Multimedia Databases and ContentBased Retrieval
Mais M. Fatayer Department of Computer Science Amman Arab University Amman – Jordan
eMail:
[email protected]
Introduction Traditional database management systems can’t handle the demands of managing multimedia data. With the rapid growth of multimedia platforms and the world wide web, database management systems must now process, store, index, and retrieve alphanumeric data, bitmapped and vector-based graphics, and video and audio clips both compressed and uncompressed. Before the emergence of content-based retrieval, media was annotated with text, allowing the media to be accessed by text-based searching. Through textual description, media can be managed based on the classification of subject or semantics. This hierarchical structure allows users to easily navigate and browse, and can search using standard Boolean queries. However, with the emergence of massive multimedia databases, the traditional text-based search suffers from the following limitations: - Manual annotations require too much time and are expensive to implement. As the number of media in a database grows, the difficulty in finding desired information increases. It becomes infeasible to manually annotate all attributes of the media content. Annotating a sixty-minute video, containing more than 100,000 images, consumes a vast amount of time and expense. - Manual annotations fail to deal with the discrepancy of subjective perception. The phrase, “an image says more than a thousand words,” implies that the textual description is sufficient for depicting subjective perception. To capture all concepts, thoughts, and feelings for the content of any media is almost impossible. - Some media contents are difficult to concretely describe in words. For example, a piece of melody without lyric or irregular organic shape cannot easily be expressed in textual form, but people expect to search media with similar contents based on examples they provided. In an attempt to overcome these difficulties, content-based retrieval employs
content information to automatically index data with minimal human intervention.
APPLICATIONS Content-based retrieval has been proposed by different communities for various applications. These include: Medical diagnosis: The amount of digital medical images used in hospitals has increased tremendously. As images with the similar pathology-bearing regions can be found and interpreted, those images can be applied to aid diagnosis for image-based reasoning. For example, Wei & Li (2004) proposed a general framework for content-based medical image retrieval and constructed a retrieval system for locating digital mammograms with similar pathological parts. Intellectual property: Trademark image registration has applied content-based retrieval techniques to compare a new candidate mark with existing marks to ensure that there is no repetition. Copyright protection can also benefit from content-based retrieval as copyright owners are able to search and identify unauthorized copies of images on the Internet. For example, Wang & Chen (2002) developed a content-based system using hit statistics to retrieve trademarks. Broadcasting archives: Every day broadcasting companies produce a lot of audio-visual data. To deal with these large archives, which can contain millions of hours of video and audio data, content-based retrieval techniques are used to annotate their contents and summarize the audio-visual data to drastically reduce the volume of raw footage. For example, Yang et al. (2003) developed a content-based video retrieval system to support personalized news retrieval. Information searching on the Internet: A large amount of media has been made
available on the Internet for retrieval. Existing search engines mainly perform text-based retrieval. To access the various media on the Internet, content-based search engines can assist users in searching the information with the most similar contents based on queries. For example, Hong & Nah (2004) designed a XML-scheme to enable content-based image retrieval on the Internet.
TEXT DOCUMENT INDEXING AND RETRIEVAL IR (information Retrieval) techniques are important in multimedia information management systems. Where there exist a large number of text documents in many organizations such as libraries. Text is very important information source for any organization Text can be used to annotate other media such as audio, images and video. Two major design issues of IR systems are how to represent documents and quires and how to compare similarities between documents and query representations. A retrieval model defines these two aspects. The most common technique is the exact match technique and the Boolean model will be discussed as example on this retrieval method. Automatic Text documents indexing and Boolean retrieval Model Basic Boolean Retrieval Model Most of the commercial IR systems can be classified as Boolean IR systems or text pattern search systems. Text pattern search quires are strings or regular expressions. During retrieval, all documents are searched and these containing the query string are retrieved. Text-pattern systems are more common for searching small documents databases or collections. In Boolean retrieval system, documents are indexed by set of keywords. queries are also represented by set of keywords joined by logical (Boolean)operators that supply relationships between the query terms. Three types of operators are in common use: OR, AND, and NOT. Their retrieval rules are: - The OR operators treats two terms as effectively synonymous. For example, given the query (term1 OR term2), the presence of either term in a record or document suffices to retrieve that record. - The AND operator combines terms into term phrases; thus the query (term1 AND term2) indicates that both terms must be present in the document in order for it to be retrieved. - The NOT operator is a restriction, or term-narrowing, operator that is normally used in conjunction with the AND operator to restrict the applicability of particular terms; thus the query(term1 AND NOT term2)leads to the retrieval of records containing term 1 but not term2.
Term operation and Automatic Indexing A document contains many terms or words. But not every word is useful and important. for example, prepositions and articles such as “of” ,”the” ,and “a” are not useful to represent the content of the document. These terms are called stop words. During the indexing process, a document is treated as a list of words and stop words are removed from the list. The remaining terms or words are further processed to improve indexing and retrieval efficiency and effectiveness. Common operations carried out on these terns are stemming ,thesaurus, and weighting. Stemming is the automated conflation of related words, usually by reducing the words to a common root form. for example, suppose that the words “retrieval” ,”retrieved” ,”retrieving” and “retrieve” all appear in a document. Instead of treating these as four different words, for indexing purposes these four words are reduced to a common root “retrieve”. The term “retrieve” is used as index term of the document. Another way of conflating related terms is with a thesaurus that lists synonymous terms and sometimes the relationships among them. For example, the words” study “,”learning”, “schoolwork”, and “reading” have similar meanings. So instead of using four index terms, a general term “study” can be used to represent the four terms. Different indexing terms have different frequencies of occurrence and importance to the document. Note that the occurring frequency of a term after stemming or thesaurus operations is the sum of the occurring frequencies of all its variations. The introduction of term-importance weight for document terms and query term may distinguish the term that is more important to the document for retrieval purposes from less important terms.
IMAGE INDEXING AND RETRIEVAL Images are stored in a database in raw form as a set of pixels or cell values, or stored in compressed form to save space. Each image is represented in grid of cells. There are many approaches to image indexing and retrieval. The first approach, attribute based, where the image contents are modeled as a set of attributes extracted manually and managed within the frame work if conventional DBMSs. Quires specified using these attributes. Examples of such attributes are image file name, image category, and date of creation, subject, author and image source. However, database attributes may not be able to describe the image contents completely. Another problem is that types of quires are limited to those attributes. The second approach, feature –extraction/object recognition depends on subsystem to automate the feature extraction and object recognition. Limitations of this approach are that it’s computationally expensive, difficult to implement and tends to be domain specific. Another method is annotating images high-level features and using IR techniques to carry out retrieval. Where text can describe the high level feature s contained in the images, and for retrieval this approach uses the relevance feedback and domain knowledge, where it can overcome some problems of incompleteness and subjectieveness. Finally, using low-level feature method is used to index and retrieve images.
In practice, the second and the fourth approaches have provided a good efficiency in performance; however, the second approach is not applicable for general applications. In following sections, a description for low-level feature combined with text based retrieval techniques are provided in more details. Methods based on color, shape and texture. In practice, text –based and low-level, feature –based techniques are combined to achieve high-relative performance Text –Based Image Retrieval In Text –Based Image Retrieval, images are described with free text. Queries are in form of keywords with/without Boolean operators. The retrieval techniques are based on similarities between query and the text descriptions of images. There are two main differences between Text –Based Image Retrieval and conventional text document retrieval. First, text annotation is manual process where high-level image understanding is not possible .in image annotation we care for efficiency and how to describe image contents completely and consistently. domain knowledge or thesaurus should be used to overcome completeness and consistency problems. Relationships between words or terms will also be considered. for example ,”child”,” man” and “woman” are issues a query using the key work “human being”, intending to retrieve all images contain human beings. Second ,the text description may not be complete and may be subjective. thus the use of knowledge base and relevance feedback is extremely important for textbase image retrieval. The advantage of Text –Based Image Retrieval, is that it captures high level abstraction and concepts, such as “smile” and “happy”, contained in images. however, it can not retrieve images based on example, and some high-level features are difficult to describe such as shape and texture. Color –Based Image Indexing and Retrieval Technique This is a commonly used approach in content-based retrieval techniques. The idea of color-based image retrieval technique is to retrieve a database images that have similar colors to user’s query. Each image in the database is represented using 3 channels of the color space chosen. The most common color space used is the RGB (red, green and blue).each color channel is discretized into m intervals. so the total number of discrete color combinations (Called bins) n is equal to m3.for example, if each color channel is discretized into 16 intervals, we can have 4,096 bins in total. A color histogram H(M) is a vector (h1,h2,h3,…hj,…,hn),where element hj represent the number of pixels in image M falling into bin j. this histogram is the feature vector to be stored as the index of the image. During image retrieval, a histogram is found for the query image or estimated from the user’s query. the distances between the histograms of the query image and images in the database are measured. Images with a histogram distance smaller than a predefined threshold are retrieved from the database and presented to the user. Alternatively ,the first k images with the smallest distance are retrieved.
In the following formula, the L-1 metric is defined as the distance between images I and H:
d(I,H)= nΣ i l - h l l=1 falling in bin l in image I and H, respectwhere i l and h l is the number of pixels ively. For example, suppose we have three images of 8×8 pixels and each pixel is one of the eight colors C1to C8. Image 1 has 8 pixels in each of the eight colors, image 2 has 7 pixels in each of colors C1 to C4,and 9 pixels in each of colors C5 to C8.Image 3 has 2 pixels in each of colors C1andC2,and 10pixels in each of colors C3 to C8.then we have the following three histograms:
H1=(8,8,8,8,8,8,8,8) H2=(7,7,7,7,9,9,9,9) H3=(2,2,10,10,10,10,10,10) The distances between these three images are: d(H1,H2)=1+1+1+1+1+1+1+1=8 d(H1,H3)=6+6+2+2+2+2+2+2=24 d(H2,H3)=5+5+3+3+3+1+1+1+1=23 Therefore, images 1 and 2 are most similar and images 1 and 3 most different according to the histogram. Image Retrieval Based on Shape Shape representation is a fundamental issue in the newly emerging multimedia applications. In the content-based image retrieval (CBIR), shape is an important low-level image feature. A good shape representation and similarity measurement for recognition and retrieval purposes should have the following two important properties: - Each shape should have a unique representation, invariant to translation, rotation and scale - Similar shapes should have similar representation so that retrieval can be based on distances among shape representations. There are generally two types of shape representations, i.e., contour-based and region-based. Contour-based methods need extraction of boundary information which in some cases may not available. Region-based methods, however, do not necessarily rely on shape boundary information, but they do not reflect local features of a shape. Therefore, for generic purposes, both types of shape representations are necessary. Several shape descriptors, which have been widely adopted for CBIR, they are: Fourier descriptors (FD), and grid descriptors (GD). Fourier descriptors method: in Fourier descriptor-based method, a shape is first represented by feature function called a shape signature. a discrete Fourier transform is applied to the signature to obtain (FD) of the shape. These FDs are used to index the shape and for calculation of shape.
Grid descriptors :In grid shape representation, a shape is projected onto a grid of fixed size. The grid cells are assigned the value of 1 if they are covered by the shape (or covered beyond a threshold) and 0 if they are outside the shape. A shape number consisting of a binary sequence is created by scanning the grid in leftright and top-bottom order, and this binary sequence is used as shape descriptors to index the shape. Image Retrieval Based on Texture Texture is an important image feature, but it is difficult to describe and its perception is subjective to a certain extent. One of the best methods proposed is the one by Tamura,H.S. Mori , and T. Yamawaki. To find a texture description, they conducted psychological experiments. They aimed to make the description conform to human perception as closely as possible. According to their specification, six features describe texture, as follows: - Coarseness: coarse is opposite to fine. Coarseness is the most fundamental texture feature and to some people texture means coarseness .the larger the distinctive image elements, the coarser the image. so, an enlarged image is coarser than the original one. - Contrast: the contrast is measured using four parameters: dynamic range of gray levels of the image, polarization of the distribution of black and white on the gray-level histogram or ratio of black and white on areas, sharpness of edges, and period of repeating patterns. - Directionality: it is a global property over he given region. It measures both element shape and placement .the orientation of the texture pattern is not important. - Line likeness: this parameter is concerned with the shape of a texture element. Two common types of shapes are linelike and bloblike. - Regularity: this measures variation of an element placement rule. It is concerned with whether the texture is regular or irregular. Different element shape reduces regularity. A fine texture tends to be perceived as regular. - Roughness: this measures whether the texture is rough or smooth. It is related to coarseness and contrast. Not all six features are used in texture-based image retrieval systems. For example in QBIC system, texture is described by coarseness, contrast and directionality. Retrieval is based on similarity instead of exact match.
Integrated Image Indexing and Retrieval Techniques An individual feature will not be able to describe an image adequately. For example, it’s not possible to distinguish a red car from a red apple based on color alone. therefore; a combination of features is required for effective image indexing and retrieval. A practical system, QBIC, was developed by IBM Corporation. It allows a large image database to be queried by visual properties such as colors, color percentages, texture, shape and sketch, as well as by keywords.
QBIC capabilities have been incorporated into IBM’s DB2 Universal Database product.
VIDEO INDEXING AND RETRIEVAL Video is information rich. A complete video may consists of text, sound track (both speech and nonspeech), and images recorded or played out continuously at fixed rate. Following methods are used for video indexing and retrieval: - Metadata-based method: video is indexed and retrieved based on structured author/producer/director, date of production and type of video. - Text-based method: using IR techniques, video can be indexed and retrieved. - Audio-based method: using speech recognition techniques and IR techniques audio video can be indexed and retrieved based on spoken words associated with video frames. - Content-based method: there are two approaches, in the first approach, video is treated as independent frames or images, and use the image indexing and retrieval methods. The other approach divides the video into group of similar frames, and indexing is based on the representative frame of these groups, this approach is called shot-based video indexing and retrieval. - Integrated approach: two or more methods of the above methods can be combined to provide more effective video indexing and retrieval. The following section talks about shot-based video indexing and retrieval technique. Shot-based video indexing and retrieval technique A video sequence consists of a sequence of images taken at a certain rate. A long video contains many frames, which are if treated individually the indexing and retrieval will be very hard. So that, video is made of number of logical units or segments called shots.
A shot can have the following features: - The frame depict the same scene - The frame signify a single camera operation - The frames contain a distinct event or/and action such as the significant presence of an object - The frames are chosen as a single indexable entity by the user.
Shot
Shot Shot
We need to identify The part of the video that contains required information
Frames taken in same scene and featuring same group of people correspond to a shot
Shot-based video indexing and retrieval consists of the following main steps: 1Segment the video into shots (called video temporal segmentation, partition or shot detection) 23-
Index each shot. (The common approach is to first identify key frames or representative frames(r frames) for each shot. Then, use image indexing method (described before) Apply similarity measurement between query and video shot and retrieve shots with high similarities, (this is achieved by using he image retrieval methods based on indexes or feature vectors obtained in step2
SEGMENT THE VIDEO INTO SHOTS Video Shot Detection or Segmentation Consecutive frames on either side of a camera break, generally, display a significant quantitative change. Here, a suitable quantitative measure that capture the difference between a pair of frames is needed. If difference between two consecutive frames exceeds a given threshold, then it may be interpreted as indicating segment boundary. From the above, it’s obvious that camera break is the simplest transaction between two scenes, where a camera may have other transactions such as dissolve, wipe, fadein and fade-out. The last operations have a gradual change between two consecutive frames than does a camera break. Basic video segment techniques The key issue of shot detection is how to measure the frame –to- frame differences. The simplest way is to measure the sum of pixel-to-pixel differences between neighboring frames. If the sum is larger than the preset threshold then assign a shot boundary between the two frames. However, this method is not effective and much false shot detection will be reported, where two frames within one shot may have a large pixel-to-pixel difference due to object movement from frame-to-frame
To overcome this limitation of the last approach, new methods were introduced to measure color histogram distance between neighboring frames, the principle behind these methods is that object motion causes a little histogram differences. If a large difference founded, hence a camera break occurred. Following formula used to measure the difference between the ith frame and its successor:
SDi=∑ │ Hi(j)-Hi+1(j) │ j
Where, H i ( j ) denotes the histogram for the ith frame, and j is one of the G possible gray levels. If S D i is larger than the pre detected threshold, a shot boundary is declared. Another simple but more effective approach is used to compare histogram based on a color code derived from the R, G and B components.
SDi=∑ (Hi(j)-Hi+1(j) )2/( Hi+1(j)) j
This measurement is called Х2 test. Here, j denotes a color code instead of gray. In this technique, selection of appropriate threshold values is a key issue in determining the segmentation performance. Detecting Shot Boundaries with Gradual Change The above technique relies on a single frame-to-frame difference threshold for shot detection. What was found in practice is that the previous techniques cannot detect shot boundaries when the change between frames is gradual as in videos produced with the techniques of fade-in, fade-out, dissolve and wipe operations. Also, when the color histogram between two different frames of two different scenes are similar. Fade-in is when a scene gradually appears. Fade-out is when a scene gradually disappear. Dissolve is when one scene gradually disappears while another gradually appears. Wipe is when one scene gradually enters across the frame while another gradually leaves. In such operations, the difference values tend to be higher than those within a shot but significantly lower than the shot threshold. Here, a single threshold does not work, because to capture their boundaries, the threshold must be lowered significantly causing much false detection. To overcome such situation Zhang et al. developed a twin-comparison technique that can detect normal camera break and gradual transitions. This technique requires the use of two difference threshold: Tb: used to detect normal camera break Ts: a lower threshold to detect the potential frames when gradual transition may occur. During the shot boundary detection process, consecutive frames are compared using one of the previous described methods.
If the difference is larger than Tb, a shot boundary is declared. If the difference is less than Tb and the difference is larger than Ts, the frame is declared as a potential transition frame. Then, add the frame-to-frame difference of the potential transition frames occurring consecutively. If the accumulated frame-to-frame differences, of consecutive potential frames is larger than Tb, a transition is declared and the consecutive potential frames are treated as special segment. here, the accumulated difference is only computed when the frame-to-frame difference is larger than Ts consecutively. VIDEO INDEXING AND RETRIEVAL Now, we need to represent and index each shot so that shots can be located and retrieved quickly in response to quires.the most common way is to represent each shot with one or more key frames or representative frames(r frames). Retrieval is then based on similarity between the query and r frames. Indexing and Retrieval Based on r Frames of Video Shots Using a representative frame is the most common way to represent a shot.r frame, capture the main contents of the shot. Features of this frame are extracted and indexed based on color, shape and texture as in image retrieval. During retrieval, queries are compared with indices or feature vectors of these frames. If this frame is similar to the query, then its presented to the user so he/she can play out the shot it represents. When shot is static any frame is good enough to be representative frame. But when there are a lot of object movements in the shot, other methods should be used. We need to address two issues regarding r frame selection.first,how many r frames should be used in a shot.second,how to select these r frames within a shot. To determine how many r frame should be used, a number of methods have been proposed: 1- Using one r frame per shot. However, this method does not consider the length and content changes of shots. 2- Assigning the number of r frames to shots according to their length, where for each second or less, one r frame is assigned to represent the shot. if the length of a shot is longer than one second, one r frame is assigned to each second of the video. This method can partially overcome the limitation of the first method, but it ignores shot contents. 3- A shot is divided into subshots or scenes and assigns one r frame to each subshot. A subshot is detected based on changes in contents. The contents are determined based on motion vectore, optical flow and frame-to-frame difference. In second step, we need to determine how these r frames are selected. According to previous methods of determining the number of r frames for each shot. Three possibilities also proposed (here, a general term “segment” is used to refer to a shot, a second of video or a subshot, depending on the used method): 1- In first method, the first frame of each segment is normally used as the r frame. This choice is based on the observation that cinematographers attempt to “characterize” a segment with the first few frames, before beginning to
track or zoom to a closeup.thus the first frame of a segment normally captures overall contents of the segment 2- In the second method, an average frame is defined so that each pixel in this frame is the average of pixel values at the same grid point in all frames of the segment. Then the frame within the segment that is most similar to this average is selected as the representative frame of the segment. 3- In the third method, the histograms of all the frames in the segment are averaged. The frame whose histogram is closest to this average histogram is selected as the representative frame. 4- The fourth method is mainly used for segments captured using camera panning. Each image or frame within the segment is divided into background and foreground objects. A large background is then constructed from the background of all frames, and then the main foreground objects of all frames are superimposed onto the constructed background. Between all the above mentioned methods its hard to determined which is the best, where the choice of r frame is application dependent. The next section addresses some additional techniques for video index and retrieval. Indexing and Retrieval Based on Motion Information Video indexing and retrieval method based on motion information has been proposed to complement the r frame-based approach, where the last treats a video as a collection of still images. In Video indexing and retrieval method based on motion information, motion information is derived from motion vectors and determined for each r frame, thus r frame are indexed-based on both image contents and motion information. Indexing and Retrieval Based on Objects Object based indexing schemes find a way to distinguish individual objects throughout a given scene, that is a complex collection of objects, and carry out the indexing process based on information about each object. the indexing strategy would be able to capture the changes in content throughout the sequence. Indexing and Retrieval Based on Metadata Metadata for video is available in some standard video format. Video indexing and retrieval can be based on this metadata using conventional DBMSs. Indexing and Retrieval Based on Annotation Using Video manual interpretation and annotating or by using transcripts and subtitles, or by applying speech recognition to sound track to extract spoken words, which can then be used for indexing and retrieval.
AUDIO INDEXING AND RETRIEVAL Digital audio is represented as a sequence of samples and normally stored in compressed form.
For human being, it’s easy to recognize different types of audio; we all can tell whether the audio is music, noise or human voice, also the mood whether its happy, sad, relaxing, etc. For a computer, audio is just a sequence of sample values. So, it needs a retrieval technique to access audio file and retrieve query request. for the traditional method of accessing audio pieces, its based on their titles or file names, which is not good enough to retrieve a query such as “find audio pieces similar to the one being played”, or in other words, query by example. Too overcome previous problem, the content based audio retrieval techniques are required. The following general approach to content-based audio retrieval techniques are normally taken: - Audio is classified into some common types of audio such as speech, music and noise. - Different audio types are processed and indexed in different ways. For example, if the audio type is speech ,speech recognition is applied and the speech is indexed based on recognized words. - Query audio pieces are similarly classified, processed and indexed. - Audio pieces are retrieved based on similarity between the query index and the index in the database. Audio signals are represented in the time domain or the frequency domain. different features are extracted from these two representations.
Time-Domain Features • Average Energy: Indicates the loudness of the audio signal • Zero Crossing Rate: Indicates the frequency of signal amplitude sign change • Silence Ratio: Indicates the proportion of the sound piece that is silent.
Frequency-Domain Features • Sound Spectrum: show the frequency components and frequency distribution of a sound signal, represented in frequency domain. In frequency domain the signal is represented as amplitude varying with frequency, indicating the amount of energy at different frequencies. • Bandwidth: indicate the frequency range of a sound; can be taken as the difference between the highest frequency and lowest frequency of non-zero spectrum components “non-zero” may be defined as at least 3dB above the silence level • Energy distribution: Signal distribution across frequency components. One important feature derived from the energy distribution is the centroid, which is the mid-point of the spectral energy distribution of a sound. Centroid is also called brightness. • Harmonicity:In harmonic sound, the spectral components are mostly whole number multiples of the lowest and most often loudest frequency. Lowest frequency is called fundamental frequency. Music is normally more harmonic than other sounds • Pitch: the distinctive quality of a sound, dependent primarily on the frequency of the sound waves produced by its source. only period sounds, such as those produced by musical instruments and the voice, give rise to a sensation of a pitch . In practice, we use the fundamental frequency as the approximation of the pitch
Spectrogram Previous two representations are simple .though, in amplitude – time representation doesn’t show the frequency component of the signal. and spectrum doesn’t show when the difference frequency components occur. To overcome the limitation of the two representations, a combined representation called spectrogram is used. The spectrogram of a signal shows the relation between the three variables: frequency contents, time and intensity. In the spectrogram, the frequency content is shown along the vertical axis, and time along the horizontal one. The gray scales the darkest part marking the greatest amplitude/power.
Audio Classification We need to classify the audio into speech, music and possibly into other categories/subcategories ,where different audio types require different processing and indexing retrieval techniques also, they have different significance to different applications. Main Characteristics of Different Type of Sound Following are main characteristics of speech and music as they are the basis for audio classification. Speech Speech has a low bandwidth comparing to music, within the range of 0-7KHZ;hence, the spectral centroid (brightness)of speech signals are usually lower than those of music. Speech signals have a higher silence ratio than music, because of the frequent pauses in a speech occurring between words and sentences. Music Music normally has a high frequency range, from 16 to 20,000 HZ.thus; its spectral centroid is higher than that of speech. Music has a low silence ratio, comparing to speech. one exception may be music produced by a solo instrument or singing without accompanying music. Audio Classification Framework All classification methods are based on calculated feature values. however, they differ in how these features are used. • Step by Step Classification Each feature is used individually in different classification steps ,each feature used separately to determine if an audio piece is speech or music. Each feature is seen as filtering criterion. At each step, an audio piece is determined as one type or another. In this classification method, the centroid of all input audio pieces is calculated, if the centroid is higher than the pre-determined threshold then it’s a music, else its either music or speech(where some music has a high centroid).then the silence ration is calculated, and if it has a low value, then audio piece is music ,else, it is either solo music or speech(solo music has a high silence ratio).finally the ZCR(zero crossing ratio) is calculated, and if the input has a high ZCR variability ,it is a speech.
The above order of the algorithm is based on the differences between features, where the less complicated feature with the high differentiating power is used first.a possible filtering process is shown in figure 1.
Audio Input
Yes High Centriod?
No
Music
Speech puls music
No
Music
High silence ratio?
Yes
Speech puls solo music
No High ZRC variability?
Solo music
Yes Speech
Figure 1:Audio classification process
• Feature Vector Based Audio Classification Values of a set of features are calculated and used as a feature vector. During the training stage, the average feature vector is found for each class of audio. During classification, the feature vector of an input is calculated and the vector distance between the input feature vector and each of the reference vectors are calculated. The input us classified into the class from which the input has least vector distance
Speech recognition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding Speech recognition systems can be characterized by many parameters, some of the more important of which are: Parameters Speaking Mode Speaking style Enrollment Vocabulary Language Model Perplexity SNR Transducer
Range Isolated words to continuous speech Read speech to spontaneous speech Speaker-dependent to Speaker-independent Small(<20 word)to large(>20,000 words) Finite-state to context-sensitive Small(<10) to large(>100) High(>30 dB) to low (<10 dB) Voice-cancelling microphone to telephone
An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment, a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.
Basic Concepts of ASR (Automatic Speech Recognition System) There are two stages of ASR: 1. Training: Features of each speech unit is extracted and stored in the system. 2. Recognition: Features of an input speech unit are extracted and compared with each of the stored features and the speech unit with the best matching features is taken as the recognized unit.
Music Indexing and Retrieval
There are two types of music, Structured music or synthetic and Sample-based music Indexing and Retrieval of Structured Music and Sound Effects Structured music and sound effects are represented by a set of commands or algorithms. the most common structured music is MIDI, which represent music as a number of notes and control commands.MPEG-4 is a new standard for structured audio, which represents sound in algorithms and control languages. These standard are developed for sound transmission, synthesis, and production. these standard are not designed for indexing and retrieval purposes.However,the explicit structure and notes descriptions in these formats make the retrieval process easy(no need for feature extraction from audio signals) User query for sound file will also depend on exact match between queries and database sound files. Sometimes, the sound produced by the retrieved sound files may not be what user wants, that’s because different devices can render the same structure of sound file differently. Indexing and Retrieval of Sample-based music There are two general approaches to indexing and retrieval of sample-based music. Retrieval based on a set of features Build model for each class based on a set of features and then compute the similarity between the features of the query and the models. Retrieval based on pitch The pitch for each note has to be extracted or estimated, and converts the musical sound into a symbolic representation.
FUTURE RESEARCH ISSUES AND TRENDS Since the 1990s, remarkable progress has been made in theoretical research and system development. However, there are still many challenging research problems. This section identifies and addresses some issues in the future research agenda. Automatic Metadata Generation Metadata (data about data) is the data associated with an information object for the purposes of description, administration, technical functionality and so on. Metadata standards have been proposed to support the annotation of multimedia content. Automatic generation of annotations for multimedia involves highlevel semantic
representation and machine learning to ensure accuracy of annotation. Content-based retrieval techniques can be employed to generate the metadata, which can be further used by the text-based retrieval. Embedding Relevance Feedback Multimedia contains large quantities of rich information and involves the subjectivity of human perception. The design of content-based retrieval systems has turned out to emphasize an interactive approach instead of a computer-centric approach. A user interaction approach requires human and computer to interact in refining the high-level queries. Relevance feedback is a powerful technique used for facilitating interaction between the user and the system. The research issue includes the design of the interface with regard to usability, and learning algorithms which can dynamically update the weights embedded in the query object to model the high level concepts and perceptual subjectivity. Bridging the Semantic Gap One of the main challenges in multimedia retrieval is bridging the gap between low-level representations and high-level semantics (Lew & Eakins, 2002). The semantic gap exists because low-level features are more easily computed in the system design process, but high-level queries are used at the starting point of the retrieval process. The semantic gap is not only the conversion between low-level features and high-level semantics, but also the understanding of contextual meaning of the query involving human knowledge and emotion. Current research intends to develop mechanisms or models that directly associate the high-level semantic objects and representation of low-level features.
Conclusion So far, the main concepts, issues and techniques in developing multimedia information indexing and retrieval system have been discussed. The importance of multime-
dia databases made the researchers to focus their efforts to go forward and design References Guojun Lu ,Multimedia Databse Management Systems, Artech House Publishers,1999 Chia-Hung Wei and Chang-Tsun Li,” Design of Content-based Multimedia Retrieval”, Department of Computer Science ,University of Warwick ,Coventry CV4 7AL, UK Leung, “Survey papers on Audio Indexing and Retrieval”,2004/2005, http://www.it.cityu.edu.hk Content-based Image Retrieval& Shape as Feature of Image,Media Signal Processing, Presentation by :Jahanzeb Farooq,Michael Osadebey “Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study”, Dengsheng Zhang and Guojun Lu,Gippsland School of Computing and Information Technology,Monash University,Churchill, Victoria 3842 Australia
more efficient methods and techniques to retrieve the best of these database.
Terms and Definitions Boolean Query: A query that uses Boolean operators (AND, OR, and NOT) to formulate a complex condition. A Boolean query example can be “university” OR “college” Content-Based Retriev- An application that directly makes use of the contents of al: media, rather than annotation inputted by the human, to locate the desired data in large databases. Feature Extraction: A subject of multimedia processing which involves applying algorithms to calculate and extract some attributes for describing the media. High-level feature: Such as timber, rhythm, instruments, and events involve different degrees of semantics contained in the media Intensity: Power of a frequency component at a particular time interval Low-level feature: Such as object motion, color, shape, texture, loudness, power spectrum, bandwidth and pitch Query by Example: A method of searching a database using example media as search criteria. This mode allows the users to select predefined examples requiring the users to learn the use of query languages. Segmentation: Is a process of dividing a video sequence into shots. Shot: A short sequence of contiguous frames.
Similarity Measure: A measure that compares the similarity of any two objects represented in the multi-dimensional space. The general approach is to represent the data features as multi-dimensional points and then to calculate the distances between the corresponding multi-dimensional points Video: A combination of text audio and images with time dimension.