Statistical Correlation Analysis in Image Retrieval Mingjing Li*, Zheng Chen, Hong-Jiang Zhang Microsoft Research China 49 Zhichun Road, Beijing 100080, China {mjli, zhengc, hjzhang}@microsoft.com Abstract A statistical correlation model for image retrieval is proposed. This model captures the semantic relationships among images in a database from simple statistics of userprovided relevance feedback information. It is applied in the post-processing of image retrieval results such that more semantically related images are returned to the user. The algorithm is easy to implement and can be efficiently integrated into an image retrieval system to help improve the retrieval performance. Preliminary experimental results on a database of 100,000 images show that the proposed model could improve image retrieval performance for both content-based and text-based queries. Keywords Image retrieval, relevance feedback, correlation analysis, statistical learning, and knowledge accumulation 1. Introduction As the amount of digital image data available on the Internet and in digital libraries is rapidly growing, there is a great need for efficient image indexing and access tools in order to fully utilize this massive digital resource. Image retrieval is a research area dedicated to address this issue and substantial research efforts have been made. However, by and large, the earlier image retrieval systems have all taken keyword or text-based approaches for indexing and retrieval of image data. Because image annotation is a tedious process, it is practically impossible to annotate all images on the Internet. Furthermore, due to the multiplicity of contents in a single image and the subjectivity of human perception and understanding, it is also difficult to make exactly the same annotations to the same image by different users. To address those limitations, content-based image retrieval (CBIR) approaches have been studied in the last decade [1, 2, 3, 4, 5]. These approaches work with descriptions based on properties that are inherent in the images themselves such as color, texture, and shape and utilize them for retrieval purposes. Since visual features are automatically extracted from images, automated indexing of image databases becomes possible. However, despite the many research efforts, the retrieval accuracy of today’s CBIR algorithms is still limited and often worse than keyword based approaches. The problem stems from the fact that visual similarity measures, such as color histograms, in general do not necessarily match perceptional semantics and subjectivity of images. In addition, each type of image features tends to capture only one of many aspects of *
Author to whom all correspondence should be addressed.
image similarity and it is difficult to require a user to specify clearly which aspect exactly or what combination of these aspects he/she wants to apply in defining a query. To address these problems, interactive relevance feedback techniques have been proposed [6, 7, 8, 9, 10, 11, 12]. The idea is that we should incorporate human perception subjectivity into the retrieval process and provide users opportunities to evaluate retrieval results, and automatically refine queries on the basis of those evaluations. Lately, this research topic has become the most challenging one in CBIR research. In general, relevance feedback process in CBIR is as following. For a given query, the CBIR system first retrieves a list of ranked images according to a predefined similarity metric, often defined by the distance between query vector and feature vectors of images in a database. Then, the user selects a set of positive and/or negative examples from the retrieved images, and the system will refine the query and retrieve a new list of images. Hence, the key issue in relevance feedback approaches is how to incorporate positive and negative examples in query and/or the similarity refinement. The early relevant feedback schemes for CBIR have been mainly adopted from text document retrieval research and can be classified into two approaches: query point movement (query refinement) and re-weighting (similarity measure refinement). Both have been built based upon the vector model in information retrieval theory [13, 14]. Recently, more computationally robust methods that perform global optimization have been proposed. The MindReader retrieval system formulates a minimization problem on the parameter estimating process [9]. It allows for correlations between attributes in addition to different weights on each component. However, as presented above, while all the approaches adapted from text document retrieval do improve the performance of CBIR, there are severe limitations: even with feedback, it is still difficult to capture high level semantics of images when only lowlevel image features are used in queries. The inherent problem with these approaches is that the low-level features are often not as powerful in representing complete semantic content of images as keywords in representing text documents. In other words, applying the relevance feedback approaches used in text document retrieval technologies to low-level feature based image retrieval will not be as successful as in text document retrieval. Using low-level features alone is not effective in representing users’ feedbacks and in describing their intentions. Furthermore, in these algorithms, the potentially captured semantic in the relevance feedback processes in one query session is not memorized to continuously improve the retrieval performance of a system. To overcome these limitations, another school of ideas is to use learning approaches in incorporating semantics in relevance feedback [15, 16, 17, 18]. The PicHunter framework further extended the relevance feedback and learning idea with a Bayesian approach [17]. With an explicit model of what users would do, given what target image they want, PicHunter uses Bayesian rule to predict what is the target they want, given their actions. This is done via a probability distribution over possible image targets, rather than refining a query. To achieve this, an entropyminimizing display algorithm is developed that attempts to maximize the information obtained from a user at each iteration of the search. Also, this proposed framework
makes use of hidden annotation rather than a possibly inaccurate and inconsistent annotation structure that the user must learn and make queries in. However, this could be a disadvantage as well since it excludes the possibility of benefiting from good annotations, which may lead to a very slow convergence. The information-embedding framework [16] attempted to embed semantic information into CBIR processes through relevance feedback using a semantic correlation matrix and low-level feature distances. In this framework, semantic relevance between image clusters is learnt from user’s feedback and used to improve the retrieval performance. In other word, the framework maintains the strengths of feature-based image retrieval while incorporating learning and annotation in the relevance feedback processes. Experiments have shown that this new framework is effective not only in improving retrieval performance in a given query session, but also in utilizing the knowledge learnt from previous queries to reduce the number of iterations in the following queries. However, it is complex in terms of computation and implementation, thus, is difficult to incorporate into practical image retrieval systems. Motivated by the work of statistical language modeling [19] and the link structure analysis in web page search [20, 21], we propose a statistical correlation model that is able to accumulate and memorize the semantic knowledge learnt from the relevance feedback information of previous queries. We have also developed an effective algorithm to apply this model in image retrieval so as to help yield better results for future queries. This model simply estimates the probability of how likely two images are semantically similar to each other based on the co-occurrence frequency that both images are labeled as positive examples during a query / feedback session. It can be trained from the users’ relevance feedback log, and dynamically updated during the image retrieval process. The algorithm is so simple that it can be easily incorporated into an image retrieval system. Preliminary versions of this paper appeared in the proceedings of the 3rd Intl Workshop on Multimedia Information Retrieval (MIR 2001) [22] and presented as a keynote at the 1st Intl Workshop on Pattern Recognition in Information Systems (PRIS 2001). The remainder of this paper is organized as follows. In Section 2, the definition of the correlation model is introduced. In Section 3, the training algorithms of the model are described. In Section 4, the image ranking schemes based on the correlation model are explained. Preliminary experimental results on a database of 100,000 images are presented in Section 5. Finally, concluding remarks are given in Section 6. 2. Definition of Statistical Correlation Model The main idea behind the proposed model is the assumption that two images represent similar semantics if they are jointly labeled as relevant to the same query in a relevance feedback phase. Accordingly, the model estimates the semantic correlation between two images based on the number of search sessions in which both images are relevant examples. A search session starts with a query phase, and is possibly followed by one or more feedback phases. For simplicity, the number of times that two images are co-relevant is referred to as bigram frequency, while that an image is
relevant is referred to as unigram frequency. The maximum value of all unigram and bigram frequencies is referred to as maximum frequency. Intuitively, the larger the bigram frequency is, the more likely that these two images are semantically similar to each other, so the higher the semantic correlation between them. Ideally, the correlation strength might be defined as the ratio between the bigram frequency and the total number of search sessions. In practice, however, there are many images in the database, and users are usually reluctant to provide feedback information. Therefore, the bigram frequency is very small with respect to the number of queries. Here, we define the semantic correlation between two images as the ratio between the bigram frequency and the maximum frequency. Since the definition of bigram frequency is symmetric, the semantic correlation is also symmetric. The selfcorrelation, i.e., the correlation between an image and itself, is defined in a similar way, except that the bigram frequency is changed with the unigram frequency of this image. By definition, the correlation strength is within the interval between 0 and 1. To be specific, 0 ≤ R( I , J ) ≤ 1 , R( I , J ) = R( J , I ) , R( I , J ) = U ( I ) M , if I = J , R( I , J ) = B( I , J ) M , if I ≠ J , where I , J are two images, B ( I , J ) is their bigram frequency, U ( I ) is the unigram frequency of image I , M is the maximum frequency, R ( I , J ) is the semantic correlation strength between image I and J .
Because of the model’s symmetry, a triangular matrix is sufficient to keep all correlation information. All items with zero correlation are excluded from the model since they do not convey any information. In order to further reduce the model size, the items with correlation value below a certain threshold may also be removed. This is called pruning. Therefore, the representation of the correlation model is highly efficient. 3. Training Algorithms
The proposed correlation model is solely determined by the unigram and bigram frequencies of images in the database. An intuitive training method is to obtain these frequencies from the statistics of user-provided feedback information collected in the user log. Let A denote the query-image adjacency matrix. Its (i, j )th entry is equal to 1 if the j th image is relevant to the i th query, and is equal to 0 otherwise. Then the co-relevant matrix AT A contains all necessary information. Its diagonal entries are the unigram frequencies, while other entries are the bigram ones. This method is pretty simple, and can be used in the dynamic update of the correlation model during image retrieval process. However, we encountered two problems in practice. One is how to process the irrelevant examples, which also provide important information. Because of the diversity of search intentions, an image that is relevant to one user’s intention may be marked as irrelevant by another user, even if their queries are the same. The correlation model should reflect the common sense of many different users. Another
problem is the data sparseness. Because of large database and limited feedbacks, it is not easy to collect sufficient training data. In order to address these two problems, the calculation of unigram and bigram frequencies is a little complicated. In our solution, the definition of unigram and bigram frequency is extended to take account of irrelevant images. For a specific search session, we assume a positive correlation between two positive (relevant) examples, and the corresponding bigram frequency is increased. We assume a negative correlation between a positive example and a negative (irrelevant) example, and their bigram frequency is decreased. However, we do not assume any correlation between two negative examples, because they may be irrelevant to the user’s query in many different ways. Accordingly, the unigram frequency of a positive example is increased, while that of a negative example is decreased. The non-feedback images are not automatically treated as negative examples in our proposed model. Therefore, these images are excluded from the calculation of unigram and bigram frequencies. In order to overcome the problem of data sparseness, the feedbacks of search sessions with the same query, either a text query or an image example, are grouped together such that feedback images in different sessions may obtain correlation information. Within each group of search sessions with the same query, the local unigram frequency of each image, which is referred to as unigram count, is calculated at first. Based on these counts, the global unigram and bigram frequencies are updated. The unigram counts in a group are calculated as follows. Initially, C ( I ) is set to 0, where C ( I ) is the unigram count of image I . After that, C ( I ) is iteratively updated for every session in this group: C ( I ) = C ( I ) + 1 , if image I is marked as positive in a session; C ( I ) = C ( I ) − 1 , if image I is marked as negative in a session; C ( I ) is unchanged otherwise. This process is repeated for every image in the database. The unigram frequencies are updated as: U (I ) = U (I ) + C(I ) , while the bigram frequencies of image pairs are updated as: B ( I , J ) = B ( I , J ) + min{C ( I ), C ( J )} , if C ( I ) > 0, C ( J ) > 0, B ( I , J ) = B ( I , J ) − min{C ( I ),−C ( J )} , if C ( I ) > 0, C ( J ) < 0, B ( I , J ) = B ( I , J ) − min{−C ( I ), C ( J )} , if C ( I ) < 0, C ( J ) > 0, B ( I , J ) = B ( I , J ) , otherwise. In this way, the symmetry of bigram frequency is reserved. The procedure for training the correlation model is summarized as follows: (1) Initializing all unigram and bigram frequencies to zero; (2) Clustering search sessions with the same query into groups; (3) Calculating the local unigram counts within a group; (4) Updating the global unigram frequencies; (5) Updating the global bigram frequencies; (6) Repeating step 3, step 4 and step 5 for all session groups; (7) Setting all negative unigram and bigram frequencies to zero; (8) Calculating the correlation strengths according to the definition. 4. Image Ranking Schemes
In a feature-based image retrieval system, the images returned by the system are consistent in the feature space, but are not necessarily consistent semantically. As a result, some retrieved images are often not relevant to the user’s intention at all. The proposed correlation model may be used to impose some semantic constraints to the retrieved results such that more semantically consistent images are presented to the user; hence, the retrieval accuracy of an image retrieval system is improved. The basic idea is to reorder the retrieved images based on the correlation model. It comes from the following observations. Given a query, the similarity between the query and an image in the database is measured based on their feature vectors. If the user provides any relevance feedback, the similarity measure is refined accordingly. Images with the highest similarities are returned as the retrieval results. Among these images, there are more or less images that are relevant to the query. In general, the retrieval precision declines as the number of images in consideration increases. This implies that the feature-based similarity is also a measure of relevance although it is often not good enough. On the other hand, the retrieved images exhibit different relationships. Since relevant images convey semantically similar content with respect to the query, it is likely that previous users have already judged them as co-relevant through relevance feedback. Therefore, the correlation strength between two relevant images is expected to be high. In contrast, as irrelevant images may be semantically different from the query in many different aspects, it is unlikely that they have been jointly labeled as relevant. Thus, the correlation strength between two irrelevant images is expected to be low. Similarly, the correlation between a relevant image and an irrelevant one is also expected to be low. Therefore, it is reasonable to assume that images having strong correlations with the top-ranked images are likely to be relevant to the query, even if their similarity scores defined in the feature space are low. With each retrieved image, we associate a non-negative relevance score, which can be treated as semantic similarity, and is initialized to its feature-based similarity. Then, we make use of the relationship between images to iteratively update the scores in the following way. Relevance scores are propagated into other images via the correlation model, and each image receives a refined score, which is the sum of all relevance scores, with the sum weighted by the correlation strength between this image and others in the retrieved list. The refined relevance scores are further propagated into others. This process repeats until the properly normalized scores converge to some equilibrium values. Suppose there are n images returned by the system, denoted I1 , I 2 , L , I n , which are ranked in the descending order of their similarities. Let P denote the vector of relevance scores, and W be a n × n matrix, whose (i, j )th entry is equal to the correlation strength between image I i and I j . The iterative refinement of relevance scores is equivalent to P' = λk W k P with k increasing without bound, where λk is a normalization factor, and P' is the vector of refined scores. As W is a symmetric matrix and has only non-negative entries, it can be proved that the unit vector in the direction of P' converges to the principal eigenvector of W , which corresponds to the largest eigen value of W and has only non-negative entries [20]. This leads to a possible image ranking scheme. That is, images are re-ranked based on the corresponding coordinates of the principle eigenvector of W .
This ranking method does improve the image retrieval precision in our experiments. However, it is not reliable enough. Unlike the link structure analysis in web page search [20, 21], the correlation model is trained using the limited feedback information available. Thus, it may be not well trained and be inaccurate in some sense. Thus, we cannot solely depend on it to re-rank images. Moreover, when the number of retrieved images is large, the extraction of principal eigenvector becomes computationally inefficient, and images relevant to other semantics might dominate the coordinates of the principle eigenvector. Therefore, a more reasonable method is to calculate the relevance scores in an efficient way and combine them with the feature-based similarities in producing the final ranking. Our ranking scheme is as follows. For image I j , its relevance score p j is initialized to its similarity s j , and is iteratively updated for a fixed number of k times according to the following equation: m
p j = ∑ pi × rij i =1
m
∑ p , m ≤ n, i =1
i
j = 1,2, L n ,
where rij is the correlation strength between image I i and I j , i.e., rij = R( I i , I j ) . In this equation, only the relevance scores of top m images are propagated to others. Then the final ranking score of image I j is the weighted sum of relevance score p j and similarity s j : S j = w × s j + (1 − w) × p j , 0 ≤ w ≤ 1 ,
where w is the semantic weight. 5. Experiments
We have implemented the correlation model and integrated it with an image search system [23], which provides the functionalities of keyword based image search, query by image example, and relevance feedback. In this system, the image database has been greatly expanded, which contains about 100,000 images collected from more than 2,000 representative websites. These images cover a variety of categories, such as “animals”, “arts”, “nature”, etc. Their high-level textual features and low-level visual features are extracted from the web pages containing the images and the image themselves respectively [24]. The following six low-level features are used in this system: color histogram in HSV space with quantization 256, first and second color moments in Lab space, color coherence vector in LUV space with quantization 64, MRSAR texture feature, Tamura coarseness feature, and Tamura directionality. The correlation model is trained using users’ search and feedback data collected in the user log. After internal use for months, about 3,000 queries with relevance feedbacks were collected. Two experiments were conducted to evaluate the proposed method: one is text-based image retrieval; another is pure content-based retrieval. For the former, we chose 20 text queries. These queries are the following keywords: car, flower, tree, cat, submarine, mars, spring, galaxy, movie star, potato, ship, space, tomb raider, woman, mountain, Clinton, Jordan, angel, dog, and summer. After that, we asked two subjects
to perform the image search experiments. Each one of them was required to search for images with every query twice and label all relevant and irrelevant images within the top 200 results returned by the system, according to his/her own subjective judgment. First, there was no feedback; second, three images were selected as either positive or negative examples. All the information was stored in the log, from which the ground-truth is extracted automatically. For CBIR, 10 image examples were selected. So far, only one subject performed the query by example experiments. The ground-truth is obtained by repeatedly conducting relevance feedback until no more images relevant to the query can be retrieved. Based on the queries and the ground-truth, the performance evaluation is conducted automatically by setting different semantic weights. In the experiments, the number of iterations to refine the relevance scores is set to 5 ( k = 5 ), while the number of images to propagate relevance scores is set to 30 ( m = 30 ). Because of the subjectivity of relevance judgment, the image retrieval precision is calculated for each subject separately, and is averaged finally. The precision is defined as the percentage of relevant images in the retrieved list. The experimental results for CBIR are presented in Figure 1, while that for text-based retrieval in Figure 2, where the horizontal axis is the number of top images in consideration; the vertical axis is the corresponding retrieval precision; w is the semantic weight. It is not surprising that the performance of text-based retrieval is much higher than that of CBIR. In both cases, the proposed correlation model significantly improves the retrieval precision. For CBIR, the precision is improved from 10% to 41% for top 10 images, while from 4.6% to 18.5% for top 100 images. In the experiments, the iterative refinement of relevance scores converges quickly. A small value of k , such as 4 or 5, will produce stable retrieval results. The selection of m does not seem to be very critical. The retrieval precision is almost the same when m = 20 ~ 60 , but it declines when m > 100 . In general, higher semantic weight yields better precision. However, the best result is not obtained when w = 1 . 6. Conclusion
A statistical correlation model is proposed to improve the retrieval accuracy of image retrieval systems through re-ranking the top images. The training process is very simple. The model may be built and updated from the statistics of users’ feedback information. The representation of the model is efficient. Because of its symmetry, a triangular matrix is sufficient to store all correlation information. The size of the model may also be effectively controlled by proper cutoff. The additional computation is minor when the model is applied in image retrieval. All of these advantages make the proposed model a good choice in practical image retrieval systems. Acknowledgments
We thank Fang Qian, Xiaoxin Yin and Lei Zhang for helping perform the image search experiments.
References
1. M. Flickner, H. Sawhney, W. Niblack, et al, Query by Image and Video Content: The QBIC system, IEEE Computer Magazine 28 (9) (1995) 23-32. 2. B. Furht, S. W. Smoliar, H. J. Zhang, Image and Video Processing in Multimedia Systems, Kluwer Academic Publishers, 1995. 3. J. R. Smith, S. F. Chang, VisualSEEK: A Fully Automated Content-Based Image Query System, Proc. 4th ACM Multimedia Conf., Boston, USA, 1996, pp. 87-98. 4. J. Huang, S. R. Kumar, M. Mitra, Combining Supervised Learning with Color Correlograms for Content-Based Image Retrieval, Proc. 5th ACM Multimedia Conf., Seattle, USA, 1997, pp. 325-334. 5. W. Y. Ma, B. S. Manjunath, NeTra: A Toolbox for Navigating Large Image Databases, Proc. IEEE Int. Conf. on Image Processing (Vol. I), Washington, USA, 1997, pp. 568-571. 6. I. J. Cox, M. L. Miller, S. M. Omohundro, P. N. Yianilos, PicHunter: Bayesian Relevance Feedback for Image Retrieval, Proc. 13th Int. Conf. on Pattern Recognition, Vienna, Austria, 1996, pp. 361-369. 7. Y. Rui, T. S. Huang, and S. Mehrotra, Content-Based Image Retrieval with Relevance Feedback in MARS, Proc. IEEE Int. Conf. on Image Processing (Vol. II), Washington, USA, 1997, pp. 815-818. 8. S. Sclaroff, L. Taycher and M. La Cascia, ImageRover: A Content-Based Image Browser for the World Wide Web, Proc. IEEE Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, 1997, pp. 2-9. 9. Y. Ishikawa, R. Subramanya, C. Faloutsos, MindReader: Query Databases Through Multiple Examples, Proc. 24th Int. Conf. on Very Large Databases, New York, USA, 1998, pp. 218-227. 10. M. E. J. Wood, N. W. Campbell, B. T. Thomas, Iterative Refinement by Relevance Feedback in Content-Based Digital Image Retrieval, Proc. 6th ACM Multimedia Conf., Bristol, United Kingdom, 1998, pp. 13-20. 11. Y. Rui and T. S. Huang, A Novel Relevance Feedback Technique in Image Retrieval, Proc. 7th ACM Multimedia Conf., Orlando, USA, 1999, pp. 67-70. 12. N. Vasconcelos, A. Lippman, Learning from User Feedback in Image Retrieval Systems, Proc. Neural Information Processing Systems 12, Denver, USA, 1999, pp. 977-983. 13. J. J. Rocchio, Relevance Feedback in Information Retrieval, in The SMART Retrieval System – Experiments in Automatic Document Processing, Prentice Hall, 1971, pp. 313-323. 14. C. Buckley, G. Salton, Optimization of Relevance Feedback Weights, Proc. 18th Annual Intl ACM SIGIR Conf., Seattle, USA, 1995, pp. 351-357. 15. T. P. Minka, R. W. Picard, Interactive Learning with a “Society of Models”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Francisco, USA, 1996, pp. 447-452. 16. C. S. Lee, W. Y. Ma and H. J. Zhang, Information Embedding Based on User’s Relevance Feedback for Image Retrieval, Proc. SPIE Vol. 3846 (Multimedia Storage and Archiving Systems IV), Boston, USA, 1999, pp. 294-304.
17. I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, P. N. Yianilos, The Bayesian Image Retrieval System, PicHunter: Theory, Implementation and Psychophysical Experiments, IEEE Trans. Image Processing 9 (1) (2000) 20-37. 18. Y. Lu, C. Hu, X. Zhu, H. J. Zhang, Q. Yang, A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems, Proc. 8th ACM Multimedia Conf., Los Angeles, USA, 2000, pp. 31-38. 19. P. R. Clarkson, R. Rosenfeld, Statistical Language Modeling Using the CMUCambridge Toolkit, Proc. Eurospeech 97, Rhodes, Greece, 1997, pp. 2707-2710. 20. J. M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46 (5) (1999) pp. 604-632. 21. R. Lempel and A. Soffer, PicASHOW: Pictorial Authority Search by Hyperlinks on the Web, Proc. 10th Intl Word Wide Web Conf., Hong Kong, China, 2001, pp. 438-448. 22. M. Li, Z. Chen, W. Liu, H. J. Zhang, A Statistical Correlation Model for Image Retrieval, Proc. 3rd Intl Workshop on Multimedia Information Retrieval, Ottawa, Canada, 2001, pp. 42-45. 23. H. J. Zhang, W. Liu, C. Hu, iFind – A System for Semantics and Feature Based Image Retrieval over Internet, Proc. 8th ACM Multimedia Conf., Los Angeles, USA, 2000, pp. 477-478. 24. Z. Chen, W. Liu, F. Zhang, M. Li, H. J. Zhang, Web Mining for Web Image Retrieval, Journal of the American Society for Information Science and Technology 52 (10) (2001) pp. 831-839.
Q R L V L F H U 3
Z Z Z
1XPEHURILPDJHV
Figure 1. Precision vs. scope curve for content-based image retrieval
Q R L V L F H U 3
Z Z Z
1XPEHURILPDJHV
Figure 2. Precision vs. scope curve for keyword-based image retrieval
About the Author – Mingjing Li received his B.S. in electrical engineering from University of Science and Technology of China in 1989, and Ph.D. in Pattern Recognition from Institute of Automation, Chinese Academy of Sciences in 1995. He joined Microsoft Research China in July 1999. His research interests include handwriting recognition, statistical language modeling, search engine and image retrieval. About the Author – Zheng Chen received his B.S., and Ph.D. degrees in computer science from Tsinghua University, China, in 1994 and 1999, respectively. He joined Microsoft Research China in March 1999. His research interests include speech recognition, natural language processing, information retrieval, multimedia information retrieval, personal information management, and artificial intelligence. About the Author – Hong-Jiang Zhang received his B.S. from Zhengzhou University, China in 1982, and Ph.D. from the Technical University of Denmark in 1991, both in electrical engineering. His research interests include video and image analysis and processing, content-based image / video / audio retrieval, media compression and streaming, computer vision and their applications in consumer and enterprise markets. He has published over 120 articles in the above area. He is a senior member of IEEE, also serves on the editorial boards of 5 professional journals and a dozen committees of various international conferences.