Fifth International Conference on Computer Graphics, Imaging and Visualization Visualisation
Tampered Image Detection using Image Matching Zhenghao Li1,2, A.Y.C. Nee2, S.K. Ong2, Weiguo Gong1 1. Key Lab of Optoelectronic Technology and System of Ministry of Education, Chongqing University, Chongqing, China 400030 2. Mechanical Engineering Department, Faculty of Engineering, National University of Singapore, Singapore 117576 {lizhenghao|mpeneeyc|mpeongsk}@nus.edu.sg,
[email protected]
advocate that the essential way to solve these kinds of problems is, at first, to extract the hidden information which cannot be quantitatively detected by human eyes from the images, employing the computer vision and digital image processing technology, and then to find traces of tampering using the information mentioned above. According to this procedure, a novel solution for detecting tampered image based on image matching is proposed. There are two main contributions in this paper. First, we present a hybrid matching algorithm for detailed image analysis based on the multi-scale theory and local feature descriptors. There are two desirable properties in this algorithm. The first property is that the algorithm can generate more accurate and reliable correspondences and the second is it can obtain a more explicit description for the scale space of image features. We detect the features using the Maximally Stable Extremal Region (MSER) [1] and Laplacian-ofGaussian (LoG) extrema, and then wrap the features to a SIFT-like descriptor [2]. Lastly, we find the correspondences using hybrid spill tree (SP-Tree) [3]. An evaluation is conducted based on the reconstruction similarity. It is shown that the proposed algorithm outperforms others. Second, we explore the use of the information embedded in the correspondences, and introduce image matching to tampered image detection. This idea not only enriches the use of the matching technology, but also develops a powerful solution for tampered image detection. The rest of the paper is organized as follows. In Section 2, the proposed matching algorithm is introduced and evaluated. The order regularity of scale space between the corresponding patches is explained in Section 3. In Section 4, the description for object rotation is given. Lastly in Section 5, conclusions and future work are presented and proposed.
Abstract With the development of image processing software, such as Photoshop, scientific analysis for detecting tampered images becomes a critical research issue. A hybrid image matching algorithm for image analysis is proposed and evaluated. Features are extracted using blob detectors and interest point detectors. This combination ensures sufficient correspondences in small image patches while maintaining high accuracy. Nearest neighbors are searched using hybrid spill tree in order to reduce the computational load. Applying the algorithm to tampered image detection, a novel solution is proposed utilizing the scale space information and gradient orientation information from the corresponding features. Furthermore, an order regularity of scale space between the corresponding patches and a description for object rotation are also proposed. In addition, we conducted experiments on tampered images, proving the method to be powerful.
1. Introduction With the development of digital image processing software, it becomes difficult for people to distinguish tampered images from the genuine ones. In fact, the authenticity of some news photographs has been questioned more frequently recently. Therefore, it is pertinent to develop theories and algorithms for detecting tampered images, and this has already received the attention of many scientists and researchers. It is known that tampered digital images should be produced in accordance with the principles of computer vision, such as perspective transformation, and epipolar constraint, and it is necessary to use image processing technology, for example, synthesis, scale, skew, Gaussian blur, etc. In this case, we
978-0-7695-3359-9/08 $25.00 © 2008 IEEE DOI 10.1109/CGIV.2008.13
174
selected from the data set [8], and are regulated to the size of 800*600 before use. We discarded the duplicate detections which are of the 8-neighborhood with other features. Reliable matches are obtained by performing RANSAC constraint after matching. The average numbers of extracted features are shown in Table 1. It can be observed from Table 1 that the best performances are obtained using the combination of MSER and LoG extrema.
2. Matching algorithm In computer vision, sets of data acquired through sampling from the same scene or object at different times, or from different perspectives, will be in different coordinate systems. Image matching is the process of finding the correspondences between these sets of data. We break up a matching task into three modules [4], and a modularized design method [5] is adopted for selecting the appropriate algorithm. It starts with a feature detection module that detects features (interesting points, regions) from the images. This is followed by a description module, where local image regions are selected and converted to descriptor vectors that can be compared with the other sets of descriptors in the final matching module. Note that the main focus is on precision rather than computational load.
Table 1. Comparisons of different detector combinations Combination
I & II
II & III
I & III
I, II & III
Totally Detected
907
609
648
1082
Discard Duplicates
493
492
519
530
Matched
185
193
200
202
Reliable Matched
174
185
189
189
It should be noted that the features are detected in a multi-scale representation, from which a set of smoothed images are obtained through a successive convolution of the Gaussian kernel. LG ( x, y ) GG ( x, y )
I ( x, y ) ,
2.1. Feature detection The intention which is also the challenge of this module is to find the features that are invariant to viewing conditions. Though image features are usually classified into two categories, global and local, the latter has been shown to be better suited for matching as they are robust to occlusion and background clutter. Different solutions for local feature extraction have been proposed in recent years. Some researchers proved that the interest points are robust and stable, but others suggested distinguished regions are more appropriate for matching. Mikolajczyk et al. [6, 7] systematically reviewed and compared the performance of these detectors. The experimental results revealed that their Harris-Laplace detector and the DoG extrema proposed by Lowe [2], which is an approximation of LoG, outperform other interest point detectors, and the MSER detector proposed by Matas et al. [1] performs best in the region detectors on a wide range of test sequences. As a matter of fact, some light-weight detectors have already performed well in narrow baseline systems, but the number of reliable matches drops sharply when there are large viewpoint and illumination changes, even if using the most complex detectors. We found an effective way to solve this problem through the fusion of different detectors. Three candidate detectors, LoG extrema (I), Harris-Laplace points (II) and MSER (III), are selected in our experiments. We extracted the features using all the possible detector combinations. The test images are
where GG ( x, y )
1 2SG 2
Exp(
x2 y 2 ). 2G 2
A multi-scale pyramid is built by several octaves with different initial size, and each octave is divided into several intervals. We construct the scale factor (o,i) = 2(o+i/I)/2, and 1/(21/2) bicubic interpolation is performed to obtain the initial image of the next octave when the variance of internal scale factor equals to 21/2, where o and i are defined as the sequence number of octave and interval.
2.2. Description For a robust image matching algorithm, extracting a sufficient number of distinctive potential features is a prerequisite but certainly cannot guarantee success. In most cases, the description module also acts an important role to the system. Though the algorithms for building the descriptors vary to a large extent currently, the influential reviews by Winder et al. [5] and Mikolajczyk et al. [9] ranked the Scale Invariant Feature Transform (SIFT) and Gradient Location and Orientation Histogram (GLOH) [8] as the best descriptors that are able to achieve high accuracy. In this module, the SIFT descriptor is adopted for its intuitional representation of a 3D space. After a sequence of successive processing, the descriptor becomes not only invariant to image scaling and
175
rotation, but also partial invariant to changes in the illumination and the 3D camera viewpoint.
We propose a Reconstruction Similarity test to avoid the need for the information of the data set. The flow chart is shown in Figure 1. We define a nonoverlapping feature to represent a kind of feature which the distance to any other features is larger than a threshold (a threshold of 4 pixels is chosen in our experiments), and the rest are overlapping features. After performing an image matching task between two images, we firstly remove the matched overlapping features in one reference image, and a Delaunay triangulation mesh [11, 12] is built on the nonoverlapping features. Then, a piece-wise affine deformation is performed on the mesh to reconstruct image 2 using the materials from image 1. The last step is similarity measurement, in which, features are extracted using 2D-PCA [13], and the similarity is computed using a cosine classifier.
2.3. Matching Exact NN search, such as the Hash Table and the KD-Tree, may bring “curse of dimensionality” into our image matching system due to the high dimensionality of the SIFT-like descriptors. In the standard SIFT, the Best-Bin-First (BBF) approach [10] is used, which can reduce the search time through maintaining a priority queue of bins that are not visited. We adopted a new approximate NN approach called the hybrid SP-Tree [3] which has been shown to outperform other existing approaches. The hybrid SP-Tree is considered as a combination of the “Defeatist search” in the nonoverlapping nodes and the SP-Tree in the overlapping nodes. The key idea of this approach is to properly set an overlapping buffer that the child-nodes can share their points, with which the time cost of back-tracking can be greatly reduced. Finally, an effective way of eliminating the error matches, such as RANSAC or epipolar geometry consistent, should not be omitted. The comparisons of computational time between different image matching algorithms are conducted and the results are shown in Table 2. The results show that the hybrid SP-Tree works more efficiently than the BBF while maintaining the same accuracy.
Image 1
Image 2
Image Matching Non-overlapping Feature Selection Delaunay Triangulation Affine Reconstruction Image 1 Trim
Table 2. Comparisons of different image matching algorithms
Image 2* Trim
Similarity Measurement
Algorithm
SIFT
GLOH
MSER
LoG+MSER +BBF
Proposed Algorithm
Time (sec) Reconstruction Similarity
2.312
2.507
1.475
3.144
2.389
0.966
0.969
0.911
0.972
0.972
Reconstruction Similarity
Figure 1. Flow chart of Reconstruction Similarity It is known that local image deformations cannot be realistically approximated even when using a full affine translation in the wide baseline system. However, there is no doubt that if more accurate correspondences and greater number of reliable nonoverlapping matches can be obtained, then a higher reconstruction similarity can be achieved. Based on our experience, in a wide baseline system, some wrong matches are hard to be eliminated using RANSAC or the epipolar constraint, but the remaining ones are usually cause of fatal error. Though ROC and Recall-Precision can reflect the number of the matches, neither of them can describe the degree of error. The advantage of the reconstruction similarity is that it is covariant with both these two factors.
2.4. Evaluation In the field of image matching, the performance of the different algorithms is usually measured based on the repeatability rate, i.e., the percentage of the number of points that are simultaneously present in two images. Though the repeatability rate indicates the potential matches, it cannot fully represent the overall performance. Mikolajczyk et al. [9] and Winder et al. [5] introduced the Recall-Precision and Receiver Operating Characteristics (ROC) for evaluation, and gave the comprehensive reviews of the current algorithms. Other researchers cannot achieve the same comparisons because they may not know the ground truth of the test data set.
176
(1)
(2)
Figure 2. Two matched images with 3 adjacent corresponding pairs A, B, and C
(1)
(2)
Figure 4. Two matched images with corresponding regions A and B when the camera is closer to A in image 1 than it in image 2, accordingly, the camera will be closer to 1B than to 2B. It can be decomposed as follows:
We tested our algorithm on the image library [8] with three highly recognized methods, namely, SIFT, GLOH and MSER. The experimental result shown in Table 2 demonstrates that algorithm we proposed performs better than others, benefiting from the efficient searching engine and the fusion of regionbased and point-based detectors.
E[1 A / E[1B | E[ 2 A / E[ 2 B . ® ¯ E[1 A E[ 2 A | E[1B E[ 2 B However, it is more complicated in a real system due to changes in the camera settings and the viewpoint. More significantly, the durations of certain features in a scale space are not the same. Some features can be detected with large scale variances (stable), while other features only exist in a narrow range of the scale change (unstable). Thus, there is a common phenomenon that a feature from a higher scale space has been matched with one from a lower scale space, which makes E usually smaller than its actual value. Wang et al. investigated the relation between the two types of extrema, ɮ1 and ɮ2, and the information content in the scale space [14]. They concluded that the higher the variation rate of the information content is, the lower is the ratio of the stable to unstable features. Therefore, we can
3. Order regularity of scale space According to the theory of scale space, the linear scale space mainly reflects the distance between the camera and the object. In a simplified model, the value of (denote = log2 as an indicator of the scale space) is proportional to the distance. Assuming object A and B simultaneously appeared in image 1 and 2 is adjacent, the order regularity can be expressed as
E[1 A E[ 2 A !0, E[1B E[ 2 B
where E is the average of the scale space indicators in an image patch. Intuitively, as illustrated in Figure 2,
177
compensate the change of E using the information theory. Owing to the instability of the cardinality of ɮ1 and ɮ2, we substitute the entropy for the cardinality. ( E[1 A E[ 2 B ) /( E[ 2 A E[1B ) R ,
radius should be adopted to avoid the problem (rX and rY are set to two times of rZ). Figure 4 demonstrated the use of this rotation expression. The horizontal view angles vary much of the two given images, but it is abnormal that Y:x,y of region A and B remains the same. In other words, the change of Y:x,y does not reflect the rotation relative to the camera, so the authenticity is questionable.
>1, 1 D @ En'1 A ! En'2 A or En'1B ! En'2 B , R® ¯>1 D , 1@ En'1 A En'2 A or En'1B En'2 B
where is set to be 0.16 in our experiments. For instance, as shown in Figure 2, it is easy to find that the change of EC from images 1 to 2 cannot meet the order regularity with EA and EB. Hence, if image 1 is authentic, we can say that region C in image 2 has been tampered.
5. Conclusions and future work In conclusion, a hybrid matching algorithm is proposed in this paper. It has been proven to be very suitable for scientific image analysis due to its desirable properties. We also propose the order regularity of the scale space and the description for object rotation, which can be used as powerful tools for tampered image detection. In future, the following three aspects will be further studied. z Finding reliable correspondences from images of a scene taken from arbitrary viewpoints in different illumination conditions is still a difficult and critical research issue. A potential solution is to extend the current algorithm, usually designed for gray level images, to the color space. Thus, we are planning to replace gray-level based detectors by color-based detectors in our algorithm. z While most researchers have concentrated on obtaining more reliable matches, the useful information hidden in matched pairs has been ignored. An attempt to explore this is made in this paper, but it is a very small percentage of the total matches. More embedded information can be mined and utilized in image processing. z Completing a semi-automatic system for detecting tampered images is now under consideration.
4. Description for object rotation Y
X Z Figure 3. Rotation in a 3D space In Section 2.2, the orientation x,y and the gradient magnitude mx,y of a descriptor are computed using the pixel differences in order to ensure that they are invariant to rotation.
Z : mx , y Z : T x, y
( Lx 1, y Lx 1, y ) 2 ( Lx , y 1 Lx , y 1 ) 2 w tan 1 (( Lx , y 1 Lx , y 1 ) /( Lx 1, y Lx 1, y )
where mx,y is weighted by the Gaussian-weighted circular window. 2
w
Exp(
rZ ) 2 (1.5 V ) 2
Z:x,y and Z:mx,y can be treated as an indicator of the rotation of an object with respect to the Z-axis as illustrated in Figure 3. We extend another two orthogonal orientations to fully describe the rotation in a 3D space:
X : mx , y X : T x, y Y : mx , y Y : T x, y
Acknowledgements This research is supported by the National HighTech Research and Development Plan of China (863 Program) under Grant No. 2007AA01Z423, by the Defense Basic Research Project of the ‘Eleventh FiveYear-Plan’ of China under Grant No. C10020060355, by the Key Project of Chinese Ministry of Education under Grant No. 02057 and by the Key Research Project of the Natural Science Foundation of Chongqing Municipality of China under Grant No. CSTC2005BA2002, CSTC2007AC2018. In addition, the research is conducted at the CIPMAS AR Laboratory at the National University of Singapore.
( Lx , y 1 Lx , y ) 2 ( Lx , y 1 Lx , y ) 2 w' tan 1 (( Lx , y 1 Lx , y ) /( L x , y 1 Lx , y )) ( Lx1, y Lx , y ) 2 ( Lx1, y Lx , y ) 2 w' tan 1 (( Lx1, y Lx , y ) /( Lx1, y Lx , y ))
.
The absence of the depth information usually causes the instability of the two presented orientations. Therefore, a larger Gaussian-weighted circular window
178
References [1]
J. Matas, O. Chuma, M. Urbana, and T. Pajdla, “WideBaseline Stereo from Maximally Stable Extremal Regions”, Image and Vision Computing, 2004, 22(10), pp.761-767.
[2]
D.G. Lowe, “Distinctive Image Features from ScaleInvariant Keypoints”, International Journal of Computer Vision, 2004, 60(2), pp.91-110.
[3]
T. Liu, A.W. Moore, A. Gray, and K. Yang, “An Investigation of Practical Approximate Nearest Neighbor Algorithms”, Advances in Neural Information Processing Systems, 2005, pp.825-832.
[4]
P.E. Forssén, “Maximally Stable Colour Regions for Recognition and Matching”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, pp.1220-1227.
[5]
S.A. J. Winder and M. Brown, “Learning local image descriptors”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.17-24.
[6]
K. Mikolajczyk and C. Schmid, “Scale & Affine Invariant Interest Point Detectors”, International Journal of Computer Vision, 2004, 60(1/2), pp.63-86.
[7]
K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Vangool, “Comparison of Affine Region Detectors”, International Journal of Computer Vision, 2005, 65(1/2), pp.43-72.
[8]
Affine Covariant Regions Datasets, url: http://www. robots.ox.ac.uk/~vgg/data/, 2004.
[9]
K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(10), pp.1615-1630.
[13] J. Yang, D. Zhang, A.F. Frangi, and J.Y. Yang, “Twodimensional PCA: a new approach to appearancebased face representation and recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(1), pp.131-137. [14] Z.Y. Wang, Z.X. Cheng, and S.J. Tang, “Information Measures of Scale-Space based on Visual Characters”, Journal of Image and Graphics, 2005, 10(7), pp.922928.
[10] J.S. Beis and D.G. Lowe, “Shape Indexing Using Approximate Nearest-Neighbour Search in HighDimensional Spaces”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp.1000-1006. [11] A. Bowyer, “Computing Dirichlet tessellations”, Computer Journal, 1981, 24(2), pp.162-166. [12] H. Øyvind and D. Morten, “Triangulations and Applications”, Mathematics and Visualization, Springer, Berlin, 2006.
179