1
Probabilistic Modeling for Semantic Scene Classification Matthew R. Boutell URCS Technical Report 862 May, 2005 This thesis was proposed in April, 2003. The dissertation of the thesis will be published in May, 2005. While the dissertation subsumes and modifies much of the material in this proposal, I have made it available (as URCS TR 862) as a historical supplement to document the details and results of the MASSES (Material and Spatial Experimental Scenes) prototype.
Probabilistic Modeling for Semantic Scene Classification by Matthew R. Boutell
Thesis Proposal for the Degree Doctor of Philosophy
Supervised by Christopher M. Brown Department of Computer Science The College Arts and Sciences University of Rochester Rochester, New York 2005
ii
Abstract Scene classification, the automatic categorization of images into semantic classes such as beach, field, or party, is useful in applications such as content-based image organization and context-sensitive digital enhancement. Most current sceneclassification systems use low-level features and pattern recognition techniques; they achieve some success on limited domains. Several contemporary classifiers, including some developed in Rochester, incorporate semantic material and object detectors. Classification performance improves because because the gap between the features and the image semantics is narrowed. We propose that spatial relationships between the objects or materials can help by distinguishing between certain types of scenes and by mitigating the effects of detector failures. While past work on spatial modeling has used logicor rule-based models, we propose a probabilistic framework to handle the loose spatial relationships that exist in many scene types. To this end, we have developed MASSES, an experimental testbed that can generate virtual scenes. MASSES can be used to experiment with different spatial models, different detector characteristics, and different learning parameters. Using a tree-structured Bayesian network for inference on a series of simulated natural scenes, we have shown that the presence of key materials can effectively distinguish certain scene types. However, spatial relationships are needed to disambiguate other types of scenes, achieving a gain of 7% in one case. However, our simple Bayes net is not expressive enough to model the faulty detection at the level of individual regions. As future work, we propose first to
iii
evaluate full (DAG) Bayesian networks and Markov Random Fields as potential probabilistic frameworks. We then plan to extend the chosen framework for our problem. Finally, we will compare our results on real and simulated sets of images with those obtained by other systems using spatial features represented implicitly.
iv
Table of Contents
Abstract
ii
List of Tables
vii
List of Figures
viii
1 Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
The Problem of Scene Classification . . . . . . . . . . . . . . . . .
3
1.2.1
Scene Classification vs. Full-scale Image Understanding . .
4
1.2.2
Scene Classification vs. Not Object Recognition . . . . . .
5
1.3
Past Work in Scene Classification . . . . . . . . . . . . . . . . . .
6
1.4
Statement of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4.1
Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5
Summary of Preliminary Work . . . . . . . . . . . . . . . . . . . .
8
1.6
Organization of Proposal . . . . . . . . . . . . . . . . . . . . . . .
9
2 Related Work 2.1
10
Design Space of Scene Classification . . . . . . . . . . . . . . . . .
11
2.1.1
11
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
2.1.2 2.2
2.3
2.4
Learning and Inference Engines . . . . . . . . . . . . . . .
11
Scene Classification Systems . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Low-level Features and Implicit Spatial Relationships . . .
13
2.2.2
Low-level Features and Explicit Spatial Relationships . . .
16
2.2.3
Mid-level Features and Implicit Spatial Relationships . . .
19
2.2.4
Semantic Features without Spatial Relationships . . . . . .
20
2.2.5
Semantic Features and Explicit Spatial Relationships . . .
21
2.2.6
Summary of Scene Classification Systems . . . . . . . . . .
22
Options for Computing Spatial Relationships . . . . . . . . . . . .
23
2.3.1
Computing Qualitative Spatial Relationships . . . . . . . .
23
2.3.2
Computing Quantitative Spatial Relationships . . . . . . .
26
Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . .
27
2.4.1
Bayesian Networks . . . . . . . . . . . . . . . . . . . . . .
29
2.4.2
Markov Random Fields . . . . . . . . . . . . . . . . . . . .
33
2.4.3
Relative Merits . . . . . . . . . . . . . . . . . . . . . . . .
42
3 Methodology 3.1
3.2
44
Statistics of Natural Images . . . . . . . . . . . . . . . . . . . . .
45
3.1.1
Ground Truth Collection Process . . . . . . . . . . . . . .
46
3.1.2
Scene Prototypes . . . . . . . . . . . . . . . . . . . . . . .
48
3.1.3
Spatial Relationships . . . . . . . . . . . . . . . . . . . . .
48
3.1.4
Scene-specific Spatial Relationship Statistics . . . . . . . .
51
Experimental Environment . . . . . . . . . . . . . . . . . . . . . .
51
3.2.1
Advantages of Working in Simulation . . . . . . . . . . . .
52
3.2.2
MASSES Prototype . . . . . . . . . . . . . . . . . . . . . .
53
vi
3.2.3
Simulating Faulty Detectors . . . . . . . . . . . . . . . . .
58
3.2.4
Background Regions . . . . . . . . . . . . . . . . . . . . .
64
4 Experimental Results
66
4.1
Best-case Detection in MASSES . . . . . . . . . . . . . . . . . . .
66
4.2
Best-case Detection on Beach Photographs . . . . . . . . . . . . .
68
4.3
Faulty Detection on Beach Photographs . . . . . . . . . . . . . .
70
5 Proposed Research 5.1
73
Research Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
5.1.1
Why a New Inference Mechanism? . . . . . . . . . . . . .
76
5.1.2
Bayes Nets Vs. MRFs . . . . . . . . . . . . . . . . . . . .
77
5.1.3
Extend the Chosen Framework . . . . . . . . . . . . . . .
79
5.1.4
Analyze the Effect of Detector Quality . . . . . . . . . . .
82
5.1.5
Evaluate Explicit Spatial Relationships . . . . . . . . . . .
82
5.1.6
Evaluate Semantic Features . . . . . . . . . . . . . . . . .
83
5.1.7
Generalize to Other Scene Classes . . . . . . . . . . . . . .
83
5.1.8
Explore Potential Long-term Directions . . . . . . . . . . .
84
5.2
Issues Not Addressed in This Thesis . . . . . . . . . . . . . . . . .
85
5.3
Research Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
6 Acknowledgments
87
Bibliography
88
A Natural Scene Statistics
98
B Detector Characteristics
103
vii
List of Tables
2.1
Options for features to use in scene classification. . . . . . . . . .
12
2.2
Potential classifiers to use in scene classification. . . . . . . . . . .
14
2.3
Related work in scene classification, organized by feature type and use of spatial information. . . . . . . . . . . . . . . . . . . . . . .
15
3.1
Scene class definitions. . . . . . . . . . . . . . . . . . . . . . . . .
47
3.2
Distribution resulting from offline sampling procedure. . . . . . .
64
4.1
MASSES with best-case material detection: Accuracy with and without spatial information. . . . . . . . . . . . . . . . . . . . . .
4.2
MASSES with best-case material detection: Accuracy with and without spatial information. . . . . . . . . . . . . . . . . . . . . .
4.3
67
69
MASSES with faulty material detection: Accuracy with and without spatial information. . . . . . . . . . . . . . . . . . . . . . . . .
70
viii
List of Figures
1.1
Content-ignorant color balancing can destroys the brilliance of sunset images, such as those pictured, which have the same global color distribution as indoor, incandescent-illuminated images. . . . . . .
3
2.1
A Bayes Net with a loop. . . . . . . . . . . . . . . . . . . . . . . .
32
2.2
Portion of a typical two-layer MRF. In low-level computer vision problems, the top layer (black) represents the external evidence of the observed image while the bottom layer (white) expresses the a prioriknowledge about relationships between parts of the underlying scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Ground-truth labeling of a beach scene. Sky, water, and sand regions are clearly shown.
3.2
35
. . . . . . . . . . . . . . . . . . . . . . .
46
Prototypical beach scenes. (a) A simple beach scene without background objects. (b) Because we make no attempt to detect it, we consider the sailboat to be “background”. (c) A more complicated scene: a developed beachfront. (d) A scene from a more distant field-of-view. (e) A crowded beach. . . . . . . . . . . . . . . . . .
49
ix
3.3
Prototypical urban scenes. (a) The most common urban scene, containing sky, buildings, and roads. (b),(c) The sky is not simply above the buildings in these images. (d) Roads are not necessary. (e) Perspective views induce varied spatial relationships. (f) Close views can preclude the presence of sky. . . . . . . . . . . . . . . .
50
3.4
An example of spatial relationships involving a split region. . . . .
50
3.5
The MASSES environment. Statistics from labeled scenes are used to bootstrap the generative model, which can then produce new virtual scenes for training or testing the inference module. . . . .
54
3.6
Single-level Bayesian network used for MASSES . . . . . . . . . .
54
3.7
Sampling the scene type yields class C. Then we sample to find the materials present in the image, in this case, M1 , M3 , and M4 . Finally, we sample to find the relationships between each pair of these material regions. . . . . . . . . . . . . . . . . . . . . . . . .
3.8
57
Bayesian network subgraph showing relationship between regions and detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.1
Images incorrectly classified due to spatial relationships. . . . . .
69
4.2
Images incorrectly classified using faulty detectors and no spatial relationships. The actual materials are shown the top row; the detected materials are shown below each. . . . . . . . . . . . . . .
70
4.3
Images incorrectly classified using faulty detectors.
. . . . . . . .
71
4.4
Images incorrectly classified using faulty detectors.
. . . . . . . .
72
5.1
Classifier input (labeled regions) and output (classification and confidence). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
75
An image in which a region is mis-detected, creating contradictory spatial relationships in the material-based inference scheme. . . .
76
x
5.3
Proposed DAG Bayesian network. Note that separate material and region layers are needed. . . . . . . . . . . . . . . . . . . . . . . .
77
1
1
Introduction
Semantic scene classification, the process of categorizing images into semantic classes such as beaches, sunsets or parties is a useful endeavor. As humans, we can quickly determine the classification of a scene, even without recognizing every one of the details present. Even the gist of a scene is worth much in terms of communication.
1.1
Motivation
Automatic semantic classification of digital images finds many applications. We describe two major ones briefly: content-based image organization and retrieval (CBIR) and digital enhancement. With digital libraries growing in size so quickly, accurate and efficient techniques for CBIR become more and more important. Many current systems allow a user to specify an image and then search for images “similar” to it, where similarity is often defined only by color or texture properties. Because a score is computed on each image in the potentially-large database, it is somewhat inefficient (though individual calculations vary in complexity). Furthermore, this so-called “query by example” has often proven to be return
2
inadequate results [68]. Sometimes the match between the retrieved and the query images is hard to understand, while other times, the match is understandable, but contains no semantic value. For instance, with simple color features, a query for a rose can return a picture of a man wearing a red shirt, especially if the background colors are similar as well. Knowledge of the semantic category of a scene helps narrow the search space dramatically [37]. If the categories of the query image and the database images have been assigned (either manually or by an algorithm), they can be exploited to improve both efficiency and accuracy. For example, knowing what constitutes a party scene allows us to consider only potential party scenes in our search and thus helps to answer the query “find photos of Mary’s birthday party”. This way, search time is reduced, the hit rate is higher, and the false alarm rate is expected to be lower. Visual examples can be found in [76]. Knowledge about the scene category can find also application in digital enhancement [73]. Digital photofinishing processes involve three steps: digitizing the image if necessary (if the original source was film), applying enhancement algorithms, and outputting the image in either hardcopy or electronic form. Enhancement consists primarily of color balancing, exposure enhancement, and noise reduction. Currently, enhancement is generic (i.e. without knowledge of the scene content). Unfortunately, while a balancing algorithm might enhance the quality of some classes of pictures, it degrades others. Take color balancing as an example. Photographs captured under incandescent lighting without flash tend to be yellowish in color. Color balancing removes the yellow cast. However, applying the same color balancing to a sunset image (containing the same overall yellow color) destroys the desired brilliance (Figure 1.1). Other images that are negatively affected by color balancing are those containing skin-type colors. Correctly balanced skin colors are important to human
3
Figure 1.1: Content-ignorant color balancing can destroys the brilliance of sunset images, such as those pictured, which have the same global color distribution as indoor, incandescent-illuminated images. perception [64], and it is important to balance them. However, causing non-skin objects with similar colors to look like skin is a conspicuous error. Rather than applying generic color balancing and exposure adjustment to all images, knowledge of the scene’s semantic classification allows us to customize them to the scene. Following the example above, we could retain or boost sunset scenes’ brilliant colors while reducing a tungsten-illuminated indoor scene’s yellowish cast.
1.2
The Problem of Scene Classification
On one hand, isn’t scene classification preceded by image understanding, the “holy grail” of vision? What makes us think we can achieve results? On the other hand, isn’t scene classification just an extension of object recognition, for which many techniques have been proposed with varying success? How is scene classification different from these two related fields?
4
1.2.1
Scene Classification vs. Full-scale Image Understanding
As usually defined, image understanding is the process of converting “pixels to predicates”: (iconic) image representations to another (symbolic) form of knowledge [2]. Image understanding is the highest (most abstract) processing level in computer vision [71], as opposed to image processing techniques, which convert one image representation to another. (For instance, using a mask to convert raw pixels to an edge image is much more concrete than identifying the expression on a person’s face in the image!) Lower-level image processing techniques such as segmentation are used to create regions that can then be identified as objects. Various control strategies are used to order the processing steps and can vary [3]. The end result desired is for the vision to support high-level reasoning about the objects and their relationships to meet a goal. While image understanding in unconstrained environments is still very much an open problem [71; 77], much progress is currently being made in scene classification. Because scenes can often be classified without full knowledge of every object in the image, the goal is not as ambitious. For instance, if a person recognizes trees at the top of a photo, grass on the bottom, and people in the middle, he may hypothesize that he is looking at a park scene, even if he cannot see every detail in the image. Or on a different level, if there are lots of sharp vertical and horizontal edges, he may be looking at an urban scene. It may be possible in some cases to use low-level information, such as color or texture, to classify some scene types accurately. In other cases, perhaps object recognition is necessary, but not necessarily of every object in the scene. In general, classification seems to be an easier problem than unconstrained image understanding; early results have confirmed this for certain scene types in constrained environments [74; 77]. Scene classification is a subset of the image
5
understanding problem, and can be used to ease other image understanding tasks [75]. For example, knowing that a scene is of a beach constrains where in the scene one should look for people. Obtaining image understanding in unconstrained environments is a lofty goal, and one worthy of pursuit. However, given the state of image understanding, we see semantic scene classification as a necessary stepping-stone in pursuit of the “grail”.
1.2.2
Scene Classification vs. Not Object Recognition
However, scene classification is a different beast than object recognition. Detection of rigid objects can rely upon geometrical relationships within the objects, and various techniques [21; 63] can be used to achieve invariance to affine transforms and changes in scene luminance. Detection of non-rigid objects is less constrained physically, since the relationships are looser [12]. Scene classification is even less constrained, since the components of a scene are varied. For instance, while humans might find it easy to recognize a scene of a child’s birthday party, the objects and people that populate the scene can vary widely, and the cues that determine the birthday scene class (such as special decorations, articles marked with the age of the child, and facial expressions on the attenders) can be subtle. Even the more obvious cues, like a birthday cake, may be difficult to determine. Again, the areas of scene classification and object recognition are related; knowing the identity of some of the scene’s objects will certainly help to classify the scene, while knowing the scene type affects the expected likelihood and location of the objects it contains.
6
1.3
Past Work in Scene Classification
Most of the current systems primarily use low-level features to classify scenes and achieve some success on constrained problems. These systems tend to be exemplar-based, in which features are extracted from images, and pattern recognition techniques are used to learn the statistics of a training set and to classify novel test images. Very few systems are model-based, in which the expected configuration of the objects in the scenes is specified by a human expert.
1.4
Statement of Thesis
The limited success of scene classification systems using low-level features forces us to look for other solutions. Currently, good semantic material detectors and object recognizers are available [70; 38; 63] and have begun to be successfully applied to scene classification [37]. However, the presence or absence of certain objects is not always enough to determine a scene type. Furthermore, object detection is still developing and is far from perfect. Faulty detection causes brittle rule-based systems to break. Our central claim is that Spatial modeling of semantic objects and materials can be used to disambiguate certain scene types as well as mitigate the effects of faulty detectors. Furthermore, an appropriate probabilistic inference mechanism must be developed to handle the loose spatial structure found in real images. Current research into spatial modeling relies on (fuzzy) logic and subgraph matching [44; 83]. While we have found no research that incorporates spatial modeling in a probabilistic framework, we argue that a probabilistic approach would be more appropriate. First, logic (even fuzzy variants) is not equipped to handle exceptions efficiently [50], a concern we address in more detail in Section 2.4. Second, semantic material detectors often yield belief in the materials. While
7
it is not obvious how to use belief values, it seems desirable to exploit the uncertainty in calculating the overall belief in each scene type. A nice side effect of true belief-based calculation is the ease in which a “don’t know” option can be added to the classifier: simply threshold the final belief value. The first interesting problem is the appropriate choice of a probabilistic framework. Both Bayesian networks [66] and Markov Random Fields [30; 16; 15] have been applied to other vision problems in the past. We also propose to investigate the effects of spatial modeling. In our experimentation, we plan to compare the following: 1) Baseline (no spatial relationships). Use naive Bayes classification rules using the presence or absence of materials only. 2a) Qualitative spatial relationships. Incorporate relations such as above or beside between regions. This would be appropriate for use in a Bayesian Network framework. 2b) Quantitative spatial relationships. Use distance and direction between regions. This may potentially be more accurate, due to the increase in the information used, but requires more training data. These may work particularly well within a Markov Random Field framework. One of our major foreseen contributions will be to validate the hypothesized gain due to spatial modeling.
1.4.1
Philosophy
The success of our approach seems to hinge on the strength of the underlying detectors. Consider two scenarios. First, if the detectors are reasonably accurate, then we can expect to overcome some faults using spatial relationships. However, if they are extremely weak, we would be left with a very ambitious goal: from a very pessimistic view of the world (loose spatial structure and weak detectors),
8
pull order from the chaos and accurately apply a discrete label to a configuration of objects. In this latter case, prior research seems to confirm that the task does not sound promising. For instance, Selinger found that if an object could be recognized with moderate success using a single camera view, additional views could improve recognition substantially. However, if the single view gave weak detection, then multiple views could not redeem the problem. She states [62] (p. 106): The result is in concert with general expertise in the field of recognition concerning the difficulty of leveraging multiple sources of weak evidence into strong hypotheses. Therefore, while we cannot expect to use our technique to classify scenes for which the detectors are completely inaccurate, we stand a reasonable chance if improving accuracy if the detectors are reasonably strong themselves.
1.5
Summary of Preliminary Work
We have performed our experiments in a simulated abstract world. The materials and spatial relationships used are based on statistics captured from real scenes. This provides us with a rich test bed, in which we can develop algorithms, compare approaches, quickly experiment with parameters, and explore “what-if” situations. With a prototype scene simulator we developed, using a single-level, treestructured Bayesian network for inference on a series of simulated natural scenes, we have shown that the presence of key materials can effectively distinguish certain scene types. However, spatial relationships are needed to disambiguate other types of scenes, achieving a gain of 7% in one case.
9
However, when simulating faulty detectors, we found that the network is not expressive enough to capture the necessary information, actually leading to lower accuracy when spatial relationships were used.
1.6
Organization of Proposal
Chapter 2 contains an overview of the relevant literature in scene classification, spatial modeling, and probabilistic frameworks. In Chapter 3, we describe our methodology, both for the detector and the simulator. Chapter 4 is a summary of our experiments and results (using best-case and faulty detectors). We conclude in Chapter 5, in which we describe our research plan and proposed contributions.
10
2
Related Work
Scene classification is a young, emerging field. The first section of this chapter is taken in large part from our earlier review of the state of the art in scene classification [6]; because this thesis is a work in progress, there is much overlap between the two. Here we focus our attention on systems using approaches directly related to our proposed thesis. Readers desiring a more comprehensive survey or more detail are referred to the original review. All systems classifying scenes must extract appropriate features and use some sort of learning or inference engine to classify the image. We start by outlining the options available for features and classifiers. We then present a number of systems which we have deemed to be good representations of the field. We augment our review of the literature by discussing two computational models of spatial relationships and then discussing in detail two graphical models we could use for probabilistic inference: Bayesian Networks and Markov Random Fields, each of which will be explored in the thesis.
11
2.1
Design Space of Scene Classification
The literature reveals two approaches to scene classification: exemplar-based and model-based. On one hand, exemplar-based approaches use pattern recognition techniques on vectors of low-level image features (such as color, texture, or edges) or semantic features (such as sky, faces or grass). The exemplars are thought to fall into clusters, which can then be used to classify novel test images, using an appropriate distance metric. Most systems use an exemplar-based approach, perhaps due to recent advances in pattern recognition techniques. On the other hand, model-based approaches are designed using expert knowledge of the scene such as the expected configuration of a scene. A scene’s configuration is the layout (relative location and sizes) of its objects. While it seems as though this should be very important, very little research has been done in this area. In either case, appropriate features must be extracted for accurate classification. What makes a certain feature appropriate for a given task? For pattern classification, one wants the inter-class distance to maximized and the intra-class distances to be minimized. Many choices are intuitive, e.g. edge features should help separate city and landscape scenes [78].
2.1.1
Features
In our review [6], we described features we found in similar systems, or which we thought could be potentially useful. Table 2.1 is a summary of that set of descriptions.
2.1.2
Learning and Inference Engines
Pattern recognition systems classify samples represented by feature vectors (see a good review in [28]). Features are extracted from each of a set of training images,
12
Table 2.1: Options for features to use in scene classification. Feature
Description
Color
Histograms [72], Coherence vectors [49], Moments [79]
Texture [51]
Wavelets[42; 65], MSAR [73], Fractal dimension [71]
Filter Output
Fourier & discrete cosine transforms [46; 73; 74; 75; 77], Gabor [59], Spatio-temporal [53]
Edges
Direction histograms [77], Direction coherence vectors
Context Patch
Dominant edge with neighboring edges [63]
Object Geometry
Area, Eccentricity, Orientation [10]
Object Detection
Output from belief-based material detectors[37; 66], rigid object detectors [63], face detectors [60]
IU Output
Output of other image understanding systems e.g., Main Subject Detection [66]
Context
Within images (scale, focus, pose) [75] Between images (adjacent images on film or video)
Mid-level
“Spatial envelope” features [47]
Meta-data
Time-stamp, Flash firing, Focal length, text [36; 24]
Statistical Measures Dimensionality reduction [18; 58]
13
or exemplars. In most classifiers, a statistical inference engine then extracts information from the processed training data. Finally, to classify a novel test image, this type of system extracts the same features from the test image and compares them to those in the training set [18]. This exemplar-based approach is used by most of the current systems. The classifiers used in these type of systems differ in how they extract information from the training data. In Table 2.2, we present a summary of the major systems used in the realm of scene classification.
2.2
Scene Classification Systems
As stated, many of the systems proposed in the literature for scene classification are exemplar-based, but a few are model-based, relying on expert knowledge to model scene types, usually in terms of the expected configuration of objects in the scene. In this section, we describe briefly some of these systems and point out some of their limitations en route to differentiating our proposed method. We organize the systems by feature type and in the use of spatial information, as shown in Table 2.3. Features are grouped into low-level, mid-level, and high-level (semantic) features, while spatial information is grouped into those that model the spatial relationships explicitly in the inference stage and those that do not.
2.2.1
Low-level Features and Implicit Spatial Relationships
A number of researchers have used low-level features sampled at regular spatial locations (e.g. blocks in a rectangular grid). In this way, spatial features are encoded implicitly, since the features computed on each location are mapped to fixed dimensions in the feature space.
14
Table 2.2: Potential classifiers to use in scene classification. Classifier
Description
1-Nearest-Neighbor
Classifies test sample with same class as the exemplar
(1NN)
closest to it in the feature space.
K-Nearest-Neighbor
Generalization of 1NN in which the sample is given
(kNN) [18]
the label of the majority of the k closest exemplars.
Learning Vector
A representative set of exemplars, called a codebook,
Quantization (LVQ)
is extracted. The codebook size and learning rate
[31; 32]
must be chosen in advance.
Maximum a Posteriori
Combines the class likelihoods (which must be
(MAP) [77]
modeled, e.g., with a mixture of Gaussians) with class priors using Bayes rule.
Support Vector
Find an optimal hyperplane separating two classes.
Machine (SVM)
Maps data into higher dimensions, using a kernel
[8; 61]
function,to increase separability. The kernel and associated parameters must be chosen in advance.
Artificial Neural
Function approximators in which the inputs are
Networks (ANN) [1]
mapped, through a series of linear combinations and non-linear activation functions to outputs. The weights are learned using a technique called backpropagation.
15
Table 2.3: Related work in scene classification, organized by feature type and use of spatial information. Spatial Information Feature Type
Implicit/None
Explicit
Low-level
Vailaya, et al.
Lipson, et al.
Oliva, et al.
Ratan & Grimson
Szummer & Picard
Smith & Li
Serrano, et al. Paek & Chang Carson, et al. Wang, et al. Mid-level
Oliva, et al.
High-level
Luo, et al.
Mulhem, et al.
(Semantic)
Song & Zhang
Proposed Method
The problems addressed include indoor vs. outdoor classification (Szummer and Picard [73], Paek and Chang [48], and Serrano et al. [65]), outdoor scene classification (Vailaya et al. [77]), and image orientation detection [79; 80] 1 . The indoor vs. outdoor classifiers’ accuracy approaches 90% on tough (e.g., consumer) image sets. On the outdoor scene classification problem, mid-90% accuracy is reported. This may be due to the use of constrained data sets (e.g. from the Corel stock photo library), because on less constrained (e.g., consumer) image sets, we found the results to be lower. The generalizability of the technique is also called into question by the discrepancies in the numbers reported for image orientation detection by some of the same researchers [79; 80]. 1
While image orientation detection is a different level of semantic classification, many of the
techniques used are similar.
16
Pseudo-Object Features The Blobworld system, developed at Berkeley, was developed primarily for contentbased indexing and retrieval. However, it is used for scene classification problem in [9]. The researchers segment the image and use statistics computed for each region (e.g., color, texture, location with respect to a 3 × 3 grid) without performing object recognition. Admittedly, this is a more general approach for scene types containing no recognizable objects. However, we can hope for more using object recognition. Finally, a maximum likelihood classifier performs the classification. Wang’s SIMPLIcity (Semantics-sensitive Integrated Matching for Picture LIbraries) system [80] also uses segmentation to match pseudo-objects. The system uses a fuzzy method called “Integrated Region Matching” to effectively compensate for potentially poor segmentation, allowing a region in one image to match with several in another image. However, spatial relationships between regions are not used and the framework is used only for CBIR, not scene classification.
2.2.2
Low-level Features and Explicit Spatial Relationships
The systems above either ignore spatial information or encode it implicitly using a feature vector. However, other bodies of research imply that explicitly-encoded spatial information is valuable and should be encoded explicitly and used by the inference engine. In this section, we review this body of research, describing a number of systems using spatial information to model the expected configuration of the scene. Configural Recognition Lipson, Grimson, and Sinha at MIT use an approach they call “configural recognition” [34; 35], using relative spatial and color relationships between pixels in low resolution images to match the images with class models.
17
The specific features extracted are very simple. The image is smoothed and subsampled at a low resolution (ranging from 8 × 8 to 32 × 32). Each pixel represents the average color of a block in the original image; no segmentation is performed. For each pixel, only its luminance, RGB values, and position are extracted. The hand-crafted models are also extremely simple. For example, a template for a snowy mountain image is a blue region over a white region over a dark region; one for a field image is a large bluish region over a large greener region. In general, the model contains relative x- and y-coordinates, relative R-, G-, B-, and luminance values, and relative sizes of regions in the image. The matching process uses the relative values of the colors in an attempt to achieve illumination invariance. Furthermore, using relative positions mimics the performance of a deformable template: as the model is compared to the image, the model can be deformed by moving the patch around so that it best matches the image. A model-image match occurs if any one configuration of the model matches the image. However, this criterion may be extended to include the degree of deformation and multiple matches depending on how well the model is expected to match the scene. Classification is binary for each classifier. On a test set containing 700 professional images (the Corel Fields, Sunsets and Sunrises, Glaciers and Mountains, Coasts, California Coasts, Waterfalls, and Lakes and Rivers CDs), the authors report recall using four classifiers: fields (80%), snowy mountains (75%), snowy mountains with lakes (67%), and waterfalls (33%). Unfortunately, exact precision numbers cannot be calculated from the results given. The strength of the system lies in the flexibility of the template, in terms of both luminance and position. However, one limitation the authors state is that each class model captured only a narrow band of images within the class and that multiple models were needed to span a class.
18
Learning the Model Parameters In a follow-up study by Ratan and Grimson [54], they also used the same model, but learned the model parameters from exemplars. They reported similar results to the hand-crafted models used by Lipson. However, the method was computationally expensive [83]. Combining Configurations with Statistical Learning
In another varia-
tion on the previous research, Yu and Grimson adapt the configural approach to a statistical, feature-vector based approach, treating configurations like words appearing in a document [83]. Set representations, e.g. attributed graphs, contain parts and relations. In this framework, the configurations of relative brightness, positions, and sizes are subgraphs. However, inference is computationally costly. Vector representations allow for efficient learning of visual concepts (using the rich theory of supervised learning). Encoding configural information in the features overcomes the limited ability of vector representations to preserve relevant information about spatial layout. [83] Within a CBIR framework with two query images, configurations as extracted as follows. Because configurations contained in both images are most informative, an extension of the maximum clique method is used to extract common subgraphs from the two images. The essence of the method is that configurations are grown from the best matching pairs (e.g., highly contrasting regions) in each image. During the query process, the common configurations are broken into smaller parts and converted to a vector format, in which feature i corresponds to the probability that sub-configuration i is present in the image. A naive (i.e., single-level, tree-structured) Bayesian network is trained on-line for image retrieval. A set of query images is used for training, with likelihood parameters estimated by EM. Database images are then retrieved in order of their posterior probability. On a subset of 1000 Corel images, a single waterfall query is shown to have
19
better retrieval performance than other measures such as color histograms, wavelet coefficients, and Gabor filter outputs. Note that the spatial information is explicitly encoded in the features, but is used directly in the inference process. In the subgraph extraction process above, if extracting a common configuration from more than two images is desired, one can use Hong, et al.’s method [26].
Composite Region Templates (CRT) CRTs are configurations of segmented image regions [69]. The configurations are limited to those occurring in the vertical direction: each vertical column is stored as a region string and statistics are computed for various sequences occurring in the strings. While an interesting approach, one unfortunate limitation of their experimental work is that the size of the training and testing sets were both extremely limited.
2.2.3
Mid-level Features and Implicit Spatial Relationships
Oliva and Torralba [46; 47] propose what they call a “scene-centered” description of images. They use an underlying framework of low-level features (multiscale Gabor filters), coupled with supervised learning to estimate the “spatial envelope” properties of a scene. They classify images with respect to “verticalness” (vertical vs. horizontal), “naturalness” (vs. man-made), “openness” (presence of a horizon line), “roughness” (fractal complexity), “busyness” (sense of clutter in man-made scenes), “expansion” (perspective in man-made scenes), “ruggedness” (deviation from the horizon in natural scenes), and “depth range”. Images are then projected into this 8-dimensional space in which the dimensions correspond to the spatial envelope features. They measure their success first on individual dimensions through a ranking experiment. They then claim that
20
their features are highly correlated with the semantic categories of the images (e.g., “highway” scenes are “open” and exhibit high “expansion”), demonstrating some success on their set of images. It is unclear how their results generalize. One observation they make is that their scene-centered approach is complementary to an “object-centered” approach like ours.
2.2.4
Semantic Features without Spatial Relationships
Semantic Features for Indoor Vs. Outdoor Classification Luo and Savakis extended the method of [65] by incorporating semantic material detection [37]. A Bayesian Network was trained for inference, with evidence coming from low-level (color, texture) features and semantic (sky, grass) features. Detected semantic features (which are not completely accurate) produced a gain of over 2% and “best-case” (100% accurate) semantics gave a gain of almost 8% over low-level features alone. The network used conditional probabilities of the form P (sky present|outdoor). While this work showed the advantage of using semantic material detection for certain types of scene classification, it stopped short of using spatial relationships. Semantic Features for Image Retrieval Song and Zhang investigate the use of semantic features within the context of image retrieval [70]. Their results are impressive, showing that semantic features greatly outperform typical low-level features, including color histograms, color coherence vectors, and wavelet texture for retrieval. They use the illumination topology of images (using a variant of contour trees) to identify image regions and combine this with other features to classify the regions into the semantic categories such as sky, water, trees, waves, placid water, lawn, and snow.
21
While they do not apply their work directly to scene classification, their success with semantic features confirms our hypothesis that they help bridge the semantic gap between pixel-representations and high-level understanding.
2.2.5
Semantic Features and Explicit Spatial Relationships
Mulhem, Leow, and Lee [44] present a novel variation of fuzzy conceptual graphs for use in scene classification. Conceptual graphs are used for representing knowledge in logic-based applications, since they can be converted to expressions of first-order logic. Fuzzy conceptual graphs extend this by adding a method of handling uncertainty. A fuzzy conceptual graph is composed of three elements: a set of concepts (e.g., ”mountain” or ”tree”), a set of relations (e.g., ”smaller than” or ”above”), and a set of relation attributes (e.g., ”ratio” of two sizes). Any of these elements which contain multiple possibilities is called fuzzy, while one which does not is called crisp. Model graphs for prototypical scenes are hand-crafted, and contain crisp concepts and fuzzy relations and attributes. For example, a ”mountain-over-lake” scene must contain a mountain and water, but the spatial relations are not guaranteed to hold. A fuzzy relation such as ”smaller than” may hold most of the time, but not always. Image graphs contain fuzzy concepts and crisp relations and attributes. This is intuitive: while a material detector calculates the boundaries of objects and can therefore calculate relations (e.g. ”to the left of”) between them, they can be uncertain as to the actual classification of the material (consider the difficulty of distinguishing between cloudy sky and snow, or of rock and sand). The ability to handle uncertainty on the part of the material detectors is an advantage of this framework.
22
Two subgraphs are matched using graph projection, a mapping such that each part of a subgraph of the model graph exists in the image graph, and a metric for linearly combining the strength of match between concepts, relations, and attributes. A subgraph isomorphism algorithm is used to find the subgraph of the model that matches best the image. The basic idea of the algorithms is to decompose the model and image into arches (two concepts connected by a relation), seed a subgraph with the best matching pair of arches, and incrementally add other model arches that match well. They found that the image matching metric worked well on a small database of two hundred images and four scene models (of mountain/lake scenes) generated by hand. Fuzzy classification of materials was done using color histograms and Gabor texture features. The method of generating confidence levels of the classification is not specified. While the results look promising for mountain/lake scenes, it remains to be seen how well this approach will scale to a larger number of scene types.
2.2.6
Summary of Scene Classification Systems
Referring back to the summary of prior work in semantic scene classification given in Table 2.3, we see that our work is closest to that of Mulhem, et al., but differs in one key aspect: while theirs is logic-based, our proposed method is founded upon probability theory, leading to principled methods of handling variability in scene configurations. Our proposed method also learns the model parameters from a set of training data, while theirs are fixed.
23
2.3
Options for Computing Spatial Relationships
If we are to utilize spatial relationships between regions, we need a method for computing and encoding these relationships. We discuss both qualitative and quantitative spatial relationships.
2.3.1
Computing Qualitative Spatial Relationships
We start by considering three models of computing qualitative spatial relationships: Attentional Vector Sum, a biologically-inspired model; Weighted Walkthroughs, a model developed for occluded regions; and a hybrid model produced for efficient computation.
Attentional Vector Sum Regier and Carlson [55] propose a computational model of spatial relations based on human perception. They consider up, down, left, and right, to be symmetric, and so focus their work on the above relation. They call the reference object the landmark and the located object the trajector. For example, ”the ball (trajector) is above the table (landmark)”. The model is designed to handle 2D landmarks, but only point trajectors. However, the researchers state that they are in the process of extending the model. Four models are compared: 1. Bounding box (BB). A is above B if it is higher than the landmark’s highest point and between its leftmost and rightmost points. The strength of the match varies depending on the height and how centered it is above the object; three parameters govern how quickly the match drops off, via sigmoidal functions.
24
2. Proximal and Center-of-Mass (PC). Here, the projection is defined based on the angle formed by the y-axis and the line connecting the trajector to the landmark. Connecting it to the closest point on the landmark gives the proximal angle and to the centroid gives the center-of-mass angle. This model has four parameters: a gain, slope, and y-intercept of the piecewise function for the goodness of the angle, and the relative weight given to the components corresponding to the two angles. 3. Hybrid Model (PC-BB). This model extends the PC model by adding the BB model’s height term. The height term gives the presence of a ”grazing line” at the top of the landmark, an effect that was observed experimentally. The model has four parameters: the slope, y-intercept, and relative weight of the PC model plus the gain on the height function’s sigmoid. 4. Attentional Vector Sum (AVS). This model incorporates two human perceptual elements: • Attention. Visual search for a target in a field of distractors is slow when targets differ from distractors only in the spatial relation among their elements (i.e. they do not exhibit ”pop-up”). Therefore, they require attention. • Vector sum. Studies of orientation cells in the monkey cortex show that directions were modeled by a vector sum of the cells. In the AVS model, the angle between the landmark and the trajector is calculated as the weighted sum of angles between the points in the landmark area and the trajector. The weights in the sum are related to attention. The center of attention on the landmark is the point closest to the trajector; its angle receives the most weight. As the landmark points get further from the center of attention, they are weighted less, dropping off exponentially.
25
Lastly, the BB model’s height function is used again (for which they can give no physiological or perceptual basis, but only because it was observed experimentally). The model has four parameters: the width of the beam of attention, the slope and y-intercept of the linear function relating angle to match strength (as in the PC model) and the gain on the height function’s sigmoid. Optimal parameters for each model were found by fitting the model with another researcher’s data set. A series of experiments was then performed to distinguish between the models. The AVS model fit each experiment’s data most consistently. The AVS method also gives a measure of ”how above” one region is compared to another. This measure may potentially be used to our advantage. However, as given, AVS may be too computationally expensive. Where ni is # points in each region i, finding points on perimeter of each is O(n1 + n2 ), giving p perimeter points. Finding closest point between landmark and trajector, yields O(p) distances. Integrating over each region yields O(n1 × n2) distances. However, if using a larger step size would not substantially reduce accuracy, we could reduce the computation significantly.
Weighted Walkthroughs Berretti et al. [5] developed a technique named ”weighted walkthroughs” to calculate spatial relations. The method is designed to compare segmented regions created by color backpropagation, and therefore has the advantage of handling landmarks or trajectors that are made of multiple regions. This may be important in natural images, where large regions are sometimes occluded. The method is straightforward: consider two regions A and B. All pairs of points (a, b) in the set S = {(a, b)|a ∈ A, b ∈ B} are compared (a ”walkthrough”
26
each of the regions). For some pairs, A will lie northeast of B, for others A will lie SE, etc. for the four quadrants. The fraction of pairs contained in each of the four quadrants are computed: four ”weights”: wN E , wN W , wSE , and wSW . Finally, these can be converted to above/below, left/right, and diagonality by computing four features: above = wN E +wN W , right = wN E +wSE , and diagonality = wN E + wSW . One advantage of this method is its ability to handle 2D, occluded ((i.e., disconnected) landmarks and trajectors.
Hybrid Approach In their research, Luo and Zhu [40] use a hybrid approach, combining the bounding box and weighted walkthrough methods. The method was designed for modeling spatial relations between materials in natural scenes, and so favors above/below calculations. It skips weighted walkthroughs when object bounding boxes do not overlap. It does not handle some obscure cases correctly, but is fast and correct when used in practice. The final decision of the spatial relationship model would be most appropriate for my work depends in large part on whether the AVS method can be extended to 2D trajectors while being made computationally tractable. One answer may be to work on a higher conceptual level than individual pixels.
2.3.2
Computing Quantitative Spatial Relationships
While computationally more expensive and possibly too sensitive, more detailed spatial information may be necessary to distinguish some scene types. Rather than just encoding the direction (such as “above”) in our knowledge framework, we could incorporate a direction and a distance. For example, Rimey encoded spatial relationships using an expected area net [56].
27
In the limit, we may wish to model the class-conditional spatial distributions as continuous variables. There is a body of literature addressing the issue of efficient inference of continuous variables in graphical models (e.g. Bayes Nets), if the distributions are assumed Gaussian (e.g., [33]). Felzenszwalb and Huttenlocher [19] used Gaussian models in their object recognition system, which we will review in the next section. Another option, if a lattice-structured Markov Random Field framework is used (also discussed in the next section), would be to use the spatial relationships that arise from the lattice structure.
2.4
Probabilistic Graphical Models
Early research involving spatial relationships between objects used logic [2]. This approach was not particularly successful on natural scenes: while logic is certain, life is uncertain. In an attempt to overcome this limitation, more recent work has extended the framework to use fuzzy logic [44]. However, Pearl [50] argues that logic cannot be extended to the uncertainty of life, where many rules have exceptions. The rules of Boolean logic contain no method of combining exceptions. Furthermore, logic interactions occur in stages, allowing for efficient computation. We would like to handle uncertain evidence incrementally as well. But unless one makes strict independence assumptions, this is impossible with logic–and computing the effect of evidence in one global step is impossible. Logic-based (syntactic or rule-based) systems combine beliefs numerically. The uncertainty of a formula is calculated as a combination of the uncertainties of the sub-formulas. Computationally, this approach mirrors the process of logical inference, leading to an efficient, modular scheme. Rules can be combined regardless of other rules and regardless of how the rule was derived. Semantically, these
28
assumptions are too strong, except under the strongest of independence assumptions. They cause the following problems semantically. 1. Bidirectional inferences. Semantically, if we have the two statements, fire→smoke and smoke, then fire should be more plausible. However, in a rule-based system, this would introduce a loop. 2. Limits of modularity. Consider the rules alarm → burglar and alarm → earthquake. Due to the modular nature of logic, if alarm becomes more plausible, then burgular should become more plausible as well. However, using plausible reasoning, if we add evidence for earthquake, then alarm becomes more plausible and burglar becomes less plausible, which corresponds with human intuition. 3. Correlated evidence. The rules of logic cannot handle multiple pieces of evidence originating from a single source. As an example, one should not independently increase the belief of an event based on many local news stories that merely echo the Associated Press. Some attempts have been made to overcome this last limitation, such as bounds propagation or user-defined combination; however, each approach introduces further difficulties. We are fully aware that there is not universal agreement with Pearl philosophically regarding the superiority of probability over logic. (Witness the heated rebuttals to Cheeseman’s argument for probability [14] by the logic community!) Still, we think his arguments are sound. Specifically, Pearl argues for a graphical model-based approach founded on probability calculus. While he elaborated on Bayesian Networks in [50], we also consider Markov Random Fields (MRF), another probabilistic graphical model
29
that has been used primarily for low-level vision problems (finding boundaries, growing regions), but has recently been used for object detection. In general, graphical probability models provide a distinct advantage in problems of inference and learning, that of statistical independence assumptions. In a graphical model, nodes represent random variables and edges represent dependencies between those variables. Ideally, nodes are connected by an edge if and only if their variables are directly dependent; however, many models only capture one direction. Sparse graphs, in particular, benefit from the message-passing algorithms used to propagate evidence around the network. While the calculation of a joint probability distribution takes exponential space (and marginals are difficult to calculate) in general, these calculations are much cheaper in certain types of graphs, as we will see.
2.4.1
Bayesian Networks
Bayesian (or belief ) networks are used to model causal probabilistic relationships [13] between a system of random variables. The causal relationships are represented by a directed acyclic graph (DAG) in which each link connects a cause (the “parent” node) to an effect (the “child” node). The strength of the link between the two is represented as the conditional probability of the child given the parent. The directed nature of the graph allows conditional independence to be specified; in particular, a node is conditionally independent of all of its non-successors, given its parent(s). The independence assumptions allow the joint probability distribution of all of the variables in the system to be specified in a simplified manner, particularly if the graph is sparse. Specifically, the network consists of four parts, as follows [66]:
30
• Prior probabilities are the initial beliefs about the root node(s) in the network when no evidence is presented. • Each node has a conditional probability matrix (CPM) associated with it, representing the causality between the node and its parents. These can be assigned by an expert or learned from data. • Evidence is the input presented to the network. Nodes can be instantiated (by setting the belief in one of its hypotheses to 1) or set to fractional (uncertain) beliefs (via virtual evidence [50]). • Posteriors are the output of the network. Their value is calculated from the product of priors and likelihoods arising from the evidence (as in Bayes’ Rule). The expressive power, inference schemes and associated computational complexity all depend greatly on the density and topology of the graph. We discuss three categories: tree, poly-tree, and general DAG.
Trees If the graph is tree-structured, with each node having exactly one parent node, each node’s exact posterior belief can be calculated quickly and in a distributed fashion using a simple message-passing scheme. Feedback is avoided by separating causal and diagnostic (evidential) support for each variable using top-down and bottom-up propagation of messages, respectively. The message-passing algorithm for tree-structured Bayesian networks is simple and allows for inference in polynomial time. However, its expressive power is somewhat limited because each effect can have only a single cause. In human reasoning, effects can have multiple potential causes that are weighed against one another as independent variables [50].
31
Causal Polytrees A polytree is a singly-connected graph (one whose underlying undirected graph is acyclic). Polytrees are a generalization of trees that allow for effects to have multiple causes. The message-passing schemes for trees generalize to polytrees, and exact posterior beliefs can be calculated. One drawback is that each variable is conditioned on the combination of its parents’ values. Estimating the values in the conditional probability matrix may be difficult because its size is exponential in the number of parent nodes. Large numbers of parents for a node can induce considerable computational complexity, since the message involves a summation over each combination of parent values. Models for multicausal interactions, such as the noisy-OR gate, have been developed to solve this problem. They are modeled after human reasoning and reduce the complexity of the messages from a node to O(p), linear in the number of its parents. The messages in the noisy-OR gate model can be computed in closed form (see [50]). A nice summary of the inference processes for trees and polytrees given in [50] can be found in [66].
General Directed Acyclic Graphs The most general case is a DAG that contains undirected loops. While a DAG cannot contain a directed cycle, its underlying undirected graph may contain a cycle, as shown in Figure 2.1. Loops cause problems for Bayesian networks, both architectural and semantic. First, the message passing algorithm fails, since messages may cycle around the loop. Second, the posterior probabilities may not be correct, since the conditional independence assumption is violated. In Figure 2.1, variables B and C may be
32
A
B
C
D
Figure 2.1: A Bayes Net with a loop. conditionally independent given their common parent A, but messages passed through D from B will also (incorrectly) affect the belief in C. There exist a number of methods for coping with loops [50]. Two methods, clustering and conditioning, are tractable only for sparse graphs. Another method, stochastic simulation, involves sampling the Bayesian network. We use a simple top-down version, called logic sampling, as a generative model and describe it in Section 3.2.2. However, it is inefficient in the face of instantiated evidence, since it involves rejecting each sample that does not agree with the evidence. Finally, the methods of belief propagation and generalized belief propagation, in which the loops are simply ignored, has been applied with success in many cases [82] and is worth further investigation. We discuss these methods in the context of MRFs in the next section.
Applications of Bayesian Networks In computer vision, Bayesian networks have been used in many applications including indoor vs. outdoor image classification [37; 48], main subject detection [66], and control of selective perception [57]. An advantage of Bayesian networks
33
is that they are able to fuse different types of sensory data (e.g. low-level and semantic features) in a well-founded manner.
2.4.2
Markov Random Fields
Markov Random Fields (MRFs), or Markov Networks, model a set of random variables as nodes in a graph. Dependencies between variables are represented by undirected arcs between the corresponding nodes in the graph. The topology of the network explicitly identifies independence assumptions – absence of an arc between two nodes indicates that the nodes are assumed to be conditionally independent given their nbrs. MRFs are used extensively for problems in lowlevel computer vision and statistical physics. MRFs provide a framework to infer underlying global structure from local observations. We now discuss the basic concepts of MRFs, drawing from our in-house review [7] of the typical treatments in the literature [30; 16; 22; 15]. Random Field A set of random variables X = {xi }. Graphical Model A random field X may be represented as a graphical model G = (Q, E) composed of a set of nodes Q and edges E connecting pairs of nodes. A node i ∈ Q represents the random variable xi ∈ X. An edge (i, j) ∈ E connecting nodes i and j indicates a statistical dependency between random variables xi and xj . More importantly, the lack of an edge between two graph nodes indicates an assumption of independence between the nodes given their neighbors. Configuration For a random field X of size n, a configuration ω of X assigns a value (x1 = ω1 , x2 = ω2 , . . . xn = ωn ) to each random variable xi ∈ X. P (Ω) is the probability density function over the set Ω = {ω} of all possible configurations of X.
34
Neighborhood Relationship We define a neighborhood relationship N on a field X as follows. Let E be a set of ordered pairs representing connections (typically probabilistic dependencies) between elements of X. Then for any xi , xj ∈ X, xj ∈ Ni ⇔ (xi , xj ) ∈ E. Markov Property [30] For a variable xi in a random field X, if xi satisfies the Markov Property, P (xi = ωi |xj = ωj , j 6= i) = P (xi = ωi |xj = ωj , j ∈ Ni ) The probabilities in Equation 2.4.2 are called local characteristics and intuitively describe a locality condition, namely that the value of any variable in the field depends only on its neighbors. Positivity Condition The positivity condition states that for every configuration ω ∈ Ω, P (x = ω) > 0. Markov Random Field Any random field satisfying both the Markov property and the positivity condition. Also called a Markov Network. Two-Layer MRF [23; 16; 22] “Two-Layer” describes the network topology of the MRF. The top layer represents the input, or evidence, while the bottom layer represents the relationships between neighboring nodes (Figure 2.2). In typical computer vision problems, inter level links between the top and bottom layers enforce compatibility between image evidence and the underlying scene. Intra-level links in the top layer of the MRF leverage a prioriknowledge about relationships between parts of the underlying scene to enforce consistency between neighboring nodes in the underlying scene [16]..
35
Figure 2.2: Portion of a typical two-layer MRF. In low-level computer vision problems, the top layer (black) represents the external evidence of the observed image while the bottom layer (white) expresses the a prioriknowledge about relationships between parts of the underlying scene. Pairwise MRF [23; 16; 82] In a pairwise MRF, the joint distribution over the MRF is captured by a set of compatibility functions that describe the statistical relationships between pairs of random variables in the MRF. For inferential purposes, this means that the graphical model representing the MRF has no cliques larger than size two. Compatibility Functions The statistical dependency between the two random variables xi , xj in a random field is characterized by a compatibility function ψi,j (ωi , ωj ) that scores every possible pair of hypotheses (xi = ωi , xj = ωj ). As an example, consider a link (i, j) in a graphical model G connecting nodes i and j. If there are three possible outcomes for xi and two possible outcomes for xj , the compatibility function relating i and j is a 3×2 matrix, M = [mij ]. Depending upon the problem, compatibilities may be characterized by either the joint distribution of the two variables2 . For some problems for which 2
Or equivalently by both conditional distributions (p(x,y) may be obtained from p(x|y) and
p(y|x).
36
the joint is unobtainable, a single conditional distribution suffices (e.g.for a problem for which p(x|y) is known, but p(y|x) cannot be computed). Inference In typical low-level computer vision applications of MRFs, what is desired from the inference procedure is the MAP estimate of the true scene (the labeling), given the observed data (the image). We have identified two complementary approaches in the literature for calculating the MAP estimate: deterministic techniques and Monte Carlo techniques (described later in this section). We start by reviewing two deterministic techniques: Belief Propagation and Highest Confidence First. The Highest Confidence First algorithm finds local maxima of the posterior distribution by using the principle of least commitment [43], while belief propagation is an inexact inference procedure using messagepassing algorithms successfully in loopy networks by simply ignoring the loops. Highest Confidence First (HCF) [16] The HCF algorithm is used for MAP estimation, finding local maxima of the posterior distribution. It is a deterministic procedure founded on the principle of least commitment. Scene nodes connected to image nodes with the strongest external evidence (i.e. a hypothesis with a large ratio of the maximum-likelihood hypothesis to the others) are “committed” first, since they are unlikely to change (based on compatibility with neighbors). Nodes with weak evidence commit later and are based primarily on their compatibility with their “committed” neighbors. Using edge-modeling MRF as an example, large intensity gradients might constitute strong evidence in some of the network’s nodes. The nodes with strong evidence should influence scene nodes with weaker evidence (via edge continuity constraints) more than the other way around.
37
Belief Propagation [82; 22; 29] The Belief Propagation (BP) algorithm is a message-passing algorithm for probabilistic networks. It is a generalization of a number of inference algorithms such as the forward-backward algorithm, the Viterbi algorithm, Pearl’s algorithm for Bayesian polytrees, and the Kalman filter. At each iteration, each node computes a belief, which is a marginal of the joint probability. The belief is a function of local compatibilities, φ i (xi ) (e.g. local evidence nodes, which are constant, can subsumed) and incoming messages, mji from neighboring nodes:
bi (xi ) = kφi (xi )
X
mji (xi )
j∈N (i)
The messages, mji , are computed from a function of compatibilities of the message’s sender and recipient nodes and previous messages from the sender’s other neighbors:
mij (xj ) =
X
φ(xi , yi )ψ(xi , xj )
xi
Y
mki (xi )
k∈N (i)j
Intuitively, the incoming messages represent combined evidence that has already propagated through the network. In the rare case that the graph contains no loops, it can be shown that the marginals are exact. However, some experimental work suggests that at least for certain problems, that the approximations are good even in the typical “loopy” networks, as the evidence is “double-counted” [81]. One can calculate the MAP estimate at each node by replacing the summations in the messages with max. Generalized belief propagation [82]
38
Generalized belief propagation (GBP) uses messages from clusters of nodes to other clusters of nodes. Since these messages are expected to be more informative, the performance is also expected to increase. GBP is theoretically justified: in fact, GBP is related to Kikuchi approximations in a manner analogous to BP and Bethe free energy: a set of beliefs gives a GBP fixed point in a graph if and only if the beliefs are local stationary points of the Kikuchi free energy (for details of free energy minimization techniques, see [82]). GBP has been found to perform much better than BP on graphs with short loops. The drawback is that the complexity is exponential in the cluster size, but again, if the graph has short loops (and thus necessitates only small clusters), the increased complexity can be minimal. Pearl’s clustering algorithm is a special case of GBP, with clusters chosen to overlap in a fixed manner that are usually large. They obtain increased accuracy, but at increased complexity. An advantage of GBP is that it can be used to vary the cluster size in order to make a trade-off between accuracy and complexity.
Inference on Tree-Structured MRFs Felzenszwalb and Huttenlocher [19] use tree-structured MRFs for recognition of objects such as faces and people. They model their objects as a collection of parts appearing in a particular spatial arrangement. Their premise is that in a part-based approach, recognition of individual parts is difficult without context, and needs spatial context for more accurate performance. They model the expected part locations using a tree-structured two-layer MRF. In the scene layer, the nodes represent parts and the connections repre-
39
sent general spatial relationships between the parts. However, rather than using the typical square lattice, they use a tree. Inference in the MRF is both exact and efficient, due to the tree structure. Their MAP estimation algorithm is based on dynamic programming and is very similar in flavor to the Viterbi algorithm for Hidden Markov Models. In fact, the brief literature in the field on using Hidden Markov Models for object and people detection [20] might be better cast in an MRF framework. Monte Carlo Methods Our treatment is taken in large part from [7; 45; 41; 23]. Monte Carlo methods are used for sampling. The goal is to characterize a distribution using a set of well-chosen samples. These can be used for approximate MAP estimation, computing expectations, etc., and are especially helpful when the expectations cannot be calculated analytically. How the representative samples are drawn and weighted depends on the Monte Carlo method used. One must keep in mind that the number of iterations of the various algorithms that are needed to obtain independent samples may be large. Monte Carlo Integration Monte Carlo integration is used to compute expectations of functions over probability distributions. Let p(x) be a probability R distribution and a(x) be a function of interest. We assume that a(x)p(x) cannot be evaluated analytically, but p(x) is easy to sample from (e.g., GausR sian). To compute a Monte Carlo estimate of a(x)p(x), we first create a representative sample of Xs from p(x). There will be many Xs from the regions of high probability density for p(x) (intuitively, the Xs that should be common in the real world). We then calculate a(x) for each x in the set. The average value of a(x) closely approximates the expectation. A key insight into this concept can be stated as follows:
40
The expectation of a function of random variables depends not only on the function’s value, but on how often the random variables take certain values! The drawback is that our assumption that p(x) is “easy” is often not valid; p(x) is often not easy to sample from, and so we need a search mechanism to draw good samples. Furthermore, we must be careful that this search mechanism does not bias the results. Monte Carlo Markov Chain (MCMC) Methods [45; 41] In Monte Carlo Markov Chain (MCMC) methods, the samples are drawn from the end of a random walk. A Markov chain is a series of random variables, X 0 , X 1 , . . . , X t in which a locality condition is satisfied, that is,
P (X ( t + 1)|X t , X ( t − 1), . . . , X 0 ) = [P (X ( t + 1)|X t ) The chain can be specified using the initial probability, p0 (x) = P (X 0 ) and the transition probabilities, p(X ( t + 1)|X t ). The transition probability of moving from state x to state y at time t is denoted Tt (x, y) , which can be summarized in a transition matrix, Tt . We consider homogenous Markov chains, those in which the transition probabilities are constant, i.e. Tt = T for all T . We take an initial distribution across the state space (which, in the case of MRFs, is the set of possible configurations of the individual variables). This distribution is multiplied by a matrix of transition probabilities repeatedly, each iteration yielding a new distribution. The theory of Markov chains gaurantees that the distribution will converge to the true distribution if the chain is ergodic (always converging to the same distribution).
41
In practice, one generates a sample from the initial probability distribution (e.g., uniform) and then moves through the state space stochastically; a random walk guided by the values in the transition matrix. Since the distribution is guaranteed to converge, at the end of the random walk, the sample should be from the actual distribution. The number of steps needed before convergence is reached is bounded by theoretical results, but varies in practice. There are a number MCMC algorithms: • Gibbs Sampler. [23; 45; 41] In the Gibbs sampling algorithm, each step of the random walk is taken along one dimension, conditioned on the present values of the other dimensions. In a MRF problem, it is assumed that the conditional probabilities are known, since they are local (by the Markov property). This method was developed in the field of physics, but was first applied to low-level computer vision problems by Geman and Geman on the problem of image restoration [23]. Geman and Geman furthermore combined Gibbs sampling with simulated annealing to obtain not just a sample, but the MAP estimate of their distribution. Their application of Gibbs sampling is also called stochastic relaxation (so as to differentiate it from deterministic relation techniques). • Metropolis Sampler. [45; 41] In each iteration of the Metropolis algorithm, one makes a small change from the current state, and accepts the change based on how good (probabilistically) the new state is compared to the old one. • Metropolis-Hastings Sampler. [41] Generalization of the Metropolis algorithm.
42
Importance Sampling [45; 41] To sample from a distribution f (x), sample from a simpler distribution, g(x), and weight based on the ratio between the original distribution and g(x). One caveat is that g(x) should have heavy tails, e.g. Cauchy, because g(x) should not equal zero where the original distribution has non-zero probability. Rejection Sampling [45; 41] To sample from a distribution f (x), sample from another, similar distribution g(x), which is bounded by a constant multiple, c, of the true distribution. Generate a point x˜ from g(x) and accept the point with probability f (x)/cg(x), repeating until a point is accepted. The efficiency of the method depends on how tight c is and how close the functions f (x) and g(x) are, and is not useful in practice for “difficult” problems [45].
2.4.3
Relative Merits
The literature is divided regarding the potential equivalence of Bayesian networks and Markov Random Fields. Pearl argues that only a subset of each model (decomposable models) are equivalent, due to the differing semantics provided by directed and undirected links [50]. However, Yedidia [82] argues that they are equivalent, providing algorithms for converting each to and from factor graphs, a third graphical model. While we need to settle this issue in our minds, we argue that even if the models are technically equivalent, their ease of use is not necessarily; each definitely has its particular merits. 3
3
Bayesian networks model causal dependencies and allow
Consider the theoretical equivalence of Turing machines with modern computers; which is
easier to program for practical tasks?
43
for efficient inference in sparse graphs. Their utility in high-level vision problems has been proven. In contrast, Markov Random Fields have been used primarily for low-level vision problems in which the spatial constraints can be specified directly and only recently have been applied to object recognition (in which case a tree-structured MRF was used). We believe that each model warrants further investigation.
44
3
Methodology
In this chapter, we discuss our experimental setup, the statistics calculated on natural images, the advantages of working in simulation, and the details of the simulator. For our prototype scene classifier using spatial information, we have focused on the problem of classifying outdoor scenery (i.e., natural ) scenes. Many other attempts have been made to classify this type of scene (e.g., [4; 9; 27; 35; 44; 46; 77]), all with some degree of success. One reason is that in natural scenes, lowlevel features tend to be more correlated with semantics (than scenes of human activity). Another is that there is a limited range of object type present in these scenes [70]. Some research suggests that natural scene databases are less complex (by a certain metric) than those of people [52]. We do note that these findings are by no means conclusive because the range of features that were used in the experiments was extremely narrow. Furthermore, outdoor scenery images tend to contain specific, albeit loose, spatial arrangements. For instance, in a beach scene, we would expect to see sky appearing above water, in turn appearing above sand. One reason for this seems to be the large field-of-view in these images: perspective is much more constrained than in short-range, human-activity dominated scenes.
45
Finally, the ultimate test for our research is to solve a practical, yet very interesting problem: classifying real consumer-type images. In much of the scene classification work, as we have seen, experimental databases are usually limited to professional stock photo libraries. If we can classify consumer images, which are much less constrained in color and composition, successfully, our work could potentially generalize to other domains. Robot vision, for instance, needs to operate in non-structured environments with low-quality sensors. When the time comes to experiment with real images and real detectors, we would be well-equipped to detect the materials in these scenes. We have access to state-of-the-art detectors for sky (blue [38] and mixed [67]), as well as grass, water, and snow [39]. We have also started to gather training data, labeling approximately 4000 images classified as beach, sunset, fall foliage, field, mountain, urban, night, lake and ocean, skiing, and desert. The materials and locations of approximately 300 of these images have been hand-labeled as well, as we describe in the following section.
3.1
Statistics of Natural Images
For a small set of these natural images, we have marked the location and extent of each material in each image. While gathering data about the presence or absence of materials is fairly straightforward (and can be represented by text), learning the spatial relationships between them requires labeling the materials in each image in such a way that the spatial information can be extracted. Because we want to compare the effects of qualitative and quantitative spatial relationships, it is not enough to collect data in the form “the sky is above the sand”. We need to outline each material’s region in each image in some manner, so that the relationships can be calculated. A sample of the desired information is captured in Figure 3.1.
46
Figure 3.1: Ground-truth labeling of a beach scene. Sky, water, and sand regions are clearly shown.
3.1.1
Ground Truth Collection Process
Ground truth collection involved two stages, collecting images of each class and labeling the material regions in each image.
Scene Class Definitions As stated, we have collected approximately 400 images of each of the following types of scenes: beach, sunset, fall foliage, field, mountain, and urban. These scene types are like those used in the literature. These are chosen such material detection of various types should be helpful. The scene definitions are given in Table 3.1. In general, images chosen did not have a dominant main subject. This is legitimate; for example, a close-up image containing a horse in a field would be considered a “horse” image, not a “field” image. We do plan to investigate later how to adapt the framework for such images.
47
Class
Table 3.1: Scene class definitions. Definition
beach
At least 5% each of water, sand, and sky
sunset
Illuminant source in front of camera
fall foliage
Detectable color in turned leaves on tree
field
No aerial view, not cluttered with trees (“open”)
mountain
Open, whole mountains, mid-range view
urban
At least one building, no extreme perspectives
Material Region Labeling Our region labeling was done in a principled fashion, using the following methodology: • Identify key materials. We chose a number of materials to label in each image: sand/gravel, sky (blue and mixed/cloudy), water, snow, buildings (skyscrapers), roads, grass, foliage (evergreen, green and turned deciduous), rock, crowds of people, and crops. • Define each material precisely. For instance, how is a crowd of people defined? Is wet sand on a beach considered part of a sand or water region? We defined “sand/gravel” as dry sand or gravel, and ignored wet sand, sometimes causing wide borders between the sand and water regions as a result. For a detailed list of definitions, see [39]. • Label each image. We select unobstructed (continuous) areas of each material; for instance, we do not select the bits of sky visible between the leaves of a tree. While our labeling tool’s user interface allows us to create and label only polygonal regions, this is sufficient.
48
The labeled regions correspond to a semantic-based coarse segmentation of each image, which is ideal for training. The labels correspond to semantic regions, even when changes in appearance (e.g., deep shadows) would fool many of today’s segmentation algorithms. Then, in the classification stage, errors are either concept failures, segmentation errors, or material detection errors. For each image region, we selected a polygon contained strictly in the interior of the region, yielding a “pure” sample of that region. Avoiding the boundary altogether is simpler, yet leaves spaces between the regions (as shown in Figure 3.1). We could, of course, achieve a closer approximation using a minimization refinement such as snakes [71]; however, it is doubtful when such a fine-grained segmentation would help.
3.1.2
Scene Prototypes
Prototypical beach scenes, in order of increasing complexity, are given in Figure 3.2. We labeled a total of 321 images, belonging to the beach, urban, and field classes.
3.1.3
Spatial Relationships
We use Luo, et al.’s hybrid bounding-box/weighted walkthrough approach (see Section 2.3.1) to computing spatial relationships. In [40], the relations were used to solve the related problem of building holistic material detection using spatial context-aware scene models 1 . They were interested in outdoor scenes and the typical materials found there as well. The relations of interest are above, below, beside, far above, and far below, but favoring the above and below relations over 1
Note that they did not build scene-specific models; we could apply our research to solve
this.
49
Figure 3.2: Prototypical beach scenes. (a) A simple beach scene without background objects. (b) Because we make no attempt to detect it, we consider the sailboat to be “background”. (c) A more complicated scene: a developed beachfront. (d) A scene from a more distant field-of-view. (e) A crowded beach. the beside relation, because they were found to be more helpful for outdoor scenes. We use their algorithm for its proven performance on similar scenes (and because it is faster than the other methods discussed). We make the simplifying assumption that multiple (potentially) disconnected regions with the same semantic label should be considered as one region for the purposes of computing spatial relations. This is not generally an issue; see, for example, the sand regions in Figure 3.2 (e); calculating their relationship as one entity with respect to the sky or water is equivalent to computing the individual relationships and merging them. This also solves the issue of conflicting relations between a material’s subregions in an intuitive fashion. As a simple example, see Figure 3.4. We have two regions (A1 and A2 ) of material MA and one region (B) of material MB . A1 is beside B and A2 is above B. However, by considering A1 and A2 as one re-
50
Figure 3.3: Prototypical urban scenes. (a) The most common urban scene, containing sky, buildings, and roads. (b),(c) The sky is not simply above the buildings in these images. (d) Roads are not necessary. (e) Perspective views induce varied spatial relationships. (f) Close views can preclude the presence of sky. gion, the resulting bounding box overlaps region B and the weighted-walkthrough algorithm yields a relationship favoring the larger region.
A2
A1
B
Figure 3.4: An example of spatial relationships involving a split region.
51
3.1.4
Scene-specific Spatial Relationship Statistics
Using the relationships defined above, we extracted spatial relationships statistics for two classes of natural images: beach and urban scenes. The statistics are given in Appendix A.
3.2
Experimental Environment
One goal of our research is to develop and demonstrate representations and algorithms for difficult semantic scene classification problems. We start with properties of image regions and their spatial relations. We exploit probabilistic constraints on interpretations (discrete labels) of regions arising from cultural and physical regularity and predictability. A simulated environment is essential for us to manipulate and explore the various relationships, properties, algorithms, and assumptions available to us without being constrained by the “accidental” (in the philosophical sense) nature of any particular domain. However, another goal certainly is to solve practical problems in scene classification, spurred on by the motivations given in the Introduction. That is, we want our solutions to be specific as well as general, and we care about application as well as abstraction. These goals can stand in competition with each other: abstracting away too much in a simulating environment can be totally unrealistic and give little hope of solving the original problem. Our goal is to balance these two goals. We can safely abstract away the bank of material and object detectors available by representing them by their probabilities of correct and incorrect detection, along with their expected beliefs. For some detectors, we learn these beliefs from training data, and for others, we estimate them. We can also abstract away the images (to some degree) by using a probabilistic generative model, as will be
52
described. We do not have to label a large number of images, and yet we learn the probabilities from the images we have labeled, thus achieving some balance. Using these abstractions, the simulator should allow us to experiment with schemes using various degrees of spatial information: qualitative vs. quantitative. Furthermore, we will want to track the effect on each of these schemes of substituting full knowledge of the scene parts with potentially faulty detectors. We have developed a prototype of such an environment, dubbed MASSES (Material And Spatial Simulated Experimental Scenes). In MASSES, materials and their relationships are abstracted for classification purposes. Each scene contains “objects” arranged according to some loose structure. MASSES is similar in spirit to Rimey’s T-world [56; 57], except T-world modeled scenes with tight structure (e.g., a table setting), and should generalize to scenes such as traffic scenes or medical diagnosis, where parts of the image follow precise rules. It is doubtful that T-world could model natural scenes, which have a much looser structure.
3.2.1
Advantages of Working in Simulation
We now discuss the specific advantages of idealizing the world with MASSES. 1. It is a standalone work, under our control and free from outside influences. Therefore, we are not dependent upon others’ ground truth data collection or upon their detectors. Intellectual property issues are avoided. 2. It can embody the ideas we hold about the world, namely the strength and granularity of the spatial organization and the accuracy of the detectors. 3. We can experiment with various algorithms. 4. Perhaps most importantly, we can easily quantify the value of added information. We can answer questions such as:
53
(a) What is the value of spatial information? Intuitively, we believe that it should help, but we need to quantify the effect. (b) How does performance decline when we replace true material knowledge with faulty detectors? (c) How does a better sky detector help? We can vary the performance continuously and track the effects. (d) How does adding another spatial constraint help? For instance, cliques in a region adjacency graph (RAG) [71] may contain discriminatory information. (e) How strict do the spatial rules have to be before we can benefit from them? How fine-grained do the measurements have to be? (f) What is the value of added training data?
3.2.2
MASSES Prototype
The expected output of the simulator is a series of virtual scenes, which can be used to train a classifier or to test its performance. In this section, we describe our scene generative model. For our prototype simulator, we have chosen to use a tree-structured Bayesian network as an inference engine. Bayesian networks of this form are well-understood and have efficient inference algorithms, as we discussed in Section 2.4.1. More expressive formalisms, such as Bayesian networks (for general graphs), or Markov Random Fields, may better capture the spatial relationship distributions, but we have not yet explored these options. The flow of information in MASSES is depicted in Figure 3.5.
54
Material Detection
Material List
Spatial Rels.
Sky Water Sand
MASSES Above(Sk, Wa) Above(Sky, Sa) Beside(Wa, Sa) Inference Engine
Beach Class
Figure 3.5: The MASSES environment. Statistics from labeled scenes are used to bootstrap the generative model, which can then produce new virtual scenes for training or testing the inference module. Bayesian Network We designed a Bayesian network to classify scenes. We place the conditional independence assumption on each material, given the scene class. We also assume that all of the spatial relationships are independent, given the scene class. Our single-level network is shown in Figure 3.6. Class
Sky
Water
...
Material_n
Rel(Sky, Wat)
...
Rel(M_i, M_j)
Figure 3.6: Single-level Bayesian network used for MASSES
55
The network has been designed with the following assumptions in mind: 1. We model regions as point sources, and model qualitative spatial relationships between them. This means we ignore the shape and size of the region, adjacency information about the regions, and occlusions causing regions to be split. We also ignore the underlying color and texture features of the region which were used to determine the classification of the region. While any of these may be useful features in a full-scale system, we temporarily ignore them. 2. We treat each material independently. This ignores the potential co-occurrence relations between materials: this is a limitation if materials are dependent, given the class. For example, in beach images, roads are more likely to be present when a building is also present. 3. We model spatial relationships as binary, symmetric relations, such as A AboveB . This raises interesting questions. In some cases, relationships are pairwise consistent, yet globally inconsistent. For example, it is possible for region R 1 to be above region R2 , region R2 to be above region R3 , and region R1 to be below region R3 . Of course, while this would be impossible when modeling regions as point sources, it is certainty possible (though unlikely) with 2D regions of varying shape (consider region R3 in the shape of the letter “C”, enclosing the other two regions). 4. We assume flat priors on the scene classes, thus eliminating their effect. While this assumption does not hold in practice, the analysis becomes simpler and we lose no generality. Finally, we base the values of the conditional probability matrices (CPMs) on the material and spatial relationship links on the statistics of natural scenes presented in Section 3.1. Specifically, we round the probabilities and ignore rare,
56
apparently less important materials (like the rarely-occurring grass region on a beach). Sampling the Bayesian Network to Generate Scenes Our first experiment is to determine if best-case (100% accurate) material detection alone (without spatial relations) can give reliable scene classification. If not, we want to determine the effects of adding spatial information. To this end, we ignore detector characteristics and create scenes directly from the material distributions learned from the outdoor scenes. With this in mind, the method of sampling the Bayes Net is straightforward, using its priors and the conditional probability matrices of its links, because the network is a directed and acyclic graph (DAG). The algorithm called probabilistic logic sampling for DAGs was proposed by Henrion [25] (and extended by Cowell [17]) 2 . We apply Henrion’s algorithm as follows: 1. Generate the class, C, of the scene by randomly sampling the priors, P (class = C). 2. Generate the list of materials present in the scene. Each of the materials has as its parent the scene’s class, so we sample from P (materiali |class = C) (one row in the CPM on the link) for each material i. 3. Generate spatial relationships for each pair of materials < i, j > present in the scene by sampling P (Mi relij Mj |class = C). Figure 3.7 shows an example. Note that there is a separate node in the network for each material and for each pair of materials in the image, a total of (m C2 + m) 2
If the graph is not a DAG, a more complicated method such as rejection or a different
framework such as a nested junction tree [17] can be used.
57
Class
1
2
3
4
...
Rel(1,2)
Rel(1,3)
Rel(1,4)
Rel(2,3)
Rel(2,4)
Rel(3,4)
...
Figure 3.7: Sampling the scene type yields class C. Then we sample to find the materials present in the image, in this case, M1 , M3 , and M4 . Finally, we sample to find the relationships between each pair of these material regions. nodes for m materials. However, when generating scenes or doing inference, we need only instantiate those nodes corresponding to materials already determined to be present in the image. If d materials are detected in a given image, there will be d C2 instantiated relationship nodes. In practice, this is a large savings because for typical scenes, d < m. The alternative to Henrion’s algorithm is to use a guess-and-check method: generate a random placement of material regions and then use the Bayes Net to classify them. In a non-forced choice classification problem, this would give the ability to generate scenes in an “other” category, but it would also generate non-sensical scenes. It is also extremely inefficient.
Inference in the Bayesian Network for Classifying Scenes To classify scenes generated in this fashion is straightforward. A scene is represented by a list of materials and binary relations. The material leaves are instantiated as either present or not present. Only the relations corresponding to materials present in the scene are instantiated; the others cannot be because two materials not present have no relationship to instantiate. This also causes inference to be much more efficient. The likelihoods propagate to the root. Since the priors are flat, the scene is classified as the one with the maximum likelihood.
58
3.2.3
Simulating Faulty Detectors
If we were to have full knowledge of the materials in a scene, then many scenes could be distinguished by the mere presence or absence of materials only. For instance, consider the outdoor scenes described above. An image containing only sky, a skyscraper, and roads is almost guaranteed to be an urban scene. However, to be practical, we must allow for faulty detection of objects and materials in a scene to occur during the classification stage. Our hope is that if a number of materials are falsely detected, their spatial relationships will yield low beliefs in any scene type, and can therefore be rejected as no known scene, or as one that cannot be classified from the materials detected. We can do this by incorporating a reject option into our inference mechanism (e.g., by thresholding class beliefs). In the more general case, a region may be classified as multiple materials, each with corresponding beliefs. The correct material may be detected with a belief smaller than the maximum, but spatial relationships may boost the overall belief in the configuration with this material to become the maximum. A method of simulating faulty detectors will involve a causal relationship between the true materials and the detectors. Define the characteristics of detector D on a set of materials M to be the set of probabilities {P (D fires |mi ) : mi ∈ M }. Take the sand detector as an example: Sand_Detector 0.05 0.95 Sky 0.95 0.05 Sand 0.20 0.80 Road 0.20 0.80 Water 0.25 0.75 Building 0.30 0.70 Background
59
The first column gives P (Ds |M ), the probability that the sand detector Ds fires, given the true material M . The second column gives the probability 1 − (P (Ds |M )) that the detector does not fire. In this example, the sand detector has 95% recall of true sand, and detects sand falsely on 20% of the water, for instance (perhaps due to a brown reflection in the water or shallow sand). It may also fire falsely on 30% of the background (unmodeled) regions in the images because they have similar colors. Assume we have a detector for each material of interest. Each of these detectors will be connected to any given region in the image and will have a certain probability of firing given the true material in that region. The evidence consists of the detectors that actually fired and the corresponding beliefs. Take a sand region; each of the detectors’ characteristics may be as follows: True Sand 0.03 0.97 Sky_detector 0.95 0.05 Sand_detector 0.20 0.80 Road_detector 0.20 0.80 Water_detector 0.10 0.90 Building_detector In this case, the first column gives for each material detector Dm , P (Dm |S), the probability that the detector fires on a true sand region. Each detector is linked to each material in the image, as shown in Figure 3.8. Note that the subgraph’s “root” node corresponds not to a specific material, but to a specific region. Inference on this subgraph allows the combined likelihood of each true material to be calculated, yielding a soft belief in each material. Let Mi be a material and D be the set of detectors, D = {D1 , D2 , . . . Dn }. We calculate the likelihood of each detector firing as follows.
60
Region_label
Sky_det
Water_det
...
Sand_det
...
...
Mat_n_det
Figure 3.8: Bayesian network subgraph showing relationship between regions and detectors.
λ(R) = α
Y
λDi (R)
(3.1)
Y
MD|Ri λ(Di )
(3.2)
i
= α
i
where α is a normalizing constant and the matrix notation for M is defined by Pearl [50] as:
My|x , P (y|x) More specifically, the (i, j)th position in My|x , P (Y = yj |X = xi ) Equation 3.1 follows from the conditional independence assumption of detectors and 3.2 follows by conditioning and summing over the values of each detector. At this point, we have two options. The likelihoods, λ(R) can be passed on to the remainder of the network (i.e. attaching the subgraph to each material leaf in the original network) or only the material with the maximum likelihood (ML) can be passed on by taking arg maxi P (Mi |D). This second option corresponds to performing inference on the subgraph off-line. Each option can advantages and disadvantages. 1. Pass up the belief in each material, given which detectors fired on the region. Maintaining uncertainty concerning the true material is desirable. Our
61
single-level, tree-structured network described above cannot handle beliefs. The main limitation is that there is no way of describing which material spatial relationships should be instantiated. Consider regions A and B, with likelihood vectors ΛA = (0.5, 0.3, 0.2) and ΛB = (0.1, 0.2, 0.7) for materials M1 , M2 , and M3 , respectively. Which relationship node should be instantiated? The combination Rel(A = M1 , B = M3 ) is most likely based on materials alone, but may not be more likely once the relationship probabilities are included. If we consider all combinations of materials, this leads to an explosion in number of relationship nodes to be considered during the inference stage, since each potential material combination must be explored. 2. Use a Maximum Likelihood approach for detecting one material per region. Even though we lose the uncertainty in the detection, we decided to choose this approach initially, since incorporating it into the network was straightforward. In the generative stage for test images, we first generate the scene as described earlier in this section, including true materials and spatial relations. Consider a single region with known material. We can approximate, by sampling, the probability of each configuration of detectors firing, and the material chosen using the ML approach. Doing this offline once can save time and lend insight into the true quality of the detectors. The result is a distribution of ML detections given the true material; sampling from this distribution yields a perturbed (possibly faulty) material that we substitute for the true one. We keep the spatial relationships originally generated; the composition of the region was mis-detected, not its location. We describe this approach in the following section.
62
Offline ML Approach to Faulty Detectors using Detector Beliefs We start by describing how to perturb a single material region. The idea is that we can use each detector’s belief in the region to determine the material with the maximum likelihood. Material Perturbation Algorithm 1. Determine which detectors fire by sampling the detector characteristics. 2. For each detector that fires, sample the belief distribution to determine the confidence in the detection. 3. Propagate the beliefs in the Bayesian network to determine the overall likelihood of each material. 4. Replace the true material with the (possibly different) material with the maximum likelihood. We work an example using the detector characteristics given in Appendix B. Consider the sand region example in the previous section. Given the large (95%) probability that the sand detector fires, and smaller probabilities that the road and water detectors fire, it is most likely that only the sand detector fires, obviously giving no perturbation. However, it is also somewhat likely that the road detector fires as well. Say that when we sample, we find the sand detector firing with belief = 0.6 and the road detector firing with belief = 0.7. (That the false detector would give a larger belief than the true detector is unlikely, yet possible, and we use it to illustrate a point.) Then calculating the combined likelihood of each material, we get:
λ(R) = α
Y
i∈{KD,SD,RD,W D,BD}
λDi (R)
(3.3)
63
where, for example,
λDSD (R) =
0.05 0.95 0.95 0.05 0.20 0.80 0.20 0.80 0.25 0.75 0.30 0.70
0.6 0.4
= (0.41, 0.59, 0.44, 0.44, 0.45, 0.46)T
(3.4)
(3.5)
Completing similar calculations for the other four detectors yields:
λDKD (R) = (0.05, 0.97, 0.95, 0.82, 0.95, 0.99)T
(3.6)
λDRD (R) = (0.32, 0.38, 0.64, 0.36, 0.46, 0.40)T
(3.7)
λDW D (R) = (0.52, 0.80, 0.80, 0.15, 0.85, 0.85)T
(3.8)
λDBD (R) = (0.95, 0.90, 0.85, 0.95, 0.20, 0.90)T
(3.9)
Multiplying yields:
λ(R) = α(0.0032, 0.1566, 0.1819, 0.0185, 0.0334, 0.1394)T In this somewhat unlikely case, because the road material has the greater likelihood, the ML approach would choose it. This sampling method can be used on-line to perturb materials. However, assuming then that we are not interested in the underlying belief in each material (only which has the maximum likelihood), we can create a new distribution from which we can sample directly to perturb the material. The idea is that we determine for each material, the probability of another material being detected
64
Table 3.2: Distribution resulting from offline sampling procedure. True
Detected Material
Material
Accuracy
Sky
Sand
Road
Water
Bldg
Bkgrd
Sky
78%
0.78204
0.00951
0.008457
0.060073
0.005972
0.133948
sand
73%
0.005474
0.726594
0.030825
0.034847
0.015689
0.186571
road
62%
0.011089
0.04848
0.620218
0.041971
0.039691
0.238551
water
62%
0.080188
0.048161
0.034062
0.617168
0.008486
0.211935
building
57%
0.012262
0.058007
0.073957
0.034174
0.573262
0.248338
in its place. We achieve this off-line by simply running the Material Perturbation Algorithm repeatedly, keeping tallies of each material detected and normalizing. Using a given set of five detector characteristics (see Appendix B), we find the distributions in Table 3.2. Note that using an ML framework, the detectors are not as good as first appeared. The most common detector error is for none of the detectors to fire (or to fire with low belief), causing the region not to be detected at all. Because we assume to know the orientation of the image, we could also incorporate an orientation model into the material fusion module, which prior work suggests yields a substantial gain in accuracy (11% in [40]). We note our underlying assumption that each detector’s CPM is independent, which may not be true in practice: what may cause the one detector to fire may be correlated with what may cause another detector to fire. 2) Not using orientation info in the detectors gives lousy detection overall, so we can expect better when we use it. However, we can just interpret these results as what we would get in the case of bad detectors!
3.2.4
Background Regions
We define a background region as any substantial region not detectable or not of interest to us (e.g., people in a beach scene). We do not include the small borders
65
between materials. Background regions can be treated like other material regions, their probability of occurrence and their spatial relationships with other regions can be calculated. It is not yet clear whether spatial relationships involving the background would be helpful. When our framework includes faulty detectors, there is always a chance that true materials of interest will not be detected by any detector (missed regions), and would therefore be considered background, or that background regions could falsely be detected as materials of interest (false positives). We currently account for misses by ignoring (by not instantiating) the background and any relationships involving it. We require two additional extensions to the initial generative model in MASSES: 1. Model the probability of background regions and their relationships with the materials of interest. 2. Generate background regions (even if we don’t use them during inference). 3. Allow them to be perturbed to become materials of interest (false positives). 4. Modify the conditional probabilities to make the model more robust to errors. For instance, 0s in the matrix should be changed to a small . A possible side-effect of modeling the background regions is that once we hypothesize a scene type, we can attempt to infer the identity of unknown background regions.
66
4
Experimental Results
We have hypothesized that spatial relationships can be used to distinguish certain scene types and that they can mitigate the effects of faulty detectors. We present in this section our experiments and results.
4.1
Best-case Detection in MASSES
As described, the generative model in MASSES is a Bayesian network with conditional probabilities based on our training data. We used MASSES to generate a number of scenes and their relationships. We then classify them using two Bayesian networks, the one with spatial information used to generate the scenes and another in which the spatial relationship nodes are removed. 1. B -scenes always contain water, sky, and sand/gravel. The sky is above or far above the other materials, and the water tends to lie above the sand, although it occasionally can lie to the side or below the sand (accounting for certain less-likely perspectives). 2. L-scenes always contain water and sky, and often contain sand/gravel. However, the sand tends to lie at the horizon line, above the water.
67
3. U -scenes contain buildings and sky, and often contain roads. We have modeled these three simulated scenes after beach, lake/ocean, and urban scenes. They are similar to their real-life counterparts, except that we model the land at the horizon as sand/gravel (since the horizon may be confused with these in images). The adjectives “occasionally”, and “often” in the definitions above are quantified by the conditional probability distributions in the Bayesian network from which we are sampling. We based the CPMs for the the B- and U-scenes on those learned from data (120 beach scenes and 100 urban scenes), but only using a subset of the materials present: sky, water, sand, building, and road. We hand-modeled the L-scene CPMs ourselves. The scene priors were chosen to be uniform. We expect that the B- and L-scenes will be difficult to discriminate using materials alone, since they often contain the same materials. However, the location of the sand is usually different, causing differences in their spatial layouts. The U-scene should be classified correctly with or without spatial information. We have designed it for use in later experiments involving faulty detectors. Accuracy on 100,000 scenes was 89.94% without spatial information and 96.99% with spatial information, a gain of 7%. The absolute performance gain is not important; what is important is that the hypothesis was verified under the assumption of best-case material detection. Data Size 100,000
Material Only Spatial Information Added 89.94%
96.99%
Table 4.1: MASSES with best-case material detection: Accuracy with and without spatial information.
Success Spatial relationships were most helpful in discriminating L-scenes containing sand, water, and sky from B-scenes. Scenes of this type were classified
68
correctly (and with high confidence) if the sand region was between the sky and the water regions.
Baseline Success The urban scenes could be discriminated from the other scene types, using material presence alone. In our simplified model, buildings were present if and only if the image was an U-scene, so this was expected. Furthermore, L-scenes not containing sand were correctly classified as well, without spatial relationships.
Continued failure What scenes escaped even spatial relations? First, with small probability, the sand region was below the water region in some L-scenes. These were mistaken as B-scenes. Second, when the sand and water regions were next to each other, the two scenes were often confused. In these cases, the final likelihoods (and hence the scene type) were determined by the spatial relations between each region and the sky region. In B-scenes, sand tends to lie below the sky and water far below the sky, whereas in L-scenes, the opposite is true. However, the small fraction of images not holding to this relationship were confused. Third, some simulated scenes contained seeming (at first sight) conflicts, primarily due to the “far” modifier. Take, for instance, a scene with water below the sky, and sand FAR below the sky. We could reason from this that the sand would lie below the water, but often the sand would lie above the water. Because we assumed each pairwise relationship to be independent of each other relationship, this situation occurred.
4.2
Best-case Detection on Beach Photographs
We classified the 120 material-marked, beach scenes in our collection using the same single-level Bayesian networks used in MASSES. As stated earlier, the
69
MASSES net is designed to distinguish between the B-, L-, and U-scenes. Dataset Size Material Only 120
100%
Spatial Information Added 116 120
= 96.67%
Table 4.2: MASSES with best-case material detection: Accuracy with and without spatial information. These results are not surprising, given our definitions. We had defined beach scenes (see Table 3.1) as ones with at least 5% of the image occupied by each of water, sand, and sky regions, with no regard for the spatial relationships between these regions. Therefore, we expect accuracy to be 100% without spatial information. Furthermore, we are not surprised that accuracy went down when spatial information was added to the inference process. Upon observation of the images, three contained sand above water, due to camera location (e.g., on a boat), and could be classified as L-scenes. They all contain large water regions and small sand regions.
Figure 4.1: Images incorrectly classified due to spatial relationships. The last image on the right, contains sand beside the water (which alone is ambiguous), but the sky closer to the sand causes a higher belief in lake. The rationale for calling it a beach scene is that the sand region dominates the water region, having nothing to do with spatial relationships.
70
4.3
Faulty Detection on Beach Photographs
Before the inference stage, we perturb the results according to the method described in 3.2.3. Note that the detector model is weak, so we can expect a large percentage of the materials to be misclassified. Results are shown in Figure 4.3. Dataset Size Material Only 120
110 120
= 91.67%
Spatial Information Added 108 120
= 90.00%
Table 4.3: MASSES with faulty material detection: Accuracy with and without spatial information. The 10 images missed when the material-only Bayesian network was used all contained undetected sand. These fall into two categories. 1. Eight images contain sky, water, and background, and are thus classified as L-scenes (Figure 4.2).
Figure 4.2: Images incorrectly classified using faulty detectors and no spatial relationships. The actual materials are shown the top row; the detected materials are shown below each. 2. The other two images mis-detected a building as well and were thus classified as U-scenes (Figure 4.3).
71
Figure 4.3: Images incorrectly classified using faulty detectors. The 12 images missed when the spatial relationships were added to the Bayesian network included the 10 missed above. In addition, spatial relationships caused an incorrect classification in two cases. 1. Spatial relationships were unable to rescue the two images above in which the sand was missed and a building was falsely detected; they were still classified as U-scenes. In the first image, spatial relationships should be able to catch the mis-detected regions: the building and road regions are besides each other, and the building is far below the sky regions, both of which are unlikely. However, the belief in the U-scene still dominated that in the other scenes. Mis-detection in the second image was such that it would be indistinguishable from a normal beach scene. 2. Spatial relationships were unable to rescue the eight images containing sky, water, and background: they are still classified as L-scenes. 3. Two additional images were misclassified (Figure 4.4). Both contain a region misclassified as sand present above the water region, causing them to be classified as L-scenes. In the first image, a large sand region was undetected, and the small remaining sand region was above the water, leading to a higher belief in an L-scene than a B-scene. The second image continued to be misclassified as an L-scene, due to camera location (as above).
72
Figure 4.4: Images incorrectly classified using faulty detectors. We reiterate that in the current model of faulty detectors, we threshold the combined material beliefs and replace the true material with the material most likely to be detected. That the underlying material beliefs are not used in inference is a severe limitation that must be addressed. Take, for example, a true sand region that has 49% belief in sand and 51% belief in background. By only allowing binary belief, we call it background and eliminate all chance that it could still be sand, even if the spatial relationships imply that would fit the model better as a sand region. We do plan to overcome this and other limitations in the thesis, as we now describe in Chapter 5.
73
5
Proposed Research
Automatically determining the semantic classification of images is useful, among other things, for context-sensitive image enhancement and image cataloging. As we reviewed in Chapter 2, most current systems use a standard pattern recognition approach, treating images as high-dimensional vectors of low-level features. Due to advances in robust classifiers such as Support Vector Machines, these systems have achieved some success in limited domains. However, the “semantic gap” between low-level pixel representations of images and the high-level understanding of them we seek is still wide. We believe that more expressive features, such as the output of semantic object and material detectors, are needed in order to bridge this gap and thus advance the state of the art in scene classification. While semantic detectors are not perfect, they continue to increase in speed and accuracy; they are mature enough now to be reliably useful. We have access to state-of-the-art detectors for blue and mixed sky, grass, skin, and faces, as well as less-reliable detectors for water and snow fields, still under development [39; 38; 67]. Current research, still in its infancy, into related frameworks that rely on high-level features like these is promising [37; 44; 48; 70]. However, no scene classification system makes use of the spatial relationships between the objects
74
detected. We have chosen to use a probabilistic graphical model to perform inference, because our input and desired output is probabilistic and we can learning the model parameters from real data. Experience developing the Bayesian tree that served as our initial inference mechanism was encouraging, but we found the Bayesian tree to be not expressive enough to handle evidence from faulty detectors when that evidence corresponds to individual regions and their spatial relationships. With this in mind, our central claim is as follows: Spatial modeling of semantic objects and materials can be used to disambiguate certain scene types as well as mitigate the effects of faulty detectors. Furthermore, an appropriate probabilistic inference mechanism must be developed to handle the loose spatial structure found in real images.
5.1
Research Plan
Our research plan can be described succinctly: • Identify the probabilistic inference mechanism, either Markov Random Fields or Bayesian networks, most appropriate for modeling scenes for semantic classification. • Extend the chosen framework, because neither fits our needs exactly. • Measure the efficacy both of our features and of our inference mechanism by conducting experiments both on simulated and real images. We define the input and output of our system, then elaborate on our research plan in the following sections.
75
System Input: Fused detector belief maps We assume the existence of a bank of relevant object and material detectors (e.g., sky, water, sand) each of which produces a probabilistic belief map. In the belief map, each material region is labeled with the corresponding belief in that material: 1 = highest confidence that region is the material down to 0 = no confidence that region is the material. We also assume the existence of a mechanism for fusing the material beliefs of each region and for adjudicating between discrepancies in the segmentation induced by each map. Material detectors and the associated fusion mechanism are documented in the literature [40; 39]. A simplified input to the system is shown in Figure 5.1 Bel(sky) = 0.78 Bel(background) = 0.12 Bel(water) = 0.07 ...
BEACH
Bel(water)= 0.53 Bel(sky) = 0.36 ...
Bel(sand) = 0.68 Bel(rock) = 0.22 Bel(bkgrd) = 0.10 ...
Bel(Beach) = 0.73 Bel(Urban) = 0.04 ...
Figure 5.1: Classifier input (labeled regions) and output (classification and confidence).
System Output: Semantic label and confidence We seek first to output one of a small (e.g., less than 12 ) set of discrete labels for the scene classification. As two examples, in our simulated experiments, we used B-, L-, and U-scenes, and in an outdoor scene classifier, we might use beach, lake, urban, field, mountain, picnic, and forest. We would also like to output a belief in at least the most likely scene type (if not each scene type). The probabilistic framework allows us to output these beliefs, which can be thresholded to label a scene as “unknown”.
76
5.1.1
Why a New Inference Mechanism?
As we have seen, the generative model (single-level Bayes net) of our MASSES simulator generates materials, not individual regions. Under best-case material detection, this is not problematic. However, when materials are falsely detected, it is limited. Consider the case shown in Figure 5.2. Initially, we have the relationships (1)
sky Abovewater ,
(2)
sand Abovewater ,
and (3)
sky Abovesand .
Consider
the case where the water is mis-detected as sky. Now the relationships are (1) sky Abovesky ,
(2)
sand Abovesky ,
and (3)
sky Abovesand .
Relation (1) can be safely ig-
nored if we consider it was a sky region broken into two pieces. However, relations (2) and (3) are contradictory.
Sky
Sky
Sand Water
Sand (Sky)
Figure 5.2: An image in which a region is mis-detected, creating contradictory spatial relationships in the material-based inference scheme. It is not clear how to handle conflicting evidence. Splitting belief within a single relationship node is not correct, and adding multiple nodes to capture multiple binary relationships between two materials is ad hoc. The limitation seems to stem from the current system’s architecture, which only encodes materials and not individual regions in the nodes. It is true that some non-adjacent regions of the same material could be merged using modelbased or belief-based grouping [67] (and creating occlusion relationships in the process). However, since multiple regions may have different associated material beliefs, information is lost.
77
Because the underlying material beliefs should be useful in inference, it appears better to encode each region (and the corresponding relations between them) distinctly.
5.1.2
Bayes Nets Vs. MRFs
It is clear that the single-level Bayesian network is not expressive enough to encode our knowledge of the world. We have identified two potential frameworks to explore: general-topology Bayesian networks and Markov Random Fields.
General Bayesian Networks If the network topology is allowed to the deviate from a strict tree-structure, Bayesian networks are a natural framework to encode our knowledge. Specifically, we must model individual regions to incorporate material beliefs. Figure 5.3 shows an example of such a framework. Note that we cannot simply replace the material nodes with region nodes; doing so would preclude the expression of knowledge such as ”there is no sky present in the image”. Class Materials
Regions
Material Spatial Rels
Region Spatial Rels
Figure 5.3: Proposed DAG Bayesian network. Note that separate material and region layers are needed. The network shown uses a two-level Bayesian network, in which a level corresponds to the materials present and a level corresponds to the individual regions.
78
The two levels are combined using a fuzzy-OR logical relation: for example, if any region is detected as sky, then the sky node is activated. However, we have some concerns with this network. First, while the approximate inference algorithms discussed in Section 2.4.1 are designed for networks with loops, it is unclear how well they would perform on dense graphs such as this. Pearl hints that both the clustering and conditioning algorithms are tractable only for sparse graphs. We need to investigate the stochastic simulation and belief propagation algorithms further. Second, it is not clear how to quantify the links connecting the region and material nodes, because they represent logical relationships of variables with multiple hypotheses. Formal mechanisms such as the noisy-OR gate are designed for binary variables. Furthermore, given the debate between the relative merits of logic and probability [14; 50], mixing the two by adding logical constraints to a probability model raises red flags in our minds due to technical details.
Markov Random Fields Markov Random Fields are an attractive alternative to Bayesian networks. The lattice structure typically used models spatial relationships directly. Furthermore, two-layer MRFs (Figure 2.2) were designed to infer interpretations of underlying scenes from uncertain evidence. For this reason, they have been applied with success on low-level image-processing problems such as restoration [23], segmentation, and depth map integration [16]. One approach to using MRFs for scene classification would be to create an array of scene-specific MRFs and run them all in parallel. (While seemingly inefficient, the number of scenes of interest is fairly small). We then classify the image with the scene corresponding to the MRF producing the region configuration with
79
the highest posterior probability. For each scene-specific MRF, the top level of the MRF corresponds to the combined material beliefs in each region. The bottom-level corresponds to the scene model, which is given by a distribution of ”typical” material configurations for that scene. Some reasons why the MRF is attractive include: 1. An available code base. We are nearing completion of a basic implementation of MRFs with inference using the belief propagation algorithm [7]. I could then extend this framework (as discussed in the next section) to accommodate the scene classification problem. 2. Ease of representing multiple regions of the same material. 3. Direct use of detector belief values. 4. Solution to a related problem, that of spatial- and scene-context-aware material detection, because the configuration with the smallest energy corresponds to the most likely material labeling.
5.1.3
Extend the Chosen Framework
Neither MRFs nor Bayesian networks fit our needs exactly. We briefly describe shortcomings in each framework and our initial ideas for extending them.
Extensions to Bayesian Networks 1. We argued that material regions must be encoded directly as nodes in the Bayesian network. However, the number of regions is unknown in advance. This would necessitate the ability to build the network dynamically during the inference process. The alternative, that of creating more nodes than
80
should be necessary and only instantiating those actually present in the image, is possible, yet seems ad hoc. 2. Our network requires the use of logical functions of multi-valued hypotheses. However, the canonical models, such as noisy-OR gates, are designed for binary-value hypotheses. 3. In cases where the detector evidence is weak (or evidence is conflicting, the configurations are unlikely, or the scene is unmodeled), we would like to label the image as “unknown”. This necessitates our modeling an “other” category, which would require a large training set. Thie problem is also much more difficult than forced-choice classification.
Extensions to Markov Random Fields MRFs are typically used on a single scene model and generate a MAP estimate of the scene given the observed image. To approximate the MAP, one does not need to calculate the actual probabilities:
P (I|S)P (S) P (I) ∝ P (I|S)P (S)
argi max P (S|I) =
(5.1) (5.2)
We are proposing a novel use of MRFs: classification problems. We are motivated by the successful use of Hidden Markov Models for speech recognition. In speech recognition applications, HMM models for each word are run in parallel and the probabilities are compared to decide which word is most likely. Inspired by this successful use of HMMs, we envision a series of scene-specific MRF models that can be run in parallel. Each MRF produces a MAP estimate of the scene labeling and the probability of occurrence within that model (e.g.,
81
the probability of the image of that scene). We label the image with the class of the MRF model producing the “Maximum of the MAP estimates” (MMAP). Of course, cross-model comparison necessitates normalization, which is our first challenge. 1. Because we are comparing multiple models, the probabilities produced by each model must be normalized to the same scale. Computing the normalization constant is intractable in general. We have two ideas for overcoming this. First, using a small number of nodes allows direct computation of the normalization constant. Having a node correspond to each region or using a small (e.g. 4 × 4) square lattice should suffice. Second, we may be able to approximate the normalization constant by sampling the scene models using a Monte Carlo method like Gibbs Sampling 2.4.2. This may prove to be too computationally expensive, but is worth investigation. 2. While MRFs encode spatial relationships explicitly, there is no work in the literature demonstrating their use in high level vision. Whether compatibility functions can be generalized beyond non-trivial, stationary, isotropic, universal pixel-or edgel-level relationships is an interesting open problem and an opportunity to make an intellectual contribution. 3. Specifically, the compatibility functions in MRFs are typically stationary; ours depend on the location in the image. For example, in a beach scene, sky next to sky may be very compatible on the top of the image, but completely incompatible at the bottom of the image. 4. Because they operate on the pixel level, typical MRFs for vision are constructed on regular lattices. Our high-level vision problem starts with seg-
82
mented regions of irregular shape. Two options seem possible: (a) Encode irregular regions and spatial relationships using a general graph structure. The literature shows few departures from the lattice structure, except for use in hypertext link modeling [11] and object recognition [19] Our approach would be similar to the latter usage, since Felzenszwalb and Huttenlocher (Section 2.4.2) were modeling spatial relationships between object parts.
However, scene configurations are less con-
strained than object configurations, so the spatial probability model must be more general than the multivariate Gaussians they used. (b) Overlay a lattice on the regions, but constrain belief propagation so as to remain consistent with the segmentation given by the detectors. The open question is how this will affect the theoretical properties of probabilities or energy minimization. One option might be to propagate as usual, and then reconcile the lowest-energy configuration with the segmentation map at the end.
5.1.4
Analyze the Effect of Detector Quality
As we argued in Chapter 3, one benefit of working in simulation is that we are not dependent on the actual detectors used. We plan to investigate the monotonic function between classification performance and material detector performance.
5.1.5
Evaluate Explicit Spatial Relationships
In our experiments, we demonstrated that modeling spatial relationships can help distinguish between certain types of scenes. Our goal here is to isolate the value
83
added by explicitly modeling the relationships. We compare the proposed probabilistic framework with a framework which uses semantic features, but models spatial information implicitly. We will implement a pattern recognition classifier (e.g., a Support Vector Machine) and comparing to chosen inference mechanism. For example, the SVM classifier with block-based belief features, encodes spatial relationships implicitly (as position within the feature vector).
5.1.6
Evaluate Semantic Features
In order to validate our claims about semantic features and spatial relationships, we must compare the efficacy of our system with a system using low-level features and pattern classification techniques. We plan to use a benchmark set of outdoor scene images.
5.1.7
Generalize to Other Scene Classes
Our proposed framework is not specific to outdoor scenes; the probabilities of the presence of materials and the spatial relationships between them are learned from data. We plan to experiment with other object-centric scene types which have spatial structure. One example is outdoor scenes dominated by a main-subject (e.g., picnic, backyards). Since the main subject in typical consumer images is a person, adding a person detector could handle these cases. The only extension to the scene model would be to include a model of background (non-people, non-object) regions including statistics of their spatial relationships with the modeled objects. A second, more challenging example is indoor scenes. Office environments are dominated by desks and people. Living room scenes tend to contain people on sofas. Conference rooms contain an arrangement of a number of people around a
84
table. While the spatial relationships vary more with scenes in which the image is captured at close range, this will test the generalizability of our methods.
5.1.8
Explore Potential Long-term Directions
We have identified a number of potential dimensions along which we can expand our thesis. • Temporal modeling: photographs do not come from a vacuum. Scenes occur in loosely-connected groupings of events, like a comic strip or “storyboard”. Exploring these groupings probabilistically gives a third, temporal dimension to our modeling. • Exploring the use of camera metadata (timestamps, focal distance, flash firing) to improve classification. This evidence is probabilistic in nature and must be incorporated appropriately. • Exploring the combination of low- or mid-level (e.g. color, texture, depth cue, other elements of the “spatial envelope” [47]) and semantic features: framework, control, cost. Some researchers, in particular Oliva and Torralba [47], argue that human observers can recover scene identity even when objects are unrecognizable in isolation and thus propose the “scene-centered” features discussed in Section 2.2.3. This appears to be true in certain cases, but I would agree with Oliva and Torralba’s assessment that their approach is complementary to approaches like the proposed one which are based on semantic objects. An interesting direction would be to begin the process of uniting the two complementary approaches.
85
• Comparison of the two schemes in Section 2.3.1 (AVS and Weighted Walkthrough) for computing qualitative relations in order to achieve a balance between computational cost and perceptual integrity.
5.2
Issues Not Addressed in This Thesis
Our experiments revealed some ambiguity in the distinction between real beach and lake photos. There is more to semantic understanding than materials and spatial relations. For example, if I take a picture while standing on the beach looking out over the water, and no sand is in the image, we think of it as a lake scene. If a large sand region dominates the foreground, then we consider it to be a beach scene. If the camera is positioned above the water and its field of view contains the beach, it may be classified as either: if I am standing ten feet from the shore, then it would be a beach semantically; if I am on a boat three miles from the shore, then it would be a lake semantically. In this case, our system classifies each as a lake, because sand appears above water in the image. One limitation of our system in real classification is that we are considering material presence and spatial relationships only. Relative size of regions also appears to be helpful [44].
5.3
Research Schedule
The proposed research schedule includes the following: • Compare Bayes Net and Markov Random Fields (2 months). • Implement chosen framework (2 months). • Performance analysis on natural images (1 semester).
86
• Evaluate semantic features by implementing low-level framework and comparing to proposed system (2 months). • Evaluate explicit modeling of spatial relationships by implementing pattern recognition classifier and comparing to proposed framework (2 months). • Temporal context (or other appropriate) extension (1 semester). • Write dissertation (1 semester). Accounting for one summer working in industry, the anticipated defense date is May, 2005.
87
6
Acknowledgments
This research was supported by a grant from Eastman Kodak Company, by the NSF under grant number EIA-0080124, and by the Department of Education (GAANN) under grant number P200A000306.
88
Bibliography
[1] Dana H Ballard. An Introduction to Natural Computation. MIT Press, Cambridge, MA, 1997. [2] Dana H Ballard and Christopher M Brown. Computer Vision. Prentice-Hall, Englewood Cliffs, NJ, 1982. [3] J. Batlle, A Casals, J Freixenet, and J Mart. A review on strategies for recognizing natural objects in colour images of outdoor scenes. Image and Vision Computing, 18(6-7):515–530, May 2000. [4] D.C. Becalick. Natural scene classification that avoids a segmentation stage. Technical report, ECE Department, Imperial College of Science, Technology, and Medicine, 1996. [5] S. Berretti, A. Del Bimbo, and E. Vicario. Spatial arrangement of color in retrieval by visual similarity. Pattern Recognition, 35:1661–1674, 2001. [6] Matthew Boutell, Jiebo Luo, and Christopher Brown. Review of the state of the art in semantic scene classification. Technical Report 799, University of Rochester, Rochester, NY, December 2002. [7] Matthew Boutell and Brandon Sanders. Markov random fields: An overview. Technical report, University of Rochester, Rochester, NY, 2003. in preparation.
89
[8] Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Minging and Knowledge Discovery, 2(2):121–167, 1998. [9] C. Carson, S. Belongie, H Greenspan, and J. Malik. Recognition of images in large databases using a learning framework. Technical Report 97-939, U.C. Berkeley, 1997. [10] C. Carson, S. Thomas, M. Belongie, J.M. Hellerstein, and J. Malik. Blobworld: a system for region-based image indexing and retrieval. In Third Intl. Conf. on Visual Information Systems. Springer-Verlag, June 1999. [11] S. Chakrabarti, B. Dom, and P Indyk. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD, 1998. [12] Peng Chang and John Krumm. Object recognition woth color cooccurance histograms. In IEEE Conference on Computer Vision and Pattern Recognition, Fort Collins, CO, June 23-25 1999. [13] Eugene Charniak. Bayesian networks without tears. AI Magazine, 12(4):50– 63, April 1991. [14] P. Cheeseman. An inquiry into computer understanding (with discussion). Computational Intelligence, 4:57–142, 1988. Further discussion appears in 6:179–192. [15] Rama Chellappa and Anil Jain, editors. Markov Random Fields: Theory and Application. Academic Press, San Diego, CA, 1993. [16] Paul Chou. The Theory and Practice of Bayesian Image Labeling. PhD thesis, University of Rochester, Rochester, NY, 1988. [17] Robert Cowell. Advanced inference in bayesian networks. In Michael I. Jordan, editor, Learning in Graphical Models, volume 89 of NATO Science Series D. Kluwer Academic Publishers, 1998.
90
[18] R. Duda, R. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, 2nd edition, 2001. [19] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient matching of pictorial structures. In IEEE Computer Vision and Pattern Recognition, pages 66–73, 2000. [20] Forsyth and Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2002. [21] D. Forsyth, J. L. Mundy, A. Zisserman, C. Coelho, A. Heller, and C. Rothwell. Invariant descriptors for 3-d object recognition and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:971–991, October 1991. [22] W.T. Freeman, E.C. Pasztor, and O.T. Carmichael. Learning low-level vision. International Journal of Computer Vision, 40(1):24–57, October 2000. [23] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, November 1984. [24] A. Hauptmann and M. Smith. Text, speech, and vision for video segmentation: The informedia project. In AAAI Symposium on Computational Models for Integrating Language and Vision, Fall 1995. [25] M. Henrion. Propagation of uncertainty by probabilistic logic sampling in bayes’ networks. In J. F. Lemmer and L. N. Kanal, editors, Uncertainty in Artificial Intelligence, pages 149–164. Elsevier Science Publishers B.V., 1988. [26] P. Hong, T. Huang, and R. Wang. Learning patterns from images by combining soft decisions and hard decisions. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 78–83, Hilton Head, SC, June 13-15 2000.
91
[27] Q. Iqbal and J. K. Aggarwal. Combining structure, color and texture for image retrieval: A performance evaluation. In International Conference on Pattern Recognition (ICPR), August 2002. [28] A.K. Jain, R.P.W. Duin, and Jianchang Mao. Statistical pattern recognition: a review. IEEE PAMI, 22(1):4–37, January 2000. [29] Michael I. Jordan, editor. Learning in graphical models. NATO Advanced Study Institute on Learning in Graphical Models. Kluwer Academic Publisher, Boston, 1998. [30] Ross Kindermann and J. Laurie Snell. Markov Random Fields and Their Applications, volume 1. American Mathematical Society, Providence, RI, 1980. [31] T. Kohonen. Improved versions of learning vector quantization. In Proc. International Joint Conference on Neural Networks, pages 545–550, San Diego, June 1990. [32] T. Kohonen, J. Kangas, J. Laaksonen, and K. Torkkola. Lvq pak: A program package for the correct application of learning vector quantization algorithms. In Proc. International Joint Conference on Neural Networks, volume 1, pages 725–730, Baltimore, June 1992. [33] S. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992. [34] P. Lipson. Context and Configuration-Based Scene Classification. PhD thesis, MIT, Cambridge, MA, 1996. [35] P. Lipson, E. Grimson, and P. Sinha. Configuration based scene classification and image indexing, 1997.
92
[36] Y. Lu, C. Hu, X. Zhu, H. J. Zhang, and Q. Yang. A unified framework for semantics and feature based relevance feedback in image retrieval systems. In ACM Multimedia Conference, Los Angeles, October 2000. [37] J. Luo and A. Savakis. Indoor vs. outdoor classification of consumer photographs using low-level and semantic features. In IEEE International Conference on Image Processing, Thessaloniki, Greece, October 2001. [38] Jiebo Luo and Stephen Etz. A physics-motivated approach to detecting sky in photographs. In International Conference on Pattern Recognition, volume 1, Quebec City, QC, Canada, August 11 - 15 2002. [39] Jiebo Luo, Amit Singhal, and Weiyu Zhu. Natural object detection in outdoor scenes based on probabilistic spatial context models. In ICME, Baltimore, MD, July 2003. [40] Jiebo Luo, Amit Singhal, and Weiyu Zhu. Towards holistic scene content classification using spatial context-aware scene models. In IEEE Conference on Computer Vision and Pattern Recognition, Madison, WI, June 2003. [41] D. J. C. Mackay. Introduction to monte carlo methods. In Michael I. Jordan, editor, Learning in Graphical Models, volume 89 of NATO Science Series. Kluwer Academic Publishers, 1998. [42] Stephane Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–692, July 1989. [43] David Marr. Vision - A Computational Investigation into the Human Representation and Processing of VIsual Information. Freeman, San Francisco, 1982.
93
[44] Philippe Mulhem, Wee Kheng Leow, and Yoong Keok Lee. Fuzzy conceptual graphs for matching images of natural scenes. In IJCAI, pages 1397–1404, 2001. [45] R. M. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 1993. [46] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001. [47] A. Oliva and A Torralba. Scene-centered description from spatial envelope properties. In 2nd Workshop on Biologically Motivated Computer Vision, Lecture Notes in Computer Science, Tuebingen, Germany, 2002. [48] S. Paek and S.-F. Chang. A knowledge engineering approach for image classification based on probabilistic reasoning systems. In IEEE International Conference on Multimedia and Expo. (ICME-2000), volume II, pages 1133– 1136, New York City, NY, Jul 30-Aug 2 2000. [49] G. Pass, R. Zabih, and J. Miller. Comparing images using color coherence vectors. In Proceedings of the 4th ACM International Conference on Multimedia, pages 65–73, Boston, Massachusetts, November 1996. [50] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann Publishers, 1988. [51] T. Randen and J.M. Husoy. Filtering for texture classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(4):291–310, April 1999.
94
[52] A. Rao, R. Srihari, L. Zhu, and A. Zhang. A theory for measuring the complexity of image databases. IEEE Transactions on Multimedia, 4(2):160– 173, June 2002. [53] Rajesh Rao and Dana Ballard. Efficient encoding of natural time varying images produces oriented space-time receptive fields. Technical Report 97.4, University of Rochester, Rochester, NY, August 1997. [54] A. Ratan and W.E.L. Grimson. Training templates for scene classification using a few examples. In Proceedings of IEEE Content Based Access of Image and Video Libraries, San Juan, 1997. [55] Terry Regier and Laura Carlson. Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General, 130(2):273–298, 2001. [56] R. D. Rimey and C. M. Brown. Control of selective perception using bayes nets and decision theory. International Journal of Computer Vision, Special Issue on Active Vision, 1994. [57] Raymond D. Rimey. Control of Selective Perception using Bayes Nets and Decision Theory. PhD thesis, Computer Science Dept., U. Rochester, Rochester, NY, December 1993. [58] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, December 22 2000. [59] Cordelia Schmid. Constructing models for content-based image retrieval. In Computer Vision and Pattern Recognition, Kauai, Hawaii USA, December 2001.
95
[60] Henry Schneiderman and Takeo Kanade. A statistical method for 3d object detection applied to faces and cars. In IEEE Conference on Computer Vision and Pattern Recognition, 2000. [61] B. Scholkopf, C. Burges, and A. Smola. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, 1999. [62] Andrea Selinger. Analysis and Applications of Feature-Based Object Recognition. PhD thesis, University of Rochester, Rochester, NY, 2001. [63] Andrea Selinger and Randal C. Nelson. A perceptual grouping hierarchy for appearance-based 3d object recognition. Computer Vision and Image Understanding, 76(1):83–92, October 1999. [64] Satoshi Semba, Masayoshi Shimizu, and Shoji Suzuki. Skin color based lightness correction method for digital color images. In PICS, pages 399–402, 2001. [65] Navid Serrano, Andreas Savakis, and Jiebo Luo. A computationally efficient approach to indoor/outdoor scene classification. In International Conference on Pattern Recognition, September 2002. [66] A. Singhal. Bayesian Evidence Combination for Region Labeling. PhD thesis, University of Rochester, Rochester, NY, 2001. [67] Amit Singhal and Jiebo Luo. Hybrid approach to classifying sky regions in natural images. In Proceedings of the SPIE, volume 5022, July 2003. [68] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Contentbased image retrieval at the end of early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, December 2000.
96
[69] J. R. Smith and C.-S. Li. Image classification and querying using composite region templates. Computer Vision and Image Understanding, 75(1/2):165 – 174, July/August 1999. [70] Y. Song and A. Zhang. Analyzing scenery images by monotonic tree. ACM Multimedia Systems Journal, 2002. [71] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis, and Machine Vision. Brooks/Cole Publishing, Pacific Grove, CA, 2 edition, 1999. [72] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, 7(1), 1991. [73] Martin Szummer and Rosalind W. Picard. Indoor-outdoor image classification. In IEEE International Workshop on Content-based Access of Image and Video Databases, Bombay, India, 1998. [74] A. Torralba and P. Sinha. Recognizing indoor scenes. Technical Report AI Memo 2001-015, CBCL Memo 202, MIT, July 2001. [75] A. Torralba and P. Sinha. Statistical context priming for object detection. In Proceedings of the International Conference on Computer Vision, pages 763–770, Vancouver, Canada, 2001. [76] A. Vailaya. Semantic Classification in Image Databases. PhD thesis, Michigan State University, East Lansing, MI, 2000. [77] A. Vailaya, M. Figueiredo, A. Jain, and H.J. Zhang. Content-based hierarchical classification of vacation images. In Proc. IEEE Multimedia Systems ’99 (International Conference on Multimedia Computing and Systems), Florence, Italy, June 1999.
97
[78] A. Vailaya, A. K. Jain, and H.-J. Zhang. On image classification: City images vs. landscapes. Pattern Recognition, 31:1921–1936, December 1998. [79] A. Vailaya, H.J. Zhang, and A. Jain. Automatic image orientation detection. In Proc. IEEE International Conference on Image Processing, Kobe, Japan, October 1999. [80] J. Wang, J. Li, and G Wiederhold. Simplicity: Semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947–963, 2001. [81] Y. Weiss and W. T. Freeman. Belief propagation and revision in networks with loops. Technical Report 1616, MIT, Cambridge, MA, 1998. [82] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. International Joint Conference on Artificial Intelligence (IJCAI), Distinguished Presentations, August 2001. [83] H. Yu and W.E.L. Grimson. Combining configurational and statistical approaches in image retrieval. In 2nd IEEE Pacific-Rim Conference on Multimedia, Beijing, October 2001.
98
A
Natural Scene Statistics
We give the spatial relationship counts calculated from the 220 labeled beach and urban scenes. The row is the landmark object and the column is the trajector object, so for instance, Above(f oliage, sky) = 24 from the first table means that sky was located above foliage in 24 beach images. Blue patch (of sky) was incorporated into sky; we do not differentiate based on cloudy/clear sky. BEACH
Relation above sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
1
0
0
0
0
0
0
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
1
0
0
0
0
0
0
0
foliage
24
0
0
0
2
0
8
0
0
1
1
water
97
0
0
12
0
0
5
8
1
7
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
16
0
0
5
51
0
0
9
0
4
10
rock
11
0
0
4
3
0
3
0
0
0
0
road
1
0
0
0
1
0
0
0
0
1
0
building
13
0
0
1
0
0
0
1
0
0
0
crowd
3
0
0
2
6
0
0
0
0
2
0
99
Relation far above sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
1
0
0
0
0
0
0
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
3
0
0
0
2
0
1
0
0
0
0
foliage
9
0
0
0
6
0
0
0
0
3
1
water
23
0
1
17
0
0
0
2
0
6
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
104
0
1
24
32
0
0
7
1
11
1
rock
11
0
0
2
4
0
1
0
0
1
0
road
4
0
0
1
2
0
0
0
0
1
0
building
0
0
0
0
0
0
0
0
0
0
0
crowd
15
0
0
3
1
0
0
3
0
5
0
Relation beside sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
9
0
0
0
2
0
2
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
1
0
0
1
0
1
0
0
foliage
9
0
1
0
7
0
7
2
3
0
0
water
0
0
0
7
0
0
32
7
1
2
11
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
1
7
32
0
0
4
4
0
7
rock
2
0
0
2
7
0
4
0
1
0
0
road
0
0
1
3
1
0
4
1
0
0
0
building
2
0
0
0
2
0
0
0
0
0
0
crowd
0
0
0
0
11
0
7
0
0
0
0
Relation far below sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
3
9
23
0
104
11
4
0
15
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
0
1
0
1
0
0
0
0
foliage
1
0
0
0
17
0
24
2
1
0
3
water
0
0
2
6
0
0
32
4
2
0
1
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
1
0
0
0
0
1
0
0
0
rock
0
0
0
0
2
0
7
0
0
0
3
road
0
0
0
0
0
0
1
0
0
0
0
building
0
0
0
3
6
0
11
1
1
0
5
crowd
0
0
0
1
0
0
1
0
0
0
0
100
Relation below sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
24
97
0
16
11
1
13
3
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
0
0
0
0
0
0
0
0
foliage
1
0
1
0
12
0
5
4
0
1
2
water
0
0
0
2
0
0
51
3
1
0
6
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
0
8
5
0
0
3
0
0
0
rock
0
0
0
0
8
0
9
0
0
1
0
road
0
0
0
0
1
0
0
0
0
0
0
building
0
0
0
1
7
0
4
0
1
0
2
crowd
0
0
0
1
0
0
10
0
0
0
0
roc
roa
bui
cro
Individual material occurrence in 120 images: sky 120 bluePatch
0
grass
3
foliage
44
water 120 snow
0
sand 120 rock
24
road
5
building
15
crowd
18
URBAN
Relation above sky
blu
gra
fol
wat
sno
san
sky
0
0
0
1
0
0
0
0
0
0
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
6
0
0
0
0
0
1
0
foliage
9
0
0
0
1
0
0
0
0
22
0
water
6
0
0
0
0
0
0
0
0
14
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
0
0
0
0
0
0
0
0
0
rock
0
0
0
0
0
0
0
0
0
0
0
road
1
0
1
4
2
0
0
0
0
19
0
building
77
0
0
2
1
0
0
1
0
0
0
crowd
0
0
0
0
0
0
0
0
0
0
0
101
Relation far above sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
0
0
0
0
0
0
0
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
9
0
0
1
3
0
0
0
0
7
0
foliage
23
0
0
0
0
0
0
0
0
2
0
water
20
0
0
3
0
0
0
0
0
8
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
1
0
0
0
0
0
0
0
0
1
0
rock
3
0
0
1
0
0
0
0
0
2
0
road
33
0
0
12
2
0
0
0
0
16
0
building
1
0
0
1
0
0
0
0
0
0
0
crowd
0
0
0
0
0
0
0
0
0
0
0
Relation beside sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
5
1
0
0
0
0
19
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
0
0
0
1
0
3
1
0
foliage
5
0
0
0
1
0
1
0
4
11
0
water
1
0
0
1
0
0
1
2
0
5
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
1
1
1
0
0
0
1
0
0
rock
0
0
0
0
2
0
0
0
0
0
0
road
0
0
3
4
0
0
1
0
0
0
0
building
19
0
1
11
5
0
0
0
0
0
0
crowd
0
0
0
0
0
0
0
0
0
0
0
Relation far below sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
9
23
20
0
1
3
33
1
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
0
0
0
0
0
0
0
0
foliage
0
0
1
0
3
0
0
1
12
1
0
water
0
0
3
0
0
0
0
0
2
0
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
0
0
0
0
0
0
0
0
0
rock
0
0
0
0
0
0
0
0
0
0
0
road
0
0
0
0
0
0
0
0
0
0
0
building
0
0
7
2
8
0
1
2
16
0
0
crowd
0
0
0
0
0
0
0
0
0
0
0
102
Relation below sky
blu
gra
fol
wat
sno
san
roc
roa
bui
cro
sky
0
0
0
9
6
0
0
0
1
77
0
bluePatch
0
0
0
0
0
0
0
0
0
0
0
grass
0
0
0
0
0
0
0
0
1
0
0
foliage
1
0
6
0
0
0
0
0
4
2
0
water
0
0
0
1
0
0
0
0
2
1
0
snow
0
0
0
0
0
0
0
0
0
0
0
sand
0
0
0
0
0
0
0
0
0
0
0
rock
0
0
0
0
0
0
0
0
0
1
0
road
0
0
0
0
0
0
0
0
0
0
0
building
0
0
1
22
14
0
0
0
19
0
0
crowd
0
0
0
0
0
0
0
0
0
0
0
Individual material occurrence in 100 images: sky
97
bluePatch
0
grass
9
foliage
38
water
28
snow
0
sand
1
rock
3
road
35
building 100 crowd
0
103
B
Detector Characteristics
Each row gives the probability of the detector firing and not firing on each true material. node-name Sky_Detector 0.95 0.05 Sky 0.03 0.97 Sand 0.05 0.95 Road 0.18 0.82 Water 0.05 0.95 Building 0.01 0.99 Background
node-name Sand_Detector 0.05 0.95 Sky 0.95 0.05 Sand 0.20 0.80 Road 0.20 0.80 Water 0.25 0.75 Building 0.30 0.70 Background
node-name Road_Detector 0.05 0.95 Sky 0.20 0.80 Sand 0.85 0.15 Road 0.15 0.85 Water 0.40 0.60 Building 0.25 0.75 Background
node-name Water_Detector
104
0.48 0.52 Sky 0.20 0.80 Sand 0.20 0.80 Road 0.85 0.15 Water 0.15 0.85 Building 0.15 0.85 Background
node-name Building_Detector 0.05 0.95 Sky 0.10 0.90 Sand 0.15 0.85 Road 0.05 0.95 Water 0.80 0.20 Building 0.10 0.90 Background