Thesis - Object Based Mapping

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Thesis - Object Based Mapping as PDF for free.

More details

  • Words: 49,009
  • Pages: 154
Antonio Henrique Pinto Selvatici

Building Object-Based Maps for Robot Navigation

Tese apresentada à Escola Politécnica da Universidade de São Paulo para obtenção do Título de Doutor em Engenharia Elétrica. Área de concentração: Sistemas Digitais Orientador:

Profa. Livre Docente Anna Helena Reali Costa

São Paulo 2009

Selvatici, Antonio Henrique Pinto Este exemplar foi revisado e alterado em relação à versão original, sob responsabilidade única do autor e com a anuência de seu orientador. São Paulo, 22 de abril de 2009.

Assinatura do autor Assinatura do orientador

FICHA CATALOGRÁFICA

Selvatici, Antonio Henrique Pinto Building Object-Based Maps for Robot Navigation / A. H. P. Selvatici. – ed. rev. – São Paulo, 2009. 154 p. Tese (Doutorado) — Escola Politécnica da Universidade de São Paulo. Departamento de Engenharia de Computação e Sistemas Digitais. 1. Robótica 2. Visão computacional 3. Inferência Bayesiana e redes de crença I. Universidade de São Paulo. Escola Politécnica. Departamento de Engenharia de Computação e Sistemas Digitais II. t.

To Giselle, for all love and care

Acknowledgments I am really thankful to my adviser Anna Reali for the great experience I had as her student for the last six years, first pursuing my Master’s, and then as a Ph.D. student. Thanks for your dedication, for the fruitful discussions about the different themes I studied over this time, for always trusting me, and for being patient. I am also thankful to Frank Dellaert, my tutor during my pleasant stay at the BORG lab. Thanks to treatment I was dispensed there, I felt as if I was in my regular work place. Above all, though, thanks for treating me as your student, for which I learned a lot. Special thanks to my family, in particular to my parents. Thanks for all the emotional support, the family meetings, and the relax time in Londrina. It has been a hard time away from you. I also thanks my parents for the the financial aid: it helped a lot, making a scholarship-funded Ph.D. student life less tough. I am in debit with the friends I made during my academic journey. Thanks to my friends from LTI for the time we spent together in social meetings, philosophical or technical discussions, and for the fun time at the lab. Thanks to Valguima, Bianchi, Diana and Valdinei, the old time lab friends. Thanks to Humberto, my great beer partner during these years, and to also Leandro, for the good time we spent together. I thank also Prof. Jun Okamoto and Fabiano, from LPA, for lending space and equipment for the practical experiments related in this thesis. Fabiano has also been a good friend throughout my grad journey. Of course, I will not forget to thank the GREAT people I met at Georgia Tech. Thanks to you, I did not feel (too much...) home sick and spent a very good time in USA. Special thanks to Misha, my great Russian beer pal, who taught me real English and gave me a huge support, for which I also acknowledge his family. Thanks also to Gian Luca, for his friendship and Italian-fashioned dinners at his house. Thanks to Michael, Mingxuan, Ananth, Richards, Grant, Sang Min, Kai and the other highly internationalized people I met at the TSRB. To be fair, a special section of this chapter should be dedicated to my fiancee Giselle. She has been playing a very important role in my life during all the time have

been pursuing the Ph.D. degree, and gave me extra support during the last months, taking care of nearly all practical aspects of my daily life. Giselle, thank you very much, and, please, accept my dedication of this manuscript to you. I am also thankful for the financial support of CNPq, Brazil, that funded me during the last four years, through the processes 140868/2005-4 (GD grant) and 201571/20072 (SWE grant).

Resumo Como a complexidade das tarefas realizadas por robôs móveis vêm aumentando a cada dia, a percepção do robô deve ser capaz de capturar informações mais ricas do ambiente, que permitam a tomada de decisões complexas. Entre os possíveis tipos de informação que podem ser obtidos do ambiente, as informações geométricas e semânticas têm papéis importantes na maioria das tarefas designadas a robôs. Enquanto as informações geométricas revelam como os objetos e obstáculos estão distribuídos no espaço, as informações semânticas capturam a presença de estruturas complexas e eventos em andamento no ambiente, e os condensam em descrições abstratas. Esta tese propõe uma nova técnica probabilística para construir uma representação do ambiente baseada em objetos a partir de imagens capturadas por um robô navegando com uma câmera de vídeo solidária a ele. Esta representação, que fornece descrições geométricas e semânticas de objetos, é chamada O-Map, e é a primeira do gênero no contexto de navegação de robôs. A técnica de mapeamento proposta é também nova, e resolve concomitantemente os problemas de localização, mapeamento e classificação de objetos, que surgem quando da construção de O-Maps usando imagens processadas por detectores imperfeitos de objetos e sem um sensor de localização global. Por este motivo, a técnica proposta é chamada O-SLAM, e é o primeiro algoritmo que soluciona simultaneamente os problemas de localização e mapeamento usando somente odometria e o resultado de algoritmos de reconhecimento de objetos. Os resultados obtidos através da aplicação de O-SLAM em imagens processadas por uma técnica simples de detecção de objetos mostra que o algoritmo proposto é capaz de construir mapas que descrevem consistentemente os objetos do ambiente, dado que o sistema de visão computacional seja capaz de detectá-los regularmente. Em particular, O-SLAM é eficaz em fechar voltas grandes na trajetória do robô, e obtém sucesso mesmo se o sistema de detecção de objetos possuir falhas, relatando falsos positivos e errando a classe do objeto algumas vezes, consertando estes erros. Dessa forma, O-SLAM é um passo em direção à solução integrada do problema de localização, mapeamento e reconhecimento de objetos, a qual deve prescindir de um sistema pronto de reconhecimento de objetos e gerar O-Maps somente pela fusão de informações geométricas e visuais obtidas pelo robô.

Palavras-chave: Robótica. Visão computacional. Inferência Bayesiana e redes de crença.

Abstract As tasks performed by mobile robots are increasing in complexity, robot perception must be able to capture richer information from the environment in order to allow complex decision making. Among the possible types of information that can be retrieved from the environment, geometric and semantic information play important roles in most of the tasks assigned to robots. While geometric information reveals how objects and obstacles are distributed in space, semantic information captures the presence of complex structures and ongoing events from the environment and summarize them in abstract descriptions. This thesis proposes a new probabilistic technique to build an object-based representation of the robot surrounding environment using images captured by an attached video camera. This representation, which provides geometric and semantic descriptions of the objects, is called O-Map, and is the first of its kind in the robot navigation context. The proposed mapping technique is also new, and concurrently solves the localization, mapping and object classification problems arisen from building O-Maps using images processed by imperfect object detectors and no global localization sensor. Thus, the proposed technique is called O-SLAM, and is the first algorithm to solve the simultaneous localization and mapping problem using solely odometers and the output from object recognition algorithms. The results obtained by applying O-SLAM to images processed by simple a object detection technique show that the proposed algorithm is able to build consistent maps describing the objects in the environment, provided that the computer vision system is able to detect them on a regular basis. In particular, O-SLAM is effective in closing large loops in the trajectory, and is able to perform well even if the object detection system makes spurious detections and reports wrong object classes, fixing these errors. Thus, O-SLAM is a step towards the solution of the simultaneous localization, mapping and object recognition problem, which must drop the need for an off-the-shelf object recognition system and generate O-Maps only by fusing geometric and appearance information gathered by the robot.

Keywords: Robotics. Computer Vision. Bayesian Inference and belief networks.

List of Figures 1.1 Example of a simple O-Map of a living room, showing object heights and average widths, as well as class labels. All represented objects are considered to have been detected with probability 1. . . . . . . . . . . .

28

1.2 Effect of the robot localization error on the data association problem. In (a),(c), and (e), stars represent the true feature positions, and the unfilled circle represents the robot pose, with the internal trace showing its orientation: the continuous stroke indicates the actual pose, while the dashed stroke shows the believed pose. In (b),(d), and (f), the black ball shows the origin of the measurement coordinates, crosses represent the projected measurements from time t, and squares connected to the origin represent the actual measurements.

. . . . . . . . . . . . . . . .

31

2.1 Classification of the most common types of maps used in robotics according to their attraction level. (a) Occupancy-grid map, representing metric maps, at the lowest level (b) Feature points map indicating the location of trees and the robot trajectory, extracted from (SELVATICI; COSTA, 2007a) (c)Topological map that connects landmarks with the con-

trol commands used to move from one to each other, extracted from (VASSALLO, 2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.2 Examples of geometric features map. (a) Map of ceiling lines, extracted from (FOLKESSON; JENSFELT; CHRISTENSEN, 2005). Textured map of planes, extracted from (THRUN et al., 2004). . . . . . . . . . . . . . . . . . . . . .

43

3.1 Example of O-Map, as proposed in this thesis. In the example, objects are represented by parallelepipeds, described by width, height and depth. The green shape represents the floor plane, which is not part of the map. Transparency in the object representation indicates uncertainty about its existence.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.2 Illustration of how an O-Map can be used to create a common vocabulary for human-robot interaction. This example is merely illustrative, and has not been performed experimentally. (a) Top view of part of a kitchen represented by its objects. (b) Sample of an image used to build the map, showing the detected cups. The association between each detection and an object in the map is used by the human user to specify the desired cup.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.3 Illustration of the scenario considered by the O-SLAM technique. The detected shape in the example is simplified as a rectangle, which corresponds to the projection of a cylinder longitudinal section. . . . . . . . .

58

3.4 Illustration of how the correspondence function J partitions the set of measurement into N subsets (here, N = 4). In the figure, circles represent robot poses at different instants i = 1, 2, . . . , 8, and stars represent the detected objects with indices j = 1, 2, 3, 4. Lines represent the available measurements, indexed by the pairs (i, k), which are shown only in part in order not to clutter the figure. The lines are distributed among the objects by J so that J(i, k) = j. For instance, we have J(2, 1) = 1 and J(4, 2) = 4 in the figure.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3.5 Illustration of how knowledge about the object size and position can be used to estimate the relative distance from the camera. (a) The decreasing size of the wooden box in the images combined with prior knowledge about the size of wooden boxes informs about its increasing distance to the camera. (b) Graphical probabilistic model illustrating the dependencies among object position and size, and robot camera pose (unknowns are represented by ellipses) given object recognition output (rectangles).

62

4.1 Fragment of the Bayesian Network that represents the probabilistic model for O-SLAM. The rectangles represent known variables, and the circles represent the unknowns. We basically assume that the object class c j give a rough idea about its geometry g j , and may give clues about its location l j . In geometric SLAM, the detected class ai,k is considered equal to the actual object class, cJ(i,k) , so that P(ai,k ∣cJ(i,k) ) = δ (ai,k , cJ(i,k) ). This model does not admit strong occlusion among objects, neither moving objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.2 Representation of how to match the same object identities in two maps generated assuming different solutions for the data association. Objects 1, 3, and 4 have the same identity in both correspondence functions represented in (a) and (b), while object 2 in (a) and objects 2 and 5 in (b) cannot be matched used the present definition of object identity. . .

76

4.3 Example plot of a mirrored sigmoid function with parameters τ = 0.1 and t0 = 0.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

4.4 Toy example proposed to illustrate how the data association probability model work.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

4.5 Plot of the residuals res1 and res2 for the values of σo and σr indicated on the graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.6 Comparison of the plots of the model evidence considering the data association J1 and J2 , with varying standard deviations. . . . . . . . . . . .

90

5.1 Illustration of how the MH algorithm sampling peaked distributions yields poor-mixing chains. (a) Target density, with peaks centered at x = 4 and x = 12. (b) Plot of the samples drawn by the MH algorithm, starting at x = 4 and proposing new states in a fixed window around the current one. The states belonging to the right-most peak are not sampled, indicating very poor chain mixing. Figures are extracted from (RANGANATHAN, 2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Illustration of how MC3 with heated chains help improving mixing when peaked distributions are sampled. (a) Target density at different temperatures: T = 1 (black line), representing the original density, the same of Figure 5.1, T = 11 (blue), and T = 21 (red). (b) Graph of the samples drawn by the MC3 algorithm, taking the same starting state and proposal density as the example of Figure 5.1. In this case, the sampler quickly alternates between both peaks, indicating good chain mixing. Figures are extracted from (RANGANATHAN, 2008). . . . . . . . . . . . . . . . . . 103 6.1 Example of annotated image from the Cogniron data set. Objects are marked with bounding polygons, and are given a label.

. . . . . . . . . 111

6.2 Illustration of the measurement data and object model used in geometric SLAM experiments. a) Cylindrical model adopted to represent the objects, described by its height, width, and base point position. b) Detected object positions and sizes in images. The size is described by the object bounding slice, extracted from the annotated polygon, consisting of the radial size and the angular width. The spots indicate the projection of the object base points, used to describe their apparent positions in images.

112

6.3 Results of geometric O-SLAM obtained with the Cogniron data set. a) Perspective view of the obtained O-Map together with the inferred trajectory b) Bird-eye view of the built objects model. The black-dotted line represents the inferred trajectory, while the dark-green-dotted line is the trajectory obtained by laser-based SLAM, which we consider as our ground truth. The house blueprint was extracted from the paper of Zivkovic, Booij & Kröse (2007). . . . . . . . . . . . . . . . . . . . . . . . 114 6.4 More results obtained with the Cogniron data set, illustrating the performance of the proposed approach for geometric SLAM. a) Comparison among trajectories. The inferred trajectory is closer to the reference, and its error remain limited. The odometry error increases over the time. b) Example of projection of the objects in the map onto the image plane. . 115 6.5 Pictures of the robot and of an object present in the LPA images set. (a) The Pioneer 3-DX robot carries an omnidirectional imaging system, consisting of a video camera coupled to a hyperbolic mirror. (b) Example of an object (green box) detected by CMVision in the experiments. Each object class is represented by a different color. . . . . . . . . . . . . . . 117 6.6 Detected object positions and sizes in images belonging to the LPA data set. Observe that the size of the "orange door" is imprecisely detected.

118

6.7 Results of geometric SLAM obtained with the hand-adjusted LPA data set. (a) Perspective view of the obtained O-Map together with the inferred trajectory. The object colors are the average ones used to detect them in images. (b) Bird-eye view of the built objects model. The black-dotted line represents the inferred trajectory, while the green line is the trajectory obtained by laser-based SLAM, considered as the ground truth. The occupancy grid map was generated using DP-SLAM. Note that the green box overlap most of the laser detected object.

. . . . . . 120

6.8 More results obtained with the LPA data set, illustrating the performance of the geometric version of O-SLAM. a) Comparison between trajectories. The inferred trajectory is closer to the reference, having limited error. b) Example of projection of the objects in the map onto the image plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.9 Results of O-SLAM obtained with the LPA data set and known object classes, though unknown data association solution. (a) Perspective view of the obtained O-Map together with the inferred trajectory. The object colors are the average ones used to detect them in images. Spurious objects are circled. (b) Bird-eye view of the built O-Map. The blackdotted line represents the inferred trajectory, while the green line is the trajectory obtained by DP-SLAM. (c) Detail of the region around the nonspurious green box in Figure 6.9b, showing three objects of that class (two of them are perfectly overlapped). The red triangles indicate the center. The transparency would indicate object existence probability, but two objects are so transparent that cannot even be seen. . . . . . . . . 124 6.10 Top view of the O-Map obtained considering the proposed model of object spuriousness. (a) Full map and trajectory, where transparency indicates object probability of existence. (b) Plot of the objects with πID > 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.11 Example of projection of the objects in the map of Figure 6.10b on the image plane. Observe that the leg of the person in the image is detected as a "green box" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.12 Comparison between samples histograms obtained by MC3 and MH algorithms.(a) Histograms obtained with MC3 . (b) Histograms obtained using the MH algorithm processing the same data. . . . . . . . . . . . . 127 6.13 O-Map and trajectory obtained using the LPA data set, with the wrong assumption of perfect object classes. The displayed objects have existence probability πID > 0.1. (a) Bird-eye view of the map, showing also the occupancy-grid obtained with DP-SLAM. (b) Perspective view of the map, showing overlapping objects corresponding to the same physical one (orange ball) detected as belonging to two different classes ("orange ball" and "orange door").

. . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.14 O-Map and trajectory obtained using the LPA data set and allowing for confusion among classes. The displayed objects have existence probability πID > 0.1. (a) Bird-eye view of the map, showing also the occupancy-grid obtained with DP-SLAM. (b) Perspective view of the map. Now, the orange ball is correctly mapped. . . . . . . . . . . . . . . 130 6.15 Detected object positions and sizes in images belonging to the LPA data set. Observe that the orange ball is detected twice: once as belonging to the correct class, and other time as an "orange door" object. . . . . . 131 6.16 Detected object positions and sizes in images belonging to the longrun LPA data set. (a) Image showing only non-spurious detections. (b) Image showing spurious measurements taken from the door and the red objects at the left part of the image, which are not intended to be detected in this experiment.

. . . . . . . . . . . . . . . . . . . . . . . . 132

6.17 Results of O-SLAM obtained with the long run LPA data set, showing the objects with πID > 0.5. (a) Perspective view of part of the obtained O-Map, showing the objects detected inside the lab room. The spurious objects are circled. (b) Bird-eye view of the built objects model. The black-dotted line represents the inferred trajectory, while the green line is the trajectory obtained by laser-based SLAM, considered as the reference. The occupancy grid map was generated using DP-SLAM. Note that no objects are shown outside the room.

. . . . . . . . . . . . . . . 133

6.18 More results obtained with the long-run LPA data set, illustrating the performance of the proposed algorithm to estimate the correct geometric parameters. (a) Comparison between trajectories. The inferred trajectory is closer to the reference. In this figure, all mapped objects are shown, represented by the red triangles. (b) Examples of projection of the objects with low spuriousness probability in the map onto the image plane. The image at the left emphasize projections of the correctly mapped objects, while the image at the right emphasize the projections of the spurious objects registered with a low spuriousness probability in the map.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

A.1 Illustration of the variables involved the robot movement model. The circle represents the robot, and the line segment inside it indicates the orientation with respect to the global coordinate system.

. . . . . . . . 151

A.2 Schematic representation of the omnidirectional camera used in the experiments of Chapter 6, extracted from (ZIVKOVIC; BOOIJ, 2005).

. . . . 152

A.3 Schematic representation of how a point in the world is projected in the image plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

List of Tables 6.1 Classes detected in the Cogniron images set and the assumed prior belief on their instance sizes. . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Classes detected in the LPA images set and the assumed prior belief on their instance sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.3 Comparison between object sizes inferred with the proposed geometric SLAM algorithm and the correct ones. Each object belongs to a different class. The correct diameter of hexahedral —volumes with six faces— objects, namely the green and yellow boxes and the orange door, were calculated by taking the euclidean norm of the corresponding length and width values in Table 6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.4 Classes detected in the LPA images set and the assumed prior belief on their instance sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Lista de Algoritmos 1

O-SLAM algorithm considering perfect object detection and recognition

73

2

Building an O-Map from data association samples Jn and data Z . . . .

79

3

Building an objects-based map from correspondence function and object classes samples (Cn , Jn ) . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4

The Metropolis-Hastings algorithm for sampling P(C, J∣Z)

. . . . . . . .

93

5

Proposing a merge move . . . . . . . . . . . . . . . . . . . . . . . . . .

97

6

Proposing a split move

99

7

Proposing a switch class move . . . . . . . . . . . . . . . . . . . . . . . 100

8

The Metropolis coupled MCMC algorithm applied to sampling P(C, J∣Z)

9

Strategy proposed to improve the sampling process . . . . . . . . . . . 109

. . . . . . . . . . . . . . . . . . . . . . . . . . .

104

List of Acronyms AR Augmented Reality BIC Bayesian Information Criterion BMA Bayesian Model Averaging BMS Bayesian Model Selection CRF Conditional Random Fields EKF Extended Kalman Filter HCI Human-Computer Interaction HMM Hidden Markov Model ICP Iterative Closest Point iSAM incremental Smoothing and Mapping JCBB Joint-Compatibility Branch and Bound LS Least-squares MAP Maximum a Posteriori MCMC Markov Chain Monte Carlo MC3 Metropolis-Coupled Markov Chain Monte Carlo MH Metropolis-Hastings ML Maximum Likelihhod NN Nearest Neighbor PTM Probabilistic Topological Map RANSAC Random Sample Consensus

ROI Region of Interest SAM Smoothing and Mapping SFM Structure From Motion SIFT Scale-Invariant Feature Transform √ SAM Square-Root Smoothing and Mapping SLAM Simultaneous Localizatiopn and Mapping

List of Symbols P(.) Probability distribution or density function P(.∣.) Conditional probability distribution or density function δ (., .) Kronecker delta, defined by X = X0 → δ (X, X0 ) = 1 and X ∕= X0 → δ (X, X0 ) = 0 Ns Number of samples 𝔅(K) Bell number of order K S(K, N) Stirling number of second kind, representing the number of possible ways to partition K elements into N subsets M j oj N lj

O-Map representation General purpose index, used to count objects in the map jth object in an object-based map Number of objects in a map jth object location in the space

πj

jth object geometric parameters jth object class jth object existence probability

C

Space of possible class labels

T

Number of images captured by the robot

X

Robot trajectory

gj cj

i

General purpose index, mostly used to count data

xi

acquisition instant times ith robot pose in the trajectory

V

Odometers data

vi

Robot displacement measured by the odometers from time i − 1 to i

Z

Data acquired by object detection

k

Index used to count object detections in an image

Ki zi,k

Number of objects detected in the image captured at time i Measurement taken from the kth object detected at time i

si,k

Measured 2D position of the kth object detected at time i Measured 2D size of the kth object detected at time i

ai,k

Detected class of the kth object detected at time i

ui,k

J

Correspondence function

C

Set of object classes in an O-Map

L

Set of object locations in an O-Map

G

Set of object geometries in an O-Map

N

Normal distribution

γ(.) Class-dependent a priori object location in the world Γ(.) Class-dependent covariance of the prior on object location ς (.) Class-dependent a priori object size in the world Σ(.) Class-dependent covariance of the prior on object size f (., .) Deterministic robot motion function Qi

Covariance representing the believed robot motion from time i − 1 to i

hu (., .) Deterministic transform that projects the 3D object location on the image plane Ri,k

Covariance representing the believed error in the object projection position on the image plane

hs (., ., .) Deterministic model that projects the 3D object size on the image plane Wi,k

Covariance representing the believed error in the object size on the image plane

P Block-diagonal covariance matrix of the Gaussian unnormalized posterior P(Z∣X, L, G,C, J)P(X, L, G∣C, J) θ A¯

Vector of unknown variables Coefficient matrix of the linear system whose solution maximizes P(Z∣X, L, G,C, J)P(X, L, G∣C, J)

b¯ Independent coefficients vector of the linear system whose solution maximizes P(Z∣X, L, G,C, J)P(X, L, G∣C, J) Id×d

Identity matrix with d diagonal elements

Q, R

¯ Matrices resulting from the QR decomposition of A, where R is the upper triangular Cholesky factor

θJ

Variables vector that assumes the correspondence function J

ID Object identity defined for all correspondence functions

lˆID

Estimated 3D location of object ID in the map

gˆID

Estimated 3D size of object ID in the map

cˆID

Estimated class of object ID in the map

πID

Existence probability of object ID in the map

Jn

General purpose index, mostly used to count samples nth correspondence function sample

Cn

nth object classes sample

θn

Variables vector that assumes the data association solution

n

given by Jn and object classes given by Cn

Contents

1 Introduction

25

1.1 Object-based map for robot navigation . . . . . . . . . . . . . . . . . . .

27

1.2 SLAM problem and loop closing

. . . . . . . . . . . . . . . . . . . . . .

29

1.3 Object-based SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

1.4 Objectives

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2 Robotic Mapping

39

2.1 Map types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Metric maps

40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

2.1.2 Topological maps . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

2.1.3 Features maps . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

2.1.4 Object-based maps . . . . . . . . . . . . . . . . . . . . . . . . .

44

2.2 The SLAM problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

2.2.1 Related problems . . . . . . . . . . . . . . . . . . . . . . . . . .

47

2.2.2 Estimation techniques . . . . . . . . . . . . . . . . . . . . . . . .

48

2.2.3 Data association problem . . . . . . . . . . . . . . . . . . . . . .

51

3 The O-Map

54

3.1 Applications of O-Maps . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.2 Framework for O-SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.2.1 Assumptions about the object detection system . . . . . . . . . .

58

3.2.2 Coupling the unknowns through probabilistic modeling . . . . . .

60

3.2.3 Computer vision algorithms compatible with O-SLAM . . . . . .

62

3.3 Comparison with other object-based maps . . . . . . . . . . . . . . . . .

64

4 Inference in Object-based SLAM 4.1 Efficient solution for geometric SLAM

66 . . . . . . . . . . . . . . . . . . .

67

4.1.1 Using linearized Gaussian models . . . . . . . . . . . . . . . . .

69

4.1.2 Inference using QR decomposition

. . . . . . . . . . . . . . . .

71

4.2 Probabilistic data association . . . . . . . . . . . . . . . . . . . . . . . .

72

4.2.1 Object identities . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.2.2 Approximating the distribution on data association . . . . . . . .

77

4.3 Treating imperfect object detection . . . . . . . . . . . . . . . . . . . . .

78

4.3.1 Modeling confusion between classes . . . . . . . . . . . . . . . .

79

4.3.2 Filtering out spurious objects . . . . . . . . . . . . . . . . . . . .

81

4.4 Probability distribution on data association solutions and object classes 4.4.1 Calculating the evidence . . . . . . . . . . . . . . . . . . . . . . . 5 MCMC Sampling

83 85 92

5.1 The Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . . .

92

5.2 Proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.2.1 Proposed state transitions . . . . . . . . . . . . . . . . . . . . . .

95

5.3 Practical issues about the proposed sampling algorithm . . . . . . . . .

98

5.3.1 Improving mixing wit the Metropolis-Coupled algorithm . . . . . . 102 5.3.2 Improving state transition efficiency . . . . . . . . . . . . . . . . 105 5.3.3 Driving the chain convergence . . . . . . . . . . . . . . . . . . . 107 6 Experimental Results of O-SLAM

110

6.1 Experimental results of geometric SLAM . . . . . . . . . . . . . . . . . . 110 6.1.1 Cogniron data set . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.1.2 LPA data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2 Results on probabilistic data association . . . . . . . . . . . . . . . . . . 122 6.3 Results of O-SLAM assuming classes confusion . . . . . . . . . . . . . 128 7 Conclusion and Future Work

136

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 References

141

Appendix A -- Models used in geometric SLAM

150

A.1 Motion model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.2 Observation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

25

1

Introduction

Mobile robots are embodied systems that move autonomously and interact with the world to perform their tasks. Instances of such robots are, at some extent, frequent nowadays, and the tasks assigned to them are increasing in complexity. Mine clearance and demining, autonomous surveillance, floor cleaning and assisting elderly or disabled people are among the tasks currently being performed by robots, requiring them to know their localization in space and also the actions to take in each possible situation. As tasks performed by robots are increasing in complexity, robot perception must be able to capture richer information from the environment so that more complex decisions can be taken. Since research in robotics has been mainly focused on the robot navigation problem during the past decade, robot localization and metric maps of obstacles were the main concern in robot perception works. While robot localization and metric maps building are still active research fields, now other types of perception are being more deeply investigated, such as objects recognition and mapping, place classification, natural language processing, face and gesture recognition, among others. Particularly, a strong emphasis is being made on processing visual information, since vision is a low-cost and low-energy sensor that is very flexible for providing rich information that can be used for several purposes. Furthermore, great progress has been made on the computer vision field, from where many techniques can be employed to obtain robot perceptions. A particular topic in computer vision that has seen great advances in the past few years is object recognition and its variants, such as object detection and object categorization. Probabilistic techniques that deal with a great amount of data and select the most significant features in images have been proposed lately, and produce impressive results in image segmentation and object recognition or categorization (PINZ, 2005). In robotics, the use of some computer vision techniques to recognize and map objects in the environment is becoming frequent, providing good detection and recognition rates

1 Introduction

26

(EKVALL; KRAGIC; JENSFELT, 2007; FORSSEN et al., 2008). However, very few works explore the robot movement in the object recognition and mapping task. The good performance of nowadays object recognition techniques suggests their usage as just other robot sensors, so that their output is used as is, considering the computer vision algorithm as a black box. In spite of that, object recognition techniques are not black boxes, and are subject to failure. The robot motion model serves, thus, as a powerful tool that can be used to complement the computer vision output when recognizing and mapping objects. The interaction of the robot with the world while it continuously acquires data provides a rich source of information that may corroborate or contradict computer vision output, leading to stronger recognition results. As a consequence of this integrated approach, objects are recognized on a certain place, instead of being sought in an image or tracked over an image sequence. The main goal of the present work is developing a framework that produces simple models of the objects in the environment by integrating the output of a specific class of object recognition algorithms with information about the robot movement in the space. Here, the computer vision algorithms are not treated as black boxes; instead, the identities of the objects in the environment, as well as their locations and sizes, are considered the unknowns of the problem. The knowledge about the object recognition limitations, integrated with the knowledge about how the robot moves and interacts with the environment, constrains both robot and object localization, as well as the object sizes and classes, making the inference of those unknowns possible. In traditional robotics literature, the identities of the objects are considered as data, output by the object recognition algorithm. Thus, this work advances the current state of the art of robotics research by proposing the first technique that maps objects in the environment while estimates the robot localization without using any other external sensor, stereo images, nor additional features other than the object detections in images. The present research represents a step towards a scenario where the objects in the environment are modeled by fusing information directly extracted by images with information about robot movement, dropping the need for relying in some out-of-the-shelf object recognition algorithm. However, in the current work, such algorithm is still necessary. The difference from related work is that now fusing their output with the knowledge about the robot movement allows the revision of the information provided by object recognition, that simultaneously constrains the robot localization and the locations and sizes of the objects in the

1.1 Object-based map for robot navigation

27

environment. Since the focus here is showing how robot embodiment and computer vision techniques can be coupled together to map the objects in a certain place without using any auxiliary external sensor, the proposed approach is demonstrated using a very simple object recognition technique, that assumes a controlled environment. The errors of object recognition are then balanced by integrating the information gathered by the robot while moves around the environment, so that maps that coherently describe the objects can be produced. Implementing a general purpose and robust object recognition algorithm and integrating it to the proposed framework to guarantee the quality of the produced results in uncontrolled real-world conditions is out of the scope of this work. At the same time, this thesis aims at contributing to the robot mapping research area. The proposed approach generates a map containing the 3D geometry of the objects in the environment together with their classes. An object class is a semantic label indicating what type of object it is; for instance, whether it is a door, a table, a ball, etc. The resulting map is called O-Map, and advances over other maps of objects used in robotics because of the consistent and compact representation used, allowing robot localization without need for any auxiliary structures other than the objects themselves. Therefore, the contribution of this work resides in both the proposed map and the process to build it, where the geometric and semantic parts are obtained in a coupled fashion.

1.1

Object-based map for robot navigation

Robot navigation is a broad topic, and involves different issues to be addressed. Broadly speaking, mobile robot navigation corresponds to the task of autonomously moving from a specific position, or a region, to other position or region. This task may involve just moving from the children’s bedroom to kitchen in a house, moving from one building to another in a university campus, or even travel from one city to another. Depending on the robot locomotion method, there are also different constrains in the robot path: a terrestrial robot moving in the city should take sidewalks —for low-speed robots— or streets —for high-speed ones, but an aerial robot usually can go directly to the target. If the area the robot can move is constrained by walls, objects, people, or other obstacles, it must find its way to avoid them and reach the target position. Finally, the robot must always know its position with respect to its final goal.

1.1 Object-based map for robot navigation

28 Cabinet

Livingroom dining table

Book case

TV Computer desk

Armchair

Figure 1.1: Example of a simple O-Map of a living room, showing object heights and average widths, as well as class labels. All represented objects are considered to have been detected with probability 1.

Many of the issues involved in robot navigation can be addressed with the aid of a map. Regardless of its scale, a map is a spacial representation used to discover the localization of special elements. In autonomous navigation, the elements represented in a map can be used for several purposes. For instance, they can be used to plan a path from the robot current position to its goal, specially if the map represents the areas where it is allowed to navigate. Maps can also be used to localize the robot: by matching the elements sensed in the environment with those registered in the map, the robot can infer the place or the possible places it is. In this work, a new type of map for robot navigation is proposed, called O-Map. The O-Map contains geometric and semantic information about certain objects in the environment, as depicted in Figure 1.1. The geometric information corresponds to the size and position of the objects, modeled by simple 3D geometric forms, and the semantic information is given by the object classes. The proposed map also provides a measure of certainty about having detected the objects, feature that is further explained in Chapter 3. Throughout this work, object geometries are assumed to be well described by cylinders, but the proposed approach could be extended to any geometric form. The O-Map has some advantages when compared to traditional maps used to localize the robot. Usually, those maps register feature points or lines in the environment, which are present in a greater quantity than objects. Although this number of elements depends on the method to detect points and lines in images or laser scans, it is true, in general, that objects are sparser. The greater sparseness of objects over more basic geometric elements is favorable to the efficiency of loop closing and robot localization algorithms, since the space of possible matches between the elements detected by the robot and those registered in the map is much smaller in this case. Finding the correct

1.2 SLAM problem and loop closing

29

matches is the bottleneck for these tasks in general. However, applications for O-Maps go beyond robot localization. Once they contain semantic information, they can be used as input for some logic task or path planning algorithm based on objects. The proposed map can be specially useful for systems that interact with humans, like in Augmented Reality (AR), Human-Computer Interaction (HCI) applications, and service robots. Their world representation should share symbols with that of humans, and there is evidence that people themselves use objects to represent indoors spaces (VASUDEVAN et al., 2007). Even if the geometric representation is not realistic, the semantic labels together with the object positions are very informative pieces of information that allow a user to understand the represented space and associate the mapped objects with the real ones depicted in an image. Although other maps that represent objects in the environment have been proposed in the robotics literature, the representation adopted in this work and the process used to obtain it are novel, and constitute the main contribution of this thesis. Instead of just inserting the detected objects in a map built using traditional approaches, here the whole map and the robot trajectory are jointly inferred from the acquired data. In special, the semantic and geometric parts of the map influence each other, being calculated in a coupled fashion. This thesis also proposes an algorithm to build O-Maps from visual input captured by a mobile robot that does not know its localization precisely. Obtaining the objectbased map concurrently with the robot localization corresponds to a SLAM problem. SLAM stands for Simultaneous Localization and Mapping, and is the problem of determining the robot localization and a map for the environment from data acquired during its exploration by the robot. Thus, the proposed algorithm to obtain O-Maps, called O-SLAM, faces many of the challenges of traditional SLAM techniques.

1.2

SLAM problem and loop closing

The problem of SLAM arises whenever it is required that a robot build a map of the environment without knowing its localization precisely. It is a very active research field, and many variants of the problem have been studied so far. The complete solution for this problem is regarded as one of the challenges in robotics (DISSANAYAKE et al., 2001; FOLKESSON; JENSFELT; CHRISTENSEN,

2005).

1.2 SLAM problem and loop closing

30

Among the sub-problems that composes the complete SLAM problem, a very challenging one is the data association problem (THRUN, 2002). It usually arises in tracking and mapping problems, where certain features detected in the sensors input data must be either tracked over the time or have its location registered in a map. In these problems, the measurement generated by a certain feature detection is used to update the estimated location of that feature. In any case, all feature detections have to be associated among them, defining which detections are due to a specific feature, giving rise to the data association problem. For instance, if the robot has a “cup” detector and detects a “cup” in two different images, the data association problem corresponds to determining whether both detections correspond to the same “cup” or if two different “cups” were detected, based on the measurements generated by these detections. Although several techniques have been proposed to solve this problem within the SLAM context, finding the optimal data association solution is still very challenging. The data association problem is formally and exactly expressed as a set partitioning problem, studied in set theory (ROTA, 1964). Given a set of feature detections, where a feature can be a point, a line, an object or any elementary structure represented in the map, the data association problem consists of grouping them into the correct sub-sets, each one corresponding to a different physical feature perceived in the world. The measurements generated by the feature detections belonging to a certain sub-set is then attributed to a unique feature in the world, and are used to build a representation for that feature in the map. Thus, the size of the space of possible data association solutions is equal to the combinatorial number of possible ways to partition the feature detections set. It is worth noting that, given a certain solution for the data association problem, the number of features registered in the map is equal to the number of nonempty sub-sets in which the feature detections are grouped. In general, associating measurements taken in successive feature detections is much easier than associating measurements taken within a long period of time. This is illustrated in Figure 1.2 for the specific SLAM problem where the robot detects the range and bearing of special feature points in the scenario. In the SLAM case, the robot trajectory shape is usually well known within short patches. Thus, supposing that certain features are detected and measured at time instant t, the predicted measurements of those features will lie very close to the actual measurements made at successive instants t + 1,t + 2, . . . ,t + n, as depicted in Figures 1.2c and 1.2d . In this case, it is easy to match the new measurements to the previous ones. However, as time passes and the robot localization becomes uncertain, the predicted projection of

1.2 SLAM problem and loop closing

31

trajectory

(a) Instant t when the robot detects four features.

(b) Measurements acquired at instant t.

(c) Scenario at instant t + 1, when the robot localization error is small.

(e) After some time moving around, the robot localization error is large.

(d) At instant t + 1, predicted and actual measurements are close to each other. It is then straightforward to match new and old measurements.

(f) As time passes, the predicted measurements of features detected at instant t get wrong, being difficult to match to the actual measurements.

Figure 1.2: Effect of the robot localization error on the data association problem. In (a),(c), and (e), stars represent the true feature positions, and the unfilled circle represents the robot pose, with the internal trace showing its orientation: the continuous stroke indicates the actual pose, while the dashed stroke shows the believed pose. In (b),(d), and (f), the black ball shows the origin of the measurement coordinates, crosses represent the projected measurements from time t, and squares connected to the origin represent the actual measurements.

1.2 SLAM problem and loop closing

32

measurements made at time t will lie apart from the actual measurements and thus deciding whether the same feature or other ones have been detected is a more difficult task, as depicted in Figures 1.2e and 1.2f. This effect occurs in the SLAM problem in general, independently of the type of sensors used. Thus, for the easier case, when the data association considers measurements taken within a short period of time or when the robot pose does not change much, nearest neighbor (NN) techniques work well. However, a more difficult case is when the robot revisits previously visited areas and new measurements have to be associated with those taken before. The data association problem in this case is also known as loop closing problem, and requires more sophisticated techniques to be solved. In general, these techniques are exploratory, and search the data association solutions space for the best correspondence among detections according to some criterion. However, enumerating the whole data association space is infeasible, and any exploratory technique must rely on a good search strategy, so as to prune the search space whenever it is possible. In this work, the approach used for data association focuses on solving the loop closing problem. However, instead of determining a single data association, this work proposes that a handful of possible data association solutions be found. Since this approach generates different maps with the same data, possibly with different number of objects each, the final O-Map is built by combining these maps by taking the expectation of the mapped object parameters. This can be useful in case there is not enough data to determine a single good data association solution. For instance, there might be cases where it is not possible to assure that there is a single “clock” in the room, or if there are two similar “clocks”. To find probable data association solutions, that space is sampled using Markov Chain Monte Carlo (MCMC) techniques, which use the theoretical properties of Markov chains to draw samples from any target distribution. MCMC methods provide a well founded framework for approximated inference in cases where exact inference is intractable, replacing ad hoc methods or heuristics that would be necessary otherwise (GILKS; RICHARDSON; SPIEGELHALTER, 1996a). Related work show that these methods can provide practical algorithms to successfully sample over the challenging space of data association solutions if well tuned by using specific knowledge about the problem (DELLAERT, 2001; KAESS; DELLAERT, 2005; RANGANATHAN, 2008).

1.3 Object-based SLAM

1.3

33

Object-based SLAM

In this work, the features represented in the map correspond to the objects detected by the robot while exploring the environment. The proposed O-SLAM algorithm considers that the robot captures images from a video camera and runs object recognition algorithms specialized in detecting certain classes of objects in those images. Other than the presumed class, measurements given by the detectors consist of object apparent sizes and positions in images. An important assumption behind this work is that, in spite of the good results modern object recognition algorithms can produce, they are still subject to failure. Therefore, at the same time O-SLAM uses existing off-the-shelf object detectors as basic sensors, the output of those algorithms is not completely trusted; instead, it is used to feed a probabilistic inference mechanism that revises and filters the returned measurements at the same time it builds the map and find the robot trajectory. Here, we model the behavior of the object detectors by assuming that their error modes correspond to: • imprecise object size and position in image; • misclassifcation, characterized by detecting an object belonging to the wrong class; and • spurious detection, characterized by detecting an image patch that does not correspond to an actual instance of the target object class. This work proposes that an MCMC technique be used to perform approximate inference of the object classes, in an analogous way to what is done to the data association solution. The key idea for that is coupling the object classes probability with the geometric parameters of the map. Therefore, the model developed to describe their probability distribution does not allow the independent inference of object classes, locations, and geometries and the robot localization. As a consequence, the inference procedure must explore the joint space of object classes, analogously to the inference of the data association solution. To do so, O-SLAM uses MCMC to sample over the joint space of object classes and data association solutions, adding a new dimension to the problem. Given a sample of the pair data association solution and object classes, O-SLAM treats the geometric measurements imprecision using the Square-Root Smoothing and √ √ SAM is a state-of-the-art Mapping ( SAM) framework (DELLAERT; KAESS, 2006).

1.4 Objectives

34

technique for geometric estimation in SLAM that models the problem of jointly obtaining the robot trajectory and the map as a non-linear least-squares (LS) optimization, treated by iterative methods that solve a linearized version of the problem at each step. Since the Hessian matrix of the linearized system is sparse, it is efficiently solved using QR factorization, a linear algebra technique specially suitable for sparse systems √ (GOLUB; LOAN, 1996). SAM can also be made incremental (KAESS; RANGANATHAN; DELLAERT,

2007) and out-of-core (NI; STEEDLY; DELLAERT, 2007). The proposed algo-

rithm uses the same framework to model the geometric relationships between object apparent positions and sizes and robot poses, and applies similar techniques to solve the resulting linearized LS problem. Finally, spurious measurements are treated by identifying the spurious objects in the map. Some objects or patches of objects in the environment may look similar to instances of a certain class, and sometimes are detected as so. However, if the object detectors are well designed, which is a premise of this thesis, these objects are not detected in a consistent way, so that highly probable data association solutions attribute few measurements to them. Moreover, these objects are usually detected only from a restricted point-of-view. Since a good estimate of the object location can be achieved only if measurements from several points of view are considered, we expect the uncertainty about the location of these spurious objects to be high. Thus, the probability of an object be spurious given a sample from the pair data association solution and object classes is modeled as depending on the posterior covariance matrix of its location.

1.4

Objectives

The goals of this thesis are: 1. introducing O-Maps; 2. defining O-SLAM, an off-line probabilistic SLAM algorithm that builds O-Maps with static objects from data output by specialized object detectors and the robot odometers; and 3. showing that, if measurements are taken from the objects of interest from different points of view, O-SLAM can build consistent O-Maps and provide the approximately correct robot trajectory for the period when sensors data was collected.

1.4 Objectives

35

The proposed algorithm, O-SLAM, considers that, the robot capture images from a video camera while explores the environment, and tries to detect certain classes of objects in them. Other than the presumed objects class, measurements returned by the object detectors consist of objects apparent sizes and positions in images. Here, the assumed object detector error modes correspond to imprecise detected sizes and positions, and misdetections —detections that does not correspond to an actual instance of the target object class. Thus, O-SLAM relies on probabilistic models describing its sensor error modes to infer object classes and build their geometric models. As mentioned in Section 1.2, data association is addressed as a sampling problem. Instead of seeking for the best data association solution, the proposed SLAM algorithm samples over that space using an MCMC technique. Extending other work results, the space of object class assignments is also sampled, adding a new dimension to the problem. The generated samples are then post-processed to build a single map of the environment. In this work, O-Maps are considered consistent if only actual objects, present in the environment, are mapped with a low probability of being spurious. Approximately correct trajectories are those that do not diverge from the ground truth trajectory, so that the robot localization error throughout the time remains bounded. The important aspect of this definition is to make clear that the proposed technique is able to recognize when the robot passes by a previously visited place. It is worth emphasizing that achieving these results also depends on the use of good object detectors. For instance, if they do not detect some objects of interest in images, or detect them few times or from a very restrict point of view, these objects will not be mapped, or will have a high probability of being spurious, causing the resulting map not be consistent with the environment. In order to demonstrate the capabilities of O-SLAM, goal number 3 is divided in three sub-goals: demonstrating the geometric model consistency: if the correct data association solution and object class assignments are provided, and only non-spurious measurements are used, object-based SLAM generates consistent object-based maps, including their locations and sizes in the environment, and approximately correct robot trajectories; demonstrating the success of O-SLAM in closing loops: the MCMC approach provide practical algorithms for solving the data association problem in object-based

1.5 Contributions

36

SLAM even when the robot closes loops in the trajectory; demonstrating the effect of modeling object detectors errors: if the data association solution and the correct object classes are unknown, consistent object-based maps are feasible to obtain by jointly inferring the geometric and semantic information, overcoming the problem of imperfect object detection.

1.5

Contributions

This thesis presents two main contributions: 1) the introduction of O-Maps and 2) the definition of the O-SLAM algorithm. The O-Map constitutes a new environment representation in the robot navigation context, and present some potential advantages when compared to traditional representations. O-SLAM is a new off-line SLAM algorithm that solves the problems posed by the construction O-Maps, like finding the most probable object classes and assigning a spuriousness probability to each object. No other SLAM algorithm uses only the output of object detectors and odometers to build object-based maps and localize the robot all at once. The first contribution of this thesis is the introduction of O-Maps as an alternative to traditional object-based maps as a semantic and geometric representation of them. Although other types of maps describing the objects in the environment exist, the proposed representation is novel in the robotics context, since no other type of map informs object positions, sizes and classes at once, besides representing parameters uncertainty. The proposed object-based map presents some potential advantages over other types of maps proposed in the literature in tasks related to robot navigation, path and task planning, and interaction with humans. When compared to traditional point or line feature-based maps, O-Maps are more suitable for closing large loops. As mentioned in Section 1.2, the greater sparseness of objects when compared to elementary geometric features like points or lines is favorable to robot localization because data association is less hard to resolve in the sparse case. Furthermore, comparing the proposed object-based map with other maps of objects proposed in the literature, the former allows the robot to better deal with perceptual aliasing, because apparent object sizes naturally lead to a notion of distance from the robot to the objects. Path and task planning take advantage from the semantic information represented by object classes, since more high-level plans can be obtained. Finally, humans can better communicate with robots using the common vocabulary of

1.5 Contributions

37

objects, which are referred by their names in natural language. Another new feature of O-Maps is the representation of objects non-spuriousness probability. Since the proposed map is built from different hypothesis about data association, the number of objects may be uncertain. Here, this uncertainty is represented by assigning a marginal probability of each object in the map having actually been detected in the environment. Because it involves matching the same physical objects among maps obtained with different hypothesis about the data association solution, the proposed SLAM algorithm defines of a concept of object identity valid across the different associations. The presence of the same object identity in the maps generated assuming the most probable data association solutions increases its non-spuriousness probability. Besides, he parameters of objects with the same identity, obtained assuming different data association solutions are combined to define that object parameters in the O-Map. The other contribution of this thesis is the O-SLAM technique, that builds the proposed object-based map from visual input data and recovers the robot trajectory. The novelty of the proposed technique resides in the way semantic and geometric information is used all together to build the map, so that object geometries and classes are tightly coupled. Thus, object classes influence and are influenced by the geometric parameters of the map. Other new characteristic of O-SLAM is that, instead of building a map with point-wise features, the proposed algorithm builds maps consisting of simple geometric models of objects in the environment. Although the maps and trajectories generated by O-SLAM are not as precise as those generated by conventional point-based SLAM techniques, there is no significant loss of the necessary information for robot localization in environments containing objects. People themselves do not know the exact dimensions and locations of the objects, but have just a vague notion about them, and even so are able to self-localize in the environment using objects as landmarks. Hence, the less precise geometric representation in O-Maps is still enough for localization in a more abstract level, which can be verified by its loop closing ability. Among the procedures involved in the O-SLAM algorithm, it is worth to emphasize the specific contribution on how to efficiently use MCMC techniques to sample the data association space. Although much is inspired in previous work (KAESS; DELLAERT, 2005; RANGANATHAN, 2008), some new procedures are proposed, designed to increase the efficiency of the sampling process and to allow sampling on the augmented space

1.6 Organization

38

of data association solutions and object classes.

1.6

Organization

The remainder of this thesis is organized in three parts. The first one, comprising Chapter 2 and Chapter 3, introduces the concepts used here and situates this work in the related literature. Chapter 2 makes a brief review of the robot mapping field, introducing the main problems faced in this area, the involved concepts and the most relevant proposed solutions. Chapter 3 describes the new map proposed in this thesis, relating it with similar work in the literature. The second part, formed by Chapter 4 and Chapter 5, constitute the technical part of the work, where the proposed object-based SLAM algorithm is detailed. In Chapter 4, the solution to infer the the map, comprising object positions, geometries and classes, as well as the robot trajectory, is detailed. For such, samples from the distribution on object classes and data association solutions are considered given. Chapter 5 describes the proposed solution to generate such samples using the MCMC approach, and discusses some practical implementation issues as well. Finally, Chapter 6 and Chapter 7 form the last part of this work, where the results and conclusions of the proposed technique are discussed. Chapter 6 describes the results of experiments performed with images captured by real robots, using a simple object detection and segmentation algorithm. The conclusions of this work are presented in Chapter 7, which also discusses the possible development of the present work.

39

2

Robotic Mapping

This chapter situates this work in the research field of robotic mapping. It has been a very active field for the past three decades, and is still under continuous development. Since there are many applications involving robot mapping, many branches in this research field have been proposed, investigating specific aspects of the area. Thus, the goal of this chapter is to specify the branches where this thesis best fit to, and to which it contributes the most. Mapping is the name given to the task of building environment representations with data acquired from them. This kind of task is common in several human activities, such as architecture, civil construction, cartography, aviation, geography, geology, military activities, and others. When it comes to robotic mapping, this term corresponds to the task of building an environment map in an autonomous way by a mobile robot, using its sensors for this purpose. In this case, the robot processes its sensor data, extracting relevant information, and registers it in the map. Maps express the information they contain by assigning values to their elements. The nature of these elements varies depending on the map type, and may correspond to one in a large spectrum of possible structures. Thus, the mapping task corresponds to assigning values to the map elements, relating each one to a certain position in the map. Maps used in robotics may receive different names according to the elements they use to describe the environment. In Section 2.1, some of the most common map classes are discussed. Section 2.2 discusses the mapping process in robotics, emphasizing those processes that do not assume precise knowledge about the robot localization.

2.1 Map types

(a) Metric map

40

(b) Features map

(c) Topological map

Figure 2.1: Classification of the most common types of maps used in robotics according to their attraction level. (a) Occupancy-grid map, representing metric maps, at the lowest level (b) Feature points map indicating the location of trees and the robot trajectory, extracted from (SELVATICI; COSTA, 2007a) (c)Topological map that connects landmarks with the control commands used to move from one to each other, extracted from (VASSALLO, 2004).

2.1

Map types

Several types of maps have been proposed in the robotics literature. The ideal map type to be used or built by a mobile robot depends on the task and the environment where it is inserted. It also depends on robot characteristics, such as type of sensors that it has, as well as the way that it moves. In any case, it is possible classify basic map types according to the abstraction level of the elements used in its construction. Figure 2.1 illustrates this classification, based on the maps taxonomy proposed by Dudek & Jenkin (2000). Metric maps are those in the lowest level of abstraction, and their elements correspond to cells indicating the value of some local feature. Topological or relational are in the other extreme, presenting the highest abstraction level. They represent places or objects and connect them by some kind of relationship, not concerning about their precise locations. Geometric or feature maps are in an intermediary abstraction level, since it represents the location of some features in the environment, while is not concerned about assigning a value for each point in the map. Despite the fact that there is a great variety of maps types which can be constructed, each one adequate to a determined task or environment, there is a very clear distinction between the process of building metric maps and topological maps. In the case of metric maps building, the purpose is to obtain a detailed map of the environment, with information in respect with the shape and size of objects and limits of free areas for navigation, such as corridors, rooms, trails and roads. In order to represent such complex and detailed information, the map is generally divided into a

2.1 Map types

41

dense grid in a way that each cell contains information about the occupancy of that space by an object, and possibly other environment characteristics which are being mapped. In the case of topological maps, the aim is to construct a relational structure, typically a graph, relating determined locations in the environment and defined objects or marks, even they are natural to the environment or built with the purpose of being mapped. It is easy to realize that the information in topological map is less detailed and distributes itself in a sparse way, concentrating itself only in some points of interest. Thus, it is very selective concerning the information which it registers. The expressive difference between metric and topological maps often leads to practical solutions which try to conciliate metric elements in relational maps, or vice-versa, in an attempt to obtain advantages of different approaches, either by building hybrid maps (NIETO; GUIVANT; NEBOT, 2004; BLANCO; FERNANDEZ-MADRIGAL; GONZALEZ, 2008) or using multiple abstraction levels to represent different aspects of the environment (DUDEK, 1996).

2.1.1

Metric maps

The purpose of metric maps is to represent the environment with high level of detail, encoding information with respect to the shape and size of objects and limits of free areas for navigation. For such, metric maps divide the 2D or 3D environment space into regular cells, which can represent any local feature of interest, working, for instance, as a blueprint of a room or a building floor, indicating the objects and limiting the areas where the robot can navigate. A classic example of metric map is the evidence grid map (ELFES, 1987), which represents the environment as a matrix in a way that each cell contains the probability of that space be occupied by an obstacle for the robot. This concept was extended to 3D evidence maps, which divide the 3D space into cells representing the occupancy probability (MORAVEC, 1996). A less common example of metric map is the visual appearance map, which represents the image captured by the robot for each pose it can assume. In this case, each cell of the map represents directly the sensory information the robot perceives instead of a local feature of the environment that does not depend on the robot perception. An instance of such map is the proposed by Porta & Kröse (2006), where images are encoded by the Discrete Cosine Transform.

2.1 Map types

42

In general, metric maps require large space in memory and each update step in the mapping process is computationally expensive, because it involves processing several cells each step. In addition, even if each mapping iteration is simple and involves few cells by time, complete map building requires iterated operations over each cell. However, as the metric map saves information in sensor signal domain itself, localization task is executed in a simple way. Moreover, due to high density of information in these maps, localization result is more accurate and less subject to ambiguities.

2.1.2

Topological maps

At the other extreme of the taxonomy presented in Figure 2.1 there are topological maps, also called relational maps, which provide relationships among several locations or objects present in the environment, generally in the form of a graph. Topological maps, generally said, register information concerning determined elements or places of the environment, called landmarks. Thus, landmarks and relationships are the elements of topological maps. These relationships can be of several types. They may indicate the relative displacement between two landmarks, the existence of a path between them, the control commands that moves a robot from a landmark to the other, semantic labels showing high level instructions to go from one to the other, and so on. Examples of topological maps include the seminal work of Kuipers (1978) about the TOUR model, which consists of semantic labels indicating the high-level strategy for a person or robot to go from a landmark to the other; Probabilistic Topological Maps (PTMs) (RANGANATHAN; DELLAERT,

2004), which shows metric landmarks and the paths between them; and

visual-motor maps (VASSALLO, 2004), which describe landmarks by a characteristic panoramic image, and link each other by edges annotated with heading control commands that takes the robot from one to the other. Topological maps require much smaller memory space when compared to metric maps. For this reason, they are more appropriate for mapping large areas. On the other hand, even if an iteration of any algorithm that builds topological maps has to process a few nodes of a graph, the mapping process is somewhat more complex. This fact is due to the need of processing robot sensor data in a more abstract way, in order to extract information about the relevant localities which must be considered as landmarks. Furthermore, extra processing must be performed to determine the relationships between landmarks.

2.1 Map types

43

(a)

(b)

Figure 2.2: Examples of geometric features map. (a) Map of ceiling lines, extracted from (FOLKESSON; JENSFELT; CHRISTENSEN, 2005). Textured map of planes, extracted from (THRUN et al., 2004).

In fact, there is a trade-off between information complexity contained in relational maps and the great quantity of elements that need to be registered in metric maps. A map containing simple relationships may require simpler data processing, but also requires a greater number of elements to provide enough information. In one limit case we have metric maps, in which relationships among map elements are negligible and information they contain is as poor as a local feature. In the other limit case, where very few places in the environment need to be represented, but information about them and the relationships among places become very complex, we have purely relational maps.

2.1.3

Features maps

The extremes represented by metric and topological approaches lead to the use of intermediary solutions. A possible one corresponds to features maps, which represent some special features of the environment, generally represented by geometric elements. Figure 2.2 shows examples of this map type. Differently from purely metric maps, the environment space is not fully represented by the features map, but just some special structures, like points, wall borders, or objects, as in the case of this work. On the other hand, the represented features have their absolute position assigned, as well as other eventual information related to them, not representing the relations among different features.

2.1 Map types

44

Despite being classified as a type of metric map in general, features maps have strong similarities to topological maps, specially concerning the respective mapping process. While in topological mapping, places where the robot has been are linked together, in features mapping the location of the features that the robot has seen are found by relating them to the places the robot visited. Differently put, topological mapping involves building a graph connecting places by where the robot passed, and features mapping involves building a graph connecting features detected by the robot to the places it has detected them (DELLAERT, 2005). The most common features used in maps for robot navigation are special points detected in the environment. 2D or 3D point-wise features were adopted in some seminal works (SMITH; SELF; CHEESEMAN, 1990; LEONARD; COX; DURRANT-WHYTE, 1992; DISSANAYAKE et al.,

2001; SE; LOWE; LITTLE, 2002; DAVISON, 2003) and since then have

been largely adopted. They carry little information, specially when detected by range finders as sonars or laser scanners, since they inform only point positions in the robot motion plane or in the space. Thus, many feature points are necessary to describe the environment with enough details for tasks such as robot navigation. If points are detected in images captured by the robot, some local image descriptors may be also assigned to the mapped features. Lines and planes are other common geometric features used in robotic mapping (PFISTER; ROUMELIOTIS; BURDICK, 2003; THRUN et al., 2004; FOLKESSON; JENSFELT; CHRISTENSEN,

2005). They describe more general structures than points, requiring much

less memory space to represent the same area. For instance, a wall in a corridor can be described using just one line segment —in 2D maps— or plane patch —in 3D ones—, while several points would be required to do the same. However, using lines or planes to describe the environment would require detecting more complex structures and matching them across different scans or images. Hence, there is a trade-off between using simpler or more complex features, analogous to the trade-off existing between metric and topological maps.

2.1.4

Object-based maps

Recently, objects have also been considered as features in robot navigation maps. Although the recognition and reconstruction of geometric objects in the environment had already been a research topic in some isolated works —e.g., (ALLEN; LEGGETT, 1995)—, it has not entered the mainstream robotics research until the development

2.1 Map types

45

of robust object recognition techniques from computer vision. In contrast to simple geometric elements, objects embed semantic information, considerably augmenting maps potential applications. The works of Vasudevan et al. (2007) and Ranganathan & Dellaert (2007) are examples of feature maps that model the objects in the environment. While the former work represents the position of objects detected in large indoor spaces with respect to a local coordinate frame, the later one augments the objects representation with a size parameter, though only objects depicted in a single image are modeled. Although the image can be either the result of a single pinhole camera shot, or a panoramic one resulting from aligning images with non-overlapping objects, the method proposed by Ranganathan & Dellaert (2007) cannot build a single map from several overlapping images of a large area. However, most of works concerned about object-based mapping rely on auxiliary structures to represent the environment, using objects to augment them. Ekvall, Kragic & Jensfelt (2007) use objects recognized in images to augment a map of line features, while Forssen et al. (2008) use an occupancy-grid map. Both works represent objects by their recognized identity and location in the map. Some works pose different degrees of dependence on other map structures. Galindo et al. (2005) describes a hierarchical representation that uses object descriptions and metric maps as the perceptual layer, subordinated to semantically labeled places, which correspond to the cognitive layer. The objects are described using simple lowlevel features needed to recognize them in images. Limketkai, Liao & Fox (2005) and Friedman, Fox & Pasula (2007) even drop the need for visual representations of objects. In both works, respectively objects and places are recognized in previously built metric maps using trained discriminative random nets. However, none of the cited works generates a probabilistic object representation. It may be the case that the method adopted to detect and identify objects in the scenario or in a prebuilt map is not completely reliable, so that a reliability measure for the mapped objects become desirable. The object-based map presented in this work, tho O-Map, allows the assignment of a reliability degree to each object represented in the map. The O-Map also uses only objects as features, without relying on any other structure to represent the environment. Analogously to the work of Ranganathan & Dellaert (2007), objects are represented with their 3D locations, sizes, and classes. However,

2.2 The SLAM problem

46

the object sizes in this work comprise the parameters of some geometric volume, increasing the dimensionality of the mapping problem. Moreover, as with the work of Vasudevan et al. (2007), the proposed map can represent the objects present in large areas.

2.2

The SLAM problem

As the mapping task consists in relating information extracted from the environment with a location in the map, mapping errors correspond to the accumulation of sensory and robot localization errors. The mapping task does not pose a hard problem when the robot localization history is knew in an accurate way (THRUN, 2002). In this case, the mapping error is intrinsically linked to the error of the sensors that capture environment information. Because measurements of these sensors in two different moments are usually independent, the mapping errors are limited. The same applies to the cases where robot localization error is bounded by an acceptable value. On the other hand, when robot localization is informed with cumulative, unbounded error, mapping errors become unbounded too. This is the case where localization is estimated by dead reckoning, a technique that calculates the robot localization by just integrating data from internal sensors, which detect displacements with no reference to an external element (DUDEK; JENKIN, 2000) In this case, in order to obtain the map, it is necessary to correct the localization error using data provided by external sensors. However, it corresponds precisely to the localization task, which requires a previous map of the environment to be executed. As the execution of two tasks alone leads to errors that are accumulated over time, mapping and localization should be carried out together, which leads to the SLAM problem. This section addresses the SLAM problem in a more detailed way, comprising a bibliographic review about the topic. Although research in robot mapping has been active since the 1970’s at least, the first proper treatment for the SLAM problem dates from the late 1980’s, with the seminal work of Smith & Cheeseman (1986), which presents a developing probabilistic framework to treat the spatial uncertainty relationships involving the robot positions and detected beacons.

2.2 The SLAM problem

2.2.1

47

Related problems

The complex problem of SLAM can be addressed according to its different related sub-problems (THRUN, 2002). The first one corresponds to the estimation problem: errors from internal and external sensors contribute to both mapping and localization errors during the mapping process, which become dependent on each other. Thus these errors must be solved in a mutual way, by the joint estimation of the robot localization and map parameters, characterizing the simultaneous localization and mapping problem. On the other hand, the solution of the estimation problem poses another one: the data association problem, mentioned in Section 1.2 for the feature mapping case. In the robot localization problem alone, where a map of the environment is available, the data association problem arises when the robot has to decide to which elements in the map correspond its sensors perception. In the SLAM problem, this problem is aggravated by the use of an incomplete map to localize the robot. In this case, the robot has yet to decide whether its perception in fact corresponds to elements that already exist in the map, or if new elements must be added to it. Another problem that directly affects the mapping and localization problems in general, but becomes critical when performing SLAM, is the dimensionality problem. While the number of map elements to be calculated can be very large —for instance, of the order of millions of cells in some metric maps— the joint acquisition of the map and localization demands higher computational effort than the execution of one of both tasks individually. Hence, one of the challenges related to the development of a technique to solve the SLAM problem is keeping its computational complexity tractable. The SLAM problem becomes even harder when a dynamic environment is considered. When we allow for environmental changes during the mapping process, new explanations for the perceived data are allowed. It makes the problem of data association even more difficult to solve. For instance, if the robot obtains unexpected sensor data, it must decide whether it is due to the localization error, to sensor errors, to a non-visited localization or to a change in the environment state. This work assumes that the objects being mapped are static. In fact, the most of modern SLAM techniques still assume a static environment. Besides all problems involved in processing data, there is also the problem related to how this data is obtained. Since the environment map is not known initially, it is not

2.2 The SLAM problem

48

possible to make a previous plan to be used by the robot to explore the environment. Thus, autonomous exploration can only be obtained by programming robot behaviors, which are processing modules that map robot perceptions to the execution of an action (MURPHY, 2000). It is desirable that exploratory behaviors are purposeful, so that the way data is acquired makes the solution of the SLAM problem easier. In general, SLAM techniques do not present an integrated solution for environment exploration, but exploratory techniques are based on some SLAM method in order to optimize movement commands as well as pieces of trajectory, in a way to minimize mapping errors or to accelerate map building (STACHNISS; GRISETTI; BURGARD, 2005; KOLLAR; ROY,

2006). Therefore, the topic of exploration in the SLAM problem is tra-

ditionally treated separately from the other ones. As with the dynamic environment problem, this work does not treat the exploration topic in SLAM.

2.2.2

Estimation techniques

Although the literature presents different probabilistic models for variables estimation the SLAM problem, virtually all of them can be derived from a general purpose formulation using elements of a Hidden Markov Model (HMM) (HAGEN; KRÖSE, 2002). The theory involving HMMs is traditionally used in signal processing, as is the case with problems of writing and speaking recognition (RABINER, 1989). The literature concerning these problems is well established, and several techniques have been proposed. However, mainly because of the differences in the application domain, there are few examples of direct use of HMM techniques for robot mapping and localization. In this context, the best known techniques are Markov Localization (FOX, 1998) and the use of Expectation Maximization (EM) for SLAM in the work of Thrun, Fox & Burgard (1998). All of these techniques adopt the same general model for the system, in which a discrete space for the robot localization and the map is assumed and there are no restrictions involving the variables probability distribution or the behavior of motion and observation models. Thus, the computational complexity of these algorithms is high compared to related techniques, which assume some simplifications of the problem domain. For historical reasons and theoretical guarantees (DISSANAYAKE et al., 2001), one of the most popular techniques for estimation in SLAM is the Extended Kalman Filter (EKF), an inference technique applicable to certain continuous state-space HMMs, introduced to the SLAM context by the seminal works of Smith, Self & Cheeseman

2.2 The SLAM problem

49

(1990) and Leonard, Durrant-Whyte & Cox (1990), and inspiring a great part of the works in the field. EKF is used to keep an estimate of the current robot localization and the locations of some geometric features, most often points. However, the use of EKF to solve the estimation problem in SLAM poses some limitations. Despite some of them have been overcome or attenuated by further modifications, others are intrinsic to the representation used by EKF, and thus require deeper changes in order to be resolved. Examples of such limitations are the assumption that the unknowns are well modeled by Gaussian variables, the satisfactory approximation of the robot motion and observation models by their linearized versions, and the lack of an integrated solution for the data association problem. All cited limitations can be thought of representation problems. To overcome these limitations, several SLAM algorithms have been created with basis on samples, which correspond to more general representations for probability distributions. DP-SLAM (ELIAZAR; PARR, 2003; ELIAZAR; PARR, 2004) uses a particle-filter to keep the belief on the robot localization and an evidence-grid map, with the aid of a special structure called ancestry-tree. FastSLAM (MONTEMERLO et al., 2002) uses a Rao-Blackwellized particle filter to maintain samples over the full robot trajectory. In fact, each sample represents more than a trajectory, but also small EKFs associated to each feature. In both cases, each sample assumes a different solution for the data association, so that these methods present an integrated way to solve the data association problem. Rao-Blackwellized particle filters, introduced to the SLAM problem by Doucet et al. (2000), make use of the Rao-Blackwell theorem to improve the estimation of the quantities of interest. The Rao-Blackwell theorem, developed within the frequentist approach, states that estimators derived from taking the analytical expectation of crude estimators with respect to a sufficient statistics of the data samples have lower variance. Porting this statement to the Bayesian approach, where samples can be understood as an approximated representation of a probability distribution, we have, as a consequence, that a posterior joint distribution is better represented by sampling only part of the random variables, representing the others using analytical formulae conditioned to the samples whenever possible. For instance, let the posterior of interest be P(A, B∣D), where A and B are generic random variables, and D is the observed data. Instead of approximating it by ordinary samples P(A, B∣D) ≈ ∑Ns n=1 δ (A, An )δ (B, Bn ), where δ (., .) is the Kronecker delta, the Rao-Blackwell theorem implies that better estimates are obtained if samples of B are used as the sufficient statistics of the full set of samples, and P(A, B∣D) ≈ ∑Ns n=1 P(A∣Bn , D)δ (B, Bn ). In the case of FastSLAM, the samples

2.2 The SLAM problem

50

Bn correspond to trajectories, while the analytic distributions P(A∣Bn , D) correspond to a product independent Gaussian distributions on the map features, updated by independent EKFs. However, particle filters also present drawbacks. Even if not used for real-time localization and mapping, they may suffer from particles degeneracy, a problem that prevents the filter to sample correctly over the support of the target probability. If used for real-time applications, keeping the sufficient number of samples may consume more computational time per update step than available. Moreover, the techniques used to avoid the samples degeneracy problem require additional processing time. Other important limitation of filtering techniques in general is that no revision of past decisions is done. By definition, filtering techniques infer the current state of an HMM given the available data (CAPPÉ; MOULINES; RYDÉN, 2005), and uses only new data to update the belief on the current robot localization and map state. Hence, the the assumed data association solution or solutions, stored in samples maintained by filtering techniques, are only updated, but cannot be revised. A way to overcome this limitation is to infer the entire history of the robot localization and the map all at once, performing batch optimization to find the optimal or quasi-optimal solution. Techniques that estimate the history of states of an HMM are called smoothing techniques (CAPPÉ; MOULINES; RYDÉN,

2005). When the SLAM problem is addressed as batch optimization

problem, it is commonly referred to as complete SLAM, off-line SLAM or SAM, acronym for Smoothing and Mapping (DELLAERT, 2005), and constitute a distinct problem from the most common on-line SLAM. Although off-line SLAM methods cannot be applied to real time localization and mapping in general, it allows the bundle adjustment of the map and trajectory, usually rendering finer results. Examples of off-line SLAM techniques are the work of Lu & Milios (1997), who proposed a specific solution for the alignment of laser scans, extended by the work of Gutmann & Konolige (1999), that proposed a technique to close large loops in the map. Folkesson & Christensen (2004) proposed the Graph-SLAM technique, which performs gradient descent optimization in the joint space of robot poses, feature locations and √ data association solutions. Dellaert (2005) proposed the SAM technique, an efficient estimation algorithm for the complete SLAM problem, whose on-line version can substitute the EKF-based filtering algorithms most commonly used with great advantage (KAESS; RANGANATHAN; DELLAERT, 2007). In fact the original EKF formulation poses serious efficiency issues when applied to

2.2 The SLAM problem

51

estimation in SLAM problems. Since the computational complexity of the original formulation of EKF grows with the square of the number of map features, some variations have been proposed which explicitly consider a graphical model of the SLAM problem and update it in constant time, assuming some realistic approximations (PASKIN, 2003; √ LIU; THRUN, 2003). In turn, incremental SAM (iSAM), the on-line version of SAM, is capable of exactly updating the map in constant time; however, it needs to execute √ SAM in a regular basis, which runs in O(n2 ) in the number of features (KAESS; RANGANATHAN; DELLAERT,

2007). Nevertheless, experiments show that this approach is

perfectly suitable to process a great amount of data in run-time (KAESS, 2008).

2.2.3

Data association problem

Many techniques have been proposed to solve the data association problem in SLAM. The key factor that mostly influences the success of a certain technique is the density of the map under construction. In general, distinguishing between local and global data association problems is suitable for building metric or densely populated feature maps, in both on-line and off-line SLAM. In this case, local data association corresponds to keep the correct alignment of consecutive sensor readings, and global data association corresponds to fixing coarse errors and close large loops in the trajectory, usually triggered by the detection of sensors inconsistency. In the case of building sparse feature maps and topological maps, no distinction is usually made. A notable exception is DP-SLAM (ELIAZAR; PARR, 2003; ELIAZAR; PARR, 2004), which builds occupancy-grid maps using laser scans and does not distinguish between local and global data association. Common techniques for local alignment of sensors data are the Iterative Closest Point (ICP) method (ZHANG, 1994), the laser scan matching method proposed by Lu & Milios (1997), and using the Random Sample Consensus (RANSAC) method (FISCHLER; BOLLES,

1981) to match feature points detected in consecutive images to the

camera model (SE; LOWE; LITTLE, 2002). In the global data association case, specially when building metric maps, exploring the robot localization state space (GUTMANN; KONOLIGE,

1999) can resolve sensor inconsistencies; however, usually more abstract

auxiliary structures are used to improve loop closing efficiency (HäHNEL et al., 2003; ELIAZAR; PARR,

2003; BLANCO; FERNANDEZ-MADRIGAL; GONZALEZ, 2008).

When it comes to features and topological mapping, data association assumes the form of a searching problem. In on-line SLAM, this problem consists in finding the best

2.2 The SLAM problem

52

mapped feature to associate a given new measurement to: when the robot makes a new measurement, it has to decide whether the measurement was generated by a feature in the map or by a non-mapped one, or even if it corresponds to a spurious measurement. In off-line SLAM, data association consists in grouping the available measurements in sets, so that measurements belonging to the same set are considered as taken from the same feature. Most on-line SLAM methods that build feature maps consider heuristic approaches for the data association problem. The most common ones are Nearest Neighbor (NN) and Maximum Likelihood (ML). In the NN approach, each feature in the map is projected in the measurement space and compared to the new measurement taken by the robot. Then, the projected feature that most approaches the new measurement is associated to it. If no feature approaches the new measurement well enough, it is used to map a new feature or is considered a spurious measurement. In the ML approach, the new measurement is associated to every mapped feature, calculating the overall map likelihood is calculated for each tested association. The association that produced the highest likelihood is assumed as true. If the likelihood stays bellow a certain threshold, the measurement is considered as being generated by a new feature (DISSANAYAKE et al., 2001).

The Joint-Compatibility Branch and Bound (JCBB) extends the ML approach

by testing groups of new measurements with groups of features in the map, improving the alignment of measurements taken successively (NEIRA; TARDOS, 2001). These heuristic methods are all subject to errors when the uncertainty about the robot localization increases, because, in this case, the main piece of information used to associate new measurements and mapped features becomes unreliable. In addition, these methods deal with the problem of data association in a local way: the association is decided based on the analysis of candidate pairs, but no additional information is extracted from the rest of the map. On the other hand, off-line SLAM techniques allows the repair of early associations, since it process all data in a batch mode. However, there are very few works in the literature which perform the revision of these errors using the exact formulation for the data association problem, requiring the exploration of the space of possible data association solutions, which grows with exponential complexity with the number of measurements (KAESS; DELLAERT, 2005; RANGANATHAN; DELLAERT, 2004). The only practical solutions that perform this exploration in a consistent way are based on MCMC methods. Dellaert (2001) introduced MCMC techniques to the constrained data association

2.2 The SLAM problem

53

problem in Structure From Motion (SFM), the computer vision version of SLAM. In that work, a known complex object described by the positions of their vertices is reconstructed based on several images processed in order to detect such vertices. Then, MCMC is used to sample over the possible associations between detected and reconstructed vertices, yielding several object reconstructions, each one obtained with a given data association solution sample. Then the obtained objects are combined to form a marginal object reconstruction, independent on a given data association solution. In turn, Ranganathan & Dellaert (2004) and Kaess & Dellaert (2005) used MCMC to treat the unconstrained data association problem, more common in robotic SLAM, and more difficult to solve. The unconstrained data association problem corresponds to partitioning the set of measurements obtained by the robot into subsets, where each subset corresponds to a certain physical feature detected by the robot. In the unconstrained data association problem, the number of features is initially unknown, and the space of possible solutions is even larger. From set theory, the number of possible partitions in a set with K elements is given by the Bell number 𝔅(K) = ∑K N=1 S(K, N). In turn, S(K, N) is the Stirling number of second kind, which corresponds to the number of possible ways to partition K elements into N sub-sets (ROTA, 1964). S(K, N) grows exponentially with the number of elements K, and so does the number of possible data association solutions, with respect to the number of measurements to associate. Here, the unconstrained data association problem is treated, following the approach proposed by Ranganathan & Dellaert (2004), Kaess & Dellaert (2005). However, this work extends the use of MCMC sampling to solving the object-based SLAM problem, which involves resolving the object recognition in order to build object-based maps.

54

3

The O-Map

This chapter introduces the object-based maps proposed in this thesis, namely the O-Map. Even though the concepts involved in the map definition are simple, the proposed environment representation is novel to the robotics context. In fact, the proposal of object-based maps that mix geometric and semantic information and techniques to build them is a recent topic, motivated by the development of practical algorithms for object recognition in images and semantic segmentation of metric maps. The proposed map encodes geometric and semantic descriptions of the objects in the environment, as well as the uncertainty belief about the existence of those objects in the real world. Figure 3.1 illustrates the proposed map structure. The O-Map is composed by a set of objects, assumed to be well described by semantic labels associated to simple three-dimensional geometric volumes, like spheres, cylinders or pyramids. A semantic label, expressed in natural language, describe to what class an object belongs, i. e., if it is a clock, a table, a chair, etc. Each object also has an associated non-spuriousness probability, which indicates the certainty degree that it actually exists in the environment, rather than having been built from spurious measurements. ∆

Mathematically, the O-Map is defined by M = {o j }, a set of N objects o j . Each ∆



object o j = (l j , g j , c j ) is described by its 3D location l j = (x j , y j , z j ), geometry g j , and semantic class label c j . The object geometry g j describes the parameters of the geometric volume that represents it. For instance, in case objects are represented by parallelepipeds, g j are three-dimensional vectors comprising object lengths, widths, and heights. The class label c j must correspond to an instance taken from a predefined set of possible labels C . Besides its geometric and semantic descriptions, each object can have an associated non-spuriousness probability π j ∈ (0, 1]. Although the proposed map representation is not suitable to describe all possible objects the robot can face, it contains the necessary elements to characterize many of the most common objects in indoors spaces. A more general O-Map would allow

3.1 Applications of O-Maps

55 chandelier chandelier

chair chair

table

chair

chair chair chair

chair

chair

Figure 3.1: Example of O-Map, as proposed in this thesis. In the example, objects are represented by parallelepipeds, described by width, height and depth. The green shape represents the floor plane, which is not part of the map. Transparency in the object representation indicates uncertainty about its existence.

each object be represented as an appropriate geometric volume, and object positions would be replaces by object poses. Poses encode, other than object locations, their attitude angles roll, pitch and yaw. However, many objects of interest are well describe by upright volumes, like prisms with their bases parallel to the floor plane. Furthermore, considering the purposes of the proposed map, which exclude, for instance, rendering texture images of the objects, their orientations with respect to the floor plane can be omitted. In this work, for sake of simplicity, objects are represented by cylinders. Thus, the measurements resulting from the objects observation by the robot do not depend on the angle from which objects are observed, simplifying the mapping process. The drawback of this assumption is that elongated objects such as doors, desks, and couches are thus described by a height and an average width, not reflecting their flat shape. Nevertheless, the cylindrical representation still reflects object sizes well enough for the purposes presented by this work, better explained in Section 3.1.

3.1

Applications of O-Maps

O-Maps encode different aspects of the objects they represent, being useful for different applications. As maps for robot navigation, they can be used to help the robot localize in an environment and plan a trajectory where objects are way-points. Moreover, because of the map semantic content, it can also help in the interaction and communication with humans. In some applications, like service robotics and Aug-

3.1 Applications of O-Maps

56

mented Reality (AR), an automatic system is supposed to communicate and interact with a human user. In case the concepts involved in this communication are objects, the proposed object-based map can be useful. Thus, O-Maps can be applied in more complex tasks, where navigation or object recognition is just part of the problem. To illustrate the multifunctional nature of O-Maps, an example of how they could be used in service robotics is given. Consider the situation where an user asks a service robot to pick up a cup on the kitchen table. The robot is supposed to go to the kitchen, explore it, identify the kitchen table and the possibly multiple cups over it, and ask the user which cup to grab via a remote desktop. An O-Map could be used in all involved tasks. First of all, the service robot of the example must know how to get to the kitchen. If it has a built-in O-Map of the house, the kitchen can be recognized by the objects it contains, using a place modeling technique like the one proposed by Ranganathan & Dellaert (2007). Then, among the objects considered as belonging to the kitchen, the table is selected as the target towards which the robot has to go. If a reactive navigation and obstacle-avoidance algorithm such as described by Selvatici & Costa (2007b) is used, it is not necessary to provide any other map to the robot. Such algorithm just needs a goal point and a map for localization, while navigates avoiding obstacles using sonar sensors. Once the robot detects the kitchen table and the cup or cups that are on it, the robot has to associate them to the cups registered in the map. If the data association procedure fails to match the detected cups with the mapped ones, then new cups are mapped by running the O-SLAM algorithm, described in Chapter 4. Then, the robot sends an image of the cups to the remote desktop by which the user communicates with the robot, highlighting the detected cups, as depicted in Figure 3.2. At that time, there is a direct association between each detected cup in the image and the corresponding mapped one, so that when the user indicates the cup he or she wants, the robot knows exactly which one to pick, and is able to maneuver the grip towards it even if the target object goes out of sight. Finally, the robot uses the O-Map to navigate back to the user, bringing the desired cup. This simple example shows that the map proposed in this thesis is suitable for robot-navigation-related tasks and also as a basis for the communication between robots and humans, meeting some of the needs of nowadays robotics research, which is also concerned about human-robot interaction issues. In this work, however, only

3.2 Framework for O-SLAM

white box

57

glass jar white cup (1)

white cup (2)

table (a)

white cup (2) white cup (1)

(b)

Figure 3.2: Illustration of how an O-Map can be used to create a common vocabulary for human-robot interaction. This example is merely illustrative, and has not been performed experimentally. (a) Top view of part of a kitchen represented by its objects. (b) Sample of an image used to build the map, showing the detected cups. The association between each detection and an object in the map is used by the human user to specify the desired cup.

the navigation task is treated, namely the problem of localizing the robot and building the object-based map.

3.2

Framework for O-SLAM

Although the data structure used in the O-Map is new to the robot navigation context, the major contribution of this thesis is the probabilistic technique that builds it. Since the proposed technique does not depend on knowing the robot localization precisely, which is inferred together with the detected object parameters, it is called OSLAM. The proposed SLAM resolution technique considers that the robot explores the environment with static objects, while records odometers data and captures images using a video camera attached to it, as depicted in Figure 3.3. The captured images are processed by object detection algorithms, specialized in detecting and segmenting instances of certain object classes. Each detection algorithm recognizes objects belonging to a specific class in images. Since object-based maps represent the objects as simple volumetric shapes, each segmented object shape in images is transformed in a simple 2D shape corresponding to a simplified projection of the assumed volumetric shape on the image plane. The data generated by this process consists of the apparent object positions in images, the parameters describing the simplified object shapes

3.2 Framework for O-SLAM

Object ge I ma e n a l p

58

Robot Trajectory

Figure 3.3: Illustration of the scenario considered by the O-SLAM technique. The detected shape in the example is simplified as a rectangle, which corresponds to the projection of a cylinder longitudinal section.

in images, and the class labels assigned by the corresponding object detectors. To simplify the notation, all data is considered to be acquired together at each acquisition step. Consider a scenario where the robot captures T images in different instants of time, so that it performs a trajectory described by X = {xi }Ti=1 , where xi is the robot pose at instant i. The odometers data is V = {vi }Ti=2 , corresponding to the traversed distance and heading difference between two consecutive poses measured by encoders on the robot wheels. The acquired computer vision data comprises Ki object detections at each time instant i, being represented by Z = {zi }Ti=1 , where zi = i {zi,k }K k=1 . Considering the assumptions about the data acquired by the object detection

algorithms, illustrated in Figure 3.3, we have zi,k = (ui,k , si,k , ai,k ),

(3.1)

where each measurement zi,k provides a 2D position in image, ui,k , the respective apparent shape parameters, si,k , and the detected class ai,k ∈ C .

3.2.1

Assumptions about the object detection system

The way the data acquired by the robot sensors are processed in order to obtain the O-Map depends on the assumed computer vision algorithms accuracy, which directly influences mapping accuracy and computational burden.

3.2 Framework for O-SLAM

59

In the simplest case, the object detection algorithms are assumed to perfectly detect and uniquely recognize the objects in the environment. It means that the identity of every detected object is known, and all measurements taken from a certain physical object are assigned the same identity. Thus, the data association solution consists in directly associating all measurements with the same identity. Both the assumed data association solution and the object identities are expressed by the correspondence function J : {i × {1, . . . , Ki }, i = 1, . . . , T } → N+ , which is a mapping from time indices i and measurement indices k to positive object indices j, such that o j , with j = J(i, k), is the object detected in the image acquired at instant i giving rise to the measurement zi,k . The correspondence function J can also be seen as a way to partition the set of measurement indices, as illustrated in Figure 3.4. The measurements zi,k whose indices (i, k) are mapped to the same object index j form a subset of the measurements Z. Analogously to a set partition, a valid data association solution associates at least one measurement to each object, and cannot associate the same measurement to different objects. In Figure 3.4, measurements are represented by lines connecting circles, representing robot pose indices, with stars, which represent the objects detected by the measurements. Circles are annotated with pose indices i, ranging from 1 to 8 in the figure, while stars are annotated with object indices j, ranging from 1 to 4 in the figure. The correspondence function J groups the lines, annotated with measurement indices (i, k), in different objects, partitioning the lines set. Real-world object detection algorithms are not perfect, so a practical object-based SLAM algorithm must allow for errors in object identities and classes. Assuming C = {c j }Nj=1 as the set of all mapped object classes, there are four possible problem setups to consider: 1. Both classes C and correspondence function J are known; 2. Classes C are known, but the correspondence function J is unknown; 3. Correspondence function J is known, but classes C are unknown; and 4. Both classes C and correspondence function J are unknown. Although setup number 4 is the one that best describes most of practical situations, other setups have some special interest. Setup number 1 can be assumed if objects have special features that are easily detected, such as a bar code tag, or if object

3.2 Framework for O-SLAM

60

2

1

1

3

2

(8,3)

(4,3)

(4,1)

(1,1)

4

5

6

7 (8,2)

(3,3) (1,2)

(4,2)

8

3

4 Figure 3.4: Illustration of how the correspondence function J partitions the set of measurement into N subsets (here, N = 4). In the figure, circles represent robot poses at different instants i = 1, 2, . . . , 8, and stars represent the detected objects with indices j = 1, 2, 3, 4. Lines represent the available measurements, indexed by the pairs (i, k), which are shown only in part in order not to clutter the figure. The lines are distributed among the objects by J so that J(i, k) = j. For instance, we have J(2, 1) = 1 and J(4, 2) = 4 in the figure.

recognition is done by hand, with the aid of a human. Setup number 2 corresponds to situations where object are correctly detected, but the number of objects of each class is unknown. The single-class case fits this setup, where no distinction between classes is performed. Finally, setup number 3 occurs when objects are perfectly identified, but their classes are unknown. This case, however, has no interest to the proposed framework, since object classes are assumed to be easily inferred if object identities can be perfectly recognized.

3.2.2

Coupling the unknowns through probabilistic modeling

The proposed approach to recover the map and the robot trajectory is to model the behavior of the robot odometers and the object detectors using probabilistic models, and then infer the desired parameters. In all cases described in Section 3.2.1, the odometers measurements are considered to be corrupted by white noise, as well as the parameters of the apparent object positions and sizes in images. In the case of the problem setup 1, where the object classes and identities are all known, the only unknown variables are the robot trajectory and the object geometries and locations. Because only geometric parameters need to be inferred in this case,

3.2 Framework for O-SLAM

61

the object-based SLAM problem with considering setup 1 is here-on called geometric SLAM, for which the first solution is proposed here. No other known work considers the problem of modeling volumetric objects using image sequences. On the other hand, in the case of setup 2, we also have to solve the data association problem, so that object identities are considered unknown. This work considers the worst case for the data association problem, where data Z does not contain direct measurements of object identities. Thus, if there are two or more detections of a couch, there is no measured data that indicates whether these detections are due to the same couch or to different ones. Here, a Bayesian probabilistic model is assumed for the correspondence function J, which is better explained in Section 4.2. In case we assume the problem setup 4, which considers imperfect computer vision algorithms, object detection errors are modeled by two different error modes: confusion between classes or spurious detection. The former error mode occurs when an object detector identifies an object belonging to a different class, similar to the one it is specialized in. The later one corresponds to the erroneous assignment of a certain object class to a random image patch. Both errors may occur because of changes in the environment light, noisy image, or other factors. Therefore, differently from almost all other object mapping approaches from the robotics literature, that blindly relies on the object recognition system, the object-based SLAM technique jointly infers the robot trajectory together with the objects geometry and classes, detecting possibly spurious objects. The probabilistic model used in object-based SLAM also considers a prior knowledge about the size of the objects, and possibly their locations, depending on their classes. The main idea behind it is that, if one roughly knows the average real-world size of the objects belonging to a certain class, the apparent size of an instance in the image leads to a coarse range estimate from the robot to the object. The interesting difference with point-based monocular visual SLAM (DAVISON, 2003) is that apparent size now yields range to objects even by a single sighting. After several sightings both object dimensions and position will be sharply determined by triangulation, obsoleting the coarse priors and constraining the object class, coupling it with the geometric parameters. Figure 3.5 illustrates this model. This idea was also used by Hoiem, Efros & Hebert (2006) to recover the camera viewpoint and the position of determined objects on the floor plane from single images, while performing object recognition using a third-party technique, strongly constraining the object class to the regions they appear

3.2 Framework for O-SLAM

62

in the image. However, here the constrains on the object locations, sizes and classes are more relaxed in order to recover larger scale models using several images.

(a) Detected Object Class Object Size

Apparent Object Size

3D Object Position

Apparent Object Position

Object class

Camera Pose

(b) Figure 3.5: Illustration of how knowledge about the object size and position can be used to estimate the relative distance from the camera. (a) The decreasing size of the wooden box in the images combined with prior knowledge about the size of wooden boxes informs about its increasing distance to the camera. (b) Graphical probabilistic model illustrating the dependencies among object position and size, and robot camera pose (unknowns are represented by ellipses) given object recognition output (rectangles).

3.2.3

Computer vision algorithms compatible with O-SLAM

This work assumes that the object recognition system that processes images captured by the robot is able to detect and segment instances of the object classes of interest. Thus, image classification algorithms, which labels a whole image according to the most prominent object in it, are incompatible with the approach proposed here. There are two classes of algorithm that can be used with O-SLAM: pixel-labeling methods and object detectors. Pixel-labeling methods assign a label to each pixel in the image. These methods are usually based on discriminative probabilistic approaches that consider the value

3.2 Framework for O-SLAM

63

of each pixel and its neighbors to infer the best label for it. They have faced great boost during the past few years, after Conditional Random Fields (CRFs) (LAFFERTY; MCCALLUM; PEREIRA,

2001) and their variants have been adapted from the problem of

labeling sequentially chained states to labeling pixels in a grid (QUATTONI; COLLINS; DARREL,

2004; CRIMINISI et al., 2006; WINN; SHOTTON, 2006; SHOTTON; JOHNSON; CIPOLLA,

2008). However, if considered to be ported to object-based SLAM, these algorithms must be used with care. Since they suppose a training phase where completely labeled images are used as training samples, the learned discriminative model labels the pixels belonging to the objects taking the object boundary pixels into consideration. It poses a problem when the same object must be detected from different points of view, since the background changes from an image to the other, also changing the value of object boundary pixels. Object detectors are more suitable for use with O-SLAM. These techniques use local and global features to detect and segment out objects in images, independently from the background. Local features correspond to pieces of information that describe single points in image, such as pixel color, color gradient, corner detection, etc. Global features are pieces of information that represent parts of an image, or even the entire one, such as the mean color, histograms, etc. The simplest object detectors use only color information, such as color histograms (SWAIN; BALLARD, 1990) and color blobs (BRUCE; BALCH; VELOSO, 2000). The former technique has been successfully used in object-based SLAM (EKVALL; KRAGIC; JENSFELT,

2007), conjugated with the use of local SIFT features as a refinement step to

filter out spurious detections. SIFT stands for Scale-Invariant Feature Transform, a technique that detects features in images widely used in modern object recognition algorithms (LOWE, 1999). Other object detectors used in object-based mapping use SIFT features with great success (FORSSEN et al., 2008; VASUDEVAN et al., 2007). Recently, promising techniques for robust detection of multiple classes in images have been proposed (MIKOLAJCZYK; LEIBE; SCHIELE, 2006; THOMAS et al., 2006), being very suitable for integration with O-SLAM. Despite their potentially high computational cost, due to performing extensive search operations over the whole image, those techniques can be optimized for detecting the same objects in consecutive images, mitigating the need for such extensive search operations. Furthermore, the present form of O-SLAM corresponds to a batch algorithm, dropping the need for meeting tight processing time requirements.

3.3 Comparison with other object-based maps

3.3

64

Comparison with other object-based maps

This thesis claims to introduce a novel object-based map for robot navigation and the first SLAM algorithm that treats object recognition errors concurrently with 3D objects mapping and the robot localization. Thus, few aspects of the present work can be compared with related object-based mapping works. Strictly speaking, no other work solves the SLAM problem based only in the objects detected in the images. Most of related works rely on a traditional SLAM technique for robot localization and map building (FORSSEN et al., 2008; EKVALL; KRAGIC; JENSFELT,

2007), using exclusively odometers and the laser sensor for that. The location

of detected objects are then assigned in the map, but they do not interfere in robot localization. Therefore, those works perform object-based mapping, not SLAM. The only related work that uses only object detections to build the map localizes the robot only by dead reckoning, and thus do not constitute a SLAM technique (VASUDEVAN et al., 2007). Although there is no advantage in localizing the robot using only odometers and object detection from the accuracy point of view, the object-based SLAM framework developed in this thesis can be easily integrated to any other probabilistic SLAM algorithm, improving the map and localization accuracy and helping to close large loops in the trajectory. This work also differs from related visual SLAM approaches, which map only some interest points in the environment (DAVISON, 2003; KLEIN; MURRAY, 2007; WILLIAMS; KLEIN; REID,

2007), without providing objects representation. Slightly closer to the

present approach, (CASTLE et al., 2007) also incorporates some higher-level entities in the map, corresponding to recognized planar patches. However, these patches only indicate that certain points in the map belongs to a specific structure, making their associations throughout images more reliable due to this additional patch-level of matching test. Our work improves on that by using apparent object size and class information rather than just image point positions in the SLAM process. Although the work of (RANGANATHAN; DELLAERT, 2007) is not concerned about mapping objects detected in multiple images captured by a mobile robot, but rather extracting a 3D representation of the objects present in a single image, it holds some similarity with the present work. Instead of relying on an out-of-the-box object detection algorithm, that work integrates different appearance and geometric clues from an image captured by a stereo pair —providing color and depth information— to map the 3D

3.3 Comparison with other object-based maps

65

location, scale and class of the depicted objects. The used geometric clues include the model of how objects appear in images and prior information about object sizes. O-SLAM and the place modeling method of Ranganathan & Dellaert (2007) are, then complementary approaches. While that work performs probabilistic object segmentation and classification, the approach proposed here is limited to perform probabilistic object classification, relying on the deterministic object segmentation output by the computer vision algorithm. Nevertheless, O-SLAM is able to match the same physical object in different non-stereo images and recover the camera trajectory, while that approach is not. Despite the differences of both works, they point towards the same ultimate goal: integrating appearance and geometric clues in order to recognize and map the objects in the environment.

66

4

Inference in Object-based SLAM

This chapter introduces the proposed algorithm for obtaining object-based maps; namely, the O-SLAM algorithm. Following the choice of using a probabilistic approach to solve the object-based SLAM and classification problem, in this chapter the probabilistic problem model is developed, together with the inference technique used to estimate the map parameters. Although the proposed algorithm is intended to solve the SLAM problem without knowing the solution for the data association problem and considering imperfect object detection, the solution for the setups 1,2, and 4 explained in Section 3.2.1 are presented in sequence, since the solution for a more difficult setup depends on the solution of the easier ones. The first setup to be considered is setup 1, which implies the geometric inference problem alone, being referred to as geometric SLAM in this work. This problem, whose solution is described in Section 4.1, consists in building an O-Map considering that object detection and recognition are perfect, so that the correct object classes and identities are known. Section 4.2 presents the proposed solution for setup 2, which considers perfect the object detection algorithm, but assumes unknown the solution for the data association problem. Since the goal is inferring the O-Map and the robot trajectory, the variables of interest are obtaining by taking the expectation with respect to the possible data association solutions. Section 4.3 extends the solution proposed for setup 2 to setup 4, where the used object detection algorithms are assumed imperfect, and may output wrong measurements. Besides the noise in the apparent positions and sizes, treated in the proposed solution for the geometric SLAM problem (setup 1), two error modes are admitted: spurious detections, corresponding to false positives and partially occluded measurements, and wrongly detected classes.

4.1 Efficient solution for geometric SLAM

4.1

67

Efficient solution for geometric SLAM

This section shows how to solve the geometric SLAM problem assuming the case of perfect object detection and recognition. The solution proposed here is to perform Maximum a Posteriori (MAP) inference using measurement data Z provided by the object detection system and the correspondence function J provided by the knowledge of object identities, according stated in the problem setup 1 in Section 3.2.1. The MAP estimate (X ∗ .M ∗ ) is defined by (X ∗ , M ∗ ) = arg max P(X, M∣J, Z). ∆

(4.1)

(X,M)

The probabilistic approach implies that object classes, locations, and associated geometries are inferred together with the robot poses, tightly coupling these variables. ∆

Recall the map definition from Chapter 3, whereby M = {o j }Nj=1 , where each object o j = (l j , g j , c j ) is described by its location l j , geometry g j , class label c j ∈ C . The num∆

ber of objects in the map, N, depends on the different object identities detected by object recognition. Assuming problem setup 1, where correct object identities and classes are known, an approach similar to that used in traditional point-based SLAM is adopted, except that the map now includes object geometries and classes. Since object classes are directly determined from the perfectly detected ones ai,k in Z, and the object identities given by J, the posterior (4.1) is expressed by P(X, M∣C, J, Z) ∝ P(Z∣X, M,C, J)P(X, M∣C, J) = P(X, L, G∣C, J)P(Z∣X, L, G,C, J),

(4.2)

where P(X, L, G∣C, J) is a prior density over the geometric map parameters, conditioned on the known object class labels C = {c j }Nj=1 and correspondence function J. P(Z∣X, M,C, J) is the joint likelihood of the detected object projections in images. Because we consider the object classes C known for now, the only parameters to infer are trajectory X and the geometric part of the map, comprising object locations L = {l j }Nj=1 and their geometry G = {g j }Nj=1 . Hence, we define the vector of variables under infer∆

ence: θ = (X, L, G). To make the assumed model more precise, the relationships between the variables and measurements are shown in Figure 4.1. Besides the measurement model, consisting of the projection model of objects onto the image plane, we assume a prior model over object sizes depending on their classes. The prior density on the robot

4.1 Efficient solution for geometric SLAM

vi-1

...

68

vi

vi+1

x i-1

z i,2

...

z i-1,K i-1

...

o j-1

z i,1

u i,2 si,2 ai,2

lj

oj

...

x i+1

xi

...

z i+1,1

gj

... cj

Figure 4.1: Fragment of the Bayesian Network that represents the probabilistic model for O-SLAM. The rectangles represent known variables, and the circles represent the unknowns. We basically assume that the object class c j give a rough idea about its geometry g j , and may give clues about its location l j . In geometric SLAM, the detected class ai,k is considered equal to the actual object class, cJ(i,k) , so that P(ai,k ∣cJ(i,k) ) = δ (ai,k , cJ(i,k) ). This model does not admit strong occlusion among objects, neither moving objects.

trajectory and the objects geometric model can be written as P(θ ∣C, J) = P(X)P(L∣C)P(G∣C) N { } = P(X) ∏ P(l j ∣c j )P(g j ∣c j ) ,

(4.3)

j=1

and odometer information V casts a prior on the robot poses of the form T

P(X) = P(x1 )

∏ P(xi∣xi−1, vi).

(4.4)

i=2

Since no absolute localization sensor used, the obtained map may have any reference frame. A common choice for that is defining the first pose x1 as a constant with any value, sometimes clamping it to the origin, and making all other variables estimated with respect to it. The measurements likelihood are modeled by considering that measurements are statistically independent on each other, and that object positions in images depend on the relative displacement between the robot and objects, and also on the robot orientation. The object shape is assumed independent of its position in image. Finally,

4.1 Efficient solution for geometric SLAM

69

we consider that the actual class labels C perfectly generate the detected ones, so that: T Ki { } P(Z∣X, M,C, J) = ∏ ∏ P(ui,k ∣xi , lJ(i,k) )P(si,k ∣xi , lJ(i,k) , gJ(i,k) ) .

(4.5)

i=1 k=1

As a result, the posterior in (4.1) is given by the generative model T N { } P(X, M∣J, Z) ∝ P(x1 ) ∏ P(xi ∣xi−1 , vi ) ∏ P(l j ∣c j )P(g j ∣c j ) i=2

T

j=1

Ki

(4.6)

{ } × ∏ ∏ P(ui,k ∣xi , lJ(i,k) )P(si,k ∣xi , lJ(i,k) , gJ(i,k) ) . i=1 k=1

While (4.6) gives the general form of the posterior in (4.1), it is necessary to specify the shapes of the involved probabilities densities and come up with a practical model from which we can infer the desired geometric parameters. For such, the framework for √ SAM is used (DELLAERT; KAESS, 2006). According to that framework, the posterior in (4.6) is factorized as product of Gaussian probabilities, which naturally leads (4.1) to be formulated as a linearized LS problem. Since we are treating an off-line SLAM problem, solving the linearized problem is part of an iterative non-linear optimization strategy, like Levenberg-Marquardt (LM) (KELLY, 1999) . In this section, only the linearized part of the solution is discussed.

4.1.1

Using linearized Gaussian models

To assure the posterior (4.6) is expressed as a product of Gaussian densities, the probabilistic geometric model is defined considering that all measurements and prior knowledge are normally distributed. Thus, the prior belief over object locations and sizes are given by l j = γ(c j ) + elj ,

elj ∼ N (0, Γ(c j )) and

(4.7)

g j = ς (c j ) + egj ,

egj ∼ N (0, Σ(c j )),

(4.8)

where elj and egj are the errors on the priors over objects location and size, respectively. Odometer readings and measurements are also disturbed by white noise, so that we consider xi = f (xi−1 , vi ) + exi , exi ∼ N (0, Qi ),

(4.9)

ui,k = hu (xi , lJ(i,k) ) + eui,k , eui,k ∼ N (0, Ri,k ), and

(4.10)

si,k = hs (xi , lJ(i,k) , gJ(i,k) ) + esi,k , esi,k ∼ N (0,Wi,k ),

(4.11)

4.1 Efficient solution for geometric SLAM

70

where exi , eui,ki and esi,ki are, respectively, the odometers error, and the errors in the object position and size in image. The deterministic motion function f (., .) and the covariances Qi define the assumed probabilistic motion model, which encodes our belief on where the robot can be after being imposed a displacement command vi starting from pose xi−1 . Analogously, the deterministic projection transforms hu (., .) and hs (., .), and the associated covariance matrices Ri,k and Wi,k , define the assumed probabilistic observation model, which encodes our belief about the measured object position and size in image given that the robot is at pose xi and the object real-world location and size are lJ(i,k) and gJ(i,k) respectively. Since the functions f , hu and hs are, in general, non-linear, linearized versions of them are used to assure a Gaussian posterior density over the map geometric parameters, L and G, and the robot trajectory, X. Replacing the linearized version of the densities (4.7-4.11) in (4.6) yields the posterior: { 1 1 N [ √ exp − ∣∣δ l j + l 0j − γ(c j )∣∣2Γ(c j ) + P(θ ∣C, J, Z) ∝ ∑ 2 j=1 ∣2πP∣ ] T 0 ∣∣δ g j + g0j − ς (c j )∣∣2Σ(c j ) + ∑ ∣∣Fix δ xi−1 − δ xi + f (xi−1 , vi ) − xi0 ∣∣2Qi i=2

T

+∑

Ki



[ x l 0 ∣∣Ui,k δ xi +Ui,k δ lJ(i,k) + hu (xi0 , lJ(i,k) ) − ui,k ∣∣2Ri,k +

i=1 ki =1

g x l 0 2 ∣∣Si,k δ xi + Si,k δ lJ(i,k) + Si,k δ gJ(i,k) + hs (xi0 , lJ(i,k) , g0J(i,k) ) − si,k ∣∣W i,k

with N

T

T

i=2

,

(4.12)

Ki

∣P∣ = ∏ (∣Γ(c j )∣∣ς (c j )∣) ∏ ∣Qi ∣ ∏ ∏ (∣Ri,k ∣∣Wi,k ∣), j=1

]}

(4.13)

i=1 k=1

where the superscript 0 indicates the linearization point of the respective variables, and δ α means the variation of the variable α around its linearization point α 0 , so that α = α 0 + δ α. The capital letters with superscripts and subscripts represent the Jacobians x of the deterministic functions: Fix is the Jacobian of f (xi−1 , vi ) with respect to xi−1 ; Ui,k l are, respectively, the Jacobians of hu (x , l x and Ui,k i J(i,k) ) with respect to xi and lJ(i,k) ; Si,k , l , and Sg are the Jacobians of hs (x , l Si,k i J(i,k) , gJ(i,k) ) with respect to xi , lJ(i,k) , and gJ(i,k) , i,k

respectively. The notation ∣∣α∣∣2Σ is used to indicate the squared Mahalanobis norm of α with respect to Σ, given by α T Σ−1 α. Note that, given the linearization points {xi0 }Ti=1 ∆

and {l 0j , g0j }Nj=1 , the variables subject to inference become δ θ = (δ X, δ L, δ G), where δ X = {δ xi }Ti=2 , δ L = {δ l j }Nj=1 , and δ G = {δ g j }Nj=1 . Since the initial pose is considered known, we have δ x1 = (0, 0, 0).

4.1 Efficient solution for geometric SLAM

4.1.2

71

Inference using QR decomposition

The MAP inference on the posterior (4.14) can be transformed into an LS problem, which can be efficiently solved using QR factorization. Re-writing the posterior (4.12) in the matrix form yields: } { 1 1 2 P(θ ∣C, J, Z) ∝ √ exp − ∣∣Aδ θ − b∣∣P , 2 ∣2πP∣

(4.14)

where each block-line in A and b corresponds to one of the summand terms in (4.12), and P is a block-diagonal matrix with the covariances that weigh the summands. For instance, here are fragments of these matrices and vectors:



F x −I3×3 ⎢ .2 .. ⎢ .. . ⎢ ⎢ x A=⎢ ⎢ 0 U2,1 ⎢ x ⎢ 0 S2,1 ⎣ .. .. . .

⋅⋅⋅ .. .

0 .. .

l ⋅ ⋅ ⋅ U2,1

0 .. . 0

g l ⋅ ⋅ ⋅ S2,1 S2,1 .. .. .. . . .





⎤ ⎡ δ x2 x20 − f2 (x10 , v2 ) ⎥ ⎢ . ⎥ ⎢ .. ⎢ .. ⎥ ⎢ ⋅⋅⋅ ⎥ . ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ u 0 0 ⎢ ⎥ ⎢ ⋅⋅⋅ ⎥ ⎥ , δ θ = ⎢ δ l1 ⎥ , b = ⎢ u2,1 − h (x2 , l1 ) ⎥ ⎢ ⎥ ⎢ ⎢ δ g1 ⎥ ⎢ s2,1 − hs (x20 , l10 , g01 ) ⋅⋅⋅ ⎥ ⎦ ⎣ ⎦ ⎣ .. .. .. . . . ⋅⋅⋅



and

Q2 ⋅ ⋅ ⋅ 0 0 ⎢ . . .. ⎢ .. . . . .. . ⎢ ⎢ P=⎢ ⎢ 0 ⋅ ⋅ ⋅ R2,1 0 ⎢ ⎢ 0 ⋅ ⋅ ⋅ 0 W2,1 ⎣ .. .. .. .. . . . .

⋅⋅⋅

⎤ ⎥ ⎥ ⎥ ⎥ ⎥, ⎥ ⎥ ⎥ ⎦



⎥ ⋅⋅⋅ ⎥ ⎥ ⎥ ⋅⋅⋅ ⎥ ⎥, ⎥ ⋅⋅⋅ ⎥ ⎦ .. .

where Id×d is the identity matrix with d diagonal elements. Maximizing the posterior (4.14) corresponds to finding δ θ ∗ = arg min ∣∣Aδ θ − b∣∣2P ,

(4.15)

δθ

which is also the posterior parameters mean, with the posterior covariance expressed by Cθ = (AT P−1 A)−1 . Due to the sparseness of A, QR factorization is an efficient way to solve (4.15) (DELLAERT; KAESS, 2006). QR factorization represents an m × n matrix A, with m ≥ n, by [ ] R , where Q is an a multiplication of other two matrices (GOLUB; LOAN, 1996), A = Q 0 m × m orthonormal matrix, and R is the n × n upper-triangular Cholesky factor of AT A.

4.2 Probabilistic data association

72

Let us rewrite (4.15) as a minimization of an Euclidean distance by incorporating the covariance matrix into the other terms: ¯ 2, ¯ θ − b∣∣ ∣∣Aδ θ − b∣∣2P = ∣∣Aδ

(4.16)

1 1 1 where A¯ = P− 2 A, b¯ = P− 2 b, and P− 2 is the Cholesky factor of P−1 . The QR factorization of A¯ allows us to rewrite (4.15) in the form

[ δ θ ∗ = arg min ∣∣Q θ

0 ([

= arg min ∣∣Q θ

[ where

c

R

] ¯ 2 δ θ − b∣∣ R 0

]

[ δθ −

c

]) )∣∣2 ,

(4.17)

r

] ¯ = QT b.

r Once R is upper-triangular square, and full-rank since A¯ poses an over-determined linear system, the problem solution is obvious: it is given by the solution of Rδ θ ∗ = c, leaving ∣∣r∣∣2 as the total squared residual. If the posterior covariance is required, it can be recovered from the Cholesky factor R: Cθ = (AT P−1 A)−1 = (RT R)−1 = R−1 (R−1 )T .

(4.18)

Now we have the all the mathematical tools to solve the geometric SLAM problem. Algorithm 1 describes the O-SLAM algorithm for the setup 1, where the correspondence function J and object classes C are given. In the next section, we treat the case where the correct J is unknown, but we know how to sample from the distribution over possible J instances. Chapter 5 is devoted to describing how to perform such sampling.

4.2

Probabilistic data association

In typical scenarios, object identities are not available, and thus the data association solution J (including statements about object identity made in Section 3.2.1) must be inferred together with the geometric variables. In this section, the O-SLAM algorithm for the problem setup 2 is addressed, considering the case of O-Maps building without assuming the correct data association solution, although object classes are still considered known for now.

4.2 Probabilistic data association

73

Algorithm 1 O-SLAM algorithm considering perfect object detection and recognition Input: data Z, the correspondence function J, the motion and observation models defined by the functions f , hu , and hs and the covariances Qi , Ri,k , and Wi,k Output: the optimal trajectory X ∗ , the optimal map M ∗ , and optionally the variables covariance matrix Cθ 1. Determine the classes C = {c j }Nj=1 so that ∀zi,k ∈ Z, cJ(i,k) = ai,k 2. Choose the initial linearization points 3. Run an iterative non-linear optimization algorithm, such as LM, so that, in each step: (a) Build the matrices A and P and the vector b using the linearized motion and observation models (b) Solve the linear system defined by A, P and b using QR factorization (c) Determine the new linearization points according to the rules of the chosen optimization algorithm (d) If the linearization points change is bellow a certain threshold, stop iterating and return the current values for A, b, and P and the linearization points 4. Solve the system formed by A, P and b using the QR factorization, obtaining the variables vector δ θ 5. Find (X ∗ , L∗ , G∗ ) = δ θ + θ 0 , where θ 0 are the variables linearization points, and M ∗ = (L∗ , G∗ ,C) 6. If required, calculate Cθ according to (4.18)

4.2 Probabilistic data association

74

Because J is subject to inference, the variables vector must include it. Thus, it is ∆



re-defined as θ = (J, θJ ), where θJ = (X, LJ , GJ ) is the geometric parameters of the map and trajectory assuming the data association solution given by J. Now, the variables we want to infer remain the same, namely the robot trajectory X and the object locations L and sizes G, which do not depend on knowing the specific data association solution. If all variable vectors θJ had the same dimensionality and nature, i. e., every position in θJ corresponded to the same physical unknown regardless of the value of J, we would estimate the unknowns by finding the expectation of the geometric variables with respect to the possible data association solutions: ˆ ∆ ˆ θJ = E[θJ ∣Z] = ∑ P(J∣Z) θJ P(θJ ∣C, J, Z) dθJ , J

(4.19)

θJ

In computer vision, this approach is known as correspondence-less structure-frommotion (DELLAERT et al., 2000). It can be used when the nature of the unknown vector θJ is known a priori, i. e., θJ has a fixed size and each of its components corresponds to a specific variable of the problem, configuring the constrained data association problem described in Section 2.2.3. The advantage of the correspondence-less structure-frommotion approach resides in taking advantage of all information that can be gathered from the data Z to infer the variables of interest, yielding optimal results even if the available data is not sufficient to certainly determine a single good data association solution. The correspondence-less structure-from-motion can be used to infer the robot trajectory X. Since it always denotes the sequence of robot poses from time i = 1 to i = T , regardless of the data association solution J, we can define the target trajectory as the expectation ˆ ∆

Xˆ = E[X∣Z] =

∑ P(J∣Z) J

=

X P(X∣C, J, Z)dX X ˆ

X P(θJ ∣J, Z) dθJ

∑ P(J∣Z) J

(4.20)

θJ

However, when it comes to the map parameters L and G, the dimensionality of the variables vector becomes unknown, and depends on the observed data Z and the assumed object identities given by J, configuring the unconstrained data association

4.2 Probabilistic data association

75

problem. It basically occurs because each value of J implies a different number of objects in the map. For instance, consider the case where data Z contains some detections of the object class "clock". If all detections are associated to a single object, the variables in θJ are related to a single clock; on the other hand, if the "clock" detections are associated to two different objects, there are two sets of variables in θJ related to clocks: one set describes one clock, and the other set describes the other clock. Moreover, even if two different assumptions about the data association lead to vectors with the same number of objects of each class, there is no straight-forward way to match all the objects in both vectors. This incompatibility of variable vectors obtained considering different data association solutions is expressed by assuming that, if P(θJ1 ∣J1 , Z) > 0, then P(θJn ∣Jn , Z) = 0, for all n ∕= 1. To solve the problem of estimating the map parameters without knowing the correct number of objects in the map, this thesis proposes that each object have its parameters calculated separately, i. e., calculating the expectation on each individual object parameters instead of taking the expectation on the whole map at once. For such, it is necessary to develop a criterion to match the same physical object represented in variable vectors θJ for different values of J, assigning the same object identity to them.

4.2.1

Object identities

Object identities are the proposed solution to the problem of taking the expectation over a map with unknown number of objects. To take the expectation over a single object parameters, we have to sum their expected values for all values of J where that object is considered to occur. Let ID be the variable that indicate the identity of certain object, which does not depend on a specific data association solution. Also, let J(ID) represent the index of the object oID parameters in a certain variables vector θJ . Thus lJ(ID) is the location of oID considering the correspondence function J. Frequently, it is possible to match part the same object parameters in two vectors θJ generated with different correspondence functions in a natural way. Following the rationale of Section 3.2.1, that makes the correspondence between the data association solution and object identities, this work proposes that the identity of a single object given a certain correspondence function be defined by the measurements associated to it. In other words, if two instances of J agree that a certain group of measurements

4.2 Probabilistic data association

2

1

1

76

3

2

5

4

6

2

1

7

8

1

3

2

5

4

3

6

5

7

8

3

4

4 (a)

(b)

Figure 4.2: Representation of how to match the same object identities in two maps generated assuming different solutions for the data association. Objects 1, 3, and 4 have the same identity in both correspondence functions represented in (a) and (b), while object 2 in (a) and objects 2 and 5 in (b) cannot be matched used the present definition of object identity.

correspond to a single object, that object is exactly the same in both vectors. For instance, objects 1, 3 and 4 have the same identities in both correspondence functions represented in Figures 4.2a and 4.2b. ∆

The object identity variable is defined as ID = {(i, k)1 , . . . , (i, k)m } a set of measurement indices pairs (i, k). The correspondence function J is said to yield an object index ID if and only if: ∀(i, k) ∈ ID, J(i, k) = j and ∀(i, k) ∈ / ID, J(i, k) ∕= j, where j ∈ N+ can be ∆

any object index. We also define J(ID) = j∣∀(i, k) ∈ ID, J(i, k) = j, which is the object index j that corresponds to the identity ID in the data association function J .Note that the proposed definition of object identity is restrictive: although objects with the same ID though in different variable vectors are, for sure, referred as the same physical object, the objects with different ID may also correspond to the same physical object. For instance, consider object 2 in Figure 4.2a and object 2 on Figure 4.2b. While both could be referred as the same physical object using a more liberal criterion, they do not present the same ID, i. e., the are formed by different measurements subsets. This work does not develop a definition of identity that perfectly matches the same physical object in all different variable vectors, but the proposed definition showed to be quite satisfactory for the purpose of this thesis. The values of the parameters related to a certain object ID are defined by the conditional expectations, which consider only the values of J where the object identity ID is found

4.2 Probabilistic data association

77

ˆ ∆ lˆID = E[lID ∣ID, Z] =

∑ P(J∣ID, Z) J



gˆID = E[gID ∣ID, Z] =

lJ(ID) P(θJ ∣C, J, Z) dθJ

(4.21)

gJ(ID) P(θJ ∣C, J, Z) dθJ

(4.22)

ˆ

θJ

∑ P(J∣ID, Z) J

θJ

where P(J∣ID, Z) is the conditional probability of a certain correspondence function J assuming that the object identity ID is yielded by J (P(J∣ID, Z) = 0 if J does not yield ID). The object classes cID are directly determined from the measurements {ai,k ∣(i, k) ∈ ID}, that must all agree with the same detected class under the assumption of perfect object detection. Because an object identity is yielded by some values of J, but not by others, the belief over the possible values of J implies a certain trust level πID in that a certain ID actually represent an object in the environment. Thus, we can assign a probability to the event of a certain ID occur in the environment: ∆

(4.23)

πID = P(ID∣Z) = ∑ P(ID∣J, Z)P(J∣Z), J

where P(ID∣J, Z) = 1J (ID), the function that indicates whether J yields ID or not. In the strict sense, (4.23) is the marginal probability of the measurements in ID actually refer to the same object. The interpretation given here is that (4.23) corresponds to the probability of the object generated by measurements {zi,k ∣(i, k) ∈ ID} actually existing in the environment, and not being generated by a spurious group of measurements.

4.2.2

Approximating the distribution on data association

Enumerating the data association space is not practical. Since the number of possible associations grows exponentially with the number of measurements, as explained in Section 3.2.1, performing the exact summations in (4.20)-(4.22) is not an option. Nevertheless, most of possible solutions for the data association are quite improbable, specially those that would require the robot trajectory to be a very distorted version of the a priori one, obtained using odometers only. Thus, the probability distribution on data association solutions can be well approximated by a sampled version, concentrated in few instances. Hence, we approximate P(J∣Z) ≈

1 Ns ∑n δ (J, Jn ),

with Ns being

4.3 Treating imperfect object detection

78

the number of samples, and 1 Ns ∑ P(ID∣Jn, Z) Ns n=1

(4.24)

1 Ns ˆ ∑ Xn Ns n=1

(4.25)

lˆID ≈

1 Ns ˆn ∑ lIDP(ID∣Jn, Z) Ns n=1

(4.26)

gˆID ≈

1 Ns n ∑ gˆIDP(ID∣Jn, Z) Ns n=1

(4.27)

XP(θJn ∣Jn , Z) dθJn

(4.28)

lJn (ID) P(θJn ∣, Jn , Z) dθJn

(4.29)

gJn (ID) P(θJn ∣, Jn , Z) dθJn

(4.30)

πID ≈ Xˆ ≈

with ˆ ∆ Xˆn = θJn

ˆ

n lˆID



= θJn

ˆ

gˆnID



= θJn

n , Since the density P(θJn ∣Jn , Z) is assumed to be Gaussian, the expectations Xˆn , lˆID

and gˆnID are simply the corresponding estimated parameters in the variables vector ∆

θn∗ = arg maxθJn P(θJn ∣Jn , Z), calculated according to Section 4.1. The resulting O-Map is ∆ ∆ defined as Mˆ = {oˆID ∣P(ID∣Z) > 0}, where oˆID = (lˆID , gˆID , cID , πID ). The algorithm to build an object-based map from samples Jn considering known object classes is described in Algorithm 2.

4.3

Treating imperfect object detection

Although the object recognition algorithms have noticeably improved much in the past years, there is no flawless practical algorithm reported in the literature, yet. Thus, a practical object-based SLAM solution must allow for errors from the objects detection system and fix them whenever possible. This is the case of problem setup 4, which considers imperfect object detection and unknown data association, which corresponds to most of practical cases. In this work, the computer vision errors are treated in a probabilistic framework, being solved by estimating the probably correct map via probabilistic inference in the model.

4.3 Treating imperfect object detection

79

Algorithm 2 Building an O-Map from data association samples Jn and data Z Input: data Z, and the set of samples{Jn }Ns n=1 Output: the estimated trajectory Xˆ and map Mˆ 1. Let ID_list ← {} 2. For n ranging from 1 to Ns: (a) Calculate the posterior P(θ ∣Jn , Z), with mean θn∗ = (Xˆn , Lˆ n , Gˆ n ) and covariance Cθn according to Algorithm 1 (b) For j ranging from 1 to Nn i. Determine ID so that Jn (ID) = j n and gˆn by directly accessing the corresponding parameter ii. Determine lˆID ID values in θn∗ iii. Set P(ID∣Jn , Z) ← 1 iv. Let IDlist ← ID_list ∪ {ID} 3. For each ID ∈ ID_list, find πID , lˆID , and gˆID using (4.24), (4.26) , and (4.27), respectively 4. Find Xˆ using (4.25), and Mˆ = {(lˆID , gˆID , cID , πID )∣ID ∈ ID_list}

Two types of errors are considered here. The first one is the confusion between two classes, that may occur due to difference in the object image caused by changes in the environment light, image noise, or other reasons. The other one is the presence of spurious measurements due to false positives in object detections. The developed probabilistic model for both errors are described in the Sections 4.3.1 and 4.3.2.

4.3.1

Modeling confusion between classes

So far, the object detected by a detector specialized in a certain class was considered as belonging to that class with probability 1. However, there might be classes having similar instances, so that objects belonging to a certain class may be detected by algorithms specialized in objects of other class. For instance, a detector designed to find chairs in images might, sometimes, detect a door in the place of a chair. This configures the confusion between classes. Keeping the approach adopted in Section 4.1, whereby the probability of detecting a certain object class depends only on its actual class, the data likelihood (4.5) is now

4.3 Treating imperfect object detection

80

given by T Ki { } P(Z∣X, L, G,C, J) = ∏ ∏ P(ui,k ∣xi , lJ(i,k) )P(si,k ∣xi , lJ(i,k) , gJ(i,k) )P(ai,k ∣cJ(i,k) ) ,

(4.31)

i=1 k=1

which is now a multi-modal distribution, where each mode corresponds to a certain combination of object classes C. However, for a specific value of C, the posterior (4.6) does not change its shape, being just multiplied by the penalty factor T

Ki

∏ ∏ P(ai,k ∣cJ(i,k)).

(4.32)

i=1 k=1

Considering that object classes are not crisply given by the measurements anymore, inference becomes ˆ ˆ = arg max ∑ (θˆ , C) C

J

θJ P(X, L, G,C∣J, Z)P(J∣Z) dθJ ,

(4.33)

θJ

since it is not possible to take the expectancy over the object classes C given their discrete nature. However, because one cannot combine variable vectors θJ generated by different data association solutions J, as discussed in Section 4.2, it is necessary to define each parameter estimate in face of the new uncertainty model. While the geometric parameters are still given by expectation with respect to the others, each object class is given by the MAP estimate, according to: ˆ ∆ Xˆ =

J C



P(ID∣Z) =

∑ ∑ P(ID∣J, Z)P(C, J∣Z)

gˆID =

∑∑

lJ(ID) P(θJ ∣C, J, Z)P(ID∣J, Z)P(C, J∣Z) dθJ

(4.36)

gJ(ID) P(θJ ∣C, J, Z)P(ID∣J, Z)P(C, J∣Z) dθJ

(4.37)

ˆ

θJ

∑∑ J C

(4.35)

ˆ

J C



(4.34)

θJ

J C

∆ lˆID =

X P(θJ ∣C, J, Z)P(C, J∣Z) dθJ

∑∑

θJ



cˆID = arg max ∑ c



P(ID∣J, Z)P(C, J∣cJ(ID) = c, Z),

(4.38)

J C−{cJ(ID) }

where C − {cJ(ID) } is the set of variables representing all object classes except cJ(ID) , the class of object oJ(ID) . Analogously to the space of data association solutions, the space of object classes

4.3 Treating imperfect object detection

81

also grows exponentially with the number of objects in the map. The number of possible values for C given the correspondence function J is ∣C ∣NJ , where ∣C ∣ is the number of possible classes an object may belong to. Thus the summation over that space is unfeasible, and a sampled version of P(C, J∣Z) = P(C∣J, Z)P(J∣Z) is considered. It is expected that the object classes under consideration are not so similar, and their instances are not confused so often, and the distribution P(C∣J, Z) is also peaked around some values, so that a sampled version should represent it well. 1 Considering the approximation P(C, J∣Z) ≈ Ns ∑n δ (C,Cn )δ (J, Jn ), the estimates (4.34-

4.37) are still approximated by (4.24-4.27), but now ˆ ∆ ˆ Xn = XP(θJn ∣Cn , Jn , Z) dθJn

(4.39)

θJn

ˆ

n lˆID



=

lJn (ID) P(θJn ∣Cn , Jn , Z) dθJn

(4.40)

gJn (ID) P(θJn ∣Cn , Jn , Z) dθJn

(4.41)

θJn

ˆ

gˆnID



= θJn

Again, since the density P(θJn ∣Cn , Jn , Z) is assumed to be Gaussian, the expectations Xˆn , lˆn , and gˆn are simply the corresponding estimated parameters in the vector ID

θn∗

ID

= arg maxθJn ∣ P(θJn ∣Cn , Jn , Z), calculated according to the procedure described in Sec-

tion 4.1. The estimate of cID is approximated by cˆID ≈ arg max ∑ P(ID∣Jn , Z)δ (c, cJ(ID) ), c

n

which is the most frequently sampled class for a certain object ID.

4.3.2

Filtering out spurious objects

There is no point in considering spurious objects part of the map, if they can be detected. Since there are no perfect object detectors, some objects may not be detected and there may be some spurious measurements in data Z. While the effect of ignored objects is not important for our approach, spurious detections yield objects in the map which are not intended to be mapped. Since these spurious objects are usually detected few times, their positions would be very uncertain, specially if no prior knowledge is considered. Actually, the prior on sizes for the spurious objects should always be unconsidered, since they differ considerably from the object actual sizes,

4.3 Treating imperfect object detection

82

since these objects do not actually belong to the classes that cast such priors. Hence, this work proposes a probabilistic manner to identify spurious objects based on the uncertainty about their positions in the map after removing the effect of the prior knowledge. If this uncertainty is large for a certain object, the chance of this object be spurious is large. The reason is that objects that have been frequently detected, specially by different points of view, have their parameters inferred with a higher certainty degree. This idea can be modeled by assigning a non-spuriousness binary variable ξ j to each object o j , so that P(ξ j ∣C, J, Z) depends on the uncertainty on l j disregarding the prior belief P(l j , g j ∣c j ) on the object geometric parameters. Let θ j = (l j , g j ) be the geometric parameters of the object o j in a certain variables vector θ assuming that the data association solution is given by the correspondence N(J)

function J. Considering the prior density P(θ ∣C, J) = ∏ j=1 P(θ j ∣c j , J) a product of independent densities, the posterior on a specific θ j0 disregarding its prior is 1 ¯ j0 ∣C, J, Z) = P(θ P(Z∣C, J)

ˆ

j0 −1

P(Z∣θ , J)

j=1

θ −{θ j0 }

1 = P(Z∣C, J)

ˆ

θ −{θ j0 }



N(J)

P(θ j ∣c j , J)

∏0

P(θ j ∣c j , J)

j= j +1

P(Z∣θ , J) N(J) P(θ j ∣c j , J) P(θ j0 ∣c j0 ) ∏ j=1

= P(θ j0 ∣C, J, Z)/P(θ j0 ∣c j0 , J),

(4.42)

where P(θ j0 ∣C, J, Z) is the posterior density on θ j0 considering the prior, and θ − {θ j0 } is the variables vector containing the map and trajectory geometric parameters, excluding object o j0 position and geometry. Since both posterior and prior on θ j0 are assumed Gaussian, the covariance matrix of θ j0 disregarding the prior is given by C¯θ−10 = Cθ−10 − diag({Γ(c j0 )−1 , Σ(c j0 )−1 }) j

(4.43)

j

where Cθ j0 is the posterior covariance of θ j0 . Here, the probability of the non-spuriousness variable is defined as ∆

P(ξ j0 ∣C, J, Z) = S(∣∣C¯l j0 ∣∣2 ),

(4.44)

where C¯l j0 is the posterior covariance of l j0 disregarding the prior on θ j0 , and ∣∣.∣∣2 takes the 2-norm of the argument, given by the modulus of its largest eigenvalue. The function S(.) is the mirrored sigmoid function, defined as S : R → (0, 1), S(t) = (1 + exp{(t − t0 )/τ})−1 , where τ and x0 are the function parameters. Figure 4.3 illus-

4.4 Probability distribution on data association solutions and object classes

83

Mirrored sigmoid function 1 0.9 0.8 0.7

S(t)

0.6 0.5 0.4 0.3 0.2 0.1 0 −0.4

−0.2

0

0.2

0.3 t

0.4

0.6

0.8

1

Figure 4.3: Example plot of a mirrored sigmoid function with parameters τ = 0.1 and t0 = 0.3.

trates the shape of the mirrored sigmoid function. To implement the filtering of spurious objects into the framework to obtain objectbased maps, the non-spuriousness variable is taken into account when calculating the object existence probability π, so that ∆

πID =

∑ ∑ P(ID∣J)P(ξJ(ID)∣C, J, Z),

or

(4.45)

J C

πID ≈

∑ P(ID∣Jn)P(ξJn(ID)∣Cn, Jn, Z),

(4.46)

n

if the distribution on J and C is approximated by a sampled version. The algorithm to build an object-based map from samples {(Cn , Jn )}Ns n=1 is described in Algorithm 3.

4.4

Probability distribution on data association solutions and object classes

In this work, the joint probability P(C, J∣Z) is expressed using the Bayes law. Considering that the correspondence function J and the object classes C define a model ∆

with parameters θJ = (X, L, G), the probability of the model defined by J and C is given by marginalizing out its parameters (HOETING et al., 1999)

4.4 Probability distribution on data association solutions and object classes

84

Algorithm 3 Building an objects-based map from correspondence function and object classes samples (Cn , Jn ) Input: data Z, and the set of samples{(Cn , Jn )}Ns n=1 Output: the estimated trajectory Xˆ and map Mˆ 1. Let ID_list ← {} 2. For n ranging from 1 to Ns: (a) Calculate the posterior P(θ ∣Cn , Jn , Z), with mean θn∗ and covariance Cθn (b) For j ranging from 1 to Nn i. Determine ID so that Jn (ID) = j n and gˆn by directly accessing the corresponding parameter ii. Determine lˆID ID ∗ values in θn iii. If the histogram of classes assumed by cID does not exist, create it and initialize all entries to zero iv. Increment the entry corresponding to cnID in the histogram of classes assumed by cID v. Set P(ID∣Jn , Z) ← 1 vi. Let ID_list ← ID_list ∪ {ID} 3. For each ID ∈ ID_list, find: (a) πID using (4.46); (b) lˆID using (4.26) and (4.40); (c) gˆID using (4.27) and (4.41); and (d) cˆID by taking the entry in the histogram of classes assumed by cID with highest count 4. Find Xˆ using (4.25) and (4.39), and Mˆ = {(lˆID , gˆID , cˆID , πID )∣ID ∈ ID_list}

4.4 Probability distribution on data association solutions and object classes

85

ˆ P(C, J∣Z) =

P(θJ ,C, J∣Z) ˆ

θJ

P(Z∣θJ ,C, J)P(θJ ∣C, J)P(C, J) dθJ



(4.47)

θJ

The term P(Z∣C, J)P(C, J) =

´

θJ P(Z∣θJ ,C, J)P(θJ ∣C, J)P(C, J) dθJ

corresponds to the un-

normalized posterior, and is also called the evidence for the model defined by J and C (MINKA, 2000). In Bayesian Model Selection (BMS) schemes, the evidence is used to select the best model that fits the available data among a pool of candidates. If enough data is used, a simplified version for the evidence is used in the Bayesian Information Criterion (BIC) to select the most adequate model complexity (SCHWARZ, 1978). In this work, choosing a single hypothesis for the data association is not the case; rather, the approach adopted here is the Bayesian Model Averaging (BMA), since the sought parameter values are calculated by the marginalization of all possible assumptions on the data association solution and object classes (see (4.34-4.37)). In the BMA approach, the evidence is used as the shape of the target distribution P(C, J∣Z), used to take the expectation of a certain parameter. This approach is supported by theoretical and practical results, among which the main reason that encourages it is the fact that the evidence for the correct model tends to become infinitely higher than the others evidence as far as the size of the data set grows to infinity (BERGER; PERICCHI, 1995).

4.4.1

Calculating the evidence

The evidence to be calculated is ˆ P(Z∣C, J) = P(Z∣θJ ,C, J)P(θJ ∣C, J) dθJ ,

(4.48)

θJ

which is the same that integrating the non-normalized posterior (4.6), except that now, because the detected and assumed object classes may be different, the posterior is now penalized by a constant factor, as discussed in Section 4.3.1. Although it does not modify the geometric SLAM inference procedure for given C and J, the penalty must be taken into account for the evidence calculation. In Section 4.1.2, the integrand is assumed to be proportional to a Gaussian density, so that we consider the integration of a Gaussian-shaped function. Let g(x) =

4.4 Probability distribution on data association solutions and object classes

86

} { 1 2 β exp − 2 ∣∣x − µx ∣∣Σx be the general form of a multivariate Gaussian-shaped function, where β is a constant:

ˆ g(x) dx = β



∣2πΣx ∣

(4.49)

x

For β = 1, (4.49) corresponds to the normalizing factor of the multivariate Gaussian density. We can also isolate the constant β by making x = µx , resulting g(µx ) = β . In our case, the Gaussian function to be integrated is g(θJ ;C, J) = P(Z∣θJ ,C, J)P(θJ ∣C, J)

(4.50)

In order to calculate the target integral, we need to determine the distribution peak value g(θJ∗ ;C, J) and the determinant of the correspondent parameter Σθ . The ma1

trix of (4.50) can be rewritten in terms of the QR factorization P− 2 A = [ representation ] R Q , so that 0 K

i P(aik ∣cJ(i,k) ) − 1 (∣∣Rδ θ −c∣∣2 +∣∣r∣∣2 ) ∏Ti=1 ∏k=1 J √ g(θJ ;C, J) = e 2 ∣2πP∣

(4.51)

The peak value g(θJ∗ ;C, J) occurs when the negative log of (4.51) is minimized, which, as stated in Section 4.1.2, happens when Rδ θJ − c = 0, leaving ∣∣r∣∣2 as the total square residual. Thus, K

g(θJ∗ ;C, J) =

i P(aik ∣cJ(i,k) ) − 1 (∣∣r∣∣2 ) ∏Ti=1 ∏k=1 √ e 2 ∣2πP∣

(4.52)

which is immediately computed if the optimal parameters θJ∗ are determined via QR factorization of the associated linear LS problem. Moreover, because g(θJ ;C, J) ∝ P(θJ ∣C, J), which is a Gaussian posterior, the parameter Σθ corresponds to the posterior covariance, given by (4.18). Since the Cholesky factor R is a square matrix, the covariance determinant is easily calculated if such factor is available: ∣Σθ ∣ = ∣R∣−2 = ∏ rii−2

(4.53)

i

where rii is the ith diagonal element of R. To illustrate the effect of the chosen formulation for P(C, J∣Z), a simple example is proposed. For sake of simplicity, the example concerns about the probability of assuming different solutions for the data association, since this is the factor that influences

4.4 Probability distribution on data association solutions and object classes

Robot

87

obstacle

r1 r2 0

x

Figure 4.4: Toy example proposed to illustrate how the data association probability model work.

the complexity of the model, i. e., the number of objects assumed in the environment. Consider the case where a static robot has a single sonar beam, being able to detect the distance to an obstacle right in front of it, as depicted in Figure 4.4. The robot is placed in the origin of the coordinates, oriented towards the x axis. Its sensor detects the distance to frontal obstacles with an uncertainty represented by the standard error σr . After taking two readings in different instants of time, r1 and r2 , the data association problem corresponds to determining whether both readings correspond to the same obstacle position (J1 ), or if the obstacle has moved in the meanwhile (J2 ). Thus, according to the notation defined so far, J1 (1, 1) = 1 and J1 (2, 1) = 1, while J2 (1, 1) = 1 and J2 (2, 1) = 2. In the case J1 corresponds to the correct data association solution, the variables vector comprises only the position of the detected obstacle xo . Assuming a Gaussian prior on xo , P(xo ) ∼ N (0, σo2 ), its posterior density is } { 1 1 ∗ 2 exp − ∣∣xo − xo ∣∣Σ1 P(xo ∣r1 , r2 , J1 ) = √ 2 ∣2πΣ1 ∣ ⎧ ⎡   xo /σo  ⎨ 1 1 ⎢ ∝ √ exp − ⎢ ⎣ (xo − r1 )/σr  (2π)3 σr4 σo2  2  ⎩ (xo − r2 )/σr with xo∗ =

σo2 σr4 σo2 σr2 (r1 + r2 ) and Σ = 1 2σo2 σr2 + σr4 2σo2 σr2 + σr4

⎤ 2 ⎫    ⎥ ⎬ ⎥ ⎦    ⎭

(4.54)

(4.55)

In turn, if J2 is the correct data association, θJ2 = (xo1 , xo2 ), comprising the obstacle

4.4 Probability distribution on data association solutions and object classes

88

positions in two different instants. Assuming independent Gaussian priors on xo1 and xo2 , with P(xo1 ) ∼ N (0, σo2 ) and P(xo2 ) ∼ N (0, σo2 ), the posterior density is 1 1 exp{− ∣∣θJ2 − θJ∗2 ∣∣2Σ2 } 2 ∣2πΣ2 ∣ ⎧ ⎡   xo1 /σo    ⎢  ⎨ ⎢ xo2 /σo 1 ⎢ 1 exp − ⎢ ∝ √ 4 4 4  2 ⎢ (xo1 − r1 )/σr (2π) σr σo   ⎣    ⎩ (xo2 − r2 )/σr

P(θ ∣r1 , r2 , J2 ) = √

with xo∗1 =

⎤ 2 ⎫    ⎥   ⎥  ⎬ ⎥ ⎥ ⎥   ⎦     ⎭

σr2 r1 σr4 σr2 r2 ∗ , x = , and Σ = I 2 2x2 2 σr2 + σo2 o2 σr2 + σo2 σr + σo2

(4.56)

(4.57)

Supposing that the prior probability distribution on J is uniform (P(J1 ) = P(J2 ) = 0.5), the evidences for both associations are given by √ 2 1 0.5 1 √ (4.58) ∣2πΣ1 ∣ P(r1 , r2 ∣xo∗ , J1 )P(xo∗ )P(J1 ) = e− 2 res1 2π 2σo2 σr2 + σr4 √ 0.5 σr2 /σo2 − 1 res22 P(r1 , r2 ∣J2 ) = ∣2πΣ2 ∣ P(r1 , r2 ∣xo∗1 , xo∗2 , J2 )P(xo∗1 , xo∗2 )P(J2 ) = e 2 (4.59) 2π σr2 + σo2

P(r1 , r2 ∣J1 ) =

where res21 and res22 are the residuals of the squared norms in (4.54) and (4.56), respectively. Although one expects the prior density on the parameters P(θ ∣J) not to influence on the data association evidence as much as the data likelihood P(Z∣θ , J), a totally uninformative prior cannot be used. Looking at the math, the ratio P(r1 , r2 ∣J1 )/P(r1 , r2 , ∣J) depends directly on the prior ratio P(xo∗ )/P(xo∗1 , xo∗2 ) =



{ } 1 [ ∗ 2 ( ∗ )2 ( ∗ )2 ] 2πσo exp − (xo ) − xo1 − xo2 2

(4.60)

The expression for the residuals res1 and res2 are too cumbersome, and will not be shown for the sake of simplicity. However, Figure 4.5 compares the plots of both residuals as a function of r2 , considering r1 = 1m and sensible values for the standard deviations σr and σo . One can see that the residual in the single obstacle position case is minimum when r2 = r1 . In addition, assuming J1 always implies greater residual than assuming J2 , except in a small window around r2 = r1 . J1 produces greater residuals because it yields a more rigid model by constraining both readings to the same obstacle position, while J2 softens the model by allowing for free obstacle movement.

4.4 Probability distribution on data association solutions and object classes

4

4 res

res

1

σ0=5m σr=0.1m

1

res2

3 residual

residual

3

2

1

0 0.5

89

σ0=2m σr=0.1m

res2

2

1

1 r2

1.5

0 0.5

1 r2

1.5

Figure 4.5: Plot of the residuals res1 and res2 for the values of σo and σr indicated on the graphs.

To analyze the evidence behavior, Figure 4.6 compares the plots of the evidences due to J1 and J2 considering different values of σo and σr . In all plots, there is a window around r2 = r1 where P(r1 , r2 ∣J1 ) > P(r1 , r2 ∣J2 ). In fact, the closer r1 and r2 are, the more plausible the single obstacle position hypothesis is. This observation corroborates the explanation of Minka (2000) about how BMS works: when the available data is likely to be generated either by a simple —more constrained— or complex —more relaxed— model, the simpler one has higher evidence. If the simpler model is not very likely to have generated the data, then the more complex one is preferred. Thus, BMS implements an automatic Occam’s razor when selects the model with greater evidence. The standard deviations σr and σo also influence the regions where one model has higher evidence than the other. If the measurements uncertainty is high, it gets harder to distinguish between different obstacle positions, analogously to what happens to a shortsighted person trying to distinguish different objects at a long range. On the other hand, if the measurements are sharp, it is easier to distinguish different obstacle positions, and the model assuming J1 is preferred to J2 unless r1 and r2 are very close. In turn, the influence of the prior standard deviation σo is smoother. Typically, values for σo are high, since the data should be more relevant to estimate the model parameters than the prior belief. Hence, the prior factor (4.60) tends to penalize the more complex model yielded by assuming J2 . In the limit, when a totally uninformative prior is used, σo → ∞, and the simplest model yielded by assuming J1 have posterior probability P(J1 ∣Z) = 1. Therefore, even though the prior on the parameters does not affects much the inferred parameter values given the data, it is important for model selection (BERGER; PERICCHI, 1995). As a consequence, default uninformative priors cannot be used, and some care should be taken to define that prior density. Despite the methods proposed by some authors to define objective default priors (see the paper of Kass

4.4 Probability distribution on data association solutions and object classes

0.03

0.05

J1

σr=0.1m σo=10m

J2

0.02

evidence

evidence

0.04

0.01 1 r2

J2

1 r2

1.5

0.08 J1

σr=0.1m σo=1.5m

J2

0.04 0.02

0.06 evidence

evidence

0.03

J1

σr=0.5m σo=5m

0.01 0.5

1.5

0.08

0 0.5

0.04

0.02

0 0.5

0.06

90

J1

σr=0.01m σ =5m

J2

o

0.04 0.02

1 r

2

1.5

0 0.5

1 r

1.5

2

Figure 4.6: Comparison of the plots of the model evidence considering the data association J1 and J2 , with varying standard deviations.

4.4 Probability distribution on data association solutions and object classes

91

& Wasserman (1996) for a review on the topic), this work adopts the subjective view that P(θJ ) must reflect the designer belief. Besides the relative stability of the evidence w.r.t. variations on σo , as depicted in Figure 4.6, here the prior on θ implies a scale under which the model is observed. If the prior is uninformative, the observation scale is very large and the difference between the models with one or two obstacle positions becomes irrelevant, making the Occam’s razor decide by the simpler model. On the other hand, if the prior focuses a limited area where the obstacle can be, the differences between both models get more perceptible if the both models coincide that the obstacle is inside that area.

92

5

MCMC Sampling

This Chapter is devoted to showing how to draw samples from the probability distribution over data association solutions and object classes, defined in Section 4.4. In this work, the MCMC approach is adopted, since there are theoretical and practical reasons to believe that it is a promising methodology to perform approximate inference, specially in the combinatorial data association space (DELLAERT, 2001; RANGANATHAN, 2008). MCMC methods work by simulating a Markov chain over the state space to be sampled, with the property of ultimately converging to the distribution of interest. Markov chains are stochastic processes over the time, represented by a sequence of random state variables {S0 , S1 , . . . } that assume values obeying the Markov property. This property states that the probability of future state values given the current state do not depend on past states. Thus, the properties of a Markov chain are described by its transition probabilities P(St+1 ∣St ), which are usually the same regardless of the time t. Some classes of Markov chains have the property of converging to a stationary distribution. It means that, for those chains, the marginal distribution on the current state value, P(St ), does not depend on the time if t is large enough, regardless of their initial states. In this case, if the state space corresponds to that we want to sample from, using the stationary Markov chain states as samples assures that they are drawn from the target distribution. MCMC methods use this property by defining a Markov chain that samples from a certain distribution of interest. There is no need to concern about explicitly defining the Markov chain parameters to use MCMC techniques.

5.1

The Metropolis-Hastings algorithm

To understand how the MCMC approach works, let us introduce the MetropolisHastings (MH) algorithm (HASTINGS, 1970), which is the basis of all MCMC methods

5.1 The Metropolis-Hastings algorithm

93

Algorithm 4 The Metropolis-Hastings algorithm for sampling P(C, J∣Z) 1. Start the chain with a valid state (C0 , J0 ), and set the sample index n ← 0 2. For t ranging from 1 to Nb + Ns, where Ns is the desired number of samples, and N b is a burn-in period, do: (a) Propose a new state (C′ , J ′ ) according to an appropriate proposal distribution q(Ct , Jt → C′ , J ′ ) (b) Calculate the acceptance ratio ) ( P(Z∣C′ , J ′ )P(C′ , J ′ )q(C′ , J ′ → Ct , Jt ) α = min 1, P(Z∣Ct , Jt )P(Ct , Jt )q(Ct , Jt → C′ , J ′ )

(5.1)

(c) With probability α, accept (C′ , J ′ ) and set Ct+1 ← C′ and Jt+1 ← J ′ . Otherwise, set Ct+1 ← Ct and Jt+1 ← Jt , characterizing a self transition in the chain (d) If t ≥ Nb, set the sample index n ← n + 1 and the assign the sample (Cn , Jn ) ← (Ct+1 , Jt+1 ) (e) Set t ← t + 1

(GILKS; RICHARDSON; SPIEGELHALTER, 1996a). The MH algorithm implements a Markov chain by proposing new states and accepting them or not according to a statistical criterion. The current chain state is returned as a sample in every state transition, usually after a burn-in period, in which the chain is considered to be in a transient period. Consider a Markov chain whose state space corresponds to the one we want to sample from. Given the current chain state St , the MH algorithm works by accepting or rejecting a proposed new state S′ generated by a proposal function, which implements a (usually simple) algorithm to sample from a proposal distribution over the states, ( ) P(S′ )q(S′ →St ) ′ q(St → S ). The proposed state is accepted with probability α = min 1, P(St )q(St →S′ ) , where P(S) is the target distribution we want to draw samples from. The MH algorithm applied to sampling the joint space of data association solutions and object classes is described in Algorithm 4. Note that, besides the theoretical guarantee of sampling from the correct probability distribution when t is large enough, the MH algorithm requires P(S) to be computable only up to a proportionality constant, a clear requirement for the problem addressed in this thesis (see the target probability distribution formulation (4.47)). To assure that the Markov chain defined by the MH algorithm converges to the desired stationary distribution over states, it is necessary to observe some requirements

5.1 The Metropolis-Hastings algorithm

94

of the proposal distribution (ROBERTS, 1996). The first one is that the proposed state transitions must be reversible, i.e. q(St → S′ ) > 0 ⇒ q(S′ → St ) > 0. The other one is the possibility of reaching any state with probability 1 after a finite number of transitions, regardless of the initial state S0 , making the chain irreducible. In other words, the proposal distribution must allow going from any state to any other by a finite sequence of proposed states. To implement the MH algorithm, it is necessary to define two probability distributions: the target one —just up to a proportionality constant—, and the proposal one, that describes how the proposal function picks states to propose. While the target distribution is specific to the addressed problem, there are no restrictions on the proposal distribution other than the reversibility condition and the chain irreducibility guarantee. However, although the frequency of the samples extracted from the consecutive chain states will obey P(S) almost surely just when the number of samples goes to infinity, we want this distribution to be well represented by the fewest possible samples, since computing and evaluating them demand computational effort. Well designed proposal distributions can help reducing the necessary number of samples by proposing state transitions that are more likely to be accepted, while explore the state space visiting the most probable states more frequently. Thus, good proposal distributions are designed in an attempt to mimic the target distribution, making the acceptance ratio (5.1) closer to 1 than general purpose ones. Ideally, the proposal distribution would be the target distribution itself; however, if directly sampling it is a reasonable option, the MCMC approach becomes unnecessary. Usually, computing the target distribution is computationally prohibitive, specially because when it is known only up to a normalization factor, calculating this term requires the summation over the whole variable space. On the other hand, if very simple and computationally inexpensive proposal distributions are used, like the uniform distribution used in the random-walk algorithm (METROPOLIS et al., 1953), so many samples may be necessary to acceptably represent the target distribution that the MCMC approach becomes infeasible. Because MCMC methods are based on running a Markov chain over the space to be sampled, some notion of state neighborhood has to be defined, at least implicitly, to implement the proposal function. Thus, given the current state chain, the possibly proposed states define such neighborhood. There is evidence that the plausible data association solutions —those with non-negligible probability— are similar to each other,

5.2 Proposal distribution

95

and are placed in a relatively small neighborhood if a certain type of proposal function is used (RANGANATHAN, 2008). That is an extra motivation for using the MCMC approach to sample the from the space of data association solutions and object classes.

5.2

Proposal distribution

The correspondence function J, which defines the solution for the data association problem, can be seen as a partition of the set of object detections, where each partition give rise to an object. The same idea was used to define the object identities across different correspondence functions in Section 4.2.1, assuming that equal partitions always correspond to same physical object. Hence, the definition of the data association problem in this work —and in the SLAM problem, in general— is closely related to the topological mapping problem, where a set of landmark detections has to be partitioned into distinct landmarks. In topological maps, landmarks are special places where the robot has passed, linked among them by a traversability relation. They are most commonly represented by a graph where nodes are the landmarks and edges represent a path from a landmark to the other. The specific landmark locations are usually not very important, so that the ultimate goal of topological mapping is finding the graph topology. Because of its success in drawing samples from the space of topological maps, the proposal distribution used in this work is inspired by the work on Probabilistic Topological Maps (PTMs) (RANGANATHAN, 2008). In that work, an MCMC method samples over the space of topologies, rendering a histogram-like probability distribution, motivated by the possible lack of input data, which could be insufficient to determine a single good topology.

5.2.1

Proposed state transitions

In the work on PTMs, a new topology is proposed according to a simple transformation of the current one, referred to as a move. There are two possible moves: the merge move, which merges two subsets of detected landmarks into a single landmark locality, or the split move, which splits the subset of observed landmarks corresponding to a certain locality into two disjunctive subsets. The same strategy is adopted in this work, which merges or splits subsets of the object detections set. For the sake

5.2 Proposal distribution

96

of simplicity, the reference to merging or splitting objects is used to express the act of merging or splitting object detection subsets, which give rise to different objects in a map. Although the kind of chain state transitions used here is similar to the one proposed by Ranganathan (2008), this work also considers the case of unknown object classes, which require sampling over that space as well. Furthermore, here a different way to choose the objects to be split or merged is proposed, even in the case of perfect object detection. This thesis only concerns about using data-driven proposal distributions, which means that the data input is always used to propose new chain states. Check the work of (RANGANATHAN, 2008) for a comparative study on how datadriven and non-data-driven distributions influence the performance of MCMC methods in approximating the target distribution. The proposal function works by choosing one of three possible moves: merge, split, or switch class, with probabilities Pmerge , Psplit , and Pswitch , respectively. Each move is designed in an attempt to mimic the target distribution behavior, P(C, J∣Z), discussed in Section 4.4.1. As a general rule, the proposal function tries to propose good — highly probable— samples most of times, allowing not so good ones to be proposed some times. However, controlling whether the proposal is good or not is difficult, in general. Building an heuristics that do this task very well might be almost as costly as sampling directly from the target distribution, so here the proposed moves apply only some intuitive rules that cannot precisely propose good and bad samples at will. The intuitive rule for the merge move is that close objects may be merged without increasing the measurements residual very much, while simplifies the model, probably yielding good samples. The merge move also favors merging objects belonging to the same class, because mixing detections of different object classes in a single object yields highly penalized models, according to the data likelihood formulation (4.31). The procedure taken to propose merge moves is described in Algorithm 5. The measurements residual reflects the difference between the actual measurements Z and the predicted measurements given the inferred map and trajectory. It is roughly equivalent to the residual of the overdetermined linear system (4.15) used to solve the geometric SLAM problem. Although, the system residual is also influenced by the differences between the prior belief about the object geometric parameters and the mapped ones, the fraction of the residual due to these differences are usually irrelevant when compared to the fraction caused by measurement differences.

5.2 Proposal distribution

97

Algorithm 5 Proposing a merge move 1. Let (Xt∗ , Mt∗ ) = θt∗ = arg maxθ P(θ ∣Ct , Jt , Z) be the map and trajectory obtained using the current state (Ct , Jt ). t 2. Determine dt = {d j }Nj=1 , where Nt is the number of objects in Mt∗ and d j is the number of objects of the same class as o j ∈ Mt∗ that are closer to it than a certain distance threshold (e. g., 50 cm).

3. Transform dt into a probability distribution by normalization and sample an object index j1 from it. ∣∣l −l ∣∣2

j j1 t , with the length σE chosen 4. Determine et = {exp(− 21 e j )}Nj=1, j∕= j1 , where e j = σE2 accordingly. If c j ∕= c j1 , multiply e j by a penalty factor (e. g., 3).

5. If et is empty, re-propose J ′ = Jt and C′ = Ct , and return. 6. Transform et into a probability distribution by normalization and sample an object index j2 from it. 7. Propose a new correspondence function J ′ by merging the objects o j1 and o j2 into o j1 doing, for all (i, ki ) ∈ {i × Ki , i = 1, . . . T }: • If Jt (i, ki ) < max( j1 , j2 ), J ′ (i, ki ) = Jt (i, ki ) • If Jt (i, ki ) = max( j1 , j2 ), J ′ (i, ki ) = min( j1 , j2 ) • If Jt (i, ki ) > max( j1 , j2 ), J ′ (i, ki ) = Jt (i, ki ) − 1 ′

Ki

T

8. Calculate the class proposal distribution q(c j1 ∣J , Z) ∝ ∏ i=1



P(ai,ki ∣c j1 ).

ki = 1 J ′ (i, ki ) = j1

9. Sample the new class c′j1 according to q(c j1 ∣J ′ , Z). 10. Calculate the probability of the proposed transition: Pmerge Ndtt ( j1 ) Nett ( j2 ) q(c′j1 ∣J ′ , Z).

q(Ct , Jt → C′ , J ′ ) =

∑ j=1 dt ( j) ∑ j=1 et ( j)

11. Since the reverse transition can only be achieved by performing a split move, calculate its probability q(C′ , J ′ → Ct , Jt ) by performing steps 1, 2, 6, and 8, assuming jsplit = min( j1 , j2 ).

5.3 Practical issues about the proposed sampling algorithm

98

The split move is designed based on the idea that the objects responsible for a great portion of the measurements residual are probably stressing the model too much, and splitting them is likely to generate good samples. The residual per object is calculated by taking the norm of the residuals of all system equations that describe measurements of a single object, given by 2

r ( j) =

T

Ki





i=1

[ x l ∣∣Ui,k δ xi∗ +Ui,k δ l ∗j + hu (xi0 , l 0j ) − ui,ki ∣∣2Ri,k + i i i

ki = 1 Jn (i, ki ) = j

g x l 2 ∣∣Si,k δ xi∗ + Si,k δ l ∗j + Si,k δ g∗j + hs (xi0 , l 0j , g0j ) − si,ki ∣∣W i,ki i i i

]}

,

(5.2)

where δ xi∗ , δ l ∗j , and δ g∗j are the inferred variational trajectory and geometric object parameters given the corresponding linearization points and the correspondence function J. However, the object is split randomly. Algorithm 6 explains how to perform a split move. Proposing a switch class move, which explores the space of object classes, is easier. For a given correspondence function J, the behavior of the target distribution P(C, J∣Z) (4.47) is strongly ruled by the penalty factor due to object class confusions (4.32), which is feasible to be calculated. Moreover, strongly penalized objects are more prone to belong to the wrong class than weakly penalize ones. Thus, the switch class move tends to choose a highly penalized object and switch its class to a more suitable one, as described in Algorithm 7.

5.3

Practical issues about the proposed sampling algorithm

Because of the general purpose nature of MCMC methods, their performance strongly depends on the specific implementation being used. More specifically, it depends on the behavior of the underlying Markov chain and the computational cost of running the state transitions. An important issue for efficient MCMC implementations is the definition of data-driven proposal distributions, discussed in Section 5.1, which incentive the chain to explore the most probable states with higher probability. An important consequence of designing good proposal functions is improving the convergence of the chain mixing. The mixing rate is a measure of how fast the chain

5.3 Practical issues about the proposed sampling algorithm

99

Algorithm 6 Proposing a split move 1. Let (Xt∗ , Mt∗ ) = θt∗ = arg maxθ P(θ ∣Ct , Jt , Z) be the map and trajectory obtained using the current state (Ct , Jt ). 2. Calculate the total square residual per object rt2 using (5.2) 3. Transform rt2 into a probability distribution by normalization, and use it to sample the index jsplit . 4. Let Ψ ← {(i, ki )∣Jt (i, ki ) = jsplit } be the subset of the measurements set that is assigned to object o j . Partition Ψ into two disjunctive subsets Ψ1 and Ψ2 so that: • The measurement index (imin , 1) is assigned to subset Ψ1 , with imin = min{i∣∃ki ∈ Ki , Jt (i, ki ) = jsplit } • A randomly chosen measurement (i2 , ki2 ) ∈ Ψ − {(imin , 1)} is assigned to subset Ψ2 • The remaining elements (i, ki ) ∈ Ψ−{(imin , 1), (i2 , ki2 )} are randomly assigned either to set Ψ1 or set Ψ2 5. Propose a new data association J ′ by splitting the measurement subsets referring to object o jsplit by doing, for all (i, ki ) ∈ {i × Ki , i = 1, . . . T }: • If (i, ki ) ∈ / Ψ, J ′ (i, ki ) = Jt (i, ki ) • If (i, ki ) ∈ Ψ1 , J ′ (i, ki ) = jsplit • If (i, ki ) ∈ Ψ2 , J ′ (i, ki ) = Nt + 1 Ki

T

6. Calculate the class proposal distributions q(c jsplit ∣J ′ , Z) ∝ ∏ i=1

Ki

T



and q(c ∣J , Z) ∝ ∏ N′

i=1





P(ai,ki ∣c jsplit )

ki = 1 ′ J (i, ki ) = jsplit

P(ai,ki ∣cN ′ ), where N ′ is the number of objects of

ki = 1 J ′ (i, ki ) = N ′ a map obtained using the proposed correspondence function J ′ . 7. Sample the new classes c′jsplit and c′N ′ according to q(c jsplit ∣J ′ , Z) and Pq(cN ′ ∣J ′ , Z), respectively. 8. Calculate the probability of the proposed transition: Psplit

rt2 ( jsplit ) N

1 ∣Ψ∣−1 −1

t r2 ( j) 2 ∑ j=1 t

q(Ct , Jt → C′ , J ′ ) =

q(c′jsplit ∣J ′ , Z)Pq(c′N ′ ∣J ′ , Z).

9. Since the reverse move can only be achieved by performing a merge move, calculate its probability q(C′ , J ′ → Cn , Jn ) by performing steps 1, 2, 4, 8, and 10 of Algorithm 5 twice: once considering j1 = jsplit and j2 = N ′ , rendering q1 (C′ , J ′ → Cn , Jn ), and another time considering j1 = N ′ and j2 = jsplit , rendering q2 (C′ , J ′ → Ct , Jt ). Then add both probabilities: q(C′ , J ′ → Ct , Jt )=q1 (C′ , J ′ → Ct , Jt )+q2 (C′ , J ′ → Ct , Jt ).

5.3 Practical issues about the proposed sampling algorithm

100

Algorithm 7 Proposing a switch class move 1. Let (Xt∗ , Mt∗ ) = θt∗ = arg maxθ P(θ ∣Ct , Jt , Z) be the map and trajectory obtained using the current sample (Ct , Jt ). 2. Propose J ′ = Jt . 3. For

ranging

j T

pt ( j) = 1 − ∏ i=1

from

Ki



1

to

Nt ,

calculate

the

vector

P(ai,ki ∣ctj )

ki = 1 Jt (i, ki ) = j

4. Transform pt in a probability distribution by normalization and sample an object index jswitch from it. T

5. Calculate the class proposal distribution q(c jswitch ∣J ′ , Z) ∝ ∏ i=1

Ki



P(ai,ki ∣c jswitch )

ki = 1 ′ J (i, ki ) = jsplit

6. Sample the new class c′jswitch according to q(c jswitch ∣J ′ , Z) 7. Calculate the probability of the proposed transition: Pswitch ptN(t jswitch ) q(c′jswitch ∣J ′ , Z).

q(Ct , Jtn → C′ , J ′ ) =

∑ j=1 pt ( j)

8. Since the reverse move can only be achieved by performing another switch class move, calculate its probability q(C′ , J ′ → Ct , Jt ) by performing steps 1,3,5, and 7, assuming the same jswitch .

5.3 Practical issues about the proposed sampling algorithm

101

switches around the non-negligible states (GILKS; RICHARDSON; SPIEGELHALTER, 1996a) —check the paper of Tierney (1996) for formal definitions. Indirectly, it indicates how fast the resulting distribution, consisting of the normalized samples histogram, approximates the target distribution when the number of samples grow. Thus, a chain with good mixing rate alternates among high probability states most of the time, and the samples histogram looks like the target distribution very fast. The efficiency of generating and evaluating samples and driving the chain to a converged state are other important issues in the implementation of MCMC methods. The chain is considered converged if the marginal probability of sampling a certain state does not depend on the initial state anymore. In the case the state space includes the space of data association solutions, the chain convergence is associated with reaching the neighborhood where most non-negligible solutions are. However, even after chain convergence, infinite samples are necessary to perfectly mimic the distribution. Thus, a great number of samples are necessary to approximate the target posterior well enough, so that efficient methods to generate and evaluate samples are necessary to make the sampling process computationally feasible. Finally, as any sampling method, implementing MCMC techniques requires the definition of the number of samples to be drawn. The general rule for that consists in guaranteeing that the normalized samples histogram approximate the target distribution well enough, which implies that the meaning of "well enough" must also be specified (GILKS; RICHARDSON; SPIEGELHALTER, 1996a). The most obvious but erroneous way one could think to define when MCMC should stop sampling is directly comparing the normalized samples histogram with the normalized target posterior, calculated for the sampled states. However, both functions might be very similar in the very beginning of the sampling process —for instance, when only one or two different states have been sampled— and diverge later, converging again with many more samples. A correct and general purpose procedure corresponds to running several chains in parallel, starting from different states, and comparing their resulting samples histogram. When all histograms look like each other, the sampling process stops (GILKS; RICHARDSON; SPIEGELHALTER,

1996a). However, this approach does not fit well the problem treated

in this work, since calculating and evaluating state transitions is computationally expensive. In this work, no theoretically sound criterion is used to determine the number of samples, which is chosen following the intuition that the greater the state space is, more samples are necessary to represent the posterior.

102

x(t)

P(x)

5.3 Practical issues about the proposed sampling algorithm

x

t

(a)

(b)

Figure 5.1: Illustration of how the MH algorithm sampling peaked distributions yields poor-mixing chains. (a) Target density, with peaks centered at x = 4 and x = 12. (b) Plot of the samples drawn by the MH algorithm, starting at x = 4 and proposing new states in a fixed window around the current one. The states belonging to the right-most peak are not sampled, indicating very poor chain mixing. Figures are extracted from (RANGANATHAN, 2008).

5.3.1

Improving mixing wit the Metropolis-Coupled algorithm

While MCMC techniques build Markov chains with ensured convergence property, achieving good mixing rates is quite difficult, specially when the target distribution has large peaks and valleys, as depicted in Figure 5.1 (RANGANATHAN, 2008). If the designer knows a priori which states are at the probability peaks, designing proposal distributions that sample at those peaks would produce good mixing chains. However, MCMC techniques are often used as "last resort" general purpose inference methods, motivated in part by the ignorance about the distribution shape —otherwise, it could be possible to sample the distribution directly— so that obtaining good mixing chains is a challenging tasks. An intuitive way to improve mixing in this case is attenuating the effect of the peaks and valleys in the target posterior, allowing the sampler to move around its support more freely. It can be done by "heating" the distribution, which consists in defining a 1

new target distribution Pheat (S) = P(S) T , where T is a relative temperature. The highest the temperature T is, the more attenuated the peaks and valleys are. This process is referred to as simulated tempering, in analogy to the process used in Mechanics of the Materials to make the steel more rigid by intensely heating and quickly cooling it. Sampling the heated distribution yields better chain mixing.

103

x(t)

P(x)

5.3 Practical issues about the proposed sampling algorithm

t

x

(a)

(b)

Figure 5.2: Illustration of how MC3 with heated chains help improving mixing when peaked distributions are sampled. (a) Target density at different temperatures: T = 1 (black line), representing the original density, the same of Figure 5.1, T = 11 (blue), and T = 21 (red). (b) Graph of the samples drawn by the MC3 algorithm, taking the same starting state and proposal density as the example of Figure 5.1. In this case, the sampler quickly alternates between both peaks, indicating good chain mixing. Figures are extracted from (RANGANATHAN, 2008).

However, the heated distribution is not the one we want to sample. To take advantage of its fast mixing properties, we can use the Metropolis Coupled MCMC algorithm (also known as MCMCMC or MC3 ) (GEYER, 1991; GILKS; ROBERTS, 1996). It works by running a certain number of Markov chains in parallel, eventually swapping the states among them. This algorithm is completely derived from Metropolis-Hastings, and is intended to take advantage of different chains properties to sample the target distribution, including the use of chains resulting from applying the MH algorithm to heated versions of the target distribution as a particular case. The MC3 algorithm applied to simulated tempering is described in Algorithm 8, whereas Figure 5.2 illustrates how it works. Just note that chains yielded by distributions with very different temperatures may hardly exchange their states, so the simulated tempering version of MC3 works better if a handful of chains with small temperature differences are used(GILKS; ROBERTS, 1996). This effect can be understood in an intuitive way: cooler chains tend to concentrate on higher probability states, while hotter ones move around the state space more freely, so that exchanging states tends to be a bad deal. If this trend is exacerbated, the swapping move gets extremely difficult to happen.

5.3 Practical issues about the proposed sampling algorithm

104

Algorithm 8 The Metropolis coupled MCMC algorithm applied to sampling P(C, J∣Z) 1. Start Nc chains with a valid state (C0 , J0 ), and set the sample index n ← 0 2. For t ranging from 1 to Nb + Ns, where Ns is the desired number of samples, and Nb is a burn-in period, do: (a) For each chain i ranging from 1 to Nc do 1 , where ∆T is a fixed temi. Calculate the heating exponent βi = 1+∆T.(i−1) perature increment ii. Run one step of the MH algorithm, corresponding to steps a– c in Algorithm 4, considering the unnormalized target posterior as (P(Z∣C, J)P(C, J))βi instead of P(Z∣C, J)P(C, J).

(b) Set t ← t + 1 (c) For each pair of chains (i, i − 1), starting from i = Nc, swap the states with probability ( ) (P(Z∣Cti , Jti )P(Cti , Jti ))βi−1 (P(Z∣Cti−1 , Jti−1 )P(Cti−1 , Jti−1 ))βi αswap = min 1, (P(Z∣Cti , Jti )P(Cti , Jti ))βi (P(Z∣Cti−1 , Jti−1 )P(Cti−1 , Jti−1 ))βi−1 (5.3) where (Cti , Jti ) is the state of chain i at time t (d) If t > Nb, set the sample index n ← n + 1 and the assign the sample (Cn , Jn ) ← (Ct1 , Jt1 )

5.3 Practical issues about the proposed sampling algorithm

5.3.2

105

Improving state transition efficiency

To propose and evaluate new samples, the geometric SLAM problem must be solved twice: once using the data association solution and object classes from the current state, and another time using the same parameters from the proposed state. The first solution is used to calculate the auxiliary structures necessary to propose a new sample, according to Algorithms 5, 6, and 7, and to evaluate the current state. The second solution is used to calculate the same auxiliary structures, necessary to calculate the reverse move probability, and to evaluate the proposed state. However, solving the geometric SLAM problem twice each state transition is a naive approach. Whenever a new state is accepted, the map and trajectory obtained from the parameters (C, J) from that state and the calculated auxiliary structures p, d, and e can be stored, so that proposing new states requires solving the geometric SLAM problem just once, considering only the proposed object classes and data association solution. Other efficiency issue involving the solution of the geometric SLAM problem is its processing time. Even using a very efficient approach for the problem, solving it may result time consuming, since many iterations may be required for the non-linear optimization to converge, and calculating the overdetermined linear system equations may require processing very complex models. Besides, since the cost of solving the geometric SLAM problem increases with the number measurements used, using data consisting of too many object detections makes the sampling process even slower. In this work, some actions are proposed to reduce the processing time needed to solve the geometric SLAM problem at each state transition. The first one is the attempt to eliminate spurious measurements from the data used to calculate the states evidence. For such, a very restrictive data association algorithm is run on the raw data, associating similar measurements present in consecutive images, which yields small measurements sets. Then, only measurements belonging to the sets having a minimum number of elements are kept; the others are discarded. The rationale behind this choice is that measurements generated by actual object detections are more consistent across consecutive images. The more measurements are present in the raw data, the greater can be made the minimum number of measurements per set, making the process more selective against spurious measurements. The measurements sets generated by the preliminary data association step are

5.3 Practical issues about the proposed sampling algorithm

106

also used as atomic sets, which cannot be split in the probabilistic data association process. These atomic sets represent object tracks, since the contained measurements are tracked in consecutive images, and work the same way as the feature tracks used by Kaess & Dellaert (2005). Since now solving the data association problem consists in finding subsets of object tracks, which are in much smaller quantity than measurements, pre-grouping the measurements have the advantage of significantly reducing the state space size, increasing the probability of proposing good states and thus improving the resulting mixing and convergence. The other action to reduce the time taken to solve the geometric SLAM problem consists in avoiding the multiple iterations of the non-linear optimization process. After the burn-in period, the chain is considered to belong to a "converged" state, which, in the problem addressed here, belongs to a restrict states neighborhood. Thus, data association solutions belonging to that neighborhood are similar among them, and generate maps whose common parameter values change slightly when compared to each other. To take advantage of this feature, the variable linearization points are just copied from the current state SLAM solution to the proposed one, so that the geometric SLAM problem is approximately solved by applying the linearized optimization step described in Section 4.1 only once. However when a split or merge move is proposed, the number of objects in the map changes, and not all objects resulting from the current data association solution match the objects resulting from the proposed data association solution. To deal with the different number of objects, we assume that a split or merge move between two non-negligible states either merges two close and similar objects or splits an object into two similar ones. This assumption is also used by the proposal function described in Section 5.2.1. Thus, when the merge move is proposed, the linearization point of the new object takes the value of linearization point of the first seen originating object in the current map. The linearization point of an object refers to the linearization values of its geometric variables, namely the position l j and size g j . The originating objects are those whose measurements sets were merged to form the new object, and the first seen originating object is the one that generated the measurement with least time index i among the measurements that refers to the merged object in the proposed map. On the other hand, if a split move is proposed, the linearization points of both new objects are equal to the linearization point of the originating object. As a consequence, just a small part of the linear system equations used to solve

5.3 Practical issues about the proposed sampling algorithm

107

the geometric SLAM problem need to be recalculated at each move, increasing the sampling process efficiency. Actually, when a split move is proposed, no equation coefficients need to be recalculated: the equations referring to the measurements of the originating objects keep the same coefficients, which need to be just displaced in the system matrix A. When a merge move is proposed, however, the equations referring to the object that was eliminated from the map, i. e., the one not used to generate the linearization point of the new object, have to be recalculated. Because now only part of the set of equations must be recalculated when a new data association solution is proposed, a method for incrementally updating the Cholesky factor R of the linear system was investigated, but with discouraging results. Similarly to the work of Khan, Balch & Dellaert (2006) whenever a move was proposed, Givens rotations were used to update the Cholesky factor resulting from adding new equations to the system, and hyperbolic rotations were used to down-date R, resulting from removing old equations from the system (GOLUB; LOAN, 1996). Even though performing Cholesky down-dating by applying hyperbolic rotations is numerically unstable, the worst problem faced when applying Cholesky updating or down-dating was a performance decrease. Although Cholesky updating or down-dating a single equation have asymptotic time complexity of O(n) in the number of variables, these procedures become less efficient than performing full QR factorization in sparse matrix using Householder reflexions and column reordering when the number of equations to be updated and down-dated increase. In the case of this work, because the same objects are detected many times in different images, each proposed move requires many equations to be updated and down-dated from the Cholesky factor, causing the performance degradation. In the problem addressed by Khan, Balch & Dellaert (2005), only few equations were updated and down-dated at a time, keeping the process efficiency.

5.3.3

Driving the chain convergence

When exploring the space of correspondence functions, the Markov chains yielded by applying the MCMC algorithm have slow convergence and slow mixing. Because the chain convergence requires its state to be among those with non-negligible probability, the burn-in period corresponds to the estimated time for the chain to explore the state space, guided by the proposal distribution, and reach a "converged" state. This process can be very slow in sparsely peaked distributions, even when good proposal distributions are used, because of the exploratory nature of MCMC methods.

5.3 Practical issues about the proposed sampling algorithm

108

While determining an initial state is not an important issue in MCMC applications in general (GILKS; RICHARDSON; SPIEGELHALTER, 1996a), difficult cases that yields slow convergence and bad mixing chains can benefit from initially driving the state to a "converged" one, i. e., a state S belonging to the support of P(S), region where states have non-negligible probability. This process can save many state transitions, which is specially worthy in this work, where state transitions are computationally expensive and there is little clue about where highly probable states are. In this work, knowledge about the problem can be used to make the chain converge faster. Since there are two basic moves that explore the space of data association solutions, a highly probable state can be found by going from an extreme state towards the other: either starting with all object tracks split and trying to merge them, or starting with all tracks merged and trying splitting them. Here the former option is used to accelerate chain convergence, since the proposal function proposes good merge moves more likely than good split moves. It means that merge moves are more likely to increase the state probability than split moves. The reason for that is the lack of a criterion to split a certain object into two tracks sets, while the distance between two objects is an obvious criterion to merge them. The strategy for driving the chain convergence is implemented using a hill climbing optimization technique. First, all object tracks are considered to generate an object each. Then, the proposal function described in Section 5.2 is used to propose moves in the state space, but proposing merge moves much more likely. Then the move is accepted if and only if it leads to a higher probability state. After a certain number of steps, the regular MCMC sampling procedure starts. However, since greedy optimization alone is not assured to lead to a high-probability state, performing a short burn in period in MCMC is still a good practice, though less necessary in this case. The overall strategy to improve the efficiency and efficacy of the sampling process is described in Algorithm 9.

5.3 Practical issues about the proposed sampling algorithm

109

Algorithm 9 Strategy proposed to improve the sampling process 1. Find a fundamental data association solution J0 by associating similar measurements of the same object class in consecutive images, grouping them in object tracks 2. Eliminate from the data set the measurements belonging to tracks having less than a minimum number of associated measurements 3. Using the state (C0 , J0 ) composed by the fundamental data association solution J0 and the object classes C0 determined from the apparent classes, find the map and trajectory geometric parameters using non-linear optimization. 4. Considering objects tracks yielded by the correspondence function J0 as unsplittable, perform hill climbing optimization in the state space (C, J) proposing new states according to the procedure described in Sections 5.2 and 5.3.2, emphasizing the proposal of merge moves. Perform non-linear optimization of the geometric map and trajectory parameters after a certain number of hill climbing steps. 5. Start sampling with the MC3 algorithm, proposing new states according to the procedure described in Sections 5.2 and 5.3.2, proposing split or merge moves with the same probability. Consider the first 10% of samples as burn-in, and discard them.

110

6

Experimental Results of O-SLAM

This chapter presents and discusses the experimental results of the proposed approach for building O-Maps. They were built using different data sets, and three different levels of prior knowledge about the objects were assumed, corresponding to the setups described in Section 3.2.1. The presented approach for O-SLAM was tested using real panoramic images captured by robots carrying omnidirectional cameras. The proposed approach clearly benefits from using panoramic images, since several sights of an object are taken without concerning about actively focusing it in the image. All experiments were coded and executed in Matlab 2007 running in a personal computer (PC) equipped with a 2.66GHz Intel Core2 Quad. However, no multithreading technique was used, so only one processor was used in the experiments.

6.1

Experimental results of geometric SLAM

The first set of experiments is intended to evaluate the performance of O-SLAM when object classes and identities are known. Such situation corresponds to setup 1, when objects are perfectly detected and identified. Although it is rarely the case in practice, object-based SLAM assuming setup 1 corresponds to the geometric SLAM problem, and is the base for solving all other cases. Two different sets of images captured by robots carrying omnidirectional cameras were used to test this approach.

6.1.1

Cogniron data set

The first set comprises images from a real-world annotated data set, obtained online from the Cogniron project website (ZIVKOVIC et al., 2008) . The image sequence was captured by an omnidirectional camera using a hyperbolic mirror. Figure 6.1 shows an

6.1 Experimental results of geometric SLAM

Livingroom dining table

111

Cabinet

Book case Couch

Computer desk TV

Armchair

Figure 6.1: Example of annotated image from the Cogniron data set. Objects are marked with bounding polygons, and are given a label.

example of annotated image from the used data set. The objects in images were manually segmented using bounding polygons, and had their classes listed in an auxiliary file. In the experiments, the objects in the scene are described by cylinders, having their locations represented by the 3D position of the base point, as depicted in Figure 6.2a. In order to obtain measurements compatible with the objects representation adopted, the measurement data Z is obtained by taking the bounding slice from the polygons, since bounding slices are the projection of cylinders onto panoramic images. Figure 6.2b illustrates how measurements were taken from the annotated data set. The acquired data and odometers were used to build an objects-based map and recover the robot trajectory using the proposed approach for geometric SLAM. The movement and observation models, defined by the functions f (x, y), hu (x, l), and hs (x, l, g) are detailed in Appendix A. Implementing the geometric version of the O-SLAM algorithm also requires the definition of the uncertainty parameters. See Appendix A to check how the covariance matrix Qi , Rik and Wik are calculated from the uncertainty parameters. The object classes detected in the data set images and the prior belief on their

6.1 Experimental results of geometric SLAM

112

Livingroom dining table Cabinet

diameter (average width)

radial length

Couch

Computer desk

height

TV

Book case

angular width

Armchair

base point

(a)

(b)

Figure 6.2: Illustration of the measurement data and object model used in geometric SLAM experiments. a) Cylindrical model adopted to represent the objects, described by its height, width, and base point position. b) Detected object positions and sizes in images. The size is described by the object bounding slice, extracted from the annotated polygon, consisting of the radial size and the angular width. The spots indicate the projection of the object base points, used to describe their apparent positions in images.

6.1 Experimental results of geometric SLAM

Class Couch TV Armchair Computer desk Cabinet Livingroom dining table Book case Coat rack Wardrobe Kitchen dining table Bed Kitchen work desk Washing machine Refrigerator

113

Prior on size Mean Standard deviation (diameter, height) (diameter, height) (1.8m,1.2m) (1m,1m) (0.5m,0.5m) (1m,1m) (0.8m,1.2m) (1m,1m) (1.0m,1.5m) (1m,1m) (1.0m,1.7m) (1m,1m) (1.8m,0.8m) (1m,1m) (1.5m,2.3m) (1m,1m) (0.6m,2.0m) (1m,1m) (0.8m,2.2m) (1m,1m) (1.8m,0.7m) (1m,1m) (2.0m,0.7m) (1m,1m) (2.0m,1.2m) (1m,1m) (0.6m,0.8m) (1m,1m) (0.8m,2.0m) (1m,1m)

Table 6.1: Classes detected in the Cogniron images set and the assumed prior belief on their instance sizes.

size parameters are described in Table 6.1. The prior belief about the object sizes corresponds to just a guess based on the available images. To reflect this uncertainty, the prior parameters variance is set to a large value. Obtained results are showed in Figure 6.3 and 6.4. As basis of comparison, the trajectory obtained with a traditional SLAM algorithm using laser scans, available in the data set, was used as reference because of its higher accuracy. It can be done so because the proposed technique is not intended to provide more accurate results than other SLAM techniques. Maps produced by O-SLAM must reflect the placement of the objects in the environment, and the obtained trajectory must show how the robot moved around those objects. Metric accuracy can only be obtained by using accurate sensors like the laser rangefinder or feature point detectors in images, which is not the case of this work. Thus, the proposed approach is successful when the robot localization error keeps limited. Results on solving the geometric SLAM problem with the proposed approach were presented in a paper (SELVATICI; COSTA; DELLAERT, 2008), though without performing non-linear optimization. Results show that the proposed approach effectively yields maps that approximate the actual objects in the environment. Figure 6.4a shows that the proposed algorithm improves the recovered trajectory w.r.t. the a priori odometer-based one, while Figure 6.4b shows that the mapped object projections on the images are close to the

6.1 Experimental results of geometric SLAM

114

Couch

TV

Livingroom dining table

Armchair 3

Wardrobe Bed

Computer desk 4 Coat rack

Book case Cabinet

8 Refrigerator

Kitchen work desk

Washing machine Kitchen dining table

(a)

Wardrobe Bed

Livingroom dining table

Couch TV

Kitchen work desk

Coat rack

Cabinet

Armchair

Kitchen dining table Refrigerator

Book case

Washing machine

Computer desk

(b) Figure 6.3: Results of geometric O-SLAM obtained with the Cogniron data set. a) Perspective view of the obtained O-Map together with the inferred trajectory b) Bird-eye view of the built objects model. The black-dotted line represents the inferred trajectory, while the dark-green-dotted line is the trajectory obtained by laser-based SLAM, which we consider as our ground truth. The house blueprint was extracted from the paper of Zivkovic, Booij & Kröse (2007).

6.1 Experimental results of geometric SLAM

115

Trajectories comparison −1.5 −1 −0.5 0 x (m)

0.5 1 1.5 odometers ground truth SLAM

2 2.5 3 3.5 −7

−6

−5

−4

−3 y (m)

−2

−1

0

(a)

(b) Figure 6.4: More results obtained with the Cogniron data set, illustrating the performance of the proposed approach for geometric SLAM. a) Comparison among trajectories. The inferred trajectory is closer to the reference, and its error remain limited. The odometry error increases over the time. b) Example of projection of the objects in the map onto the image plane.

6.1 Experimental results of geometric SLAM

116

correspondent actual apparent objects in most cases. However, the adopted object model does not approximate some of the mapped objects very well. There are classes whose instances present flattened shapes, which are not well modeled by cylinders. In these cases, the mapped object width corresponds rather to an "intermediary width", obtained by averaging the object length and depth according to the viewpoints from which the robot has detected it. Moreover, as a consequence of the model weakness, the trajectory correction by the inference engine was not good enough to keep its quality when the robot moves far apart from the starting position, causing the degradation of the object positions too.

6.1.2

LPA data set

The proposed algorithm to solve the geometric SLAM problem was also tested using images processed by a simple object detector technique, broadly known in the robotics community. In this experiment, simple and colored objects were placed around the environment, and a Pioneer 3-DX robot (see Figure 6.5), carrying an omnidirectional camera system, grabbed panoramic images while exploring the environment. Then, CMVision (BRUCE; BALCH; VELOSO, 2000) was used to find color blobs in the panoramic images, corresponding to object detections. Odometer, sonars and laser scans data was also recorded. The images set name —LPA data set— is because data acquisition took place at Laboratório de Percepção Avançada (LPA) —Advanced Perception Lab, in English— , at the University of São Paulo. The robot movement during the data acquisition was manually driven, using a wireless remote control. Data was acquired all at once at discrete steps: after the robot was commanded to either move or turn a small amount, it stopped and then recorded sensors output. The translation commands imposed the robot to move forwards or backwards 20cm, while the rotation commands commanded the robot to turn 10º clockwise or counter-clockwise. CMVision detects color blobs in images by finding connected regions whose color falls in a certain interval of the YUV space. In the experiments, these intervals were determined by selecting a region of interest (ROI) from images of the objects to be identified, and calculating the average and the standard deviation for each channel. Then, the average color is defined as the interval center, while the boundary sides are defined by taking three times the standard deviation.

6.1 Experimental results of geometric SLAM

(a)

117

(b)

Figure 6.5: Pictures of the robot and of an object present in the LPA images set. (a) The Pioneer 3-DX robot carries an omnidirectional imaging system, consisting of a video camera coupled to a hyperbolic mirror. (b) Example of an object (green box) detected by CMVision in the experiments. Each object class is represented by a different color.

Because a real computer vision algorithm was used to detect objects in images, object identities had to be manually set, and the object detection errors had to be manually fixed for this experiment, since it assumes perfect object recognition. For such, a small, equally spaced set of the grabbed images were used to get measurements from, discarding the others, and correct object detections were manually selected in those images. As in the previous experiment, the objects in the scene are described by cylinders, as depicted in Figure 6.2a, and the geometric measurements are described by the parameters of the apparent object bounding slices. Figure 6.6 shows an example of image and the detected objects. The object classes detected in the data set images and the prior belief on their sizes parameters are described in Table 6.2. Because the objects were actually seen, a good visual size estimate could be guessed, reflecting the prior variances of the size parameters. The correct sizes are also informed as a basis for comparison. Obtained results are showed in Figure 6.7 and 6.8. As basis for comparison, the trajectory obtained with the DP-SLAM algorithm (ELIAZAR; PARR, 2004) using laser scans was used as the reference trajectory. Table 6.3 compares the object sizes obtained with the presented approach for geometric SLAM and the correct sizes, measured with

6.1 Experimental results of geometric SLAM

118

orange door yellow box 6

3

2 green box

Magellan 5

pink ball

1

4 blue ball

Figure 6.6: Detected object positions and sizes in images belonging to the LPA data set. Observe that the size of the "orange door" is imprecisely detected.

green box yellow box pink ball blue ball orange ball Magellan orange door

Prior on size Mean Standard deviation (diameter, height) (diameter, height) (0.6m,0.6m) (0.3m,0.15m) (0.3m,0.3m) (0.2m,0.15m) (0.25m,0.25m) (0.15m,0.15m) (0.25m,0.25m) (0.15m,0.15m) (0.25m,0.25m) (0.15m,0.15m) (0.5m,0.3m) (0.2m,0.2m) (1.0m,2.2m) (0.3m,0.5m)

Correct size (length,width, height) (0.49m,0.47m,0.57m) (0.23m,0.14m,0.23m) (0.20m,0.20m,0.20m) (0.20m,0.20m,0.20m) (0.20m,0.20m,0.20m) (0.40m,0.40m,0.25m) (1.04m,0.00m,2.17m)

Table 6.2: Classes detected in the LPA images set and the assumed prior belief on their instance sizes.

6.1 Experimental results of geometric SLAM

green box yellow box pink ball blue ball orange ball Magellan orange door

Inferred size (diameter, height) (0.64m,0.66m) (0.24m,0.25m) (0.24m,0.23m) (0.23m,0.23m) (0.20m,0.19m) (0.43m,0.23m) (0.98m,2.12m)

119

Correct size (diameter, height) (0.68m,0.57m) (0.26,0.23m) (0.20m,0.20m) (0.20m,0.20m) (0.20m,0.20m) (0.40m,0.25m) (1.04m,2.17m)

Table 6.3: Comparison between object sizes inferred with the proposed geometric SLAM algorithm and the correct ones. Each object belongs to a different class. The correct diameter of hexahedral —volumes with six faces— objects, namely the green and yellow boxes and the orange door, were calculated by taking the euclidean norm of the corresponding length and width values in Table 6.2.

a ruler. As in the experiment of Section 6.1.1, the proposed approach to solve the geometric SLAM problem was effective in generating a map that approximates the position and sizes of the objects in the environment, and in reducing the localization error. In this experiment, though, the object sizes were reproduced with greater fidelity, since the objects used were better approximated by the cylindrical model than the ones used in the Cogniron data set. It can be checked in Figure 6.8b: while the projections of the objects in images were not in the correct place sometimes, their sizes matched the objects in the image very well. A notable exception is the “orange door” object. Since it is a flat object, it is well modeled by a cylinder if seen by the camera only from the frontal point of view. If the object is seen from a skewed point of view, its apparent width becomes smaller, suggesting to the assumed model that it is more distant than it actually is to the robot. It explains why the orange door is distant from the wall in Figure 6.7b, while its ground truth position must coincide with the wall. The robot trajectory is also affected by the door shift, also shifting towards it. Although the map and trajectory obtained using the proposed approach to solve the geometric SLAM problem does not generate as accurate maps as solutions based on more precise sensors, it is successful in recovering the robot localization in the global frame. Figure 6.7 shows that the obtained object positions and robot trajectory differ from the ground truth significantly, mainly because of the model and measurements uncertainty. For instance, Figure 6.6 shows an example of imprecise size measurement of an object. So large imprecision does not occur in point or line detectors in images,

6.1 Experimental results of geometric SLAM

120

blue ball

pink ball green box

orange door

4 2

yellow box

Magellan

orange ball

1

2 0

−2

0 4

2

−4 0

−2

−6

(a)

orange door green box detected by laser

green box

orange ball yellow box

pink ball

Magellan blue ball

(b) Figure 6.7: Results of geometric SLAM obtained with the hand-adjusted LPA data set. (a) Perspective view of the obtained O-Map together with the inferred trajectory. The object colors are the average ones used to detect them in images. (b) Bird-eye view of the built objects model. The black-dotted line represents the inferred trajectory, while the green line is the trajectory obtained by laser-based SLAM, considered as the ground truth. The occupancy grid map was generated using DP-SLAM. Note that the green box overlap most of the laser detected object.

6.1 Experimental results of geometric SLAM

121

3.5 odometers SLAM ground truth

3 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −3

−2

−1

0

1

2

(a)

orange ball

yellow box

orange door

green 2 box

Magellan

pink ball blue ball

1

(b) Figure 6.8: More results obtained with the LPA data set, illustrating the performance of the geometric version of O-SLAM. a) Comparison between trajectories. The inferred trajectory is closer to the reference, having limited error. b) Example of projection of the objects in the map onto the image plane.

6.2 Results on probabilistic data association

122

such as SIFT feature detectors (LOWE, 1999). Even so, the proposed approach improves the robot localization when compared to the odometers, showing that, once the correct data association is provided, the proposed approach is able to recover the approximated robot localization even if odometers indicate an inconsistent location.

6.2

Results on probabilistic data association

In this set of experiments, O-Maps were obtained considering the perfect object recognition case. The LPA data set was also used to test the O-SLAM algorithm in the case of unknown object identities. In this case, instead of manually selecting good measurements, as in Section 6.1.2, spurious ones were automatically filtered out using the method proposed in Section 5.3.2. However, detections of objects belonging to classes that are similar to another one were removed from the data set, since object detectors are assumed to detect only objects belonging to a specific class. This is the case of the "orange ball" object, which was detected in place of the "orange door" frequently. In order to build the map, the proposal function described in Section 5.2.1 was used together with the MC3 algorithm and the strategy presented in Section 5.3 to sample over the space of correspondence functions J. The MC3 was run with three chains, with a temperature increment of ∆T = 3.33 between them. All possible data association solutions J, i. e, those that do not merge measurements reporting different object classes or that do not split object tracks, are considered to have the same prior probability, so that P(J) ∝ 1. The parameters of the proposal distribution are the probability of proposing a merge move Pmerge = 0.5, implying Psplit = 0.5, and the distance threshold for close objects, used in step 2 of Algorithm 5, being 0.5m. Since the object classes are determined directly from the detected ones, no exploration of this space is done. Therefore, the probability of proposing switch class moves is set to zero (Pswitch = 0), and objects belonging different classes are never merged. This is done by setting the penalty factor for e j , used in step 4 of Algorithm 5, to infinity. As explained in Section 4.4.1, Gaussian prior densities have to be defined for all unknown geometric parameters. While the prior mean for all object positions are assumed γ(c j ) = (0, 0, 0)T , each object class casts a different prior belief on the size parameters. The same prior belief parameters listed in Table 6.2 were used in this

6.2 Results on probabilistic data association

123

experiments, except for the "orange ball" class, whose object detections were ignored from the data set. In order to filter out cluttered detections, similar measurements in consecutive images were previously associated among them into atomic object detection sets. Two measurements in consecutive images zi,k and zi+1,k were merged together if ⎡ ⎤ ⎡ ⎤ 30 pixels uri+1,k − uri,k ⎥ ⎢ ⎢ φ ⎥ φ ⎥ ⎢ u ⎢ ⎥ o − u 20 ⎢ i+1,k ⎢ ⎥ i,k ⎥ <⎢ ⎥ ⎥, ⎢ r r r r ⎥ ⎥ ⎢ s ⎢ − s 0.2 min(s , s ) i+1,k i,k ⎦ ⎣ i+1,k i,k ⎦ ⎣ φ φ sφi+1,k − sφi,k 0.2 min(si+1,k , si,k )

(6.1)

where uri,k and sri,k are respectively the measured object radial position and length in φ

φ

image captured at instant i, and ui,k and si,k are respectively the measured object angular position and width. Tracks were eliminated if they were composed by less than 2 detections. The described configuration was used to process the same images of the experiment described in Section 6.1.2, discarding measurements detected as "orange ball". 100 steps of the hill climbing chain optimization algorithm were run, applying the nonlinear optimization of the geometric parameters at each 50 steps, and considering P = 0.9 and P = 0.1 at this phase. Then, the MC3 algorithm was used to draw merge

split

667 samples, discarding the first 70 samples as burn-in. The obtained map is shown in Figure 6.9, where objects with existence probability π < 0.1 were not shown. Results show that, despite the presence of many spurious objects in the map, the robot trajectory and geometric parameters of the correct objects did not change significantly comparing to the case where the ground truth data association solution was used. The objects marked as spurious with an ellipse in Figure 6.9a are due to spurious measurements that were not successfully removed from raw data with the proposed filtering procedure. All non-spurious objects were generated in the correct way. Figure 6.9c show how the resulting map express the multiple hypotheses about the data association solution involving the non-spurious "green-box" measurements. Although only one cylinder can be seen in the figure, there are three different "green boxes" represented there. The visible one, generated by the most probable data association solution, has high existence probability. The other two, represented with high transparency in the map, were generated assuming the low-probability data associa-

6.2 Results on probabilistic data association

124

(a)

(b)

(c)

Figure 6.9: Results of O-SLAM obtained with the LPA data set and known object classes, though unknown data association solution. (a) Perspective view of the obtained O-Map together with the inferred trajectory. The object colors are the average ones used to detect them in images. Spurious objects are circled. (b) Bird-eye view of the built O-Map. The black-dotted line represents the inferred trajectory, while the green line is the trajectory obtained by DP-SLAM. (c) Detail of the region around the non-spurious green box in Figure 6.9b, showing three objects of that class (two of them are perfectly overlapped). The red triangles indicate the center. The transparency would indicate object existence probability, but two objects are so transparent that cannot even be seen.

6.2 Results on probabilistic data association

125

tion solution. Although all object positions are indicated with a red triangle, two of them are so perfectly overlapped that seem only one in figure. Intuitively, they represent the same physical object and could be combined in the map, even though they ID are different. It suggests that a better definition of object identity across different solutions for the data association is suitable for handling cases where one physical object appears as several low probability ones in the map. Spurious object detections correspond to a frequent error mode of the CMVisionbased object detectors used in the experiment. Section4.3.2 proposes a probabilistic model to describe whether the mapped objects are likely to have been generated by spurious measurements or by valid ones, embedding this belief in the object existence probability. Figure 6.10 shows the result of applying this model to the calculus of the final map, assuming the parameters of the sigmoidal spuriousness probability function as t0 = (0.4m)2 and τ = (0.2m)2 . Similar results were submitted to a conference (SELVATICI; COSTA; DELLAERT, ), although, there, the MH algorithm was used, and objects were sharply classified between spurious and non-spurious, which is equivalent to using τ = 0. The results shown in Figure 6.10 indicate that the assumed model to detect spurious objects was effective, although a non-expected object, identified as "green box", still received high existence probability. Figure 6.11 points out a plausible reason for that: the measurements referring to the undesired object were consistently taken from an actual object in the environment, namely part of a person’s leg, even though this object was not intended to be mapped. Finally, to analyze the effect of improving the sampler chain mixing using the MC3 algorithm, the same data set and algorithm parameters were used to sample over the data association space using the MH algorithm. However, for a more fair comparison, three times more samples were drawn, since three chains running in parallel were used by the MC3 algorithm. The sample histograms are presented in Figure 6.12, comparing them with the respective normalized evidence plot, which is equal, in the ideal case, to the normalized samples histogram. In fact, the MC3 algorithm improved the sampler mixing, visiting each state according to the target posterior, while the MH algorithm visited states with negligible probability more frequently.

6.2 Results on probabilistic data association

126

(a)

(b) Figure 6.10: Top view of the O-Map obtained considering the proposed model of object spuriousness. (a) Full map and trajectory, where transparency indicates object probability of existence. (b) Plot of the objects with πID > 0.5.

6.2 Results on probabilistic data association

127

22

66 6 6

1 1 1 1

44 44 33 33 555

Figure 6.11: Example of projection of the objects in the map of Figure 6.10b on the image plane. Observe that the leg of the person in the image is detected as a "green box" .

J samples histogram

J samples histogram 400

500

Number of samples

Number of samples

600

400 300 200

300

200

100

100 0

1

0

2

0

5

Sampled J

1

0.8

0.8

0.6 0.4 0.2 0

1

2

20

25

Normalized evidence of the J samples

1 Normalized evidence

Normalized evidence

Normalized evidence of the J samples

10 15 Sampled J

0.6 0.4 0.2 0

Sampled J

0

5

10 15 Sampled J

(a)

(b)

20

Figure 6.12: Comparison between samples histograms obtained by MC3 and MH algorithms.(a) Histograms obtained with MC3 . (b) Histograms obtained using the MH algorithm processing the same data.

25

6.3 Results of O-SLAM assuming classes confusion

6.3

128

Results of O-SLAM assuming classes confusion

The same LPA images set was used to test the proposed approach for treating imperfect object detection, considering the case where objects of a certain class might be detected as objects belonging to another class. To show the impact of modeling this error mode in the map, the O-SLAM algorithm was run twice: once assuming perfect object classes, and other time, allowing for confusion among classes. In this experiment, a better tuning of the CMVision blob detector was performed, and object detections of all classes were considered, including those reporting "orange ball". To eliminate more spurious detections from the data, only object tracks with at least three measurements were considered, removing the measurements not belonging to these tracks. The confusion probability, P(ai,ki ∣cJ(i,ki ) ), is given by the Mahalanobis distance between both classes in the color space. The same parameters of the previous experiments were used. In the case of considering imperfect classes, switch class moves are also proposed by the proposal function, and the adopted move proposal probabilities were Pmerge = 0.4, Psplit = 0.4 and Pswitch = 0.2. The resulting object maps are depicted in Figures 6.13 and 6.14. When using the model that treats objects misclassification, the resulting map show all high-probability objects with their correct classes, including the orange ball. On the other hand, when perfect object classes are wrongly assumed, the orange is mapped twice: once as an "orange ball" object and another time as an "orange door" object. This is because the same physical orange ball was sometimes detected by the "orange door" detector, as shown in Figure 6.15. Thus, the proposed approach has the ability of fixing wrong object classes, by jointly inferring them with the robot trajectory and geometric parameters of the map. In the maps of Figures 6.13 and 6.14, an apparently new spurious object has been mapped with high existence probability, identified as a "yellow box" at the left of the robot trajectory. This object corresponds, indeed, to a true yellow box, that was detected using the new tuning of CMVision. The O-SLAM algorithm was also tested using a long run images set, also obtained at LPA, in order to test its ability to close large loops in the trajectory. In this experiment, the robot performed a long loop while detecting objects in images. The blob finder algorithm was tuned to detect different classes which are listed, together with the prior belief on their sizes parameters, in Table 6.4. Figure 6.16 shows examples of images

6.3 Results of O-SLAM assuming classes confusion

(a)

(b) Figure 6.13: O-Map and trajectory obtained using the LPA data set, with the wrong assumption of perfect object classes. The displayed objects have existence probability πID > 0.1. (a) Bird-eye view of the map, showing also the occupancy-grid obtained with DP-SLAM. (b) Perspective view of the map, showing overlapping objects corresponding to the same physical one (orange ball) detected as belonging to two different classes ("orange ball" and "orange door").

129

6.3 Results of O-SLAM assuming classes confusion

(a)

(b) Figure 6.14: O-Map and trajectory obtained using the LPA data set and allowing for confusion among classes. The displayed objects have existence probability πID > 0.1. (a) Bird-eye view of the map, showing also the occupancy-grid obtained with DP-SLAM. (b) Perspective view of the map. Now, the orange ball is correctly mapped.

130

6.3 Results of O-SLAM assuming classes confusion

131

yellow box green box orange door orange ball orange door

orange door

orange door

Figure 6.15: Detected object positions and sizes in images belonging to the LPA data set. Observe that the orange ball is detected twice: once as belonging to the correct class, and other time as an "orange door" object.

captured by the robot, as well as the detected objects. Figures 6.17 and 6.18 illustrate the results obtained by applying the proposed algorithm on the images. Because a larger data set was used, now 2000 samples were drawn using the MC3 algorithm, discarding 200 as burn-in, and performing 400 steps of hill climbing optimization to drive the chain convergence. In this experiment, the proposed algorithm showed the ability to correct a large error in the trajectory, even though few objects were mapped with low spuriousness probability. Objects outside the room could not be consistently detected because of the low light in the corridor, so that all objects placed there were considered spurious, even if they were intended to be mapped. On the other hand, besides mapping the correct objects inside the lab room, some non-desired ones appear in the map with high probability. As seen in Figure 6.18b, these objects correspond either to red objects, close to the person in the image, or were cause by fragmented detections of the orange door, exemplified in Figure 6.16b. These spurious measurements were consistently taken in this data set, violating the principles assumed by the proposed approach. On the other hand, even if no object was mapped with high existence probability outside the lab room, these (considered) spurious objects helped in the mapping process, serving as landmarks for robot localization. Hence, the robot did not loose

6.3 Results of O-SLAM assuming classes confusion

132

Prior on size Mean Standard deviation (diameter, height) (diameter, height) (0.6m,0.6m) (0.3m,0.15m) (0.3m,0.3m) (0.2m,0.2m) (0.25m,0.25m) (0.1m,0.1m) (0.25m,0.25m) (0.1m,0.1m) (0.25m,0.25m) (0.1m,0.1m) (0.1m,0.3m) (0.05m,0.1m) (0.3m,0.4m) (0.2m,0.2m) (0.3m,0.4m) (0.2m,0.2m) (0.3m,0.4m) (0.2m,0.2m)

green box yellow box pink ball blue ball orange ball red bottle green bucket blue bucket orange bucket

Table 6.4: Classes detected in the LPA images set and the assumed prior belief on their instance sizes.

orange bucket

green box green box

orange bucket orange bucket orange bucket orange bucket

orange ball red bottle

red bottle

(a)

(b)

Figure 6.16: Detected object positions and sizes in images belonging to the long-run LPA data set. (a) Image showing only non-spurious detections. (b) Image showing spurious measurements taken from the door and the red objects at the left part of the image, which are not intended to be detected in this experiment.

6.3 Results of O-SLAM assuming classes confusion

(a)

(b) Figure 6.17: Results of O-SLAM obtained with the long run LPA data set, showing the objects with πID > 0.5. (a) Perspective view of part of the obtained O-Map, showing the objects detected inside the lab room. The spurious objects are circled. (b) Bird-eye view of the built objects model. The black-dotted line represents the inferred trajectory, while the green line is the trajectory obtained by laser-based SLAM, considered as the reference. The occupancy grid map was generated using DP-SLAM. Note that no objects are shown outside the room.

133

6.3 Results of O-SLAM assuming classes confusion

134

4 2 0 −2 −4 O-SLAM odometers reference

−6 −8 −10 −12 −14 −6

−4

−2

0

2

4

(a) yellow box

yellow box

blue ball

orange ball 6 1 red bottle

yellow box

blue ball yellow box pink ball

pink ball

orange bucket green box orange bucket orange bucket

orange ball orange bucket red bottle orange bucket red bottle orange bucket

green box

red bottle

orange bucket red bottle

orange bucket red bottle

(b) Figure 6.18: More results obtained with the long-run LPA data set, illustrating the performance of the proposed algorithm to estimate the correct geometric parameters. (a) Comparison between trajectories. The inferred trajectory is closer to the reference. In this figure, all mapped objects are shown, represented by the red triangles. (b) Examples of projection of the objects with low spuriousness probability in the map onto the image plane. The image at the left emphasize projections of the correctly mapped objects, while the image at the right emphasize the projections of the spurious objects registered with a low spuriousness probability in the map.

6.3 Results of O-SLAM assuming classes confusion

135

the track of its pose outside the lab, and was able to correctly match object detections performed at the beginning of the trajectory with those performed at the end.

136

7

Conclusion and Future Work

This thesis introduces the O-Map, an object-based map for robot navigation, as well as the O-SLAM algorithm, an off-line SLAM technique to build it from visual input data. The map contains simple geometric models of the objects in the environment, as well as semantic labels describing them and associated probabilities for the actual existence of the robots in the environment. This map structure is novel to the mobile robotics context, and besides being useful for navigation tasks, it can serve as a basis for communication between the robot and human users, where the objects in the map represent the common vocabulary. The proposed algorithm builds a map that successfully approximates the objects in the environment, jointly inferring the semantic and geometric parts using a probabilistic model that tightly couples these parts, while recovers the robot trajectory precisely, even when the correct solution for the data association is unknown and the used object detectors generate wrong detections sometimes. The claims about O-SLAM made in the Introduction (Section 1.4) were demonstrated in this thesis. First, it has been shown that consistent O-Maps are obtained if the correct data association solution and object classes are provided. Moreover when the object geometries are well modeled by the assumed geometric volume, their estimated geometric parameters are good approximations of the corresponding real world ones. The approximated robot trajectory is also recovered, improving the estimate of the robot localization with respect to that obtained by the odometers, thus confirming the effectiveness of the proposed approach to solve the geometric SLAM problem. In fact, the robot trajectory is always recovered, or at least improved significantly when compared to the odometers based one, even if the correct data association solution and object classes are unknown and the robot performs a large loop in the trajectory. It was also demonstrated that jointly inferring the objects geometry and classes is the suitable way to build object-based maps when the data association solution is unknown and the object classes are imperfectly known. To make it possible, this work

7.1 Contributions

137

presented a probabilistic model that tightly connects the robot trajectory, object geometries and classes, and the data association solution, as well as an inference method to estimate these unknowns. It was shown by an experimental example that doing this way can generate better maps than considering the data association and object classification problems individually, assuming that the same data and models are used in both cases. The experiments performed in this work show also that MCMC methods are able to successfully produce correct object-based maps by sampling over the space of data association solutions and object classes, advancing the results of related work (DELLAERT,

2001; KAESS; DELLAERT, 2005; RANGANATHAN, 2008). In those works, MCMC

is used for inference in the space of data association (or topology) solutions, but here an additional dimension is added to the problem. However, in any case, MCMC must be used in limited size problems only, because it scales very badly with the number of measurements or tracks to be associated. Although this work proposes strategies to improve MCMC efficiency, also extending the results of related work, doing so just increases the maximum size of the problems that can be solved in reasonable time, but does not solve the bad scaling problem inherent to addressing inference in the space of data association solutions.

7.1

Contributions

This thesis contributes to the robot mapping literature by presenting a novel map representation of the objects in the environment, as well as the technique to build it. Furthermore, the O-SLAM algorithm is the first one that uses only object detected in images to build a map and recover the robot trajectory without need for any other auxiliary sensor nor stereo images. Thus, the proposed approach can be used in a broad range of situations, from small to large robots, since it does not rely on heavy and high energy-consuming laser or sonar sensors. If a precise map and trajectory is sought, the proposed method can be easily integrated to other probabilistic SLAM approaches that use more precise sensors, even helping them in closing loops in the trajectory. Besides extending the results of related work that uses MCMC to solve the data association problem in a probabilistic fashion by sampling over the object classes in addition to the correspondence function space, this work presents new strategies to

7.1 Contributions

138

make the use of MCMC practical in the addressed problem. They include preprocessing the raw data to remove non-consistent measurements and build object tracks, an initial hill climbing optimization phase, and the data-driven proposal distribution. Although some of strategies were, in part, inspired by other works, new elements are introduced in the present work, specially the hill climbing optimization performed before starting sampling with MCMC and the part of the proposal distribution related to sampling over the object classes space. The object tracks produced in the preprocessing phase were inspired by the feature tracks used by Kaess & Dellaert (2005). However, the idea of using the tracks to remove probably spurious measurements is original, as well as the strict data association method used to build the tracks. Because it is too strict, the proposed pre-association algorithm filters out many non-spurious measurements too, besides splitting object tracks that could be initially merged, affecting the performance of the O-SLAM algorithm. For instance, if an object is consistently detected for several image frames, is occluded in few frames, and is detected again later, the used pre-association algorithm will split the measurements to object into two tracks, while a better data-association algorithm would be able to merge both tracks. The initial hill climbing optimization phase, proposed to find an initial chain state for the MCMC sampler, is an original contribution. Although the definition of an initial state for the chain is not a relevant problem in general (GILKS; RICHARDSON; SPIEGELHALTER, 1996b), it accelerates the sampler convergence when exploring a a very large state space, with many local maxima. Without a good starting state, the chain might explore the state space for a long time before reaching a "converged" state. Finally, the presented probabilistic model for spurious objects is an original idea, that proved to work very well in practice, although it requires the definition of a soft threshold, that defines how conservative the approach is, i. e., how likely it will assigning a high spurious probability to an object. In a practical cases, however, the same threshold can be used for the same robot and object detection algorithm used, since it holds a strong connection with the precision required from the map. Thus, if very precise object detectors are used to obtain accurate maps, a low threshold can be defined and used for all environments, provided that the objects of interest are actively sought and enough measurements are taken from them.

7.2 Perspectives

7.2

139

Perspectives

Since this thesis presents the first technique for object-based SLAM, the present work opens many branches in which future research can be undertaken. This Section discusses about three possible ones: map representation, object recognition and inference. The present work can be thought of a starting point for these research branches, and there is still much to do. The representation of objects by O-Maps can be much improved by allowing that more generic geometric forms be represented. For instance, mixing objects with more basic geometric features, like point, lines and planes could augment the map representation with wall, corridors, corners, etc. Moreover, the use of more complex geometric shapes would enrich the map representation, allowing better human-robot interaction, possibly providing more realistic place visualization. The object detection technique used in this work presents serious limitations for use in non-controlled environments. However, replacing CMVision by more robust object detectors would not advance the idea behind the proposed approach, though would produce more impressive results. In turn, investigating methods for object recognition that drop the need for an off-the-box algorithm would be very interesting future work in this branch. In this sense, the work of Ranganathan & Dellaert (2007) complements this thesis, and merging the probabilistic models used here with those proposed by that work can be a good way to go. Although much has to be done yet about map representation in O-Maps and object recognition in O-SLAM, the most challenging topic to be further developed corresponds to the inference branch. In many works, the exact formulation of the unconstrained data association problem has been replaced by a heuristic one, such as NN and ML because of its intractability (HäHNEL et al., 2003). However, inference techniques for factored-form distributions have faced great advances during the past decade (WEISS; FREEMAN,

1999; YEDIDIA; FREEMAN; Y.WEISS, 2000; WAINWRIGHT; JAAKKOLA; WILLSKY,

2003; WAINWRIGHT; JAAKKOLA; WILLSKY, 2005), but have not met an extensive application field in the robotics domain yet. Investigating how these methods can be applied to inference over the space of the data association solutions is a very interesting continuation of the present work. Variational Bayes is a promising approach. While MCMC performs approximate inference using the exact formulation of the probability distribution, Variational Bayes methods perform exact or approximate inference in an

7.2 Perspectives

140

approximated formulation of the target distribution. Therefore, Variational Bayes approach may provide a method for applying recently developed inference techniques to a factorized approximated posterior over the data association solutions.

141

References ALLEN, C. R.; LEGGETT, I. C. 3D scene reconstruction and object recognition for use with autonomously guided vehicles (AGVs). In: IEEE. Proc. IEEE IECON 21st International Conference on Industrial Electronics, Control, and Instrumentation. Orlando, FL, 1995. v. 2, p. 1219–1224. BERGER, J. O.; PERICCHI, L. R. On the justification of Default and Intrinsic Bayes Factor. West Lafayette, IN, April 1995. BLANCO, J. L.; FERNANDEZ-MADRIGAL, J. A.; GONZALEZ, J. Toward a unified bayesian approach to hybrid metric–topological slam. IEEE Transactions on Robotics, v. 24, n. 2, p. 259–270, April 2008. BRUCE, J.; BALCH, T.; VELOSO, M. Fast and inexpensive color image segmentation for interactive robots. In: IEEE COMPUTER SOCIETY. Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2000. Takamatsu, Japan, 2000. CAPPÉ, O.; MOULINES, E.; RYDÉN, T. Inference In Hidden Markov Models. Heidelberg, Germany: Springer, 2005. (Statistics). ISBN 0387402640. CASTLE, R. O.; GAWLEY, D. J.; KLEIN, G.; MURRAY, D. W. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2007. Rome, Italy, 2007. p. 4102–4107. CRIMINISI, A.; CROSS, G.; BLAKE, A.; KOLMOGOROV, V. Bilayer segmentation of live video. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2006. New York, NY, 2006. v. 1, p. 53–60. DAVISON, A. J. Real-time simultaneous localisation and mapping with a single camera. In: IEEE COMPUTER SOCIETY. Proc. IEEE International Conference on Computer Vision. Nice, France, 2003. p. 1403–1410. DELLAERT, F. Monte Carlo EM for Data Association and its Applications in Computer Vision. Tese (Doutorado) — School of Computer Science, Carnegie Mellon, September 2001. Also available as Technical Report CMU-CS-01-153. DELLAERT, F. Square Root SAM. In: Proceedings of Robotics: Science and Systems I. Cambbridge, MA: Roboticsproceedings.org, 2005. DELLAERT, F.; KAESS, M. Square Root SAM: Simultaneous localization and mapping via square root information smoothing. International Journal of Robotics Research, v. 25, n. 12, Dec 2006.

References

142

DELLAERT, F.; SEITZ, S. M.; THORPE, C. E.; THRUN, S. Structure from motion without correspondence. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2000. Hilton Head, SC, 2000. v. 2, p. 557–564. DISSANAYAKE, M. W. M. G.; NEWMAN, P.; CLARK, S.; DURRANT-WHYTE, H. F.; CSORBA, M. A solution to the simultaneous localization and map building (SLAM) problem. IEEE Transactions on Robotics and Automation, v. 17, n. 3, p. 229–241, June 2001. DOUCET, A.; FREITAS, N. de; MURPHY, K.; RUSSELL, S. Rao-Blackwellised particle filtering for dynamic bayesian networks. In: Proc. of the 16th Conference on Uncertainty in Artificial Intelligence (UAI). San Francisco, CA: Morgan Kaufmann, 2000. p. 176–183. DUDEK, G.; JENKIN, M. Computational Principles of Mobile Robotics. Cambridge, United Kingdom: Cambridge University Press, 2000. DUDEK, G. L. Environment representation using multiple abstraction levels. Proceedings of the IEEE, v. 84, n. 11, p. 1684–1704, Nov. 1996. EKVALL, S.; KRAGIC, D.; JENSFELT, P. Object detection and mapping for service robot tasks. Robotica, Cambridge University Press, v. 25, p. 175–187, 2007. ELFES, A. Sonar-based real-world mapping and navigation. IEEE Journal of Robotics and Automation, v. 3, n. 3, p. 249–265, Jun 1987. ELIAZAR, A. I.; PARR, R. DP-SLAM: Fast, robust simultaneouslocalization and mapping without predetermined landmarks. In: IJCAI. Proc. of the 18th International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico, 2003. p. 1335–1142. ELIAZAR, A. I.; PARR, R. DP-SLAM 2.0. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2004. New Orleans, LA, 2004. v. 2, p. 1314–1320. FISCHLER, M.; BOLLES, R. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Communications of the Association for Computing Machinery, v. 24, p. 381–395, 1981. FOLKESSON, J.; CHRISTENSEN, H. Graphical slam - a self-correcting map. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2004. New Orleans, LA, 2004. v. 1, p. 383–390. FOLKESSON, J.; JENSFELT, P.; CHRISTENSEN, H. I. Graphical slam using vision and the measurement subspace. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2005. Edmond, Canada, 2005. p. 325–330. FORSSEN, P. E.; MEGER, D.; LAI, K.; HELMER, S.; LITTLE, J. J.; LOWE, D. G. Informed visual search: Combining attention and object recognition. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2008. Pasadena, CA, 2008. p. 935–942.

References

143

FOX, D. Markov Localization: A Probabilistic Framework for Mobile Robot Localization and Navigation. Tese (Doutorado) — Dept. of Computer Science, University of Bonn, Germany, 1998. FRIEDMAN, S.; FOX, D.; PASULA, H. Voronoi random fields: Extracting the topological structure of indoor environments via place labeling. In: IJCAI. Proc. of the 20th. International Joint Conference on Artificial Intelligence (IJCAI). Hyderabad, India, 2007. p. 2109–2114. GALINDO, C.; SAFFIOTTI, A.; CORADESCHI, S.; BUSCHKA, P.; FERNáNDEZMADRIGAL, J.; GONZáLEZ, J. Multi-hierarchical semantic maps for mobile robotics. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2005. Edmond, Canada, 2005. p. 3492–3497. GEYER, C. J. Markov chain Monte Carlo maximum likelihood. In: KERAMIDAS, E. M. (Ed.). Proc. of the 23rd Symposium on the Interface. Fairfax Station: Interface Foundation, 1991. (Computing Science and Statistics), p. 156–163. GILKS, W.; RICHARDSON, S.; SPIEGELHALTER, D. Introducing Markov chain Monte Carlo. In: GILKS, W.; RICHARDSON, S.; SPIEGELHALTER, D. (Ed.). Markov chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996. cap. 1, p. 1–20. ISBN 0 412 05551 1. GILKS, W.; RICHARDSON, S.; SPIEGELHALTER, D. (Ed.). Markov chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996. ISBN 0 412 05551 1. GILKS, W.; ROBERTS, G. O. Strategies for improving MCMC. In: GILKS, W.; RICHARDSON, S.; SPIEGELHALTER, D. (Ed.). Markov chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996. cap. 6, p. 89–114. ISBN 0 412 05551 1. GOLUB, G.; LOAN, C. V. Matrix Computations. Third. Baltimore, MD: Johns Hopkins University Press, 1996. GUTMANN, J. S.; KONOLIGE, K. Incremental mapping of large cyclic environments. In: Proc. IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA ’99. Kyongju, South Korea: IEEE Robotics and Automation Society, 1999. p. 318–325. HAGEN, S. ten; KRÖSE, B. Trajectory reconstruction for self-localization and map building. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2002. Washington, DC, USA, 2002. v. 2, p. 1796–1801. HäHNEL, D.; BURGARD, W.; WEGBREIT, B.; THRUN, S. Towards lazy data association in SLAM. In: Proceedings of the 11th International Symposium of Robotics Research (ISRR’03). Sienna, Italy: Springer, 2003. HASTINGS, W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, v. 57, p. 97–109, 1970.

References

144

HOETING, J. A.; MADIGAN, D.; RAFTERY, A. E.; VOLINSKY, C. T. Bayesian model averaging: A tutorial. Statistical Science, v. 14, n. 4, p. 382–417, 1999. HOIEM, D.; EFROS, A. A.; HEBERT, M. Putting objects in perspective. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2006. New York, NY, 2006. v. 2, p. 2137–2144. IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2006. IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2003. IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2004. IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2005. IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2007. IJCAI. Proc. of the 18th International Joint Conference on Artificial Intelligence (IJCAI). KAESS, M. Incremental Smoothing and Mapping. Tese (Ph.D.) — Georgia Institute of Technology, Atlanta, GA, Dec 2008. KAESS, M.; DELLAERT, F. A markov chain monte carlo approach to closing the loop in slam. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2005. Barcelona, Spain, 2005. p. 643–648. KAESS, M.; RANGANATHAN, A.; DELLAERT, F. iSAM: Fast incremental smoothing and mapping with efficient data association. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2007. Rome, Italy, 2007. p. 1670–1677. KASS, R. E.; WASSERMAN, L. The selection of prior distributions by formal rules. Journal of the American Statistical Association, American Statistical Association, v. 91, n. 435, p. 1343–1370, 1996. KELLY, C. T. Iterative Methods for Optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999. (Frontiers in Applied Mathematics, v. 18). KHAN, Z.; BALCH, T.; DELLAERT, F. Multitarget tracking with split and merged measurements. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2005. San Diego, CA, 2005. KHAN, Z.; BALCH, T.; DELLAERT, F. Mcmc data association and sparse factorization updating for real time multitarget tracking with merged and multiple measurements. IEEE Transactions on Pattern Analysis and Machine Intelligence, v. 28, n. 12, p. 1960–1972, Dec. 2006.

References

145

KLEIN, G.; MURRAY, D. Parallel tracking and mapping for small AR workspaces. In: IEEE. Proc. 6th IEEE and ACM International Symposium on Mixed and Augmented Reality ISMAR 2007. Nara, Japan, 2007. p. 225–234. KOLLAR, T.; ROY, N. Using reinforcement learning to improve exploration trajectories for error minimization. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2006. Orlando, FL, 2006. p. 3338–3343. KUIPERS, B. Modeling spatial knowledge. Cognitive Science, v. 2, p. 129–153, 1978. LAFFERTY, J.; MCCALLUM, A.; PEREIRA, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001. p. 282–289. LEEN, T. K.; DIETTERICH, T. G.; TRESP, V. (Ed.). Advances in Neural Information Processing Systems 13. Vancouver, Canada: The MIT Press, 2000. LEONARD, J.; COX, I.; DURRANT-WHYTE, H. Dynamic map building for an autonomous mobile robot. International Journal of Robotics Research, v. 11, n. 4, p. 286–289, 1992. LEONARD, J.; DURRANT-WHYTE, H.; COX, I. J. Dynamic map building for autonomous mobile robot. In: IEEE. Proc. IEEE International Workshop on Intelligent Robots and Systems ’90. ’Towards a New Frontier of Applications’ (IROS ’90). Ibaraki, Japan, 1990. p. 89–96. LIMKETKAI, B.; LIAO, L.; FOX, D. Relational object maps for mobile robots. In: IJCAI. Proc. of 19th. International Joint Conference on Artificial Intelligence (IJCAI). Edinburgh, Scotland, 2005. p. 1471–1476. LIU, Y.; THRUN, S. Results for outdoor-SLAM using sparse extended information filters. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2003. Taipei, Taiwan, 2003. p. 1227–1233. LOWE, D. G. Object recognition from local scale-invariant features. In: IEEE COMPUTER SOCIETY. Proc. IEEE International Conference on Computer Vision. Kerkyra, Greece, 1999. v. 2, p. 1150–1157. LU, F.; MILIOS, E. Globally consistent range scan alignment for environment mapping. Autonomous Robots, Springer Netherlands, v. 4, n. 4, p. 333–349, October 1997. ISSN 0929-5593. METROPOLIS, N.; ROSENBLUTH, A.; ROSENBLUTH, M.; TELLER, A.; TELLER, E. Equations of state calculations by fast computing machine. Journal of Chemical Physics, v. 21, p. 1087–1091, 1953. MIKOLAJCZYK, K.; LEIBE, B.; SCHIELE, B. Multiple object class detection with a generative model. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2006. New York, NY, 2006. v. 1, p. 26–36.

References

146

MINKA, T. Automatic choice of dimensionality for PCA. In: LEEN, T. K.; DIETTERICH, T. G.; TRESP, V. (Ed.). Advances in Neural Information Processing Systems 13. Vancouver, Canada: The MIT Press, 2000. p. 598–604. MONTEMERLO, M.; THRUN, S.; KOLLER, D.; WEGBREIT, B. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In: Proceedings of the AAAI National Conference on Artificial Intelligence. Edmonton, Canada: AAAI, 2002. MORAVEC, H. Robot Spatial Perception by Stereoscopic Vision and 3D Evidence Grids. Pittsburg, PA, September 1996. MURPHY, R. Introduction to AI Robotics. Cambridge, MA: The MIT Press, 2000. NEIRA, J.; TARDOS, J. Data association in stochastic mapping using the joint compatibility test. IEEE Transactions on Robotics and Automation, v. 17, n. 6, p. 890–897, December 2001. NI, K.; STEEDLY, D.; DELLAERT, F. Tectonic SAM: Exact, out-of-core, submap-based SLAM. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2007. Rome, Italy, 2007. p. 1678–1685. NIETO, J. I.; GUIVANT, J. E.; NEBOT, E. M. The hybrid metric maps (HYMMs): a novel map representation for DenseSLAM. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2004. New Orleans, LA, 2004. v. 1, p. 391–396. PASKIN, M. A. Thin junction tree filters for simultaneous localization and mapping. In: IJCAI. Proc. of the 18th International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico, 2003. p. 1157–1164. PFISTER, S. T.; ROUMELIOTIS, S. I.; BURDICK, J. W. Weighted line fitting algorithms for mobile robot map building and efficient data representation. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE International Conference on Robotics and Automation (ICRA) 2003. Taipei, Taiwan, 2003. v. 1, p. 1304–1311. PINZ, A. Object categorization. Foundations and Trends in Computer Graphics and Vision, v. 1, n. 4, 2005. PORTA, J. M.; KRöSE, B. J. A. Appearance-based concurrent map building and localization. Journal of Robotics and Autonomous Systems, v. 54, n. 2, p. 159–164, 2006. PROC. of Robotics: Science and Systems I. Cambbridge, MA: Roboticsproceedings.org, 2005. QUATTONI, A.; COLLINS, M.; DARREL, T. Conditional random fields for object recognition. In: SAUL, L. K.; WEISS, Y.; BOTTOU, L. (Ed.). Advances in Neural Information Processing Systems 17. Vancouver, Canada: The MIT Press, 2004. RABINER, L. R. A tutorial on hidden markov models and selected application in speech recognition. Proceedings of IEEE, v. 77, n. 2, p. 257–286, February 1989.

References

147

RANGANATHAN, A. Probabilistic Topological Maps. Tese (Doutorado) — College of Computing, Georgia Institute of Technology, Atlanta,GA, 2008. RANGANATHAN, A.; DELLAERT, F. Inference in the space of topological maps: an mcmc-based approach. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. iros04. Sendai, Japan, 2004. v. 2, p. 1518–1523. RANGANATHAN, A.; DELLAERT, F. Semantic modeling of places using objects. In: Proceedings of Robotics: Science and Systems I. Atlanta, GA: Roboticsproceedings.org, 2007. ROBERTS, G. Markov chain concepts related to sampling algorithms. In: GILKS, W.; RICHARDSON, S.; SPIEGELHALTER, D. (Ed.). Markov chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996. cap. 3, p. 45–58. ISBN 0 412 05551 1. ROTA, G. C. The number of partitions of a set. The American Mathematical Monthly, v. 71, n. 5, p. 498–504, May 1964. SCHWARZ, G. Estimating the dimension of a model. Annals of Statistics, v. 6, n. 2, p. 461–464, 1978. SE, S.; LOWE, D.; LITTLE, J. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. International Journal of Robotics Research, v. 21, n. 8, p. 735–758, August 2002. SELVATICI, A. H. P.; COSTA, A. H. R. Fast loopy belief propagation for topological SAM. In: IEEE ROBOTICS AND AUTOMATION SOCIETY. Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2002. San Diego, CA, 2007. p. 664–669. SELVATICI, A. H. P.; COSTA, A. H. R. A hybrid adaptive architecture for mobile robots based on reactive behaviours. In: NEDJAH, N.; COELHO, L.; MOURELLE, L. (Ed.). Mobile Robots: The Evolutionary Approach. The Netherlands: Springer Verlag, 2007. cap. 11, p. 161–184. SELVATICI, A. H. P.; COSTA, A. H. R.; DELLAERT, F. Building object-based maps from visual input. Submitted to Robotics: Science and Systems 2009 (RSS’09). SELVATICI, A. H. P.; COSTA, A. H. R.; DELLAERT, F. Object-based visual slam: How object identity informs geometry. In: IV Workshop de Visão Computacional. Bauru-SP, Brazil: Faculdade de Ciências da UNESP, 2008. ISBN 978-85-99728-33-8. SHOTTON, J.; JOHNSON, M.; CIPOLLA, R. Semantic texton forests for image categorization and segmentation. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2008. Anchorage, AK, 2008. p. 1–8. SMITH, R.; SELF, M.; CHEESEMAN, P. Estimating uncertain spatial relationships in Robotics. In: COX, I.; WILFONG, G. (Ed.). Autonomous Robot Vehicles. The Netherlands: Springer-Verlag, 1990. p. 167–193.

References

148

SMITH, R. C.; CHEESEMAN, P. On the representation and estimation of spatial uncertainty. International Journal of Robotics Research, SAGE Publications, v. 5, n. 4, p. 56–68, 1986. STACHNISS, C.; GRISETTI, G.; BURGARD, W. Information gain-based exploration using rao-blackwellized particle filters. In: Proceedings of Robotics: Science and Systems I. Cambbridge, MA: Roboticsproceedings.org, 2005. SWAIN, M.; BALLARD, D. Indexing via color histograms. In: IEEE COMPUTER SOCIETY. Proc. IEEE International Conference on Computer Vision. Osaka, Japan, 1990. p. 390–393. THOMAS, A.; FERRAR, V.; LEIBE, B.; TUYTELAARS, T.; SCHIEL, B.; GOOL, L. V. Towards multi-view object class detection. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2006. New York, NY, 2006. v. 2, p. 1589–1596. THRUN, S. Robotic mapping: A survey. In: LAKEMEYER, G.; NEBEL, B. (Ed.). Exploring Artificial Intelligence in the New Millenium. São Francisco, CA: Morgan Kaufmann, 2002. cap. 1, p. 1–38. THRUN, S.; BURGARD, W.; FOX, D. Probabilistic Robotics. Cambridge, MA: The MIT Press, 2005. THRUN, S.; FOX, D.; BURGARD, W. A probabilistic approach to concurrent mapping and localization for mobile robots. Machine Learning, v. 31, p. 29–53, 1998. THRUN, S.; MARTIN, C.; LIU, Y.; HAHNEL, D.; EMERY-MONTEMERLO, R.; CHAKRABARTI, D.; BURGARD, W. A real-time expectation-maximization algorithm for acquiring multiplanar maps of indoor environments with mobile robots. IEEE Transactions on Robotics and Automation, v. 20, n. 3, p. 433–443, June 2004. TIERNEY, L. Introduction to general state-space Markov chain theory. In: GILKS, W.; RICHARDSON, S.; SPIEGELHALTER, D. (Ed.). Markov chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996. cap. 4, p. 59–88. ISBN 0 412 05551 1. VASSALLO, R. F. Uso de mapeamentos visuomotores com imagens omnidirecionais para aprendizagem por imitação em robótica. Tese (Doutorado) — Centro Tecnológico da Universidade Federal do Espírito Santo, Vitória, Brazil, 2004. VASUDEVAN, S.; GACHTER, S.; BERGER, M.; SIEGWART, R. Cognitive maps for mobile robots — an object based approach. Journal of Robotics and Autonomous Systems, v. 55, n. 5, p. 359–371, May 2007. WAINWRIGHT, M. J.; JAAKKOLA, T. S.; WILLSKY, A. S. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, v. 49, n. 5, p. 1120–1146, May 2003. WAINWRIGHT, M. J.; JAAKKOLA, T. S.; WILLSKY, A. S. Map estimation via agreement on trees: message-passing and linear programming. IEEE Transactions on Information Theory, v. 51, n. 11, p. 3697–3717, Nov. 2005.

References

149

WEISS, Y.; FREEMAN, W. T. Correctness of belief propagation in Gaussian graphical models of arbitrary topology. In: SOLLA, S. A.; LEEN, T. K.; MüLLER, K.-R. (Ed.). Advances in Neural Information Processing Systems 13. Vancouver, Canada: The MIT Press, 1999. p. 673–679. WILLIAMS, B.; KLEIN, G.; REID, I. Real-time SLAM relocalisation. In: Proc. IEEE International Conference on Computer Vision 2007. Rio de Janeiro, Brazil: IEEE Computer Society, 2007. p. 1–8. WINN, J.; SHOTTON, J. The layout consistent random field for recognizing and segmenting partially occluded objects. In: IEEE COMPUTER SOCIETY. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2006. New York, NY, 2006. v. 1, p. 37–44. YEDIDIA, J.; FREEMAN, W.; Y.WEISS. Generalized belief propagation. In: LEEN, T. K.; DIETTERICH, T. G.; TRESP, V. (Ed.). Advances in Neural Information Processing Systems 13. Vancouver, Canada: The MIT Press, 2000. p. 689–695. ZHANG, Z. Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, Kluwer Academic Publishers, Hingham, MA, USA, v. 13, n. 2, p. 119–152, 1994. ISSN 0920-5691. ZIVKOVIC, Z.; BOOIJ, O. How did we built our hyperbolic mirror omnidirectional camera — practical issues and basic geometry. The Netherlands, 2005. Tech report number IAS-UVA-05-04. ZIVKOVIC, Z.; BOOIJ, O.; KRöSE, B. From images to rooms. Journal of Robotics and Autonomous Systems, North-Holland Publishing Co., Amsterdam, The Netherlands, The Netherlands, v. 55, n. 5, p. 411–418, 2007. ISSN 0921-8890. ZIVKOVIC, Z.; BOOIJ, O.; KROSE, B.; TOPP, E. A.; CHRISTENSEN, H. I. From sensors to human spatial concepts: An annotated data set. IEEE Transactions on Robotics, v. 24, n. 2, p. 501–505, April 2008. Data set url: http://staff.science.uva.nl/˜zivkovic/FS2HSC/ dataset.html.

150

Appendix A -- Models used in geometric SLAM

The solution for geometric SLAM adopted in this work depends on the definition of the robot motion and observation models, represented by the functions f (xi−1 , vi ), hu (xi , l j ), and hs (xi , l j , g j ), as well as the covariance matrices Qi , Ri,k , and Wi,k , as defined in Section 4.1. This Appendix describes how these models are defined, and how to calculate their parameters.

A.1

Motion model

The robot movement model describes how an input control command vi takes the robot from pose xi−1 to pose xi . Here, we assume that this model corresponds to a deterministic function xi = f (xi−1 , vi ) corrupted by white noise with covariance Qi . Supφ

posing the robot moves on the floor plane, the control has the form vi = (vri , vi ), where φ

vri is a displacement command and vi is a heading command. Following the book of Thrun, Burgard & Fox (2005), the present model assumes that displacement commands can be directly determined from odometers readings. ∆

Let the robot pose be defined as x = (x, y, θ ), as depicted in Figure A.1. If a constant forward speed ν and turn ratio ω is imposed to the robot, the continuous time equation for the robot pose is x˙ = ν cos(θ )

(A.1)

y˙ = ν sin(θ )

(A.2)

θ˙ = ω.

(A.3)

Assuming that the control command defined by vi imposes a constant speed and turn φ

ratio from time i − 1 to i, so that vri = ν and vi = ω, the displacement of the robot given

A.1 Motion model

151

y

y

θi i

ν

y

ω

θi − 1

i-1

x

i-1

x

x i

Figure A.1: Illustration of the variables involved the robot movement model. The circle represents the robot, and the line segment inside it indicates the orientation with respect to the global coordinate system.

by integrating (A.1)-(A.3) for this time period is xi − xi−1 =

vri φ vi

yi − yi−1 = − φ

θi − θi−1 = vi

φ

sin(θi + vi ) − vri

φ vi

φ

vri φ vi

cos(θi + vi ) +

sin(θi ) ≈ vri cos(θi−1 + vi /2)

(A.4)

vri

(A.5)

φ

φ vi

cos(θi ) ≈ vri sin(θi−1 + vi /2) φ

(A.6)

φ

for small enough ∣vi ∣—the model works well for heading changes bellow 10 degrees. However, if the robot displacement is negligible, with vri ≈ 0, any value for the heading turn fits the model well. The uncertainty model of the robot movement assumes that the disparity between the commanded movement and the actual one is roughly proportional to the commanded displacement. The covariance matrix of the movement error is given by Qi = Qadd + Qmult vi , where Qadd represents a minimum error level, and Qmult is a faci i i i tor that multiplies the command input. For both robots used in the experiments, these

A.2 Observation model

152

Figure A.2: Schematic representation of the omnidirectional camera used in the experiments of Chapter 6, extracted from (ZIVKOVIC; BOOIJ, 2005).

parameters are ⎡ ⎢ Qadd =⎢ i ⎣

A.2

5mm

0

0

5mm

0

0

0





0.1

0



⎥ ⎢ ⎥ mult ⎢ 0.01 0 ⎥ . and Q = 0 ⎥ i ⎦ ⎣ ⎦ o 1 0 0.15

Observation model

The observation model describes how a measurement zi,k is obtained by the robot, at pose xi , when detects the object o j , where j = J(i, k), and J is the assumed correspondence function. Besides the detected object class ai,k , each measurement is composed by the apparent object position in image, ui,k , and the size of object projection on the image plane, si,k . In all experiments described in this work, an omnidirectional camera was used, corresponding to a catadioptric system composed by a video camera and a hyperbolic mirror, as depicted in Figure A.2. Here, we assume that apparent position and size measurements are generated by deterministic functions ui,k = hu (xi , lJ(i,k) ) and si,k = hs (xi , lJ(i,k) , gJ(i,k) ), corrupted by white noise with covariances Ri,k and Wi,k , respectively. Figure A.3 illustrates the geometric basics about how these measurements are taken. Objects are modeled as cylinders,

A.2 Observation model

153

and, for sake of simplicity, measurements are considered as generated by the projection of the object longitudinal section. Errors generated by this inexact representation are treated by embedding the model uncertainty into the size measurement noise. The ∆

φ

apparent object position in image, ui,k = (uri,k , ui,k ), given in polar coordinates, is the di∆

rect projection of the center of the cylinder base l j = (x j , y j , z j ). The apparent object ∆

φ

φ

size si,k = (sri,k , si,k ), is parametrized by the radial length sri,k and the angular width si,k . The radial length is the projection of the object height h j , and the angular width of the projection of the object diameter d j . The measurements are given by (√ ) (x j − xi )2 + (y j − yi )2 , z j (A.7) ( ) y j − yi = arctan − θi (A.8) x j − xi (√ ) (√ ) = T (x j − xi )2 + (y j − yi )2 , z j + h j − T (x j − xi )2 + (y j − yi )2 , z j (A.9) ⎛ ⎞ dj ⎠, = 2 arctan ⎝ √ (A.10) 2 2 2 (x j − xi ) + (y j − yi )

uri,k = T φ

ui,k sri,k φ

si,k

where T(r, z) is the distance from the image center, in pixels, resulting from the projection of a point at height z and distance r from the mirror axis, in the world coordinates. The uncertainty model of the observation model assumes a single parameter E, in pixels, that reflects the uncertainty about a point projection in image. The measurement covariance matrices are given by: [ ] E2 0 Ri,k = and Wi,k = 2 Ri,k ( ) E/uri,k 2 0

(A.11)

In all experiments, the uncertainty in the image plane used was E = 30pixels. The projection of a point in the world on the image plane is explained by Zivkovic & Booij (2005), and depicted in Figure A.3. Here, the catadioptric system is assumed to use a hyperbolic mirror with circular cross section and hyperbolic longitudinal section, described by the equation (

z − zm + e a

)2 −

( r )2 b

= 1,

(A.12)

where r2 = x2 + y2 is the radial distance from the mirror axis and a and b are mirror parameters denoting the semi-major and semi-minor axis of the hyperbole respectively. √ The mirror eccentricity is e = a2 + b2 , and zm is the height of the mirror focal point. For a point in the world located at height z and distance r from the mirror axis, its

A.2 Observation model

154

z mirror focal point (0,z m)

y

(r,z)

mirror surface

y

r

d j

sφi,k

mirror

T(r,z) Virtual image plane

uφi,k

y

θi i

Lens center (0,z m-2e) CCD plane

x

(a)

i

x

x j

(b)

Figure A.3: Schematic representation of how a point in the world is projected in the image plane.

projection distance from the image center is given by T(r, z) =

λ (r, z) r f , c λ (r, z) (z − zm ) + 2e

where f is the camera lens focal distance, c is the pixel size, and ( ) √ b2 e(z − zm ) − a (z − zm )2 + r2 λ (r, z) = . b2 (z − zm )2 − a2 r2 The parameters of the camera and mirror used in this work are: •Cogniron data set: a = 42.09mm, b = 25.09mm, zm = 1.0615m, and f/c = 1715pixels •LPA data set: a = 39.29mm, b = 19.65mm, zm = 0.76m, and f/c = 1800pixels

Related Documents