Ucsd Ece191 Report Hand Tracking Pointing Device

  • Uploaded by: Tommy Chheng
  • 0
  • 0
  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Ucsd Ece191 Report Hand Tracking Pointing Device as PDF for free.

More details

  • Words: 6,676
  • Pages: 24
Camera Based Hand and Body Driven Interaction with Very Large Displays Tom Cassey, Tommy Chheng, Jeffery Lien {tcassey,tcchheng, jwlien}@ucsd.edu June 15, 2007

Abstract We introduce a system to track a user’s hand and head in 3D and real-time for usage with a large tiled display. The system uses overhead stereo cameras and do not require the user to wear any equipment. Hands are detected using Haar-like features with an Adaboost classifier and heads are detected using Hough transforms generalized for circles. Three-dimensional values for the hand and head are obtained by correspondence matching of the paired cameras. Finally, a 3D vector is extrapolated from the centroid of the head to the hand. A projection of the 3D vector to the large tiled display is the pointing location.

Contents 1 Introduction

2

2 Hardware Setup 2.1 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 IR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 4 4

3 Camera Calibration 3.1 Why is Calibration Needed? . . . . . . . . . 3.2 Extrinsic Parameters . . . . . . . . . . . . . 3.2.1 Homogeneous Coordinates . . . . . . 3.2.2 Homogeneous Extrinsic Parameters . 3.3 Intrinsic Parameters . . . . . . . . . . . . . 3.4 Rectification . . . . . . . . . . . . . . . . . . 3.5 Calibration Implementation . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

6 6 6 7 7 7 7 8

4 Hand Detection 4.1 Cascaded Adaboost Classifiers 4.1.1 Adaboost . . . . . . . . 4.1.2 Cascaded Classifiers . . 4.2 Training Data and Results . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

9 10 10 10 11

5 Head Detection 5.1 Edge Detection[?] . . . . . . . . 5.1.1 Image Gradient . . . . . 5.1.2 Canny Edge Detector[?] 5.1.3 Hough Transform . . . . 5.2 Head Detection Results . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

12 12 12 13 14 15

6 Tracking 6.1 Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Token Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Object Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16 16 16 17

7 Stereopsis 7.1 Overview . . . . . . . . . 7.2 Epipolar Geometry . . . . 7.3 Correspondence Problem . 7.4 Range Calculations . . . . 7.5 Range Resolution . . . . .

18 18 18 18 19 20

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

8 Vector Extrapolation

21

9 Conclusion and Future Work

22

1

Chapter 1

Introduction Today, scientific computing is generating large data sets and high resolution imaging. A biological sample under an electron microscope can easily produce gigabytes of high resolution 3D images. Due to advances in remote sensing technologies, the scope and amount of high resolution geospatial imagery has become widely available today. These factors has created the need to visualize the imagery data. Many research groups have built large tiled displays. The human computer interface problem for large tiled display still remains to be a problem. Using a traditional mouse to navigate the data is unfeasible. The NCMIR group at CALit2 located in University of California, San Diego, has built an alternative human computer interaction system using a hand controller and head gear. Requiring such equipment places a restrictive burden on the user. Our objective is to create a hand tracking system given overhead stereo cameras to drive a large display. Our system is vision-based and does not require any external sensors on the user. There are a few contraints that eases our objective. Our cameras are fixed overhead. This allows less background noise to aid in hand detection. We developed a hand and arm tracking system called HATS. An overview of our system is depicted in 1.1.

2

Figure 1.1: System Overview

3

Chapter 2

Hardware Setup We place our camera system over the user. Other possibilities for camera configuration could have been placing the two cameras in front of the user or one camera above the user and one camera in front of the user. We selected overhead camera configuration due to better performance and ease of use. There is very little noise since the cameras will be pointing at the ground. If we had placed the cameras in front of the user, we would have to deal with a noisey background. eg. People walking by.

2.1

Stereo Cameras

Two Point Grey Dragon Fly cameras shown in 2.1 were used to make our stereo pair. The cameras are able to run at a frame rate of 120 fps, with a resolution of 640x480 pixels, each having a 6 mm focal length, and a pixel width of 0.0074 mm. The cameras are mounted 364.57 mm apart, and are slightly verged in, in order to have the largest shared field of view. Each camera is verged in by 0.0392 radians, thus the two cameras’ are focused on the same point in space about 4.6 m below the cameras. It was decided to use this configuration for the cameras since it provided good range resolution, while also having a large shared field of view. The choice of these parameters will further discussed in the ’Stereopsis’ chapter.

2.2

IR Filters

Infrared filters and illuminators were used in our setup in order to provide an enhanced contrast between the user, specifically skin sections since they show up with high intensity, and the rest of the background. Additionally, using infrared for illumination nullifies any variation in the lighting in which this setup will be used, and can also provide proper illumination for the cameras, while not changing the lighting perceived by the user. Thus, the setup can function properly in a totally dark room. The IR filters used blocks out visible light up to 650 nm, and transmits 90% of light in the 730 nm to 2000 nm range (near IR). One thing to note is that incandescent light bulbs and perhaps some other types, and the sun emit light in this range as well, and can cause an overexposure in the cameras if not carefully controlled.

4

Figure 2.1: Two Point Grey Research Dragon Fly Cameras are used to make a stereo pair. The IR filters help skin tone illumination.

5

Chapter 3

Camera Calibration 3.1

Why is Calibration Needed?

The goal of camera calibration is to determine a transformation between the two independent pixel coordinate systems of the two cameras, and the real world coordinate system. Once the real world coordinate of each pixel can be determined, it is possible to perform stereopsis on the images. The process of stereopsis allows the depth of each pixel in the image to be calculated and thus the 3D coordinate of each point to be calculated. The stereopsis process will be described in more detail in Chapter 7 of this report.

3.2

Extrinsic Parameters

The extrinsic camera properties describe the position and orientation of the camera with respect to the world coordinate system. In the stereo system we are working with the world coordinate system is taken to be the coordinate system of the left camera. Therefore the extrinsic parameters describe the transformation between the right cameras coordinate system, and the left cameras coordinate system.

Figure 3.1: Two Coordinate Systems Figure 3.1 above shows two different coordinate systems. The transformation between the two coordinate systems in the diagram is an affine transformation, which consists of a rotation followed by a translation.

y = Rx + t The rotation component of the transformation is given by the equation below   iA · iB iA · jB iA · kB R =  jA · iB jA · jB jA · kB  kA · iB kA · jB kA · kB 6

(3.1)

(3.2)

The translation component of the transformation is given by the equation below   tx T =  ty  tx

(3.3)

Under the Euclidean coordinate system an affine transformation has to be expressed as two separate transformations, as shown in Equation 3.1. However if the image coordinates are first converted to the homogeneous coordinate system then the transformation can be expressed as a single matrix multiplication.

3.2.1

Homogeneous Coordinates

The Homogeneous coordinate system adds an extra dimension to the euclidean coordinate system, where the extra coordinate is a normalising factor. Equation 3.4 below shows the transformation from from the euclidean coordinate system to the homogeneous coordinate system. (x, y, z) ⇒ (x, y, z, w) = (x, y, z, 1) (3.4) Equation 3.5 below shows the transformation from from the homogeneous coordinate system to the euclidean coordinate system. (x, y, z, w) ⇒ (x/w, y/w, z/w) (3.5)

3.2.2

Homogeneous Extrinsic Parameters 

Mext

3.3

r11  r21 =  r31 0

r12 r22 r32 0

r13 r23 r33 0

 tx ty   tz  1

(3.6)

Intrinsic Parameters

The intrinsic parameters describe the transformation from the camera coordinates to the pixel coordinates in the image. This transformation depends on the optical, geometric and digital parameters of the camera that is capturing a given image. The following equations describe the transformation from the camera coordinate system to the pixels in the image. xim =

−f x + ox , sx

(3.7)

−f y + oy sy

(3.8)

yim =

Where f is the focal length of the camera, (ox , oy ) is the coordinate in Pixels of the image centre (Focal Point), sx is the width of the pixels in millimetres, and sy is the height of the pixels in millimetres When written in matrix form Equations 3.7 and 3.8, yield the following matrix, which is known as the intrinsic matrix.   −f /sx 0 ox Mint =  0 − f /sy oy  (3.9) 0 0 1

3.4

Rectification

Image rectification is a process that transforms a pair of images so that the images have a common image plane. Once this process has been performed on the stereo image pair that are captured by the overhead cameras, the stereopsis process reduces to a triangulation problem, which is discussed in more detail in Chapter 7.

7

The image rectification process makes use of the extrinsic and intrinsic parameters of the two cameras in order to define the transformation that maps the right camera image onto the same image plane as the left camera image. Whilst this process is required in order to simplify the stereopsis process it does result in the rectified objects being warped and disfigured. The image rectification process is performed by combining the extrinsic and intrinsic parameters of the system into one matrix which is known as the fundamental matrix. This matrix provides the linear transformation between the original images and the rectified images that can then be used to perform stereopsis. The next section of this chapter discusses the process that is used in order to determine the parameters of this matrix.

3.5

Calibration Implementation

[Describe how we used the checker board and SVS software for Calibration]

8

Chapter 4

Hand Detection Using Haar-like features for object detection has several advantages over using other features or direct image corelation. For one, simple image features such as edges, color or contours are good for basic object detection but as the object gets more complex, they tend not to work well. Papageorgiou[[?]] used Haar-like features as the representation of an image for object detection. Since then, Viola and Jones and later Lienhart added an extended set of the Haar-like features. They are shown in 4.2.

Figure 4.1: Haar-like features and an extended set. In each of the Haar-like features, the value is determined by the difference of the two different colored areas. For example, in Fig. 4.2(a), the value is the difference of all the pixels in the black rectangle from all the pixels in the white rectangle. Viola and Jones introduced the ”integral image” to allow fast computation of these Haar-like features. The integral image at any (x,y) is the summation of all the pixels to the upper left as described by equation 4.1: X ii(x, y) = = i(x0 , y 0 ) (4.1) x0 ≤x,y 0 ≤y

where ii(x,y) is the integral image and i(x,y) is the orignal image. The integral image can be computed in one pass with the recurrence relation: s(x, y) = s(x, y − 1) + i(x, y)

9

(4.2)

ii(x, y) = ii(x − 1, y) + s(x, y)

(4.3)

(s(x,y) is the cumulative row sum, s(x,-1) = 0 and ii(-1, y) = 0) With the integral image, any of the Haar-like features can be computed quickly. As shown in the integral image 4.2, the vertical rectangle edge feature in 4.2(a) is simply 4 - (1 + 3).

Figure 4.2: Haar-like features can be computed quickly using an Integral Image.

4.1 4.1.1

Cascaded Adaboost Classifiers Adaboost

The Adaboost algorithm[?] uses a combination of simple weak classifiers to create a strong one. Each simple classifier is weak because they can only classify perhaps 51% of the training data sucessfully. The final classifier becomes strong because it weighs the weak classifiers accordingly during the training process.

4.1.2

Cascaded Classifiers

Viola and Jones[?] introduced an algorithm to construct a cascade of adaboosted classifiers as depicted in figure 4.3. The end result gives increased detection rates and reduced computation time.

Figure 4.3: A cascaded classifier allows early rejection to increase speed. Each Adaboost classifier can reject the search window early on. The succeeding classifiers have a more difficult step of distinguishing the features. Because the early classifiers can reject the majority of search windows, a larger portion of the computational power can be focused on the later classifiers. The algorithm can target for how much each classifier is trained to find and eliminate. A trade-off is that a higher expected hit-rate will give more false positives and a lower expected hit-rate can give less false positives. An empirical study can be found in the work of Lienhart et al[?].

10

4.2

Training Data and Results

Below is a list training different sets of data and a qualitative note of its performance in the real-time video capture. Produced XML cascade output file Image dim Training size nstages hand classifier 6.xml 20x20 357 15 hand pointing 2 10.xml 20x20 400 15 wrists 40.xml 40x40 400 15 hand classifier 8.xml 20x20 400 15 hand classifier 6.xml High hit rate but also high false positives. hand pointing 2 10.xml Low positives but low false positives. wrists 40.xml Can detect arms but bounding box too big. hand classifier 8.xml Low positives but low false positives. [show good results and false positives]

11

nsplits 2 2 2 2

minhitrate .99 .99 .99 .99

maxfalsealarm .03 .03 .03 .03

Chapter 5

Head Detection The head detection module relies on the shape of the head being similar to that of a circle. Generally speaking, the head is the most circular object that is found in the scene when a user is interacting with the tiled displays. In order to utilise this geometric property, a circle detector is applied to the set of edge points for the input image. These edge points map out the skeleton or outline of objects in the image, and thus can be used to match geometric shapes to the objects.

Figure 5.1: Head Detection Flowchart The flowchart above shows the stages that are involved in detecting regions that are likely to contain a head, these stages are discussed in more detail in the following sections.

5.1

Edge Detection[?]

Boundaries between objects in images are generally marked by a sharp change in the image intensity at the location of the boundary, the process of detecting these changes in image intensity is known as edge detection. However boundaries are not the only cause of sharp changes of intensity within images. The causes of sharp intensity changes (edges) are as follows: • discontinuities in depth • discontinuities in surface orientation • changes in material properties • variations in scene illumination The first three of these causes occur (but not exclusively) when there is a boundary between two objects. The fourth cause ’variations in the scene illumination’ is generally causes by objects occluding the light source, which results in shadows being cast over regions in the image. As mentioned previously boundaries or edges within images are marked by a sharp change in the images intensity at in the area where an edge lies. These sharp changes in image intensity will result in the gradient at the point in an image where an edge occurs having a large magnitude.

5.1.1

Image Gradient

Since the images are 2-dimensional, the gradient with respect to the x and y direction must be computed separately. In order to speed up the computation of the gradient at each point these two operations are implemented as kernels which are then convolved with the image in order to extract the x, and y components of the gradient. The Sobel kernel representations of these two discrete gradient equations are shown below.     −1 −2 −1 −1 0 1 0 0  , Gy =  −2 0 2  Gx =  0 (5.1) 1 2 1 −1 0 1 12

Given the x and y component of the gradient at each point the gradient magnitude at that point can be calculated using the following equations. q kGk = G2x + G2y (5.2) The gradient direction at each point is also required for certain edge detection algorithms such as the Canny edge detector. The equation for determining the direction of the gradient at each point is shown below.   Gy α = atan (5.3) Gx

5.1.2

Canny Edge Detector[?]

The ’Canny Edge Detector’ provides a robust method for locating edge points within an image, which is based upon image gradient. The canny detector also includes additional processing to the calculated gradient image in order to improve the locality of the detected edge. The Canny Edge detector is broken down into multiple stages which are outlined in the flowchart below.

Figure 5.2: Canny Edge Detector Flowchart The previous section outlined the methods for calculating the gradient at each point, and then how to utilise this information to calculate the magnitude and direction of the gradient at each point in the image. However applying this operation to the raw image, generally results in many spurious edges being detected, and hence an inaccurate or unusable edge image being calculated. These spurious edges are caused by noise that is introduced to the image when it is captured. In order to reduce the effect of this noise the image must first be blurred, which involves replacing the value of each pixel with a weighted average calculated over a small window around the pixel. However this procedure has the unwanted effect of smearing edges within the image as well, which results in edges being harder to detect. Oriented non-maximal suppression The additional processing that is provided by the canny edge detector is known as oriented non-maximal suppression. This process makes use of the gradient magnitude and direction that was calculated previously, in order to ensure that the edge detector only responds once for each edge in the image, and that the locality of the detected edge is accurate. This stage of the algorithm loops over all the detected edge pixels in the image, and looks at the pixels that lie either side of the pixel in the direction of the gradient at that point, if the current point is a local maximum then it is classified as an edge point, otherwise the point is discarded.

13

(a) Edge Image with no Blurring

(b) Edge Image with Prior Blur

Once the pixels in the image that lie on edges have been found, the pixels need to be processed in order to find circles that best fit the edge information. Since the set of features (edge points) computed by the edge detector are from the whole image and are not just localised to the edges that correspond to the head, the circle matching process is required to be resilient to outliers (features that do not conform to the circle being matched) within the input data. The method employed for detecting circles within the camera frames is a generalised Hough transform. The Hough transform is a technique for matching a particular geometric shape to a set of features that have been found within an image, the equation that the input set of points are matched to is shown below. c = r2 = (x − ux )2 + (y − uy )2

5.1.3

(5.4)

Hough Transform

The Hough transform is a method that was originally developed for detecting lines within images, the method can however be extended for detecting any shape, and in this case has been used to detected circles. Looking at the equation of a circle, and the diagram below it can be seen that three parameters are required to describe a circle, namely the radius and the two components of the centre point. The Hough transform method uses an accumulator array (or search space) which has a dimension for each parameter, thus for detecting circles a 3-dimensional accumulator array is required.

Figure 5.3: Cicle Diagram The efficiency of the Hough transform method for matching a particular shape to a set of points is directly linked to the dimensionality of the search space, and thus the number of parameters that are required to describe the shape. In light of the requirement to achieve real time processing the images are down-sampled by a factor of 4 before the circle detector algorithm is applied. Due to the algorithm being third order, this reduces the number of computations by a factor of 64. The transform assumes that each point in the input set (the set of edge points) lies on the circumference of a circle. The parameters of all possible circles, for which the circumference passes through the current point, are then calculated, and for each potential circle the corresponding value in the accumulator array is incremented. Once all the points in the input set have been processed, the accumulator array is checked to find elements that have enough votes to be considered as a circle in the image, the parameters corresponding to these elements in the accumulator array are then returned as a region within the image that could potentially contain a head.

14

5.2

Head Detection Results

The head detection method described above provides a method of identifying potential head regions within the input scene. Although the detector performs well in most situations, noise and other circular shaped objects present in the scene occasionally cause the detector to skip a frame, or to return multiple head regions. In order to overcome these limitations post processing is applied to the results to interpolate when a frame is missed and to enable the most probable head location to be selected when multiple head regions are returned by the detector. These post processing techniques are implemented in the object tracking module which is discussed later on.

(a) Edge Image with no Blurring

(b) Edge Image with Prior Blur

15

Chapter 6

Tracking Ideally the object detectors employed to detect the head and hand regions within the image would have a 100% accuracy rate and as such always locate the correct objects within the scene, and never miss classify a region. However realistically this will not be the case and as such the detection methods employed are inevitably prone to errors. The object tracker module aims to supplement the detection modules by providing additional processing which helps to limit the effect of erroneous detections and skipped frames. The object tracker does this by maintaining the state information of all objects that have been detected by a particular object detector, within a given time interval, using tokens, and then using this information to determine which detected region is most likely to correspond to the object that is being tracked.

6.1

Tokens

Within the token tracker tokens are data structures that store the state information of a particular object that has been detected. The following information is encapsulated within a token. • Location of the object • Size of the object (area in pixels) • Time since the object was last observed (in frames) • Number of times the object has been observed during its lifetime When an object is detected it is first compared to the list of tokens, if the properties of the object are a close enough match to any of the objects described by the tokens then the token is updated with the new state information. If no match is found, then a new token is initialised with the properties of the newly detected object. This process is applied to all the objects that are detected by the object detector.

6.2

Token Updates

When a match is made between a detected object and a pre-existing token, the token must be updated so that it reflects the new state of the object. During this update process the count of the number of times the object has been seen is incremented and the number of frames since the last detection is reset to zero. Once all the objects have been processed and the corresponding tokens updated, or initialised, the list of tokens is filtered to remove any dead tokens. A dead token is any token that corresponds to an object that has not recently been seen. This could occur for the following three reasons. The token could have been initialised for an object that was erroneously detected. If this is the case, then the object is likely to only be detected in once, and then not detected again. The second cause could be that the object that the token was tracking is no longer present in the scene. The third reason for a token dying is if the object’s properties changed too rapidly between detections. This would result in a match between the existing token and the object not being made, and a new token being initialised to represent the object. Once all the dead tokens have been filtered out from the list, the age of all the tokens (the time since seen) is incremented. And the process is then repeated for the next set of objects that are detected. 16

6.3

Object Selection

At any given moment the object tracker can be keeping track of multiple objects, some of which may have been correctly classified, and others which may have been erroneously detected. In order for the system to allow the user to interact with the display, there must be a mechanism to enable the system to discern which of these objects the ’correct’ object is. As mentioned before, the ideal case would be if the object detectors correctly identified the object regions with 100% accuracy. However, with the object tracker supplementing the detection systems, this requirement can be relaxed such that for a detector to be admissible, it must meet the following requirements. • The object must be correctly classified in more frames than erroneous regions are classified • The object must be classified in more frames than it is missed If these two criteria are met, then the token tracking the object will receive more updates than any other token that exists, this will result in the correct token having a higher vote (observed count) than all the other tokens. Therefore the location of the object can be found stored in the token that has the highest vote count.

17

Chapter 7

Stereopsis Now that the head and hands have been detected and tracked, obtaining their actual location in space is the final step before the final output vector can be calculated. Calculating the depth of the user’s hand or head is accomplished by using two cameras, mounted on the same horizontal axis a certain distance apart, called the baseline. Similar to how humans perceive depth by using two eyes, the computer can calculate depth by using two cameras that are properly aligned. The method through which this is accomplished will be explained in this chapter.

7.1

Overview

Using a stereo pair of cameras, gives us two perspectives of the same image. By comparing these two perspectives, we gain additional information on the scene. Assuming there is no major occlusion, we can find a point in the right image, and find that same point in the left image. With these two points we can triangulate the depth of this point in real space. One can imagine, if two cameras are mounted side by side, the left and right image of a far away object will seem almost identical. In fact, as the distance of the object from the cameras increases passed a certain threshold, there can be no discernible difference seen between the two images. On the other hand, if the object is very close, the two perspectives of the object will be very different. By using this difference in the images, the depth can be extracted. Thus, the goal is to find and quantize this difference.

7.2

Epipolar Geometry

An easy way to quantize this difference is by comparing the the location of where a point of the object is projected onto both images. Taking the difference of these two locations then gives us a measure of how far away the image is. This distance is called the disparity, and is measured as the difference between the two projections of the same point onto the CCD of the cameras. Thus, given a point in one image, it is necessary to find the corresponding point in the other image. It is desirable to have these two points, representing the same point in real space, be on the same horizontal line in the image. This way the search for the corresponding point can be limited to one dimension. In order to use stereo cameras to extract depth, it is necessary to have the left and right images calibrated and aligned so that they share the same horizontal scan lines. That is, any one point in real space, corresponds to a point in the right image, and a point in the left image on the same horizontal line. By mounting the two cameras parallel on the same horizontal axis, this is easily achieved. However, since our cameras are slightly verged in, the points projected in the two images are not on the same horizontal line. To rectify this problem, we employ the SVS Videre tool, which performs a transformation on the image that effectively makes the two cameras parallel. This process of calibration requires several images of a checkerboard pattern displayed at different angles, and then outputs two rectified images that correct for not only the verged cameras, but also the distortion in the lenses.

7.3

Correspondence Problem

Given a point in one image (a hand or a head, for example), how do we locate that same point in the other image? Since the images have been rectified, the point must be on the same horizontal line in both images. So the search for the corresponding point is narrowed down to one dimension. As a measure of correspondence, the Normalized Cross Correlation is used. A window around the object of interest from one 18

image is taking and slid across the entire horizontal line, finding the NCC at each horizontal point along the line. The horizontal point with the largest NCC value is then taken to be that corresponding point.

Figure 7.1: Correspondence Problem

7.4

Range Calculations

7.2 shows the setup of the cameras. All coordinates will now be with respect to the labeled axis.

Figure 7.2: Camera Setup

• The right hand coordinate system is used, where the z axis is point down and out of the cameras, the x axis is pointing outward, and the y axis is pointing into the page. • θvergenceL and θvergenceR is the angle the cameras are rotated inward. • b is the baseline, or distance between the two cameras. 19

• f is the focal length of the cameras. • XL and XR are the known distances of the project points onto the CCD and the center of the CCD given by the correspondence process. • θL and θR are two angles adjacent to the x-axis. These angles characterize the distance z. θL and θR can be calculated given XL , XR , θvergenceL , θvergenceR , and the focal lengths. They are given by: θL = arctan( XfL + θvergenceL ) θR = arctan( XfR + θvergenceR ) Using similar triangles Z can be found as: b Z = tanθR −tanθ L Notice that as θvergenceL,R approach zero, Z becomes inversely proportional to the disparity. After Z is found, X and Y are easily computed: Z Y = YLf∗Z X = tanθ L where YL is similar to XL , but in the y-direction. Given these three values (X,Y,Z), a 3D point of the object of interest is formed.

7.5

Range Resolution

The resolution of the range is dependent on many factors, including how far away the object that is trying to be resolved is away from the camera (Z), the baseline of the cameras, the focal length, and the size of the width of a pixel in the CCD. ?? shows the approximately Z 2 relationship of the resolution.

Figure 7.3: Range Resolution

20

Chapter 8

Vector Extrapolation Given a 3D point for the head location and the hand location, a vector of where the user is trying to point can be extrapolated. By taking the difference between the hand’s 3D point and the head’s 3D point, and normalizing its magnitude to be a 1 meter, a unit vector is formed in the direction towards the display. This vector as well as the 3D location of the user’s head is then outputted to the interface where the user’s intended pointing direction can be projected onto the tiled display.

21

Chapter 9

Conclusion and Future Work We have presented an approach for a marker-free human computer interaction system with overhead stereo cameras. We have developed a promising demonstration that can be enhanced to drive large tiled displays. There is more work required in this project for it to be a truly robust system. Speed needs to be improved and gesture recognition needs to be accomplished. For gesture recognition, we did not get good results with an overhead camera system. It would be a good experiment to add a third camera in front of the user solely for gesture recognition. Porting the code to take advantage of multi-core processors such as the Cell Broadband Engine would likely allow parallel computation of classifiers for different gestures.

22

Related Documents


More Documents from "piensa madrid"