Camera Based Hand and Body Driven Interaction with Very Large Displays Tom Cassey, Tommy Chheng, Jeffrey Lien {tcassey,tcchheng,jwlien}@ucsd.edu
Advisor: Raj Singh, Ruth West
Sponsor: National Center for Microscopy and Imaging Research (NCMIR)
Objective We introduce a system to track a user's hand and head in 3D and real-time for usage with a large tiled display. The system uses overhead stereo cameras and does not require the user to wear any equipment.
University of California, San Diego
Approach
Correspondence Matching Three-dimensional values for the hand and head are obtained by correspondence matching of the stereo cameras. Once we have detected the hand or the head, we search for the equivalent object along the same axis in the left image. We used normalized cross correlation as the metric to determine the best match.
System overview The cameras capture the user from two overhead cameras. Our system uses the right frame as the primary source for head and hand detection. The left frame is a support frame used to extrapolate 3D information.
NCC =
Hardware Setup
Vector Extrapolation
We use two Point Grey Research Dragonfly Express cameras to create a stereo pair. IR emitters along with IR filters on the cameras are used to help distinguish skin tones. The cameras are mounted about two meters above the user. We were able to capture the video with a resolution of 640x480 at 15 frames per seconds. An Sun workstation equipped with an AMD Opteron 246 and 3 GB of memory.
Camera Calibration
After the correspondence matching, we have a 3D vector for both the head and hand from the camera. We then subtract these two vectors to extrapolate user's intended pointing direction. This 3D vector is projected onto the large tiled display by a series of translations from the camera coordinate system to the screen coordinate system.
The cameras were calibrated to infer 3D properties from the stereo camera pair. We used the SRI Small Vision System Calibration[1] to calibrate the cameras. Using a known checker board pattern, the SRI API allowed us to compute the extrinsic and Intrinsic parameters of the cameras. These parameters correct for the distortions in the lens and misalignment.
Hand Detection
Background Many research groups have built large tiled displays to visualize large high resolution imagery such as a detailed biological sample under an electron microscope. However, not many intuitive control systems have been created.
Hands are detected using Haar-like features with a cascade of Adaboost classifiers. This system allows robust detection and a certain degree of invariability to scale and rotation. We used the implementation from the OpenCV libraries[2] to train and create a cascade of classifiers for our hand detection.
Head Detection Heads are detected using Hough transforms generalized for circles. From above, heads appear roughly circular. Heads can be detected by exploiting this geometric property. Circular regions are detected using generalized hough transform with the following equation: 2 2 2
r = x−a y−b
Our project attempts to solve the interface problem by using overhead stereo cameras to project the user's intended point of interest to the screens.
Tracking
NCMIR's BioWall, a 20 tiled display, is shown here. It is the perfect candidate to use our system.
Conclusion
We use token tracking to make our object detection more robust. Tokens are data structures used to store state information (position and area) about detected objects in the image. When an object is detected, its location and area are compared to pre-existing tokens. If the new region is similar enough to an existing token, the token data is updated. This allows us to select the best detected object. We use token tracking to help both hand and head detection.
HATS We developed a testing program to visualize the stereo pair and show the currently detected head and hand. Using a cascade of Adaboost classifiers with Haar-like features for hand detection and a generalized hough transform for circles as a head detector, we compute 3D positioning using correspondence matching. Finally a 3D vector of the user's point of interest is extracted. Our system works well enough for users to trace their hands on a drawing board. .NCMIR will be interfacing with our system for 3D scientific applications on large tiled displays at research conferences.
Future Work There is more work required in this project for it to be a truly robust system. The need for speed is required for greater control accuracy and a robust gesture recognition system would allow user actions.
[1] http://www.ai.sri.com/~konolige/svs/ [2] http://www.intel.com/technology/computing/opencv/overview.htm Your Address, Your Institute
/ 24/01/03 / Page 1 abstract