E27/CS27 Lab 5: Camera Based Mouse

Ben Mitchell
Dan Crosta
Zach Pezzementi

Abstract

For this project, we implemented a camera-based mouse system. We used r-g color histograms to do thresholding on an image, and then processed that image to find hands in a pointing configuration. For this task we found that using oriented and scaled versions of he captured image improved the pointing accuracy of the system. Having found such a hand, we used median and Kalman filters to smooth the motion, and had the mouse cursor move on the screen according to the movement of the fingertip in the image. We also allowed a user to send a left click by extending the thumb from the standard pointing position.

Thresholding

In order to separate hands from the background, we used histograms in the rg colorspace. This space in intensity invariant, which helps compensate for the fact that the cameras used had an auto-balance feature which could not be turned off. Also, it was discovered that a light colored background actually worked better than a dark one; because of the greater similarity of hand to background, the cameras did not change their white/contrast/brightness balances as much, which made thresholding easier.

In order to find "skin" color for the camera setup, we first took a picture of a hand against the background, and manually masked out the non-hand part of the image (isodata was found to work against a black background, but not against a light one, and the training background must be the same as the testing background, or the white balance decreases accuracy). We then scanned through the image, and for each pixel, we incremented a two-dimensional histogram at the bucket corresponding to the value of that pixel in the r-g colorspace. Having built up this histogram, we then scaled all the values relative to the maximum bucket value to create a histogram of rough probabilities that a pixel of a given color is a part of a hand. This was all done off-line.

Once we had a histogram, we used the probabilities to threshold each frame from the camera into a binary image, which we then segmented. We tried doing smoothing on the frames to eliminate noise, but did not find it helpful.


The original image

Hand-masked version from which the histogram was generated.

Thresholded binary image

Segmentation and Image Processing

In order to analyze the objects in the thresholded image, we first had to segment the image. Before doing this, however, we used a grassfire transform to first shrink and then grow the binary image. The grassfire algorithm allows us to grow/shrink by more than one pixel at a time, and since it's a two pass algorithm, if we grow/shrink by more than two, it's more efficient than just doing it a pixel at a time. The reason for growing and shrinking is that we want to eliminate noise in the image, since the thresholding is not perfect. When we shrink, it causes small areas (ie. small pieces of background that got into the image) to dissapear. We then grow again to fill in small holes in the hand-region, so it is continuous.

Having cleaned up the image, we then use a two-pass segmentation algorithm to create a region-map. For each region, we then check to see if there's a hand in the mouse position. We wanted to have a rotation-invariant system, to correct for rotation of the camera setup itself (our camera was mounted to the monitor, which rotates easily) and to allow users to be pointing in a comfortable, rather than easy-to-recognize position. To this end, we discovered the oriented bounding box surrounding the hand, and projected the pixels in this box to a 40x40 compressed, rotated image.

We found the oriented bouding box as follows: we first find the major and minor axes of a region using moments, and then we use the chain-code algorithm to walk around the outside of the region projecting each pixel onto each of the object's axes. Taking the extreme coordinates from this process gives us an oriented bounding box around the object.


Segmented image, with extrema of the bounding box along with major and minor axis

We can use the angle of the major axis to get a rotation transform that will take us from the camera-space into the object-space, and we can use the bounding box to give us a translation. Since we also want to make the images scale invariant, we can add a scaling factor to our transformation such that the size of the bounding box will be taken to a fixed (small) size. This leaves us with a transform that rotates, scales, and translates, all in two dimensions.

We then use this transform to create a compressed image of the object, oriented along the major axis. In order to do this, we must sample the original image, but because the new image is of lower resolution, a simple sampling would lead to terrible aliasing. To avoid this, we project the corners of each pixel in compressed space back into camera-space, and average all the pixels inside a bounding box around those coordinates. This average value we then assign to the pixel in question in compressed space. We do this for each pixel in the object-image, so that we have a complete, scaled, rotated representation of the object.


A compressed version of the above hand; note finger orientation relative to bounding box(above)

Handshape Recognition

Now that we have our compressed image, we can use a simple finite state automaton to examine that image. Since all images are the same size and orientation, it becomes fairly easy to match things which look like a hand in the mouse position, which we defined to be a closed hand with extended index finger. The finite state automaton works in basically the same way as the one described by Mysliwiec, except that we use our compressed image rather than the entire image, which makes our system scale and rotation invariant.

We also examined the images to find hands which were in the clicking position (thumb and forefinger extended at a right angle). We first tried an FSA to recognize a click-hand, but found that it incorrectly identified a pointing hand as a clicking hand, due to the scaling, which "stretches" the pointing hand relative to the clicking hand. So, rather than using the FSA, we found that applying a simple SSD of the reference click image to the compressed image in question. If the sum-of-squared-differences is below a threshold (empirically derived) the hand was considered to be in clicking position, otherwise it was not.


Compressed image of a hand in mousing position

Compressed reference image of a hand in clicking position

Cursor Movement and Clicking

If we find a hand in the mouse position, we use the fingertip to move the cursor; after finding the fingertip in the compressed space (which is easy, due to the small scale of the space), we then transform the coordinates back into camera space. We then take the difference between the current point and the previous point, and use it to move the cursor a proportional distance on the screen. In order to ensure that the user can take his or her hand away and then begin mousing again in a different part of the image, we have a system which re-ininializes the previous position variables if more than 15 frames in a row fail to have a hand in the mouse position. The 15 frame leeway corrects for salt and pepper noise in the system, mostly coming from imperfections in the color histograming and thresholding. To correct for this noise, which causes a jitter in the cursor position when the hand is still, we first use a median filter, which takes the median of the last 15 positions, and then a Kalman filter, which helps to smooth the pointer motion.

The clicking was acomplished by looking for consecutive images in the clicking position. Upon seeing the first image, we send a mouse button press event, and upon seeing an image which is not in the clicking position, we send a mouse button release event. The motion events were sent using the function XWarpPointer, and the clicks were sent using XTestFakeButtonEvent. Both functions are part of the standard X11 library.

Results

For lack of a better application, we used Bruce's mouseTest program to evaluate performance of our camera mouse. However, the mouseTest program and the camera mouse program each require a lot of processing time, which means that the camera mouse is very laggy, and has a very poor frame rate (and consequently response time) when the mouseTest program is run at the same time. The normal mouse has dedicated hardware, so the use of the cpu by the mouseTest program is not a problem for it. Also, clicking precisely was discovered to be very difficult, because changing from the pointing position to the clicking position tended to move the cursor. With a good frame rate, this could be compensated for with a little practice, but with a poor frame rate, it was nearly impossible. Therefore, we had users send click events from the real mouse with their off hand when evaluating the camera mouse. While not entirely realistic, the alternative would have put the camera mouse at such a severe disadvantage that the comparison would not have been at all meaningful. As it was, these results should be taken with a grain of salt, since the camera mouse performs better in practice. Note that it works best with I/O bound processes; due to the lack of dedicated hardware, CPU bound processes will tend to drop the frame rate, and therefore lower the responsiveness of the system.

NOTE: we hacked the source for mouseTest so that it waited for I/O rather than polling for it. This made the frame rate for our camera mouse much, much better, and had the effect of basically cutting average response times and distances in half. A copy of the modified version is in CVS. It does a few things that are a little annoying, but overall, it's much more useable. As users with a little prior experience with the camera mouse (having tested it as we built it), Ben, Dan, and Zach had an advantage over the other users in this test.

Real mouse avg. response time Real mouse avg. distance Camera mouse avg. response time Camera mouse avg. distance
Ben 1.16s 2.91pix 2.82 5.23
Dan 0.58 1.79 3.12 3.40
Zach0.60 1.31 2.37 2.78
Ethan0.63 1.21 3.272.96
Cari 0.63 1.54 2.97 2.41
Mean 0.72 2.10 2.91 3.35
Standard Deviation 0.25 0.78 0.35 1.11

Conclusions

While our mouse is certainly not as fast and accurate as a standard optical mouse, it is still useful. First, we must consider that the mouse has dedicated hardware support for something we are doing in software, and that the users are all accustomed to them, and not to our system. Also, the cameras used were significatly less than ideal, not only in the white balancing, but also in the field of view and resolution. A better camera with a wider field of view would make it much easier to match the speed of a normal mouse without sacrificing precision. But these things aside, our system has some real advantages over a standard mouse, and there are settings in which the advantages outweigh the disadvantages. The main advantage is the lack of requirement for a physical mouse, which is good for several reasons. First of all, a lot of stress is put on the body by having to repeatedly shift hand and arm position to use a mouse; many people who work with computers for long stretches of time notice their mouse hand and arm becoming stiff or tired far faster than the other. But in addition to comfort, there is the issue of the presence of the hardware. The requirement for a mouse means a requirement for a surface to use that mouse on. Even devices like track balls or touch pads merely reduce the required surface area. Our system, rather than requiring desk space, requires only an environmet in which a camera can be mounted looking at the area of interest. It is therefore useful in all kinds of applications where a cursor is required, but a place to mouse is not available. In this context, it could be combined with a voice recognition system for character input to create a system which did not require a mouse or a keyboard, allowing the system to be used in a variety of environments where neither periferal would fit easily.

Extensions

While we're using an FSA, we're doing transforms so that the images are basically scale and rotation invariant, which makes ours much smarter than the one described in the paper. Also, we implemented median and Kalman filters to try to smooth out our cursor movement, which was actually quite successful, and makes the system a lot more useful. We also allow the user to click the left mouse button using without having to move his hand from the pointing position by simply extending the thumb.