Go to root: PhD-3D-Object-Tracking

One of the most important fields of robotics is computer vision. A machine that is able to "see" its environment can have a lot of applications. For example, think of a robot in a production line that grabs some parts and moves them somewhere else. Or a surveillance system that is able to recognize how many people are in a room. Or a biped robot making its way through a room, evading obstacles such as tables, chairs or other people.

For many years, the most common sensors for computer vision were 2D cameras, that retrieved a RGB image of the scene (like all the digital cameras that are so common nowadays, in our laptops or smarphones). Algorithms exist that are able to find an object in a picture, even if it is rotated or scaled, or that can retrieve motion data from a stream of video, and even perform 3D analysis to track the camera's position. All these years' worth of research is now available in libraries like OpenCV (Open Computer Vision).

3D sensors are available, too. The biggest advantage they offer is that it is trivial to get measurements about distances and motion, but the addition of a new dimension makes calculations expensive. Working with the data they retrieve is a lot different that working with a 2D image, and texture information is rarely used.

During the next tutorials I will explain how to get a common depth sensor working with a 3D processing library.

Depth sensors

3D or depth sensors give you precise information about the distance to the "points" in the scene (a point would be the 3D equivalent of a pixel). There are several types of depth sensors, each working with a different technique. Some sensors are slow, some are fast. Some give accurate, high-res measurements, some are noisy and low-res. Some are expensive, some can be bought for a hundred bucks. There is no "perfect" sensor and which one you choose will depend on your budget and the project that you want to implement.

Stereo cameras

Example of stereo camera.

Stereo cameras are the only passive measurement device of the list. They are essentially two identical cameras assembled together (some centimeters apart), that capture slightly different scenes. By computing the differences between both scenes, it is possible to infer depth information about the points in each image.

A stereo pair is cheap, but perhaps the least accurate sensor. Ideally, it would require perfect calibration of both cameras, which is unfeasible in practice. Bad light conditions will render it useless. Also, because of the way the algorithm works (detecting corresponding points of interest in both images), they give poor results with empty scenes or objects that have plain textures, where few interest points. Some models circumvent this by projecting a grid or texture of light on the scene (active stereo vision).

I will not go into detail about the math involved with the triangulation process, as you can find it on the internet.

Time-of-flight

LIDAR mounted on a mobile robot (image from Wikimedia Commons).

Time-of-flight camera (image from Wikimedia Commons).

Microsoft Kinect.

ASUS Xtion PRO.

Infrarred light pattern (dots) projected by Kinect.

Time-of-flight (ToF) sensors work by measuring the time it has taken a ray or pulse of light to travel a certain distance. Because the speed of light is a known constant, a simple formula can be used to obtain the range to the object. These sensors are not affected by light conditions and have the potential to be very precise.

LIDAR

A LIDAR (light+radar) is just a common laser range finder mounted on a platform that is able to rotate very fast, scanning the scene point by point. They are very precise sensors, but also expensive, and they do not retrieve texture information. They have been used for decades in many different fields like meteorology, archaeology or astronomy. LIDAR devices can be mounted on satellites, planes or mobile robots. The data retrieved by a LIDAR has very high resolution, so some processing is needed in order to use it for real-time applications.

ToF cameras

A time-of-flight camera does not perform point-by-point scans like a LIDAR does. Instead, it employs a single pulse of light to capture the whole scene, once per frame. Thanks to that, they can work a lot faster, with some models topping above 100 Hz. The price to pay is a low resolution of 320×240 or even less. Depth measurements have an accuracy of about 1cm. ToF cameras cost less than LIDAR sensors, but we are talking about 4000 $ or so. Color information is not retrieved.

Structured light

Structured light sensors (like Kinect and Xtion) work by projecting a pattern of infrarred light (for example, a grid of lines, or a "constellation" of points) on top of the scene's objects. This pattern is seen distorted when looked from a perspective different from the projector's. By analysing this distortion, information about the depth can be retrieved, and the surface(s) reconstructed.

The resolution and speed of these sensors are similar to common VGA cameras, usually 640x480 at 30 FPS (the new Kinect model works at 1080p). Precision is similar to that of ToF cameras, 1cm more or less, with a maximum range of 3-6m. They are a lot cheaper than any other sensors, with a first generation Kinect now costing around 100 $, and hence they have become a lot popular in the last years. In the next tutorial, I will explain the installation and usage of one of these cameras.

Point clouds

Go to root: PhD-3D-Object-Tracking

Links to articles:

PCL/OpenNI tutorial 0: The very basics

PCL/OpenNI tutorial 1: Installing and testing

PCL/OpenNI tutorial 2: Cloud processing (basic)

PCL/OpenNI troubleshooting

PCL/OpenNI tutorial 0: The very basics

Contents

Depth sensors

Stereo cameras

Time-of-flight

LIDAR

ToF cameras

Structured light

Point clouds

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Resources

Activities

Tools