The stereo tracker algorithm is capable of constructing accurate 3D maps using
the information comming from two cameras built as a fixed stereo rig. The system constructs the
map during the survey (on-line process) using feature-based registering techniques (i.e.
SURF and SIFT)
exteded to a stereo framework. The algorithm runs a final bundle adjustment
optimization (off-line) refining the structure and the camera motion by minimizing the reprojection
error of the 3D estimates to the cameras.

Before executing any survey, the stereo system must be calibrated in order to
obtain the individual intrinsic camera parameters and extrinsinc parameters relating both cameras.
In the calibration process the non-linear image distortion parameters (radial and tangential) are
estimated and will anable us to remove this distortions afterwards. Furthermore, after calibration,
each stereo image pair obtained by the stereo rig can be rectified. The rectification process transform
the image pairs to new pairs that can be considered to be obtained by a fronto-parallel stereo system.
Using this type of system correspondece points in the images are allways in scan-lines. This introduces
a constraint that reduces the correspondence search from 2D to 1D.

Figure 1 shows two image pairs on the left taken at time t and time t+1. On the right,
it is shown the result after applying the correction of the non-linear distortions and the rectification.

Source Images

Rectified Images

Images Acquiredat Time t+1

Images Acquiredat Time t

Fig 1. Left column shows the acquired images at
time t and t+1 and
the right column shows images after the non-linear corrections and
the stereo rectification process at time t and
t+1.

To achive the reconstruction the Stereo Tracker Algorithm executes the following actions at each step:

Feature Detection

Feature Matching

Triangulation

3D Registration and ego-motion estimation

Figure 2 depicts detected features using SURF
in the quadruplet of rectified images gathered by the stereo rig at time t and time t+1. The best features are
selected equaly sparsed withing images by using the non-maximal suppression algorithm.

Once the features are detected, SURF
descriptors arround them are computed. The feature descriptors are matched in image pairs a)
left-right at time t; b) left-right at time t+1; c) left-left at time t and t+1;
and d) righ-right at time t and time t+1.

Figure 3 shows on the left all the matches found and,
on the right, the matches that remain after applying the epipolar,
disparity and quadruplet constraints. Epipolar constraint only allows to find a
correspondence at (x+Δx, y) in the right image for a feature at
(x, y) on the left. The disparity constraint bounds the Δx
along the closer range centered in x. Finally, the quadruplet constraint filters out
the quadruplet matches that are not closed set, this is, if we select a matched
feature on the left image at time t, and we follow the matching links
as edges in a graph, after for steps we must end up in the initial
feature. In other words, the resulting shape when linking a set of
4 matched features must be a parallelogram.

Fig 3. The left image shows all the correspondecens
obtained after the matching process. The right image shows the
filtered matches using the epipolar, the disparity and the
quadruplet constraints.

Using the pair wise matches at time t we triangulate the 3D position of any matched
pair of points. The same process is done by the matched features at time t+1. After this step we obtain
two sets of 3D points and the correspondences between them since the have computed the time t and t+1
matches. Using this information, a robust 3D registration technique by executing a
RANSAC over the
absolute orientation
algorithm is executed to obtain the stereo rig motion (Rotation and Translation) between time t and time t+1.

The incrementally estimated trajectory suffers from drifting along the time. That is,
since more measurements are incorporated the more error is accumulated. We improve the solution by reducing
slightly this drift executing an off-line optimization
(bundle adjustment) step.
Therefore, to incorporate more information to the optimization, the 2D features are tracked within
consecutive frames providing two or more 2D projections for each 3D point. We use an
optimized
C/C++ Sparse-Bundle implementation extended to account
for stereo data focused on optimizing the structure from motion
data that can account multiple 2D projections of a single 3D point within frames.

Bundle Adjustment optimizes the MLE for the Structure from
Motion problems (reprojection error), however its cost is cubic with respect to
the number of parameters.
A Sparse-Bundle Adjustment algorithm exploits the
block sparse structure of the Jacobian Matrix in order to save time
and memory resources. Moreover, we use the sparse-bundle
adjustment to optimize the error in the 3D instead of the 2D
saving a little bit more of time and resources since the same
problem formulated in the 3D terms has smaller size
than in the 2D terms.

Figure 4 illustrates the reconstruction process run over a outdors dataset.

Fig 4. This figure is a video sequence showing on top row are the stereo
input sequence at 25 FPS. The bottom left figure shows consecutive pairs of stereo shots at a
subsample rate (1 out every 20 stereo pairs) with the stereo feature pairs tracked along time. Bottom right picture illustrates
the evolution of the reconstruction. Blue ellipoid represents the camera position uncertainty (99% of confidence) and the green ellipsoid
the uncertainty of the scene reconstructed at the last shot (99% of confidence).
Uncertainty in the reconstruccion allows to detect the loop closure
point and relate the first and the last stereo-frame features. Then,
a global optimization (stereo sparse bundle adjustment) is performed. The
final camera trajectory and the images are fed to a
dense reconstruction algorithm.