Batch based Monocular SLAM for Egocentric Videos

Abstract

Simultaneous Localization and Mapping (SLAM) from a monocular camera has been a well researched area. However, estimating camera pose and 3d geometry reliably for egocentric videos still remain a challenge [25, 40, 42]. Some of the common causes of failures are dominant 3D rotations and low parallax between successive frames, resulting in unreliable pose and 3d estimates. For forward moving cameras, with no opportunities for loop closures, the drift leads to eventual failures for traditional feature based and direct SLAM techniques. We propose a novel batch mode structure from motion based technique for robust SLAM in such scenarios. In contrast to most of the existing techniques, we process frames in short batches, wherein we exploit short loop closures arising out of to-and-fro motion of wearer’s head, and stabilize the egomotion estimates by 2D batch mode techniques such as motion averaging on pairwise epipolar results. Once pose estimates are obtained reliably over a batch, we refine the 3d estimate by triangulation and batch mode Bundle Adjustment (BA). Finally, we merge the batches using 3D correspondences and carry out a BA refinement post merging. We present both qualitative and quantitative comparison of our method on various public first and third person video datasets, to establish the robustness and accuracy of our algorithm over the state of the art.

1 Introduction

Figure 1: Wild camera motions in egocentric videos make traditional SLAM techniques challenging to apply. We posit that lack of parallax makes 3D estimation unreliable in consecutive frames of such videos. Incremental addition of frames, as is done in traditional SLAM techniques, results in incorrect estimation of trajectories from unreliable 3D points leading to failure of the whole pipeline. We propose a batch mode method which first stabilizes camera poses using 2D methods such as motion averaging, before estimating 3D geometry. This results in a robust SLAM technique for wild videos including egocentric. The figure shows 3D point clouds and trajectory estimated over 12000 frames from a hyperlapse [25] sequence where all the other SLAM techniques have been reported to fail.

Simultaneous Localization and Mapping (SLAM) has received a lot of attention from the computer vision researchers because of its applicability in robotics, defense, unmanned vehicles, augmented reality applications, etc. Egocentric or first person cameras [17, 18, 30] are the wearable cameras, typically harnessed on a wearer’s head. First person perspective and always-on nature have made these cameras popular in extreme sports, law enforcement, life logging, home automation and assistive vision applications. This has drawn a lot of interests into novel egocentric video applications [1, 2, 5, 13, 14, 15, 23, 29, 37, 39, 43, 46, 49]. Sharp head rotations resulting in quick changes in the camera view as well as forward motion cause most visual SLAM techniques to fail on such videos, as claimed in [25, 40, 42]. In this paper we revisit the monocular SLAM problem with a special emphasis on egocentric videos [25, 40, 42]. To show the general applicability of our method, we also demonstrate our method for other standard SLAM applications like vehicle mounted and hand-held videos.

Most of the current techniques for visual SLAM [4, 11, 33] deal with the problem incrementally, by picking one frame at a time from a video stream, and estimating the camera pose with respect to the 3d structure obtained so far, especially from the last few frames.

Feature based techniques [4, 9, 10, 27, 32, 33, 47, 50, 51] use resectioning [52] to first estimate the pose with respect to the existing 3d structure, and in the second step the estimated geometry is incrementally updated using Bundle Adjustment (BA) [52, 58]. This refines the camera pose and scene geometry simultaneously by minimizing reprojection errors with the new image added. The process is repeated for every new frame. In some cases an additional loop closure step recognizes the places visited earlier and refine the pose estimates over the loop so as to make them consistent at the intersection [54].

Dense and semi-dense visual odometry techniques [11] use all or a substantial subset of image pixels to register a new frame using a Gauss-Newton procedure. The optimization is carried out over both 3d structure and camera pose using a Lie-algebraic formulation for the later. In the second stage, a batch mode loop-closure over key-frames is used to fix the scale ambiguities.

It has been widely reported that both classes of techniques fail for egocentric videos [25, 40, 41, 42]. We observe that the incremental nature of both the styles is unsuitable for the low parallax due to dominant 3D rotations between successive frames in an egocentric video. Neither a feature based BA like strategy, nor a direct technique based on image registration can stabilize the 3d structure well due to errors in triangulation. Ultimately, relying on such 3D points causes drifts in the estimated camera trajectories leading to failure of the whole pipeline. In fact, [25] could only address this problem by carrying out bundle adjustment [45, 57] over large batches of 1400 frames, thereby making the problem well conditioned. We analyze these problems and suggest a novel pipeline for robust and accurate structure estimation from a fast moving, low parallax video such as from egocentric or a vehicle mounted camera. The specific contributions of this paper are:

We analyze the failure of existing SLAM techniques for egocentric videos. We posit that computing geometry from unreliable pose estimation is the primary cause of such failures.

We propose a batch mode technique which first stabilizes pose estimation before computing 3d structure using it. We compute the camera poses in small batches using local loop closures based on motion averaging [6] of initial estimates obtained using multiple epipolar relationships. The technique does not use unreliable 3D estimates at this stage.

We show that the proposed technique can reliably estimate camera pose and 3D structure from long public egocentric videos, which is not possible from any of the current SLAM methods.

We show that our technique works comparably, qualitatively as well as quantitatively, on other regular SLAM applications like vehicle mounted and scanning videos.

For simplicity and speed we use optical flow for image matching. However, our pipeline does not preclude the use of feature descriptor based matching for relocalization and mapping applications (see Section 5.3 for a discussion).

2 Related Work

Based on the method of feature selection for pose estimation, SLAM algorithms from a monocular camera can be classified into feature based, dense, semi-dense or hybrid methods. Feature based methods, both filtering based [26] and key frame based [4, 24, 33], use sparse features like SIFT [28], ORB [44], SURF [3] etc for tracking. The sparse feature correspondences are then used to refine the pose using structure-from-motion techniques like bundle adjustment. Due to the incremental nature of all these approaches, a large number of points are often lost during the resectioning phase [52].

Dense methods initialize the entire or a significant portion of an image for tracking [34]. The camera poses are estimated in an expectation maximization framework, where in one iteration the tracking is improved through pose refinement by minimizing the photometric error, and, in alternate iterations, the 3D structure is refined using the improved tracking. To increase the accuracy of estimation, semi-dense methods perform photometric error minimization only in regions of sufficient gradient [11, 12]. However, these methods do not fare well in cases of low parallax and wild camera motions mainly because structure estimation cannot be decoupled from pose refinement.

SLAM techniques also differ on the kinds of scene being tracked: road scenes captured from vehicle mounted cameras, indoor scans from a hand held camera, and from head mounted egocentric cameras usually accompanied by sharp head rotations of the wearer. Visual odometry algorithms have been quite successful for hand-held or vehicle mounted cameras [11, 12, 22, 24, 33, 34], but do not fare well for egocentric videos due to unrestrained camera motion, wide variety of indoor and outdoor scenes and presence of moving objects [25, 40, 41, 42].

In recent years, the structure-from-motion (SfM) techniques have seen a lot of progress, using the concepts of rotation averaging (RA) [6] and translation averaging (TA) [19, 21, 31, 56]. The computational cost being linear in number of cameras, these techniques are fast, robust and well suited for small image sets. They provide good initial estimates for camera pose and structure using pairwise epipolar geometry, which can be refined further using standard SfM techniques.

Loop closures in SLAM are detected using three major approaches [54]: map-to-map, image-to-image and image-to-map. Clemente et al. [7] use a map-to-map approach where they find correspondences between common features in two sub-maps. Cummins et al. [8] use visual features for image-to-image loop-closures. Matching is performed based on presence or absence of these features from a visual vocabulary. Williams et al. [55] use an image-to-map approach and find loop-closure using re-localization of camera by estimating the pose relative to map correspondences.

3 Background

The pose of a camera I′ w.r.t a reference image I is denoted by
a 3×3 rotation matrix R∈SO(3) and a 3×1 translation direction vector t. The pairwise pose can be estimated from the decomposition of the essential matixE which binds two views using pairwise epipolar geometry such that: E=[t]×R[20, 36]. Here [t]× is a skew-symmetric matrix corresponding to the vector t. A view graph has the images as nodes and the pair-wise epipolar relationships as edges.

3.1 Motion Averaging

Given such a view graph, embedding of the camera poses into a global frame of reference can be done using motion averaging [6, 21, 31, 56]. The motion between a pair of cameras i and j can be expressed in terms of the pairwise rotation (Rij) and translation direction (tij) as: Mij=[Rijstij01,], where, s is the scale of the translation. If Mi and Mj are the motion parameters of cameras i and j respectively in the global frame of reference, then we have the following relationship between pairwise and global camera motions: Mij=MjM−1i.

Rotation averaging: Using the above expression, the relationship between global rotations and pairwise rotations can be derived as: Rij=RjR−1i, where Ri and Rj are the global rotations of cameras i and j. From a given set of pairwise rotation estimates Rij, we can estimate the absolute rotations of all the cameras by minimising a robust sum of discrepancies between the estimated relative rotations Rij and the relative rotations suggested by the terms RjR−1i[6].

{R1,⋯,RN}=argmin{R1,⋯,RN}∑(i,j)d(RjR−1i,Rij)

(1)

where d(R1,R2)=1√2||log(R2R−11)||F, which is the intrinsic bivariate distance measure defined in the manifold of 3D rotations on the SO(3).

Translation averaging: The global translations Ti and Tj are related with pairwise translation directions Tij as: tij×(Tj−RijTi)=0. Global camera positions (Ci=RTiTi) can be obtained as [56]: {Ci}=argmin{Ci}∑(i,j)(RTjtij,Ci−Cj||Ci−Cj||), where the summation is over all the camera-camera and camera-point constraints derived from 3D points. Wilson and Snavely [56] use a gradient descent approach to solve the minimization problem.

3.2 Bundle Adjustment

An alternative way to estimate camera pose is to use Structure-from-Motion (SfM) which recovers both camera poses and 3D structure by minimizing the reprojection error (described in (2)) using bundle adjustment [52].

mincj,bin∑i=1m∑j=1VijD(P(cj,bi),xijΨ(xij))

(2)

where, Vij∈{0,1} is the visibility of the ith 3D point in the jth camera, P is the function which projects a 3D point bi onto camera cj which is modelled using 7 parameters (1 for focal length, 3 for rotation, 3 for position) , {x}ij is the actual projection of the ith point on the jth camera, Ψ({x}ij)=1+r∥{x}ij∥2 is the single parameter (r) distortion function and D is the Euclidean distance.

Two kinds of bundle adjustment methods are used in the literature to minimize (2):

Iterative BA (IBA): Incremental bundle adjustment traverse the graph sequentially starting from an image pair as a seed for the optimization and then keeps on adding images sequentially through resectioning of 3D-2D correspondences. The technique is used in majority of SLAM algorithms [4, 24, 33]

Global or Batch-mode BA (BBA): Batch-mode bundle adjustment optimizes for all the camera poses at once by minimizing (2) globally. The approach is less susceptible to discontinuities in reconstruction or drift due to joint optimization of all cameras at once. Also, it requires an initialization for camera parameters and 3D structure which can be provided through motion averaging and linear triangulation [20].

4 Proposed Algorithm

Figure 2: Flow chart of the proposed technique

4.1 Key Framing

We start by processing each frame and whenever there is sufficient parallax between frames we designate a new key-frame. This designation is defined by two considerations:
Two key-frames must be separated by either P frames, or the average optical flow crosses a threshold of m pixels, which is sufficient to define a good parallax between key-frames. We take the minimum of these two criteria for defining a new key-frame. In our experiments we typically choose P between 10 to 30 based on the assumption that we use a camera with a frame rate of 30-60 fps. m is typically chosen as 20 pixels. The rationale behind using 20 pixels of average optical flow in defining key-frames is to make our method adaptive to wild motions. For example, on the hyperlapse videos [25], whenever there are wild turns, every frame becomes a key-frame, whereas when we use the same method on a walking video, the key-frames are more wide apart.

4.2 Batch Generation

Figure 3: Incremental 3D and trajectory estimation is problematic for egocentric videos due to lack of parallax between successive frames. We propose batch mode processing to stabilize the trajectory estimation first. (a) Incremental estimation (batch size 1) (b) Ours with a batch size of 30 key-frames. Note that large batch size may also cause problems in motion averaging convergence and breaks in structure estimation as shown in (c). The sequence is taken from Huji dataset [40].

We do pose estimation in batches of key-frames. The batch mode processing makes the motion and structure estimation problem well constrained, when parallax between successive frames is small due to dominant 3D rotations, as is common in egocentric videos. We allocate a number of key-frames into a batch and then process each batch independently. Typically each batch contains around 10-30 key-frames with each key-frame separated by about 5-7 frames for the case when the wearer is walking. While lack of parallax justifies creating batches, making too large a batch is also problematic, because of the following reasons:

Motion averaging works more stably on smaller batches.

A smaller batch size helps in controlling drifts and breaks in structure estimation.

In Figure 3, we show the effect of batch size on trajectory estimation. Both incremental, as well as processing the entire sequence containing about 500 key-frames at once, results in breaks in trajectory and structure. Whereas, when we do structure and pose estimation in smaller batches, it results in a perfect trajectory and structure (Figure 3).

Note that the batch mode of processing does not necessarily result in any delay in computation of output. We can use sliding window of key-frames with significant overlaps between successive batches. In the extreme case two successive batches can be different in only the two end key-frames. In addition, since the pose and structure estimation results of a previous batch provides a good initialization for a new overlapping batch, this does not significantly add to the computational burden either. Most of our results in this paper have been produced with non-overlapping batches.

4.3 Local Loop Closures

Figure 4: Loop closures are an important step in a SLAM algorithm to fix accumulated errors. However, such global loop closures may never come in an egocentric video because of usual forward motion of the wearer. We observe that an egocentric camera scans a scene multiple times due to to-and-fro motion of the wearer’s head. We suggest local loop closures to exploit such typical motion profile. (a) Structure estimation without local loop closures (b) Estimation with local loop closures. A difference in consistency of scale is visible.

In order to handle large rotations, we use the concept of local loop closures which gives extra constraints for stabilising the camera estimates. For a classical SLAM problem from a hand held video, global loop closure is an important step to fix the accumulated errors over individual pose estimation. However, in case of the egocentric videos, where the motion of the wearer is linear forward, a user may not revisit a particular scene point, which makes global loop closure impossible. In addition, given the wild nature of egocentric videos, the camera pose estimates and trajectories tend to drift quickly unless fixed by loop closures. We note that a wearer’s head typically scans the scene left to right and back during the course of natural walking. The camera looks at the same scene multiple times, thus providing opportunities for a series of short local loop closures. We take advantage of this phenomenon to improve the accuracy of the estimated camera poses. Similarly, in case of pure forward motions the camera sees the same scene continuously, giving enough constraints through local loop closures.

To use this concept we maintain a set of past key-frames in a window. When considering a new key-frame I′ we estimate its pairwise pose estimates with existing key-frames It−1,It−2⋯,It−batchsize in this window for establishing redundant paths. This constitutes a local loop closure. Additionally the loop closures are detected on sliding windows for imposing a smoothness constraint over the trajectory. Figure 4a shows the effect of loop closures on the estimation where in absence of loop closures the structure gets deformed in scale and shifts above the ground. However, with loop closures it fits well with the structure.

4.4 Camera Motion and 3D Structure Initialization

We use the five-point algorithm [36] for estimating the pair-wise epipolar geometry to create the initial view graph for a batch. Once the pairwise estimates are obtained with local loop closures, we have enough redundant paths traced through each and every camera on the view graph of the current batch. This provides sufficient constraints for motion averaging. We use motion averaging as described in 3.1. We first use rotation averaging for finding global rotation estimates followed by translation averaging. Here we mention that we use a mixture of two different methods for robustly averaging out the translations. To initialize the global translations we generate an initial guess using global convex optimization technique specified in [38] and subsequently refine the solution using the approach of [56]. This provides a very good initial estimate for the camera pose. The 3D structure is initialized using linear triangulation as specified in [20]

4.5 SfM Refinement

The initial structure and camera poses are further refined using a final run of batch mode bundle adjustment, which converges very fast because of the good initialization as described in last section.

4.6 Merging and BBA Refinement with Resectioning

Finally, the batches are successively merged using 7 dof alignment based on SVD as in [53]. Also, during merging new points which were not used previously due to not being visible in most cameras, are added back as these points get stable with more cameras viewing them now. A final round of global BBA based refinement is run whenever the cross batch reprojection errors get high. This leads to a non-linear refinement in scale of the estimated structure and poses. We describe the complete algorithm in Figure 2.

5 Experiments and Results

In this section we present results using our techniques on some challenging data sets for egocentric videos. Since, classical SLAM algorithms like LSD [11] and ORB SLAM [33] does not work for egocentric videos, we compare with the state of the art on regular hand held videos. We also perform careful experimental analysis to justify our choice of parameters such as key-frame separation and batch size.

We have implemented portions of our algorithm in C++ and MATLAB. All the experiments have been carried out on a regular desktop with Core i7 2.3 GHz processor (containing 4 cores) and 32 GB RAM, running Ubuntu 14.04.

Our algorithm requires the intrinsic parameters of the cameras for SfM estimation. For the sequences taken from public sources, we have used the calibration information about the make and the version of the cameras used provided on their websites.

5.1 Qualitative Results

Figure 5: Comparison of the estimated structure on challenging Hyperlapse climbing03 sequence [25]. The wearer is climbing a hill with wild motions and sharp head rotations causing state of the art SLAM algorithms to fail. Authors of hyperlapse have reported to use SfM algorithm by manually dividing the sequence in batches of 1400 frames. Our algorithm works without fail on the complete sequence (a) Dense depth map generated by [25] using CMVS [59], (b) Corresponding dense depth map generated by our method (c) A reference view

Figure 6: Our result on another challenging sequence from Huji data set [40]. Here the wearer is walking in a narrow alley and even makes sharp 360 degree turn. The batch mode allows the proposed work to work robustly even on such sequence (a) Estimated trajectory on superimposed on Google map (b) Dense depth map of a portion obtained using CMVS [59] using our initial sparse reconstruction and camera poses (c) Corresponding scene image for visual comparison

We have tested our algorithm on various Hyperlapse sequences [25]. The bike01[25] video in the dataset is a very challenging sequence with wild head motions, fast forward movements and sharp turns. Both [4, 33] fail at the very wild motions at frames 1907 -1920 whereas [11] works for around 2600 frames. Our method works smoothly for 12000 frames and beyond. We have shown the computed trajectory for upto 12000 frames in Figure 1. In the same figure we have shown the 3d map by carrying out dense reconstruction of some portions using CMVS [59] based on the camera poses and the sparse structure computed using our algorithm. Note that CMVS can produce high quality output only if the pose and the initial structure estimates are correct. In Figure 5 we compare the dense 3d structure of a portion computed using our method with the one given in Hyperlapse. It is to be noted that in [25] pose and 3d structure are computed using SfM over batches of 1400 frames.

We present similar results on a similarly challenging Huji_Yair_5 sequence from HUJI egoseg dataset in Figure 6. All the state-of-the-art SLAM techniques have been reported to fail on these datasets [40, 41, 42].

5.2 Quantitative Analysis

(a)

(b)

(c)

Figure 7: Our result on TUM benchmark [48]fr3_str_tex_far data set (a) the sparse depth with the trajectory (white) along with (b) dense depth map computed using CMVS [59] and (c) comparison with ground truth trajectory after 7 dof alignment

Dataset

RMSE (cm) of key-frame trajectory

Ours

ORB-SLAM

LSD SLAM

fr3_str_tex_far

2.21

0.77

7.95

fr1_floor

5.5

2.99

38.07

Table 1: The focus of this paper is on SLAM on egocentric cameras. Since most of the current state of the art SLAM techniques do not work on egocentric videos, we perform the quantitative comparison on handheld videos from TUM benchmark [48] dataset. The table shows the comparison of RMSE error with respect to ground truth trajectory on the two sequences from the dataset. Our error is better than LSD SLAM on these sequences. ORB SLAM has a better error performance than us.

None of the egocentric datasets we encountered have ground truth trajectories available, making it difficult to carry out any quantitative analysis of the proposed algorithm. However, there are multiple third person datasets available for such quantitative analysis of a SLAM algorithm. We have used the TUM Visual odometry dataset [48] for such analysis. Figure 7 shows the dense reconstruction and the trajectory estimated by the proposed method. Note that the graph shown in the figure also contains the ground truth trajectory, but the estimated trajectory is completely aligned with the ground truth and is hiding it completely.

The TUM dataset also allows us to compute RMS error of the computed trajectory with respect to the ground truth trajectory. Table 1 shows the error for our algorithm as well as the ones reported by the state of the art algorithm on same sequence. Though our algorithm is targeted at egocentric videos, we match and often improve the state of the art even for regular hand-held videos as well.

5.3 Relocalization

Figure 8: our pipeline does not preclude use of such feature descriptors for relocalization. We train a vocabulary tree [35] using the SIFT [28] features computed from the key-frames and then carry out feature matching for novel frame with the key-frames using vocabulary tree, followed by estimating the pose of novel frame using 3D-2D correspondences. Figure shows localized novel cameras on the precomputed trajectory. The estimated location very near the trajectory indicates successful localization. we use TUM fr3_str_tex_far sequence for this experiment.

Method

Rotation (deg.)

Position (cm)

Mean

Median

Mean

Median

Without BA

0.0198

0.0216

1.004

1.040

With BA

0.0062

0.0051

0.975

0.977

Table 2: Quantitative analysis of relocalization error. We perform relocalization as shown in Figure 8 and compute error in camera rotation (degrees) and absolute position (cm) after relocalization for novel frames. Small error in the estimation indicates successful localization.

Table 3: Though our focus is on SLAM for egocentric videos only, the algorithm is also applicable for other scenarios where parallax between consecutive frames is less. Videos from vehicle mounted cameras also have similar context. We test our algorithm on sequences from KITTI dataset [16] and compare with the state of the art. The table shows the RMSE error in meters with respect to the ground truth. The numbers show that we improve the state of the art on such videos as well. A thorough analysis/comparison is out of scope for this paper.

A popular metric to measure the accuracy of estimated 3D structure is by measuring the relocalization eror. In our algorithm, we use optical flow for image matching for the sake of simplicity and speed. Since optical flow vectors do not have associated feature descriptors, they cannot be used for relocalization and mapping. However, our pipeline does not preclude use of such feature descriptors for relocalization. To demonstrate relocalization using our framework, we train a vocabulary tree [35] using the SIFT [28] features computed from the key-frames in the TUM fr3_str_tex_far sequence [48]. We then use a set of frames which are not key-frames to calculate relocalization error. We carry out feature matching with the key-frames using vocabulary tree, reject outliers using the pre-computed trajectory of the key-frames, and estimate the pose of the unknown frames using 3D-2D correspondences. In Figure 8, we plot the relocalized unknown frames on the computed trajectory. Location of the frames on the trajectory indicate the correctness of relocalization. In Table 2 we show the accuracy of relocalization with respect to the ground-truth both with and without a final BA refinement.

5.4 Vehicle Mounted Cameras

Though, the focus of this paper is on egocentric videos, our algorithm is equally applicable for other capture scenario where there is low parallax between consecutive frames. One such popular case arise from a forward looking camera mounted on a vehicle. We have taken one such popular KITTI dataset [16] for the comparison. Figure 9 shows the trajectories computed using our algorithm along with ground truth trajectories on various sequences from the dataset.

Table 3 shows the RMSE error of the computed trajectory with respect to the ground truth. Comparison with state of the art ORB SLAM [33] indicates that we perform better than the state of the art on such videos as well. Note that LSD-SLAM [11] does not work on KITTI videos.

6 Conclusion

Despite tremendous progress made in SLAM techniques, running such algorithms for many categories of videos in the wild still remain a challenge. We believe careful case by case analysis of such challenging videos may give insights into solving the problem. Egocentric a.k.a first person videos are one such category we focus on in this paper. We observe that incremental estimation employed in most current SLAM techniques often cause unreliable 3D estimates to be used within trajectory estimation. We suggest to first stabilize the trajectory using 2D techniques and then go for structure estimation. We also exploit domain specific heuristics such as local loop closure. Interestingly, we observe that the proposed technique improves state of the art not only for targeted egocentric videos but also for videos captured from vehicle mounted cameras. We believe that robust trajectory and structure estimation from the proposed technique will help many current and novel egocentric applications.