Image and Vision Computing 21 (2003) 359–381 www.elsevier.

A survey of video processing techniques for trafﬁc applications
V. Kastrinaki, M. Zervakis*, K. Kalaitzakis
Digital Image and Signal Processing Laboratory, Department of Electronics and Computer Engineering, Technical University of Crete, Chania 73100, Greece Received 29 October 2001; received in revised form 18 December 2002; accepted 15 January 2003

Abstract Video sensors become particularly important in trafﬁc applications mainly due to their fast response, easy installation, operation and maintenance, and their ability to monitor wide areas. Research in several ﬁelds of trafﬁc applications has resulted in a wealth of video processing and analysis methods. Two of the most demanding and widely studied applications relate to trafﬁc monitoring and automatic vehicle guidance. In general, systems developed for these areas must integrate, amongst their other tasks, the analysis of their static environment (automatic lane ﬁnding) and the detection of static or moving obstacles (object detection) within their space of interest. In this paper we present an overview of image processing and analysis tools used in these applications and we relate these tools with complete systems developed for speciﬁc trafﬁc applications. More speciﬁcally, we categorize processing methods based on the intrinsic organization of their input data (feature-driven, area-driven, or model-based) and the domain of processing (spatial/frame or temporal/video). Furthermore, we discriminate between the cases of static and mobile camera. Based on this categorization of processing tools, we present representative systems that have been deployed for operation. Thus, the purpose of the paper is threefold. First, to classify image-processing methods used in trafﬁc applications. Second, to provide the advantages and disadvantages of these algorithms. Third, from this integrated consideration, to attempt an evaluation of shortcomings and general needs in this ﬁeld of active research. q 2003 Elsevier Science B.V. All rights reserved.
Keywords: Trafﬁc monitoring; Automatic vehicle guidance; Automatic lane ﬁnding; Object detection; Dynamic scene analysis

Image processing also ﬁnds extensive applications in the related ﬁeld of autonomous vehicle guidance, mainly for determining the vehicle’s relative position in the lane and for obstacle detection. The problem of autonomous vehicle guidance involves solving different problems at different abstraction levels. The vision system can aid the accurate localization of the vehicle with respect to its environment, which is composed of the appropriate lane and obstacles or other moving vehicles. Both lane and obstacle detection are based on estimation procedures for recognizing the borders of the lane and determining the path of the vehicle. The estimation is often performed by matching the observations (images) to an assumed road and/or vehicle model. Video systems for either trafﬁc monitoring or autonomous vehicle guidance normally involve two major tasks of perception: (a) estimation of road geometry and (b) vehicle and obstacle detection. Road trafﬁc monitoring aims at the acquisition and analysis of trafﬁc ﬁgures, such as presence and numbers of vehicles, speed distribution data, turning trafﬁc ﬂows at intersections, queue-lengths, space and time occupancy rates, etc. Thus, for trafﬁc monitoring it is essential to detect the lane of the road and then sense

and identify presence and/or motion parameters of a vehicle. Similarly, in autonomous vehicle guidance, the knowledge about road geometry allows a vehicle to follow its route and the detection of road obstacles becomes a necessary and important task for avoiding other vehicles present on the road. In this paper we focus on video systems considering both areas of road trafﬁc monitoring and automatic vehicle guidance. We attempt a state of the art survey of algorithms and tools for the two major subtasks involved in trafﬁc applications, i.e. the automatic lane ﬁnding (estimation of lane and/or central line) and vehicle detection (moving or stationary object/obstacle). With the progress of research in computer vision, it appears that these tasks should be trivial. The reality is not so simple; a vision-based system for such trafﬁc applications must have the features of a short processing time, low processing cost and high reliability [2]. Moreover, the techniques employed must be robust enough to tolerate inaccuracies in the 3D reconstruction of the scene, noise caused by vehicle movement and calibration drifts in the acquisition system. The image acquisition process can be regarded as a perspective transform from the 3D world space to the 2D image space. The inverse transform, which represents a 3D reconstruction of the world from a 2D image, is usually indeterminate (ill-posed problem) because information is lost in the acquisition mapping. Thus, an important task of video systems is to remove the inherent perspective effect from acquired images [3,4]. This task requires additional spatio-temporal information by means of additional sensors (stereo vision or other type sensors) or the analysis of temporal information from a sequence of images. Stereo vision and optical ﬂow methods aid the regularization of the inversion process and help recover scene depth. Some of the lane or object detection problems have been already solved as presented in the next sections. Others, such as the handling of uncertainty and the fusion of information from different sensors, are still open problems as presented in Section 4 that traces the future trends. In our analysis of video systems we distinguish between two situations. The ﬁrst one is the case in which a static camera observes a dynamic road scene for the purpose of trafﬁc surveillance. In this case, the static camera generally has a good view of the road objects because of the high position of the camera. Therefore, 2D intensity images may contain enough information for the model-based recognition of road objects. The second situation is the case in which one or more vision sensors are mounted on a mobile vehicle that moves in a dynamic road scene. In this case, the vision sensors may not be in the best position for observing a road scene. Then, it is necessary to correlate video information with sensors that provide the actual state of the vehicle, or to combine multisensory data in order to detect road obstacles efﬁciently [2]. Both lane and object detection become quite different in the cases of stationary (trafﬁc monitoring) and moving

camera (automatic vehicle guidance), conceptually and algorithmically. In trafﬁc monitoring, the lane and the objects (vehicles) have to be detected on the image plane, at the camera coordinates. Alternatively, in vehicle guidance, the lane and the object (obstacle) positions must be located at the actual 3D space. Hence, the two cases, i.e. stationary and moving cameras, require different processing approaches, as illustrated in Sections 2 and 3 of the paper. The techniques used for moving cameras can also be used for stationary cameras. Nevertheless, due to their complexity and computational cost, they are not well suited for the relative simpler applications of stationary video analysis. Research in the ﬁeld started as early as in the 70s with the advent of computers and the development of efﬁcient image processing techniques. There is a wealth of methods for either trafﬁc monitoring or terrain monitoring for vehicle guidance. Some of them share common characteristics and some originate from quite diverse approaches. The purpose of this paper is threefold. First, to classify image-processing methods used in trafﬁc applications. Second, to provide the advantages and disadvantages of these algorithms. Third, from this integrated consideration, to attempt an evaluation of shortcomings and general needs in this ﬁeld of active research. The paper proceeds by considering the problem of automatic lane ﬁnding in Section 2 and that of vehicle detection in Section 3, respectively. In Section 4 we provide a critical comparison and relate processing algorithms with complete systems developed for speciﬁc trafﬁc applications. The paper concludes by projecting future trends and developments motivated by the demands of the ﬁeld and the shortcomings of the available tools.

2. Automatic lane ﬁnding 2.1. Stationary camera A critical objective in the development of a road monitoring system based upon image analysis is adaptability. The ability of the system to react to a changing scene while carrying out a variety of goals is a key issue in designing replacements to the existing methods of trafﬁc data collection. This adaptability can only be brought about by a generalized approach to the problem which incorporates little or no a priori knowledge of the analyzed scene. Such a system will be able to adapt to ‘changing circumstances’, which may include the following: changing light levels, i.e. night – day, or sunny – cloudy; deliberately altered camera scene, perhaps altered remotely by an operator; accidentally altered camera position, i.e. buffeting by the wind or knocks due to foreign bodies; changing analysis goals, i.e. trafﬁc ﬂow to counting or occupancy measurement. Moreover, an adaptive system would ease installation of the equipment due to its ability for selfinitialization [1]. Automatic lane ﬁnding (ALF) is an important task for an adaptive trafﬁc monitoring system.

It also enables applications in active vision systems. localized and combined to meaningful characteristics. † The system can assume a ﬁxed or smoothly varying lane width and thereby limit its search to almost-parallel lane markings. In order to accelerate the lane detection process. however. performance of the navigation and control systems. The approaches used in stereo vision often involve independent processing on the left and right images and projection of the result to the ground plane through the Helmholtz shear equation. lane detection has generally been reduced to the localization of speciﬁc features. namely lane-region detection and lane-border detection (lane markings and road edges). making the assumption of ﬂat road and using piecewise road geometry models (such as clothoids) [7. separated by short spatial distances. Furthermore. The aspects that characterize a trafﬁc lane are its visual difference from the environment and the relatively dense motion of vehicles along the lane. one regarding the recognition of the borders of the lane and the second for the prediction of the path of the vehicle. Single-frame analysis has been extensively considered not only in monocular but also in stereo vision systems.V. shadows. Based on these features. where the camera viewing angle and the focal length of the camera lens may be controlled by the system operator to ﬁnd an optimum view [5]. features that can be easily inferred are the lane characteristics themselves (lane markings and/or road edges) and the continuous change of the scene along the lane area. making the lane virtually wider. The inverse perspective mapping essentially re-projects the two images onto a common plane (the road plane) and provides a single image with common lane structure. Thus. the estimated lane borders at the previous frame can be expanded.
. requires knowledge of the vehicle dynamics. Real-time road segmentation is complicated by the great variability of vehicle and environmental conditions. It enables the system to adapt to different environmental conditions and camera viewing positions. The latter approach limits the computation-intensive processing of images to simply extracting features of interest. Kastrinaki et al. In a different scheme. such as lane markings painted on the road surface. robust segmentation is very demanding. the lane recognition task can be based on spatial visual information. dirt on the road. which is located through the perspective transform. This process. It should be emphasized that the ﬁrst class considers just changes in the gray-scale values within an image sequence. as well as modeling of the state of the car (dynamics and kinematics). / Image and Vision Computing 21 (2003) 359–381
361
ALF can assist and simplify the installation of a detection system. the lane detection process is designed to (a) provide estimates for the position and orientation of the car within the lane and (b) infer a reference system for locating other vehicles or obstacles in the path of that vehicle. spectral reﬂection when the sun is at low angle and manmade changes (tarmac patches used to repair road segments) complicate the segmentation process. and feature-driven approaches in which lane features are extracted. there is a need to restrict the computation to a reduced region of interest (ROI). The ﬁrst restricts the search on the predicted path of the vehicle by deﬁning a search region within a trapezoid on the image plane. The ﬁrst class relates the detection of the lane with the changing intensity distribution along the region of a lane. at least for the short-range estimation of the lane position. The second approach deﬁnes small search windows located at the expected position of the lane. similar to the process of object detection considered in Section 3 [4]. 2. the lane recognition process must be repeated continuously on a sequence of frames. Changing seasons or weather conditions. have been used to distinguish between road and non-road regions in each individual frame. the inverse perspective mapping can be used to simplify the process of lane detection. whereas the second class considers directly the spatial detection of lane characteristics. both tasks require two major estimation procedures. a computer vision system can analyze speciﬁc regions (the ‘focus of
attention’) to identify and extract the features of interest. In general. a ﬂat road without bumps) to localize features easier and simplify the mapping between image pixels and their corresponding world coordinates. Furthermore. A rough prediction of the lane position at subsequent video frames can highly accelerate the lane detection process.8]. Certain assumptions facilitate the lane detection task and/or speed-up the processing [6]: † Instead of processing entire images. road tracking can facilitate road segmentation based on previous information. etc. Moving camera In the case of automatic vehicle guidance. There are two general approaches in this direction. The derivation of the path of the vehicle requires temporal information concerning the vehicle motion. Alternatively. time of the day. † A system can exploit its knowledge of camera and an assumption of a precise 3D road model (for example.2. In the case of a moving vehicle. Because of these combined effects. Although some systems have been designed to work on completely unstructured roads and terrain. based on the method of describing the lane characteristics. such as color and texture. In one scheme. vehicle suspension. so that the actual lane borders at the next frame are searched for within this expanded ROI [9]. Several features of structured roads. The second class can be further separated. Two general subclasses involve model-driven approaches in which deformable templates are iteratively modiﬁed to match the road edges. without addressing the problem of motion estimation. we can distinguish two classes of approaches in lane detection.

B) values and it is a priori likelihood based on the expected number of pixels in each class. In essence. a map of static regions obtained by simple frame differencing can provide information about the motion activity of neighboring patches candidate for merging [16]. in order to account for varying color conditions on the road (shading. the result must be mapped on the road (world) coordinate system for navigation purposes.2. The road segmentation can also be performed using stochastic patter recognition approaches. Once formed.15]. Ref. 2. distinguishes between active areas of the scene where motion is occurring (the road) and inactive areas of no signiﬁcant motion (e. the features are deﬁned by the spectral response of the illumination at the red. One can deﬁne many classes representing road and/or non-road segments. At each pixel. The classiﬁcation step must be succeeded by a region merging procedure. A typical classiﬁcation problem involves the steps of feature extraction.3. Texture classiﬁcation can also be effectively combined with color classiﬁcation based on the conﬁdence of the two classiﬁcation schemes [11]. allowing for region separation in its feature space. Lane-region detection One method of automatic lane ﬁnding with stationary camera can be based upon accumulating a map of signiﬁcant scene change [5]. Gaussian distributions have been used to model the color classes [11].e. grey-level segmentation is likely to discriminate the road surface area from other areas of interest. Thus. feature decorrelation and reduction. Two of these features signify the position and the third signiﬁes the greylevel of each pixel under consideration. green and blue bands. The lane-region analysis can be also modeled as a classiﬁcation problem.G.) [12]. which labels image pixels into road and non-road classes based on particular features. central reservation). The texture calculation can be based on the amplitude of the gradient operator at each image area. feature driven and model driven approaches. reﬂections.G. 2. Following the process of lane detection on the image plane. This class involves. as to combine similar small regions under a single label. The texture of the road is normally smoother than that of the environment. To prevent saturation and allow adaptation to changes of the scene.B) plane classiﬁcation can be performed through a linear discriminant function [12]. in order to handle shadow interior and boundaries. [11] uses a normalized gradient measure based on a high-resolution and a low-resolution (smoothed) image. Feature-driven approaches This class of approaches is based on the detection of edges in the image and the organization of edges into meaningful structures (lanes or lane markings) [17]. the road geometry has to be estimated in order to derive the transformation matrix between the vehicle and the road coordinate systems. two particular features have been used. The green band contributes very little in the separation of classes in natural scenes and on the (R.362
V. distinct from non-road pixels.12]. 2. the map generation also incorporates a simple decay mechanism through which previously active areas slowly fade from the map. The feature
. the (R. along with the object detection process. / Image and Vision Computing 21 (2003) 359–381
a least squares linear ﬁt is used to extrapolate lane markings and locate the new search windows at the next frames [10]. due to several factors. cloud cover and other atmospheric conditions. Besides color. color as a feature for classiﬁcation requires special treatment and normalization to ensure consistency of the classiﬁcation results.3. Region merging may utilize other sources of information. Automatic lane ﬁnding approaches The fundamental aspects of ALF approaches are considered and reviewed in this section. Since the road surface is poorly textured and differs signiﬁcantly from objects (vehicles) and background. as well as with the camera position and orientation. It depends on the illuminant color. These approaches are classiﬁed into lane-region detection. the illumination and viewing geometry and the sensor parameters. By assuming a ﬂat road model. Texture classiﬁcation is performed through stochastic patter recognition techniques and unsupervised clustering. feature detection and feature aggregation. the local texture of the image has been used as a feature for classiﬁcation [11. namely color and texture [11. the color statistics of the road and off-road models need be modiﬁed in each class. i. In more general cases. the distance of a 3D-scene point on the road plane can be readily computed if we know the transformation matrix between the camera and the vehicle coordinate systems. the reﬂectivity of the object. The so-called activity map. For road segmentation applications. adapting the process to changing conditions [13]. the classiﬁer groups together neighboring pixels of similar intensities [16].B) scatter diagram of the image [12]. The hue.B) value deﬁnes the feature vector and the classiﬁcation can be performed directly on the (R. Unsupervised clustering on the basis of the C-means algorithm or the Kohonnen self-organizing maps can be employed on a 3D input space of features. The apparent color of an object is not consistent.1.3. verges. Once the road has been localized in an image. etc. The classiﬁcation process can be based on piece-wise linear discriminant functions. gray-value (HSV) space has been also used as more effective for classiﬁcation [14]. Each class is represented
by its mean and variance of (R. two levels of processing. since road pixels cluster nicely. clustering and segmentation. the activity map can be used by a lane ﬁnding algorithm to extract the lane positions [1].g. in general. saturation. such as motion. The aspects of relative position estimation are further considered in Section 3. In the case of color. The color of a scene may vary with time.G. Thus. Kastrinaki et al.

Along these lines. In general.3. Short-range aggregation considers local lane ﬁtting into the edge structure of the image. Thus gross road boundaries and markings must be directed towards a speciﬁc point in the image. All pairs of edge pixels (along each horizontal line) that fall within some limits around this width are considered as lane markings and corresponding points on different scan lines are aggregated together as lines of the road. [10] operates on search windows located along the estimated position of the lane markings. At this stage. to provide an estimated lane-marking location for placing the subsequent search windows.
. a circular arc is approximated by a second-order parabola. the edges of the lane marking are
determined as the locations of maximum positive and negative horizontal changes in illumination. with their parameters estimated via a recursive least squares (RLS) ﬁlter ﬁt on candidate edge points. it then forms correspondences of edge points to a two-lane road model (three lane markings) and identiﬁes the most frequent lane width along the image. the direction of edges at each pixel can be computed based on the phase of the gradient and a curvature of line segments can be estimated based on neighborhood relations. [4] detects lane markings through a horizontal (linear) edge detector and enhances vertical edges via a morphological operator. The detected lanes at near-range are extrapolated to far-range via linear leastsquares ﬁt. The approach in Ref. Simpler linear models are used in Ref. highly affecting the line tracking approach. whereas the feature vectors are used to compute the likelihood probability. In contrast to other deformable line models. meaningful edges of the video image are located at a certain distance apart. For each edge point. This range is adapted in real-time to the current state variables of the road model. considering simultaneously both-side borders of the road lane. / Image and Vision Computing 21 (2003) 359–381
363
detection part aims at extracting intensity discontinuities. This is done to preserve only straight lines that point towards the speciﬁc direction of the FOE. Ref.26] using snakes and splines to model road segments. Often in practice the strongest edges are not the road edges. To make the detection more effective. so as to derive the parameters of the model that match the observations. other features have been proposed that capture information about the orientation of the edges. [20] detects brightness discontinuities and retains only long straight lines that point toward the FOE. it preserves the edge direction and the neighboring line curvature and performs a ﬁrst elimination of edges based on thresholding of the direction and curvature. Ref. feature driven approaches are highly dependent on the methods used to extract features and they suffer from noise effects and irrelevant feature structures. 2. Shadow edges can appear quite strong. The feature aggregation is performed through correlation with a synthetic image that encodes the road structure for the speciﬁc FOE. Bayesian optimization procedures are often used for the estimation of these parameters. The estimation can be performed on the image plane [27] or on the ground plane [24] after the appropriate perspective mapping. The dominant edges are extracted based on thresholding of the gradient magnitude and they are reﬁned through thinning operators.3. Prior knowledge of the road geometry imposes strong constraints on the likely location and orientation of the lanes. For each horizontal line. The pavement edges and lane markings are often approximated by circular arcs on a ﬂat-ground plane. The edge detection can be efﬁciently performed through morphological operators [21 – 23]. so that the detected edges do not necessarily ﬁt a straight-line or a smoothly varying model. based on the assumption of smooth road curvature. Then. Kastrinaki et al. but are not affected drastically by extraneous edges. The parameters of the deformable template are estimated by optimizing the resulting maximum a posteriori objective function [24]. Long-range aggregation is based on a line intersection model. The system can cope with discontinuities of the road borders and can track road intersections. Alternatively. To detect edges that are possible markings or road boundaries. A similar approach is used in Ref. For each search window. Hence. The deformable template introduces a priori information. the LANA algorithm [24] uses frequency-domain features rather than features directly related to the detected edges. whose parameters must be estimated. [25. For small to moderate curvatures.V. a ﬁrst step of image enhancement is performed followed by a gradient operator. in order to ﬁt the lane-width model. [26] uses a spline-based model that describes the perspective effect of parallel lines. the focus of expansion (FOE) of the camera system. More ﬂexible approaches have been considered in Refs. A realistic assumption that is often used requires that the lane (or the lane marking) width does not change drastically. The location of the road markings along with the state of the vehicle are used in two different Kalman ﬁlters to estimate the near and far-range road geometry ahead of the vehicle [10]. [18] for auto-calibration of the camera module. through a histogram analysis. these edge points are aggregated as boundaries of the lane making (paint stripe) based on their spacing. it employs a contour following algorithm based on the range of acceptable gradient directions. The Road Markings Analysis (ROMA) system is based on aggregation of the gradient direction at edge pixels in real-time [19]. Model-driven approaches In model-driven approaches the aim is to match a deformable template deﬁning some scene characteristic to the observed image. Along these directions. [14] for road boundaries and lane markings. These feature vectors are used along with a deformable-template model of the lane markers in a Bayesian estimation setting. Ref. which should approximate the lane-marking width. Feature aggregation organizes edge segments into meaningful structures (lane markings) based on shortrange or long-range attributes of the lane.

or piecewise constant speed. for instance. the horizontal road curvature changes slowly (almost linearly) and the vertical curvature is insigniﬁcant. The Kalman ﬁltering algorithm is employed in Ref. such as the hill-and-dale and the zero-bank models have been considered for road geometry reconstruction [12. Even more unstructured road geometry is studied in Ref. However. The previous model assumes no vertical curvature and no vertical deviation of the camera with respect to the road. because on real highway trafﬁc scenes the feature extraction procedure almost always returns a number of extraneous features that are not part of the lane structure. the lane tracker applies a robust ﬁtting procedure using the Hough transform. where the estimation of the 3D structure is also possible.29]. Stationary camera In road trafﬁc monitoring. The road model consists of skeletal lines
pieced together from clothoids (i. [34]. the road boundaries are parallel with constant width. the curvature parameters and their association with the ego-motion of the camera can be formulated into a compact system of differential equations. The 3D model of the road can also be used in modeling the road parameters through differential equations that relate motion with spatial changes. i. In automatic vehicle guidance. They are placed on posts above the ground to obtain optimal view of the road and the passing vehicles. The Helmholtz shear equation is used to verify that candidate lane markers actually lie on the ground plane [28]. and forces the road model to move up or down from the ﬂat-road plane so as to retain a constant road width. Kastrinaki et al. Thus. Thus. the camera heading relative to the road direction. [28] the lane tracker predicts where the lane markers should appear in the current image based on its previous estimates of the lane position.
. Once a set of candidate lane markers has been recovered. Assuming slow speed changes. / Image and Vision Computing 21 (2003) 359–381
Model-based approaches for lane ﬁnding have been extensively employed in stereo vision systems. in both the left and right images. These assumptions imply a ﬂat-road geometry model. which is of limited use in practice. The hill-and-dale model uses the ﬂat-road model for the two roadway points closest to the vehicle in the image. It then extracts possible lane markers from the left and right images. Object detection 3. or the location of the image row considered. obtained using an edge detector on the observed image. Such distributions can be derived from test data [29]. In these applications it is essential to analyze the dynamic change of the environment and its contents. the use of a model has certain drawbacks. The lane markers are modeled as white bars of a particular width against a darker background. to ﬁnd the set of model parameters which best match the observed data [28]. arcs with constant curvature change over their run length). and a tracking algorithm estimates the parameters of this model from feature measurements in the left and right images [28]. These extra features can come from a variety of sources. In Ref. the vehicle lateral offset from the lane center. providing a dynamic model for these parameters. where the ground plane is ﬂat. This requires the description of the road using small segments and the derivation of probability distributions for the relative positions of these segments on regular road scenes (prior distribution on road geometry). the inefﬁciency in matching complex road structures and the high computational complexity. Such approaches assume a parametric model of the lane geometry. Regions in the image that satisfy this intensity proﬁle can be identiﬁed through a template matching procedure. different templates are used at different image locations along the length of the road. Thus. the cameras are moving with the vehicle. as well as the dynamic change of the camera itself.e. Model-driven approaches provide powerful means for the analysis of road edges and markings. [30.31]. These feature measurements are passed to a robust estimation procedure. Moreover.33]. Such approaches using statevariable estimation (Kalman ﬁltering) are developed in Refs. A robust ﬁtting strategy is absolutely essential in trafﬁc applications. conditioned on the possible positions of the road segments (a posteriori distribution of segments). other vehicles on the highway.1.e. the temporal change of curvature is linearly related to the speed of the vehicle. the video acquisition cameras are stationary. can be considered as variables to be inferred from the observation and the a posteriori probability conditioned on this observation [25. The road assumptions deﬁne a general highway scene. which recovers the parameters of the lane along with the orientation and height of the stereo rig with respect to the ground plane. object detection from a stationary camera is simpler in that it involves fewer estimation procedures. the width of the lane markers in the image changes linearly as a function of the distance from the camera. shadows or cracks in the roadway etc. The zero-bank assumption models the road as a space ribbon generated by a central line-spine and horizontal line-segments of constant width cutting the spine at their midpoint at a normal to the spine’s 3D direction.
3. it requires the speciﬁcation of probability distributions for observed segments. where all local road parameters are involved in the state-variable estimation process. such as the difﬁculty in choosing and maintaining an appropriate model for the road structure. Another class of model-driven approaches involves the stochastic modeling of lane parameters and the use of Bayesian inference to match a road model to the observed scene. The position and conﬁguration of the road.364
V. and the horizontal road curvature. The location of the road boundaries in the image is determined by three state variables. [32] to estimate the state-variables of the road and reconstruct the 3D location of the road boundaries. Other rigorous models. In this form.

where two general directions have been proposed. With optical-ﬂow-ﬁeld analysis. i. obstacle determination is limited to the localization of vehicles by means of a search for speciﬁc patterns. by itself. Optical-ﬂow-based techniques detect obstacles indirectly by analyzing the velocity ﬁeld. where 2D features are tracked in time. Kastrinaki et al. Thus. which could be easily misinterpreted.3. is not constrained by speed. are acquired simultaneously from different points of view [37]. possibly supported by other features such as shape. [42] employs smoothness constraints on the motion vectors. multiple images are acquired at different times [36]. Stereovision has advantages in that it can detect obstacles directly and. In such systems. is not possible with monocular vision and single frame analysis. Such model information. The most common techniques. Stereo image techniques identify the correspondences between pixels in the different images.V. The ﬁrst one considers the dynamic matching of low-abstraction (2D image-level) features between the data and the model. Since single frames encode only partial information. The problem of limited amount of information in each frame stems from the fact that each frame represents a non-invertible projection of the dynamically changing 3D world onto the camera plane. Essentially. The initial form of integrated spatio-temporal analysis operates on a so-called 2 1 D 2 feature space. i. it becomes possible to consider even the dynamic modeling of 3D objects. A ﬁrst source of additional information is the temporal evolution of the observed image.2. or matching a single observation to a road model or even matching a sequence of observations to a dynamic model. Such form of constraints convey the realistic assumption that compact objects should preserve smoothly varying displacement vectors. Since the scene in trafﬁc applications does not change drastically. This possibility paved the way to fully integrated spatio-temporal processing. Additional constraints can be imposed through the consideration of 3D models for the construction of the environment (full 3D space reconstruction) and the matching of 2D data (observations) with the 3D representation of these models. symmetry. This problem concerns the matching of a low-abstraction image to a high-abstraction and complexity object. or the use of a bounding box [38 – 40]. the joint consideration of a frame sequence provides meaningful constraints of spatial features over time or vice versa.e. stereo images. the detection deals mainly with the analysis of variations in time of one and the same pixel rather than with the information given by the environment of a pixel in one image [35]. involve processing two or more images. Ref. however. unlike optical-ﬂow-ﬁeld analysis. True 3D modeling. With the latest advances in computer architecture and hardware. Moving camera Autonomous vehicle guidance requires the solution of different problems at different abstraction levels. which enables the tracking of features over time. or their projection on the camera coordinates (pose estimation problem). Although it keeps continuous track of changes in the 3D model using both road and motion modeling (features in a 3 1 D space). / Image and Vision Computing 21 (2003) 359–381
365
Initial approaches in this ﬁeld involve spatial. In other words. forward projection of 3D models and matching with 2D observations is used to derive the structure and location of obstacles. it matches the observations with the expected projection of the world
. by means of matching observations (acquired images) over time. Using a sequence of images the detection principle is based essentially on the fact that the objects to be searched for are in motion. These methods prioritize temporal characteristics compared with spatial characteristics.e. namely the restricted processing time for real-time applications and the limited amount of information from the environment. using visual cues and a priori knowledge about the scene. It is possible from monocular vision to extract certain 3D information from a single 2D-projection image. Furthermore. which are imposed by the gray-scale spatial distribution. More advanced and effective approaches consider object modeling and tracking using state-space estimation procedures for matching the model to the observations and for estimating the next state of the object. the systems for autonomous vehicle guidance require additional information in the form of a knowledge-base that models
the 3D environment and its changes (self/ego motion or relative motion of other objects). it propagates the current 2D representation of 2 the model in accordance with the current state of the camera with respect to the road [44]. 3. For efﬁcient processing we need to limit the ROI within each frame and process only relevant features within this ROI instead of the entire image. The vision system can aid the accurate localization of the vehicle with respect to its environment. A priori knowledge is necessary in order to bridge the gap between these two representations [41]. enables the consideration and matching of relative object poses [43]. one must handle differences between the representation of the acquired data and the projected representation of the models to be recognized. The availability of only partial information in 2D images necessitates the use of robust approaches able to infer a complete scene representation from only partial representations. Several approaches considering different aspects of object and motion perception from a stationery camera are considered in Section 3. of course. Several efﬁcient methods presented in the following are based on dynamic scene prediction using motion and road models. analysis of the optical ﬂow ﬁeld and processing of stereo images. We can identify two major problems with the efﬁcient recognition of the road environment. For instance. the prediction of the ROI from previously processed frames become of paramount importance. temporal and spatio-temporal analysis of video sequences.

It is based on the notion that vehicles are compact objects having different intensity form their background. where objects are treated as 3D motion processes in space and time. Edge-based detection (spatial differentiation) Approaches in this class are based on the edge-features of objects. Thus. since the edge information remains signiﬁcant even in variations of ambient lighting [55].3. whereas noise and small intensity variations tend to disappear at this level. whereas road areas yield relatively low edge content. Kastrinaki et al. gray-scale morphological operators have been proposed for object detection and identiﬁcation that are insensitive to lighting variation [48].57].3. potential vehicle or road obstacle). Alternatively. the lowresolution image can immediately direct attention to the pixels that correspond to such objects in the initial image. Thus the presence of vehicles may be detected by the edge complexity within the road area. Ref. The second approach uses a full 4D model. They can be applied to single images to detect the edge structure of even still vehicles [49]. the detection of vehicles and/or obstacles can simply consist of ﬁnding the rectangles that enclose the dominant line segments and their neighbors in the image plane [2. Adaptive thresholding can be used to account for lighting changes. the space signatures are deﬁned in Ref. since vehicle rears are generally contour and region-symmetric about a vertical central line [54]. Symmetry provides an additional useful feature for relating these line segments. 3. which operates on still images. Furthermore.366
V. which can be quantiﬁed through analysis of the histogram [51].4. Morphological edge-detection schemes have been extensively applied. Alternatively. the linear edge segments on each observed image are matched to the model by evaluating the presence of attributes of an outline. Thresholding This is one of the simplest. by thresholding intensities in small regions we can separate the vehicle from the background. dimensions. [58] by means of the vehicle outlines projected from a certain number of positions (poses) on the image plane from a certain geometrical vehicle model. Each pixel of interest is selected according to some interest function which may be a function of the intensity values of
its adjacent pixels. [52. a region search begins at the top level (coarse to ﬁne). Subsequently. Then. Approaches have been categorized according to the method used to isolate the object from the background on a single frame or a sequence of frames. the results of an edge detector generally highlight vehicles as complex groups of edges. / Image and Vision Computing 21 (2003) 359–381
onto the camera system and propagates the error for correcting the current (model) hypothesis [31]. the algorithm must identify relevant features (often line segments) and deﬁne a grouping strategy that allows the identiﬁcation of feature sets. a dominant line segment of a vehicle must have other line segments in its neighborhood that are detected in nearly perpendicular directions. Thus. but cannot avoid the false detection of shadows or missed detection of parts of the vehicle with similar intensities as its environment [46].1. each of which may correspond to an object of interest (e. or can match observations (possibly from different sensors or information sources) and models at different abstraction levels (or projections) [41].3. [57] employs a logistic regression approach using characteristics extracted from the vehicle signature. edge strength. which must be selected appropriately for a certain vehicle and its background. Edge-based vehicle detection is often more effective than other background removal or thresholding approaches. the objects to be identiﬁed (vehicles) are described by their characteristics (forms. To improve the shape of object regions Ref. 3. Towards this direction. binary mathematical morphology can be used to aggregate close pixels into a uniﬁed object [47].3. Geometric shape descriptors together with generic models for motion form the basis for this integrated (4D or dynamic vision) analysis [45].30]. or successive frame differencing for motion analysis [5]. In trafﬁc scenes. for
. This approach depends heavily on the threshold used. Thus.2. luminosity). Multigrid identiﬁcation of regions of interest A method of directing attention to regions of interest based on multiresolution images is developed in Ref. To aid the thresholding process.3. 3. Compact objects that differ from their background remain distinguishable in the lowresolution image.3.g. Object detection approaches Some fundamental issues of object detection are considered and reviewed in this section. This evolution of techniques and their abilities is summarized in Table 1 that is further discussed in the conclusion. the edges can be grouped together to form the vehicle’s boundary. 3. which allow identiﬁcation in their environment [56. [5]. but less effective techniques. after the consideration of established approaches.3.18. 3. Vertical edges are more likely to form dominant line segments corresponding to the vertical boundaries of the proﬁle of a road obstacle. A camera model is employed to project the 3D object model onto the camera coordinates at each expected position. This method ﬁrst generates a hierarchy of images at different resolutions. since they exhibit superior performance [4. Based on this representation one can search for features in the 4D-space [45]. Space signature In this detection method. Moreover. in order to detect the vehicle from its background.53] employ the Hought transform to extract consistent contour lines and morphological operations to restore small breaks on the detected contours.50]. Some relevant approaches for moving object detection from a moving camera are summarized in Section 3.

Moreover. In fact. Inter-frame differencing provides a crude but simple tool for estimating moving regions. or track characteristic points of
. Along these lines. or is detected in real-time by forming a mathematical or exponential average of successive images. If we form a weighted average between the previous background and the current frame. the joint consideration of spatial and time signatures provides valuable information for both object detection and tracking. and severe light reﬂection on the vehicle body panels generate serious variation in the spatial signatures of same-type vehicles. Background frame differencing In the preceding methods.6. 3. The time signal of light intensity on each point is analyzed by means of a model with pre-recorded and periodically updated characteristics. The resulting mask of moving regions can be further reﬁned with color segmentation [68] or accurate motion estimation by means of optical ﬂow estimation and optimization of the displaced frame difference [16. the background is build through exponential updating [63]. The inter-frame difference succeeds in detecting motion when temporal changes are evident. occlusion. Due to the inﬂexible nature of template matching.61] employs neural networks for recalling space signatures. There are several background updating techniques. the background frame is required to be updated regularly. however. This process can be complemented with background frame differencing to improve the estimation accuracy [67]. vehicle detection based on sign patterns does not require high computational effort. In averaging. Through this consideration. The most commonly used are averaging and selective updating. Moreover. changes in ambient lighting. the one task can beneﬁt from the results of the other in terms of reducing the overall computational complexity and increasing the robustness of analysis [70]. and exploits their ability to interpolate among different known shapes [62]. Selective updating can be performed in a more robust averaging form. Inter-frame differencing This is the most direct method for making immobile objects disappear and preserving only the traces of objects in motion between two successive frames. The background image is speciﬁed either manually. the image of motionless objects (background image) is insigniﬁcant. However. by taking an image without vehicles. [64] the inter-frame difference is modeled trough a two-component mixture density.67]. The two components are zero mean corresponding to the static (background) and changing (moving object) parts of the image. The background can change signiﬁcantly with shadows cast by buildings and clouds. using directly the typical gray-scale signature of vehicles [60]. the TRIP II system [58. essentially encoding information about the projected edges. the adaptable time delay neural network developed for the Urban Trafﬁc Assistant (UTA) system is designed and trained for processing complete image sequences [71].5.7.3. To overcome such problems. In selective updating. where
the stationary regions of the background are replaced by the average of the current frame and the previous background [50]. Ref. 3. In practice. in order to reﬁne the segmentation of moving objects. The network is applied for the detection of general obstacles in the course of the UTA vehicle.3.38].3. The analysis of the time signature recorded on these points is used to derive the presence or absence of vehicles [69]. it enables the system to deal with the tracking process and keep the vehicle in track by continuously sensing its sign pattern in real time. the inter-frame difference is described using a statistical framework often employing spatial Markov random ﬁelds [64 –66]. Alternatively. a speciﬁc template must be created for each type of vehicle to be recognized. 3. Thresholding is performed in order to obtain presence/absence information of an object in motion [5. the background is built gradually by taking the average of the previous background with the current frame. To overcome this problem. Despite its inefﬁciencies. the background is replaced by the current frame only at regions with no motion detected. Space signatures can also be identiﬁed in an image through correlation or template matching techniques. it fails when the moving objects are not sufﬁciently textured and preserve uniform regions with the background. the template mask assumes that there is little change in the intensity signature of vehicles. since there are many geometrical shapes for vehicles contained in the same vehicle-class. shadows.35. With these changing environmental conditions. On the contrary. / Image and Vision Computing 21 (2003) 359–381
each of the pre-established object positions (poses). This creates a problem.8. These arrays are used for matching with the image data. [59] projects the 3D model at different poses to sparse 2D arrays. in Ref.368
V. Spatial correlation of time signatures allows further reinforcement of detection. this method is based on forming a precise background image and using it for separating moving objects from their background. In a similar framework. The proﬁle is computed at several positions on the road as the average intensity of pixels within a small window located at each measurement point. The immediate consequence is that stationary or slow-moving objects are not detected.3. 3. where the difference between the current and the previous frames is smaller than a threshold [63]. The detection is then achieved by means of subtracting the reference image from the current image. Feature aggregation and object tracking These techniques can operate on the feature space to either identify an object. or simply due to changes in lighting conditions. Time signature This method encodes the intensity proﬁle of a moving vehicle as a function of time. Kastrinaki et al.

Under such conditions. A timesequenced trajectory of each matched object provides a track of the object [32]. Optical ﬂow ﬁeld Approaches in this class exploit the fact that the appearance of a rigid object changes little during motion. more importantly. Model-based approaches match the representations of objects within the image sequence to 3D models or their 2D projections from different directions (poses) [44. pixels of interest are segmented prior to matching using background removal.9. 3D models that can be tracked in time and 4D models for full spatio-temporal representation of the object [73. Feature-based approaches consider the organization (clustering) of pixels into crude object structures in each frame and subsequently compute motion vectors by matching these structures in the sequence of frames. Â t 2 DtÞ with gðx. such as corners [72. Line segments or points can also be tracked in the 3D space by estimating their 3D displacements via a Kalman ﬁlter designed for depth estimation [18. They are often used in object detection to improve the robustness and reliability of detection and reduce false detection rates.74]. resulting in optical ﬂow vectors. Multigrid methods are designed for fast estimation of the relevant motion vectors at low resolution and hierarchical reﬁnement of the motion ﬂow ﬁeld at higher resolution levels [86]. active contours and polygonal approximations for the contour of the object.3. A robust feature-based method for the estimation of optical ﬂow vectors has been developed by Kories and Zimmermann [84]. To reduce the amount of computation. / Image and Vision Computing 21 (2003) 359–381
369
the object [32]. Nevertheless. but also about the spatial structure of the scene. The aggregation step handles features previously detected. tÞ on a pixel-by-pixel basis through the temporal gradient of the image sequence. Gradient-based techniques yield poor results for poor-texture images and in presence of shocks and vibrations [83]. this operation can be interpreted as a pattern recognition task. tÞ is computed by mapping the gray-value gðx 2 uDt. or along contours of segmented objects [75]. t 2 DtÞ recorded at time t 2 Dt at the image point x 2 uDt onto the gray-value gðx.77]. the objects are tracked. and occlusion. they can be characterized as (i) gradient-based (ii) correlation based (iii) feature-based and (iv) multigrid methods. In signature tracking. Various approaches have been proposed for the efﬁcient estimation of optical ﬂow ﬁeld [42. [85].73]. the intensity variations alone do not provide sufﬁcient information to completely determine both components (magnitude and direction) of the optical ﬂow ﬁeld uðx. The features are aggregated with respect to the vehicle’s geometrical characteristics. Gradient-based techniques focus on matching gðx 2 uDt. objects are independently detected in each frame. Each frame is ﬁrst subjected to a bandpass ﬁlter. For region-based features tracking is based on correspondences among the associated target regions at different time instances [64. Two general approaches have been employed for feature aggregation.67. In general.73]. Blobs representing local maxima and minima of the graylevel are identiﬁed as features. in order to ﬁnd the vehicles themselves or the vehicle queues (in case of congestion). Kastrinaki et al. A related technique is considered in Ref.V. The multigrid approach in Ref. A symbolic correspondence is made between the sets of objects detected in a frame pair. a set of intensity and geometry-based signature features are extracted for each detected object. In general.73]. edge detection or inter-frame difference. or within segmented regions of similar texture [14.76]. texture or shape) properties of the tracked object. 3. Motion estimation is only performed at distinguishable points. Active contours. features for tracking encode boundary (edge based) or region (object motion. namely motion-based and model-based approaches [64]. Correlation-based techniques search for the maximum shift around each pixel that maximizes the correlation of gray-level patterns between two consecutive frames. such as snakes and geodesic contours are often employed for the description of boundaries and their evolution over the sequence of frames. Such procedures are quite expensive in terms of computational complexity. [5] relies upon
. Smoothness constraints facilitate the estimation of optical ﬂow ﬁelds even for areas with constant or linearly distributed intensities [78 – 80. whereas the drastic changes occur at regions where the object moves in and/or out of the background. correlation-based techniques usually derive more accurate results. The optical ﬂow ﬁeld uðx. The centroids of the detected blobs are tracked through subsequent frames. perspective distortions and occlusion resulting from typical camera positions. In symbolic tracking.82].72. Several model-based approaches have been proposed employing simple 2D region models (mainly rectangles). namely numeric signature tracking and symbolic tracking. tÞ recorded at
location x at time t: The optical ﬂow ﬁeld encodes the temporal displacement of observable gray-scale structures within an image sequence. perspective. Motion-based approaches group together visual motion consistencies over time [64. [32].72.70]. It comprises information not only about the relative displacement of pixels.64. Next. the signatures are updated to accommodate for changes in range. In most cases. Attempts to speed up the computation at the cost of resolution often imply subsampling of the image and computation of the motion ﬁeld at fewer image points [83]. These features are correlated in the next frame to update the location of the objects. operating at much faster speeds than human operators and without the problem of limited attention spans [85]. Therefore. tÞ [81]. the methods are suitable for on-line qualitative monitoring. The accuracy of these techniques is affected by sensor noise (quantization). which aims at matching areas of similar intensities in two consecutive frames. Two alternative methods of tracking are employed in Ref.78 – 80]. algorithmic disturbances and. Following the detection of features.

moving or stationary obstacles are detected by analyzing the difference between the expected and the real velocity ﬁelds [36. The parallax effect is used in the Intelligent Vehicle (IV) in a different form for obstacle detection [95]. The ego-motion is ﬁrst computed from the analysis of the optical ﬂow. If an object extends vertically from the ground plane. [94] deﬁnes patches of similar spatial characteristics in each frame and uses local voting over the output of a correlation-type motion detector to detect moving objects. For a still camera. Moreover. Ref. so that one camera is located above
. A stereo rig is positioned vertically.72. where the optical ﬂow vectors are clustered in order to incrementally create candidate moving-objects in the picture domain [81]. however. then differences observed on the motion-vector can be used to derive information regarding the objects moving within the scene. [93] ﬁrst calculates the optical ﬂow and after smoothing the displacement vectors in both the temporal and the spatial domains. These parameters can be estimated by optimizing an error measure on two subsequent frames using a gradient-based estimation approach [66. The motion vector ﬁeld is reﬁned hierarchically at higher resolution levels. 3. the focus of expansion or the epipole [81]. A related approach is used in the ACTIONS system. If a camera is translating through a stationary environment. This approach. It also uses the inverse perspective mapping to eliminate motion effects on the ground plane due to the ego-motion of the camera [94]. The detection of moving objects in image sequences taken from a moving camera becomes much more difﬁcult due to the camera motion. in order to verify not only temporal but also spatial consistency of detected moving objects. When the car bearing the camera is moving in a stationary environment along a ﬂat road and the camera axis is parallel to the ground. Image regions that cannot be aligned in two frames at any depth are segmented into independently moving objects [90].89]. or is composed of a few distinct portions at different depths. moving objects are readily identiﬁed by thresholding the optical ﬂow ﬁeld. If another moving object becomes visible by the translating camera. When the scene is piecewise planar. Each layer estimates motion at a certain depth due to the camera and removes the associated portions of the image. For planar motion with no parallax (no signiﬁcant depth variations). The problem of recovering the optical ﬂow from timevarying image sequences is ill-posed and additional constraints must be often imposed to derive satisfactory solutions. to provide robust performance of the algorithm [90]. The detection of obstacles from a moving camera based on the optical ﬂow ﬁeld is generally divided into two steps. the motion of points on the same object appears different relative to the background. If we compensate the ego-motion of the camera. identiﬁes object structures at low-resolution levels where it also computes a crude estimate of the motion ﬁeld from the low-resolution image sequence. In a similar form. Similar ﬂow vectors are grouped together and compared to the spatial features.370
V. it employs a voting process over time in each spatial location regarding the direction of the displacement vectors to derive consistent trends in the evolution of the optical ﬂow ﬁeld and.83]. Thus. the ego-motion effect can be decomposed into the planar and the parallax parts. a clear difference between the predicted and the actual position of the object is experienced. then independently moving (or stationary) obstacles can be readily detected. Such constraints have been used in a joint spatiotemporal domain of analysis [92]. After compensating for the planar 2D motion.3. For each frame. thus. If we use the displacement ﬁeld of the road to displace the object. at most eight parameters can characterize the motion ﬁeld. it deﬁnes characteristic features (such as corners and edges) and matches these features on the present and the previous frame to derive a list of ﬂow vectors. In a different form. These ﬁelds are re-projected to the 3D road coordinate system using a model of the road (usually ﬂat straight road) [88. all points in the image that are not on the ground plane will be erroneously predicted. The optimization process is often applied on a multiresolution representation of the frames. then the ego-motion can be estimated in layers of 2D parametric motion estimation.10.87]. Kastrinaki et al. similar to the feature based approaches. The estimation of ego-motion can be based on parametric models of the motion ﬁeld. / Image and Vision Computing 21 (2003) 359–381
the organization of similar pixel-intensities into objects. This interference can be detected by testing if the calculated optical-ﬂow vectors have the same direction as the estimated ego-motion model vectors [81. [94] starts from similarity in the spatial domain. motion on a planar road. the prediction error (above an acceptable threshold) indicates locations of vertically extended objects in the scene [87]. the optical ﬂow ﬁeld resulting from this additional motion will interfere with the optical ﬂow ﬁeld of the ego-motion. In other words. For more general motion of the camera. Then. the object’s projection on the 2D image plane also moves relative to the image coordinate system. deﬁne consistently moving objects.g. Finally. This difference is called motion parallax [87]. the motion ﬁeld (due to ego-motion) is expected to have almost quadratic structure [83].90]. These displacements due to camera motion form a radial ﬁeld
centered at the epipole. it merges regions of relatively uniform optical ﬂow. Ref. then the directions of all optical-ﬂow vectors intersect at one point in the image plane. Independently moving objects can be recovered by verifying that the displacement at any given point is directed away from the epipole [91]. Ref. e. If the environment is constrained. Motion parallax When the camera is moving forward towards an object. its image moves differently from the immediate background. the residual parallax displacements in two subsequent frames are primarily due to translational motion of the camera. depending on the distance from the ground plane. Smoothness constraints stem from the fact that uniformly moving objects possess slightly changing motion ﬁelds.

For all points lying on a plane.45]. Under the assumption of a ﬂat road. ﬁgures on the road appear different on the two cameras. 3. stereo vision can be effectively used for the reconstruction of the 3D space ahead of the vehicle. Using this information. In this conﬁguration.) and the road geometry (the ﬂat-road assumption highly simpliﬁes the problem). Model based approaches employ a parameterized 3D vehicle model for both its structural (shape) characteristics and its motion [73. the localization of the lane and the detection of generic obstacles on the road can be performed without any 3Dworld reconstruction [4]. The line segments extracted from the image are matched to the model segments projected on the 2D camera plane. depending on their distance from the camera. The class of model-based techniques takes a different approach.3. Besides the projection of images onto the ground plane. except from their different location. namely the model matching and the motion estimation. Inverse perspective mapping A promising approach in real-time object detection from video images is to remove the inherent perspective effect from acquired single or stereo images. an obstacle generates the same time signature. the Mahalanobis distance is used in Ref. whereas residual disparities indicate objects lying above the ground plane and can become potential obstacles.76].11. The perspective effect relates differently 3D points on the road (world) coordinate system with 2D pixels on the image plane. which can take under consideration the spatial intra and intercorrelation of the stereo images [97]. The disparity between points in the two stereo images relates directly to the distance of the actual 3D location from the cameras. Nevertheless. which can be easily detected on a polar histogram of the difference image. On the other hand.96]. 3. vehicle speed. To remove the perspective effect it is essential to know the image acquisition structure with respect to the road coordinates (camera position. The model matching process aims at ﬁnding the best match between the observed image and the 3D model projected onto the camera plane. The inverse perspective mapping aims at inverting the perspective effect. the road geometry can be estimated from visual data [10. ﬂat straight road) [6. which derives the 3D position of the vehicle relative to the camera coordinates. Considering ﬁrst a stationery camera. 3D modeling and forward mapping The previous approaches reﬂect attempts to invert the 3D projection for a sequence of images and reconstruct the actual (world) spatial arrangement and motion of objects. The approach in the Path project [28] considers such a matching of structural characteristics (vertical edges). The vehicle model often assumes straight line segments represented by their length and mid point location [44]. The motion estimation process
. based on 2D projections. 3. progressive scanning and delaying one of the camera signals make the detection of obstacles possible.3. steering angle. It may be used to re-map the right image onto the left.). vehicle and road coordinate systems. This effect associates different information content to different image pixels.V.3. based on the given model of the road in front of the vehicle (e. forcing homogeneous distribution of information within the image plane. the disparity on the two stereo images is a linear function of image coordinates (Helmholtz shear equation). Obstacles located above the ground plane appear identical in the camera images. A simple threshold can be used to identify these objects in the difference of the re-mapped images. Candidate matches in the left and right images are evaluated by computing the correlation between a window of pixels centered on each edge [28].13. two major problems must be solved. shadows and shades on the road structure [95]. or both images onto the road coordinate system. / Image and Vision Computing 21 (2003) 359–381
371
the other. however. The re-projection transform maps the matched points onto the road coordinate system. The matching can be based on the optimization of distance measures between the observation and the model. the road geometry has to be estimated ﬁrst in order to derive the re-projection transform from the camera to road coordinate systems. etc. etc. The difference of the re-mapped views transforms relatively square obstacles into two neighboring triangles corresponding to the vertical boundaries of the object. [44]. It tries to solve the analysis task by carrying out an iterative synthesis with prediction error feedback using spatio-temporal world models. The matching can also be based on stochastic modeling. this reprojection process is quite straightforward (triangulation transform). orientation. the system relies on and is highly affected by brightness changes. In the case of general road conditions. This Helmholtz shear relation highly simpliﬁes the computation of stereo disparity.12.38. road markings or objects of the same size appear smaller in the image as they move away from the camera coordinate system.7. whereas road ﬁgures generate different time signatures on the two cameras. which can be provided by appropriate
sensors of the vehicle. This step is essentially a pose identiﬁcation process. Thus. Stereo vision The detection of stationary or moving objects in trafﬁc applications has been also considered through stereo vision systems. All points on the ground plane appear with zero disparities. Thus. For this purpose it is necessary to know the exact relationship among the camera. This reconstruction is based on correspondences between points in the left and right images.28. Using this approach. by re-mapping both right and left images into a common (road) domain. The inverse perspective mapping can be applied to stereovision [4]. Once this has been accomplished. This estimation requires the exact knowledge of the state of the car (yaw rate. the 3D coordinates of the matched point can be computed via a reprojection transform.g. Kastrinaki et al.

[44]. but also the computation of the effects of each component of the relative state vector on the vehicle position. The measurement equation has to be computed only in the forward direction. In a computerized system. The dynamic consideration of a world model allows not only the computation of present vehicle positions. The estimation of motion and shape parameters can be combined in a more general (overall) state estimation process [98]. By applying this scheme to each object in the environment in parallel. / Image and Vision Computing 21 (2003) 359–381
is based on models that describe the vehicle motion.34]. The ego-motion dynamics can be computed from the actuators of the moving vehicle. Thus. It attempts to compile the differences in the requirements and the constraints of these two areas. It becomes obvious that knowledge about the structure of the environment and the dynamics of motion are relevant components in real-time vision. we can proceed with a prediction of the vehicle and obstacles’ states for the next time instant. this paper provides a review of video analysis tools and their operation in trafﬁc applications. A Kalman ﬁlter can be used to predict the vector of the state estimates based on the vectors of measurements and control variables. The review focuses on two areas. a linear approximation to the non-linear equations of the model should be sufﬁcient for capturing the essential inter-relationships of the estimation process [31]. vehicle following or autonomous vehicle guidance. the discrepancy between prediction and measurement should be small. The MRF also induces spatial and temporal smoothness constraints. the motion of obstacles can be decomposed into translation and rotation over their center of gravity.34. Zero-Bank and Modiﬁed Zero-Bank models have been considered along with the inverse and/or forward mapping [12. In the case of a moving camera. we attempt a brief review of existing systems for trafﬁc monitoring and automatic vehicle guidance. including the Hill-and-Dale. an internal representation of the actual environment can be maintained in the interpretation process. we consider operation simply in the spatial domain. In terms of their major applications.
4. when new measurements are taken. generic models of objects from the real world can be stored as three-dimensional structures carrying visible features at different spatial positions relative to their center of gravity. for linear models the recursive state estimation is efﬁciently performed through least-square processes. spatial features with temporal projection of feature locations. whereas the extended Kalman ﬁlter is used in Ref. Ref. the maximum a posteriori (MAP) estimator is employed in Ref.33. through the fusion of dynamic models (that describe spatial motion) with 3D shape models (that describe spatial distribution of visual features). we ﬁrst indicate the status of the camera (static or moving) and categorize applications as in trafﬁc monitoring. For simplicity. temporal features (optical ﬂow ﬁeld) constrained on spatial characteristics (mainly 2 1 D and joint spatio-temporal 2 operation. The motion parameters of this model are estimated using a timerecursive estimation process. From this knowledge and applying the laws of forward projection (which is done much more easily than the inverse). the position and orientation of visual features in the image can be matched to those of the projected model [31. Therefore. For instance. namely automatic lane ﬁnding and obstacle detection. Furthermore. Whenever important. the relative position of the moving vehicle and its camera can be inferred. From the ego-motion dynamics. where the Jacobian matrix reﬂects the observed image variation. If the cycle time of the measurement and control process is small and the state of the object is well known. by prediction error feedback [31]. In a different form.66]. the basic processing techniques used and their major applications. We also attempt a categorization of these systems in terms of their domain of operation. More speciﬁcally.372
V. automatic lane ﬁnding. This approach avoids the ill-posed approximation of the non-unique inverse projection transform. lane following. the changing views of objects during self or ego-motion reveal different aspects of the 3D geometry of objects and their surrounding environment. Other road models. but it rather considers representative systems that highlight the major trends in the area. The partial derivatives for the parameters of each object at its current spatial position are collected in the Jacobian matrix as detailed information for interpreting the observed image.
the spatial state estimation through vision can be performed through recursive least squares estimation and Kalman ﬁltering schemes. their operating domain is classiﬁed in terms of the nature of features utilized. we also emphasize the estimation of vehicle’s state variables. Representative systems and future trends Based on the previous categorization of video analysis methods. Having this information.
. [98]. In summary. the fundamental processing techniques from Sections 2 and 3 are summarized for each system. Kastrinaki et al. This categorization is summarized in Table 2. from state variables (3D world) to measurement space (image plane). The dynamics of other moving obstacles can be modeled by stochastic disturbance variables. Thus. It should be mentioned that this review does not by any means cover all existing systems. Moreover. [31] models the remaining difference image from two consecutive frames after ego-motion compensation as a Markov random ﬁeld (MRF) that incorporates the stochastic model of the hypothesis that a pixel is either static (background) or mobile (vehicle). This information can be maintained and used for estimating future vehicle positions. The optimization of the energy function of the resulting Gibbs posterior distribution provides the motion-detection map at every pixel [66]. The forward projection mapping is easily evaluated under the ﬂat-road model [31].

The complexity of the navigation problem is quite high. emphasizing more on high processing speeds than on accuracy of the results. video sensors have been shown to offer the following advantages: competitive cost. Some additional measurements needed for adaptive trafﬁc management are: approach queue length. This evolution is graphically depicted in Table 1. which relates applications to required pieces of information depending on the complexity of each application. ramp queue length. A vision-based guidance system applied to outdoor navigation usually involves two main tasks of perception. the detection of road obstacles is a necessary and important task to avoid other vehicles present on a road. With the rapid progress in the electronics industry. Most of these algorithms are rather simple. because video sensors have the potential of wide area viewing. in order to accurately match the dynamically changing 3D world to the observed image sequences. modern hardware systems allow for more sophisticated and accurate algorithms to be employed that capitalize on the real advantages of machine vision. Image processing and analysis tools become essential components of automated systems in trafﬁc applications. In cases emulating conventional sensors. This is evident from the comprehensive overview of systems and their properties in Table 2. Video sensors have demonstrated the ability to obtain trafﬁc measurements more efﬁciently than other conventional sensors. the knowledge about road geometry allows a vehicle to follow its route.376 Table 2 (continued) System
V. Thus. / Image and Vision Computing 21 (2003) 359–381
Operating domain
Processing techniques † Edge detection individually on each stereo image For object detection † Detection of lane and object through inverse perspective mapping
Major applications
VAMORS [31] (Prometheus project)
† Spatio-temporal processing for ALF and vehicle guidance † Temporal estimation of vehicle’s state variable
† Model-driven approach
† Autonomous vehicle guidance † Moving camera
† 3D object modeling and forward perspective mapping † State-variable estimation of road skeletal lines for Alf † State-variable estimation of 3D model structure for object detection † Feature-driven approach † Use of spatio-temporal signature † Feature-driven approach † Color road detection and lane detection via RLS ﬁtting † Interframe differencing and edge detection for locating potential object templates † Feature tracking via RLS
UTA [14]
† Spatio-temporal processing for ALF and object detection Based on neural networks † Spatial processing for alf and object detection † Feature tracking in temporal domain
† Autonomous vehicle guidance † Moving camera † Autonomous vehicle guidance † Moving camera
Ref. in order to extract useful information from video sensors. and automatic measurement of turning movements. vehicle deceleration. lower installation cost and installation/operation during construction. namely ﬁnding the road geometry and detecting the road obstacles. [14]
which lead to different processing techniques on various levels of information abstraction. since
the actual task is to reconstruct an inherent 3D representation of the spatial environment from the observed 2D images. computational complexity becomes less restrictive for real-time applications. lower maintenance and operation costs. the big thrust is based on traditional image processing techniques that employ either similarity or edge information to detect roads and vehicles and separate them from their environment. Throughout this work we emphasize on the trend for utilizing more and more information. Subsequently. From the methods presented in this survey. Vision also provides powerful means for collecting information regarding the environment and its actual state during autonomous locomotion. they are capable of more than merely emulating conventional sensors. approach ﬂow proﬁle. First of all. non-intrusive sensing. Kastrinaki et al. However. The levels of abstraction
.

we can expect morphological operators to be used more widely for both the segmentation of smooth structures and the detection of edges. The second requirement for information fusion is still in primitive stages. they also treat the measurements as hard evidence that provide indisputable information. The fusion of different sources of information. the measurements convey a large degree of uncertainty. two major requirements from advanced systems are expected to emerge namely (i) adaptation to environmental changes and (ii) ability of combining (fusing) information from different sources. Ref. Approaches that deal with uncertainty possibly based on fuzzy-set theory have not been studied in transportation applications. [123] develops and uses a (combined) range-intensity histogram for the purpose of identifying military vehicles. Thus. A few approaches deal with difﬁcult visual conditions (shades. Due to the complexity of a dynamically changing road scene. The second class of adaptable systems can deal with more general scenarios. / Image and Vision Computing 21 (2003) 359–381
377
and information analysis range from 2D frame processing to full 4D spatio-temporal analysis.V. In fact.108. they are expected to dominate in systems that can intelligently adapt to the environment. Neural networks form powerful structures for capturing knowledge and interpretation skills by training. In trafﬁc applications there are enough data for extensive training that can be gathered
on-line from a ﬁxed sensor (camera) position or while a human operator drives the vehicle. it is necessary to combine multisensory data in order to detect road characteristics and obstacles efﬁciently. but they also rely heavily on the measurements and the degree of discrimination that can be inferred from them.122]. Such nonlinear operators provide algorithmic robustness and increased discrimination ability in complex scenes. it is impossible to deﬁne thresholds and other needed parameters for feature extraction and parameter estimation under all different situations. In these cases. Nevertheless. [124] uses a trainable neural network structure for fusing information from a video sensor and a range ﬁnder.120]. The former uses the results of 2D image analysis to guide range sensing. Kastrinaki et al. Ref. vision sensors may not provide enough information to analyze the scene. Towards the improvement of the image-processing stage itself. The ﬁrst class comprises systems that can be trained in some conditions and have the ability to interpolate among learnt situations when an unknown condition is presented. like rainy or extremely hot conditions. For instance. where the video image guides the focus for attention of the range ﬁnder or vice versa [33. changing lighting). static or moving objects on the road. ‘illusion’ patterns on the scene may be easily misinterpreted as vehicles or roadway structures. under diverse conditions. light.123]. where the measurements may not allow for indisputable inference. allowing the development of applications from simple lane detection to the extremely complex task of autonomous vehicle guidance. Individual Processing algorithms provide speciﬁc partial solutions under given constraints. such as the weather. such as in trafﬁc applications. [118] that combines vision and laser radar systems with DGPS localization data and maps. Since measurements in real-life are amendable to dispute. we can expect increased use of multiresolution techniques that provide not only detailed localized features in the scale-space or in the wavelet domains but also abstract overall information. It is difﬁcult for a vision algorithm to account for all kinds of variations. shadowing. [121] shows how to recognize 3D objects by ﬁrst detecting characteristic points in images and then using them to guide the acquisition of 3D information around these points. Representative systems are developed in the OMNI project for trafﬁc management that combines video sensors with information from vehicles equipped with GPS/GSM [18] and in the autonomous vehicle of Ref. etc. manmade interventions. Recursive estimation schemes (employing Kalman ﬁlters) proceed in a probabilistic mode that ﬁrst derives the most likely location of road lanes and vehicles and then matches the measured data to these estimated states. Furthermore. Nevertheless. different algorithms or even different modality systems provide quite different results and/or information regarding the same scene. The ﬁrst issue of adaptability can be dealt either with systems that can be trained in diverse conditions or with systems that tolerate uncertainty. A question that arises at this point concerns the future developments in the ﬁeld. mainly video and range data has been considered as a means of providing more evidence in reconstructing the 3D world in robotic applications. Similar strategies can be used for fusing results of different image processing techniques on the same set of data. simulating more closely the human perception. [121. Along these lines. Most sophisticated image processing approaches adopt the underlying assumption that there exists hard evidence in the measurements (image) to provide characteristic features for classiﬁcation and further mapping to certain world models. most of these approaches use data sources in sequential operation. The second scheme adopts the strategy of combining both intensity and range information to facilitate the perception task [33. Then.119. the active and intelligent sensing (sequential sensor operation) and the fusion-oriented sensing (simultaneous sensor operation). We can distinguish two kinds of multisensory cooperation. noise. The results of such algorithms are not
. Real-world trafﬁc applications must account for several aspects that accept diverse interpretation. Moreover. Since they provide powerful means of incorporating possibility and linguistic interpretations expressed by human experts. Pioneer work with this idea is presented in Refs. Under such circumstances. in order to determine the appropriate turn curvature so as to keep the vehicle at the middle of the road. it is only natural to expect the development of systems for trafﬁc application that are based on so-called soft computing techniques. Ref.