Autofocus: contrast detection

In the preceeding applet we considered active and
passive methods of autofocusing a camera, and we studied in detail the
phase detection method commonly used in single-lens reflex (SLR)
cameras. In this applet we consider contrast detection, another
passive autofocus method. This method is commonly used in point-and-shoot
cameras and some cell phone cameras - those with movable lenses. Contrast
detection is also used in SLR cameras when they are in Live preview mode (called
"Live View" on Canon cameras). In this mode the reflex mirror is flipped up,
thereby disabling the autofocus module, so only the main sensor is available to
help the camera focus, and contrast detection must be used.

The optical path

The optical path for constrast detection is much simpler than for phase
detection. There is no half-silvered mirror, no microlenses, and no chip
crammed with one-dimensional CCDs. Instead, the image falling on the main
sensor is captured and directly analyzed to determine whether the image falling
on it is well focused or not. As in phase detection, this analysis can be
performed at one or more distinct positions in the camera's field of view at
once, and these results can be combined to help the camera decide where to
focus. We could call these positions autofocus points by analogy
to the phase detection method, except that in contrast detection there is no
limit on the number and placement of these points. In the simulated viewfinder
image at the right side of the applet, we've highlighted one position near the
center of the field of view. Let's consider the formation of the image at this
position.

If you follow the red bundle of rays in the applet, they start at an unseen
object lying on the optical axis (black horizontal line) to the left of the
applet, pass through the main lens, and reconverge on the optical axis. With
the applet in its reset position, these rays come to a focus before the main
sensor, then spread out again and strike the sensor. Although the simulated
viewfinder shows a coin, let us assume for the purposes of the ray diagram that
the unseen object is a single bright point on a black background. In this case
the image captured by the sensor would be a broad spot that tapers to black at
its boundaries. A 1D plot through this spot is shown to the right, where it
looks like a low hump. Such a broad low hump is said to have low contrast.

The autofocusing process

Use the slider to move the lens left and right. As you do so the position
where the red rays converge will also move. As their focus moves closer to the
sensor, the breadth of the spot formed on the sensor decreases and its center
becomes brighter, as shown in the 1D plot. When the rays' focus coincides with
the sensor, the spot is tightest, and its peak is highest. Unfortunately, we
can't use the height of this peak to decide when the system is well focused,
since the object could any color - light or dark. Instead, we could examine
the slope of the plot, i.e. the gradient within a small neighborhood of pixels,
estimating this gradient by comparing the intensities in adjacent pixels. We
would then declare the system well focused when this gradient exceeds a certain
threshold. However, if the object naturally has slowly varying intensities,
like skies or human skin, then its gradients will be modest even if it is in
good focus.

The most successful method is to move the lens back and forth until the
intensity or its gradient reaches a maximum, meaning that moving the lens
forward or backward would decrease the intensity in a pixel or the gradient in
a small neighborhood. This "local maximum" is the position of best focus. For
images of natural scenes (like the coin), rather than bright spots on a dark
background, even more complicated methods can be employed, and it's beyond the
scope of this applet to describe them all. Moreover, the algorithm used in any
commercial camera is proprietary and secret, and therefore unknown to us.

Regardless of the method employed,
if you're trained in computer vision you'll recognize these methods as
shape from focus algorithms.
Compare this to phase detection, which because it
uses only two views at each autofocus point is more like a shape from
stereo algorithm.

Characteristics of contrast detection systems

This maximum-seeking method has one advantage and one big disadvantage. Its
advantage is that it requires only local operations, meaning that it looks at
only a few pixels in the vicinity of the desired evalution position. Thus, it
requires relatively little computation (hence battery power). Its disadvantage
is that it requires capturing multiple images. From a single image it can't
tell you whether the camera is well focused. It also can't tell you how far to
move the lens to get it into focus. It can't even tell you which direction to
move the lens! (Compare this to phase detection,
in which a single observation suffices in principle to focus the lens
accurately.) Thus, contrast detection systems must capture an image, estimate
the amount of misfocus, move the lens by a small amount, capture another image,
estimate misfocus again, and decide if things are getting better or worse. If
they're getting better, then the lens is moved again by a small amount. If
they're getting worse, then the lens is moved the other way. If they've been
getting better for a while, then suddenly they start get worse, we've overshot
the in-focus position, so the lens should be moved the other way, but only a
little.

By this iterative "hunting" process we eventually find the in-focus
position we are seeking. However, hunting takes time. That's why we complain
about how long it takes to autofocus a point-and-shoot camera. That's also why
high-end SLRs that shoot movies in Live View mode either don't offer
autofocusing or do it poorly - because they are compelled to use contrast
detection. By the way, professional moviemakers don't use autofocus cameras;
they have a person called a focus puller who stands next
to the camera and manually moves the focus ring by a precise, pre-measured
amount when called for in the script.

If you're trained in computer vision, you might ask at this point why cameras
don't use shape from defocus, in which images are captured are only
two lens positions. By comparing their amount of misfocus, one can in
principle estimate how far to move the lens to bring the image into good focus,
thereby avoiding hunting. However, this estimation requires assuming something
about the frequency content of the scene (How sharp is the object being
imaged?), and such assumptions are seldom trustworthy enough for everyday
photography.