A New Segmentation Paradigm (Inspired by the Human Visual System)

Motivation

In this work, we have asked a fundamental question about segmenting a scene (or image). More specifically, we ask whether it makes sense to segment an entire image or scene into regions. And our answer is No . We explain that with an example below. In the process, we will describe a new segmentation approach that is biologically motivated and connects segmentation with visual attention.

Let us first try to understand why segmenting an entire scene is not a well-defined problem. Look above at the left most image! It is a natural scene with, say, two prominent objects: a horse and a pair of trees. The middle and the rightmost images are the outputs of the normalized cut based image segmentation algorithm for different values of input parameter (the expected number of regions). Now, if we ask: which one of the two is the desired segmentation of the scene? The right answer is: "it depends!".

The segmentation, shown in the rightmost image, breaks the horse into a couple of regions whereas the segmentation, shown in the middle image, merges the horse with the region corresponding to the grass in the scene. So, if the horse is of interest to the problem at hand, the rightmost segmentation is the appropriate one. By the same logic, we can deduce that the segmentation shown in the middle image is appropriate if the trees are of interest.

So, the definition of an appropriate segmentation of a scene is strongly linked with the object of interest in the scene. In other words, the object of interest should be identified even before the segmentation process begins. At this point, it may appear as a chicken-and-egg problem. But it is not.

The huge literature on visual attention comes to rescue. The human visual system is known to have an attention module that efficiently uses the low-level visual cues (such as color, texture etc.) to quickly find the salient locations in the scene. The human eyes are then drawn to these salient points (also called fixations). The fixations are made even while watching a still picture. These fixations points are going to be used as the identification markers for the objects of interest in the scene.

We propose a segmentation algorithm which takes a fixation point as its input and outputs the region containing the given fixation point in the scene. For example, for the two fixations points, indicated by the green crosses, on two different objects (see Fig.(a) & (c)), our method segments the corresponding regions enclosing those fixations points (see Fig.(b) & (d)) . For more details on the algorithm, refer to our ICCV paper [PDF].

(a)

(b)

(c)

(d)

The primary difference between the proposed segmentation framework and the standard approach is the fact that we always segment one region/object at a time. In order
to segment multiple objects, the segmentation process will be repeated for the fixation points from inside each of the objects of interest. The diagram
below shows that fixation has a critical role in our segmentation process and that is to identify the object of interest.

Description

For a given fixation point, segmenting the region/object containing that point is a two step process.

Cue Processing: Visual cues such as color, texture, motion and stereo generate a probabilistic boundary edge map
wherein the probability of a pixel to be at the boundary of any object in the scene is stored as its intensity.

Segmentation: For a given fixation point, the optimal closed contour around that point in the probabilistic edge map. This
process is carried out in the polar space to make the segmentation process scale invariant

Note that the segmentation step is cue-independent meaning it does not interact with the visual cues directly. The interaction is rather indirect. The
segmentation takes as an input the probabilistic edge map of a given scene which is generated by the visual cues. This means the segmentation step would not
change as input changes from a video to a stero pair to an image. The change in the input changes the available visual cues. So, what we need to do is to
find out how do we combine the visual cues to generate the probabilistic boundary edge map wherein the intensity of a pixel is proportional to the likelihood of that pixel to be at an object boundary.

Case 1:
We have a video (on the left below). So we use color, texture and motion cues to generate the probabilistic boundary edge map (on the right
below). Here the object boundaries are also motion boundaries that can be detected as the discontinuities in the optical flow of the scene.

Case 2:
We have a stereo pair. So we use color, texture and depth map to generate the probabilistic boundary edge map. With Kinect, it has become pretty easy
to generate RGB-D input. The data shown below has been obtained from solutionsInPerceptionChallenge,
a vision challenge currently beingorganized by Willow Garage and NIST.
Most of the pixels along the object boundaries have depth discontinuity and thus can be detected by finding discontinuities in the depth map.

Case 3:
We only have a single image. In this case, the probabilistic boundary edge map can simply be the color and texture gradient values scaled between 0 and 1.
In the example below, the gradient based probabilistic boundary edge map is good enough as the pixels on the object boundaries are dark whereas those inside
the objects are dim.

Segment a fixated region
The segmentation is oblivious to how the probabilistic boundary edge map is created. It just takes the edge and orientation maps as
its input along with a fixation point and finds the closed boundary around that point. Below is the example of segmentation for different fixation points
for different cases. Note that the segmentation process is not affected by the actual size of the closed contours due to the scale invariance property of
the polar space. (To learn more about it, read our ICCV paper.)

FAQs

Q1. What is the biggest benefit of the new fixation based formulation of segmentation?
Ans: The segmentation becomes a fully automatic process which finds regions without any user input. These regions can then be used for mid level visual processing. A region with a closed contour is more informative and discriminative than an image patch at any point in the image. Q2. Will there be a real-time implementation of the segmentation algorithm?
Ans: Yes. We are working on a C&Cuda implementation of the algorithm which will be real-time.