Image Acquisition (Introduction to Video and Image Processing) Part 2

The Image Sensor

The light reflected from the object of interest is focused by some optics and now needs to be recorded by the camera. For this purpose an image sensor is used. An image sensor consists of a 2D array of cells as seen in Fig. 2.13. Each of these cells is denoted a pixel and is capable of measuring the amount of incident light and convert that into a voltage, which in turn is converted into a digital number.

The more incident light the higher the voltage and the higher the digital number. Before a camera can capture an image, all cells are emptied, meaning that no charge is present. When the camera is to capture an image, light is allowed to enter and charges start accumulating in each cell. After a certain amount of time, known as the exposure time, and controlled by the shutter, the incident light is shut out again. If the exposure time is too low or too high the result is an underexposed or overexposed image, respectively, see Fig. 2.14.

Many cameras have a built-in intelligent system that tries to ensure the image is not over- or underexposed. This is done by measuring the amount of incoming light and if too low/high correct the image accordingly, either by changing the exposure time or more often by an automatic gain control. While the former improves the image by changing the camera settings, the latter is rather a post-processing step. Both can provide more pleasing video for the human eye to watch, but for automatic video analysis you are very often better off disabling such features. This might sound counter intuitive, but since automatic video/image processing is all about manipulating the incoming light, we need to understand and be able to foresee incoming light in different situations and this can be hard if the camera interferes beyond our control and understanding. This might be easier understood after reading the next topic. The point is that when choosing a camera you need to remember to check if the automatic gain control is mandatory or if it can be disabled. Go for a camera where it can be disabled. It should of course be added that if you capture video in situations where the amount of light can change significantly, then you have to enable the camera’s automatic settings in order to obtain a useable image.

Fig. 2.12 Examples of how different settings for focal length, aperture and distance to object result in different depth-of-fields. For a given combination of the three settings the optics are focused so that the object (person) is in focus. The focused checkers then represent the depth-of-field for that particular setting, i.e., the range in which the object will be in focus. The figure is based on a Canon 400D

Fig. 2.13 The sensor consists of an array of interconnected cells. Each cell consists of a housing which holds a filter, a sensor and an output. The filter controls which type of energy is allowed to enter the sensor. The sensor measures the amount of energy as a voltage, which is converted into a digital number through an analog-to-digital converter (ADC)

Fig. 2.14 The input image was taken with the correct amount of exposure. The over- and underexposed images are too bright and too dark, respectively, which makes it hard to see details in them. If the object or camera is moved during the exposure time, it produces motion blur as demonstrated in the last image

Another aspect related to the exposure time is when the object of interest is in motion. Here the exposure time in general needs to be low in order to avoid motion blur, where light from a certain point on the object will be spread out over more cells, see Fig. 2.14.

The accumulated charges are converted into digital form using an analog-to-digital converter. This process takes the continuous world outside the camera and converts it into a digital representation, which is required when stored in the computer. Or in other words, this is where the image becomes digital. To fully comprehend the difference, have a look at Fig. 2.15.

To the left we see where the incident light hits the different cells and how many times (the more times the brighter the value). This results in the shape of the object and its intensity. Let us first consider the shape of the object. A cell is sensitive to incident light hitting the cell, but not sensitive to where exactly the light hits the cell.

Fig. 2.15 To the left the amount of light which hits each cell is shown. To the right the resulting image of the measured light is shown

Fig. 2.16 The effect of spatial resolution. The spatial resolution is from left to right: 256 x 256, 64 x 64, and 16 x 16

So if the shape should be preserved, the size of the cells should be infinitely small. From this it follows that the image will be infinitively large in both the x- and y-direction. This is not tractable and therefore a cell, of course, has a finite size. This leads to loss of data/precision and this process is termed spatial quantization. The effect is the blocky shape of the object in the figure to the right. The number of pixels used to represent an image is also called the spatial resolution of the image. A high resolution means that a large number of pixels are used, resulting in fine details in the image. A low resolution means that a relatively low number of pixels is used. Sometimes the words fine and coarse resolution are used. The visual effect of the spatial resolution can be seen in Fig. 2.16. Overall we have a trade-off between memory and shape/detail preservation. It is possible to change the resolution of an image by a process called image-resampling. This can be used to create a low resolution image from a high resolution image. However, it is normally not possible to create a high resolution image from a low resolution image.

Fig. 2.17 The effect of gray-level resolution. The gray-level resolution is from left to right: 256, 16, and 4 gray levels

A similar situation is present for the representation of the amount of incident light within a cell. The number of photons hitting a cell can be tremendously high requiring an equally high digital number to represent this information. However, since the human eye is not even close to being able to distinguish the exact number of photons, we can quantify the number of photons hitting a cell. Often this quantization results in a representation of one byte (8 bits), since one byte corresponds to the way memory is organized inside a computer.In the case of 8-bit quantization, a charge of 0 volt will be quantized to 0 and a high charge quantized to 255. Other gray-level quantizations are sometimes used. The effect of changing the gray-level quantization (also called the gray-level resolution) can be seen in Fig. 2.17. Down to 16 gray levels the image will frequently still look realistic, but with a clearly visible quantization effect. The gray-level resolution is usually specified in number of bits. While, typical gray-level resolutions are 8-, 10-, and 12-bit corresponding to 256,1024, and 4096 gray levels, 8-bit images are the most common and are the topic of this text.

In the case of an overexposed image, a number of cells might have charges above the maximum measurable charge. These cells are all quantized to 255. There is no way of knowing just how much incident light entered such a cell and we therefore say that the cell is saturated. This situation should be avoided by setting the shutter (and/or aperture), and saturated cells should be handled carefully in any video and image processing system. When a cell is saturated it can affect the neighbor pixels by increasing their charges. This is known as blooming and is yet another argument for avoiding saturation.

The Digital Image

To transform the information from the sensor into an image, each cell content is now converted into a pixel value in the range: [0, 255]. Such a value is interpreted as the amount of light hitting a cell during the exposure time. This is denoted the intensity of a pixel. It is visualized as a shade of gray denoted a gray-scale value or gray-level value ranging from black (0) to white (255), see Fig. 2.18.

Fig. 2.18 The relationship between the intensity values and the different shades of gray

Fig. 2.19 Definition of the image coordinate system

A gray-scale image (as opposed to a color image, which is the topic of Chap. 3) is a 2D array of pixels (corresponding to the 2D array of cells in Fig. 2.13) each having a number between 0 and 255. In this text the coordinate system of the image is defined as illustrated in Fig. 2.19 and the image is represented as f(x, y), where x is the horizontal position of the pixel and y the vertical position. For the small image in Fig. 2.19, f (0, 0) = 10, f(3,1) = 95 and f(2, 3) = 19.

So whenever you see a gray-scale image you must remember that what you are actually seeing is a 2D array of numbers as illustrated in Fig. 2.20.

The Region of Interest (ROI)

As digital cameras are sold in larger and larger numbers the development within sensor technology has resulted in many new products including larger and larger numbers of pixels within one sensor. This is normally defined as the size of the image that can be captured by a sensor, i.e., the number of pixels in the vertical direction multiplied by the number of pixels in the horizontal direction. Having a large number of pixels can result in high quality images and has made, for example, digital zoom a reality.

When it comes to image processing, a larger image size is not always a benefit. Unless you are interested in tiny details or require very accurate measurements in the image, you are better off using a smaller sized image. The reason being that when we start to process images we have to process each pixel, i.e., perform some math on each pixel. And, due to the large number of pixels, that quickly adds up to quite a large number of mathematical operations, which in turn means a high computational load on your computer.

Say you have an image which is 500 x 500 pixels. That means that you have 500 · 500 = 250,000 pixels. Now say that you are processing video with 50 images per second. That means that you have to process 50 · 250,000 = 12,500,000 pixels per second. Say that your algorithm requires 10 mathematical operations per pixel, then in total your computer has to do 10 · 12,500,000 = 125,000,000 operations per second. That is quite a number even for today’s powerful computers. So when you choose your camera do not make the mistake of thinking that bigger is always better!

Fig. 2.20 A gray-scale image and part of the image described as a 2D array, where the cells represent pixels and the value in a cell represents the intensity of that pixel

Besides picking a camera with a reasonable size you should also consider introducing a region-of-interest (ROI). An ROI is simply a region (normally a rectangle) within the image which defines the pixels of interest. Those pixels not included in the region are ignored altogether and less processing is therefore required. An ROI is illustrated in Fig. 2.21.

The ROI can sometimes be defined for a camera, meaning that the camera only captures those pixels within the region, but usually it is something you as a designer define in software. Say that you have put up a camera in your home in order to detect if someone comes through one of the windows while you are on holiday. You could then define an ROI for each window seen in the image and only process these pixels. When you start playing around with video and image processing you will soon realize the need for an ROI.

Further Information

As hinted at in this topic the camera and especially the optics are complicated and much more information is required to comprehend those in-depth. While a full understanding of the capturing process is mainly based on electrical engineering, understanding optics requires a study on physics and how light interacts with the physical world. A more easy way into these fields can be via the FCam [1], which is a software platform for understanding and teaching different aspects of a camera.

Fig. 2.21 The white rectangle defines a region-of-interest (ROI), i.e., this part of the image is the only one being processed

Another way into these fields is to pick up a topic on Machine Vision. Here you will often find a practical approach to understanding the camera and guidelines on picking the right camera and optics.

While this topic (and the rest of the topic) focused solely on images formed by visual light it should be mentioned that other wavelengths from the electromagnetic spectrum can also be converted into digital images and processed by the methods in the following topics. Two examples are X-ray images and thermographic images, see Fig. 2.22. An X-ray image is formed by placing an object between an X-ray emitter and an X-ray receiver. The receiver measures the energy level of the X-rays at different positions. The energy level is proportional to the physical properties of the object, i.e., bones stop the X-rays while blood does not. Thermographic images capture middle- or far-infrared rays. Heat is emitted from all objects via such wavelengths meaning that the intensity in each pixel in a thermographic image corresponds directly to the temperature of the observed object, see Fig. 2.22. Other types of image not directly based on the electromagnetic spectrum can also be captured and processed and in general all 2D signals that can be measured can be represented as an image. Examples are MR and CT images known from hospitals, and 3D (or depth) images obtained by a laser scanner, a time-of-flight camera or the Kinect sensor developed for gaming, see Fig. 2.22.

Fig. 2.22 Three different types of image. (a) X-ray image. Note the ring on the finger. (b) Thermographic image. The more reddish the higher the temperature. (c) 3D image. The more blueish the closer to the camera