Image Understanding and Computer Vision

Image understanding and computer vision are two closely related multidisciplinary research fields concerned with the use of computer algorithms to modify or analyze digital images using signal and image processing, machine learning, and artificial intelligence techniques in order to achieve certain tasks or applications. These two research areas seek to produce numerical and symbolic information in the forms of decisions through scene analysis and image understanding. Furthermore, one of the main goals of image understanding and computer vision is to duplicate the abilities of human vision by electronically perceiving and understanding an image.

Digital image processing has many applications ranging from video transmission on our cell phones to probing the universe. One of the first applications of digital images was digitized newspaper pictures sent by undersea cable between London and New York in early 1920s, as shown in Figure 1(a). Pictures were coded by a Bartlane cable picture transmission system1 for cable transmission and then reconstructed at the receiving end on a telegraph printer fitted with typefaces simulating a halftone pattern. The early Bartlane systems were capable of coding images in five distinct brightness levels, which were increased to fifteen levels in 1929. Although improvements on methods for transmitted digital pictures continued to be investigated over the next thirty-five years, it took the combined advents of large-scale digital computers and the space program to bring into focus the potentials of digital imaging concepts. Work involving computer techniques for improving images from a space probe began at the Jet Propulsion Laboratory in 1964 when pictures of the Moon transmitted by Ranger 7 were processed by a computer to correct various types of image distortion inherent in the onboard television camera as shown in Figure 1(b). In the 1970s, digital image processing proliferated as cheaper computers and dedicated hardware became available. With emerging fast multicore computers and dedicated signal processing chips in the 2000s, digital image processing has become the most common form of processing images in real time.

More recent imaging sensors operating at different spectral ranges have also provided unforeseen applications for the U.S. Army as well as U.S. commercial enterprises. Typical applications that routinely employ image processing techniques these days are automatic target detection and classification, military reconnaissance, surveillance, robot vision, character recognition, industrial robots for product assembly and inspection, multimodal image fusion, cross-modality image understanding, automatic processing of biometrics sensor data (i.e., face, iris, and fingerprints), screening of x-rays and blood samples, and machine processing of aerial and satellite imagery for military, weather prediction, and crop assessment.

The U.S. Army Research Laboratory (ARL) has been in the forefront of research in image understanding and computer vision using many different sensor modalities. ARL has been performing research on a large number of topics such as automatic target recognition, humanitarian mine detection, personnel detection, super-resolution, face recognition, object tracking from video sequences, and the use of biometrics for human identifications. For more than two decades, ARL researchers have studied and advanced the problem of target detection and classification by exploiting the concepts and theories in statistical signal processing, neural networks, machine learning, and image understanding. Most of these algorithms were optimized and applied to FLIR, SAR, MMW, and hyperspectral imaging sensor data. One of the major shortcomings of the previous methods was that these algorithms were mainly based on linear detection and classification approaches. More recently at ARL, researchers have developed nonlinear target (anomaly) detection methods2 using the concepts of kernels in statistical learning theory (see Figure 2 for an example). In the last couple of years, the sparse representation target classifiers were implemented using convex optimization techniques that outperform the conventional matched filters.3 Other limitations of traditional classifiers are that (a) they need to be retrained when new labeled training data becomes available, and (b) a large training data set is usually needed to design a robust classifier. Using the recent dictionary-based classifiers developed by ARL researchers, it is shown that new data can easily be incorporated into a classifier without retraining. Furthermore, using the ideas of semisupervised and active learning it is shown that the classifiers developed by ARL can continuously be trained by using only a limited number of labeled samples. Also using hyperspectral imagery, ARL researchers have been able to detect subpixel or camouflaged targets using the spectral signatures of the desired targets. An overview of the current and future challenges in target detection using hyperspectral imagery can be found in a recent IEEE paper.4 Issues such as the use of computational imaging ideas, advanced nonlinear detection methods, hyperspectral band selection, dimensionality reduction, fusion of hyperspectral sensor with other modalities, mixture of synthetic and real imagery, and full evaluation of the recent machine-learning-based techniques from laboratory experiments to real field trials are still important research areas to be investigated.

ARL research activities have recently concentrated more toward applications such as multisensor data fusion for display and decision making, multimodal biometrics for personnel identification, human detection in image and video data, object tracking, three-dimensional visualization of buildings and cities, human activity classification, and use of super-resolution to improve the quality of images.5,6,7 The research problem of using multiple sensors concurrently and making a joint classification decision is still an open problem in image understanding. One aspect of my current research is the use of multimodal biometrics for personnel identification using the theory of joint sparse representation.6 For example, fingerprints, iris, ear, face, and other biometrics signatures, as shown in Figure 3, can be used simultaneously to identify an individual.

Cross-modality face recognition, or in general, heterogeneous domain adaptation, is an important application for night-vision surveillance. Here, the goal is to recognize target faces from long-wave or midwave IR camera imagery by leveraging a large amount of labeled training data from another existing domain, such as the visible face images collected previously. In our current approach to solving this problem. we first learn two dictionaries that can sparsely represent the two modalities by the same sparse representation codes (cross-modality invariant hidden variables), which are then used to train a support vector machine (SVM) classifier. During testing the sparse codes from the available input modality is fed to the SVM classifier to identify the input probe face image.

Image understanding algorithms so far have only been able to address a limited number of applications, typically in restricted or laboratory-based settings. In order to solve more practical applications, new mathematical models and techniques are still needed. These methods should incorporate concepts from the architecture of the human visual cortex, cognitive science, deep learning, nonlinear signal processing, intelligent search engines, and to be implemented on dedicated high performance computers. Finally, the major critical element to the success of image understanding for developing solutions to real military applications is the coalescence of scientific tools and novel techniques from several multidisciplinary research fields