Abstract

We report on the image formation pipeline developed to efficiently form gigapixel-scale imagery generated by the AWARE-2 multiscale camera. The AWARE-2 camera consists of 98 “microcameras” imaging through a shared spherical objective, covering a 120° x 50° field of view with approximately 40 microradian instantaneous field of view (the angular extent of a pixel). The pipeline is scalable, capable of producing imagery ranging in scope from “live” one megapixel views to full resolution gigapixel images. Architectural choices that enable trivially parallelizable algorithms for rapid image formation and on-the-fly microcamera alignment compensation are discussed.

1. Introduction

The recent development of multiscale imaging systems for wide Field of View (FoV), high resolution cameras [1] has prompted the need for image formation techniques capable of efficiently handling gigapixel-scale data sets. In multiscale design, the full FoV of the system is imaged by a single monocentric objective [2], forming a spherical intermediate image plane. This intermediate plane is sampled in parallel by an array of “microcameras” that are tiled on a hemispherical surface centered on the objective [3]. Each microcamera reimages a small portion of the full FoV onto a detector array. The microcamera FoVs overlap slightly, such that there are no gaps in the overall FoV. The primary advantage of this architecture is a reduction in system resources—mass, volume, and cost—compared with an equivalent, but traditional, optical system capable of the same FoV and resolution.

A prototype gigapixel multiscale imager, termed AWARE-2, has been developed as part of the Advanced Wide field-of-view Architectures for image Reconstruction and Exploitation (AWARE) program at DARPA. The system covers a 120° x 50° horizontally-oriented FoV with 98 microcameras. Each microcamera is capable of capturing a 14 megapixel (MP) image at 10 Hz, resulting in 1.4 gigabytes of data per frame. Due to overlap and unused portions of the sensors, the fusion of these individual pictures produces a final image of approximately 1 gigapixel.

The primary software design goal is to develop a powerful and scalable image formation pipeline that can be applied to any multiscale imaging system. Ultimately, AWARE is intended to simultaneously deliver near real-time frame rates (10 Hz) to multiple, independent users. This is a prodigious amount of data. There is currently no practical way to form images of this scale at a 10 Hz frame rate. Further, forming a gigapixel image ten times a second would fill a 10 terabyte array in 17 minutes, severely limiting the utility of generating imagery in such quantities. We further note that no display device is capable of presenting a one gigapixel image. A typical user currently has a display roughly one megapixel in size. Even with current trends towards high-dots-per-inch systems, individual displays are unlikely to significantly exceed ten megapixels in the foreseeable future, largely as a result of the limited visual acuity of the human eye.

This limitation on display scale is the key insight that provides not only an understanding of how the nature of human-image interaction must change in the era of gigapixel-scale imagery, but also yields a response to the potential data-overload and processing constraints. The solution is to provide the user with a “live” interface that primarily presents display-scale imagery (approximately one megapixel) at the system frame rate, while also providing the capability for the user to flag a frame for archival storage at the native gigapixel resolution. The live interface acts as a “window” into the full set of imagery data, allowing a user to pan and zoom as desired.

Significant data-transfer savings can be achieved when the microcamera electronics are engineered to deliver image data over sub-camera regions-of-interest (ROIs) and at various resolution scales. To understand this, we consider two limiting cases. In the first case, the user is fully zoomed out in order to observe the entire system FoV. Here, every microcamera in the array must deliver data corresponding to the full microcamera FoV, but may provide that data in an extremely low resolution form. In the second case, the user is fully zoomed in (albeit perhaps still spanning multiple microcameras). Therefore, only the relevant microcameras must deliver data. This data will be at the native pixel scale of the microcameras, but need only contain the specific sub-regions of the individual microcamera FoVs that would fall in the user’s display. Any intermediate case will provide some mixture of data savings via resolution reduction or microcamera ROIs. As a result, the net data bandwidth required per user is on the order of five megapixels per frame (50 megabytes/sec for 8 bit grayscale imagery at 10 Hz), which is well within the realm of practicality.

This approach naturally leads to a parallelizable architecture. To the fullest extent possible, real-time operations should be distributed over many computational work units to rapidly generate the live windows into the array for multiple users. Furthermore, model-based image formation methods can reduce the computation time spent by a single work unit. Simultaneously, the processing time from more complex operations can be amortized over many frames as background processes. Here we describe the image formation pipeline we have developed to efficiently perform image formation in a multiscale system. It conceptually mirrors the multiscale design philosophy found in the AWARE optical and electronic systems [2]. The work of imaging, transmitting, and processing portions of the FoV is distributed among the array of microcameras, electronics, and computational units. In this manuscript, we present a general processing pipeline developed to achieve these goals.

2. Gigapixel image formation

A multiscale image is fundamentally formed by the combination of many smaller images that cover the full FoV of the camera. In the AWARE-2 prototype, the output image is formed from 98 microcamera images, tiled as shown in Fig. 1(a)
. The tiling on the sphere is designed to prevent gaps in the FoV while ensuring that there is sufficient overlap between microcameras to align them with each other (see Section 3). At the same time, the tiling should not waste detector elements with unnecessary amounts of overlap. No regular tiling of a spherical surface exists; the layout presented in Fig. 1(a) is the AWARE-2 solution for optimizing the overlap regions. Though AWARE-2 was originally designed to cover a 120° cone in object space, the prototype is most often used in a horizon-pointing configuration. Hence, only those microcameras that cover this region, highlighted in Fig. 1(a), are populated in the prototype. The microcameras are held in a machined aluminum geodesic dome, shown in Fig. 1(b), which matches the tiling layout [4].

Fig. 1The microcameras are tiled on the surface of a hemisphere to optimally cover the full system FoV without gaps. The projection of the microcamera FoVs into object space is shown in (a), where the microcameras populated in AWARE-2 have been highlighted. The machined aluminum geodesic dome, approximately 11.5” in diameter, which holds the microcameras in this configuration, is shown in (b).

Combining the contributions from many images to form a single image (termed “compositing” or “stitching”) is done by projecting data from the local coordinate system of a microcamera into a shared spherical coordinate system representing object space. The shared coordinate system can be projected into the coordinate system of a display device in a variety of ways [5]. This projection must be applied to all microcameras, taking into account any optical characteristics that might affect the projection (e.g. optical distortion). At every point in object space (equivalent to a pixel in the composited image), the projected image-space pixels must be unified in a way that utilizes information from overlapping microcameras to accurately estimate the true intensity of that location in object space. The key to efficient image formation in the AWARE system is to decompose both sides of this problem into trivially parallelizable steps that allow the processing to be done much more quickly.

2.1 MapReduce framework

The AWARE processing architecture is based around the MapReduce concept, first popularized by Google in 2004. The algorithm is divided into two stages—map and reduce—that together form the framework for processing the data. Ideally, each map operation (or reduce operation), regardless of its form, is independent of the others, allowing the process to be fully parallelizable. In general, a MapReduce approach simply converts a list of key/value pairs into another list of key/value pairs. First, the map step transforms each pair into an intermediate pair via some arbitrary operation. Subsequently, the reduce step groups intermediate pairs with the same key value and produces a single resultant key/value pair for each group, again via some arbitrary operation [6]. The principal benefit of this method is that extremely time consuming calculations can be broken down into many trivial independent calculations. With the abundance of computing power commonly available, these calculations can be performed in parallel, thereby dramatically reducing the total computing time. Though not all algorithms, nor all image formation techniques, are amenable to the MapReduce approach, we have explicitly developed image formation in AWARE to be posed in such a form, as we describe below.

In the AWARE context, the input key is a tuple describing the microcamera location in the array and the pixel location within that microcamera. This may be viewed as describing the location of the pixel in a local coordinate system. As shown in Fig. 2
, AWARE data is then viewed as a large list of key/value pairs ([microcamera_number, pixel_number], pixel value). Figure 2 then demonstrates that the map function transforms each member of the list of input pairs into an intermediate key/value pair by projecting those local coordinates into the shared coordinate system. The projection of a given pixel on a given microcamera is completely independent of any other pixel on any other microcamera, thus the map operation can be performed on every pixel in the array in parallel. This forms a list of key/value pairs ([θx, θy], [m, b, t]), where θx and θy are the angular departure from the system optical axis in x and y, m is the pixel value, b is the predicted relative illumination for that pixel (discussed in Section 2.2), and t is the exposure time for that microcamera. More accurately, θx and θy are a Cartesian description of an object space angle’s spherical coordinates that have been projected onto a plane and quantized as pixels in the final image. Due to microcamera overlap and aliasing, pixel values that originated from different input keys may map to the same output coordinate. The reduce function is applied to each output location to combine the list of values that represent information coming from that angle in object space. The final output is a list of values, i.e. an array that forms the image of the desired region in object space.

Fig. 2The MapReduce approach breaks the image formation process into two parts. The map step transforms a list of key/value pairs that represent the intensity value for a given pixel on a given microcamera into an intermediate list of key/value pairs which represent the intensity value for a given location in object space. This location corresponds directly to a pixel in the output image. The reduce step combines key/value pairs sharing the same key to form an estimate the intensity that was present at that single location in object space.

Though the next two sections will provide more detail on the specifics of the calculations, we wish to emphasize that the ability to express the gigapixel image formation computation in terms of many simple independent operations is what makes the process approachable on a live timescale. This is supported by an electronics architecture that allows the compositor to request image data at the appropriate scale to minimize bandwidth and processing time. Though Google and others have applied MapReduce to computationally intensive problems in other disciplines, we believe bringing it to bear in an image formation context significantly changes how we think about such problems.

2.2 Optical modeling

The map operation is more than a simple coordinate transformation from Cartesian (detector) coordinates to spherical (object space) coordinates. It must also account for image distortion resulting from aberrations in the optics. In other words, magnification variation (as a function of field angle) must be removed before performing the projection into spherical coordinates. However, we can predict the form of this variation via analysis of the optical design using optical modeling software (e.g. ZEMAX). A low-dimensional polynomial model is then fit to the data to produce an algorithmic description of the distortion. The use of a low-dimensional parametric representation is important, as formation of a one gigapixel composite image will consist of roughly 1.4 billion mapping operations—more complicated models become extremely computationally expensive on this scale. However, to preserve the quality in the stitched image, the model must be accurate enough to predict the distortion to within a detector pixel. This trade-off led to a 9th degree polynomial distortion model for the first AWARE-2 prototype. As shown in Fig. 3(a)
, polynomials of lower order are insufficient to obtain the desired level of fidelity. We have found, empirically, that an odd function produces the best results, which is in line with the standard formulations in photogrammetry and computer vision [7, 8]. The map operation is then a combination of this distortion model and the coordinate transformation from local Cartesian coordinates to global Cartesian angular coordinates (defined as Θ, distinct from θ used to denote altitude in global spherical coordinates). In effect, the map converts a radial physical distance from the center of the detector to a radial angle from the optical axis in object space, as demonstrated by Eq. (1). Applying this map locally to all cameras around their optical axis, regardless of their location on the sphere, allows the same calculation to be used throughout the array.

Fig. 3Parametric models are used to predict the distortion and relative illumination as a function of radial position on a sensor. (a) Comparison of several polynomial functions to the distortion found in a ZEMAX simulation, demonstrating that a 9th order polynomial is sufficient to achieve a pixel-accurate distortion prediction. (b) Fit for the relative illumination model using an 8th order polynomial.

The analysis shown in Fig. 3 was performed with distortion data obtained from the ZEMAX simulation. The actual microcameras will, of course, deviate from this nominal model. However, the specific choice of model (assuming minor deviations) will not dramatically affect the efficacy of a polynomial of this form. It should be noted that this polynomial modeling also makes two important simplifications. The first is that the distortion is radially symmetric. This is certainly not the case with physical hardware, but we have observed that the resultant errors are minor, as will be discussed in Section 2.5. Second, focus changes (see Section 4) will change the magnification of the cameras, and presumably the distortion as well. We may be able to calibrate (see Section 2.5) for this effect to introduce a focus-dependent magnification term in the distortion model. To avoid the effect, the next generation AWARE camera has been designed to be telecentric in image space.

For a given (θ,ϕ) location in the reconstruction, the reduce operation must combine a set {mk} of pixel values that represent measurements corresponding to that location. The value of a specific pixel, mk, is related to the source irradiance I{θ,ϕ} via

Equation (2) is known as the general linear model, and it can be shown that the maximum likelihood estimator (MLE) is the optimal unbiased linear estimator for that case [9]. Using the notation above, the MLE for the source irradiance I^(θ,ϕ) takes the form

We model the relative illumination, bk, for the system in the same way we model the distortion—via a ZEMAX analysis. The relative illumination model for the first AWARE-2 prototype is shown in Fig. 3(b) and stated explicitly in Eq. (5). Like the distortion model, it is a function of radial physical distance from the center of the detector and produces an illumination value relative to the peak. In this case, an 8th order polynomial was necessary to empirically achieve a satisfactory fit, as the illumination does not follow the classical cos4 roll-off [10].

Every microcamera can have individual gain and exposure settings, giving rise to a synthesized high dynamic range composited image (see Section 4). These dynamic settings are accounted for in Eq. (4) by scaling the pixel values by the exposure time (tk) of their microcameras. Though we do not currently vary gain in the microcameras, a factor to include gain could similarly be included. Again, this can all be done completely in parallel.

2.3 Image composition

By applying these functions to the MapReduce framework, a collection of microcamera images can be mapped and combined into an output image. The map function does, however, require knowledge of the global orientation of each microcamera optical axis (“pointing”), as well as the angular rotation of the microcamera focal plane about that axis (“clocking”). There is a default Look-Up Table (LUT) for the as-designed system, specifying the pointing and clocking of each microcamera. Deviations from this as-designed table are covered below in Sections 2.5 and 3. Assuming we have a valid LUT, we can combine any number of microcamera images in parallel to rapidly form a single stitched output image, as schematically shown in Fig. 4
.

Fig. 4Values from every pixel from every microcamera are mapped into object space. Pixels that overlap in the shared coordinate system are reduced to a single value. These operations can run in parallel on every pixel to quickly form a stitched output image. The set of images on the left depicts a collection of detector outputs. The image on the right is a portion of the final image generated from this group of individual microcamera images that have been positioned with an understanding of the geometry of the array.

2.4 Flat-field illumination compensation

The models presented in Section 2.2 are developed using a ZEMAX analysis of the AWARE-2 optical system. These models share the fundamental limitation that they can only capture the as-designed characteristics of the system. Manufacturing and assembly tolerances will naturally perturb the as-designed model for every microcamera individually. If not corrected, this deviation can have a significant impact on the quality of the stitched image. We have qualitatively seen that the impact on the distortion model is relatively minor. Though some variation with respect to the design surely exists, it does not rise to a level that noticeably degrades the final imagery.

The as-built illumination performance, however, currently deviates dramatically from the as-designed model. The change here is partially driven by the aforementioned tolerances, but perhaps more importantly by significant stray light present in the system. A cursory examination of a representative microcamera image, shown in Fig. 5(a)
, reveals significant stray light effects. The ring artifacts on the outer edges of the microcameras are a direct result of unwanted rays traversing the system. In Fig. 5(b), these effects are not captured in the ZEMAX-based illumination model and thus are not corrected in the stitched image. In the next-generation array, this issue can be resolved by incorporating more stringent stray light control into the system design. For AWARE-2, however, we mitigate the effect by correcting for the illumination pattern with flat-field microcamera measurements.

Fig. 5(a) Deviations from the as-designed illumination model produce microcamera imagery with stray light artifacts. (b) Compositing with this data creates rings in the stitched image. (c) Flat-field measurements can be taken on a per-microcamera basis. (d) Camera data can be corrected with the flat fields to reduce stray light effects. (e) Compositing with the corrected data produces a nearly-seamless composite image.

To perform the flat-field measurements, the system objective lens is covered with a plastic hemispherical dome and pointed at a relatively uniform light source (e.g. cloudless sky approximately 90° from the sun) to create an approximately angularly-invariant light source. The image recorded by each microcamera is then a flat-field measurement for the microcamera illumination pattern, shown in Fig. 5(c). This is essentially a pixel-wise measurement of the illumination profile (as opposed to a low-dimensional parametric model) that is used for the bk term in Eq. (4) instead of the value predicted by the model. Unlike the map step, however, illumination compensation is computationally simple; using the raw data, as opposed to a parametric model, does not significantly impact processing time. A microcamera image corrected via the flat field is shown in Fig. 5(d). In Fig. 5(e), a noticeable improvement in image quality is immediately obvious in an image stitched with this compensation method.

Though the compensation works extremely well in this example we’ve shown, one can see less effective uses at other places in this manuscript. This is due to the same stray light problems in the microcamera optics. Though the stray light generally conforms to the demonstrated ring pattern, there are variations based on current lighting conditions. As a result, microcamera boundaries will be least visible with data taken in similar lighting conditions as those under which the flat field data was taken. This is especially challenging with outdoor scenes, but we believe that better stray light control in the next generation camera will eliminate this variation.

2.5 Calibration

Flat-field illumination compensation is essentially a calibration procedure for the array; it measures the current illumination profile of the system. As we develop more precise methods of performing this calibration, we can use the flat-field, as-built data to generate updated low dimensional illumination models, in the same way that we originally generated such models from as-designed data extracted from the ZEMAX prescription. We are also developing methods to calibrate the system to characterize the distortion and pointing on a per-microcamera basis.

The first goal of this calibration would be to find the optical axis for every microcamera. This is made slightly more complicated by any potential detector decenter with respect to the microcamera optical axis. However, that decenter can be calibrated during microcamera assembly and subtracted from calibration measurements (as well as composites). This allows us to update the LUT described in Section 2.3 to the as-built parameters.

The second calibration goal will be to identify the best distortion model for each microcamera. From our experience with ZEMAX analyses, we can identify a set of approximately 20 object angle locations in a single microcamera FoV that provide sufficient information to generate an accurate distortion model. Although we do not currently observe effects from distortion model mismatch, they will become more important as microcamera image quality improves in later generations of the array.

2.6 Computational performance timing

Our primary goal while architecting the image formation pipeline was to ensure that the computationally intensive portions were trivially parallelizable. This led to our use of the MapReduce framework, described in Section 2.1. Because we have framed the problem as an assembly of independent operations, they can be run by as many independent processes as are available. Ideally, a single process per map operation would be optimal; however overhead in communication naturally leads to a much larger practical limit.

For the AWARE-2 system, we currently utilize a 9-node cluster; each node has dual 6-core Intel Xeon processors. A single node is dedicated as the “root” for the compositing processes; each processor on the remaining nodes is a dedicated “worker.” The parallel composite is managed via the Message Passing Interface (MPI) framework [11]. The root node handles communication with the user via a MATLAB Graphical User Interface (GUI) and distribution of jobs to the workers. Figure 6
is a schematic of the MPI framework, which utilizes an internal job queue that allows it to disseminate the work evenly among the available resources. The map and reduce operations are distributed via this process. This architecture has allowed us to reduce the total computation time in the AWARE computational cluster for a one megapixel window to approximately 0.8 seconds. A full resolution one gigapixel image can be generated in approximately 3 minutes.

Fig. 6AWARE utilizes the Message Passing Interface (MPI) framework to distribute compositing work among a pool of workers. Each processing core in each server is designated a worker. The root node receives commands via some user interface and distributes the jobs to the workers.

2.7 Graphics Processing Units (GPUs)

Graphics Processing Units (GPUs) are becoming an increasingly common solution for addressing highly-parallelizable computational tasks. For certain tasks, the extremely high number of computational cores (>1500 in the latest-generation cards [12]) can dramatically reduce computation time. The utility of GPU solutions can be limited in high-memory applications, as a result of memory-bandwidth issues. However, for AWARE-2 we have found the memory transfer time to be a minor (<10%) portion of the total computation.

An NVIDIA GTX 570 desktop GPU with 480 cores is able to perform a live one megapixel composite approximately three times faster than the full 80-processor cluster described above. Further, it accomplishes this with dramatically reduced cost and power. Moreover, as GPUs rapidly become more powerful, we expect that the increased number of cores should further reduce computation time. Figure 7
demonstrates the decrease in compositing time as more cores are applied to the computation; an approximate power-law relationship between the two is readily observed, although the performance gains begin to saturate for very large numbers of workers. We hypothesize that this is the result of a fixed data transition time overhead, as well as the fact that the number of workers was beginning to exceed the number of physical cores in the GTX 570 GPU.

Fig. 7The relative time to composite an image decreases as more workers are used in the computation. This experiment was done on an NVIDIA GTX 570 with 480 cores, thus requesting more workers than are available results in a reduced performance gain.

3. Registration

Gigapixel image composition relies on knowledge of the relative pointing and clocking of the microcameras in the array. For the general stitching problem, there are two primary ways in which the relationship between images is found. Many commercial stitching software packages make no assumptions about the geometry of the image acquisition. Programs such as the Microsoft Research Image Composite Editor (ICE) use a variety of techniques to find commonalities among a set of images in order to identify the most likely alignment of those images with respect to each other [13, 14].

An alternative method is to take advantage of some known geometry of the imagery. Applications on smart phones and other position-aware devices are beginning to utilize this technique to limit the solution space when finding relationships between images in the set. AWARE has the advantage of being able to extend this concept further due to the fixed mechanical layout of the microcameras. Knowing the pointing and clocking of every microcamera, to within manufacturing and assembly tolerances, dramatically reduces the set of possible solutions for microcamera registration. However, it remains true that there will be some deviation from the designed alignment of the microcameras. The aforementioned tolerances, as well as mechanical drift due to physical shocks or thermal variation will inevitably alter the relative positions of the microcameras. Figure 8
demonstrates the drift that occurs in an AWARE camera. For wide FoV composites, potential mismatches are often not visible, as shown in Fig. 8(a). However, Fig. 8(b) is a composite of a much smaller FoV, performed with the as-designed alignment paremeters. The double images of several people in the crowd demonstrate a drift in microcamera alignment. Clearly, some monitoring of the current alignment of the system is necessary for accurate compositing.

Fig. 8(a) Registration errors are not noticeable when viewing a wide FoV. (b) In narrow FoV composites, misalignment between neighboring microcameras can be clearly seen.

3.1 Feature extraction

There are several methods for the registration of images sharing some portion of their scenes [15, 16]. One powerful method is to extract features from the images and geometrically match corresponding features to identify the geometric alignment of the images with respect to each other. The most common contemporary algorithms for extracting distinctive features from natural imagery are Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Feature (SURF) [17, 18]. Our processing utilizes both algorithms to increase the reliability of the extracted feature clusters. These clusters are then mapped into object space (via the same mechanisms that map the image data during compositing).

The general registration problem attempts to match these clusters with each other by adjusting the microcamera alignment parameters in the compositing LUT, as well as allowing minor modifications to the assumed microcamera magnification. Some threshold must be set on the quality of the features to limit the number of potential combinations. Commercial registration programs must find a balance between accuracy (few features) and robustness (many features). As discussed above, the physical structure of the camera limits the subset of features that can match between two microcameras to the known geometry plus some mechanical drift. For AWARE-2, we limit the allowable search space by restricting microcamera pointing variation to within one degree (significantly larger than the mechanical tolerances will allow). Figure 9
demonstrates registration analysis for two neighboring microcameras. Features in both images are indicated with markers. These features are projected into object space and one set is adjusted to match the other. The process will be imperfect, resulting in a residual registration error that can be used as a metric for the quality of the matches.

Fig. 9SIFT and SURF algorithms are used to identify clusters of features (shown as markers in the images) in neighboring microcameras. The clusters are transformed into object space and compared to calculate a registration error. The transformation parameters are adjusted to minimize the error.

3.2 Global error minimization

The method described above produces a best-fit for a pair of microcameras and a resultant residual error. However, the parameters by which a microcamera best matches one neighbor may not be the same parameters that produce the best matches with its five other neighbors. As the registration corrections propagate throughout the array, microcameras can be pulled in contradictory directions while attempting to match their neighbors individually. A solution must therefore be implemented that takes into account the condition of the entire array when minimizing the registration error. To this end, we calculate a global error metric, which is the sum of the residual errors resulting from each microcamera pair. A global optimization routine (non-linear least squares) is then used to simultaneously adjust the alignment parameters for all of the microcameras to minimize this array-wide error. By performing this operation, we dramatically improve the composited image, as shown when comparing the as-designed model used in Fig. 10(a)
with the as-built model used in Fig. 10(b).

Fig. 10(a) A composite formed with an unregistered camera angles will have stitching errors due to mechanical and thermal drift, as shown in this overlap region between three cameras. (b) The extracted features can be used to find a globally optimal registration, leading to an improved composited image.

We anticipate that as our algorithms advance further we will be able to expand the optimization to include not just the basic alignment parameters for each microcamera, but also the parameters in the per-microcamera distortion models. This will produce perturbations to the models found via the calibration procedure discussed in Section 2.5. It should be noted that these algorithms are computationally intensive and cannot be run in real time. However, we have observed that mechanical and thermal drifts in the prototype system have a relatively long time constant. The parallelized processing pipeline is designed to perform registration operations in the background while the array is generating live imagery. As the current microcamera alignment is calculated, updates will be pushed to the image formation algorithms to be used with new data.

3.3 Image quality

Distortion modeling, flat field compensation, and registration have improved the final quality of the composited images; however, a number of residual issues remain, some of which can be seen in Fig. 10. As mentioned in Section 2.4, the flat field compensation works best when the lighting conditions match those of the flat field data. This is a physical limitation, not an artifact of the image formation framework, and will be mitigated by stronger stray light control in future systems. Figure 10 also exhibits some streaking toward the middle of the image. This is due to motion blur in the native imagery and is not a result of the image processing. One could imagine applying compensation post-processing techniques to the imagery, but this breaks from the “live” image formation model. Lastly, a viewer might question the true pixel count of the system due to the noticeably poor MTF in the native imagery. This too is a physical consideration of the current array and as such outside of the scope of this manuscript. The performance gains achieved by this architecture are independent of the quality of the camera. Interested readers can find substantial detail on the optical performance of the prototype and other test systems in a previously published work [19].

4. Focus and exposure variation across the FoV

As we have seen, a multiscale optical system has a number of characteristics that alter the image formation process significantly from that of a traditional camera. These changes generally shift some of the work of image formation from the optical system to the electronic and image processing systems. In addition, the multiscale architecture presents several unique advantages that are not available to traditional panoramic systems, such as pan and scan systems [20] or focal plane scanning systems [21].

Chief among these advantages is the fact that the array is composed of many independently controllable microcameras. This permits simultaneous per-microcamera adjustment of parameters such as exposure, gain, and focus. This becomes extremely important when examining a composited image from AWARE-2 of a natural scene, such as the one shown in Fig. 11
. Such scenes in general will have both significant variation in object distance and large intensity gradients between sunlit and shadowed regions. A traditional camera with a single exposure time and focal position is incapable of capturing the full range of the scene.

Fig. 11A composited, tone-mapped HDR image from the AWARE-2 camera using the proposed image formation architecture. Each microcamera in the array automatically chooses a focal position and exposure time optimized for the distances and intensities found in the portion of the scene it is imaging.

In a conventional camera, a global focus is chosen to image as much of the scene in focus as possible. AWARE-2, in contrast, focuses each microcamera for targets in its field of view. This allows AWARE-2 to generate snapshot, synthesized extended depth of field (EDoF) imagery unavailable to a standard wide view camera. Similarly, variation in scene brightness is traditionally handled with high dynamic range (HDR) photography techniques [22], wherein a series of pictures of varied exposure times are taken sequentially. As with focus, AWARE-2 is able to make localized adjustments and create a gigapixel HDR image in a single snapshot. All of the AWARE-2 images shown in this manuscript have been generated with this snapshot HDR technique. A complete discussion of the techniques developed for gigapixel snapshot HDR image formation will be presented in a forthcoming paper [23].

5. Conclusion

We have presented an image formation architecture designed generally for multiscale imaging systems and specifically for the AWARE-2 snapshot gigapixel camera. The architecture is designed to take advantage of the naturally-parallel structure that is present in the optics and electronics of a multiscale imager. The parallelization allows a computationally costly task to be easily divided among many workers. AWARE-2 can generate a live one megapixel image to a user in less than a second, and a full resolution (one gigapixel) image in approximately 3 minutes. We anticipate that a single desktop workstation with a current generation GPU card will be capable of generating live data at the designed 10 Hz frame rate and gigapixel imagery in 2 minutes. Challenges introduced by a multiscale system, such as microcamera alignment and microcamera-dependent distortion and illumination models, are managed by calibrating for the approximate state of the system before operation. That calibration can also be updated while the camera is in operation, based on deductions from the current imagery. The image formation pipeline is designed to perform these processing-intensive updates to the system, as well as full resolution composites, in the background without interrupting real-time interaction with the array. AWARE-2 also presents distinct advantages due to its architecture, such as snapshot HDR and EDoF. We hope to further expand our exploitation of these features of the design. Moreover, there exists a host of computationally-based image processing and acquisition concepts that can expand the range of image formation options available to the AWARE project.

Though much of this framework has been discussed in the context of the AWARE-2 prototype, it should be noted that the framework is meant to be generally applicable as research interest in gigapixel imagers grows [24]. The architecture is independent of the number of pixels, the quality of the optics, or the computational power available. For monocentric, >100 megapixel scale image formation, we present this architecture as a candidate to dramatically decrease computation time, approaching real time. The same architecture allows for the generation of full resolution HDR imagery at a rate in excess of 0.5 gigapixels/minute. This framework allows us to begin to position large pixel count systems as video cameras with live, interactive displays with user-driven or automated fields of view that are also capable of full resolution snapshots.

Acknowledgments

The AWARE Wide Field of View Program is supported by the Defense Advanced Research Projects Agency contract HR0011-10-C-0073.

Figures (11)

The microcameras are tiled on the surface of a hemisphere to optimally cover the full system FoV without gaps. The projection of the microcamera FoVs into object space is shown in (a), where the microcameras populated in AWARE-2 have been highlighted. The machined aluminum geodesic dome, approximately 11.5” in diameter, which holds the microcameras in this configuration, is shown in (b).

The MapReduce approach breaks the image formation process into two parts. The map step transforms a list of key/value pairs that represent the intensity value for a given pixel on a given microcamera into an intermediate list of key/value pairs which represent the intensity value for a given location in object space. This location corresponds directly to a pixel in the output image. The reduce step combines key/value pairs sharing the same key to form an estimate the intensity that was present at that single location in object space.

Parametric models are used to predict the distortion and relative illumination as a function of radial position on a sensor. (a) Comparison of several polynomial functions to the distortion found in a ZEMAX simulation, demonstrating that a 9th order polynomial is sufficient to achieve a pixel-accurate distortion prediction. (b) Fit for the relative illumination model using an 8th order polynomial.

Values from every pixel from every microcamera are mapped into object space. Pixels that overlap in the shared coordinate system are reduced to a single value. These operations can run in parallel on every pixel to quickly form a stitched output image. The set of images on the left depicts a collection of detector outputs. The image on the right is a portion of the final image generated from this group of individual microcamera images that have been positioned with an understanding of the geometry of the array.

(a) Deviations from the as-designed illumination model produce microcamera imagery with stray light artifacts. (b) Compositing with this data creates rings in the stitched image. (c) Flat-field measurements can be taken on a per-microcamera basis. (d) Camera data can be corrected with the flat fields to reduce stray light effects. (e) Compositing with the corrected data produces a nearly-seamless composite image.

AWARE utilizes the Message Passing Interface (MPI) framework to distribute compositing work among a pool of workers. Each processing core in each server is designated a worker. The root node receives commands via some user interface and distributes the jobs to the workers.

The relative time to composite an image decreases as more workers are used in the computation. This experiment was done on an NVIDIA GTX 570 with 480 cores, thus requesting more workers than are available results in a reduced performance gain.

SIFT and SURF algorithms are used to identify clusters of features (shown as markers in the images) in neighboring microcameras. The clusters are transformed into object space and compared to calculate a registration error. The transformation parameters are adjusted to minimize the error.

(a) A composite formed with an unregistered camera angles will have stitching errors due to mechanical and thermal drift, as shown in this overlap region between three cameras. (b) The extracted features can be used to find a globally optimal registration, leading to an improved composited image.

A composited, tone-mapped HDR image from the AWARE-2 camera using the proposed image formation architecture. Each microcamera in the array automatically chooses a focal position and exposure time optimized for the distances and intensities found in the portion of the scene it is imaging.