Geospatial Image To Learn (GeoLearn)

The motivation for developing Geospatial Image To Learn (GeoLearn) comes from hydroclimatology and terrestrial
hydrology. The research and development in this area addresses the scientific questions
about causes and consequences of hydrologic variables through phenomenology,
modeling, and synthesis.

GeoLearn has been prototyped as a novel simulation and exploratory environment for prediction modeling from remote
sensing imagery, and large size geospatial raster and vector data. The GeoLearn framework has the functionality to read
data sets from local and remote sites; extract features like slope from elevation; mosaic tiles; perform quality
assurance of remotely sensed images; integrate images; spatially
select pixels by masking with boundaries, geo-points, maps with categorical variables, thresholded maps with continuous
variables or painted regions using primitives; extract pixels over a mask, perform data-driven modeling using
machine learning techniques, provide interpretation
of models in terms of variable relevance and visualize a variety of input, output and intermediate data.

Geospatial Image To Learn can be viewed as an encapsulated workflow for:

loading multiple raster files (images) from local or remote locations,

pre-processing the files based on quality control requirements and performing feature extractions from them if necessary,

integrating and mosaicking all raster data sets to form a stack with consistent spatial
and temporal resolution as well as geographic projection,

loading other files (boundaries, points or images) to create a mask for pixel selection purposes,

integrating the existing stack of raster images with other masking information,

selecting boundaries or image regions of interest and extracting variables from the stack of images,

analyzing data-driven model to assign a relevance coefficient to input variables, and

mapping the data-driven model at a pixel level to spatial domain.

All aforementioned steps are supported by visualizations (gray-scale or pseudo-color) of
input, intermediate, and output data sets, as well as the data models. An overview of the
functionality is provided in Figure 1.

GeoLearn is a java-based, open source, desktop application given the fact that
the primary users of
the GeoLearn exploratory framework are scientists without access to high performance computing resources.
This allows users to run the code on any platform and perhaps modify and extend the framework.
However, this approach could be effective only for those components that were developed by the authors
or were leveraged as open source codes. The tradeoffs between resources and functionality led to
the use of the third party software.

The majority of the code is based on our own
Image to Learn (Im2Learn) software to perform an
out-of-core image
representation, data integration, image manipulation and visualization, with additional calls to:

Data To Knowledge (D2K, Version 4.1.2) developed by
NCSA to perform decision tree modeling.

Figure 2 shows GeoLearn workflow interface that guides a user through the five major steps, such as
Load Raster, Create Mask, Attribute Selection, Modeling and Visualization.
Multiple file formats (HDF, netCDF, GeoTiff, DEM, SRTM) can be loaded. Additionaly the files can be retrieve not only
from a desktop disk or a networked disk but also from a remote site using OPeNDAP protocol (e.g., NASA DACC). Software also
performs feature extraction (from elevation to slope, aspect, flow direction and cumulative flow) and quality assurance
and quality control (QA/QC).

Figure 2: GeoLearn workflow interface. The ellipsoid highlights the total memory footprint (around 30MB)
for a large number of ingested files using the out-of-core data representation approach.
Large input files are represented as multiple tiles (data chunks) that are paged in and out of
a desktop computer RAM. Thus, even if the files could not fit into RAM they still could be processed
on a desktop computer by keeping only a small portion of
the data in RAM at any time.

GeoLearn enables several tradeoff studies related to data integration in terms of projection and spatial resolution
parameters of integrated data sets. Different temporal integration schemes have been implemented. Similarly,
spatial integration uses different interpolation schemes. Examples of two types of
spatial data integrations are shown in Figure 3

GeoLearn provides five masking methods for pixel subset selection,
such as boundary-based (Shapefile), point-based (Table), categorical map-based (Categorical), continuous
map-based (Threshold), user-defined (User defined) or any Boolean combination of already created masks.
Using the masking functionality, one can explore and compare region-based models or point-based models
with or without constraints (land cover label or an elevation range).

It is not known many times how earth observations are related and how those relationships vary over space and time.
Similarly, in terms of modeling, there is no superior machine learning technique. GeoLearn enables to compare
data-driven modeling results obtained by exploring multiple
possible relationships among variables and by investigating regression tree, support vector machine and k-nearest
neighbor machine learning algorithms.
In addition, we designed a methodology and an algorithm for ranking input variables based on their relevance for predicting
output variables. Figure 5 illustrates not only the process of data-driven modeling and visualization
but also the assistance in interpreting the relevance as a function of space (Fig.5 - bottom right). In this case,
vegetation greenness (NDVI index) is predicted with the leaf area index (LAI), fraction of photo-synthetically active
radiation (FPAR) absorbed by the plant canopy and snow cover. As expected, LAI is ranked with the highest relevance
at the majority of pixels and snow cover is never ranked with the highest relevance. This type of analysis was used to
validate the correctness of relevance assignment based on our prior knowledge.

Projects using GeoLearn software:

Geospatial Image To Learn was created as a joint collaboration between the
Civil and Environmental Engineering Department
(CEE) at the University of Illinois at Urbana-Champaign
(UIUC) and the National Center for Supercomputing Applications (NCSA) at UIUC.

Additional help with the software requirement specifications and code release came from
Amanda White (CEE), Ben Ruddel (CEE), and Tim Nee (NCSA).

Acknowledgments

Funding support was provided by National Aeronautics and Space Administration (NASA),
National Archive and Record Administration (NARA), and National Science Foundation (NSF).
The NASA project was led by the principal investigators Praveen Kumar and Peter Bajcsy. Praveen Kumar
served as the principal investigator for the NSF project.
The NARA project was led by the principal investigator Peter Bajcsy.