Some of the content on this website requires JavaScript to be enabled in your web browser to function as intended.
This includes, but is not limited to: Flash (also requires the Adobe Flash Player), navigation, video, image galleries, etc. While
the website is still usable without JavaScript, it should be enabled to enjoy the full interactive experience.

»
Speaker Abstracts

+-
Kirk Borne, Ph.D. - Abstract 1

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 1

Citizen Science and Astroinformatics - Data Science at the Frontiers of Astronomy

Kirk BorneGeorge Mason University

I will describe the synergy between computational algorithms and human computation in addressing some of the challenges of learning from Big Data. I will focus on two topics, Astroinformatics (which is Data Science for Astronomy) and Citizen Science in Astronomy, within the broader context of collaborative annotation of massive data for search and discovery. The application of Data Science machine learning algorithms to the collections of labels and tags produced through human-machine interaction will enable novel and unexpected discoveries.

+-
John Wallin, Ph.D.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 1

John WallinCenter for Computational Science, Department of Physics and AstronomyMiddle Tennesee State University

The Zooniverse project was created to allow scientists to connect their work to volunteers from around the world to help them analyze large data sets. In these projects, these Citizen Scientist volunteers classify, characterize, and transcribe image and sound data from a wide variety of scientific disciplines. As computational intelligence algorithms improve, the tasks needed for Citizen Scientists need to change and evolve. We present work some of the recent results showing how data from Citizen Science projects can be used to validate and train computational intelligence algorithms, and how pairing crowd sourcing with computational intelligence can produce scalable solutions to the extreme data challenges facing us in the future.

+-
Dan Burger, Ph.D.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 1

Filtergraph is a web application being developed by the Vanderbilt Initiative in Data-intensive Astrophysics (VIDA) to flexibly handle a large variety of astronomy datasets. While current datasets at Vanderbilt are being used to search for eclipsing binaries and extrasolar planets, this system can be easily reconfigured for a wide variety of data sources. The user loads a flat-file dataset into Filtergraph which instantly generates an interactive data portal that can be easily shared with others. From this portal, the user can immediately generate scatter plots, histograms, and tables based on the dataset. Key features of the portal include the ability to filter the data in real time through user-specified criteria, the ability to select data by dragging on the screen, and the ability to perform arithmetic operations on the data in real time. The application is being optimized for speed in the context of very large datasets: for instance, plot generated from a stellar database of 3.1 million entries render in less than 2 seconds on a standard web server platform. This web application has been created using the Web2py web framework based on the Python programming language. Filtergraph is freely available at http://filtergraph.vanderbilt.edu/

+-
Matthias Katzfuss, Ph.D.

Session: Data Science and Climate 1

Low-Rank Spatial Models for Large Remote-Sensing Datasets

Matthias KatzfussUniversity of Heidelberg

With the proliferation of modern high-resolution measuring instruments mounted on satellites, planes, ground-based vehicles and monitoring stations, a need has arisen for statistical methods suitable for the analysis of large spatial datasets observed on large, heterogeneous spatial domains.

Many statistical approaches to this problem rely on low-rank models, for which the process of interest is modeled as a linear combination of spatial basis functions plus a fine-scale-variation term. For the full-scale approximation, a type of low-rank model that uses a so-called parent covariance and a set of knots to parameterize the model components, I will discuss two extensions: First, I will describe how to make Bayesian inference on the set of knots. Second, I will argue that it is often advantageous to use a nonstationary parent covariance, and I propose a generalization of the Matern covariance to the sphere that can be used for global satellite CO2 measurements.

+-
Amy Braverman, Ph.D.

Session: Data Science and Climate 1

Likelihood-based Climate Model Evaluation

Amy BravermanJet Propulsion Laboratory

From the humblest methods may sometimes come great insight. Tasked to assist atmospheric scientists working on an advanced algorithm that deduces atmospheric carbon dioxide by examining satellite-based atmospheric spectra, it became quickly apparent that the interface between data mining methods and the climate researcher's paradigm would provide the greatest bottleneck, along with strong correlation, high dimensionality, and large volume in the data itself. To alleviate this concern, we reformulated the data mining problem to more closely resemble an automated version of the typical threshold filters and linear fits employed by climate scientists in their data analysis and used genetic algorithms to explore the trade-off space of performance versus percent data accepted. This simple system can then be used to perform feature selection to isolate data features that cause algorithm failure and analyze upstream data quality. Our results shed light on the dominant sources of error, those being partial or thin clouds, remove the justification for trying to fit away errors in the final output, and provide a performance curve for the evaluation of differing versions of the algorithm itself. Examples will be shown from the Greenhouse gasses Observing SATelite (GOSAT).

+-
Lukas Mandrake, Ph.D.

Session: Data Science and Climate 1

Informing Climate Retrieval Development Using Data Mining

Lukas Mandrake Jet Propulsion Laboratory

From the humblest methods may sometimes come great insight. Tasked to assist atmospheric scientists working on an advanced algorithm that deduces atmospheric carbon dioxide by examining satellite-based atmospheric spectra, it became quickly apparent that the interface between data mining methods and the climate researcher's paradigm would provide the greatest bottleneck, along with strong correlation, high dimensionality, and large volume in the data itself. To alleviate this concern, we reformulated the data mining problem to more closely resemble an automated version of the typical threshold filters and linear fits employed by climate scientists in their data analysis and used genetic algorithms to explore the trade-off space of performance versus percent data accepted. This simple system can then be used to perform feature selection to isolate data features that cause algorithm failure and analyze upstream data quality. Our results shed light on the dominant sources of error, those being partial or thin clouds, remove the justification for trying to fit away errors in the final output, and provide a performance curve for the evaluation of differing versions of the algorithm itself. Examples will be shown from the Greenhouse gasses Observing SATelite (GOSAT).

+-
Kirk Borne, Ph.D.- Abstract 2

Session: Learning from Data

Learning from Data, Big and Small

Kirk BorneGeorge Mason University

The volume of data has grown to the point that politicians, educators, business people, scientists, social media specialists, and countless others are paying attention to this exponential flood of information. It is indeed exponential since the data growth rate is proportional to the existing data volume, with a doubling time of less than one year in many contexts.

The explosion of interest in the topic is saturating the discussion in all data-intensive domains. The challenges associated with Big Data are technological, algorithmic, and sociological. I will address the fundamental challenges that Big Data pose to scientific research and education.

An informatics approach to scientific research includes a variety of data science disciplines, including statistics, visualization, machine learning, data mining, data modeling, data indexing, data structures, and more. Accordingly, science education must evolve to incorporate these emerging methods and algorithms within traditional programs. I will describe some approaches to this revolution in scientific research and education.

+-
Michael Mahoney, Ph.D. - Abstract 1

Session: Learning from Data

+-
Arun Vedachalam, Ph.D.

Session: Learning from Data

+-
Pawan K Bhartia, Ph.D. & Joanna Joiner Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Experience in extracting scientific information from the data collected by back-scattered ultraviolet (BUV) instruments flown on satellites since 1970

Pawan K BhartiaNASA Goddard Space Flight Center

Joanna JoinerNASA Goddard Space Flight Center

We will discuss our experience in extracting scientific information from more than a dozen back-scattered ultraviolet (BUV) instruments flown on satellites since April 1970. The data from the BUV series of instruments are not only one the the longest earth science data collected from satellites, the volume of the data from these instruments and their demand for processing power has increased as fast as the decrease in cost of storage and processing. The first TOMS instrument launched on NASA's Nimbus-7 satellite in Oct 1978 produced less than 1 kbit of data per second. Yet at the time of launch it was estimated that processing all the data it was collecting will require almost the entire capacity of the largest IBM 360 computer operating at NASA Goddard Space Flight Center (GSFC) at that time. Since this computer served the needs of all the scientists working at GSFC, it was considered unfeasible to process all the data from TOMS. Though the situation was remedied by optimizing the processing code at substantial cost, the volume of data TOMS produced was considered so large that many potential users of the data chose not to invest their resources in analyzing it. Some information technologists have blamed the delay in finding the Antarctic ozone hole in the TOMS data to the lack of adequate visualization and data mining capability existing at NASA at the time. Though this is a misrepresentation of what actually happened, it is nevertheless true that our ability to extract meaningful information from large datasets has not expanded as fast the as the volume of the data we are collecting. In many cases we are analyzing large datasets with the tools and techniques developed for much smaller datasets. We will discuss our experience in applying modern methods of extracting information from large datasets that include neural networks and principal component analysis.

+-
Hongbin Yu, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Evidence of aerosol intercontinental transport (ICT) is both widespread and compelling. Model simulations suggest that ICT could significantly affect regional air quality and climate, but the broad inter-model spread of results underscores a need of constraining model simulations with measurements. Satellites have inherent advantages over in situ measurements to characterize aerosol ICT, because of their spatial and temporal coverage. Significant progress in satellite remote sensing of aerosol properties during the Earth Observing System (EOS) era offers the opportunity to increase quantitative characterization and estimates of aerosol ICT beyond the capability of pre-EOS era satellites that could only qualitatively track aerosol plumes. EOS satellites also observe emission strengths and injection heights of some aerosols, aerosol precursors, and aerosol-related gases, which can help characterize aerosol ICT. In this talk, we will show how a synergy of three-dimensional observations of aerosols from multiple EOS satellites supplemented by model simulations provides an insight into the relative contributions of ICT imported and domestic aerosols over North America. Implications of the findings for climate and air quality will also be discussed.

+-
Peter R. Colarco, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

A primary consideration
for future aerosol satellite missions is the spatial coverage provided by the
measurements. For example, for a polar
orbiting satellite, two important questions that arise are how much of the
Earth is sampled per orbit, and how long does it take to achieve a global
sample? In the current generation of
sensors, for example, the Moderate-Resolution Imaging Spectroradiometer (MODIS)
instrument has a wide swath (~2300 km) and provides nearly daily global
coverage. On the other hand, the
Multi-Angle Imaging Spectroradiometer (MISR) has a much narrower swath (~380
km) and takes about eight days to sample the globe. At the extreme, the nadir-only view provided
by the Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) has an orbital
repeat cycle of 16 days, but samples very little of the Earth’s surface in
achieving this “global coverage.” Each
of these instruments has different capabilities for detecting aerosols, and it
is generally understood that there is often a trade-off between coverage and
capability.

The fundamental question
in this study is whether spatial coverage matters to the statistics of the
aerosol optical thickness (AOT), a primary quantity desired from satellite
measurements. We investigate is using AOT
fields obtained from the MODIS data record.
In our study the full swath MODIS data set is sampled to extract MISR-
and CALIOP-like spatial coverage versions of the data set. We investigate simple observability
questions, such as where it is and is not possible to measure aerosols because
of the spatial sampling choice. We
additionally investigate the suitability of narrow swath measurements for
detecting trends in the AOT field.

+-
Jason Cohen, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Rapid and wide-scale
changes are occurring to the land surface in Southeast Asia, due to both urban
and agricultural expansion. These changes are both rapidly occurring and not
well constrained. These changes are leading to widespread changes, which are
starting to be observable at climatological scales. Some of these influences
are natural in origin, while others are anthropogenic, and include three major
drivers: (a) changes as a response to various phases of the Monsoon; (b) human
induced fires; and (c) permanent alteration of the land for urbanization and
other economic activities. These changes are one of the causes of a significant
portion of global Black Carbon (BC) and Organic Carbon (OC) aerosols emitted
into the atmosphere, both of which are highly variable in both space and time.

Therefore, to better
quantify the properties of the land surface at large spatial scales, and the
resulting properties of the aerosols resulting from changes in these land
surfaces, it is imperative to look at the problem over a sufficiently long
time-scale so as to capture processes important in this region of the world. In
this work, data has been used reaching back at fast as possible, in most cases
at least 10 years. However, when working with so much date, new quantitative
methods of analysis are required. The purpose of this presentation is to
introduce two-such proof-of-concept approaches, as well as some initial and
interesting results.

The first proof-of-concept approach involved using a
Kalman Filter based on a coupled climate/radiation/aerosol/urbanization model,
and data consisting of BC concentrations and remotely sensed AAODs. This work
has produced the first global average estimate and uncertainty of annual BC
emissions. This result produced an optimized range for BC emissions ranging
from 200% to 300% the emissions currently used by the IPCC, AEROCOM, and Bond
et al. However, an important additional point is related to the issues of fires
and land-use change is elucidated. The emissions, concentrations, and AAODs in
an annual average sense are significantly underestimated in Southeast Asia,
which also happens to be impacted by large-scale fires!

The second
proof-of-concept approach involved using PCA as a tool to extract the standing
modes of multiple remotely sensed datasets, and analyzing those that contribute
the most variance. This tool allows for both the spatial and associated
temporal structure of the dataset can be elucidated. This approach has been
used in connection with AOD data from MISR and MODIS; NDVI, LAI, and EVI from
MODIS; Precipitation from TRMM; and various aerosol products from CALIPSO.

Combining the first and
second techniques has led to the determination of a unique temporally and
spatially varying properties correspond one-to-one with all the known
large-scale fire events in Southeast Asia over the past decade. Running these
new results through the same modeling system allows for a comparison against
known datasets, and these results will be presented. It will be demonstrated
that the inter-seasonal and inter-annual variations can be better captured with
this new technique as compared with other commonly used fire-based a-priori
emissions sets, such as those based on GFED or MODIS fire radiative power.

Finally, other
applications of the second technique have allowed for successful detection of a
few important connections to be made with respect to the properties of the
land-surface and the climate system over Southeast Asia. Two of these results
that will be discussed include both important natural and anthropogenic signals:
interactions between fires and precipitation, and interactions between the
monsoons and the larger-scale land-surface properties.

+-
George Djorgovski, Ph.D.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

+-
Ashish Mahabal, Ph.D.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

Finding Rare Astronomical Objects Using Efficient Bayesian Networks

Ashish MahabalCalifornia Institute of Technology

Current time-domain surveys find tens of transients per night with detection threshold set high (well over a magnitude). Surveys in the near-future are set to find several order of magnitudes more. A vast majority of these belong to well-understood types. The challenge is in identifying the rarer types and concentrate the scarce follow-up resources on those. Early characterization/classification often has to be from scarce data. This includes (1) fluxes at one or two recent epochs, perhaps an archival, co-added flux, often in the form of an upper limit, (2) position with an error depending on the wavelength of discovery, (3) archival parameters like nearest radio source. The total number of parameters such a list can run to is several tens, most of which are unavailable for any given transient. New discriminant follow-up is not only expensive to obtain but for some of the rapid transients it can mean an unacceptable delay. Bayesian methods allow one to deal with missing parameters. But learning from data is an expensive if not impossible task given the large number of parameters. Naive Bayesian networks work up to an extent, but do not deal well with redundancy. The ideal solution is smaller networks for each class with well-crafted parameters based on domain knowledge. Here we detail the concept with a three parameter network which uses only archival parameters to discriminate between supernovae and non-supernovae. Thus it can opearte in real-time without needing any follow-up observations. We describe such a deployment as part of the Cataline Real-time Transient Survey (CRTS). We further detail how the small binary network fits in to a larger multi-class network.

+-
Umaa Rebbapragada, Ph.D.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

This talk presents real-time machine learning systems for triage of big data streams generated by photometric and image-differencing pipelines. Our first system is a transient event detection system in development for the Palomar Transient Factory (PTF), a fully-automated synoptic sky survey that has demonstrated real-time discovery of optical transient events. The system is tasked with discriminating between real astronomical objects and artifacts of the image differencing pipeline. We performed a machine learning forensics investigation into the initial PTF classification system that led to training data improvements and the development of new features, both of which dramatic improved the false positive and negative rates. The second machine learning system is a real-time classification engine of transients and variables in development for the Australian Square Kilometre Array Pathfinder (ASKAP), an upcoming wide-field radio survey with unprecedented ability to investigate the radio transient sky. The goal of our system is to classify light curves into known classes with as few observations as possible in order to trigger follow-up on costlier assets. We discuss the violation of standard machine learning assumptions incurred by this task, and propose the use of ensemble and hierarchical machine learning classifiers that make predictions most robustly.

+-
Padma A. Yanamandra-Fisher, Ph.D.

Session: Astroinformatics: Learning from Data in the Astronomical Sciences 2

Application of PCA to the Atmospheres of Jupiter and Saturn: Temporal and Seasonal Changes

Given the wealth of observations of Jupiter and Saturn since Galileo’s first telescopic observations in 1610, there are still important unanswered questions about their atmospheres and the dynamics of various processes that are still not understood. The high-resolution data returned from several spacecraft missions, placed in context of the larger timeline of ground-based observations, in principle, allow us to develop an insight into these processes. Recently, Jupiter and Saturn have been exhibiting dramatic atmospheric changes nearly continuously since 2007. The underlying basis for these changes may be a common driver of atmospheric disturbances in Jovian planets. Yet, access to these data sets is not sufficient to develop unique models of various physical and chemical processes that govern the planets. Dramatic changes in their atmospheres from discrete localized features to global regions; availability of large telescopes (therefore, higher spatial/spectral data than before), require a new paradigm of rapid exploratory data mining that can be corroborated/validated with standard physical and theoretical models. Statistical models, like PCA and empirical orthogonal analysis, provide unbiased rapid examination of data and identification of key trends in the latent variables that influence the state of the atmosphere. Application of PCA to Jupiter’s Great Red Spot (GRS)-Oval(s) periodic interactions, and seasonal changes in the brightness temperatures on Saturn, showcase its versatility for the identification of the drivers/triggers that influence the observed changes in their atmospheres. I highlight the importance of: (i) placing the observations in the context of various timescales (seasonal, periodic or episodic); (ii) ground-based and spacecraft observations; and (iii) the synergy between professional and amateur astronomers (a new direction in Citizen Science).

I gratefully acknowledge the assistance of various colleagues and student interns in our project and support from NASA/Planetary Astronomy Program.

In recent years, the availability of large satellite datasets has provided an extraordinary opportunity to improve our understanding of the mechanisms controlling atmospheric composition. In particular, these datasets have contributed toward improving the representation of fire emissions in climate and air quality models and assessments. In this talk, I will present several examples that applied computational analysis to the multi-year record of satellite aerosol observations, which have enabled characterization of smoke fire processes and their impacts. I will discuss (a) the determination of smoke plume heights from fires over North America; (b) the investigation of the main physical factors that determine smoke plume rise; and (c) the assessment of fire impacts on aerosol loading and air quality over Colorado.

+-
Mark Nakamura, Ph.D.

Statistical Downscaling of Two-Dimensional Wind Fields

Mark Nakamura University of California, Los Angeles

Global Climate Models (GCMs) are dynamical computer programs that model the physical interactions that govern our Earth's climate. The power of these GCMs lie in that they allow us to make future climate predictions. These GCMs produce vast amounts of data by predicting a myriad of climatic variables at all three-dimensions of our oceans and atmosphere. One drawback of modeling a system this complex is the computational expenses. This creates an end result of global predictions at a low-resolution that speak more to the general overall climate and not to climatological impacts a local level.

To produce local high-resolution predictions there are two options. The first being dynamical downscaling. This involves nesting another higher resolution dynamical model and using the initial GCM data as starting inputs. Or you can create statistical models that examine the relationship between local high-resolution (prediction level) variables and corresponding low-resolution GCM data, also known as statistical downscaling. Dynamical downscaling produces accurate estimates but requires heavy computation knowledge, resources and time. Statistical downscaling techniques can be employed much faster and require less knowledge of the computational coding procedures of the dynamical model.

The focus of this talk will be on the statistical downscaling of two-dimensional wind fields. Wind fields propose a unique problem in that they include the prediction of directional data (wind direction). Directional data can not be treated with typical off the shelf statistical techniques because of the data's cyclical nature (e.g. 369 degrees is very close to 1 degree). To account for this in my methods, the prediction of wind magnitude and direction are done separately. Using a circular tree based regression for wind direction and a generalized linear model for wind magnitude. To account for the non-stationarity of a prediction point (i.e. each prediction point has a unique surrounding topography), I create unique models for each prediction point. Within the prediction for one location, days are first clustered into similar wind regimes. This decreases variance within the model and ensures areas of statistical influence are captured.

+-
Vipin Kumar, Ph.D.

Understanding Climate Change: Opportunities and Challenges for Data Driven Research

Vipin KumarWilliam Norris Professor and Head, Department of Computer Science and Engineering, University of Minnesota

This talk will present an overview of research being done in a large interdisciplinary project on the development of novel data driven approaches that take advantage of the wealth of climate and ecosystem data now available from satellite and ground-based sensors, the observational record for atmospheric, oceanic, and terrestrial processes, and physics-based climate model simulations. These information-rich datasets offer huge potential for monitoring, understanding, and predicting the behavior of the Earth's ecosystem and for advancing the science of climate change. This talk will discuss some of the challenges in analyzing such data sets and our early research results.

+-
Dan Crichton, Ph.D.

+-
Arnold Goodman, Ph.D. - Abstract 1

+-
Susan Paddock, Ph.D.

+-
Arnold Goodman, Ph.D. - Abstract 2

+-
Eric Chi, Ph.D.

+-
Benjamin Shaby, Ph.D.

+-
Heike Hofmann, Ph.D.

+-
Kiri L. Wagstaff , Ph.D. & Michael J. Garay Ph.D.

Scientific Discovery and Anomaly Detection in Large Aerosol Data Sets

In the era of large scientific data sets, when it is impossible for an individual to examine every observation in detail, there is an urgent need for methods to automatically prioritize data for review. However, any such automated method must make decisions in a trustworthy, comprehensible manner. In this talk, I will describe the Discovery through Eigenbasis Modeling of Uninteresting Data (DEMUD) method, which uses principal component modeling and reconstruction error to prioritize data by its novelty. Uniquely, DEMUD also provides individual reasons for each priority decision. I will share results obtained when using DEMUD to analyze aerosol retrievals from MISR satellite data, enabling us to quickly identify interesting and unusual observations in both space and time. I will also discuss how DEMUD handles situations when data is missing, which commonly occurs in satellite aerosol retrievals for scenes with extensive cloud cover.

+-
Christopher Lynnes, Ph.D.

Giovanni-4: the next generation of an online tool for satellite data visualization, exploration and intercomparison

Christopher LynnesNASA/GSFC

One of the most time consuming phases of scientific research is the identification of useful data for tackling the problem at hand. This is not simply locating datasets via spatial, temporal and semantic information. Typically it involves an intensive (often lengthy) phase of examining the data for key signatures of the phenomena under study. Earth science remote sensing makes this problem challenging due to the large data volumes, complex data formats and data structures employed. The Geospatial Interactive Online Analysis and Visualization Interface (Giovanni) was designed to provide a server-side tool for this exploratory data analysis phase. Deployed at the Goddard Earth Sciences Data and Information Services Center for over a decade, Giovanni offers a variety of services to visualize the content of Earth science data archived at the GES DISC and select other data centers. Services range from basic time-averaged maps to correlation maps of variable pairs and interactive scatterplot+map. Currently, Giovanni is undergoing a rearchitecture to enhance support for data exploration by adding more interactive visualizations and speed to move closer to a true interactive user experience (UX). A longer term goal is to support user-contributed content in Giovanni.

Developing global maps of carbon dioxide (CO2) concentration near the surface can help identify locations where major amounts of CO2 are entering and exiting the atmosphere. No single instrument currently provides this information, but inferences can be made by considering a weighted difference between total column CO2 concentration observed by the Greenhouse gases Observing Satellite (GOSAT) and mid-tropospheric CO2 concentration observed by the Atmospheric InfraRed Sounder (AIRS) on the Aqua satellite. In the past, attempts to combine satellite information have been hindered by the instruments' different spatial supports and by the typically massive size of the remote sensing datasets. We describe a spatio-temporal data-fusion methodology, based on the Kalman filter and smoother, that can combine complementary datasets from multiple sources and properly account for spatial and temporal dependencies in order to produce more complete and accurate inferences. The resulting optimal predictors have computational complexity that is linear with respect to the number of observations at each time point.

+-
William Cleveland, Ph.D. - Abstract 1

+-
Rob Gould, Ph.D.

+-
Jeff Hammerbacher, Ph.D.

+-
Daniel R. Jeske, Ph.D.

Co-clustering Spatial Data Using a Generalized Linear Mixed Model With Application to the Integrated Pest Management

Daniel R. JeskeDepartment of Statistics, University of California - Riverside

Co-clustering has been broadly applied to many domains such as bioinformatics and text mining. However, model-based spatial co-clustering has not been studied. In this paper, we develop a co-clustering method using a generalized linear mixed model for spatial data. To avoid the high computational demands associated with global optimization, we propose a heuristic optimization algorithm to search for a near optimal co-clustering. For an application pertinent to Integrated Pest Management, we combine the spatial co-clustering technique with a statistical inference method to make assessment of pest densities more accurate. We demonstrate the utility and power of our proposed pest assessment procedure through simulation studies and apply the procedure to studies of the persea mite (Oligonychus perseae), a pest of avocado trees, and the citricola scale (Coccus pseudomagnoliarum), a pest of citrus trees.

+-
Yaming Yu, Ph.D

Using state-space models for variance matrices to study climate patterns

Yaming YuUniversity of California, Irvine

The global climate system is dominated by large-scale spatial patterns of atmospheric and oceanic variability which are often defined in terms of Empirical Orthogonal Function (EOF) analysis. EOF analysis has two main limitations: The assumption of stationarity over a long period of time and the absence of associated measures of uncertainty. We build a model for the spatial variance-covariance matrix with parametric basis functions and adopt a Bayesian approach to estimate the parameters. Posterior simulation using Markov chain Monte Carlo yields both the estimates for the parameters and the associated measures of uncertainty. A state-space model is applied on smaller time windows with less information to capture the smoothly changing nature of the pattern over time by linking some of the parameters at successive time windows through system equations. We explore these methods and illustrate with both simulations and real data.

+-
Barbara A. Bailey, Ph.D.

Nonlinear Models for Predicting Plankton Ecosystem Dynamics

Barbara A. BaileySan Diego State University

Time series of physical and biological properties of the ocean are a valuable resource for developing models for ecological forecasting and ecosystem-based management. Both the physics of the oceans and organisms living in it can exhibit nonlinear dynamics. We describe the development of a nonlinear model that predicts the abundance of the important zooplankton species Calanus finmarchicus from hydrographic data from the Gulf of Maine. The results of a neural network model, including model diagnostics, forecasts, and dynamical quantities are presented. The best neural network model based on generalized cross validation includes variables of C. finmarchicus abundance, herring abundance, and the state of the Gulf of Maine waters, with meaningful time lags. Forecasts are constructed for the model fit to 1978-2003 bimonthly data and corresponding forecasts intervals are obtained by the stationary bootstrap.

+-
James Harner, Ph.D.

+-
Michael Limcaco, Ph.D.

+-
Gayn B. Winters, Ph.D.

+-
Robert Allen, Ph.D.

Session: Climate Data Analysis: From Satellites to Climate Models

Heterogeneous Warming Agents and Widening of the Tropical Belt

Robert AllenUniversity of California, Riverside

Observational analyses have shown the width of the tropical belt increasing in recent decades as the world has warmed. This expansion is important because it is associated with shifts in large-scale atmospheric circulation and major climate zones. Although recent studies have attributed tropical expansion in the Southern Hemisphere to ozone depletion, the drivers of Northern Hemisphere expansion are not well known and the expansion has not so far been reproduced by climate models. Here we use a climate model with detailed aerosol physics to show that increases in heterogeneous warming agents-including black carbon aerosols and tropospheric ozone-are noticeably better than greenhouse gases at driving expansion, and can account for the observed summertime maximum in tropical expansion. Mechanistically, atmospheric heating from black carbon and tropospheric ozone has occurred at the mid-latitudes, generating a poleward shift of the tropospheric jet, thereby relocating the main division between tropical and temperate air masses. Although we still underestimate tropical expansion, the true aerosol forcing is poorly known and could also be underestimated. Thus, although the insensitivity of models needs further investigation, black carbon and tropospheric ozone, both of which are strongly influenced by human activities, are the most likely causes of observed Northern Hemisphere tropical expansion.

+-
Charlie Zender, Ph.D., Pedro Vicente, WenShan Wang

Session: Climate Data Analysis: From Satellites to Climate Models

The Future of Model Evaluation

Charlie ZenderUniversity of California, Irvine

Pedro Vicente

WenShan Wang

Geoscientific model evaluation often means comparing model simulations in netCDF format to satellite observations in HDF format. Analysis techniques that exploit the hierarchical structure of these self-describing data formats can be more intuitive, simple, and efficient than traditional analysis techniques. To unleash the full power of HDF/netCDF4 storage capabilities, one must have tools to manipulate and aggregate disparate datasets into larger structures that facilitate parallel processing. We describe our recent progress extending the netCDF Operators (NCO) to process netCDF and HDF-EOS datasets that use hierarchical groups. We illustrate our approach by showing how much easier it is to characterize, evaluate, and intercompare Earth System Model-simulated (CMIP5 archive) and satellite-retrieved trends with the group and HDF-EOS "aware" NCO compared to previous methods.

+-
Joel Norris, Ph.D.

Joel Norris Scripps Institution of Oceanography, University of California, San Diego

Clouds play a key role in the climate system and are one of the biggest uncertainties in our understanding of climate change. Investigation of cloud changes in recent decades is severely impeded by inhomogeneities in the observational record. Recent work by the presenter, however, demonstrates that statistical techniques are able to remove spurious variability from the satellite cloud record. This enables identification of regional patterns of cloud change that resemble those projected by climate models to occur for global warming.

+-
Toshihisa Matsui, Ph.D.

Session: Climate Data Analysis: From Satellites to Climate Models

+-
Ian Misner, Ph.D.

+-
Atanas Radenski Ph.D. & Louis Ehwerhemuepha Ph.D

+-
Gennady Verkhivker, Ph.D.

+-
Michael Mahoney, Ph.D. - Abstract 2

Motivated by problems in large-scale data analysis, randomized algorithms for matrix problems such as regression and low-rank matrix approximation have been the focus of a great deal of attention in recent years. These algorithms exploit novel random sampling and random projection methods; and implementations of these algorithms have already proven superior to traditional state-of-the-art algorithms, as implemented in Lapack and high-quality scientific computing software, for moderately-large problems stored in RAM on a single machine. Here, we describe the extension of these methods to computing high-precision solutions in parallel and distributed environments that are more common in very large-scale data analysis applications.

In particular, we consider both the Least Squares Approximation problem and the Least Absolute Deviation problem, and we develop and implement randomized algorithms that take advantage of modern computer architectures in order to achieve improved communication profiles. Our iterative least-squares solver, LSRN, is competitive with state-of-the-art implementations on moderately-large problems; and, when coupled with the Chebyshev semi-iterative method, scales well for solving large problems on clusters that have high communication costs such as on an Amazon Elastic Compute Cloud cluster. Our iterative least-absolute-deviations solver is based on fast ellipsoidal rounding, random sampling, and interior-point cutting-plane methods; and we demonstrate significant improvements over traditional algorithms on MapReduce. In addition, this algorithm can also be extended to solve more general convex problems on MapReduce.

+-
MIles Lopes, Ph.D.

Session: Random Solutions to Big Problems

A More Powerful Two-Sample Test in High-dimensions using Random Projection

Miles Lopes University of California, Berkeley

We study the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension $p$ to exceed the sample size $n$. Specifically, we propose a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T^2 statistic. Working under a high-dimensional framework with (p,n) tending to infinity, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Lastly, using ROC curves generated from simulated data, we demonstrate superior performance with competing tests in the parameter regimes anticipated by our theoretical results.

+-
George Papandreou, Ph.D.

Session: Random Solutions to Big Problems

Being Friends with Noise: Probabilistic Machine Learning in Computer Vision and Multimodal Perception

George PapandreouUniversity of California, Los Angeles

Machine learning allows us to automatically reason about data. It plays an increasingly important role in building computer vision and multimodal perception systems which are able to interpret the ever growing volume of images and videos available in digital form. In these domains we typically deal with complex sensory signals that feature strongly stochastic aspects such as missing data and noisy measurements.

Probabilistic Bayesian machine learning methods are particularly well-suited for describing such ambiguous data, providing a natural conceptual framework for quantifying the uncertainty in interpreting them. I will illustrate with examples from my work in image modeling, computer vision, and audiovisual speech recognition, that the Bayesian approach can be very fruitful in practical applications, allowing us to learn model parameters and adaptively fuse heterogeneous information sources in a principled fashion.

Despite these advantages, applying probabilistic techniques to large-scale data such as those arising in computer vision can pose significant computational challenges and alternative optimization-based deterministic methods are often preferred. I will describe my recent research on Perturb-and-MAP random sampling which brings powerful techniques from optimization into probabilistic modeling, making Bayesian inference computationally tractable for challenging computer vision problems such as image inpainting and deblurring, image segmentation, and scene labeling.

+-
Daven Henze, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

+-
Tracey Holloway

Session: From Large Earth Science Datasets to Compelling Scientific Results

Satellite and Model Data to Support Air Quality
Management

Tracey Holloway University of Wisconsin

+-
Jaechoul Lee, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

+-
Simon Urbanek, Ph.D.

Session: Visualization of Big Data

+-
William Cleveland, Ph.D. - 2

Session: Visualization of Big Data

+-
Nicholas Lewin-Koh, Ph.D.

Session: Visualization of Big Data

+-
Maria Val Martin, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

In recent years, the availability of large satellite datasets has provided an extraordinary opportunity to improve our understanding of the mechanisms controlling atmospheric composition. In particular, these datasets have contributed toward improving the representation of fire emissions in climate and air quality models and assessments. In this talk, I will present several examples that applied computational analysis to the multi-year record of satellite aerosol observations, which have enabled characterization of smoke fire processes and their impacts. I will discuss (a) the determination of smoke plume heights from fires over North America; (b) the investigation of the main physical factors that determine smoke plume rise; and (c) the assessment of fire impacts on aerosol loading and air quality over Colorado.

+-
Charles Ichoku, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Atmospheric aerosols are
routinely retrieved from measurements acquired by such spaceborne sensors as
MODIS on the Terra and Aqua satellites, MISR on Terra, OMI on Aura, POLDER on
the French PARASOL, CALIOP on CALIPSO, SeaWiFS on SeaStar. The aerosol
measurements collected by these instruments over the last decade contribute to
an unprecedented availability of the most complete set of complimentary aerosol
measurements ever acquired. Overall, there are 11 different products from these
7 spaceborne sensors, because aerosols are retrieved from MODIS over land using
different algorithms. To derive the full scientific benefit of this diversity
of measurements by using them synergistically, they are being carefully and
uniformly analyzed in a comparative manner, in order to understand their
uncertainties and limitations using coincident ground-based aerosol
measurements from the Aerosol Robotic Network (AERONET). In this presentation,
we will show results of detailed statistical analysis of these products, which
revealed their relative strengths and limitations over different locations
around the world, thereby illustrating which measurements are most reliable in
different regions and over different land-cover types.

+-
Mian Chin, Ph.D.

Session: From Large Earth Science Datasets to Compelling Scientific Results

Multi-decadal variations of aerosols from
multi-platform data and model from 1980 to 2009

Mian Chin NASA Goddard Space Flight Center

We present a global model analysis of aerosol
trends from 1980 to 2009 in different land and oceanic regions in the world and
assessing the anthropogenic and natural emission impact on those trends. The
global model GOCART simulated aerosol optical depth are compared with the
long-term data from satellite (AVHRR, TOMS, SeaWiFS, MODIS, and MISR)
retrievals and ground-based sunphotometer (AERONET) measurements, and surface
concentrations with surface measurements from the IMPROVE network in the U.S.,
the EMEP network in Europe, and the University of Miami managed sites over
islands in the oceans. We examine the relationship between emissions, surface
concentrations, and column AOD in pollution, dust, and biomass burning
dominated source regions and downwind areas and assess the anthropogenic impact
on the global and regional aerosol trends.

+-
Kyle Caudle

Data streams provide unique challenges that are not normally encountered during standard statistical analysis. Foremost is the fact that data is arriving at such a high rate that storing the data and analyzing later is no longer feasible. Understanding the underlying distribution of the data can help our understanding of how systems operate under stable status quo operations. We propose a method for performing multivariate wavelet density estimation in large dimensions (i.e. 5 or more) by parallel processing each of the wavelet functions and scaling functions and then piecing things back together. Current work utilizes code generating software that automatically generates multidimensional code based on the selected number of dimensions. Once the underlying density is constructed tests are performed to check for changes in the underlying density function and new data arrives.

L.O. MearnsInstitute for Mathematical Applications to
the Geosciences, National Center for Atmospheric
Research, Boulder, CO

The surface
air temperature, precipitation, and surface insolation, the three key fields in
shaping the surface hydrology and atmosphere-land interaction, simulated by
multiple RCMs that have participated in the NARCCAP hindcast experiment are
evaluated for the conterminous U.S. region and the period 1980-2003 using the
Regional Climate Model Evaluation System (RCMES). Findings in this study
illustrate that all models reasonably simulate the spatial pattern and
variability of the annual-mean climatology of these three fields in the
conterminous U.S. and that the model performance in simulating the spatial
variability varies according to RCMs more widely than the spatial pattern. A
number of systematic model biases in simulating these variables have also been
found. For the annual mean climatology, all five RCMs generate warm bias over
the Great Plains and the California Central Valley and cold bias over the
coastal regions along the Atlantic Ocean and the Gulf of Mexico. The warm bias
over the Great Plains occurs in both summer and winter; however, the model bias
in other regions varies considerably according to models and seasons. The most
notable errors common to the majority of these RCMs in simulating the
annual-mean precipitation include wet bias in the Pacific Northwest region and
dry bias in the Gulf of Mexico and southern Great Plains regions, especially
for the inland Pacific Northwest region in all seasons and for the Arizona-New
Mexico region in summer. In terms of the normalized RMSE, all RCMs perform
better for the eastern half of the U.S. than for the western half. Most RCMs
overestimate the annual-mean surface insolation over the conterminous U.S.
region. All RCMs show either larger positive bias or smaller negative bias in
the eastern half of the conterminous U.S. than in the western half. For all
RCMs and their ensemble, the spatial pattern of the insolation bias is
negatively correlated with that of the precipitation bias, suggesting that the
biases in precipitation and surface insolation are related, most likely via the
cloud fields. For the three model fields evaluated in this study, the
multi-model ensemble appears to be among the best performers for all metrics,
regions, and seasons. The systematic variations in model errors according to
regions, seasons, variables, and metrics found in this study suggest that the
bias correction, a key step in applying climate model data to assess the
climate impact on various sectors, may have to be performed differently for
regions, for seasons, for variables, and for assessment models.

+-
Robert Walko, Ph.D.

Session: Massive Data Challenges in Numerical Weather Modeling

Use of variable-resolution gridding in the
Ocean-Land-Atmosphere Model (OLAM) for optimal utilization of resources on
large and small computers

R.L. WalkoRosenstiel School
of Marine and Atmospheric Science, University of Miami, Miami, FL

D.M. MedvigyDepartment of
Geosciences and Program in Atmospheric and Oceanic Sciences, Princeton University, Princeton, NJ

R. AvissarRosenstiel School of Marine and Atmospheric Science, University of Miami, Miami, FL

Atmospheric, oceanic, hydrological, ecosystem,
and other environmental modeling systems are capable of consuming an enormous
number of computing cycles and generating an enormous quantity of digital
output available for subsequent analysis and post-processing. Management of model simulations and related
data communication, storage, and processing rank among the major ‘Big Data’
challenges in the Environmental Sciences, alongside storage and processing of
the ever-increasing stream of observational data. It has been argued that along with Big Data
comes a growing need and obligation for ‘Big Judgment’, which for numerical
modeling includes planning ahead and exercising insight and intuition in order
to optimally design model simulations for maximum yield of useful information
for a given allocation of computing resources.
Variable-resolution computational grids provide one means of increasing
the benefit-to-cost ratio in many modeling applications. A common example is regional climate
modeling, which concentrates high resolution over geographic regions that are of
key interest or importance, while covering the remainder of the planet with lower
grid resolution that is much less costly in both computational cycles and data
storage. The Ocean-Land-Atmosphere Model
(OLAM), a novel environmental simulation system that incorporates seamless
variable-resolution grid methods, is used to describe and demonstrate
applications of this technique. We
present examples from both regional and global modeling applications where
spatially-selective higher resolution provides substantial overall benefits compared
to uniform resolution. Advantages of the
seamless grid over the more traditional nested grid technique are also discussed.

+-
Craig Tremback, Ph.D.

Session: Massive Data Challenges in Numerical Weather Modeling

+-
Seon K. Park, Ph.D.

Session: Massive Data Challenges in Numerical Weather Modeling

Development of an Integrated Prediction System for Climate-Environment-Ecosystem Interactions and Corresponding GIS-based Database and Web Display System

Climate change affects the global/regional
environmental system among its various components, including atmosphere, hydrosphere, biosphere, land
surfaces, etc., which have nonlinear
interactions each other. These components exert
impact on climate change itself via positive/negative feedback processes. There
has been less
effort to investigate interactions between the environmental system and climate change; especially the feedback
processes associated with environmental components, through macro- and
micro-scale changes, remain poorly understood.

The Center for Climate/Environment Change Prediction
Research (CCCPR) aims at developing an
integrative prediction system for climate-environment-ecosystem. We conduct core researches in identifying nonlinear interactions and
related feedback processes in the climate-environment system. To achieve our goal, the research efforts of the center are divided into three major themes with strong
connections: 1) climate/atmospheric environment prediction; 2) ecology/water environmental prediction; and 3) development
of interaction diagnosis/prediction system.

For climate/atmospheric environment prediction, we conduct research on climate change analysis and scenario production as well as atmospheric chemistry/aerosol analysis and
prediction. Especially, non-linear feedback processes between climate and atmospheric
environment is studied in depth. In studying ecology/water environmental prediction, we focus on 1) vegetation and eco-system analysis and
prediction due to climate change; 2) water and soil chemistry characteristics
change analysis; and 3) ecosystem/water environment prediction model
development. The feedback process of water and soil chemistry characteristic
change to ecosystem and water environment which leads to climate change are
studied in detail. In the effort of developing interaction diagnosis/prediction system, we are developing 1)
coupled atmosphere-land surface process modeling; 2) remote sensing observation
and monitoring techniques; and 3) interface and integrated database (DB). In this task, the data
interface is developed and the GIS-based integrated database and web display system is operated to consolidate all product data for sharing and
distribution in CCCPR and among other research communities. Further details
will be discussed.