Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.

2.
Abstract
• We introduce the NIST collection of 51 use cases and describe
their scope over industry, government and research areas. We
look at their structure from several points of view or facets
covering problem architecture, analytics kernels, micro-
system usage such as flops/bytes, application class (GIS,
expectation maximization) and very importantly data source.
• We then propose that in many cases it is wise to combine the
well known commodity best practice (often Apache) Big Data
Stack (with ~120 software subsystems) with high performance
computing technologies.
• We describe this and give early results based on clustering
running with different paradigms.
• We identify key layers where HPC Apache integration is
particularly important: File systems, Cluster resource
management, File and object data management, Inter
process and thread communication, Analytics libraries,
Workflow and Monitoring.

4.
NIST Requirements and Use Case Subgroup
• Part of NIST Big Data Public Working Group (NBD-PWG) June-September 2013
http://bigdatawg.nist.gov/
• Leaders of activity
– Wo Chang, NIST
– Robert Marcus, ET-Strategies
– Chaitanya Baru, UC San Diego
The focus is to form a community of interest from industry, academia,
and government, with the goal of developing a consensus list of Big
Data requirements across all stakeholders. This includes gathering and
understanding various use cases from diversified application domains.
Tasks
• Gather use case input from all stakeholders
• Derive Big Data requirements from each use case.
• Analyze/prioritize a list of challenging general requirements that may delay or
prevent adoption of Big Data deployment
• Develop a set of general patterns capturing the “essence” of use cases (to do)
• Work with Reference Architecture to validate requirements and reference
architecture by explicitly implementing some patterns based on use cases
4

5.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
Big Data Definition
• More consensus on Data Science definition than that of Big Data
• Big Data refers to digital data volume, velocity and/or variety that:
• Enable novel approaches to frontier questions previously
inaccessible or impractical using current or conventional methods;
and/or
• Exceed the storage capacity or analysis capability of current or
conventional methods and systems; and
• Differentiates by storing and analyzing population data and not
sample sizes.
• Needs management requiring scalability across coupled
horizontal resources
• Everybody says their data is big (!) Perhaps how it is used is most
important
5

6.
What is Data Science?
• I was impressed by number of NIST working group members who
were self declared data scientists
• I was also impressed by universal adoption by participants of
Apache technologies – see later
• McKinsey says there are lots of jobs (1.65M by 2018 in USA) but
that’s not enough! Is this a field – what is it and what is its core?
• The emergence of the 4th or data driven paradigm of science
illustrates significance - http://research.microsoft.com/en-
us/collaboration/fourthparadigm/
• Discovery is guided by data rather than by a model
• The End of (traditional) science http://www.wired.com/wired/issue/16-
07 is famous here
• Another example is recommender systems in Netflix, e-
commerce etc. where pure data (user ratings of movies or
products) allows an empirical prediction of what users like

8.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
Data Science Definition
• Data Science is the extraction of actionable knowledge directly from data
through a process of discovery, hypothesis, and analytical hypothesis
analysis.
• A Data Scientist is a
practitioner who has
sufficient knowledge of the
overlapping regimes of
expertise in business needs,
domain knowledge,
analytical skills and
programming expertise to
manage the end-to-end
scientific method process
through each stage in the
big data lifecycle.
8

12.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
3: Census Bureau Statistical Survey
Response Improvement (Adaptive Design)
• Application: Survey costs are increasing as survey response declines. The goal of this
work is to use advanced “recommendation system techniques” that are open and
scientifically objective, using data mashed up from several sources and historical
survey para-data (administrative data about the survey) to drive operational
processes in an effort to increase quality and reduce the cost of field surveys.
• Current Approach: About a petabyte of data coming from surveys and other
government administrative sources. Data can be streamed with approximately 150
million records transmitted as field data streamed continuously, during the decennial
census. All data must be both confidential and secure. All processes must be
auditable for security and confidentiality as required by various legal statutes. Data
quality should be high and statistically checked for accuracy and reliability
throughout the collection process. Use Hadoop, Spark, Hive, R, SAS, Mahout,
Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig software.
• Futures: Analytics needs to be developed which give statistical estimations that
provide more detail, on a more near real time basis for less cost. The reliability of
estimated statistics from such “mashed up” sources still must be evaluated.
Government
12

13.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
7: Netflix Movie Service
• Application: Allow streaming of user selected movies to satisfy multiple objectives (for
different stakeholders) -- especially retaining subscribers. Find best possible ordering of a
set of videos for a user (household) within a given context in real-time; maximize movie
consumption. Digital movies stored in cloud with metadata; user profiles and rankings for
small fraction of movies for each user. Use multiple criteria – content based
recommender system; user-based recommender system; diversity. Refine algorithms
continuously with A/B testing.
• Current Approach: Recommender systems and streaming video delivery are core Netflix
technologies. Recommender systems are always personalized and use logistic/linear
regression, elastic nets, matrix factorization, clustering, latent Dirichlet allocation,
association rules, gradient boosted decision trees etc. Winner of Netflix competition (to
improve ratings by 10%) combined over 100 different algorithms. Uses SQL, NoSQL,
MapReduce on Amazon Web Services. Netflix recommender systems have features in
common to e-commerce like Amazon. Streaming video has features in common with
other content providing services like iTunes, Google Play, Pandora and Last.fm.
• Futures: Very competitive business. Need to be aware of other companies and trends in
both content (which Movies are hot) and technology. Need to investigate new business
initiatives such as Netflix sponsored content
Commercial
13

14.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
15: Intelligence Data
Processing and Analysis
• Application: Allow Intelligence Analysts to a) Identify relationships between entities
(people, organizations, places, equipment) b) Spot trends in sentiment or intent for either
general population or leadership group (state, non-state actors) c) Find location of and
possibly timing of hostile actions (including implantation of IEDs) d) Track the location and
actions of (potentially) hostile actors e) Ability to reason against and derive knowledge
from diverse, disconnected, and frequently unstructured (e.g. text) data sources f) Ability
to process data close to the point of collection and allow data to be shared easily to/from
individual soldiers, forward deployed units, and senior leadership in garrison.
• Current Approach: Software includes Hadoop, Accumulo (Big Table), Solr, Natural
Language Processing, Puppet (for deployment and security) and Storm running on
medium size clusters. Data size in 10s of Terabytes to 100s of Petabytes with Imagery
intelligence device gathering petabyte in a few hours. Dismounted warfighters would
have at most 1-100s of Gigabytes (typically handheld data storage).
• Futures: Data currently exists in disparate silos which must be accessible through a
semantically integrated data space. Wide variety of data types, sources, structures, and
quality which will span domains and requires integrated search and reasoning. Most
critical data is either unstructured or imagery/video which requires significant processing
to extract entities and information. Network quality, Provenance and security essential.
Defense
14

15.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
26: Large-scale Deep Learning
• Application: Large models (e.g., neural networks with more neurons and connections) combined with
large datasets are increasingly the top performers in benchmark tasks for vision, speech, and Natural
Language Processing. One needs to train a deep neural network from a large (>>1TB) corpus of data
(typically imagery, video, audio, or text). Such training procedures often require customization of the
neural network architecture, learning criteria, and dataset pre-processing. In addition to the
computational expense demanded by the learning algorithms, the need for rapid prototyping and
ease of development is extremely high.
• Current Approach: The largest applications so far are to image recognition and scientific studies of
unsupervised learning with 10 million images and up to 11 billion parameters on a 64 GPU HPC
Infiniband cluster. Both supervised (using existing classified images) and unsupervised applications
Deep Learning
Social Networking
• Futures: Large datasets of 100TB or more may be
necessary in order to exploit the representational power
of the larger models. Training a self-driving car could take
100 million images at megapixel resolution. Deep
Learning shares many characteristics with the broader
field of machine learning. The paramount requirements
are high computational throughput for mostly dense
linear algebra operations, and extremely high productivity
for researcher exploration. One needs integration of high
performance libraries with high level (python) prototyping
environments
IN
Classified
OUT
15

16.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
35: Light source beamlines
• Application: Samples are exposed to X-rays from light sources in a variety of
configurations depending on the experiment. Detectors (essentially high-speed
digital cameras) collect the data. The data are then analyzed to reconstruct a
view of the sample or process being studied.
• Current Approach: A variety of commercial and open source software is used for
data analysis – examples including Octopus for Tomographic Reconstruction,
Avizo (http://vsg3d.com) and FIJI (a distribution of ImageJ) for Visualization and
Analysis. Data transfer is accomplished using physical transport of portable
media (severely limits performance) or using high-performance GridFTP,
managed by Globus Online or workflow systems such as SPADE.
• Futures: Camera resolution is continually increasing. Data transfer to large-scale
computing facilities is becoming necessary because of the computational power
required to conduct the analysis on time scales useful to the experiment. Large
number of beamlines (e.g. 39 at LBNL ALS) means that total data load is likely to
increase significantly and require a generalized infrastructure for analyzing
gigabytes per second of data from many beamline detectors at multiple
facilities.
Research Ecosystem
16

17.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
36: Catalina Real-Time Transient Survey (CRTS):
a digital, panoramic, synoptic sky survey I
• Application: The survey explores the variable universe in the visible light regime, on time
scales ranging from minutes to years, by searching for variable and transient sources. It
discovers a broad variety of astrophysical objects and phenomena, including various types
of cosmic explosions (e.g., Supernovae), variable stars, phenomena associated with
accretion to massive black holes (active galactic nuclei) and their relativistic jets, high
proper motion stars, etc. The data are collected from 3 telescopes (2 in Arizona and 1 in
Australia), with additional ones expected in the near future (in Chile).
• Current Approach: The survey generates up to ~ 0.1 TB on a clear night with a total of
~100 TB in current data holdings. The data are preprocessed at the telescope, and
transferred to Univ. of Arizona and Caltech, for further analysis, distribution, and archiving.
The data are processed in real time, and detected transient events are published
electronically through a variety of dissemination mechanisms, with no proprietary
withholding period (CRTS has a completely open data policy). Further data analysis
includes classification of the detected transient events, additional observations using
other telescopes, scientific interpretation, and publishing. In this process, it makes a
heavy use of the archival data (several PB’s) from a wide variety of geographically
distributed resources connected through the Virtual Observatory (VO) framework.
Astronomy & Physics
17

19.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
47: Atmospheric Turbulence - Event
Discovery and Predictive Analytics
• Application: This builds datamining on top of reanalysis products including the North
American Regional Reanalysis (NARR) and the Modern-Era Retrospective-Analysis for
Research (MERRA) from NASA where latter described earlier. The analytics correlate
aircraft reports of turbulence (either from pilot reports or from automated aircraft
measurements of eddy dissipation rates) with recently completed atmospheric re-analyses.
This is of value to aviation industry and to weather forecasters. There are no standards for
re-analysis products complicating system where MapReduce is being investigated. The
reanalysis data is hundreds of terabytes and slowly updated whereas turbulence is smaller
in size and implemented as a streaming service.
Earth, Environmental
and Polar Science
• Current Approach: Current 200TB dataset can
be analyzed with MapReduce or the like using
SciDB or other scientific database.
• Futures: The dataset will reach 500TB in 5
years. The initial turbulence case can be
extended to other ocean/atmosphere
phenomena but the analytics would be
different in each case.
Typical NASA image of turbulent waves
19

27.
Would like to capture “essence of
these use cases”
“small” kernels, mini-apps
Or Classify applications into patterns
Do it from HPC background not database view point
e.g. focus on cases with detailed analytics
Section 5 of my class
https://bigdatacoursespring2014.appspot.com/preview classifies
51 use cases with ogre facets

28.
What are “mini-Applications”
• Use for benchmarks of computers and software (is my
parallel compiler any good?)
• In parallel computing, this is well established
– Linpack for measuring performance to rank machines in Top500
(changing?)
– NAS Parallel Benchmarks (originally a pencil and paper
specification to allow optimal implementations; then MPI library)
– Other specialized Benchmark sets keep changing and used to
guide procurements
• Last 2 NSF hardware solicitations had NO preset benchmarks –
perhaps as no agreement on key applications for clouds and
data intensive applications
– Berkeley dwarfs capture different structures that any approach
to parallel computing must address
– Templates used to capture parallel computing patterns
• I’ll let experts comment on database benchmarks like TPC

37.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
18: Computational
Bioimaging
• Application: Data delivered from bioimaging is increasingly automated, higher
resolution, and multi-modal. This has created a data analysis bottleneck that, if
resolved, can advance the biosciences discovery through Big Data techniques.
• Current Approach: The current piecemeal analysis approach does not scale to
situation where a single scan on emerging machines is 32TB and medical
diagnostic imaging is annually around 70 PB even excluding cardiology. One
needs a web-based one-stop-shop for high performance, high throughput image
processing for producers and consumers of models built on bio-imaging data.
• Futures: Goal is to solve that bottleneck with extreme scale computing with
community-focused science gateways to support the application of massive data
analysis toward massive imaging data sets. Workflow components include data
acquisition, storage, enhancement, minimizing noise, segmentation of regions of
interest, crowd-based selection and extraction of features, and object
classification, and organization, and search. Use ImageJ, OMERO, VolRover,
advanced segmentation and feature detection software.
Healthcare
Life Sciences
Largely Local Machine Learning
37

38.
Big Data Applications & Analytics MOOC Use Case Analysis Fall 201312/26/13
27: Organizing large-scale, unstructured
collections of consumer photos I
• Application: Produce 3D reconstructions of scenes using collections
of millions to billions of consumer images, where neither the scene
structure nor the camera positions are known a priori. Use resulting
3d models to allow efficient browsing of large-scale photo
collections by geographic position. Geolocate new images by
matching to 3d models. Perform object recognition on each image.
3d reconstruction posed as a robust non-linear least squares
optimization problem where observed relations between images
are constraints and unknowns are 6-d camera pose of each image
and 3-d position of each point in the scene.
• Current Approach: Hadoop cluster with 480 cores processing data
of initial applications. Note over 500 billion images on Facebook
and over 5 billion on Flickr with over 500 million images added to
social media sites each day.
Deep Learning
Social Networking
Global Machine Learning after Initial Local steps 38

40.
This Facet of Ogres has Features
• These core analytics/kernels can be classified by features
like
• (a) Flops per byte;
• (b) Communication Interconnect requirements;
• (c) Is application (graph) constant or dynamic
• (d) Most applications consist of a set of interconnected
entities; is this regular as a set of pixels or is it a
complicated irregular graph
• (d) Is communication BSP or Asynchronous; in latter case
shared memory may be attractive
• (e) Are algorithms Iterative or not?
• (f) Are data points in metric or non-metric spaces