3.
About ISS
§
Headquartered
in
Colorado
Springs
§
§
Other
oﬃces
located
in
Washington
DC,
Hampton
VA,
Tampa
FL,
and
Rome
NY
InnovaEve
SoluEons
from
“Space
to
Mud
and
Everything
Between”
Sole
prime
on
mulEple
Air
Force
Research
Labs
programs
IDIQ
§ Currently
ExecuEng
More
Than
100
SoSware
Development
Projects
§ Over
800
employees
§ Strength
in
SoluEons
Development
and
Deployment
§
§
Consistently
Recognized
as
a
Leader
Recognized
as
a
DeloiXe
Fast
50
Colorado
company
and
a
DeloiXe
Fast
500
company
over
eight
consecuEve
years
§ Three-­‐Eme
Inc.
Magazine
500
winner
§ 2009
Defense
Company
of
the
Year
§

4.
The data challenge
•
•
•
Most electronic information is not relational, but unstructured
(textual, binary) or semi-structured (spreadsheet, RSS feed).
– In 2007, the estimated information content of all human
knowledge was 295 exabytes (295 million terabytes)
– Data production will be 44 times greater in 2020 than in 2009
• Approximately 35 zetabytes total (35 billion terabytes)
– A majority of the data produced in the future will be
unstructured
Unstructured data is easily processed by human beings, but is
more difficult for machines.
A tremendous amount of information and knowledge is dormant
within unstructured data.

7.
The need
•
•
•
Analysts are looking to extract knowledge from the massive
heterogeneous data sets, providing “actionable intelligence”
Search and NLP techniques are key enablers to allow an analyst to
reliably search for the information they know about, and to assist them
in discovering the information they don’t know about
It is critical (especially in tactical environments) to provide tools to the
analyst that allow them to “shrink the haystack” to a more digestible size,
and seed that information into an analytics pipeline, targeted at a
particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.)
– Time-to-live on the relevance of data collected can be very short
– Its not about finding the needle in the haystack, its about giving a trained
analyst the tools to present the most relevant information in a timely manner,
allowing them to make an informed decision

11.
Additional Solr features that we find useful
•
•
Synonym (aka “Semantic Search” to us)
– Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide
search that is tuned for a particular customer domain
• For example, a search for “weapon” finds various gun types (AK-47, M-16)
– Currently implemented at index time
– Simple feature to implement, but has proven very powerful as a “practical analytic”
Geospatial resolution (used in NLP pipeline)
– Loaded GeoNames dataset into a separate Solr core
– Allows for quick lookups in geospatial entity resolution
• e.g. resolving “Paris” to latitude/longitude based geo-coordinate
– Can boost based on general rules, or customer-specific ones
• For example, which “Paris” is it? The one in France or Texas?
– Population could be the boost parameter that returns Paris, France over Paris,
Texas
• Allows us to easily override for local conditions
– For example, if a customer wants all geo resolution to be focused in a
particular region of the world (i.e. their AOR)

12.
NLP techniques we use
•
•
•
Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline
– Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML
techniques
– Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps
– Abstracts vendor-specific NLP engine details, hence allowing you to plug in
different implementations without much disruption
GATE/Gazetteer approach
– Essentially Dictionaries containing key terms used for categorization (facets)
– Can have n number of “categories” that are generic, as well as customer domain
defined
OpenNLP/Supervised Machine Learning approach
– “Context aware” models that are trained by data scientists/SMEs
– Based on probabilistic theory (Maximum Entropy)

13.
Why use both NLP approaches?
•
•
•
•
Both approaches have their pro/cons
Gazetteer approach
– Pros
• Good precision – you are going to find what is important to you
• Simple for analyst to “tune” - does not require a data scientist
• Quick and easy to add new categories to a problem domain
– Cons
• Only as good as the gazetteer
• Not context aware
Supervised Machine Learning approach
– Pros
• Once properly trained, good at finding new concepts in context
– Cons
• Requires a data scientist/SME to produce quality models
• Can be tedious to train
Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find
things that are relevant that you may not know about