Stream, score and store data in context

How to identify the data that matters to your organization

By Catherine Clark, SAS Federal Engagement Manager

Intelligence data, as well as data for all government agencies, has been growing at an alarming rate for years, and it appears this trend will continue for the foreseeable future. There is an urgent need in the intelligence community to manage big data so that analysts and agents can focus on the clues and information that can solve and stop crime rather than focusing all their time on finding the clues.

A contextual scoring process that immediately analyzes the usefulness of new information gets everyone closer to the answer. As each new piece of data comes in, it is scored and prioritized based on a mission's actions, interests and queries. Using cloud computing with high-performance analytics can organize big data into manageable, contextual and prioritized views for the mission.

Where is big data coming from?Though a moving target, current data set sizes are on the order of terabytes, exabytes and zettabytes. They continue to grow because systems are increasingly gathering new data by ubiquitous information-sensing mobile devices, sensory technologies, software logs, cameras, microphones, wireless sensor networks and so on. Every day, 2.5 quintillion bytes of data are created; an astonishing 90 percent of the data in the world today was created within the past two years. According to a September 2011 Economist Intelligence Unit survey of 586 executives, 99 percent of respondents reported increases in the amount of data their organizations collected.

A big data solution to the intelligence challenge – and others like it – must focus on how to prioritize the relevant data so that information gets to analysts sooner. By minimizing information overload and equipping analysts with the comprehensive information they need, organizations can make solid decisions faster.

How do you sort your mail?In the physical world, everyone processes incoming information differently; however, most people perform some type of analytic process to manage their data. For example, when mail arrives, a person typically reviews the letters, bills and catalogues, and then decides whether an item should go in the trash, be processed immediately or put on hold for later review.

In the digital world, this rarely happens. As data arrives, no automated process reviews the data. It is just pushed into some type of storage mechanism, leaving the sorting to the already overwhelmed analyst. This "store, then query" paradigm provides more analytic challenges as the data grows. Processing data – or actually, this lack of processing – turns organizations into information hoarders, with every bit of information having equal importance to every other bit.

As with any problem of this magnitude, the issues with big data will drive innovative strategies in prioritization and relevance. But what if we could continuously evaluate and score enterprise data as a function of an organization's knowledge and domain rules?

Extract, transform, analyze and load The typical process for data ingestion is to extract, transform and load, or ETL. Yet this ETL process is missing an important step: analyze. By performing a query for relevance before storing the data, this analytic step – or contextual scoring – provides the following capabilities:

Data normalization – the data attributes within a data model are organized to increase the cohesion of entity types. In other words, the goal of data normalization is to reduce – even eliminate – data redundancy.

Content categorization – applies natural language processing and advanced linguistic techniques to identify key topics and phrases in electronic text so that large volumes of multilingual content acquired, generated or existing in a repository can be automatically categorized.

Data enrichment – creates complete and accurate information on a worldwide level; enhances and verifies data with geographic, demographic, linguistic and other details.

Relevance scoring – detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important). This relevance is determined based on an organization's mission by way of its actions. Reports and queries generated by the analysts can be fed back into the system so that incoming data can be compared by assigning a point value based on similarities.

Relevance scoring adds an important innovation to this process. As an organization generates reports, stores internal documents and sends internal emails, the contextual scoring process can automatically evaluate this information to determine the interests of the organization. The system performs statistical and semantic analysis on structured and unstructured data to form the organizational knowledge "dialogue." The output of this dialogue feeds into the analysis process when new data comes in. Similar to popular retail websites that suggest new products to shoppers, if the data is important to the mission, a message (such as, "This new information is related to items in which you have already shown interest.") will pop up on the analyst's screen.

As a result, analysts are given relevant data. And the relevance becomes more refined over time as the system learns what is of interest to each person, scoring and presenting the data in the most meaningful way. Analysts and other intelligence officers no longer have to
search through hundreds or thousands of documents to find the single item they really need.

Adding to the complexity of prioritizing intelligence data is the fact that incoming information is often in languages other than English. Since the number of available linguists and translators is limited, most organizations have to prioritize which data to translate first. By scoring the documents for relevance, a linguist can focus on the most important content. When budgets are tighter than ever, and with the volume of data always growing, it is critical to spend time only on the most pressing matters and get answers quickly.

ConclusionIntelligence organizations spend millions of dollars storing irrelevant data. For many, the volume and velocity of data outstrips the storage and computing capacity for accurate and timely decision making. Simply put, most organizations are not really getting a handle on their information – they are just squeezing more data into the cloud without addressing its real-time priority and relevance to their business operations and workflow.

An enterprisewide approach to storing and retrieving information can tap into the enterprise dialogue and "think" like the people in the organization. A contextual scoring process can organize big data into manageable, contextual and prioritized views for the mission so that decisions are accurate and timely.

Bio: Catherine Clark is an Engagement Manager at SAS Federal, focused on the US Intelligence Community (IC) and the Department of Defense (DoD). She has 20 years of experience solving the unique analytic and big data challenges of this community. Clark was a Solutions Architect and Software Engineer at VSTI.