Become a Fan

January 31, 2007

In the past, when the US Department of Defense (DOD) sensors collected information, this information was first processed (forwarded to systems and people for analysis) then later posted (once the analytic product was finalized). The users (planners, warfighters, etc.) could not access the information until processing was completed and the final product posted.

Waiting for analysis and finished work products took too long – this resulted in varying degrees of strategic and operational error (e.g., bombs dropped in the wrong place). It became obvious this strategy must be changed.

As part of its network-centric warfare strategy, the DOD established the principle of "posting before processing." This means the data is made available to the users at the moment it becomes available. Analysts get this information at the same time and when their work product is finalized, it is then posted as additional information. By posting before processing, users have been able to make real-time operational decisions based on more current, albeit raw and uncorrelated, information. This has been hugely successful.

And while this transformation was well executed and a big step forward, I envision the next giant step forward will involve processing and posting at the same time. This means as theater sensors collect information, this information will be immediately placed into context with the historical data previously collected, previous analytic work products, open source and so on. This rich context will be constructed in real-time and posted in real-time.

From a network-centric warfare point of view – processing and posting at the same time is a form of Perpetual Analytics.

When DOD is able to process and post simultaneously, two very critical operational capabilities will emerge: 1) Persistent Context will be available to the user providing a uniquely comprehensive real-time operational picture, and 2) more importantly, as sensor data arrives and is contextualized, when selected conditions are met the user will be immediately notified of relevance. This second point is paramount because we cannot expect users to ask every question, rather, the data must find the data and the relevance must find the user.

Scenario: The military unit is moving from point A to point B. Along this vector a collection system has just recognized some significant changes (e.g., nine new trucks parked along side the road or the presence of RF [radio frequency] broadcasting that did not exist at any earlier time). Perpetual analytics can detect such conditions in real time and push such intelligence to the military unit – instantly.

January 25, 2007

When a key piece of data changes in the enterprise, one must first treat this new data like a query (i.e., what does this new data mean in relation to what the enterprise already knows). And if new data is not treated first like a query, one will never know if this new information matters unless someone asks. I often refer to this notion as Perpetual Analytics – a world where the "data finds the data and the relevance finds the user."

So exactly how would such a system be constructed? Many folks have suggested that this can be solved using "federated queries." Federated queries are solutions that interact with all of the islands of operational, reference and historical data scattered across the enterprise (often leveraging very smart middleware). This approach uses a query to interrogate enterprise data stores in order to gather related records. Think of federated queries as an example of "just-in-time-context."

If you want to evaluate new information against what the enterprise already knows, it so happens that federated queries don’t cut it for most missions. And the greater the number of data silos and queries the more impossible federated query systems become.

To explain why federated search breaks down with any scale I’ll need to get a bit technical here … so if you are not technical, the balance of this post is not for you.

TWO PRIMARY REASONS FEDERATED QUERY SYSTEMS DON'T SCALE

1. Operational systems and their underlying silos were originally designed to handle a specific operational mission. And the larger these systems, the more constrained their computational cycles. In other words, they do not have the free processing (or disk I/O) cycles to answer hundreds, thousands or millions of additional inquiries a day. Additionally, because operational systems were designed only to handle queries necessary to deliver specific business functionality, they cannot efficiently answer queries that they were not designed to support. This is because they do not have the indexes needed for fast lookup on every relevant field, which in turn necessitates the use of database table scans for record location. (If you are not technical and are still reading this, table scan=very slow.) Let’s take a payroll system for example. Payroll systems are designed to locate employee records based on mission specific fields, e.g., employee ID, name, tax ID, etc. A payroll system will not generally have an index to enable the efficient search on such fields as phone number or address. And if it did have an index to support a search on employee phone number, it would not likely have an index on the phone number of the employee’s emergency contact! This incomplete index problem holds true for most operational systems – from reservation systems, to sales and order entry systems, to accounts payable systems, and so on. In short, most operational systems cannot answer the necessary queries, or in any case, not quickly.

2. Even if all of the operational systems could answer all the queries quickly, there is a secondary scalability problem that necessities recursive processing. This is easiest to explain by example. If one performs a federated query to discover enterprise records related to a specific person – say starting with a specific person’s name and date of birth – should the federated query return some new attributes for this person, e.g., a few addresses and phone numbers – one has just learned something. To be thorough one must take what one has learned about this person and perform another enterprise-wide federated query in case there are some additional records that can now be located based on the new data points. Now, what if during this second federated query another address, a few more ways to spell the name, and an aka or two are discovered. To be thorough, each time something is learned that might enable the discovery of a previously missed record, the process must perform another federated query. I have seen this at scale where the organization had something like 2,000 internal data sets, all tethered together with very smart middleware. Their recursive process had an artificial time limit at which point it would abandon additional attempts to locate the remaining records even though there was possibly more records in the enterprise for the same person!

These two points make federated query systems challenging at scale.

Now imagine perpetual analytics where every new piece of key enterprise data is first treated like a query. How exactly is one going to use this federated approach at the scale of hundreds or thousands of queries a second? Thus, I say after observing the behavior of such systems up close and personal, scalable intelligent systems cannot be achieved via federated query. Those attempting to enable enterprise discovery or enterprise intelligence through the use of a federated query solution will very likely come up short despite Herculean investment.

So if federated search does not answer the mail, what does? You guessed it: Persistent Context. Persistent context solves the scalability and accuracy challenges associated with trying to assemble context just-in-time using federated queries.

Perpetual Analytics requires persistent context. And persistent context is all about the librarian and the central index (catalog, directory or whatever you want to call this thing).

Persistent context enables instant, enterprise-wide discovery. And discovery enables the essential federating activity – "federated fetch." Simply speaking, once one finds related records, Source Attribution is used to determine where the records are physically located. One then fetches specific records from specific data stores in a federated manner. This form of federation scales.

Whether an enterprise is interested in improving its use of disparate information assets to improve health care outcomes, better service customers, fight fraud or protect the country, this (solutions involving persistent context) is how I think it will have to be done at the end of the day.

[One final technical point: Even if the operational systems expose their metadata in a fully cross-referenced index (e.g., a specialized search/discovery "appliance") to solve the missing index problem, the recursive costs to construct just-in-time context (each time new information is discovered) still make federated queries an unattractive approach. To boot, there are a few other incremental risks associated with using externalized indexes conjoined to each operational system. If you care to discuss this point, drop me an email.]

January 21, 2007

Wednesday, January 10th, 2007, the Senate Judiciary Committee held a hearing entitled "Balancing Privacy and Security: The Privacy Implications of Government Data Mining Programs."

This session again proved that what data mining means depends on whom you ask. And, as such, this poses a real problem for those trying to have a rational conversation on the subject. And I worry that if lawmakers get this wrong … poor laws will follow.

Jim Harper of Cato Institute submitted written testimony, which referenced the paper he and I recently released titled "Effective Counterterrorism and the Limited Role of Predictive Data Mining." Our paper was intended not to describe data mining at large; rather, we selected the term "Predictive Data Mining" to describe a certain kind of data mining, specifically "…the process of searching data for previously unknown patterns and often using these patterns to predict future outcomes." As our paper posits – using machines to find hidden patterns based on historical data is not useful in the context of terrorism when there are so few terrorist incidents from which to draw. We could have just as easily called this "Data Mining for Predictive Patterns."

Kim Taipale, the executive director of the Center for Advanced Studies in Science and Technology Policy, submitted this written testimony. Kim argues that, broadly speaking, data mining is any automated analysis of information that reveals output that otherwise would "remain unnoticed using traditional manual means of investigation." Therefore, data mining is "simply a productivity tool that when properly employed can increase human analytic capacity and make better use of limited security resources." This definition includes link-based analysis (e.g., who’s talking to whom, who’s financing whom, etc.), pattern-based analysis (e.g., anticipated signatures of terrorist planning) and predicate-based analysis (e.g., higher interest in those who graduated from Afghanistan terror training camps). He also goes on to say "… patterns can be inferred from lower-level precursor activity – for example, illegal immigration, identity theft … attendance in training camps, targeting and surveillance activity…."

Leslie Harris, the executive director of the Center for Democracy and Technology, submitted this written testimony. Leslie chose this definition for data mining: "use of computer tools to extract useful knowledge from large sets of data." Leslie differentiates data mining into two categories: pattern-based data mining and subject-based data mining. Pattern-based is then described as data mining "which seeks to find a pattern, anomaly or signature among oceans of personal transactional data." Subject-based data mining is described as a form "which seeks information about a particular individual who is already under suspicion." Her testimony goes on to say, "As a general matter, the value of subject-based approaches is more readily apparent, and there are fewer privacy concerns associated with data searches that begin with particularized suspicion."

Robert Barr, the executive director of Liberty Strategies, submitted this written testimony in which he expresses concern over various government programs and notes that "Data mining presents many serious threats to the First, Second, Fourth and Fifth Amendments to the Constitution." Although data mining is not defined by his testimony, it appears (based on the programs he mentions) that he uses the term "data mining" to mean any effort by the government to access and/or collect data.

Dr. James Carafano a senior research fellow at the Heritage Foundation specializing in national security, defense and counter terrorism, submitted this written testimony. In part he writes, "Because technology is going to be an important part of any set of counterterrorism tools, and because our lives in the information age are so dependent on many of the systems and databases in which these technologies will look for information about terrorists, we also need a set of rules to guide how we implement the basic principles of long-war fighting in the electronic world." And while this testimony does not attempt to define data mining, neither does it imply a broad nor narrow definition.

Despite the fact there is no agreement on what data mining means, I cannot help but notice a high degree of consensus (e.g., watch listing, link analysis and predicate-based analytics can be useful and are less invasive). In any case, when the government starts writing data mining laws … these things come to my mind:

1. We should be talking about authorization, oversight and accountability related to programs involving U.S. persons only. There is much less concern with respect to analytics and information collected abroad (unrelated to U.S. persons). New data mining policy related to program disclosure that does not differentiate between U.S. and non-U.S. persons would be a huge mistake.

2. Data mining has many valuable uses at both the aggregate and person-centric level in areas outside of counter-terrorism. For example, healthcare research, bio-surveillance, benchmarking efficacy of various educational programs, and so on. Any government policy stating data mining should only be used in support of counter-terrorism would also be a huge mistake.

3. And finally, any policy that emerges that regulates data mining or mandates reporting better define it. Because under one definition of data mining even something as simple as using a computer to lookup your name on a reservation list (e.g., at the hotel during check-in) is considered data mining. So if this type of activity gets added to the data mining reporting requirements, those in charge of monitoring data mining programs will have to sift through so many reports (i.e., false positives) they may never find or have time to appropriately respond to the programs that are more problematic.

By the way, the debate about what data sets can the government peer into is another debate – an important debate – but (in my opinion) not a data mining debate.

And for the record, in my opinion, at least in relation to programs designed to target specific people, predicting which people should be targeted for additional scrutiny or action should not be based on machine-discovered patterns when so little historical training data exists. However, this method does become useful when first starting with qualified predicates (e.g., subjects who attended terrorist training camps). This can materially assist organizations/governments focus their finite investigatory resources.

Furthermore, in my opinion, I don’t know of a single federal program that is attempting to detect hidden terrorist patterns using historical terrorist event data. Either they are hidden from sight or more likely these organizations already recognize that there are better ways to attack the counter-terrorism mission.

January 14, 2007

Apparently, your brain creates a very specific electrical brain response, known as P300, when one is presented with information that is already contained in one’s mind. If you recognize the information (i.e., it is familiar to you), you will have a P300 response. There is no way to avoid this; it is a biological/electrical stimulus response event. Sort of like a lie detector, only (reportedly) always accurate.

Think of this as Mind-Reading 1.0.

Question: "Did you murder John Doe?"

Answer: "No."

Question: "Have you ever been inside this house?" [While presenting a picture of the front of a house]

Answer: "No."

Question: "Did you commit this murder?" [While presenting a picture of the murder scene, which took place in the bedroom]

Answer "No."

"Sir, you are under arrest – you had a P300 response to the murder scene."

Maybe you are feeling some comfort knowing this is not being accepted in the courtroom. Think again. P300 is already being used in court as admissible evidence by both defense and prosecuting attorneys.

Now what happens when science improves and Mind-Reading 2.0 is available? Maybe this new technology will answer the question "Have you ever had this thought before?"

Or, better yet … Dr. Joyce Brothers reportedly once said, "My husband and I have never thought about divorce – murder, perhaps, but not divorce."

Now juxtapose this emerging technology with how our courts have interpreted our rights against unreasonable searches and seizures guaranteed in the Fourth Amendment of our Constitution. For example, the Supreme Court long ago ruled (Smith vs. Maryland) that we have no "reasonable expectation of privacy" of the telephone numbers we dial.

Now that the technology exists to "read" P300 responses, I have to wonder where we are going to draw the line.

Is it reasonable to believe that what is going on in your skull is private? Will our Fifth Amendment right against self-incrimination protect us from the Mind-Reading 2.0? Is it possible the Supreme Court could someday rule that certain brain activity is not private?

If this seems far-fetched, let me share one plausible journey that might just make this true. The Court has held (Katz vs. United States) that an expectation of privacy is not "reasonable" unless both: (1) a person can claim "a legitimate expectation of privacy" over a particular type of information; and (2) this expectation is one that society is prepared to recognize as "reasonable." And, of course, what society sees as "reasonable" changes over time, and courts can change their interpretation of reasonableness as time and technology march on.

On this principle, when high school science projects make mind-reading devices and disposable units can bought at WalMart in twelve packs … will some court then assert that we should no longer have an expectation of privacy over our thoughts?

January 09, 2007

I travel a lot. So learning the little nuances of each airport I frequent comes in very handy. I have been able to pull off little stunts like driving five miles, parking my car, getting my boarding pass at a kiosk, clearing security and getting to my gate in 17 minutes. Of course, this takes a lot of luck, too. One long line and poof … you miss a flight especially when not allowing more than 45 minutes between airport arrival and the plane’s departure

As I mentioned in my op-ed, I planned on taking the plunge and therefore enrolled in the Clear Registered Traveler program yesterday. Handing over my biometrics and subjecting myself to a background check will be well worth shorter airport lines, especially if this program turns up at airports I actually frequent.

Before I made this leap, for once I read the Privacy Policy. It seems they have done a decent job communicating their intentions and protections. Obviously if you have an outstanding criminal warrant, this would not be a good program for you!

When I think about this program and programs like this – while there is some additional risk introduced anytime one reveals more of his/her personal information – the other [big] risk is what happens when a good guy who gets the "Approved Security Threat Assessment" stamp from the Transportation Security Administration decides to become a bad guy later.

When good guys go bad (the "insider threat") is a nasty problem. And while this program has wisely opted for a perpetual credentialing process, something I strongly advocate, detecting the lone gunman scenario, where the bad guy works alone, is a very hard problem. I hope the perpetual credentialing program they have in place is topnotch.

Nonetheless, I’m eager to have as many airport shortcuts as I can get my hands on!