A New Data Mantra: Capture Everything

In conversations with insurers globally, we at Celent are hearing of a new approach to analytics. It's not called big data, but a different approach, one that seeks to leverage data far more quickly and be more tolerant of the errors in the data. A shift toward the idea that all data is useful is occurring, but baby steps so far—I still often hear about truth, fact and consistent data in discussions.

When thinking about data, this idea of truth has always bothered me—the idea that system data represents the facts, or the unassailable truth. One of the key activities in establishing classic analytics processes is establishing which data is the truth, there are always arguments about which data is accurate and can be trusted. In this process, inaccurate data (or data that doesn’t contribute to this truth) is ignored, removed or lost. This leads to a negotiation process, the end result of which is often called the single version of the truth—i.e. the output of a report that all stakeholders agree to. The strange thing about this process is that it observes that there are multiple viewpoints, but seeks a single truth regardless. Relational database design and modern user interfaces push us toward this line of thinking, where there is only one field to fill in, one answer to each question after all.

I suggest that there is value in capturing the half-truths, the out-right lies; and technology now let’s us analyze these semantically.

It’s easy to come up with examples from the insurance industry where we regularly accept that data is likely flawed. For instance, original quote data says the vehicle is a standard build but the claims adjuster spots the alloy wheels and rear parking sensors. In the case of an accident, for many motor claims, the insured makes a statement that there was a crash and the other driver was in error. The other driver also makes a similar statement, saying that there was a crash and the insured was at fault. Most modern systems capture all of this data, the different views over time, the different views from different stakeholders, but most systems and processes still assume that at a given point there is one set of valid data, one driver at fault.

Now that customers are posting to social media, insurers face more questions: What if what an insured stated at the time of purchase is contradicted in their Facebook profile? Was that tweet accurate or just posturing on the part of the customer? How should the insurer, or rather the automated systems analyzing this data, treat these contrary positions?

There is factual data that is true—the fact that the witness statements were made, the date and time when they were captured, who made them, regarding what case. What of the pertinent data though, the data the humans actually use in determining the case or what should be done next, the data that allows us to reason about the case and to make a judgement? This information is typically stored in free text formats, requiring humans to interpret the data and do what humans do well: establish hypotheses and test them, ultimately selecting the one they feel fits best and recording that result as fact. For example, it’s a fact that Bob, the claims handler, felt on Jan. 1, 2012 that the insured wasn’t at fault, but is that what is recorded? Or is it the assertion that the insured was at fault, recorded as the truth and not a hypothesis with an audit trail to who updated the system?

If the big data movement has taught us anything, along with the exploits of Google, Amazon, etc., it is that all data is useful. Capture everything.

Why you ask? One example: There exist algorithms and systems that enable the analysis of competing hypotheses, capturing how credible or likely an assertion is based on the believability of the source of the underlying data. What if your system could highlight how plausible the insured’s data or statement is, or a witnesses testimony, or the data from a third party based on the information at-hand? What if your core system presented options rather than an answer derived assuming everything in the system is correct?

Truth, then, is not something best derived from raw data after the fact, but rather, something that requires consideration as the data is being collected. Data, knowledge and information collected in the right way will allow future systems to help insurer staff reason about the data and be more effective. The insurance industry is, however, sitting on a gold mine of raw data, and as it starts to mine that data to leverage it for new insights, I suggest insurers seek new models to better understand the knowledge therein.

Those insurers that will emerge as leaders will capture all the data that they can, will understand that some of that data is contradictory and model it in such a way that software can support decisions about the data rather than leave the grey areas to the human operators.

What’s your view: Is there a single version of the truth in insurance? How are you dealing with contrary data? Have you already solved this?