Big Data lets insurers mine for fraud throughout claims cycle

By James Ruotolo
June 18, 2015

Credit and banking industries offer best practices for fraud fighters

Abstract: The methods for rooting out the estimated $80 billion in yearly insurance fraud have stayed largely the same since insurers drafted the first red flags many years ago. While modern fraud-detection systems have helped, the process is still labor-intensive and doesn’t stop enough claims-payout leakage. Simple business rules applied to the underwriting and claims process are as likely to churn out false positives as to detect true fraud. But the advent of so-called Big Data and new computing techniques are changing that. New storage and analysis techniques can comb structured and unstructured data to find previously unseen links between fraudsters and patterns that traditional methods can’t detect. Drawing from advances from other industries, insurers can use more-robust data to mine claim notes and social media at all stages of the insurance life cycle.

The Coalition Against Insurance Fraud published the second edition of its “State of Insurance Fraud Technology” survey in September 2014. Insurers have widely adopted technology to help thwart insurance fraud, the study reveals in one key finding.

In fact, 95 percent of responding insurers said they use some form of fraud-detection technology. That is up from 88 percent two years earlier. We finally have reached a point where nearly all major insurers use some form of technology or automation to help ferret out suspicious claims.

But these numbers tell only part of the story. If we dig into the details, we discover a much-more complex situation. Most insurers use automated business rules as part of their technology platform, the Coalition study says. But less than half use more-advanced techniques such as predictive modeling, text mining or anomaly detection. Grappling with so-called Big Data is a key reason for this lag.

Complex Big Data sets

Big Data doesn’t just mean lots of records. Big Data is defined as data sets so large or complex that gaining anti-fraud insights becomes challenging. That certainly describes the amount and types of data that insurers gather in an environment that is growing increasingly data-rich and will continue to expand for all insurers.

From the initial information collected on an application to the detailed notes, photos and videos an adjuster takes when a claim is filed, even relatively small insurers amass large volumes of internal data. Then layer in outside data — from police reports,medical records, fraud bureau lists, social media and telemetry data. Insurers have hit the trifecta that further defines Big Data: volume, velocity and variety.

A lot of data (volume) is flowing into insurance companies very quickly (velocity), and its variety (text, notes and video) is difficult to store and analyze. For most insurers, all three factors — volume, velocity and variety — are increasing at unprecedented rates, further complicating fraud detection efforts.

The data often arrives and just sits there, despite its potential for valuable insight and context in anti-fraud operations. Or the data is discarded.Vehicle telematics (i.e. Black Boxes), for instance, produce an enormous volume of content faster than insurers can consume or analyze it.

Databases grow exponentially, but investigators lack a simple way to integrate it into their investigation process. Instead, investigators ping the database for each investigation. Staged-crash ring members might use the same words to describe their fake neck injuries when interviewed, but that information languishes in individual case files that aren’t compared against one another unless an investigator senses a pattern.

Investigators see this firsthand through the introduction of link analysis in the fraud investigation process. At first, link analysis was applied to individual cases to map the relationships among entities. This helped identify large-scale organized fraud rings.

But more-advanced social-network analysis automates this linking and can grab entities from across huge swaths of data to identify relationships and patterns that are otherwise difficult or impossible to recognize. Yet linking entities alone is not enough. To successfully wrangle Big Data, insurers must extract insights and prioritize actionable work — such as suspicious claims alerts. Predictive modeling is one way to achieve this.

Thwarting credit-card thieves

“Databases grow exponentially, but investigators lack a simple way to integrate it into their investigation process.”Another industry beset by fraud seemed to provide a new approach about a decade ago. Financial firms were beating back credit-card fraud with great success.

Despite the volume of credit card transactions and high velocity of their arrival, banks successfully harnessed predictive models to catch fraudsters in the act — in real time — and shut down their accounts before fraudulent charges were completed.

Banks aggregated payment data and trained predictive models to identify transactions likely to be fraudulent. If your credit card company has called to ask whether you were buying a laptop at a store hundreds of miles from your house, you understand how this works.

The Big Data harnessed for credit-card fraud models has very high volume and velocity — many transactions accumulate quickly — but practically no variety. The information is uniform across the entire industry. There are limited variables to consider, enormous amounts of accurate fraudulent history to draw from, and no unstructured text such as the adjuster field notes that insurers accumulate.

Early struggles with tech

Meanwhile, insurers worked with basic rules-based technology to help uncover potentially fraudulent claims and sketchy insurance-policy applications. They tested each transaction against a predefined set of algorithms or business rules to detect known types of fraud based on specific patterns of activity.

These systems flag claims that look suspicious due to their aggregate scores or relation to threshold values. Eighty percent of insurers use this technology today, though all rely to some extent on policy underwriters and claim adjusters to manually identify fraudulent scenarios.

Sophisticated fraudsters and ordinary customers know how to work around those rules. Dishonest vehicle owners apply for insurance illicitly using a rural vacation home as their primary residence because they know the premiums are lower in that locale. Criminal fraud rings also know the red flags and work to fly under the radar.

Rules-based technology thus has slowly been losing its effectiveness for several years.

Likewise, link analysis was great for analyzing relationships in large organized fraud cases. But the process was cumbersome, manual and required a fair amount of technical skill to manipulate charts. The process was a great improvement over the oldmanual method of corkboard and pushpins, but it didn’t scale to accommodate true Big Data.

The credit card industry’s success against scams emboldened insurers and technology companies to turn to predictive modeling. But the technology that dealt with the volume and velocity of credit card transactions doesn’t work on insurance fraud for one very big reason: Variety.

Insurers gather a lot of data, though much of it is text-based. And so is outside data from some medical records, police reports, fraud bureaus and social media. Even building the anomaly models to run against claims initially failed because claims fraud happens in more ways than credit card fraud. There also are fewer claims than credit card purchases. Early predictive models just didn’t work because insurers failed to account for data variety.

The Coalition survey provides some context and hints at the early challenges insurers face when trying to use predictive modeling for fraud detection. While limited IT resources are a challenge for any insurer anti-fraud technology project, the other top challenges for implementing fraud-detection technology are excessive false positives and data quality.

One of the biggest challenges for insurers involves accessing the right data sources. Efforts often are hampered by:

Third-party vendor systems that house critical data such as medical billing information (but aren’t easy to access or integrate);

Unstructured text fields from data such as claim notes;

Hand-keyed data fields often containing spelling and transposition errors; and

Daily flow from high-volume sources such as social media and mobile devices.

Even if these sources are accessible, insurers must deal with several major data-quality challenges. By some estimates, up to 80 percent of insurer data comes in unstructured text format. Modelers need to extract useful information from text sources such as claim notes to gain fraud-related insights from this valuable content.

Yet insurers often lack a single unique identifier for a person across their various systems. Data-quality routines and entity resolution are critical to solving for this problem.

Without addressing these major data-quality concerns, any model is likely to produce too many false positives.

As the Coalition survey suggests, excessive false positives are a major challenge of implementing anti-fraud analytics technology.

Advanced fraud-detection methods like anomaly detection, predictive modeling and social-network analysis also are only as good as the data quality. You will get poor results if you feed poor-quality data into these applications. The best detection techniques in the world are not a replacement for good data management.

New technology emerges

“Multiple users can share this data across multiple applications in a rapid, secure and concurrent manner.”New technologies — or enhancements of older technologies — help make Big Data fraud analytics a reality for insurers.

Text analysis. In many insurance-fraud detection projects, from one-third to one-half of variables used in the fraud-detection model come from unstructured text information.

This is especially useful for long-tail claims such as injury claims, because the best data often is found in claim notes. Text mining is more than just keyword searching.Good text-analytics tools interpret the meaning of the words to provide context. Technology that is adept at processing natural language can help extract variables from the unstructured text that can be used for further fraud modeling.

Data management. No matter where your data is stored — from legacy systems to the powerful data-storage framework Hadoop — a data management system can help insurers create reusable data rules. They provide a standard, repeatable method of improving and integrating data. Ideally,you want a system that connects to multiple data sources. It should have streamlined data federation, migration, synchronization, administration and visual monitoring.

Event stream processing. This helps insurers analyze and process events in motion (i.e., event streams). Instead of storing data and running queries against data, you store the queries and stream the data through them. This is foundational to both real-time fraud detection (refreshing fraud scoring) and effective use of large high-velocity data sources like vehicle telematics.

Hadoop. A free programming framework that supports the processing of large data sets in a distributed computing environment.

It offers massive data storage and super-fast processing at roughly 5 percent of the cost of traditional, less-flexible databases.

Hadoop’s signature strength is the ability to handle structured and unstructured data (including audio, visual and text), and in large volumes. Insurers either can hire Hadoop specialists to take advantage of the framework or buy products that bridge to existing databases and data warehouses.

This is the foundational technology for creating predictive-analytics models that stay one step ahead of fraudsters and leakage of paid-out claims money. The transaction-monitoring optimization technology used to fight often-sophisticated money laundering uses Hadoop as a core storage and organizing technology. Complex staged-crash rings and medical mills, for example, are deploying increasingly sophisticated methods of laundering money stolen from auto insurers.

In-memory. In-memory analytics is a computing style in which all data used by an application is stored within the main memory of the computing environment. Rather than being accessible on a disc, data remains suspended in the memory of a powerful set of computers.

Multiple users can share this data across multiple applications in a rapid, secure and concurrent manner. In-memory analytics also takes advantage of multi-threading and distributed computing. This means users can distribute the data (and complex workloads that process the data) across multiple machines in a cluster or within a single-server environment.

In-memory analytics deals with queries and data exploration, but also is used with more-complex processes such as predictive analytics, machine learning and text analytics. The kinds of neural-network analytics that help insurers find links among suspects perpetuating premium and claim fraud relies on this type of processing.

Software as a service (SaaS). Predictive modeling and other high-end analytics were available only to large insurers willing to install the technology onsite until recently. Software-asa- service has evolved to where even fairly small insurers can take advantage of Big Data analytics. Insurers subscribe to a service run by a vendor rather than pay for the large purchase, installation and maintenance of in-house systems. SaaS also is termed “on-demand software.”

Using Big Data in real world

An insurer with a large book of auto business wanted to better understand and identify its fraud exposure in several no-fault states in which it did business. The insurer was saddled with a lot of internal data that didn’t sync up. The insurer wanted to incorporate third-party data from the National Insurance Crime Bureau and ISO while presenting the information in a user-friendly and prioritized way to analysts and investigators.

The missing data was available for some claims in a different system, but the insurer couldn’t integrate and weave it into its existing data. To achieve this, the insurer generated unique provider ID tags that would sync up provider information — whether from its internal database or medical claims information from outside sources.

For other claims, for which missing data could not be located, insurance-fraud domain experts created heuristic rules — an educated guess — that leveraged the available variables to enhance the model. Highlighting data-quality issues also allowedthe insurer to devise new ways to address these issues in its operating environment. This made future analytics projects easier and more accurate.

The insurer successfully merged internal watch lists with NICB ForeWarn and MedAware alerts to immediately spot more than $2 million in suspicious medical-provider claims it had missed. The company maximized its investment in NICB membership by automating the alerting process. All of the information was presented in an easy-to-use web-based format. From one screen, analysts and investigators could quickly access all the information they needed from multiple sources — claims and policy information, NICB alerts, ISO prior claim history, vehicle data, and medical provider and billing detail. This dramatically decreased the time required to triage an alert.

Once the data management and Big Data analytics efforts were deployed, the insurer accepted three of every four alerts presented, identified more than $5 million in previously unidentified questionable PIP claims, and recorded a 162-percent increase in suspicious provider investigations compared to its old manual process.

New paradigm needed

Forward-thinking insurers that grasp the possibilities of intelligent use of Big Data are convinced there can be a huge return on investment by incorporating more data, and more sources of data.

In the Coalition survey, more than one-third of insurers were skeptical of the ROI of anti-fraud technology in 2012, while only about 10 percent remained concerned in 2014.