How to Mine Big Data like a Pro

For hospitals, this isn’t an academic question. Hospitals had about $36 billion in uncompensated care costs in 2015. The bulk of that was from unpaid patient bills.

One solution to this problem is to limit the costs related to the operation—but how? Machine learning. Hospitals now use predictive analytics to forecast the average stays and potential problems for operations, like hip surgery.

Businesses everywhere are seeing similar real-world effects from using machine learning to analyze data. The problem is, they often come up short.

As Mike Gaultieri, an analyst with Forrester Research, noted, machine learning isn’t like traditional business intelligence in which results are guaranteed. “If you’re looking for a machine learning model you can say ‘I’ll try’ but you may not be able to,” he said. “Companies do have to understand that just because you wished you had a model that predicted the stock market doesn’t mean you’ll have one.”

Rags Raghavendra, head of DXC Technology’s Analytics Data Labs, a global hub of data scientists focused on consulting and finding ways to operationalize analytics, said businesses are frustrated because they’re often taking on too much. “Clients are trying to boil the ocean in terms of trying to extract meaning out of every type of data that they have access to,” he said. “What we recommend is to look at the data that you have and that’s easily accessible and then go on to the next step.”

Companies that have tried and failed to glean insights from data should first of all accept that failure and iteration is part of the process. But they can maximize their odds of success by getting smarter about using machine learning. Here are eight ways to do it:

Begin with the problem you want to solve. Diving straight into data and looking for insights to jump out at you is the wrong approach. All good data stories start with identifying the right performance metric that links a business outcome to a data-centric question. The chosen metric however, shouldn’t be too broad or too granular. For instance, when DXC recently worked with a media company to explain why subscribers were leaving, the most obvious metric was Change in Subscriber Base. As it turned out, the more relevant metric was Average Revenue per User (ARPU), which was directly connected to the company’s larger business goals of enhancing revenues.

Industrialize the process of machine learning. “This whole process of big data analysis is not industrialized,” said Raghavendra, whose lab supports a wide range of industries, including manufacturing, telecom, automotive, airline, energy, financial services and healthcare. “Many times, you’re repeating the analysis all over again or are unable to scale it.” DXC is a strong proponent of an efficient and simplified approach of Industrialized Machine Learning which believes that all stages in an analysis – from ingesting and cleansing data, building algorithms, putting them to production, and generating insights – should be reusable and deployable on enterprise-scale technologies.

Don’t get hung up on silos. Silos are the bane of many corporate data-mining programs because they prevent access to a uniform pool of data. But silos aren’t as big an obstacle as some believe. “If you have a smart data and platform strategy, you should worry less about silos,” says Raghavendra. Simply put, you don’t need to worry about silos as long as they’re not an issue for the problem you’ve chosen to solve. You should, however, prepare for the next set of problems in the pipeline by providing for integration of disparate data sources. “There are flexible and modular platforms that allow you to integrate data when required,” Raghavendra added.

Think outside-in. You don’t always have to own all the information, talent, analytics and intelligence. This is an ecosystem story, and those who tap the matrix of capabilities around them will win. Crowdsourced data scientists, machine learning-as-a-service and external data sets all hold powerful potential.

Use data lakes. Data lakes are repositories in which you can store all your existing data as-is, irrespective of its format. Raghavendra said companies should get in the practice of putting all their data in a data lake even if they aren’t sure at first how they’ll use it. “Don’t think about structuring it right at the start,” he said.

Perform exploratory data analysis (EDA) with a goal in mind. The first stage of data mining is EDA, which seeks to summarize data visually and non-visually. “What I’ve often seen is the exploratory data analysis part is siloed,” said Bharathan Shamasundar, senior data scientist with DXC. “The purpose of EDA is to drive insights about patterns in data and take informed positions on what to do thereafter. Often, companies go about doing this perfunctorily.” DXC’s experience with an energy utility company underscored the importance of smart EDA. The utility was looking for an accurate forecast of how much energy its wind turbines would generate. Because it applied EDA to its algorithms, DXC’s team beat existing benchmarks for turbine performance 95% of the time despite using fewer variables to make its calculations. That experience shows meaningful EDA, done in advance, will more often lead to algorithms that are appropriate to the data on hand.

Use intelligent sampling. One reason that companies have trouble accessing insights from big data is that they’re using too much of it. “Sampling has become a bad word,” Shamasundar said. “Sampling data is being smart about working with data.” Often, what looks like “big data” is chock full of redundant information. For a commodity trading company, DXC identified that a large portion of the data in storage was redundant as 94% of all its trading deals were based on a smaller subset of data. This shows that evaluating quality and relevance is an important component of data strategy.

Consider a flexible operating model for your data science program. Raghavendra advises: Don’t hold off on launching a data analysis program just because you can’t hire a data scientist. Demand for data scientists is currently 60 percent greater than supply, and there are no signs that this disparity is slowing. If a business can’t staff enough data scientists though, Raghavendra said that they should consider using a mix of partner organizations providing specialist analytics support and “citizen data scientists.” A citizen data scientist is someone who understands the domain and the business of their employer-organization. They can perform reasonable analysis using some off-the-shelf analytics platforms that have now simplified certain tasks of data mining. As businesses apply analytics to solve problems, partner organizations could help support the scaling-up of their programs and build deeper capabilities in multiple areas.

Though following these guidelines makes success more likely, businesses need to remember that the possibility of failure is real. Data science makes use of the scientific method, which is based on proving or disproving a hypothesis. Harnessing data then should be considered an R&D activity. “It’s best to have six or a dozen ideas and run them in parallel,” said Gualtieri of data-based inquiries, “because not all of them are going to work.”

The challenge will get harder and harder as the amount of data keeps increasing. On the other hand, the more data you have, the greater the potential rewards, as well.

According to Dave Aron, Head of Research at the Leading Edge Forum, DXC’s thought leadership arm, too many companies still view their most important assets as physical and financial.

“Businesses set to thrive in the next decade recognize information as an asset, and they build and continually improve their analytics and learning platforms,” Aron said. “The Internet of Things and increased data protection legislation are making this ever more critical.”

Obtaining benefits from data – whether you’re a hospital or a utility or any other kind of business – will take a deliberate approach, a lot of grit and respect for the scientific method.