Data scientists building predictive models and machine learning algorithms often have to do more data preparation work upfront than is necessary in conventional analytics applications.

In more and more businesses, the drive to set up big data architectures that can support predictive analytics, data mining and machine learning applications is changing the shape of the data pipeline as well as the data preparation steps required to feed it.

"We used to live in a very straightforward world where data moved in one direction -- it was a data flow into a data warehouse," independent consultant and industry analyst Dave Wells said. "Now we have data warehouses, data lakes and data scientists' sandboxes. There are many sources, and they're processed in many ways. And the data pipelines now are multidirectional."

The overall effect is that strictly linear approaches to data flows are breaking down. And data scientists and other users whose analytical interests are exploratory or discovery-oriented in nature must be served by data management teams, explained Wells, who wrote a report on data preparation software and tools for managing data pipelines that was published last November by consulting firm Eckerson Group.

Jin On

The nature of predictive analytics changes the way analysts deal with data, according to Jin On, a data scientist at Geneia LLC, a Harrisburg, Pa., company that provides analytics software and services to clients in the healthcare industry. "At the beginning of my career, the [analytical] models I built out were more about descriptive statistics," On said. "You'd ask a strict question about how many people have diabetes and get a blatant answer."

Predictive modeling is different, she added. For example, one of the applications she works on is aimed at predicting how likely it is that individual patients will need to be readmitted to a hospital. "For that type of analytics, you need more creativity," On explained. "You have to look at the actual data first and see what it's really telling you about the attributes that contribute most to the likelihood of a readmission."

Back for more on data preparation

Some of On's work takes her into the realm of machine learning, which often requires that raw data be maintained as is and then filtered in different ways to meet particular analytical needs. She said that after assessing the characteristics of the available data, the next step is to look at the types of machine learning algorithms that can be used to boost the accuracy of the planned model's predictions.

The data requirements formulated by On, who uses SAS software to prepare data and build predictive models, can vary with different machine learning algorithms. A case in point is the Random Forest algorithm, which places restrictions on the number of levels that certain categories of data variables can have, according to On. That often means additional data preparation steps to massage the data to match her specifications. "You have to go back into data preparation in that case to tweak the data so it works with a particular algorithm," she noted. "That's one reason I want to [examine] my data first, before I start exploring it."

Trash for some, treasure for others

For data managers, new approaches to data preparation -- ones that are far from cookie-cutter methods -- are necessary to support advanced analytics needs, said Eric King, CEO of The Modeling Agency LLC, an analytics consulting and training services company.

Even one of computing's most time-tested concepts may be in for reworking: garbage in, garbage out (GIGO), which holds that users will never get good results out of bad data. "Back in the day, it was all about GIGO, but it isn't anymore," said King, whose company teaches courses on the subject of data preparation for predictive analytics. He said prescribed data preparation steps typically involve a lot of binning, smoothing and fitting -- much of it meant to discard outlier data as a way of "separating the signal from the noise."

In big data environments, though, "such cleansing may not be what the data scientist or the successful predictive modeler wants," King said. "At the same time, new algorithms can handle a good bit of noise and garbage." As a result, he added, "it can be wasteful to over-cleanse the data." There are steps that need to be taken -- but when preparing data for analytics uses, sometimes it pays to be judicious.

Join the conversation

1 comment

Register

I agree to TechTarget’s Terms of Use, Privacy Policy, and the transfer of my information to the United States for processing to provide me with relevant information as described in our Privacy Policy.

Please check the box if you want to proceed.

I agree to my information being processed by TechTarget and its Partners to contact me via phone, email, or other means regarding information relevant to my professional interests. I may unsubscribe at any time.