Machine learning. Artificial Intelligence

Menu

Data Science and Machine Learning Predictions

This is the time of year when everyone looks to the year ahead. Here are five four things in data science and machine learning that are utterly and completely predictable in 2018.

Data Science Matures

In the Pleistocene Era of Data Science, there were Heroes and Hackers: lone souls working on ad hoc projects with Pig, Hive, Mahout, Java, and a few prayers. For asset management, organizations used thumb drives and email. Collaboration was a non-issue because there were few others, if any, to collaborate with.

In time, organizations hired more data scientists. Heroes and Hackers evolved into Data Science Guerrillas armed with laptops and notebooks. IT didn’t want anything to do with data science; it’s messy and complicated, so it was easier to simply pretend it didn’t exist. Responsible team leaders asked contributors to store assets on Git; some complied, some didn’t, but it hardly mattered because the Git library was a disorganized mess. For tooling, data scientists used a quodlibet of languages, packages, and notebooks, which made cross-checking and peer review problematic. Nobody could agree on a common set of tools, so collaboration was rare.

“Guerrilla Data Science”

Today, data science has matured to the point that organizations expect a return on their investment. They want to see faster turnaround, and more value. Nobody cares if you won Kaggle; we want to see a minimum viable data product while we’re young.

Smart organizations adopt a collaborative model of data science. The collaborative model recognizes that the data scientist is one member of a larger team that may include business analysts, data engineers, developers, machine learning engineers, DevOps specialists, compliance specialists, security professionals, and many others all pulling together to deliver a working application.

“Collaborative Data Science”

The rise of collaborative data science leads organizations to adopt open data science platforms that do the following:

Provide a shared platform for all data science contributors

Facilitate the use of open data science tools (such as Python and R) at scale

Provide self-service access to data, storage, and compute

Support a complete pipeline from data to deployment

Include collaborative development tools

Ensure asset management and reproducibility

There are now multiple offerings in the market from vendors including Amazon Web Services, Anaconda, Cloudera, DataScience, Domino, Google, IBM, and Microsoft. In 2017, venture capitalists funded several startups in the category, which suggests that there is strong growth potential.

In 2018, look for more organizations to adopt a collaborative model of data science, and invest in an open data science platform.

Automated Machine Learning Gets Real

Several months ago, a data scientist explained to me why it takes him weeks to build a predictive model. “I have to run a hundred experiments to find the best model,” he complained, as he showed me his Jupyter notebooks. “That takes time. Every experiment takes a lot of programming, because there are so many different parameters. We cross-check everything manually to make sure there are no mistakes.”

After listening to this for an hour, I was ready to kill myself.

Automated machine learning does not eliminate the hard parts of a data scientists’ job, such as listening to clients, understanding the business problem, and figuring out how to craft a solution. It automates the stupid parts of the job. Like repetitive programming. The kind of stuff researchers delegate to interns and new hires.

Think of it like this. We’ve had robotic heart surgery for 20 years, but you don’t see cardiac surgeons standing by freeway exits holding signs that say Will Work for Food. If I have a heart problem, I’m not calling Watson — I’m going to see Dr. Angina down at University Hospital.

“The surgical robot will see you now.”

It’s the same with data science. When the CEO needs answers to really important questions, she’s not calling Watson. She’s calling the CAO or the Chief Data Scientist or whatever. Someone with skin in the game. Because when real executives delegate a task, they delegate it to someone they trust.

Organizations that want to invest in automated machine learning have plenty of commercial and open source options. Amazon Web Services, DataRobot, Google, H2O.ai, IBM, and SAS all offer automated learners; some of these are much better than others (but I’d rather hold a detailed discussion of the differences for a later post.) In the open source ecosystem, we have auto-sklearn, Auto Tune Models, Auto-Weka, machine-JS, and TPOT.

Prediction: in 2018 we’re going to see many more offerings, and more organizations will adopt the tools.

Data Scientists Discover GDPR Applies to Data Scientists

On May 25, 2018, the European Union’s General Data Protection Regulation (GDPR) takes effect. The reaction in the data science community will be something like this:

February: nothing

March: nothing

April: WTF is GDPR?

May: hair on fire

Data scientists discover that GDPR applies to them.

As I’ve written elsewhere, much of the commentary about GDPR misstates the likely impact on data science. There’s a lot of talk about the “right to an explanation,” which is actually a “right to human-in-the-loop decision-making.” But this provision applies to a narrow set of transactions, and affects front-office customer interactions more than data scientists.

GDPR’s greatest impact on data science practice is the obligation it imposes to avoid bias in predictive models used in decisions about consumers. In practice, this means that data science teams must survive an audit of their methods and procedures. Reproducibility and data lineage will be de rigueur.

That’s one more reason to put Heroes, Hackers, and Guerrillas behind you, and adopt a mature model of data science.

While GDPR sets out general principles, it leaves many details to the European Data Protection Board (EDPB). This secretariat will issue detailed guidance for controllers and processors – for example, on the data portability right, Data Protection Impact Assessments, certifications, and the role of Data Protection Officers. Like any regulator, EDPB will issue guidance over time, and the rules may be complex. Thus, compliance won’t be a matter of learning a few principles once; it will be an ongoing effort to understand requirements as they evolve.

GDPR Compliance Officer (*)

Meet the new boss, your GDPR Compliance Officer. She’s up on all the latest rulings, as well as legal requirements imposed by the separate states in which your organization operates. She’s going to engage in all of your data science projects, and she’ll tell you what you need to do to comply with the regulations. You’re going to do whatever she tells you to do, or your work will never see the light of day.

(*) Yeah, I know — it’s Natalia Poklonskaya. No hidden political message there, I just like the picture.

Cloud, Blah, Blah, Blah, Blah…

Cloud is neither a great platform for data science nor a good platform. It’s the only logical platform.

Think of it like this. It makes sense for organizations to invest in IT infrastructure for workloads that are persistent, predictable, and mission-critical. Everything else should go to the cloud.

If you live in Manhattan and want to visit Grandma in Shrewsbury twice a year, you don’t buy a Tesla unless you’re filthy rich. You rent a ZipCar, or take an Uber.

Are data science workloads persistent, predictable and/or mission-critical? If you answered “none of the above” go to the head of the class. Data science projects are time-boxed and short-term. They require brief massive bursts of computing power. And they are rarely mission-critical.

I’m tempted to “predict” that data science will move to the cloud in 2018. Except that data science moved to the cloud a long time ago. I don’t have statistics, but here are some anecdotes:

2010: RazorFish, the digital marketing agency pulls the plug on its server and moves everything to AWS

2014: Data scientists at a leading US bank say they’ve moved 100% of model development to the cloud

2015: A leading strategy consultancy uses a Virtual Private Cloud for 100% of its data science workloads

Analytic service providers and consultants led the way into the cloud. As variable-cost organizations, they had a huge incentive to stop investing in IT infrastructure. And, they had the skills to use the cloud back when it was hard.

It’s getting easier to use the cloud, so economic logic prevails.

Yes, there are some holdouts: organizations that prohibit use of cloud, or take a go-slow approach. But they are increasingly rare.

Predicting that data science will continue moving to the cloud is like predicting that the Mississippi River will continue flowing into the Gulf of Mexico.

IBM: Four More Quarters of Decline. Oh, Wait…

I was going to predict four more quarters of declining revenue for IBM. But then the company threw a monkey wrench into the works and reported increased sales in Q4. So, let’s offer a round of golf applause for the folks at Armonk.

But remember: the U.S.S Arizona stopped sinking when it settled into the mud.

Does this mean IBM’s big investment in Watson is finally paying off? Well, no. Take a squint at the numbers. The big jump in revenue comes from the Systems business, where IBM reports a big jump in…wait for it…System Z boxes, aka mainframes. And, in the Cognitive Solutions segment, IBM says that security and transaction processing software drove the revenue increase. You know, stuff like CICS that runs on mainframes.

So, the handful of organizations that account for most of IBM’s revenue decided that it’s easier to upgrade some of their old boxes than it is to replace them wholesale with modern architecture.

Not that there’s anything wrong with that.

Why, you ask, does IBM include software for mainframe transaction processing in its “Cognitive Solutions” business unit? Good question. One theory: when IBM reorganized, the most important consideration was to make sure that each of CEO Ginny Rometty’s one-downs had a big enough fief to justify super-sized compensation. IBM had to throw the kitchen sink into “Cognitive Solutions” to make it a suitable prince-bishopric.

Which explains why “Cognitive Solutions” has a 4% growth rate. 4% isn’t a growth story. It’s a “we’re just keeping our heads above water” story. It’s tough to grow when your business is sandbagged with the dogs IBM has collected over the years. Yes, Virginia, there is still a Red Brick Warehouse.

Each quarter, IBM breathlessly announces “wins” for Watson. Scan through the 10-Qs, however, and you know what you don’t see? The words “Watson” and “revenue” in close conjunction. That’s because auditors actually care about such things as “revenue recognition” and “materiality” and keeping BS out of the financial statements. Lest they get sued. Wall Street pleads with IBM to show some results from Watson. So you figure that if IBM actually had material Watson revenue, you’d see it in the financials.

IBM reports revenue of ~$20 billion annually for the “Cognitive Solutions” business. But industry analyst IDC estimates IBM’s actual revenue from cognitive and AI software at about $160 million. Which means that the IBM cognitive story is one part reality and 125 parts window dressing.

Keep that in mind the next time an IBM executive wants to talk to you about the power of cognitive computing.

Natalia Poklonskaya on your photo is a deputy of the Russian parliament. Yes, she used to be a prosecutor. But this is a very odious person, she is a militant obscurantist, adheres to sectarian views. Even in the Russian parliament, she is considered crazy, although this parliament does not include adequate people. It is better to replace this photo, otherwise the readers will laugh.

Regards, Andrey Sukhobokov

You wrote January 30, 2018, 22:06:19:

Thomas W. Dinsmore posted: “This is the time of year when everyone looks to the year ahead. Here are five four things in data science and machine learning that are utterly and completely predictable in 2018. Data Science Matures In the Pleistocene Era of Data Science, there were H”

Thanks for the post. If Machine Learning is becoming more accurate because of the number of iteration, which we can call considerations, how will we ever be able to go through the millions, possibly billions of them to check how the results came to be. Does that mean we are entering an era where being accurate is more important than understanding?

Thanks for the reply Thomas. I agree accuracy and understanding are different problems.

With the increase of quality big data and processing power of computers, I feel accuracy (better than human accuracy) is only a matter of time. As you said the last iteration is all that matters if you get it right.

The bigger problem in my opinion is understanding how this accuracy is derived. It is an irreducible problem, meaning (the black box) where the improvement is refined over many times is not exactly something we can go through.

The question will become, are we happy to have a superhuman accurate result without understanding it. Certainly if a doctor told me he recommends a treatment and explains how it works and that it has a 75% chance of success, and then gives me another option of an algorithm’s recommendation which he cannot explain, but has a 95% chance, I would choose the algorithm’s recommendation every time.

I’m sure most corporations will only care about accuracy, since this will put them ahead of the pack. The ones that try to understand may well be left behind.

Suppose a doctor or a black box recommends no treatment. I suspect that you will want an explanation.

As a rule, people want explanations for bad news, not good news. When a credit card issuer declines you for credit, they will usually tell you why. (You really should have paid off that loan last year.) When they grant you credit, they don’t say why, and you’re not likely to ask.