Search this site

notes on numbers and other randomness

Tag: machine learning

The key to a successful analytical model is having a robust set of variables against which to test for their predictive capabilities. And the key to having a robust set of variables from which to test is to get the business users engaged early in the process.

Machine learning startup Wise.io, whose founders are University of California, Berkeley, astrophysicists, has raised a $2.5 million series A round of of venture capital. The company claims its software, which was built to analyze telescope imagery, can simplify the process of predicting customer behavior.

Read how one fast growing Internet company turned to Wise to get a Lead Scoring Application, scrapped their plans to hire a Data Scientist and replaced their custom code with an end-to-end Leading scoring Application in a couple of weeks.

Uh oh. Pretty soon no one’s going to hire data scientists!*

[Aside: Editors will still be needed. How about “lead scoring application” instead of “Leading scoring Application?” Heh. Aside to the aside: predictive lead scoring is among the easiestofdatascienceproblems currently confronting humans.]

Frame problems
Data science projects do not arrive on an analyst’s desk like graduate school data analysis projects, with the question you are supposed to answer given to you. Instead, data scientists work with business experts to identify areas potentially amenable to optimization with data-based approaches. Then the data scientist does the difficult work of turning an intuition about how data might help into a machine learning problem. She formulates decision problems in terms of expected value calculations. She selects machine learning algorithms that may be useful. She figures out what data might be relevant. She gathers it and preps it (see below) and explores it.

Design features
Just as real-world data science projects do not arrive with a neat question and preselected algorithm preformulated, they also do not arrive with variables all prepped and ready. You start with a set of transaction records (or file system full of logs… or text corpus… or image database) and then you think, “O human self, how can I make this mess of numbers and characters and yeses and nos representative of what’s going on in the real world?” You might log-transform or bin currency amounts. You might need to think about time and how to represent changing trends (exponentially weighted moving average?) You might distill your data – mapping countries to regions for example. You might enrich it – gathering firmographics from Dun & Bradstreet.

You also must think about missing values and outliers. Do you ignore them? Impute and replace them? Machines can be given instruction in how to handle these but they may do it crudely, without understanding the problem domain and without being able to investigate why certain values are missing or outlandish.

Unfool themselves
We humans have an uncanny ability to fool ourselves. And in data analysis, it seems easier than ever to do so. We fool ourselves with leakage. We fool ourselves by cross-validating with correlated data. We fool ourselves when we think a just-so story captures truth. We fool ourselves when we think that big data means we can analyze everything and we don’t need to sample.

[Aside: We are always sampling. We sample in time. We sample from confluences of causes. Analyzing all the data that exists doesn’t mean that we have analyzed all the data that a particular data generation mechanism could generate.]

Care: that’s what humans bring to machine learning and machines do not. We care about the problem domain and the question we want to answer. We care about developing meaningful measures of things going on in that domain. We care about ensuring we don’t fool ourselves.

* Even if we can toolify data science and I have no doubt we will move ever towards that, the tool vendors will still need data scientists. I predict continued full employment. But I may be fooling myself.

Target, Pregnancy, and Predictive Analytics, Part II [Dean Abbott/Data Mining and Predictive Analytics. The Target story was interesting for what it says about the possibilities and perils of analytics. This was my favorite writeup, for its overview of to succeed with data analysis:

1) understand the data,
2) understand why the models are focusing on particular input patterns,
3) ask lots of questions (why does the model like these fields best? why not these other fields?)
4) be forensic (now that’s interesting or that’s odd…I wonder…),
5) be prepared to iterate, (how can we predict better for those customers we don’t characterize well)
6) be prepared to learn during the modeling process

We have to “notice” patterns in the data and connect them to behavior. This is one reason I like to build multiple models: different algorithms can find different kinds of patterns. Regression is a global predictor (one continuous equation for all data), whereas decision trees and kNN are local estimators.

Rather than merely providing a “cookbook” approach to say, building a “who to follow” recommendation system for Twitter, it takes the time to explain the methodology behing the algorithms and give the reader a better basis for understanding why these methods work (and, equally importantly, how they can go wrong).

What’s new? Exuberance for novelty has benefits [John Tierney/The New York Times]. In a longitudinal study, people who combined novelty-seeking with persistence and “self-transcendence” showed the most success over the years (good health, lots of friends, few emotional problems, greatest satisfaction with life).

In The Magicians[1], Lev Grossman describes magic as it might exist, but he could as well be describing the real-world practice of statistical analysis or software development:

As much as it was like anything, magic was like a language. And like a language, textbooks and teachers treated it as an orderly system for the purposes of teaching it, but in reality it was complex and chaotic and organic. It obeyed rules only to the extent that it felt like it, and there were almost as many special cases and one-time variations as there were rules. These Exceptions were indicated by rows of asterisks and daggers and other more obscure typographical fauna which invited the reader to peruse the many footnotes that cluttered up the margins of magical reference books like Talmudic commentary.

It was Mayakovsky’s [the teacher’s] intention to make them memorize all these minutiae, and not only to memorize them but to absorb and internalize them. The very best spellcasters had talent, he told his captive, silent audience, but they also had unusual under-the-hood mental machinery, the delicate but powerful correlating and cross-checking engines necessary to access and manipulate and manage this vast body of information. (p149)

To be a good data scientist, whether using traditional statistical techniques or machine learning algorithms (or both), you must know all the rules and approach it first as an orderly system. Then you begin to learn all the special cases and one-time variations and you study and study and practice and practice until you can almost unconsciously adjust to each unique situation that arises.

When I took ANOVA in my Ph.D. program, I could hardly believe there was entire course devoted to it. But it was much like Grossman’s description above. Each week we learned new special cases and one-time variations. I did ANOVA in so many different Circumstances that now I have absorbed and internalized its application as well as the design of studies that would usefully be analyzed with it or with some more flexible variation of it (e.g., hierarchical linear modeling). It felt cookbook at the beginning, but at the end of the course, I felt like I’d begun to develop that “unusual under-the-hood mental machinery” that Grossman suggested an effective magician in his imagined world would need.

That’s not to say that there aren’t important universal principles and practices and foundational knowledge to understand if you are to be an effective statistician or data miner or machine learner programmer; it’s not to say that awareness of Circumstances and methodical practice are all you need. It is to say that data science is ultimately a practice not a philosophy and you reach expertise in it through doing things over and over again, each time in slightly different ways.

In The Magicians, protagonist Quentin practices Legrand’s Hammer Charm, under thousands of different Circumstances:

Page by page the Circumstances listed in the book became more and more esoteric and counterfactual. He cast Legrand’s Hammer Charm at noon and at midnight, in summer and winter, on mountaintops and a thousand yards beneath the earth’s surface. He cast the spell underwater and on the surface of the moon. He cast it in early evening during a blizzard on a beach on the island of Mangareva, which would almost certainly never happen since Mangareva is part of French Polynesia, in the South Pacific. He cast the spell as a man, as a woman, and once–was this really relevant?–as a hermaphrodite. He cast it in anger, with ambivalence, and with bitter regret. (pp150-151)

Sometimes I feel like I have fit logistic regression in all these situations (perhaps not as a hermaphrodite). The next logistic regression I fit, I will say to myself “Wax on, wax off” as Quentin did when faced with a new spell that he had to practice according to each set of Circumstances.

[1]Highly recommended, but with caveats. Read it last summer — loved it — sent it to my 15-year-old son at camp. He loved it too and bought me the sequel for Christmas. After reading the second one, I had to re-read the first. It’s a polarizing book. Don’t pick it up if you are offended by heavy drinking, gratuitous sex, and a wandering plot. Do pick it up if you felt like your young adulthood was marked by heavy drinking, gratuitous sex, a wandering plot, and not nearly enough magic. My son tends to read adult books so I didn’t hesitate to share it with him, but it probably would not be appropriate for most teenagers.