notes on numbers and other randomness

Category: big data

Machine learning startup Wise.io, whose founders are University of California, Berkeley, astrophysicists, has raised a $2.5 million series A round of of venture capital. The company claims its software, which was built to analyze telescope imagery, can simplify the process of predicting customer behavior.

Read how one fast growing Internet company turned to Wise to get a Lead Scoring Application, scrapped their plans to hire a Data Scientist and replaced their custom code with an end-to-end Leading scoring Application in a couple of weeks.

Uh oh. Pretty soon no one’s going to hire data scientists!*

[Aside: Editors will still be needed. How about “lead scoring application” instead of “Leading scoring Application?” Heh. Aside to the aside: predictive lead scoring is among the easiestofdatascienceproblems currently confronting humans.]

Frame problems
Data science projects do not arrive on an analyst’s desk like graduate school data analysis projects, with the question you are supposed to answer given to you. Instead, data scientists work with business experts to identify areas potentially amenable to optimization with data-based approaches. Then the data scientist does the difficult work of turning an intuition about how data might help into a machine learning problem. She formulates decision problems in terms of expected value calculations. She selects machine learning algorithms that may be useful. She figures out what data might be relevant. She gathers it and preps it (see below) and explores it.

Design features
Just as real-world data science projects do not arrive with a neat question and preselected algorithm preformulated, they also do not arrive with variables all prepped and ready. You start with a set of transaction records (or file system full of logs… or text corpus… or image database) and then you think, “O human self, how can I make this mess of numbers and characters and yeses and nos representative of what’s going on in the real world?” You might log-transform or bin currency amounts. You might need to think about time and how to represent changing trends (exponentially weighted moving average?) You might distill your data – mapping countries to regions for example. You might enrich it – gathering firmographics from Dun & Bradstreet.

You also must think about missing values and outliers. Do you ignore them? Impute and replace them? Machines can be given instruction in how to handle these but they may do it crudely, without understanding the problem domain and without being able to investigate why certain values are missing or outlandish.

Unfool themselves
We humans have an uncanny ability to fool ourselves. And in data analysis, it seems easier than ever to do so. We fool ourselves with leakage. We fool ourselves by cross-validating with correlated data. We fool ourselves when we think a just-so story captures truth. We fool ourselves when we think that big data means we can analyze everything and we don’t need to sample.

[Aside: We are always sampling. We sample in time. We sample from confluences of causes. Analyzing all the data that exists doesn’t mean that we have analyzed all the data that a particular data generation mechanism could generate.]

Care: that’s what humans bring to machine learning and machines do not. We care about the problem domain and the question we want to answer. We care about developing meaningful measures of things going on in that domain. We care about ensuring we don’t fool ourselves.

* Even if we can toolify data science and I have no doubt we will move ever towards that, the tool vendors will still need data scientists. I predict continued full employment. But I may be fooling myself.

Does Malcolm Gladwell’s brand of storytelling have any lessons for data scientists? Or is it unscientific pop-sci pablum?

Gladwell specializes in uncovering exciting and surprising regularities about the world — you don’t need to reach a lot of people to spread your ideas (The Tipping Point), your intuition wields more power than you imagined (Blink), and success depends on historical or other accident as much as individual talent (Outliers).

[Gladwell] excels at telling just-so stories and cherry-picking science to back them. In “The Tipping Point” (2000), he enthused about a study that showed facial expressions to be such powerful subliminal persuaders that ABC News anchor Peter Jennings made people vote for Ronald Reagan in 1984 just by smiling more when he reported on him than when he reported on his opponent, Walter Mondale. In “Blink” (2005), Mr. Gladwell wrote that a psychologist with a “love lab” could watch married couples interact for just 15 minutes and predict with shocking accuracy whether they would divorce within 15 years. In neither case was there rigorous evidence for such claims. [Christopher Chabris, The Wall Street Journal]

On his blog, Chabris further critiques Gladwell’s approach, defining a hidden rule as “a counterintuitive, causal mechanism behind the workings of the world.” Social scientists like Chabris are all too well aware that to really know what’s happening causally in the world we need replicable experimentation, not cherry-picked studies wrapped up in overblown stories.

Humans love hidden rules. We want to know if there is some counterintuitive practices we should be following, practices that will make our personal and business lives rock.

Data scientists are often called upon to discover hidden rules. Predictive models potentially combine many more variables than our puny minds can handle, often doing so in interesting and unexpected ways. Predictive and other correlational analyses may identify counterintuitive rules that you might not follow if you didn’t have a machine helping you. We learned this from Moneyball. The player stats that baseball cognoscenti thought worked for identifying the best players turned out to be less effective than stats identified by predictive modeling in putting together a winning team.

I am sympathetic to Chabris’ complaints. When I build a predictive model, a natural urge is to deconstruct it and see what it is saying about regularities in our world. What hidden rules did it identify that we didn’t know about? How can we use those rules to work better? But the best predictive models often don’t tell us accurate or useful things about the world. They just make good predictions about what will happen — if the world keeps behaving like it behaved in the past. Using them to generate hidden, counterintuitive rules feels somehow wrong.

As those of you who are social scientists surely already know, ideas are like stone soup. Even a bad idea, if it gets you thinking, can move you forward. For example: is that 10,000 hour thing true? I dunno. We’ll see what happens to Steven Levitt’s golfing buddy. (Amazingly enough, Levitt says he’s spent 5000 hours practicing golf. That comes to 5 hours every Saturday . . . for 20 years. That’s a lot of golf! A lot lot lot lot of golf. Steven Levitt really really loves golf.) But, whether or not the 10,000-hour claim really has truth, it certainly gets you thinking about the value of practice. Chris Chabris and others could quite reasonably argue that everyone already knows that practice helps. But there’s something about that 10,000 hour number that sticks in the mind.

When we move from heuristic business rules to predictive models there’s a need to get people thinking with more depth and nuance about how the world works. Telling stories with predictive or other data analytic models can promote that, even if the stories are only qualifiedly true.

If the structure and outputs of a predictive model can be used to get people thinking in more creative and less rigid ways about their actions, I’m in favor. Doesn’t mean I’m going to let go of my belief in the ideal of experimentation or other careful research designs for figuring out what really works, but it does mean maybe there’s some truth to the proposition that data scientists should be storytellers. Finding and communicating hidden rules a la Gladwell can complement careful science.

In a tour de force on the opportunities and challenges of big data Butterworth apparently demolishes the idea of small sample data analysis or (more questionable?) the use of anecdotes and thoughtfulness to argue points of contrversy. But finding correlations in massive amounts of data doesn’t mean that the difficulty of finding causality — what’s really going on — has disappeared. It doesn’t mean we abandon anecdote and argument and thoughtful explanation. It means only that we can calculate correlations on bigger data sets. Sometimes — Angrist-and-Pischke style — we can do something akin to experiment. Still, such efforts require much more than mere counting, more than mere enumeration.

His first example, gender bias in the media:

Pre-Big Crit, you might have had pundits setting the air on fire with a mixture of anecdote and data; or a thoughtful article in The Atlantic or The Economist or Slate, reflecting a mixture of anecdote, academic observation and maybe a survey or two; or, if you were lucky, a content analysis of the media which looked for gender bias in several hundred or even several thousand news stories, and took a lot of time, effort, and money to undertake, and which—providing its methodology is good and its sample representative—might be able to give us a best possible answer within the bounds of human effort and timeliness.

The Bristol-Cardiff team, on the other hand, looked at 2,490,429 stories from 498 English language publications over 10 months in 2010. Not literally looked at—that would have taken them, cumulatively, 9.47 years, assuming they could have read and calculated the gender ratios for each story in just two minutes; instead, after ten months assembling the database, answering this question took about two hours. And yes, the media is testosterone fueled, with men dominating as subjects and sources in practically every topic analyzed from sports to science, politics to even reports about the weather. The closest women get to an equal narrative footing with men is—surprise—fashion. Closest. The malestream media couldn’t even grant women tokenistic majority status in fashion reporting. If HBO were to do a sitcom about the voices of this generation that reflected just who had the power to speak, it would, after aggregation, be called “Boys.”

How is this useful analysis, that news stories are more likely to be about men than about women? And how is this evidence of gender bias in news stories? There is only gender bias here if the actual news had been unfairly represented by the stories — if somehow women made as much news as men. But yet we know that women don’t for a myriad of reasons. Women are busy with family. Women don’t face the same opportunity structures as men. Women face bias in the professional and political worlds. The presence of a lopsided gender ratio in magazine and news stories does not necessarily point to gender bias in journalism.

It’s just not that easy to tease truth out of numbers. I hate to restate a cliche everyone should already know and which is too often stated uncritically, but I will anyway. Correlation is not causation.

Patterns are not truth.

Big data does not, in fact, allow us to answer really big questions because most really big questions are questions about causality: Do women face unfair bias — is their unequal representation the result of bias apart from real world factors that would otherwise tend to reduce their representation (and in what context, what country, what career?) Or from my current job context — does social engagement improve academic outcomes (in what context, in what country, in what courses, in what classroom)? Big data is not so useful in answering such questions. Mere correlation in a specific context doesn’t tell you much. Broad-scale big data correlation, even less.

Here’s an example of a study that demonstrated the clear presence of gender bias. Merely changing the gender of a name from male to female on a resume led to lower rankings on hireability, competency, and mentoring. No big data required. This is essentially experimental — everything was held constant except the gender of the applicants’ name. Big data doesn’t make experiments that control for outside factors more likely. It may reduce their use if it makes us think that big data has something more to tell us than small-data experimentation.

To elaborate on that example from my current job: since students who post to discussion threads show better grades (let’s say they are more “engaged”), will increasing discussion thread posts improve grades? Maybe – but only in very limited contexts, where discussion thread participation actually improves students’ ability to make sense of content, produce better work, spend more time in class, and ultimately do better in class. In most cases, there is only correlation, not causation. Better students post more. They are more conscientious. They are already more engaged. You can make more discussion threads mandatory but I’m skeptical that will improve outcomes.

We can count and then we can correlate counts but to make sense of those associations requires more work. It requires understanding, explanation, context, sometimes anecdote — and ideally experimentation too.

I just watched this video of Hilary Mason* talking about data mining. Aside from the obvious thoughts of what I could have done with my life if (1) I had majored in computer science instead of philosophy/economics and (2) hadn’t spent all of the zeroes having babies, buying/selling houses, and living out an island retirement fantasy thirty years before my time, I found myself musing about her comments on the “data scientist” term. She said she’s gotten into arguments about it. I guess some people think it doesn’t really mean anything — it’s just hype — who needs it? Someone’s a computer scientist or a statistician or a business intelligence analyst, right? Why make up some new name?

I dunno, I rather like the term. My official title at work is “data scientist” — thank you to my management for that — and it seems more appropriate than statistician or business intelligence analyst or senior software developer or whatever else you might want to call me. The fact is, I do way more than statistical analysis. I know SQL all too well and (as my manager knows from my frequent complaints) spend 75% + of my time writing extract-transform-load code. I use traditional statistical methods like factor analysis and logistic regression (heavily) but if needed I use techniques from machine learning. I try to keep on top of the latest online learning research and I incorporate that into our analytics plans and models. Lately I’ve been spending time looking at what sort of big data architectures might support the scale of analytics we want to do. I don’t just need to know what statistical or ML methods to use — I need to figure out how to make them scalable and real-time and — this is critical — useful in the educational context. That doesn’t sound like pure statistics to me, so don’t just call me a statistician**.

I do way more than data analysis and I’m capable of way more, thanks to my meandering career path that’s taken me from risk assessment (heavy machinery accident analysis at Failure Analysis now Exponent) to database app development (ERP apps at Oracle) to education (AP calculus and remedial algebra teaching at the Denver School of Science and Technology) and now to Pearson (online learning analytics). I earned a couple of degrees in mathematical statistics and applied statistics/research design/psychometrics meanwhile.

Drew Conway's Venn diagram of data science

None of what I did made sense at the time I was wandering the path — and yet it all adds up to something useful and rare in my current position. Data science requires an alchemistic mixture of domain knowledge, data analysis capability, and a hacker’s mindset (see Drew Conway’s Venn diagram of data science reproduced here). Any term that only incorporates one or two of these circles doesn’t really capture what we do. I’m an educational researcher, a statistician, a programmer, a business analyst. I’m all these things.

In the end, I don’t really care what you call me, so long as I get the chance to ask interesting questions, gather the data to answer them, and then give you an answer you can use — an answer that is grounded in quantitative rigor and human meaning.

*Yes, I do have a girl-crush on Hilary. I think she’s awesome.

** Also, my kids cannot seem to pronounce the word “statistician.” I need a job title they can tell people without stumbling over it. I hope to inspire them to pursue careers that are as rewarding and engaging, intellectually and socially, as my own has been.

In my Ph.D. program, I learned all about how to analyze small data. I learned rules of thumb for how much data I needed to run a particular analysis, and what to do if I didn’t have enough. I worked with what seemed now (and even seemed then) to be toy data sets, except they weren’t toys, because when you’re running a psychological experiment you might be lucky to have 30 participants and when you’re analyzing the whole world’s math performance (think TIMSS), you can do it with a data set less than a gigabyte in size.

image courtesy Victorian Web

I did some program evaluation while I finished my degree and sometimes colleagues would lament, “we don’t have enough data!” Usually, we did. We could thank Sir Ronald Aylmer Fisher for that. He worked with small data in agricultural settings and gave us the venerable ANOVA technique, which works fine with just a handful of cases per group (assuming balanced group sizes, normality, and homoscedasticity). Maybe we might give a nod to William Sealy Gosset, too, for introducing Student’s t distribution, which helps when the Central Limit Theorem hasn’t kicked in yet and brought us to normality.

But Sir Ronald and Student can’t help me now. I’m down a rabbit hole… in some sort of web-scale wonderland of big data. I feel like Alice after drinking the magic potion, too small to reach the key on the table. The data is so much broader and bigger than I am, so much broader and bigger than my puny methods and my puny desktop R environment that wants to suck everything into memory in order to analyze it.

I stay awake at night thinking how to analyze all this data and deliver on its promise, how to analyze across schools and courses and so many, many students, not to mention all their clickstreams. How can I get through the locked door and experience the rest of wonderland when I’m so small and the data’s so big? I could sample the data, I think, and then I’d be in the realm where I’m comfortable, dealing with sampling distributions and generalizing to a population and applying the small-data methods I know already. Perhaps I can extract it by subset — by subsets of like courses, perhaps, or by school (I’m doing that already–not scalable and doesn’t address some of the most interesting questions). What about trying out Revolutions’ big data support for R? Or maybe I can apply haute big-data techniques: Hadoop-ify it (Hive, Pig, HBase???) then use simplistic (embarrassingly parallel) algorithms with MapReduce. Problem is, none of the methods I like to use and seem appropriate for educational settings (multilevel modeling for example) are easily parallelized. I’m stumped.

It’s okay to be stumped, I think — part of creation is living with uncertainty:

Most people who are not consummate creators avoid tension. They want quick answers. They don’t like living in the realm of not knowing something they want to know. They have an intolerance for those moments in the creative process in which you have no idea how to get from where you are to where you want to be. Actually, this is one of the very best moments there are. This is when something completely original can be born, when you go beyond your usual ways of addressing similar situations, where you can drive the creative process into high gear. [Robert Fritz on supercharging the creative process]

Alice ate some cake that made her bigger. Is there some cake that will make me and my methods big enough to answer the questions I want answered? For now I’m in the realm of not knowing but I hope in 2012 I will have some answers: first, answers about how to make myself big again, and second, answers from the data.