To own up to the title of data scientist means practitioners, vendors and organizations must be held accountable to using the term science, just as is expected from every other scientific discipline. What makes science such a powerful approach to discovery and prediction is the fact that its definition is fully independent of human concerns. Yes, we apply science to the areas we are interested in, and are not immune to bias and even falsification of results. But these deviations of the practice do not survive the scientific approach. They are weeded out by the self-consistent and testable mechanisms that underly the scientific method. There is a natural momentum to science that self-corrects and its ability to do this is fully understandable because what survives is the truth. The truth, whether inline with our wishes or not, is simply the way the world works.

Opinions, tools of the trade, programing languages and ‘best’ practices come and go, but what alway survives is the underlying truth that governs how complex systems operate. That ‘thing’ that does work in real world settings. That concept that does explain the behavior with enough predictive accuracy to solve challenges and help organizations compete. This requires discovery; not engineered systems, business acumen, or vendor software. Those toolsets and approaches are only as powerful as the science that drives their execution and provides them their modeled behavior. It is not a product that defines data science, but an intangible ability to conduct quality research that turns raw resources into usable technology.

Why are we doing this? To make our software better – to help it learn about the world and then, based on that learning, improve business outcomes:

The software of tomorrow isn’t programming ‘simple’ logic into machines to produce some automated output. It is using probabilistic approaches and numerical and statistical methods to ‘learn’ the behavior and act accordingly. The software of tomorrow is aware of the market in which it operates and takes actions that are inline with the models sitting under its hood; models that have been built from intense research on some underlying phenomenon that the software interacts with. Science is now being called upon to be a directly-involved piece of real-world products and for that reason, like never before in history, the demand for ushering in science to help enterprise compete is exploding.

Any time someone equates data science with storytelling I get worked up. Science is not storytelling and neither is data science. There is science to figuring out how the world works and how to make things better based on knowing how it works.

Does Malcolm Gladwell’s brand of storytelling have any lessons for data scientists? Or is it unscientific pop-sci pablum?

Gladwell specializes in uncovering exciting and surprising regularities about the world — you don’t need to reach a lot of people to spread your ideas (The Tipping Point), your intuition wields more power than you imagined (Blink), and success depends on historical or other accident as much as individual talent (Outliers).

[Gladwell] excels at telling just-so stories and cherry-picking science to back them. In “The Tipping Point” (2000), he enthused about a study that showed facial expressions to be such powerful subliminal persuaders that ABC News anchor Peter Jennings made people vote for Ronald Reagan in 1984 just by smiling more when he reported on him than when he reported on his opponent, Walter Mondale. In “Blink” (2005), Mr. Gladwell wrote that a psychologist with a “love lab” could watch married couples interact for just 15 minutes and predict with shocking accuracy whether they would divorce within 15 years. In neither case was there rigorous evidence for such claims. [Christopher Chabris, The Wall Street Journal]

On his blog, Chabris further critiques Gladwell’s approach, defining a hidden rule as “a counterintuitive, causal mechanism behind the workings of the world.” Social scientists like Chabris are all too well aware that to really know what’s happening causally in the world we need replicable experimentation, not cherry-picked studies wrapped up in overblown stories.

Humans love hidden rules. We want to know if there is some counterintuitive practices we should be following, practices that will make our personal and business lives rock.

Data scientists are often called upon to discover hidden rules. Predictive models potentially combine many more variables than our puny minds can handle, often doing so in interesting and unexpected ways. Predictive and other correlational analyses may identify counterintuitive rules that you might not follow if you didn’t have a machine helping you. We learned this from Moneyball. The player stats that baseball cognoscenti thought worked for identifying the best players turned out to be less effective than stats identified by predictive modeling in putting together a winning team.

I am sympathetic to Chabris’ complaints. When I build a predictive model, a natural urge is to deconstruct it and see what it is saying about regularities in our world. What hidden rules did it identify that we didn’t know about? How can we use those rules to work better? But the best predictive models often don’t tell us accurate or useful things about the world. They just make good predictions about what will happen — if the world keeps behaving like it behaved in the past. Using them to generate hidden, counterintuitive rules feels somehow wrong.

As those of you who are social scientists surely already know, ideas are like stone soup. Even a bad idea, if it gets you thinking, can move you forward. For example: is that 10,000 hour thing true? I dunno. We’ll see what happens to Steven Levitt’s golfing buddy. (Amazingly enough, Levitt says he’s spent 5000 hours practicing golf. That comes to 5 hours every Saturday . . . for 20 years. That’s a lot of golf! A lot lot lot lot of golf. Steven Levitt really really loves golf.) But, whether or not the 10,000-hour claim really has truth, it certainly gets you thinking about the value of practice. Chris Chabris and others could quite reasonably argue that everyone already knows that practice helps. But there’s something about that 10,000 hour number that sticks in the mind.

When we move from heuristic business rules to predictive models there’s a need to get people thinking with more depth and nuance about how the world works. Telling stories with predictive or other data analytic models can promote that, even if the stories are only qualifiedly true.

If the structure and outputs of a predictive model can be used to get people thinking in more creative and less rigid ways about their actions, I’m in favor. Doesn’t mean I’m going to let go of my belief in the ideal of experimentation or other careful research designs for figuring out what really works, but it does mean maybe there’s some truth to the proposition that data scientists should be storytellers. Finding and communicating hidden rules a la Gladwell can complement careful science.

In a tour de force on the opportunities and challenges of big data Butterworth apparently demolishes the idea of small sample data analysis or (more questionable?) the use of anecdotes and thoughtfulness to argue points of contrversy. But finding correlations in massive amounts of data doesn’t mean that the difficulty of finding causality — what’s really going on — has disappeared. It doesn’t mean we abandon anecdote and argument and thoughtful explanation. It means only that we can calculate correlations on bigger data sets. Sometimes — Angrist-and-Pischke style — we can do something akin to experiment. Still, such efforts require much more than mere counting, more than mere enumeration.

His first example, gender bias in the media:

Pre-Big Crit, you might have had pundits setting the air on fire with a mixture of anecdote and data; or a thoughtful article in The Atlantic or The Economist or Slate, reflecting a mixture of anecdote, academic observation and maybe a survey or two; or, if you were lucky, a content analysis of the media which looked for gender bias in several hundred or even several thousand news stories, and took a lot of time, effort, and money to undertake, and which—providing its methodology is good and its sample representative—might be able to give us a best possible answer within the bounds of human effort and timeliness.

The Bristol-Cardiff team, on the other hand, looked at 2,490,429 stories from 498 English language publications over 10 months in 2010. Not literally looked at—that would have taken them, cumulatively, 9.47 years, assuming they could have read and calculated the gender ratios for each story in just two minutes; instead, after ten months assembling the database, answering this question took about two hours. And yes, the media is testosterone fueled, with men dominating as subjects and sources in practically every topic analyzed from sports to science, politics to even reports about the weather. The closest women get to an equal narrative footing with men is—surprise—fashion. Closest. The malestream media couldn’t even grant women tokenistic majority status in fashion reporting. If HBO were to do a sitcom about the voices of this generation that reflected just who had the power to speak, it would, after aggregation, be called “Boys.”

How is this useful analysis, that news stories are more likely to be about men than about women? And how is this evidence of gender bias in news stories? There is only gender bias here if the actual news had been unfairly represented by the stories — if somehow women made as much news as men. But yet we know that women don’t for a myriad of reasons. Women are busy with family. Women don’t face the same opportunity structures as men. Women face bias in the professional and political worlds. The presence of a lopsided gender ratio in magazine and news stories does not necessarily point to gender bias in journalism.

It’s just not that easy to tease truth out of numbers. I hate to restate a cliche everyone should already know and which is too often stated uncritically, but I will anyway. Correlation is not causation.

Patterns are not truth.

Big data does not, in fact, allow us to answer really big questions because most really big questions are questions about causality: Do women face unfair bias — is their unequal representation the result of bias apart from real world factors that would otherwise tend to reduce their representation (and in what context, what country, what career?) Or from my current job context — does social engagement improve academic outcomes (in what context, in what country, in what courses, in what classroom)? Big data is not so useful in answering such questions. Mere correlation in a specific context doesn’t tell you much. Broad-scale big data correlation, even less.

Here’s an example of a study that demonstrated the clear presence of gender bias. Merely changing the gender of a name from male to female on a resume led to lower rankings on hireability, competency, and mentoring. No big data required. This is essentially experimental — everything was held constant except the gender of the applicants’ name. Big data doesn’t make experiments that control for outside factors more likely. It may reduce their use if it makes us think that big data has something more to tell us than small-data experimentation.

To elaborate on that example from my current job: since students who post to discussion threads show better grades (let’s say they are more “engaged”), will increasing discussion thread posts improve grades? Maybe – but only in very limited contexts, where discussion thread participation actually improves students’ ability to make sense of content, produce better work, spend more time in class, and ultimately do better in class. In most cases, there is only correlation, not causation. Better students post more. They are more conscientious. They are already more engaged. You can make more discussion threads mandatory but I’m skeptical that will improve outcomes.

We can count and then we can correlate counts but to make sense of those associations requires more work. It requires understanding, explanation, context, sometimes anecdote — and ideally experimentation too.

I just watched this video of Hilary Mason* talking about data mining. Aside from the obvious thoughts of what I could have done with my life if (1) I had majored in computer science instead of philosophy/economics and (2) hadn’t spent all of the zeroes having babies, buying/selling houses, and living out an island retirement fantasy thirty years before my time, I found myself musing about her comments on the “data scientist” term. She said she’s gotten into arguments about it. I guess some people think it doesn’t really mean anything — it’s just hype — who needs it? Someone’s a computer scientist or a statistician or a business intelligence analyst, right? Why make up some new name?

I dunno, I rather like the term. My official title at work is “data scientist” — thank you to my management for that — and it seems more appropriate than statistician or business intelligence analyst or senior software developer or whatever else you might want to call me. The fact is, I do way more than statistical analysis. I know SQL all too well and (as my manager knows from my frequent complaints) spend 75% + of my time writing extract-transform-load code. I use traditional statistical methods like factor analysis and logistic regression (heavily) but if needed I use techniques from machine learning. I try to keep on top of the latest online learning research and I incorporate that into our analytics plans and models. Lately I’ve been spending time looking at what sort of big data architectures might support the scale of analytics we want to do. I don’t just need to know what statistical or ML methods to use — I need to figure out how to make them scalable and real-time and — this is critical — useful in the educational context. That doesn’t sound like pure statistics to me, so don’t just call me a statistician**.

I do way more than data analysis and I’m capable of way more, thanks to my meandering career path that’s taken me from risk assessment (heavy machinery accident analysis at Failure Analysis now Exponent) to database app development (ERP apps at Oracle) to education (AP calculus and remedial algebra teaching at the Denver School of Science and Technology) and now to Pearson (online learning analytics). I earned a couple of degrees in mathematical statistics and applied statistics/research design/psychometrics meanwhile.

Drew Conway's Venn diagram of data science

None of what I did made sense at the time I was wandering the path — and yet it all adds up to something useful and rare in my current position. Data science requires an alchemistic mixture of domain knowledge, data analysis capability, and a hacker’s mindset (see Drew Conway’s Venn diagram of data science reproduced here). Any term that only incorporates one or two of these circles doesn’t really capture what we do. I’m an educational researcher, a statistician, a programmer, a business analyst. I’m all these things.

In the end, I don’t really care what you call me, so long as I get the chance to ask interesting questions, gather the data to answer them, and then give you an answer you can use — an answer that is grounded in quantitative rigor and human meaning.

*Yes, I do have a girl-crush on Hilary. I think she’s awesome.

** Also, my kids cannot seem to pronounce the word “statistician.” I need a job title they can tell people without stumbling over it. I hope to inspire them to pursue careers that are as rewarding and engaging, intellectually and socially, as my own has been.

In The Magicians[1], Lev Grossman describes magic as it might exist, but he could as well be describing the real-world practice of statistical analysis or software development:

As much as it was like anything, magic was like a language. And like a language, textbooks and teachers treated it as an orderly system for the purposes of teaching it, but in reality it was complex and chaotic and organic. It obeyed rules only to the extent that it felt like it, and there were almost as many special cases and one-time variations as there were rules. These Exceptions were indicated by rows of asterisks and daggers and other more obscure typographical fauna which invited the reader to peruse the many footnotes that cluttered up the margins of magical reference books like Talmudic commentary.

It was Mayakovsky’s [the teacher’s] intention to make them memorize all these minutiae, and not only to memorize them but to absorb and internalize them. The very best spellcasters had talent, he told his captive, silent audience, but they also had unusual under-the-hood mental machinery, the delicate but powerful correlating and cross-checking engines necessary to access and manipulate and manage this vast body of information. (p149)

To be a good data scientist, whether using traditional statistical techniques or machine learning algorithms (or both), you must know all the rules and approach it first as an orderly system. Then you begin to learn all the special cases and one-time variations and you study and study and practice and practice until you can almost unconsciously adjust to each unique situation that arises.

When I took ANOVA in my Ph.D. program, I could hardly believe there was entire course devoted to it. But it was much like Grossman’s description above. Each week we learned new special cases and one-time variations. I did ANOVA in so many different Circumstances that now I have absorbed and internalized its application as well as the design of studies that would usefully be analyzed with it or with some more flexible variation of it (e.g., hierarchical linear modeling). It felt cookbook at the beginning, but at the end of the course, I felt like I’d begun to develop that “unusual under-the-hood mental machinery” that Grossman suggested an effective magician in his imagined world would need.

That’s not to say that there aren’t important universal principles and practices and foundational knowledge to understand if you are to be an effective statistician or data miner or machine learner programmer; it’s not to say that awareness of Circumstances and methodical practice are all you need. It is to say that data science is ultimately a practice not a philosophy and you reach expertise in it through doing things over and over again, each time in slightly different ways.

In The Magicians, protagonist Quentin practices Legrand’s Hammer Charm, under thousands of different Circumstances:

Page by page the Circumstances listed in the book became more and more esoteric and counterfactual. He cast Legrand’s Hammer Charm at noon and at midnight, in summer and winter, on mountaintops and a thousand yards beneath the earth’s surface. He cast the spell underwater and on the surface of the moon. He cast it in early evening during a blizzard on a beach on the island of Mangareva, which would almost certainly never happen since Mangareva is part of French Polynesia, in the South Pacific. He cast the spell as a man, as a woman, and once–was this really relevant?–as a hermaphrodite. He cast it in anger, with ambivalence, and with bitter regret. (pp150-151)

Sometimes I feel like I have fit logistic regression in all these situations (perhaps not as a hermaphrodite). The next logistic regression I fit, I will say to myself “Wax on, wax off” as Quentin did when faced with a new spell that he had to practice according to each set of Circumstances.

[1]Highly recommended, but with caveats. Read it last summer — loved it — sent it to my 15-year-old son at camp. He loved it too and bought me the sequel for Christmas. After reading the second one, I had to re-read the first. It’s a polarizing book. Don’t pick it up if you are offended by heavy drinking, gratuitous sex, and a wandering plot. Do pick it up if you felt like your young adulthood was marked by heavy drinking, gratuitous sex, a wandering plot, and not nearly enough magic. My son tends to read adult books so I didn’t hesitate to share it with him, but it probably would not be appropriate for most teenagers.

In my Ph.D. program, I learned all about how to analyze small data. I learned rules of thumb for how much data I needed to run a particular analysis, and what to do if I didn’t have enough. I worked with what seemed now (and even seemed then) to be toy data sets, except they weren’t toys, because when you’re running a psychological experiment you might be lucky to have 30 participants and when you’re analyzing the whole world’s math performance (think TIMSS), you can do it with a data set less than a gigabyte in size.

image courtesy Victorian Web

I did some program evaluation while I finished my degree and sometimes colleagues would lament, “we don’t have enough data!” Usually, we did. We could thank Sir Ronald Aylmer Fisher for that. He worked with small data in agricultural settings and gave us the venerable ANOVA technique, which works fine with just a handful of cases per group (assuming balanced group sizes, normality, and homoscedasticity). Maybe we might give a nod to William Sealy Gosset, too, for introducing Student’s t distribution, which helps when the Central Limit Theorem hasn’t kicked in yet and brought us to normality.

But Sir Ronald and Student can’t help me now. I’m down a rabbit hole… in some sort of web-scale wonderland of big data. I feel like Alice after drinking the magic potion, too small to reach the key on the table. The data is so much broader and bigger than I am, so much broader and bigger than my puny methods and my puny desktop R environment that wants to suck everything into memory in order to analyze it.

I stay awake at night thinking how to analyze all this data and deliver on its promise, how to analyze across schools and courses and so many, many students, not to mention all their clickstreams. How can I get through the locked door and experience the rest of wonderland when I’m so small and the data’s so big? I could sample the data, I think, and then I’d be in the realm where I’m comfortable, dealing with sampling distributions and generalizing to a population and applying the small-data methods I know already. Perhaps I can extract it by subset — by subsets of like courses, perhaps, or by school (I’m doing that already–not scalable and doesn’t address some of the most interesting questions). What about trying out Revolutions’ big data support for R? Or maybe I can apply haute big-data techniques: Hadoop-ify it (Hive, Pig, HBase???) then use simplistic (embarrassingly parallel) algorithms with MapReduce. Problem is, none of the methods I like to use and seem appropriate for educational settings (multilevel modeling for example) are easily parallelized. I’m stumped.

It’s okay to be stumped, I think — part of creation is living with uncertainty:

Most people who are not consummate creators avoid tension. They want quick answers. They don’t like living in the realm of not knowing something they want to know. They have an intolerance for those moments in the creative process in which you have no idea how to get from where you are to where you want to be. Actually, this is one of the very best moments there are. This is when something completely original can be born, when you go beyond your usual ways of addressing similar situations, where you can drive the creative process into high gear. [Robert Fritz on supercharging the creative process]

Alice ate some cake that made her bigger. Is there some cake that will make me and my methods big enough to answer the questions I want answered? For now I’m in the realm of not knowing but I hope in 2012 I will have some answers: first, answers about how to make myself big again, and second, answers from the data.

The actual working title of my dissertation is: Modeling Social Participation as Predictive of Life Satisfaction and Social Connectedness: Scale or Index?

When I tell people my topic, I usually start with the domain area: social participation as related to life satisfaction in older U.S. adults (my data set is people age 65 and over from the Health and Retirement Study), but really, the topic is a statistical and measurement one. Participation happens to be something I’m personally interested in and fits the statistical problem area, but I could do this same project in a variety of domains with a range of constructs. Maybe I ought to change my elevator speech to start with the statistical/measurement part.

Most psychometrics concerns itself with the measurement of latent psychological constructs like attitudes, intelligence, academic achievement and so forth. Psychometricians have developed sophisticated means of constructing instruments (surveys or assessments, for example) that can measure these latent constructs. The approach taken is often based on either classical test theory or item response theory. Either way, the assumption is that observed data (such as a student’s answers to test questions or a subject’s survey responses) are caused by whatever unobserved trait is intended to be measured.

However, there are some things we want to measure that don’t fit this model. Social participation is one of them. Participation instruments generally ask the respondent to report his or her level of participation in various activities. In a latent factor setting, you would then assume some underlying level of participation that gave rise to the observed frequencies of participation. That’s not quite right though. If someone increases their participation in some area — say by joining an investment club — their overall level of participation goes up. The increase in participation in the investment club seems causally prior to the increase in overall participation. This is the opposite direction of causality than that proposed by traditional psychometric models.

Some people call a measurement instrument developed by some sort of summation of disparate items an index rather than a scale, where a scale follows the latent factor model. The development of such indexes follows a so-called formative measurement model, where what you’re trying to measure is formed of what you observe, in contrast to the development of scales that follows a reflective measurement model, where what you observe reflects the underlying latent factor of interest. In the diagram, the first figure represents formative measurement (observed indicators x1-x3 cause the latent construct eta 1) and the second figure represents reflective (observed indicators y1 to y3 reflect the level of the latent construct).

There has been plenty of criticism of formative measurement, but I think it can be made useful, and that’s the aim of my dissertation project. I’m now at the analysis stage and just beginning to really understand the usefulness and potential of formative indexes.

As an aside, I don’t like to call formative measurement “measurement.” I prefer to think of it as “modeling.” I think what you’re doing with index development is constructing a one- or few-number summary of a lot of individual data items in a way that predicts outcomes of interest. Think of the Apgar score as a good example. It gives you a one number summary of the health of the baby and its likelihood to survive and thrive, but you’re not measuring one thing in particular about the baby. Well, maybe you are measuring overall health. Hmmmm.