To own up to the title of data scientist means practitioners, vendors and organizations must be held accountable to using the term science, just as is expected from every other scientific discipline. What makes science such a powerful approach to discovery and prediction is the fact that its definition is fully independent of human concerns. Yes, we apply science to the areas we are interested in, and are not immune to bias and even falsification of results. But these deviations of the practice do not survive the scientific approach. They are weeded out by the self-consistent and testable mechanisms that underly the scientific method. There is a natural momentum to science that self-corrects and its ability to do this is fully understandable because what survives is the truth. The truth, whether inline with our wishes or not, is simply the way the world works.

Opinions, tools of the trade, programing languages and ‘best’ practices come and go, but what alway survives is the underlying truth that governs how complex systems operate. That ‘thing’ that does work in real world settings. That concept that does explain the behavior with enough predictive accuracy to solve challenges and help organizations compete. This requires discovery; not engineered systems, business acumen, or vendor software. Those toolsets and approaches are only as powerful as the science that drives their execution and provides them their modeled behavior. It is not a product that defines data science, but an intangible ability to conduct quality research that turns raw resources into usable technology.

Why are we doing this? To make our software better – to help it learn about the world and then, based on that learning, improve business outcomes:

The software of tomorrow isn’t programming ‘simple’ logic into machines to produce some automated output. It is using probabilistic approaches and numerical and statistical methods to ‘learn’ the behavior and act accordingly. The software of tomorrow is aware of the market in which it operates and takes actions that are inline with the models sitting under its hood; models that have been built from intense research on some underlying phenomenon that the software interacts with. Science is now being called upon to be a directly-involved piece of real-world products and for that reason, like never before in history, the demand for ushering in science to help enterprise compete is exploding.

Any time someone equates data science with storytelling I get worked up. Science is not storytelling and neither is data science. There is science to figuring out how the world works and how to make things better based on knowing how it works.

Does Malcolm Gladwell’s brand of storytelling have any lessons for data scientists? Or is it unscientific pop-sci pablum?

Gladwell specializes in uncovering exciting and surprising regularities about the world — you don’t need to reach a lot of people to spread your ideas (The Tipping Point), your intuition wields more power than you imagined (Blink), and success depends on historical or other accident as much as individual talent (Outliers).

[Gladwell] excels at telling just-so stories and cherry-picking science to back them. In “The Tipping Point” (2000), he enthused about a study that showed facial expressions to be such powerful subliminal persuaders that ABC News anchor Peter Jennings made people vote for Ronald Reagan in 1984 just by smiling more when he reported on him than when he reported on his opponent, Walter Mondale. In “Blink” (2005), Mr. Gladwell wrote that a psychologist with a “love lab” could watch married couples interact for just 15 minutes and predict with shocking accuracy whether they would divorce within 15 years. In neither case was there rigorous evidence for such claims. [Christopher Chabris, The Wall Street Journal]

On his blog, Chabris further critiques Gladwell’s approach, defining a hidden rule as “a counterintuitive, causal mechanism behind the workings of the world.” Social scientists like Chabris are all too well aware that to really know what’s happening causally in the world we need replicable experimentation, not cherry-picked studies wrapped up in overblown stories.

Humans love hidden rules. We want to know if there is some counterintuitive practices we should be following, practices that will make our personal and business lives rock.

Data scientists are often called upon to discover hidden rules. Predictive models potentially combine many more variables than our puny minds can handle, often doing so in interesting and unexpected ways. Predictive and other correlational analyses may identify counterintuitive rules that you might not follow if you didn’t have a machine helping you. We learned this from Moneyball. The player stats that baseball cognoscenti thought worked for identifying the best players turned out to be less effective than stats identified by predictive modeling in putting together a winning team.

I am sympathetic to Chabris’ complaints. When I build a predictive model, a natural urge is to deconstruct it and see what it is saying about regularities in our world. What hidden rules did it identify that we didn’t know about? How can we use those rules to work better? But the best predictive models often don’t tell us accurate or useful things about the world. They just make good predictions about what will happen — if the world keeps behaving like it behaved in the past. Using them to generate hidden, counterintuitive rules feels somehow wrong.

As those of you who are social scientists surely already know, ideas are like stone soup. Even a bad idea, if it gets you thinking, can move you forward. For example: is that 10,000 hour thing true? I dunno. We’ll see what happens to Steven Levitt’s golfing buddy. (Amazingly enough, Levitt says he’s spent 5000 hours practicing golf. That comes to 5 hours every Saturday . . . for 20 years. That’s a lot of golf! A lot lot lot lot of golf. Steven Levitt really really loves golf.) But, whether or not the 10,000-hour claim really has truth, it certainly gets you thinking about the value of practice. Chris Chabris and others could quite reasonably argue that everyone already knows that practice helps. But there’s something about that 10,000 hour number that sticks in the mind.

When we move from heuristic business rules to predictive models there’s a need to get people thinking with more depth and nuance about how the world works. Telling stories with predictive or other data analytic models can promote that, even if the stories are only qualifiedly true.

If the structure and outputs of a predictive model can be used to get people thinking in more creative and less rigid ways about their actions, I’m in favor. Doesn’t mean I’m going to let go of my belief in the ideal of experimentation or other careful research designs for figuring out what really works, but it does mean maybe there’s some truth to the proposition that data scientists should be storytellers. Finding and communicating hidden rules a la Gladwell can complement careful science.

I’m ready for spring. I’ve almost finished winter quarter classes — just the writeup of my HLM project left to turn in. The snow is finally melting off my front lawn. The crabapple trees are putting out buds.

Once having traversed the threshold, the hero moves in a dream landscape of curiously fluid, ambiguous forms, where he must survive a succession of trials…. The original departure into the land of trials represented only the beginning of the long and really perilous path of initiatory conquests and moments of illumination. Dragons have now to be slain and surprising barriers passed – again, again, and again. Meanwhile there will be a multitude of preliminary victories, unretainable ecstasies and momentary glimpses of the wonderful land.

It did feel like a dream landscape, absurdly odd with strange events and people and strange projects too. The trials came from inside me: sticking out a class that I despised when I’m used to quitting whatever doesn’t please me, confronting demons from my past as I spiraled back on old decisions and events that led me to this place, stifling my familiar ways of approach so that I could build new relationships for the journey ahead.

Were there “preliminary victories, unretainable ecstasies and momentary glimpses of the wonderful land”? Yes and yes and yes. My professor’s telling me my project had a good chance of yielding publishable results, then sharing my proposal with the class as an example of how one should be done. A raucously funny altercation in psychometrics that made me laugh so hard I really did cry, not just over the unexpected outburst but because of the good friends I have more unexpectedly found. Monthly department meetings, begun at my suggestion, where, for the first time, I could imagine myself as some sort of academic. Dreaded group projects that turned into opportunities for intellectual sparring and companionship, things I have so thoroughly missed over the past ten years.

Thinking into the future of this mythical story, I wonder what Atonement with the Father could possibly mean for me. I know it doesn’t mean atonement with my actual father, as we are not opposed. I think what it means is atonement with the father inside me. Figures in the hero’s journey are symbolic, not literal. The Goddess is not always female, woman as temptress doesn’t necessarily mean some other woman (isn’t the temptress inside me? yes), and each person has masculine and feminine inside.

For the past ten years, the mother inside me has been dominant, as I had babies and raised them and put work mostly to the side. Even before that, I made choices from the female side of myself: dropping my plan to get a Ph.D. in favor of my boyfriend’s career and our plans to buy a house, having my first child at a relatively early age thus stalling my career progress, moving to Virginia to be closer to family, choosing to have a third child even though with only two I could have gotten back to work earlier.

I can even see how my stereotypically female side has actually been in charge for far longer than that; before I married, I lost myself in romance again and again.

It would be too pat if spring brought atonement and rebirth. More likely, I will have to keep slaying dragons and passing surprising barriers as I look for the ultimate treasure. And what is that treasure? Perhaps I have already found it. As Michael Foley in The Age of Absurdity says, it is the journey itself:

The search for meaning is itself the meaning, the Way is the destination, the quest is the grail.

Given the strong requirements in terms of model specification and measurement, the enterprise of “opening the black box” or “exploring causal pathways” using endogenous mediators is largely a rhetorical exercise.

But what is social science anyway? To what extent can we find the “truth” about complex social systems that involve agents with free will and myriad complex, interlinked influences on them?

Perhaps social science is just rhetoric of an advanced sort, carefully constructed arguments based on theory, prior research, data analysis and hunches that describe how the world might work. Over time, some of these arguments are shown to be false, so (ideally) we fix the story up and make it better fit what we’ve observed and what we can deduce from the build-up of evidence and argument so far.

An example from the psychology of narrative identity processing

Pals’ (2006) study of narrative identity processing and adult development is an example of mediation analysis as advanced rhetoric. Here’s the abstract:

Difficult life experiences in adulthood constitute a challenge to the narrative construction of identity. Individual differences in how adults respond to this challenge were conceptualized in terms of two dimensions of narrative identity processing: exploratory narrative processing and coherent positive resolution. These dimensions, coded from narratives of difficult experiences reported by the women of the Mills Longitudinal Study (Helson, 1967) at age 52, were expected to be related to personality traits and to have implications for pathways of personality development and physical health. First, the exploratory narrative processing of difficult experiences mediated the relationship between the trait of coping openness in young adulthood (age 21) and the outcome of maturity in late midlife (age 61). Second, coherent positive resolution predicted increasing ego-resiliency between young adulthood and midlife (age 52), and this pattern of increasing ego-resiliency, in turn, mediated the relationship between coherent positive resolution and life satisfaction in late midlife. Finally, the integration of exploratory narrative processing and coherent positive resolution predicted positive self-transformation within narratives of difficult experiences. In turn, positive self-transformation uniquely predicted optimal development (composite of maturity and life satisfaction) and physical health.

This study was correlational, so that’s the first reason that strict causalists would dispose of it. It also studied mediation, so even if it were some sort of randomized experiment, there would be questions about its suggestions of causality. But the researcher doesn’t just run the mediational analysis and then declare that she’s shown what she wanted to show. She places the correlational findings in the context of theory and makes an overall argument for her hypothesis while noting the limitations of the approach:

A second limitation of this study is that although the hypotheses reflect theoretically driven ideas about cause-effect relations (e.g., coping openness stimulates exploratory narrative processing; coherent positive resolution leads to increased ego-resiliency), the correlational design did not allow for analyses that would support conclusive statements regarding causality. The longitudinal findings were consistent with causal patterns unfolding over time but did not prove them. Thus, an important direction for future research on narrative identity processing will be to examine its causal impact, ideally through studies that closely examine the connection between changes in narrative identity and changes in relevant outcomes. In one recent study, for example, individuals who wrote about a traumatic experience for several days displayed an increase in self-reported personal growth and self-acceptance, whereas those who wrote about trivial topics did not show this pattern of positive self-transformation (Hemenover, 2003). This finding supports the idea that when people fully engage in the narrative processing of a difficult experience, their understanding of themselves and their lives may transform in ways that will make them more mature, resilient, and satisfied with their lives. Findings such as these reflect the growing view that the narrative interpretation of past experiences—the cornerstone of narrative identity—constitutes one way adults may intentionally guide development and bring about change in their lives (Bauer et al., 2005).

Is this research useful even if causality and mediation has not been proven? I think it is. We don’t know for sure which way causality runs among the various traits and behaviors studied (it probably runs in multiple directions) but Pals makes a good argument that someone with coping openness may engage in exploratory narrative processing of difficult life events and this, in turn, may drive a maturing process. In the second mediational hypothesis, she argues that developing coherent positive resolutions in that narrative processing of life events might lead to increased ego-resiliency. Are these analyses and arguments of practical use? I think yes.

Development of the stories should use an open and exploratory style rather than closed and defensive.

The ending of the story should reflect some sort of positive resolution.

So mediation analysis, even of the non-experimental sort, can be useful. Okay so maybe it’s not like the scientific finding that lack of Vitamin C causes scurvy, but that doesn’t make it useless or unscientific.

Philosophers of science would have something more sophisticated to say about this. My point is that science doesn’t happen exactly according to the “scientific method” you learned in high school. In many ways it is just advanced rhetoric that’s (ideally) grounded in careful analysis, thoughtful theorizing, and an understanding of prior research.

References

Green, D. P., Ha, S. E. and Bullock, J.G. (2009) Enough Already About “Black Box” Experiments: Studying Mediation is More Difficult than Most Scholars Suppose. Annals of the American Academy of Political and Social Science628, 200-08. Available at SSRN: http://ssrn.com/abstract=1544416

Pals, J.L. (2006). Narrative identity processing of difficult life experiences: Pathways of personality development and positive self-transformation in adulthood. Journal of Personality 74(4).

A second limitation of this study is that although the hypotheses
reflect theoretically driven ideas about cause-effect relations (e.g., coping
openness stimulates exploratory narrative processing; coherent
positive resolution leads to increased ego-resiliency), the correlational
design did not allow for analyses that would support conclusive
statements regarding causality. The longitudinal findings
1102 Pals
were consistent with causal patterns unfolding over time but did not
prove them. Thus, an important direction for future research
on narrative identity processing will be to examine its causal impact,
ideally through studies that closely examine the connection
between changes in narrative identity and changes in relevant outcomes.
In one recent study, for example, individuals who wrote
about a traumatic experience for several days displayed an increase
in self-reported personal growth and self-acceptance, whereas
those who wrote about trivial topics did not show this pattern
of positive self-transformation (Hemenover, 2003). This finding
supports the idea that when people fully engage in the narrative
processing of a difficult experience, their understanding of
themselves and their lives may transform in ways that will
make them more mature, resilient, and satisfied with their lives.
Findings such as these reflect the growing view that the
narrative interpretation of past experiences—the cornerstone of narrative
identity—constitutes one way adults may intentionally guide
development and bring about change in their lives (Bauer et al.,
2005).A second limitation of this study is that although the hypotheses reflect theoretically driven ideas about cause-effect relations (e.g., coping openness stimulates exploratory narrative processing; coherent positive resolution leads to increased ego-resiliency), the correlational design did not allow for analyses that would support conclusive statements regarding causality. The longitudinal findings 1102 Pals were consistent with causal patterns unfolding over time but did not prove them. Thus, an important direction for future research on narrative identity processing will be to examine its causal impact, ideally through studies that closely examine the connection between changes in narrative identity and changes in relevant outcomes. In one recent study, for example, individuals who wrote about a traumatic experience for several days displayed an increase in self-reported personal growth and self-acceptance, whereas those who wrote about trivial topics did not show this pattern of positive self-transformation (Hemenover, 2003). This finding supports the idea that when people fully engage in the narrative processing of a difficult experience, their understanding of themselves and their lives may transform in ways that will make them more mature, resilient, and satisfied with their lives. Findings such as these reflect the growing view that the narrative interpretation of past experiences—the cornerstone of narrative identity—constitutes one way adults may intentionally guide development and bring about change in their lives (Bauer et al., 2005).