Attack of the numbers

There is something tremendously interesting to say, I expect, about the general increase in the amount of data we are able to collect, analyse and understand, in all walks of life. The buzzword (or buzzexpression, I suppose) for this is “Big Data,” and the NYT has a decent summary of the phenomenon here. Excerpt:

There is a lot more data, all the time, growing at 50 percent a year, or more than doubling every two years, estimates IDC, a technology research firm. It’s not just more streams of data, but entirely new ones. For example, there are now countless digital sensors worldwide in industrial equipment, automobiles, electrical meters and shipping crates. They can measure and communicate location, movement, vibration, temperature, humidity, even chemical changes in the air.

Thing is, I’ve been wanting to blog about this data phenomenon for a long while now, but I’m still not sure what the tremendously interesting thing there is to say about Big Data is. Perhaps a part of it is the sheer range of applications of data-intense research techniques. Besides scientific and social scientific research, Data is branching out into the arts and sport as well. Moneyball (book and film) is the obvious example for sport, but it seems that all of the major American sports have their arcane sets of statistics by which to judge players. I can’t say I know much about them, although I gained a little bit of an appreciation for American football statistics during the most recent season.

In British sport, cricket is tremendously well suited to statistical analysis, although it’s also a great example of the limitations inherent in reducing performance to numerical outputs. For instance, their statistics would suggest that Stuart Broad and Andrew Flintoff are roughly equivalently good as all-rounders: Broad has a slightly better bowling average (31.25 vs. 32.78) but a slightly worse batting average (28.32 vs. 31.77). However, Flintoff is an absolute colossus of English cricket while Broad, now 26, remains, I would say, a good but not great player. Broad, of course, still has time – Flintoff was 28 when he hit his pinnacle in the 2005 Ashes – but I very much doubt that Broad will ever sway a match, and certainly not a series, to the extent that Flintoff was then able (his team’s most wickets and 3rd-most runs, in the greatest series ever played). So we still have to rely on intangibles in cricket – and despite Moneyball, I’m sure they still do to some extent in baseball.

Football (soccer), though, is a much tougher realm for statistical analysis. Some players’ and teams’ real-life dominance is reflected in statistics: see Ronaldo and Messi’s goalscoring over the past couple of seasons. But most other metrics aren’t much help. It may be a freak case, and we probably need more data before we can understand it better, but since Martin O’Neill took over at Sunderland they have hardly improved on lots of counts, got worse on others, yet seen a complete turnaround in results. From Jonathan Wilson at the Guardian:

Sunderland score 1.44 goals a game, the fifth best in the league, under O’Neill, as opposed to 1.15, the 13th best, under Bruce. That again, though, is a result. A difference in process is harder to ascertain. Opta’s stats show that Sunderland have fewer shots under O’Neill (the second worst in the league, remarkably) than they did under Bruce; they have less possession (42% as opposed to 47%) and make fewer passes, of which they complete fewer; they put in fewer crosses (the 12th most as opposed to the third most); and they win a lower percentage of their tackles.

We go on to discover that they may well be winning lucky – they score a lot from outside the box – but it also seems completely clear from match reports that they look a better-organized, more confident side. But no statistic seems able to express that confidence and organization in a way that would make it comparable to other teams. [Update: here’s a very detailed look at football stats from Simon Kuper].

Finally then, arts. This NYT article looks at efforts to understand the English language as used in speech, fiction and other forms of writing:

Has a vernacular style become the standard for the typical fiction writer? Or is literary language still a distinct and peculiar beast?

Scholars in the growing field of digital humanities can tackle this question by analyzing enormous numbers of texts at once. When books and other written documents are gathered into an electronic corpus, one “subcorpus” can be compared with another: all the digitized fiction, for instance, can be stacked up against other genres of writing, like news reports, academic papers or blog posts.

There already seem to be some interesting findings emerging, which give insight into the originality and otherwise of literary writing. Researchers have been able to identify authors’ coinages and adaptations of earlier phrases, as well as which phrases are overused in literature (“bolt upright” stands out particularly). This sort of research does seem to be a promising way of gaining a more robust understanding of slippery concepts like style. I’d particularly like someone to do an analysis of all of Martin Amis’s published fiction: he’s a famous hater of cliche and apparently tries never to write a sentence that someone else has got to first. We should quickly be able to tell whether he’s as original as he would like to be.