DATA ANALYSIS: SOCIETY

Towards a data analytic society

A city of 600k, drug-tested in one fell swoop. No exceptions, no consent sought. Dosage of everything from sugar to crystal meth recorded, tabulated, sorted and analysed.

Not a scene from a paranoid TV drama, but a good example of scientific research interacting noninvasively with social policy. The project has not, so far as I know, yet been formally published at the time of writing, though it was presented to the American Chemical Society in August[1] and there has been plenty of press commentary[2][3]. Forty communities in Oregon were initially tested, with more planned, analysing small samples of water entering sewage treatment plants. And, interestingly, this ‘community urinalysis’ won at least guarded acceptance across the spectrum of opinion, from enforcement agencies to drug users.

Shift up several levels, for example to the UNAIDS[4] mapping of AIDS incidence[5] in terms of continental populations, and there is little public attention at all. Shift down to the level where those citizens are themselves units of analysis, and they become less happy – but, despite all those paranoid TV dramas, less so than would once have been the case.

There is increasing acceptance at all levels of the data analytic as a default approach to life. The fact of analysis occurring in societal contexts, in and of itself, has social impact. Computing in general, and scientific computing in particular, have brought many changes, but a data analytic view of the world is the one which will most separate the future from the past. We are in the middle of a revolution in the way populations and individuals think about the world, and computerised science is the trigger.

There are three main components here: application of scientific methods to social policy, subliminal acceptance by individuals of a statistical view of themselves, and adoption of such views by individuals in looking outward onto their world.

The AIDS mapping seems, to most people not affected by the issue, to have little to do with them at all. Most would say, if asked, that the figures illustrated are too big and too far away to handle. But such illustrations have, through campaign posters and public education leaflets, become part of the background informational wallpaper of life and this, in itself, acclimatises us to the normality of data analytic presentation. This acceptance of the top (social policy) layer of analysis is what progressively extends the middle layer where community urinalysis is now cautiously accepted.

That acceptance could not always have been taken for granted. The apparent generality of the results, describing whole communities in broad brush, makes the whole exercise seem abstract – not so very different from the international AIDS mapping. Not so long ago, though, it would have seemed an invasion. Civil rights concerns would have seemed more of an issue. The perception has shifted: statistical aggregate scientific description of a social state has moved down from the international to the intranational or even lower, because it looks much the same.

Community urinalysis makes it possible to very closely define and compare the level of (for example) cocaine use in Portland and Salem. Better still, those levels can be tracked very precisely over time, using frequent sample extraction. Daily sampling, or even several times a day, has already been suggested as a way to track the spread of new substances – methamphetamines being a topical chronic example; localised batches of badly cut heroin with toxic fillers is a recurrent acute case. In the Oregon study, fine-grained analysis of this kind showed methamphetamine usage to be geographically heterogenous to a high degree and comparatively stable over time, while cocaine use peaks and troughs on a weekly cycle.

All of this, while staying general, produces specific changes that impact individuals. The fact that information on distribution of proscribed substance usage exists will inevitably influence the distribution of infrastructure funding. At least one European health authority is looking at the work in relation to sociospatial resource targeting. At least one police force has, as a direct result of the Oregon study reports, expensively seconded an officer to commercial premises with the broader agenda of acquiring expertise in the use of Sanitas groundwater analysis software.

At least one military intelligence agency, with existing in-house analytic expertise in using SAS software to explore the strategic implications of water shortage, is interested in taking urinalysis down to smaller units than the city.

Historically, there is a Darwinian social entropy in application of technologies and scientific methods. As long as they yield results, they diffuse through a society and become part of its fabric. The process is generally irreversible: the society becomes dependent, and withdrawal would be too traumatic. This is not doom-mongering: the diffusions are usually, at least in the long run and the broad picture, advantageous, and this one will be no different. There are cases (DDT comes to mind) where reversals do occur, but the principle is there: once community urinalysis has been accepted, it will progress and is unlikely to be abandoned. This principle applies also to the extending acceptance of statistical approaches to issues, and in particular to computer driven data analysis.

Not only does it now underpin everything we do, but it propagates through what we are at a rate which far outpaces any other in history. The world is, of course, always changing and has always done so. Sometimes, however, it changes in a way that it becomes a different world: agrarian, capitalist, industrial revolutions, or the invention of printing, are examples. I’m fairly sure that future historians will point to the early 21st century as a time when the ways in which both societies and their individual members think about themselves went through that sort of radical shift within a very short time. And they will identify the transfer of data analytic approaches from scientific computing specialist to general population as the responsible agent.

The penetration of individual social perception by scientific approaches begins, naturally enough, with scientists. Sam Roberts, application engineer at the MathWorks with a previous background in big pharma, recently mentioned to me in passing that he went into that field partly out of concern over the ethics of animal testing. He didn’t go onto the streets, nor into politics: he sought to shift the area of his concern from physical to conceptual realms. The micro array is, perhaps, the best symbol of modern science – which would have seemed quite extraordinary to practitioners of even 30 years ago.

Looking at the vertical markets of a company like the MathWorks, and the products which have evolved to serve them, is instructive from a sociodynamic viewpoint. Matlab, as I’ve discovered over the last year or two, is as likely to be used in finance as in automotive and aeronautic control engineering.

Indeed, control engineering is a concept that has escaped from its box, a term as likely to be used by molecular chemists or biologists as by the designers of jet aircraft. SimBiology, a market-specific Matlab extension into life science, applies Gillespie-style discrete event simulation modelling and other stochastic approaches to the study of biological systems and their components. If this all seems commonplace and obvious to you, ask yourself two questions: when did it become so, and what does your answer tell you about the rate of change in unconscious habits of thought in the society of which you are part?

This move from directly thought models of the world to statistically evolved ones, though a direct result of scientific method, is adopted and adapted by those who would never think of themselves as scientists or even scientific. Computerisation doesn’t just make data analytic approaches ever more rapid and efficient; it drives their osmotic spread throughout human society. Gradually, such approaches become more and more commonly accepted as bases for decisions and judgements – not only by governments, but the governed too. In itself, this can only be a good thing, but it does release social forces that are not always predictable. I could follow the chain on from pharmaceuticals through industry to general commerce and into politics, the place of personal cellular communications, and so on, but space doesn’t permit it. Let’s, instead, make a leap straight down to the bottom of the pyramid.

Scientists are also people, with families and friends and children: the conversation with Sam Roberts, above, started from a shared interest in out-of-hours work with teachers who want to encourage scientific thinking in children. Teaching is, as Postman and Weingartner[6] told us, a subversive activity; it is now the infection vector bringing together analytic thinking with the spread of cheap high technology.

In SCW’s website education pages, last year[7], a teacher described an experiment in which 10- and 11-year-olds analytically considered development options within their school, relating funding options to costs and benefits. She commented that they were ‘interested in... using such methods to explore problem solving choices’, and ‘was astonished at... the degree of sophistication in their handling’. This is welcome news to those, like me, who are exercised by lack of critical thinking skills in new undergraduates; it may be seen as a mixed blessing to governments making such funding allocations on behalf of electorates.

In the last couple of weeks I have seen this approach applied closer to the bone, by children of the same age using personal computers to evaluate through administrative spreadsheets the effectiveness of their own teachers. It’s not easy to hand over power in that way, but it’s certainly a good way to start building a critically-aware citizenry of the future.

From computer access to a connected machine in every school bag, 24 hours a day and seven days a week, is a huge leap – but thoroughly serious programmes are under way to bring it about. This has many implications, but I’m concerned here with those that flow from resulting universal access to easily used data stores and analytic tools. For most people, the first doorway through the wardrobe into data analysis is the ubiquitous spreadsheet.

In affluent societies the initial running is being made by ‘netbook’ machines based on Intel’s Classmate pattern. One manifestation of this is a range of durable subnotebook machines from Asus (see the separate review on the SCW website[8] for more detail), with prices starting from roughly e200. A public falling out with the ‘One Laptop Per Child’ (OLPC) project[9], dedicated to supplying children in the developing world with laptop computers, has intensified rather than cooled the race to put a processor in every pocket.

The OLPC’s XO laptop. On-screen photographic image by Janet Hynes.

OLPC machines are rugged, use low power consumption processors (supplied, in the absence of Intel, by AMD), charge from hand-cranked generators, have built-in wireless internet, and cost just over e130 (target cost, as economies of scale kick in, about e70). And their bundled office suite includes the all-important spreadsheet. The social implications of this are incalculable. Many rural African children have no paper, sharing slates between students; books and teachers may be in short supply; classes may be large and the student centred ideal is an impossible dream. The arrival of OLPC in such places will be an even greater revolution than in industrialised societies, leapfrogging more than a century of educational evolution.

How governments buying these machines will be affected by their future results is anybody’s guess. M, a teacher in a developing world school that he doesn’t want identified, describes how he used a new supply of laptops in teaching the principles of simple bookkeeping for sole trader businesses. The following week, the students returned to him with their own spreadsheets applying the same principles to national economics and regional investment imbalances.

Over the past few months, I’ve been exploring the impact of ICT saturation on small groups of both children and the adults (teachers, relatives, neighbours) in direct contact with them. I’ve had a set of Asus machines to play with, moving them around small projects with spreadsheets and SysQuake installed. One class, in cooperation with the local water company, conducted a community urinalysis of their school’s sewage outflow – though they tested for dietary and biological byproducts rather than proscribed substances. I’ve also been fortunate in my access to M’s school, where machines have arrived in quantity. In each case, the result of leaving the same machine in the same child’s hands 24 hours a day, seven days a week, has been a marked increase in data analytic approaches across all activity boundaries, at school and outside it.

Having tools is not, of course, the same as having the material on which to use them; increasingly sophisticated analytic views means increasingly sophisticated data access.

This, too, trickles down from the top tier where data extraction runs into much the same problems as in the hard sciences – and reaches for the same solutions. Admire (Advanced Data Mining and Integration Research for Europe), a three-year project coordinated from the University of Edinburgh, is doing for social data what University of Portsmouth’s Helen Xiang was doing through the NGS for astronomy[10] in the last issue: unifying queries on disparate, distributed and heterogenous data sources.

This sort of query currently involves ruinously expensive time spent on minutely detailed specification of strategies, sources, and mechanisms. Admire seeks to subsume all that under a structure of internet and grid gateways, communicating through Infrastructure Service Bus-mediated services under high level language control. Crucial to this are semantic technologies, which are key to all future data access at my two lower levels. One of the initial proving ground scenarios returns to the theme of water, with an integrated application to make flood predictions from meteorological forecasts.

The flow of information, and the restriction or shaping of that flow, have always been crucial to balances of power. In the 15th century, movable type paved the way for the Enlightenment; mass literacy and numeracy were engines of social change; the internet is the bane of repressive states, and rapidly modernising societies struggle to maintainequilibrium as they come to terms with it. In the long run, extensions of data access will work their way down to individual level and meet the universally available computing power. In a time of globalisation, that sets the scene for data analytic outlooks to produce a similar revolution in social structure whose outcome is impossible to guess.

A record of a small scale sewage outflow ‘community urinalysis’ study, conducted by school pupils and analysed using personal Asus EEE PC computers. This illustration, produced by spreadsheet, was part of an individual contribution to class discussion of policy on fizzy drinks in school vending machines.