I’m going to return to 2014’s approach of dividing best visualisation of data (dataviz!) from visualisation of methods (methodviz!).

In the first category, as soon as I saw Jill Pelto’s watercolour data paintings I was bowled over. Time series of environmental data are superimposed and form familiar but disturbing landscapes. I’m delighted to have a print of Landscape of Change hanging in the living room at Chateau Grant. Pelto studies glaciers and spends a lot of time on intrepid-sounding field trips, so she sees the effects of climate change first hand in a way that the rest of us don’t. There’s a NatGeo article on her work here.

In the methodviz category, Fernanda Viegas and Martin Wattenberg made a truly ground-breaking website for Google’s TensorFlow project (open source deep learning software). This shows you how artificial neural networks of the simple feedforward variety work, and allows you to mess about with their design to a certain extent. I was really impressed with how the hardest aspect to communicate — the emergence of non-linear functions of the inputs — is just simple, intuitive and obvious for users. I’m sure it will continue to help people learn about this super-trendy but apparently obscure method for years to come, and it would be great to have more pages like this for algorithmic analytical methods. You can watch them present it here.

This time I’m going to take a closer look at another of my data visualisations I’ve been filling my spare time with for fun, not profit. I have two bird feeders in our garden and you can watch the consumption of seeds updated with every top-up at this page. This started when I wrote about Dear Data (viz o’ the year 2015) and recommended playing around with any odd local data you can get your hands on. I thought it would just be a cutesy dataviz exercise but it ended up as a neat microcosm of another issue that has occupied me somewhat this year: inference and explanation.

Briefly, statistical inference says “the rate of bird seed consumption is 0.41 cm/day, and if birds throughout suburban England consume at one rate, that should be between 0.33 and 0.50, with 95% probability”, or “the bird seed consumption has changed, in fact more than is plausibly due to random variation, so it is probably some systematic effect”. But explanation is different and is all about why it changed. Explanation doesn’t have to match statistics. A compelling statistical inference with a miniscule p-value could bug the hell out of you because you just can’t see why it should be in the direction it is, or of the strength is. Or an unconvincing, borderline statistical inference could cry out to you, “I am a fact. Just the way you hoped I would be!”

The problem here is that we try to be systematic in doing our statistical inferences, so that we don’t fall prey to cognitive biases: we pre-specify what we’re going to do and have to raise the bar if we start doing lots of tests. However, there’s no systematic approach like that to explanation. In fact, it’s not at all clear where these explanations come from, apart from thunderbolts of inspiration, and it’s only somewhat understood how we judge a good explanation from a poor one (as ever, I refer you to Peter Lipton’s book Inference To The Best Explanation for further reading).

When you get a great, satisfying explanation, it’s tempting to stop looking, but when you have compelling stats that don’t lead to a nice explanation, you might keep poking at the data, looking for patterns you like better, that suggest just such a nice explanation to you. Then, all the statistical work is no more sound than the explanatory thunderbolts.

Sad to relate, dear Reader, even Robert falls into these traps. On the web page, I wrote an explanation of the pattern of seed consumption, without giving too much thought to it:

I interpret the pattern along these lines: in mid-summer, the consumption increases massively as all the chicks leave the nest and start learning how to feed themselves. The sparrows in particular move around and feed in flocks of up to 20 birds. Once seeds and berries are available in the country though, it is safer for them to move out there than to stay in the suburbs with prowling cats everywhere. But as the new year arrives, the food runs out and they move back in gradually, still in large flocks, before splitting into small territories to build nests. Cycle of life and all that.

That was based on unsystematic observation before I started collecting the data. My hunches were informed by sketchy information about the habits of house sparrows, gleaned from goodness-knows-where, and they were backed up by the first year of data. I felt pretty smug. On this basis, one would feel confident predicting the future. But then, things started to unravel. The pattern no longer fit and there were multiple competing explanations for this. The data alone could not choose between them. Fundamentally, I realised I was sleepwalking into ignoring one of my own rules: don’t treat complex adaptive systems like physical experiments. An ecosystem — some gardens, parks and terrain vague, plus a bunch of songbirds, raptors, insects, squirrels, humans and cats — is a complex adaptive system, and the same issues beset any research in society. Causal relationships are non-linear, highly interdependent, and there are intelligent agents in the system. This all contributes to the same input producing very different outputs on different occasions, because the rules of the system change.

If it is foolish to declare an explanation on the basis of one year’s data that happen to match prior beliefs, it is equally so to declare an explanation for why 2016 shifted from 2015’s pattern after just a few months. It’s also foolish to say that after March 2016 consumption dropped and that coincided with a new roof being built on our garage, so that is the cause — yet I did just that:

The sharp drop in March 2016 was the result of work going on to replace the roof on our garage, which introduced scary humans into the garden all day. [emphasis added]

How embarrassing. Yes, it’s a nice explanation, but it doesn’t really have any competitors because there are no other observed causes, and there are no other observed effects either (I don’t sit out there all day taking notes like Thoreau). It’s only likely insofar as it is in a set of one and there are no likelier competitors. It’s only lovely insofar as it explains the data, and there’s nothing else to explain. And when we get later in the year, the congruence it enjoys with prior beliefs of casual mechanisms deteriorates: surely those birds would get used to the new roof and come back?

But we are all capable of remarkable mental gymnastics to keep our favourite explanation in the running. My neighbour had a tree cut down in June, so that would upset things further, and would be a permanent shift in the system. It was a good year for insects, following a frost-free winter, so there was less pressure to go feeding in comparatively dangerous gardens. And so on, getting ever more fanciful. The evidence for any of these mechanisms is thin to say the least. We can’t guard against this mental laxity because it’s the same process that helped our family members long ago to eat the springbok in the bushes but not get eaten by the corresponding lion, and now it’s hard-wired, but we can at least acknowledge that science* is somewhat subjective, even though we try our best to impose strict lab-notes-style hypothetico-deduction on it (this does not necessarily imply the use of decision theoretic devices like significance tests; the birdfeeder page only does splines which is basically non-parametric descriptive statistics), and not pretend to be able to know the Secrett of Nature by way of Experiment.

* – while simple physical sciences — of the sort you and I did in high school — might lead to Secretts, life and social sciences certainly don’t, and in fact the modern physical sciences involve pushing instruments to their limits and then statistics comes in to help sift the signal from the noise, so they too are a step removed from claiming to have proven physical laws.

Also, this exercise has something to say about data collection, or appropriation. I made some ground rules that were not very well specified, but in essence, I thought it best to write down how much seed I put in, rather than adjust for why it had come out. Spilled seeds were not to be separated from eaten seeds. But then in July 2015, I found a whole big feeder gone after filling it up in the morning, and ascribed this (on no evidence) to a squirrel. I thought about disregarding the day, but decided not to in the end. For one thing, it turned out in the cold light of analysis to be not so different to other high-consumption days. Maybe it was just especially ravenous sparrows. Or maybe Cyril the Squirrel had been at work all along. Once again, there was no way to choose one explanation from another. Now, the more you learn about the source of the data in all its messy glory, the more you question. But without that information, you wouldn’t. Another subjective, mutable aspect appears, one which is more relevant in this age of readily available and reused data.

All of these bird seed problems also appear in real research and analysis, but there, drawing the wrong conclusions can cause real harm. In each of the cock-ups above, it is the explanation that causes the problem, not the stats.

As I write this, I feel like I keep banging the same subjectivity and explanation drums, but, frankly, I don’t see much evidence of practice changing. I think the replication efforts of recent years in psychology are somewhat helpful, but are limited to fighting on a very narrow front. It probably helps to terrorise researchers more generally regarding poor practices but what we also need is a friendly acceptance of subjectivity and the role explanation plays. Science is hard.

What I like about this is how he had to think outside the box to show changes in rankings over time for a selection of tennis players. Lines would be like spaghetti. Stream charts suck. Parallel coordinates: you cannot be serious. Having different symbols, sizes or colours all use poorly-read mappings of data to visual dimension. So, something new was needed. And, of course, it has to conform to house style. And in doing so, he uncovers some nice little stories that are not so clear from the stark headlines, about runs at number one and comebacks, and whether it really is tougher than it used to be. Sweet annotation too. John has absolutely aced this one. (Sorry, couldn’t resist.)

I’m working on a noise pollution map of central London. Noise is an interesting public health topic, overlooked and of debatable cause and effect but understandable to everyone. To realise it as interactive online content, I get to play around with Mapbox as well as D3 over Leaflet [1] and some novel forms of visualisation, audio delivery and interaction.

The basic idea is that, whenever the need arises to get from A to B, and I could do it by walking, I record the ambient sound and also capture a detailed GPS trail. Then, I process those two sets of data back at bayescamp and run some sweet tricks to make them into the map. I have about 15 hours of walking so far, and am prototyping the code to process the data. The map doesn’t exist yet, but in a future post on this subject, I’ll include a sketch of what it might look like. The map below shows some of my walks (not all). As I collect and process the files, I will update the image here, so it should be close to live.

I’d like it to become crowd-sourced, in the sense that someone else could follow my procedure for data capture, copy the website and add their own data before sharing it back. GitHub feels like the ideal tool for this. Then, the ultimate output is a tool for people to assemble their own noise-pollution data.

As I make gradual progress in my spare time, I’ll blog about it here with the ‘noise pollution’ tag. To start with, I’ll take a look at:

The equipment

Clearly, some kind of portable audio recorder is needed. For several years, when I made the occasional bit of sound art, I used a minidisc recorder [2] but now have a Roland R-05 digital recorder. This has an excellent battery life and enough storage for at least a couple of long walks. At present, you can get one from Amazon for GBP 159. When plugged into USB, it looks and behaves just like a memory stick. I have been saving CD-quality audio in .wav format, mindful that you can always degrade it later, but you can’t come back. That is pretty much the lowest quality the R-05 will capture anyway (barring .mp3 format, and I decided against that in that I don’t want it to dedicate computing power to compressing the sound data), so it occupies as little space on the device as possible. It will tuck away in a jacket pocket easily so there’s no need to be encumbered by kit like you’re Chris Watson.

Pretty much any decent microphone, plus serious wind shielding, would do, but my personal preference is for binaurals, which are worn in the ear like earphones and capture a very realistic stereo image. Mine are Roland CS-10EM which you can get for GBP 76. The wind shielding options are more limited for binaurals than a hand-held mic, because they are so small. I am still using the foam covers that come with the mics (pic below), and wind remains something of a consideration in the procedure of capturing data, which I’ll come back to another time.

On the GPS side, there are loads of options and they can be quite cheap without sacrificing quality. I wanted something small that allowed me to access the data in a generic format, and chose the Canmore GT-730FL. This looks like a USB stick, recharges when plugged in, can happily log (every second!) for about 8 hours on a single charge, and allows you to plug it in and download your trail in CSV or KML format. The precision of the trail was far superior to my mobile phone at the time when I got it, though the difference is less marked now even with a Samsung J5 (J stands for Junior (not really)). There is a single button on the side, which adds a flag to the current location datum when you press it. That flag shows up in KML format in its own field, but is absent from CSV. They cost GBP 37 at present. There are two major drawbacks: the documentation is awful (Remember when you used to get appliances from Japan in the 80s and none of the instructions made sense? Get ready for some nostalgia.) and the data transfer is by virtual serial port, which is straightforward on Windows with the manufacturer’s Canway software but a whole weekend’s worth of StackOverflow and swearing on Linux/OS X. Furthermore, I have not been able to get the software working on anything but an ancient Windows Vista PC (can you imagine the horror). Still, it is worth it to get that trail. There is a nice blog by Peter Dean (click here), which details what to do with the Canmore and its software, and compares it empirically to other products. The Canway software is quite neat in that it shows you a zoomable map of each trail, and is only a couple of clicks away from exporting to CSV or KML.

Having obtained the .kml file for the trail plus starting point, the .csv file for the trail in simpler format, and the .wav file for the sound, the next step is synchronising them, trimming to the relevant parts and then summarising the sound levels. For this, I do a little data-focussed programming, which is the topic for next time.

Footnotes

1 – these are JavaScript libraries that are really useful for flexible representations of data and maps. If you aren’t interested in that part of the process, just ignore them. There will be plenty of other procedural and analytic considerations to come that might tickle you more.

2 – unfairly maligned; I heard someone on the radio say recently that, back around 2000, if you dropped a minidisc on the floor, it was debatable whether it was worth the effort to pick it up

This recent BBC Radio 4 “Farming Today” show (available to listen online until) visited Rothamsted Research Station, former home of stats pioneer Ronald Fisher, and considered the role of remote sensing, rovers, drones etc for agriculture, and most interestingly perhaps for you readers, the big data that result.

Agrimetrics (a partnership of Rothamsted and other academic organisations) chief executive David Flanders said of big data (about 19 minutes into the show):

I think originally in the dark ages of computing, when it was invented, it had some very pedantic definition that involved more than the amount of data that one computer can handle with one program or something. I think that’s gone by the wayside now. The definition I like is that it gives you answers to questions you hadn’t even thought of.

which I found confusing and somewhat alarming. I assume he knows a lot more about big data than I do, as he runs a ‘big data centre of excellence’ and I run a few computers (although his LinkedIn profile features the boardroom over the lab), but I’m not sure why he plays down the computational challenge of data exceeding memory. That seems to me to be the real point of big data. Sure, we have tools to simplify distributed computing, and if you want to do something based on binning or moments, then it’s all pretty straightforward. But efficient algorithms to scale up more complex statistical models are still being developed, and it is by no means a thing of the past. Perhaps the emphasis on heuristic algorithms for local optima in the business world have driven this view that distributed data and computation is done and dusted. I am always amazed at how models I feel are simple are sometimes regarded as mind-blowing in the machine learning / business analytics world. It may be because they don’t scale so well (yet) and don’t come pre-packaged in big data software (yet).

In contrast, the view that, with enough data, truths will present themselves unbidden to the analyst, is a much more dangerous one. Here we find enormous potential for overt and cryptic multiplicity (which has been discussed ad nauseam elsewhere), and although I can understand how a marketing department in a medium-sized business would be seduced by such promises from the software company, it’s odd, irresponsible even, to hear a scientist say it to the public. Agrimetrics’ website says

data in themselves do not provide useful insight until they are translated into knowledge

and hurrah for that. It sounds like a platitude but is quite profound. Only with contextual information, discussion and involvement of experts from all parts of the organisation generating and using the data do you really get a grip of what’s going on. These three points were originally a kind of office joke like buzzword bingo when I worked on clinical guidelines, but later I realised were accidentally the answer to making proper use of data:

or, less facetiously, talk to everyone about these data (not just the boss), get them all involved in discussions to define questions and interpret the results, and then do the same in translating it to recommendations for action. No matter how big your data are, this does not go away.

I used to have an office door until this week when we moved to open plan space elsewhere in the medical school. I used to stick a chart of the week on that door, a student’s suggestion that proved to be a bottomless mine of goodies. So, I thought I would carry on visualising here.

We begin with some physical dataviz courtesy of Steven Gnagni and spotted by Helen Drury. Spoor of human activity etc etc. More like this next week.