Dealing with adapted data

Reader Daniel T. is unhappy about this analysis of the intraday Internet usage by OS and device types. He doesn't like their choice of index, which I'll get to in a second post. (Link appears here when ready.)

There is something else wrong with this type of analysis.

Let's do a thought experiment. If you are a marketer interested in the diurnal variability in Internet usage, what are some of the factors you might investigate? My list would include whether the user is logging in from work or from home; whether the user is working or unemployed or on vacation; whether the user is male or female, young or old, a student, retired, etc.

But OS is exactly what the blogger analyzed, and thousands of marketers around the world do so on a daily basis. That's because they are using what data they can get their hands on. Web log data are adapted, that is to say, they were collected by engineers for the purpose of debugging, and now they are used by marketers to explain consumer behavior. It's not hard to see why such data cannot tell the full story.

This goes back to the O and the A in my OCCAM framework for Big Data (link). Web log data is the prototypical example of data collected by tracking devices indiscriminately without purposeful design, and then adapted to marketing applications.

***

One way to cope with using adapted data is to be clear about our model of the world. Assume OS really does affect Internet usage. How does OS affect Internet usage? Are you assuming that the features of an OS directly condition a user's behavior? Or are you assuming that the choice of OS is an indicator of the type of user?

Another way to cope with adapted data is to find or collect the data you really want (e.g. demographics, occupation) rather than analyzing data you don't understand. Recall Sean Taylor's advice to collect your own data (link).