Living with Statistics

Off the Beat: Bruce Byfield's Blog

Feb 20, 2012 GMT

Bruce Byfield

One of the prices of software freedom is the impossibility of getting accurate figures for usage. As a user, I consider that a small price to pay for not having to register or activate software. However, as a journalist I'm often frustrated, because accurate figures can be useful for establishing a point or debunking rumors.

The questions for which I would like accurate stats include: how many GNU/Linux users are there? Has Linux Mint really overtaken Ubuntu as the most popular distribution? Has GNOME gained or lost users with the start of its third release series? All these questions and more would benefit from reliable figures, yet we don't have any. Instead, we have a series of indicators that are approximate at best, and completely unreliable at worst.

One problem is external biases. For example, when NetApplications places Linux usage at 1.6%, that total is derived "from the browsers of site visitors to our exclusive on-demand network of live stats customers." But when I consider that the same methodology based on visits to my personal blog would suggest a figure of 19% for Linux, I have to wonder if NetApplications' figures aren't as skewed as mine, but in the opposite direction.

Similarly, since NetApplications' headquarters are in California, probably American companies are most likely to use its services. Unofficially, I am always told that free software usage is lightest in North America, Microsoft's home, and higher in Europe or in developing countries.

However, other problems arise when I rely on sources that are more friendly to free software, such as Distrowatch's page views for distributions. My guess is that most people who visit Distrowatch are already familiar with free and open source software (FOSS), so that their figures reflect only reflect the tastes of relatively experienced users.

Yet even that assumption may be questionable. Page views might tell what distributions people are curious about, but that might be a rough indicator of what people are downloading and using.

Moreover, Distrowatch's numbers are small enough that a new release or a lively discussion elsewhere online can skew results for days or weeks at a time. A handful of fans might easily distort results, although nothing indicates that such an effort has ever been made. Armed with such doubts, you can easily dismiss Distrowatch figures altogether, as Canonical employee Michael Hall did when Distrowatch reported Linux Mint as receiving more views than Ubuntu.

User surveys share some of the problems of Distrowatch's figures, but also come with their own problems. For instance, FLOSSPOLS' survey of gender in the community frames all discussions of women's under-representation in FOSS. Yet the FLOSSPOLS data was collected seven to eight years ago, making it decidely obsolete, especially in a field that changes as rapidly as FOSS. Today, we have no idea whether the situation in the community is better than the survey reports (it could hardly be worse).

Still, at least the FLOSSPOLS survey was designed according to research standards. Community surveys, such as the Linux Journal's Readers' Choice Awards or the LinuxQuestions' Members Choice Awards can't even claim that. In both, participants are self-selected and answers are open ended. The number of participants may or may not be given, and margins of errors never -- although, if they were, they might be as high as five percent. If so, then in many cases where GNOME was declared the most popular desktop environment over KDE, or Mozilla the most popular web browser over Chrome, a more accurate result would probably be to declare a tie.

None of what I am saying is meant to be a reflection upon those who collect the data. With the exception of FLOSSPOLS and NetApplications, none of these sources has ever claimed to be providing scientifically reliable information. In some cases, entertainment is probably more of a motivation than anything else. But for those of us in search of accurate information, the shortcomings of what is available are annoying, to say the least.

Living with ImperfectionSo what's a writer to do? The high road would be to ignore such sources of information, and learn to live with uncertainty. As much as I want accurate information about FOSS, I might have to accept that it just doesn't exist.

However, that is hardly a solution. Even if I ignore these figures, others don't. Such sources as are available always being cited to support various arguments, and, if nothing else, I might want to debunk the argument with something more than the reasonable doubt of meta-arguments.

Besides, the issues that such sources touch upon are ones that I -- and many other people -- want to talk about. As limited as these information sources maybe, they at least give some context to discussions that would otherwise be even less uninformed.

As a result, the way I use these figures is an uneasy compromise. However, briefly, I try to indicate that they're not reliable. I try not to make arguments that depend on a couple of percentage points of difference.

Most of all, I try not to base an argument on any single set of results. If a survey gets the same results several years running, I'm more likely to trust the figures than if they appear in a single year. Better yet are times when more than one source shows similar results over several years.

Of course, if I was paranoid enough, I might worry about whether all surveys were being manipulated by a small group of users or corporate employees. Realistically, though, I think that, under the conditions I describe these statistical sources can indicate general trends to a degree that no other sources of information can. But I try not to forget that these sources are tentative, and can never be used with any precision.