Listen. Reflect. Explore. Solve.

Post navigation

TIOBE or not TIOBE – “Lies, damned lies, and statistics”

“Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.
– Mark Twain

I’ve been meaning to write a post about the suspect methodology of the TIOBE Index but Andrew Sterling Hanenkamp beat me to it (via Perl Buzz).

that the number of search engine hits for the phrase “foo programming” is proportional to the “popularity” of that language.

that the proportionality is the same for different languages.

It’s not hard to pick holes in both of those assumptions.

They also claim that “The ratings are based on the number of skilled engineers world-wide, courses and third party vendors” but I can’t see anything in their methodology that supports that claim.
I presume they’re just pointing out the kinds of sites that are more likely to contain the “foo programming” phrase.

Even if you can accept their assumptions as valid, can you trust their maths? Back in Jan 2008 when I was researching views of perl TIOBE was mentioned. So I took a look at it.

At the time Python had just risen above Perl, prompting TIOBE to declare Python the “programming language of the year”. When I did a manual search, using the method they described, the results didn’t fit.

I wrote an e-mail to Paul Jansen, the Managing Director and author of the TIOBE Index. Here’s most of it:

Take perl and python, for example:

I get 923,000 hits from google for +”python programming” and 3,030,000 for +”perl programming”. (The hits for Jython, IronPython, and pypy programming are tiny.) As reported by the “X-Y of approx Z results” at the top of the search results page.

Using google blog search I get 139,887 for +”python programming” and 491,267 for +”perl programming”. (The hits for Jython, IronPython, and pypy programming are tiny.)

So roughly 3-to-1 in perl’s favor from those two sources. It’s hard to imagine that “MSN, Yahoo!, and YouTube” would yield very different ratios.

23 thoughts on “TIOBE or not TIOBE – “Lies, damned lies, and statistics””

The TIOBE numbers are why I wrote “Don’t compare percentages”. People use percentages to lie, and I think TIOBE is purposedly trying to distort the picture. They’d be more interesting if they supplied raw numbers.

The most damning thing about TIOBE and any other hit counting is that it relies on a third party to decide what to index. Most of the stuff that comes back from Google Blogsearch, for instance, is reposted crap for link attractors and other search engine optimization shenanigans. Counting the same original content in multiple places distorts the data and makes the hit counting just about worthless. Note that TIOBE had to change its index in April 2004 just for this reason (see the FAQ at the bottom of http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html).

Interesting analysis. I hadn’t paid much attention to Tiobe until this May. On the misinformation of an outsider, they removed ColdFusion from the list claiming it wasn’t a programming language. Much to their chagrin they were immediately corrected, but on the realization that the technically “correct” way to refer to ColdFusion-the-language was “CFML” (which almost nobody actually calls it) they changed their search terms and it promptly fell into oblivion.

It would kind of be like Googling for “bathroom tissue” (technically correct) instead of “Kleenex” (most common vernacular) and wondering why you got so much fewer results.

“Sure. The measurement has flaws. But do you have a better idea? Tiobe is the best I know of. Can you beat it? Or are you just whining?”

When the flaws are so egregious, being the best is not good enough. Come to think of it, what does it mean to be the best at doing something wrong, anyway? Bringing to our attention that Tiobe is unreliable at every step of their methodology (underlying assumptions, search aggregation, even the math) is useful by itself, without trying to “beat it”.

Personally, I think that the TIOBE index is very skewed by the beginner effect. I believe that most of the hits for the top languages are in fact just student/hobbyist programmers asking for help etc. I’m sure there are a lot more people posting questions and answers for Visual Basic and Delphi than there are posting about Cobol or AS/400 CL. Then again, when you think about it, finding the amount of current chatter about a language does seem to be a fairly good indication of how many people are working with or learning it.

Aside from that, who cares? Why get upset when somebody says Java is more “popular” than Python? If you’re making money, that’s the most important thing. If you really want to blow off TIOBE’s findings, then make the argument that really, for any given architecture, there is only *one* language – everything else is just a kind of pre-processor.