Revisiting the Dataists Programming Language Rankings

In December of last year, Drew Conway set out to quantitatively explore programming language popularity. Rather than choose between GitHub derived raw project volume numbers and the proxy of community discussion metrics extracted from Stack Overflow, he proposed to measure and compare them both.

The resulting analysis was fascinating, featuring both a high correlation and an obvious stratification of language popularity. Ten months later, we repeated this analysis to see what, if anything, had changed.

Working backwards from the original dataset, we recompiled the list using the GitHub website and the R script written by Conway. Here’s our updated data (note that higher rankings are better):

From there, we replicated the original scatterplot, as seen here.

Like Conway, we found both a high correlation – it’s slightly higher this time around, actually, at .79 versus .78 – and easily detectable language tiers. The first point tells us that GitHub and Stack Overflow generally agree; although as the original post notes, it’s impossible to say whether language popularity is a product of community traction or the reverse. The second point is interesting and worth exploring further. According to our plot, we essentially have four tiers of languages, in terms of popularity.

Before we get into what the above means, a few caveats. First, this is understood to be an incomplete list of programming languages: the omission of COBOL, still very much in use, should make that obvious. Second, this is a measure of two specific communities, and therefore reflective of the respective biases in terms of usage of same. This kind of analysis is observational in nature, and therefore cannot be considered representative of the market as a whole. And last, a quick logistical note: we dropped two languages from the original list – sclang and duby – because they had dropped off GitHub’s project list.

With that out of the way, our results – like Conway’s before – pass my basic sniff test. There are a few mild surprises – Erlang and D are a bit lower than I expected, Go slightly higher – but there are no glaring errors to my eye. If this is the reliably the case, we will have gained an important tool in the triage that inevitably results from runtime fragmentation coverage.

Besides repeating last December’s analysis, we were also able to compare our findings with the raw numbers from the previous study to look for trends. The changes in GitHub rankings were generally minor, particularly amongst the top 10. There were some interesting tidbits in the growth rates of Stack Overflow tags, however. The highest growth – our apologies to Bryan Cantrill – came from CoffeeScript, which has seen 1527% growth in related tags since December. This is misleading, however, as CoffeeScript’s actual growth was 336 tags over that span. Filtering for languages with a minimum of 5,000 tags, then, here is what’s left.

The filtered listed is essentially our Tier 1 languages plus Delphi, R and Scala. If we exclude C#’s remarkable performance, the average growth for this list is 87%. Healthy, relatively undifferentiated numbers for each, with R leading the non-C# pack at 136%. None can touch the growth of C# on Stack Overflow, however. With 823% growth since December, Stack Overflow has added more C# tags than there are in total for Java, JavaScript or PHP. This type of growth is worth exploring, and likely of interest to vendors like Microsoft or Xamarin.

The Takeaways

As an industry, we have the first tier languages mostly correct. While arguments can be constructed that Clojure, R, Scala et al may be considered first tier languages in certain contexts, the numbers don’t quite justify this within the studied communities. Further, the metrics strongly indicate that legacy compiled languages are not being replaced by interpreted alternatives, but rather are coexisting with them.

The strong correlation between GitHub and Stack Overflow indicates that using community behaviors as a proxy for actual language traction is a viable approach. This validates the idea that we can infer developer behaviors by tracking associated community trends; this is, in part, the central assumption RedMonk Analytics is built upon.

Quantitative analysis of programming language usage – and importantly, trending – will become an increasingly useful tool over time to both buy and sell side technologists, albeit for differing reasons.

There is something unique happening within the Stack Overflow community with respect to C#; the cause of the surge in related commentary there is worth exploration.

The presence of Objective-C in the Tier 1 grouping – and potentially the continued growth of C# and Java – hints at the importance of mobile development in runtime and tool selection.

What do you see in the above?

Credit: All credit for the idea behind this analysis belongs to Drew Conway; we’re merely replicating the study he originally conducted. All data used, meanwhile, is courtesy GitHub and Stack Overflow.

Disclosure: GitHub and Microsoft are RedMonk clients, while StackOverflow and Xamarin are not.

13 comments

It was immediately obvious to me that the growth rate for C# did NOT pass the sniff test so I did a brief investigation.

In fact, the C# growth rate is an artifact of a problem in Conway’s original data collection. Jon Skeet (from SO) pointed out this problem in a comment and Conway correct the graph but failed to update the raw data. I suspect you are comparing your data to Conway’s uncorrected raw data.

In Conway’s raw data, C# is rank 51, on his corrected graph it is in the top position (#56 approx) just like on your graph.

If you can get corrected data from Conway, I suspect you will find that C# is in line with growth rates for other languages.

Language usage as reflected by community behavior would be strongly biased towards newer languages (more to prove, spread through community action), and those languages favored by smaller businesses/teams (serviced by the community, not by vendors).

I would expect to see strong correlation across all forms of community behavior, from Q&A sites to IRC channels, from blogs to Twitter, and of course Github.

Github calls itself “Social Coding” for a reason. Even if not all projects hosted there are open source, the profile of developers/teams that use Github is strongly biased towards community behavior.

In other words, I would expect to see strong correlation between “teams that ask/answer questions online”, “teams that post/read blogs” and “teams that use Github”.

–

I’d be interested in seeing what the correlation is like between trends in community behavior, as depicted in this chart, and trends in professionally developed software.

I wonder how strongly they influence each other. Can we say banks will be using as much Ruby as Java in the next few years? Is it safe to bring server-side JavaScript to the ERP?

Seems to me that the global bias there is that both StackOverflow and GitHub cater to a certain type of crowd that probably doesn’t overlap much with enterprise and is very bleeding-edge/web2.0. (Even still, most folks aren’t on git.) How’s this stuff match up with the Black Duck data?

@Assaf: it’s worth pointing out, with regard to your contention that newer languages are advantaged, that both C and C++ – not exactly new or hot – actually perform well across both metrics.

as for the correlation between community and professionally developed software, i’m not sure how you define those categories. much community software these days is developed in communities. and conversely, many communities have professional implementations of their software.

if you mean software developed within enterprises, however, the answer is we just don’t know. they’re largely opaque to us from a research standpoint.

what we do know, however, is that the community trends are steadily infiltrating the enterprise ranks, as evidenced by the continued traction of dynamic languages amongst enterprise vendor software portfolios (e.g. CloudFoundry).

@Donnie Berkholz: excellent question. i’ll take a look in the near future. though it must be said that the public Black Duck data, being forge oriented, doesn’t offer us much better visibility into enterprise trends.

I’m not making a distinction between professional and community, a lot of software is developed professionally within the community (e.g. Linux, WebKit).

The distinction is more between teams that engage in community behavior, and teams that don’t. And yes, enterprise software is more of the later, dark matter and such.

Community behavior would slat towards languages that don’t have a strong vendor base, you rely on the community instead. C/C++ would rank high in community, ABAP and PL/SQL would rank high in vendor support (we barely see these on Github).

It would also slant towards newer languages because it’s easier to experiment when you have a lot of outside help. I’m guessing in the larger professional sphere, Visual Basic would rank much higher than Scala or Haskell, even than Shell (tier 1). I’d expect to see much more FORTRAN than Erlang.

No doubt community trends are infiltrating the enterprise rank. Take Java for a example. It was a strong community/hacker language and clawed it’s way into the enterprise against the likes of C, PowerBuilder and such. The C/C++ it came to replace was before it a community/hacker language …

Anonymous Cowardsays:

I expect Objective-C to go down as Android tablets start to eat into Apple’s market share in that area. The take-off of Objective-C coincided with the launch of the iPad, but Android seems to repeat its evolution from the smartphone market in the tablet market, that’s what my opinion is based on.