Evaluating Open Source Participation by Email Traffic

Dalibor Topic was the one to give me this idea, though I’m not sure if he’d remember the tweet. He was, however, the one who pointed me at MarkMail‘s archive of open source list traffic, which I’d seen before, using a by domain constraint, which I hadn’t. The idea is simple: MarkMail maintains a searchable index of the mailing lists for a number of open source projects (these, specifically). As a means of demonstrating the value of its MarkLogic Server, it parses the individual messages into XML and renders them queryable according to specific dimensions.

Given this ability, I thought it would be interesting to see what we might learn by examining – and in some cases comparing – general participation by domain. While this is not intended to be a comprehensive or authoritative statement on actual levels of engagement, the datapoints are interesting if nothing else. You can replicate these queries at MarkMail yourself using the “type:development from:domain.com” syntax. I’ve chosen development rather than commits because I wanted a broader sense of engagement than putbacks, but the latter would be interesting to study as well.

Before proceeding, a few caveats:

MarkMail is currently indexing 8,146 sources. This is clearly not all of the open source mailing lists, so the picture is incomplete. It’s been a little while since their blog has been updated, as well.

As has been documented in other discussions, such as measurement of contributions to the Linux kernel, many developers – though employed by a given entity – may prefer to use their own email address for on-list communications. Which obviously breaks the graphs below. Nor is the given domain likely to be inclusive of all of a given company’s employees.

Not so much a caveat as an FYI, the graphs below aren’t normalized, as you’ll be able to tell quickly. Nor does MarkMail, as far as I can tell, expose the dataset for external processing. So pay close attention to the Y axis in all of the graphs.

Anyway, on to the data. Let’s look at how the big boys compare first.

HP vs IBM

Here’s HP:

And here’s IBM:

As might be predicted, IBM’s measurable list traffic exceeds HP’s. The level of that disparity is a bit of a surprise, but the most notable feature of both graphs is the general downward trending of participation beginning in 2009. The timing begs the question: has large system provider participation in open source been negatively impacted by the recession? We can’t answer that with this dataset, unfortunately.

One question we can attempt to answer with the available data: how has the acquisition of Sun by Oracle affected the participation of both companies in open source?

Oracle & Sun

Here’s Sun:

And here’s Oracle:

The answer, according to this dataset, is that while Oracle’s level of participation in open source communities spiked following the acquisition, it fails to replicate Sun’s performance as an independent. Which was, to be fair, among the highest participation observed: Oracle’s list activity, post-Sun, exceeds IBM’s at present, where it fell short up through 2009.

Moving on from the large systems vendors, what does the participation of Linux vendors look like?

The Linux Vendors

Here’s Canonical:

And here’s Novell:

Last, but obviously not least, Red Hat:

Red Hat dominates, as expected. With a broader portfolio of open source middleware in addition to the operating system, Red Hat’s observable participation is among the higest in the industry. What I didn’t anticipate, however, was the decline from Novell nor that they would be eclipsed by Canonical. It’s important to keep this in perspective, of course: the number of messages does not equate to the number of contributions to the kernel, for example (the Linux Foundation has detailed that here). Still, the above is worth some thought.

How about some notable proprietary vendors? What does their participation look like?

Microsoft & VMware

Here’s Microsoft:

And here’s VMware:

The peaks and valleys, particularly for Microsoft, are a bit curious, but otherwise, there’s not much to be seen here. Minimal involvement, as expected.

Hardware players?

AMD vs. Intel

Here’s AMD:

And here’s Intel:

While the disparity between the levels of participation here are worth noting, the most important aspect of the above graphs, to me, is the trendline. Participation is escalating significantly, which is indicative, perhaps, both of the growing importance of open source both within customers and the firms themselves.

What of the internet firms, who have been heavily dependent on open source, historically?

The Internet Firms

Here’s Facebook:

Google:

And Twitter:

With the exception of the slight tail to Google’s participation, I’m not sure there’s much to be extracted here. Neither Facebook nor Twitter have thousands of employees, so the volume is not terribly indicative. It is marginally interesting, however, that Twitter’s Y axis indicates a level up from Facebook.

One last question: it’s often been remarked that open source developers, in spite of being passionate about that software, are heavily dependent on proprietary webmail systems, principally GMail. So for our last question, let’s look at that.

The Webmail Providers

Here’s Gmail:

Hotmail:

And Yahoo Mail:

Several things jump out. First, Hotmail and Yahoo Mail have been flat to declining for a minimum of two years. Second, Gmail’s trajectory up until that time was sustained growth; since, it has also plateaued. Last, Gmail is massively more popular than either; than both combined, actually. Gmail, in point of fact, is the most popular single domain in this study. It seems plausible, therefore, that it is in fact true that a Gmail address is in fact the address of choice for the open source population.

Disclosure: of the mentioned companies, RedMonk clients include IBM, Microsoft, and Red Hat.

23 comments

Most folks have multiple addresses. Which they use depending on the corporate culture, and other circumstances.
For example, I used to post using my @sun.com at Java.Net but when we got acquired, I switched to using my @dev.java.net account.
As another example, the Apache culture is focused on individuals, and most contributors there will use their @apache.org address.

I think you’re stretching a little with the GMail conclusion. I would believe it more if I could see it as a percentage of total posts. Another thing worth mentioning is that GMail allows posting from arbitrary non-gmail.com addresses, so it might be better to check something besides the From: header.

Thanks, interesting numbers! Another metric that would be valuable would be the number of comments/patches by company in bug tracking systems. I know that as an Eclipse committer, I have most of my discussions within Bugzilla, not mailing lists. I would echo the comments regarding gmail addresses. I know several committers who use gmail to manage their open source mail traffic simply because it has good filtering tools. My employer encourages us to use our corporate email addresses to highlight the contribution we make to open source.

I have also heard anecdotes of some very large companies using a handful of developers to interface with a larger community and make contributions on behalf of many more developers (I believe I heard this about Webkit specifically).

[…] Evaluating Open Source Participation by Email Traffic Useful charts from O'Grady. Analysing e-mail like this has been a valuable trend indicator for a long time. I'm especially interested in how Sun's open source involvement grew after I started as COSO (tags: OpenSource FOSS Community Participation Developer EMail) […]

“More than culture: some of these companies have, or have had in the past, explicit policies that employees should contribute from a non-company address.”

In a sense, though, that makes the numbers ‘accurate’, because in that situation it seems reasonable to consider the contribution as having been a private volunteer effort by someone who just happens to work for a company, not a contribution from that company. Presumably companies who have such a policy are ones who do not encourage or see value in contributions to open source projects, so it would not be right to ‘credit’ the company with that contribution…

That’s not true for the cases I had in mind, Adam. I was thinking of two scenarios.

* At apache the culture is to use @apache.org addresses. That was the case for Sun and IBM guys.

* In some companies, the company encourage(/mandate?) contributions from their employees from either apache.org or eclipse.org addresses to foster a sense of “the community interest is first; the company is second”.

Sun didn’t have a mandate, but it was certainly OK to use either address. Ian was referring to the second case.

Finally, since in many (most?) OSS forges you need to use their ID anyhow, and they have email addresses, it is may just be more convenient to configure your email client to route all the accounts into there. That’s what I do.

Jef Spaletasays:

This is potentially misleading because there is a proclivity among venders to be heavily involved in the projects they themselves manage. Some vendors have more transparent development process for their pet projects than others so it skews the activity curve a little.

If you want to look at collaboration and participation outside of single vendor gardens you need to exclude lists from domains that a corporate entity hosts for internal projects.