With GitHub’s 3 millionth user just announced, the time was right to more deeply examine GitHub’s growth since its start back in 2008. Thanks to Francis Irving’s work (graph) at ScraperWiki, I found a way to query monthly growth rather than just relying on periodic announcements. (For those playing along at home, note the search syntax has since changed.)

My goal was to come up with a model for GitHub’s growth to understand what kind of rules its growth followed and so I could better predict the future.

GitHub users as a population

My first assumption was that I could handle this as a population using standard approaches like the exponential or logistic equations, but I started by plotting a couple of things: users over time, and the log of users over time. Getting a feel for the shape of the data should be the starting point for any analysis.

If it’s a population experiencing exponential growth, it should be log-linear (plotting the log of users on the Y-axis rather than raw users should be linear), but it’s not — so the growth cannot be treated exponentially. Since GitHub users increase faster than exponential growth allows (tighter curvature on graph of users over time, higher slope on log-linear graph), to fit growth, we need a superexponential model.

After digging into this for a while, I finally discovered a model of populations that might fit — coalition-based growth, described by von Foerster and colleagues in 1960 in a publication in Science. Its essence is in game theory, considering the entire community as a single group in a two-entity game against its environment, due to members’ high level of communications enabling them to form tightly linked coalitions, rather than independent individuals trying to survive. To me, the parallel with collaborative software development seemed quite strong.

The best fit to the data I’ve found so far is described by superexponential growth following the coalition-based equation

P(t) = P0 * t * ekt
P0 = 49,100 ± 1750; k = 0.54 ± 0.009

where P is the population at time t, P0 is the initial population, and k is a growth constant (i.e., the frequency of growth by a factor e). This equation is different from the more typical exponential growth [P(t) = P0 * ekt] because of an additional multiple of t to indicate that the rate of growth actually increases with time, which is inserted to account for the network effects. The results generally make sense, which is always a good check to make — for example, the initial population is much closer to zero than 3 million.

GitHub adoption as diffusion of a new innovation

In the meanwhile, I’d thrown out a request on Twitter for suggestions on how to model this, and Adrian Cockcroft suggested treating it as diffusion atop a pre-existing social network. This seemed reasonable too, so I started looking into it. Turns out that the logistic function is also used to describe diffusion of innovations, but it’s again log-linear, which doesn’t fit the GitHub adoption data. Then I combined this with some of my previous thoughts that there must be alternate ways to model GitHub based on social-network analysis rather than population dynamics.

When I looked more deeply into the theory of diffusion of innovations, I discovered that it’s often treated using the Bass model. This is really just a combination of exponential and logistic equations with two coefficients, p and q, to model diffusion via social interactions and broadcast advertising. The Bass model does account for social networks, but its main shortcoming is that it treats them as fully connected and homogeneous (everyone knows everyone, and all people are identical), when in reality they’re often small world / scale-free. That said, I figured it would make sense to start with the simplest possible approximation and see how it did, and here’s the results:

Intriguingly, the Bass model produced a nearly identical fit to the coalition-based model using the following equation:

where P again is the population at time t, p and q are coefficients for advertising and social-network effects, respectively, and m is the total size of the market.

Under that model, you would interpret nearly all of GitHub’s popularity to social effects (word-of-mouth and friends) and nearly none to broadcast advertising. Again, it’s good to see the results generally make intuitive sense.

Regarding the market size m, it’s critical to note that it is commonly underestimated, particularly with the paucity of data here (only a partial curve and no inflection point).

In summary, modeling GitHub adoption as diffusion of an innovation seems to work pretty well, too, despite the obvious simplifications regarding the social network and static market sizing, advertising, pricing, etc.

What about the future?

Understanding the past is useful, but what we really want to do is predict something. So, do these models enable us to do that? Sure — let’s plot the models out to year 10 and see what things look like:

Neither fit is perfect, with some clear systematic errors, but I suspect that won’t be fixable without a more complex model (e.g. heterogeneous social networks) or more data. The coalition model says things increase faster and faster forever (which seems just a tad unrealistic), predicting 100 million users after 10 years. Although I don’t necessarily discard 100 million developers out of hand, I’m definitely skeptical about 2.5 billion at year 15, which makes the model as a whole a little weak. The Bass model, on the other hand, is a more typical S-shaped curve that’s clearly moving toward a maximum, predicting 20 million users at year 10 and 21 million at year 15.

Now, don’t take those numbers as hard figures, because there’s huge amounts of uncertainty associated with them — the purpose of this exercise was more about understanding GitHub’s growth model and possibly some near-term prediction.

In the near term, I’d estimate, based on my Bass model, that GitHub will hit 4 million users near August and 5 million near December.