Open-source developers all over the world contribute to millions of projects every day: writing and reviewing code, filing and discussing bug reports, updating documentation and project wikis, and so forth. The data generated from this activity can reveal interesting trends across many industries, including popularity of programming languages over time, defect rates, contribution metrics, and popularity of specific frameworks and libraries.

The challenge in extracting these trends is gathering the data. Each project has its own distributed workflow, code repositories, and conventions. Having hosted dozens of my own projects on GitHub, I've long wanted to analyze the developer activity from the 2.6M+ public projects hosted on GitHub. Hence, earlier this year GitHub Archive was born!

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. Each day it archives over 120,000 public activities, ranging from new commits and fork events to opening and closing tickets, each with detailed metadata.

Once I collected the data, I needed a tool to analyze it, and that is when I found Google BigQuery. Based on the research behind Dremel, a popular internal tool at Google for analyzing web-scale datasets, BigQuery allowed me to easily import the entire dataset and use a familiar SQL like syntax to comb through the gigabytes of data in seconds. Plus the tool will scale to terabyte datasets, so there is plenty of room to grow!

The best news is that thanks to collaboration from the GitHub and BigQuery teams, the GitHub dataset is now public and available for you to slice and dice in any way you like. No need to worry about data gathering or database schemas: BigQuery will do all the heavy lifting, and you can just compose your queries to be executed in realtime.

Here's a real-world example. What are the most popular programming languages on GitHub over the past month?

Ilya Grigorik is a Web Performance Engineer and Advocate at Google, an open-source evangelist, and an analytics geek. You can find him on GitHub under igrigorik, and blogging about web performance at igvita.com.