Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

GitHub provides 20+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client:

Query

Command

Activity for 1/1/2015 @ 3PM UTC

wget http://data.gharchive.org/2015-01-01-15.json.gz

Activity for 1/1/2015

wget http://data.gharchive.org/2015-01-01-{0..23}.json.gz

Activity for all of January 2015

wget http://data.gharchive.org/2015-01-{01..30}-{0..23}.json.gz

Each archive contains JSON encoded events as reported by the GitHub API. You can download the raw data and apply own processing to it - e.g. write a custom aggregation script, import it into a database, and so on! An example Ruby script to download and iterate over a single archive:

Activity archives are available starting 2/12/2011.

Activity archives for dates between 2/12/2011-12/31/2014 was recorded from the (now deprecated) Timeline API.

Activity archives for dates starting 1/1/2015 is recorded from the Events API.

For the curious, check out The Changelog episode #144 for an in-depth interview about the history of GH Archive, integration with BigQuery, where the project is heading, and more.

The entire GH Archive is also available as a public dataset on Google BigQuery: the dataset is automatically updated every hour and enables you to run arbitrary SQL-like queries over the entire dataset in seconds. To get started:

The schema of above datasets contains distinct columns for common activity fields (see same response format), a "payload" string field which contains the JSON encoded activity description, and "other" string field containing all other fields.

The content of the "payload" field is different for each event type and may be updated by GitHub at any point, hence it is kept as a serialized JSON string value in BigQuery. Use the provided JSON functions (e.g. see query example above with JSON_EXTRACT()) to extract and access data in this field.

The content of the "other" field is a JSON string which contains all other data provided but GitHub that does not match the predefined BigQuery schema - e.g. if GitHub adds a new field, it will show up in "other" until and unless the schema is extended to support it.

Changelog Nightly is the new and improved version of the daily email reports powered by the GH Archive data. These reports ship each day at 10pm CT and unearth the hottest new repos on GitHub. Alternatively, if you want something curated and less frequent, subscribe to Changelog Weekly.

GitLogs - Github Daily Newsletter curated with a peak detection algorithm. Also a sexy interface to search topics and trends on Github

Subscribe to Changelog Nightly (the new and improved GH Archive daily email reports). It ships every night at 10pm CT — and unearths the hottest new repos on GitHub before they blow up. It's nerd to the core and in your inbox each night.