github

1) GitHub is *the thing*. It have a modern UI which follows current trends. It’s easy in use, it have only one mechanism of version control, which is of course – Git. It have it’s own culture and fans (e.g. octocat, gadgets, stickers, etc.). Despite the fact it is sometimes blocked (e.g. in China) and have short shortages, it is highly reliable and refreshes data on the web pages immediately after a single change made from the git client / protocol (Yes, Git is also a protocol).

2) GitHub have biggest number of users and projects. More than SourceForge.

3) GitHub don’t have advertisements on their website. And will never have as there is no such need for them. While SourceForge is currently packed with wide blocks of different advertisement (probably to keep their funds running), GitHub webpage is clean and feature-oriented.

4) Probably most important – rich API for developers and researchers. It made for creating solutions like GitHub Torrent (http://ghtorrent.org/). It allowed Google BigQuery to use GitHub timeline data. It’s possible to create your own local instance of MongoDB or MySQL database holding all events from the GitHub timeline. Thanks to fast and secure OAuth for webapps application like Open Source Report Card (https://osrc.dfm.io/) could be created.

5) Trend analysis on Google Scholar proves my point. Number of papers involving GitHub is increasing, while number of articles on SourceForge is decreasing. There is a small number of people in the World who make high quality FLOSS* research using only GitHub data, and they work is quite often cited, despite the fact it’s a new research (papers from 2014, 2015).

Source: self-made in Feb 2015

6) There are many externals apps which support continuous integration and management of OSS teams. Example of an automatic-build system is drone.io. There is research in Academia about possible task-assignment strategies in OSS teams as well as creating central planners for work distribution. And what’s most important – papers regarding possible quality models in FLOSS teams and results from analyzing teams on GitHub.

7) GitHub employees are present at many important conferences regarding FLOSS and / or web technologies. Ivan Žužak will be a speaker at one of workshops at 11th Intl. Conf. on Open Source Systems (Florence, 2015). They are very keen about making an impact and helping the open-source community.

There are 257 people from all over the globe working at GitHub. Meet them here.

8) There is a high quality manual for mining the GitHub, as well as know-how of avoiding perils in OSS analysis (e.g. forks vs mother repositories, push model vs. fork-push). Check out:

Report on current state of art in researching open source software and teams on GitHub

The main idea of this article is to present people who research teams and oss created on GitHub and software which using github-driven data to present more hidden characteristic of code repositories placed on GitHub.

– 4th March 2014

Software implementations, tools and practical establishments

“The Open Source Report Card” is a portal available at the address http://osrc.dfm.io/. It is also an open source project developed on GitHub and licensed under the MIT License. At the top of the page they state a warning: “Dear recruiters, GitHub is not your C.V. and that these stats only provide a biased and one-sided view.”. Their website is also powered by FusionAds template (http://fusionads.net/). In the centre it have a textbox in which it’s possible to enter a valid GitHub user login, which will lead to an individual report card. Users is analysed on following categories: languages, schedule, organization membership, recent activity. A nick of most similar user is shown, and a list of 5 most similar in activity users is given as well.

GitHub visualizer is a website available at the address http://ghv.artzub.com/. It allows to create a visualization of work done in a chosen GitHub repository.

Coderstats.net is another example of a report card, it is a portal available at the address http://coderstats.net. It shows short summarization of repositories and languages used. There is no information about hours of activity and a similarity to other GitHub users.

Ohloh.net is a huge project considering open source. The portal is available at address http://www.ohloh.net/. They say they do indexing of 663,168 open source projects. Website gives options of viewing: people, projects, organizations through rankings or a search engine. They collect not only GitHub data, but also other repository providers. Portal encourages to create “FLOSS resumes”. Portals basically works by “claiming” a contribution which finally identifies a proper person to his/her project.

Octoboard (www.octoboard.com ) is a GitHub activity dashboard. Octoboard is based on GitHub Archive : each day, it scans new GitHub events archives and computes a few stats, with a 15 days history. You can see some general data on this page, or use menu for more information about language and history. Octoboard is an open source project built for the GitHub Data Challenge by Denis Roussel.

GitHub employees and their work

GitHub developer program was announced at 6th of March 2014. They state: “By joining the Developer Program, you’ll receive ongoing notifications about changes to our API. You’ll be eligible to receive early access on select feature releases, and can request a development license for GitHub Enterprise. You can also submit your work for consideration on the integrations page.” What I believe these are the main advantages of joining the program: official recognition of our work, getting access to the newest API before public release, and a possibility of having plan for private repositories for free. Team members get the “developer” badge at their profile page. Moreover, it gives a licence for using the GitHub and Octocat logo legally. An example of a developer badge owner is Rafał Chmiel (https://github.com/rafalchmiel) (update 02.05.14 – seems like he resigned from this program? Badge no longer visible) – he created a github-cheat-sheet (a witty list of less known and useful tricks for GitHub).

A duet of Brian Doll (GitHub) and Ilya Gregorik (Google) presented a topic on “Analyzing Millions of GitHub Commits – what makes developers happy, angry, and everything in between?”

Data mining, querying, and further more

GitHub html resources (e.g. use Python’s package – “Beautiful Soup” or a Selenium). All data is visible in a browser, but some data is drawn on demand. Many elements are shown through AJAX elements, thus it’s not always easy to get rich HTML data, but ones need to simulate a browser behaviour. Yet, we use this mechanism in example to create a dataset of dialogs (software available at https://github.com/wikiteams/linda-nlp).

GitHub API

Self-explanatory. A programming interface to ask GitHub for details. Limited by quotas (https://developer.github.com/v3/rate_limit/ ). Used well in out projects and reliable. Quota is no longer a problem – we use switching between accounts during scripts execution.

GitHub Archive

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. We downloaded whole GitHub archive and transformed the JSON documents into a mongo database on our servers.

GHTorrent project

GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database. They offer downloadable archives with database dumps, and online query tools to both mysql and mongodb aswell. The project is documented with a database relationship schema. Still, all the data is based on events, so don’t expect you will find there everything available physically from GitHub.

GitHub Big Query

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce. Google BigQuery added the GitHub timeline (data of all events, be sure to check explanation of a GitHub event here: https://developer.github.com/v3/activity/events/types/ ). Mining lot of data, especially when it comes to sophisticated queries, is expensive and requires a setup of paid Google account. BigQuery (BQ) is reportedly based on Dremel, a scalable, interactive ad hoc query system for analysis of read-only nested data. To use the data in BigQuery, it first must be uploaded to Google Storage and in a second step imported using the BigQuery HTTP API. BigQuery requires all requests to be authenticated, supporting a number of Google-proprietary mechanisms as well as OAuth.

3rd party tools

Object-oriented GitHub API for Java – http://github.jcabi.com/. Version 0.7.5 as on 1th May 2014. They say that despite the fact “there are a few other Java adapters of Github API, our implementation has its advantages, including: all classes are private and implement public interfaces, out-of-the-box in-memory mock of Github server, all classes are truly immutable and thread-safe, every Github object gives GET/PATCH access to its raw JSON, HTTP request is accessible for modifications, and finally: entire Github API is available, at least through a configurable HTTP request”.

Programming languages statistics – http://langpop.com/ – is a portal where data of language popularity is aggregated from many data sources, including GitHub, Google search engine, Google code, etc. Very informative and interesting portal.

Coderwall (https://coderwall.com ) allows to build a page for a developer yet it does not aggregate data to create a profile, so it is a limited but interesting data source.

Past and oncoming conferences

Publications

Here I list papers I found regarding research on open-source software. The only requirement I had is a fact the the article must mention GitHub in it’s text and key-words. I will make a more precise list later which will filter out research which is not using GitHub – driven data (i.e. SourceForge is still a base for a research, but this is changing).

Legit – simplifying git by reducing its workflow to only couple of instructions

“Legit is a complementary command-line interface for Git, optimized for workflow simplicity. It is heavily inspired by GitHub for Mac.” As I am quoting those words, in March 2014, this OSS project already have 3,055 stars on the counter. I like the idea of reducing the code-revision workflow to snappy 5 commands, and I introduced it to my students recently. There is a small drawback – installing this additions requires sudo on the machine – git-legit is not a part of pypi packaging.