GitHub, Academia, and Collaborative Writing

Hello! My name is Harrison Massey, and I'm quite proud to be one of the 2014 HASTAC scholars. There's a lot I want to talk about in regards to open access and open-source software, but something that's been on my mind a lot recently has been the use of GitHub in academia. As the most-used open source project hosting site in the world, GitHub is sure to be a vital component for not only open-source academic software projects, but also for open education initiatives. Git's distributed architecture and the layer of relatively simple tools provided by GitHub make the website an excellent tool for synchronizing document work amongst many teammates and outside collaborators.

However, despite the features of GitHub that make Git more accessible, there's still a number of conceptual barriers that have to be overcome to use and understand Git:

Decentralization

Git's decentralized nature makes it an excellent tool for coordinating collaborative editing. It 1) keeps collaborators from stepping on each others' toes and 2) the hosting service for the "central" repository collaborators are working on can be changed at a moment's notice. However, for people who are more accustomed to working on the same copy of a document simultaneously (such as on Google Drive or Wikipedia), the processes of working with the distributed architecture can be jarring.

The easiest way to think about this process is to consider how you might perform analog review or annotation of a document. A friend wants you to edit their paper, so they print off a copy and hand it to you. You take the copy, bleed red onto the page to indicate changes, then hand the paper back to them. They can then look at those changes, which have not affected their original document file, and decide which changes they want to include in their original.

Git and GitHub operate in a similar fashion: if I want to change something, I make a copy of it to change. Once I'm done making my changes, I submit them for inclusion in the original. The names used for this process in Git, however, are not very intuitive, and a number of intermediary steps are added to this model as well that make the process initially difficult to understand.

Specialized language

Repository

A repository is a group of documents that share a folder and revision history. A single repository might encompass the code files for a single software program or the HTML source for a website. In Git, a repository is the main target of operation, so most of the actions I'm discussing are applied to whole repositories.

Forking

A forked repository in GitHub

When your friend's document gets printed off and handed to you, a "fork" essentially happens. When you fork a project in GitHub, the entire repository gets copied from the owner's GitHub account to your GitHub account, including the files and the change history. This allows you to make edits to their files without disrupting anything they are working on.

The process of forking is at the heart of open collaboration on GitHub. It allows an individual or core team to coordinate closely on a work while still inviting contributions from those outside the team. It also avoids the need for features like Google Drive's "sharing" access controls, since changes don't automatically affect the original document.

Cloning

Creating a clone in GitHub for Windows

While it is possible to edit text files right on GitHub, I have to "clone" the repository from GitHub to my local machine if I want to use my own text editor or upload other types of files. This makes a copy of the entire repository, including its change history, on my machine. This can be difficult to understand at first, especially if you are using one of the GitHub desktop apps. The apps will show you all of your repositories, making it seem like they are available for editing, when there is actually an additional step to perform before edits can be made. To extend our previous metaphor, a clone is akin to making a photocopy: you now have a new sheet of paper that you can edit separately.

Commits

Creating a commit in GitHub for Windows

A "commit" is a set of changes made to a repository. Unlike a change entry in Google Drive or Wikipedia, a commit's changes can span any number of files in the repository. These commit groupings are purely defined by the user. When you want to define a commit, you choose ("add") which changes you want to include in a particular commit. You then add a description, which is used to describe what the group of changes accomplishes.

Unfortunately, our metaphor falls apart a bit at this point. Imagine, perhaps, that you sit down to edit the paper across multiple sessions and use a differently colored pen each time. Each color grouping would represent a commit.

Pushing

The "Sync' button in GitHub for Windows

Once you have one or more commits on your clone, you can "push" them to GitHub. This uploads the commits on your machine to the original repository on GitHub from which you created the clone. In our metaphor, this copies edits from your photocopy to your original printout (let's pretend we have magic ink that can do this). It's important to note that until you do this, your changes will not appear on GitHub.

If you're using the official GitHub desktop application, this will be labeled as "publish" or "sync." "Sync" actually downloads the latest version of your repository from GitHub, then uploads your commits to GitHub, but you will likely primarily use the button for uploading.

Pull Requests

The branch comparison screen for creating a pull request

Finally, a "pull request" does exactly what it says on the somewhat strangely-phrased tin: it asks the owner of the repository you forked to include your changes in the original, much like handing a red ink-stained page back to a writer. This request encapsulates any commits you created yourself along with a name and description for the request, letting the original owner know what the commits accomplish in aggregate. It's generally a good idea to make a pull request that accomplishes a single task as opposed to several; that way, if the owner likes one change and doesn't like another, it's easier to just include the part they like.

You can create a pull request using the green compare button near the top-left of the repository screen or by pressing the "new pull request" button in the pull request view on a GitHub repository. You then tell GitHub to "compare across forks," and you select your fork from the "head fork" dropdown. This compares your copy with the original repository, allowing you to select commits for the request.

Issues

When you create a pull request, it will also show up as an "issue" on the original repository. An issue can be anything from a bug report on a software application to a discussion question for a document writing team. Issues are numbered, which allows you to refer to them from commits or other issues using "#" and the issue number (e.g. "#23"). Issues can also be tagged with "labels" such as "bug" or "discussion," which allow you to categorize issues into a versatile set of purposes.

Putting it all together

So, to contribute to another person's GitHub repository, you usually follow the following steps:

Fork the target repository

Clone your fork of the repository to your computer

Make your edits

Create a commit (or commits) to encapsulate your changes

Push your changes to your fork on GitHub

Create a pull request that asks the original owner to include the commits from your fork in their original copy

Wait for the repository owner to respond, and conduct any discussion about the pull request in the corresponding issue.

It is possible to skip a number of these steps by using GitHub's browser-based text editor. However, I recommend going through this process at least once before using it. The online editor uses a lot of this terminology to explain what's going on behind the scenes, which makes the interface confusing if you're unfamiliar with the process.

When Git might not be the best collaboration system

Git and GitHub are very good tools for asynchronous, open collaboration. However, if you're trying to run a live editing session or keep your documents private, you may want to choose a different tool.

For live, synchronous editing, you probably want a system with features such as cursor sharing and chat. Etherpad, Etherpad Lite, and Google Drive are all better for this than GitHub, since changes show up immediately and don't require "commits."

It's also worth noting that Git and GitHub have a longer setup time than many other collaborative tools, which is fine if all your collaborators have used it before. If one or more participants don't have it set up, though, you may want to opt for a speedier option if time is of the essence.

Git started!

If you're unfamiliar with Git, the easiest way to get started is to sign up on the GitHub front page and then download one of the GitHub desktop applications (Windows or Mac). Then, if you are a HASTAC scholar and leave a comment here (or maybe we should start a forum post) with your GitHub username, I have a GitHub organization set up for us that I'll add you to.

You can also view the Markdown file for this post and all my future blog posts in my hastac-blog repository, where I'm posting them to share the source and solicit corrections to spelling, grammar, and the like.

The terminology surrounding Git and GitHub can be scary and confusing, but once you get used to it, GitHub can become a key part of your collaborative workflow. The abilities to solicit edits in an organized fashion and conduct discussion through associated "issues" are quite valuable, and the tool is flexible enough to fit a number of collaborative processes.

If any of you are already using GitHub for academic projects, I'd love to hear about it! I've barely scratched the surface of Git's abilities and uses, so continued discussion would be great.

Thanks for the link, Chris! I hadn't seen that before, and I'm definitely interested in seeing where it goes. Plus, their style guide has a lot of generally good advice for anyone for anyone writing in plain text.

I've been working on a couple of collaborative projects of late and we are using Github as the place to store our datasets and workflow. I find it helpful to think like a developer when I am doing research. It keeps me on track and focuses my attention.

I think that the best writing medium for a Github-hosted project is TeX/LaTeX. It still does not handle live-editing (there are platforms for live-editing of TeX documents), but since it is plaintext with markup, it works with git the way that any other plaintext works. I find it relatively easy to have multiple authors working on the same document without conflicts. All changes are tracked in a really transparent way, which is a nice bonus. Plus the added benefits of attracting comment if you keep the repos open. Datasets in open repos would be a problem if then are not de-identified, obviously.

I would love to talk more about this with you in the future. I've been poking around at the idea of studying Github as an authoring platform all summer. Maybe a bunch of us can collaborate on that.

Interesting. Yes, deidentification is important for datasets, especially since once something is committed and pushed, you can't really get rid of it without deleting the repository from GitHub. It's quite simple to do a search and find all the places where people have committed encryption keys; I can't imagine identifiable data would be much different.

In my mind, TeX is overkill for short-form writing such as this blog post, but it's great for research papers and longer articles. I've been working on learning it myself from WriteLaTeX's posted tutorials, and there's a lot of power there, especially for quantitative research.

At any rate, I'd definitely love to talk about that. Just from looking at these responses I already have some ideas for code tools for plaintext + Git authoring.

Hey Harrison- very interesting blog post and I am very happy to see more and more posts on Open Science and "Forking the Academy". I was wondering if you have considered using Authorea (https://www.authorea.com/) to overcome some of the roadblocks you talk about. Authorea uses Git in the background and it now offers direct Github integration allowing Git experts to work with non-Git experts (who might consider learning Git a overhead for their work).

I appreciate the ideas about how to use Git (and GitHub). I've been trying to use them for non-coding projects for a while. I had also never seen authorea, but I like the look of it. It's not exactly related, except for the collaborative writing piece, but I'm wondering if folks have ever tried https://gingkoapp.com/ and whether there is a git related connection behind that app? I haven't given it a whirl yet, but it looks like a cool way to write.

We thought about using a Git approach for creating interactive courses. So we built a quick prototype. You can copy and push "change requests". You can also add quizzes and earn some points whenever someone answers your quizzes or you answer a quiz from someone else (some little gamification elements). Let me know what you think!

Harrison, somehow I missed this until now -- but thanks for sharing it!

Do you think Github would be as conducive to systems mapping as it would be to blog posts? I'm about to begin a project that will map relationships among institutions, individuals, technologies and laws in order to suggest avenues for future policymaking -- sort of like Actor Network Theory, but policy-focused.

I've been looking for a tool that would enable me to create a draft and invite others to offer their own iteration, but I'm unsure whether Github would accommodate a project that was formatted to look more like a map than a traditional chunk of text or code.

Thanks so much for a great guide with metaphors. I'm late to the party, but wanted to contribute to the discussion.

There's one pretty big issue that I didn't see in all of this discussion so far, and that is the non-trivial problem of resolving merge conflicts in git. Perhaps it's useful to recommend some heuristics to avoid conflicts in the first place, otherwise the pull request steps are not simple. I've done a few complex merges with LaTeX code and find it really cognitively challenging (!), perhaps because the content is text, but the format is a typesetting language. The diff tools are word-based, so depending on the nature of the conflict, it is not straightforward (especially when text gets moved around). Here are a few strategies I've come across:

avoid conflicts in the first place:

separate the project into sub-files (e.g., for each document section) which are then included into the main LaTeX file.

(not really the git core philosophy) agree on not modifying the same sub-file between commits (if you only have one file, this is hard).

if conflicts do arise, learn to use a tool such as kdiff3, which makes merging LaTeX files less painful.