Centralized Workflow

As your research project moves from conception, through data collection,
modeling and analysis, to publishing and other forms of dissemination, it’s
components can fracture, lose their development history, and—worst of
all—become conflicted or lost. This lesson introduces a high level strategy for
organizing your collaborative workflow, along with the necessary software and
cloud solutions. Called “the centralized workflow”, this strategy targets
distributed work by equal contributors on a shared codebase and is widespread in
collaborative research.

A central hub stores project files and their history. Researchers are spokes
on the wheel, each working on a local clone of the project. Project
integrity is maintained by rules for synchronizing commits between the hub
and spokes when users execute a push or pull on their local clone.

Reproducible Pipeline

The result of reproducible research is more than a published paper, it includes
the whole data-to-document pipeline. In a typical socio-environmental
synthesis project, a finished pipeline includes the following
steps:

Workflow vs. Pipeline (a weak analogy)

Workflow describes how your team collaboratively creates the code, software
environment and integrations that comprise the pipeline. By analogy to a physical pipeline that moves raw material to finished product, your workflow involves everything from drafting plans to testing the product.

Collaborative workflows require communication—developing a pipeline under
version control facilitates it.

Specific Achievements

RStudio + git

RStudio provides a GUI to the core tools provided by git. Login to your RStudio Server account and upload handouts.zip. Click on “handouts.Rproj” to open the directory as a project.

Initialize git

Convert your RStudio project to a git repository by enabling version control,
available from the menu bar under “Tools” > “Version Control” > “Project Setup”.

Adding a git repository creates a hidden folder in your project called “.git”,
storing all the data about your project’s current and past state.

Commit

Once RStudio refreshes your project, there will be a “Git” tab in the same
window as the Environment tab. The window shows files that have content not
already commited in the current state of your project. Choose “Commit” to open a
new window for easy staging and commiting.

check README.md and handouts.Rproj

write a commit message

commit (but heed the warning!)

Saving, staging, and commiting are each separate steps, none of which imply any
of the others. This may seem like a hassle, but is a good thing! As your project
grows larger, you will frequently save changes you don’t want to commit: staging
lets you choose what changes get packaged into a commit.

History

The history of your project shows a single commit, every new commit will be
chained on top of a preceding commit. Note the “Author” data is probably not
going to be recognized by GitHub and linked to your account.

Revisit the commit history to confirm that the author information has been
amended for the first commit. In the future, configure your user.name and user.email before starting a project, so you do not have to ammend any commits.

Create the Hub

You have created a repository that has no history—it will accept the commits
made in RStudio without conflict. The quick start information provided by GitHub
explains how to finish configuration of your local git repo.

Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to contribute work of her own work.

A git repository is a network of commits, although the current network is a tree
with just one branch. After a worker creates a clone, the local repo is
in the same state as the origin.

An essential component of the centralized workflow is the ability to merge
commit histories that have diverged. Any fork in the history has to be
re-integrated, and git does this automatically through merging.

Collaborators

Collaboration that goes beyond commenting on a final report—integrated work on
a project from start to finish—raises workflow challenges.

Data, script, or report; who has the most up-to-date version?

Will a collaborator’s work overwrite your own?

How to recover a working version of a broken pipeline?

Centralized workflows, managed by git, help to answer these questions.

Project Integrity

The origin becomes the official up-to-date repo, even if your work is a
few commits ahead.

Diverging files are usually automatically merged by git.

Manual re-integration is aided by the ability to “checkout” the project at any
commit.

Version control software works well with text files. Large, non-text
components of your project (e.g. very big or binary data files) can slow down
any cloning, merging or branch switching. For that reason, data rarely live in a
repository with code and. Keeping data and code separate also facilitates data
reuse—it’s not tied to one pipeline.

Add a section where you can list collaborators to the README.md file. Our aim is
to let your collaborators update this list with their own name, so only include yourself. You can use any text editor, and RStudio’s is quite handy.

## Collaborators
- J. Doe

Stage

Before you can commit changes involving a new file, you have to tell git which modifications you want to commit by staging.

Go to the “Git” tab in RStudio.

Select “Commit” to open the “Review Changes” window.

Select “Staged” to add modifications (hence “M”) by “README.md”.

Commit

Enter a brief (<50 chars) descriptive message about the commit.

Commit!

Close the “Review Changes” window.

Push

Look at the “Git” tab again and notice that your branch is “ahead of
origin/master”. Push the commit to your GitHub repo.

GitHub Collaborators

Even on public GitHub repos, only the owner has “push access” by default. The
owner can allow any other GitHub user to push by inviting collaborators under
the settings tab.

Introduce yourself to your neighbor and assign split the two roles below between you. Be sure to watch eachother perform the steps assigned to your individual roles.

Owner: add your neighbor as a collaborator.

Collaborator: accept your neighbor’s emailed invitation.

Create a Second Spoke

The Collaborator now needs to create a new project in RStudio by cloning the
Owner’s project. Under the “File” menu item, choose to create a New Project,
and then choose “Version Control”.

You cannot use the same name for two project folders! Before cloning, the
Collaborators should close their project and rename or delete their “handouts”
folder.

Push & Pull

Collaborator: Add your name to the list in the “README.md”.

Collaborator: Stage, commit, and push (up arrow) your modifications.

Owner: Pull (down arrow) to apply your neighbor’s commit.

Merge

You both realize it would have been good to include your affiliation along with
your name. Do you need to circulate “README.md” to each collaborator in sequence
for an update? No!

Ower AND Collaborator: edit your entry in the “README.md”

Ower AND Collaborator: stage, commit, & push.

Owner OR Collaborator: if you receive an error message, it tells you
exactly what to do.

What about Data?

The scripts tha execute your pipeline are plain text files, but the project may need other assets, such as data, figures, or private configurations.

Non-text files get little benefit from git, and have large costs.

Large data files should not be version controlled by git and usually live outside the repo (as an “integration”).

Private information should not be committed to git.

External Data

The most common pipeline integration is shared data storage.

Local area network file share (e.g. “Z:\\…”)

Cloud storeage (e.g. Dropbox, Google Drive)

Database (e.g. a PostgreSQL server)

Link to the Data

One good practice is creating “symbolic links” (a.k.a. shortcuts) to data files
that live outside a project repo; these let your code use paths that point
inside the repo for data.

file.symlink(from=...,to='data')

The shortcut works like a normal path to your data—which could be risky on
certain operating systems or early versions (before 1.6) of “git”. It is
sometimes possible to add all your data to a commit by accident with git add
.. To avoid this, update git or “ignore” all files and folders below data/ by
adding the line /data/** to the “.gitignore” file in your repo. The leading
/ refers to the root of the git repository, not to the root of your
filesystem.

Future Directions

Share your work for reuse and extension.

Make trying new analysis as easy as branching.

Contribute beyond your own projects.

The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also available for review and extension by your research community.

Using advanced git to manage contributions to the project as a branching and merging “tree” of commits accomplishes two objectives. First, work can safely proceed in parallel, on separate branches if necessary. Second, a recoverable (and auditable) trail of changes is immediately available in the project history.

The latest software for modeling and analysis in your research field may already
be on git. Build better pipelines by contributing bug reports or even pull
requests to projects integral to your own work.

If you need to catch-up before a section of code will work, just squish it's
🍅 to copy code above it into your clipboard. Then paste into your interpreter's
console, run, and you'll be ready to start in on that section. Code copied by
both 🍅 and 📋 will also appear below, where you can edit first, and then copy,
paste, and run again.