GitHub for Science

At Riffyn, we believe very deeply in the idea of scientific source code. Our whole enterprise is founded on it. Our software embodies it. Our team delivers and teaches it. Heck, we preach it.

But we didn’t invent the idea from nothing. Like all ideas, it has parents. In this case, one parent of Riffyn’s idea for scientific source code was the napkin sketch in Figure 1 below. This sketch was shared between an industrial process development scientist and manufacturing scientist who were trying to troubleshoot a problem at full-scale manufacturing. They were using the sketch to help communicate what they were doing and identify where things were differing between the process development lab and manufacturing.

Figure 1: Napkin sketch that two industrial scientists used to trouble-shoot process scale-up.

It was this flow diagram, combined with our years of observations of how scientists work, and lessons from data science and measurement systems analysis that all combined to create our idea for scientific source code.

A toolset for scientific source code

Our idea took form as a hypothesis in 2013. Here is that hypothesis, as it was originally written.

Figure 2: Original 2013 hypothesis for scientific source code and why it matters.

Over the months and years, that hypothesis has taken shape as process-flow objects embedded in the core of Riffyn’s Scientific Development Environment (SDE) cloud software. Examples of Riffyn SDE are shown in part 1 and also shown in Figure 3 below in Measure mode which is used to collect lab data.

Figure 3: Riffyn process-flow objects form the foundation for scientific source code, experiment sharing, and accelerated data analytics. Here is shown a process being executed as an experiment with four bioreactor batches. All materials, equipment, samples, and data are collected, structured, and traceable across the process object – a standardized, yet flexible, container for the data.

The Riffyn SDE encodes experiments as process flow objects that capture every detail of the protocol, the reactions, the parameter settings, the sample flow, and the measured data points. These flow objects, and their associated experiment data, are unambiguous and machine readable.

Some people see this process-based representation of an experiment and say: “What the hell does this have to do with an experiment?” Others see it for the first time and say “Riffyn, where have you been all my life?”

Whether you find it instinctively right or wrong, it’s most certainly a paradigm shift. And it’s a shift that’s extraordinarily powerful. What makes it so powerful is that it can be stored, version controlled, copied, compared, reviewed, merged, and shared just like computer code. Even more, the process object provides a standardized, yet flexible, container for experimental data – it can be used to automatically contextualize and shape experiment data for analytics. In short, process objects are the source code for science, and the blueprints for machine-learning-ready data. And as source code, these process objects create a foundation for a GitHub for science. We realize that that statement is not self-evident, so let’s see this in action in a few illustrative examples below.

GitHub for science

Underlying a process flow representation of experiments in Riffyn is a machine-readable object. A fragment of that object is shown below. That object is a software-independent representation of the experiment; it can be read by any program and it fully describes the experiment. Embedded in that structure is also a complete genealogical record of the ancestor processes from which it came. That means that when one process is copied and modified to make a new process, its ancestry is tracked and included in the process information.

Figure 4: Excerpt of the machine-readable encoding of a Riffyn process object that is used as the source code for scientific experiments.

1. Data integration. The process object provides an instruction set to automatically contextualize, integrate, and reshape measurement data across experimental flows, scientific disciplines, collaborating teams, and time. This provides extraordinary power and speed for applying machine learning to everyday science. See example in this video.

2. Process and experiment genealogy. It is possible to construct the entire family tree of all process and experiment design variants that led to a current experiment or process design or were derived from it. Moreover, all the experimental data from the entire family tree can be combined into one data structure for analysis. See Figure 9 below and the example in this video. This deep traceability can be critical in a product development setting when transferring technologies to manufacturing or preparing data for regulatory submission.

3. Diff and merge of experiment designs. In other words, redlining experiment designs. It is possible to automatically identify every variable, parameter, or other element that is changed between two versions of a process. See Figure 9 below and the example in this video. Once identified, these differences can be accepted, rejected, and merged into a new version of the process.

Figure 5: Top panel – Process Differences Report: With Riffyn’s source code for science, differences between any two processes are automatically detected and presented in a diff report showing what steps, materials, parameters, specifications, or units were added, deleted, or changed. Bottom panel – Process Genealogical Trees: This shows the full genealogical tree (ancestor processes on the left moving toward child processes on the right).

Although they are just the beginning of many more great things to come, these three capabilities form the foundation needed to construct a GitHub for Science.

It is worth noting that all of the examples above were executed using open-source code running on software platforms outside of Riffyn SDE. The role of Riffyn SDE is to provide the system-agnostic source code and data for the requested experiments. That data can be read, used, and modified by any other software. Thus, this source code for science is free to live and evolve outside of the boundaries of the system that initially generated it – just as true source code should be.

From GitHub to manufacturing hub

In software engineering, GitHub is not the final resting place for source code. To be useful, it has to be transferred to a production environment. That is accomplished by automated systems that transfer the code to servers and clients that run it. Similarly, in the design and manufacturing of hardware, a CAD file can be transferred to a CAM machine or 3D printer for automated or semi-automated fabrication.

Likewise, with scientific source code, this automated transfer from R&D to manufacturing becomes possible for science. For example, as analytical chemistry methods or bioprocesses evolve in R&D they approach manufacturing-ready form. The process objects (source code) that capture methodological and process details can be digitally transferred to manufacturing execution systems to operate production facilities (Figure 6).

Figure 6: With source code for science, it becomes possible to automate the digital transfer of R&D innovations and data to manufacturing.

Not only would this reduce errors and accelerate timelines, it would also provide a means to map manufacturing data directly to development data. Thus, performance differences between development-scale processes and manufacturing can be compared to quickly diagnose deviations and find opportunities for process improvement. Ultimately this will lead to better product quality, more predictable product development, and lower costs for drug, chemical, material, and biotech product development.

A new era in scientific experimentation

We owe a great societal debt to the wonders of science and the profound power of the scientific method that underlies it. That method, and the traditional means of recording and reporting results, have delivered a high quality of life for people. But as our society has grown larger and more complex, and the pace and pressures of society have increased, so have our demands on science to deliver faster and more profound innovation. The 400-year-old toolset of scientific communication needs an upgrade.

The foundation for that upgrade is source code for science, and it’s already here. It’s been used to transform the operations of Riffyn’s customers. It has already taken biotech products to market faster than ever before. Yet this source code-based approach has only taken its tentative first steps. We can’t wait to see what it will deliver when it hits full stride.

Timothy Gardner

Tim Gardner is the Founder and the CEO of Riffyn. He was previously Vice President of Research & Development at Amyris, where he led the engineering of yeast strain and processes technology for large-scale bio-manufacturing of renewable chemicals. Tim has been recognized for his pioneering work in Synthetic Biology by Scientific American, the New Scientist, Nature, Technology Review, and the New York Times. He also served as an advisor to the European Union Scientific Committees and the Boston University Engineering Alumni Advisory Board. Tim enjoys hockey, running, mountain biking, and being beaten by his sons in almost everything.