Source Code for Science

April 23, 2020Timothy Gardner

By Timothy Gardner and Embriette Hyde

Dating apps have changed the way people romance and forge lasting relationships. Like people, ideas also need to romance each other: they meet, mix and produce offspring. But sadly, the breeding of scientific ideas is being shortchanged. We need a new kind of dating app: a dating app for nerdy ideas.

How ideas breed

In today’s distributed world, the breeding of ideas is heavily assisted by software and cloud collaboration tools like GitHub and Google Docs. These tools have dramatically reshaped the means and speed with which ideas are bred in our personal and work lives. Try collaborating on a document without them – it’s slow and inhibitory.

Yet this is exactly what we do with scientific ideas each and every day. Science still breeds ideas using the whiteboard, reports, and PowerPoint presentations in group meetings. These are essential and valuable tools, but not nearly enough. Science needs more. Science needs a source code.

Source code is the lifeblood of computer science and engineering and therefore at the heart of all of the technology we use today. It is the reason we have Google, Netflix, and, of course, our dating apps.

Good source code is editable, trackable, changeable, and shareable. It enables engineers to find and duplicate code, change it, easily see changes, and merge changes into live systems – all with a trackable history. Software engineering would be light-years behind where it is now without tools that facilitate such source code tinkering.

But how does this apply to science? What would a source code for science even look like?

What must the source code for science capture?

To answer this question, we first need to resolve a couple of other fundamental questions. First, and most importantly: “What do scientists make?” Do scientists make knowledge, or theories, or data, or experiments? Or are some of those just means to an end? The answer is what the science source code will represent.

Let’s answer this via analogy to other fields. A software engineer makes code that affects actions on data or machines. An architect makes blueprints (CAD drawings) that are used to build and operate a structure. A film writer makes a script that is used to produce a movie. Hopefully you see the pattern here: the creative product is the code; the blueprint, the script (the stuff in italics). And then there are the outcomes that are generated from, operated with, or interpreted using the creative product.

So, what do scientists make? Experiment designs that are used to generate data and conclusions.

Creative products; i.e., experiment designs are completely within the control of the creator(s). This is good news. Conversely, outcomes are based on the creative product but are not entirely controlled by the creator. How well we generate and understand outcomes that are ultimately out of our control comes down to how well we build the creative product. An auto engineer could not repair a car without access to the design documents (or at least a help guide that is derived from them). A writer could not refine a film or sign actors, without sharing and iterating on the script. Likewise, a scientist cannot properly interpret any generated data without understanding the experiment design that generated it.

The role of source code (or of blueprints or scripts) is to capture these creative products in tangible form and to provide a systematic means for the creator(s) to share and evolve them. Scientists make experiment designs. The source code for science must unambiguously capture everything that entails.

What does the source code for science look like?

We have arrived at our second fundamental question – what does source code for science look like? The answer may seem obvious – scientists do experiments, and when they do experiments, they write down a list of materials, methods to be performed, and observations to be recorded. Science has had a source code for 400 years.

Or has it?

Anyone who has spent any time at all in any lab on the planet knows how completely inadequate ad hoc materials and methods are. They know that spreadsheets and written or electronic notebooks contain what are usually uninterpretable, incomplete, or ambiguous statements of what was actually done in an experiment. They are the equivalent of trying to describe a car part with paragraphs and spreadsheets – that approach doesn’t work for cars, and it doesn’t work for science. Yet defining source code for science isn’t as simple as designing blueprints for a car part. Experiments are multidimensional and potentially unconstrained. How do we encode them in a form that gives us all of the transparency and breed-ability of computer source code?

Fortunately, every scientific experiment shares something fundamental which allows us to define its source code. Every experiment is a sequence of actions (a protocol) in which inputs (things the scientist can control or study) are transformed into outputs (things the scientist can make or observe). String a series of such actions together and you have an experiment. The data generated from an experiment describes those inputs and outputs. When we draw conclusions, it is by correlating that input/output data in some way, shape, or form.

Figure 1. Every scientific experiment can be described as a sequence of actions where inputs are transformed to outputs.

So we see now the answer to our second fundamental question: the source code for science looks like code that captures a process flow (Figure 2) composed of a sequence of transformation steps. Source code for science provides an explicit expression of all the actions of an experimental protocol; all of their inputs and outputs, all the properties that are set or measured on those inputs and outputs, and the flow of materials and data from one action to the next across the process.

Figure 2: A scientific experiment can be described as a process flow diagram – a sequence of transformative steps with inputs, outputs, parameters, and measurements. This can serve as the source code for science because it is complete, unambiguous, and can be encoded in machine-readable form for sharing, comparison, analysis, and reuse.

At Riffyn, we believe very deeply in the idea of scientific source code -- our whole enterprise is founded on it. But we didn’t invent the idea from nothing. Like all ideas, it has parents. In part 2 of this post, we will uncover the roots of the idea of scientific source code and how it could lead us to something profound — a GitHub for science where ideas can be bred, and a transformation of the lifecycle of scientific product development.

For more information or to request a demo, please visit riffyn.com , and follow us on LinkedIn to keep track of the latest Riffyn happenings.

Timothy Gardner

Tim Gardner is the Founder and the CEO of Riffyn. He was previously Vice President of Research & Development at Amyris, where he led the engineering of yeast strain and processes technology for large-scale bio-manufacturing of renewable chemicals. Tim has been recognized for his pioneering work in Synthetic Biology by Scientific American, the New Scientist, Nature, Technology Review, and the New York Times. He also served as an advisor to the European Union Scientific Committees and the Boston University Engineering Alumni Advisory Board. Tim enjoys hockey, running, mountain biking, and being beaten by his sons in almost everything.