Nodebook

Kevin Zielnicki

July 26, 2017
- San Francisco, CA

Reproducible analysis has been a topic of frequent discussion in both academia and industry.1 This is a good thing. Irreproducible work can’t be verified and is difficult to build on. Even if it solves the problem at hand, it is too tied to the specific moment and person who produced it. In contrast, reproducible analysis can be verified or extended in the future. This is important not only when the future user is someone else, but also when it is yourself trying to understand what you did last month.

Even better, I would argue that thinking about reproducibility leads to better analysis. I find that if I think of some analysis as ephemeral, I am less likely to catch mistakes than if I think of it as a permanent artifact.

Reproducibility isn’t controversial, and yet irreproducible analysis is everywhere. I’ve certainly created plenty of it. Why does this happen, despite good intentions? Because, in the short term, it is easier and more expedient not to worry about reproducibility. But this isn’t a moral failing so much as a failing of our tools. Tools can, and should, help make reproducible analysis the natural thing to do.

This brings us to Jupyter Notebook, one of my favorite pieces of software. Jupyter is widely used by data scientists, and for good reason. By providing a visual interactive interface to the versatility and power of Python, it makes a great many tasks pleasantly quick and easy. Want to pull in some data from an API, run it through a MCMC-trained Bayesian classifier, and then visually diagnose the performance of the classifier? No problem! Want to query a database, check on histogram plots of various fields, build a regression model and plot its predictions? Easy! Interested in exploring some deep learning models? We can do that too! Want to do all that in a notebook you can hand off to somebody else without them having to reason about what lines you ran, and how many times, and in which order? Well, you’d better be careful.

Of course, it is completely possible to write clean, reproducible code in a Jupyter notebook. But this is not the default. More commonly, a notebook will accumulate a mess of false starts, out-of-order statements, and unused legacy code. Jupyter’s amazing flexibility is also its undoing here. It gives us a paradigm where it is all too easy to write something thoroughly unreadable and uninterpretable. This is commonly addressed by, after finishing a particular piece of analysis, tidying up and re-running a notebook to make a clean deliverable.2 Yet, this can become a significant burden, and sometimes may never get done. What if, instead, the notebook itself constrained us to write more repeatable code in the first place?

REPL-Y All

To understand how we got here, let’s consider the REPL. Short for read-eval-print loop, the REPL provides an environment for interactive programming. It works by allowing the user to enter a command, then evaluating the command and printing the results. This paradigm informed IPython, and through it, Jupyter Notebook. REPLs are handy for interactive development because they give immediate feedback, but they are inherently ephemeral. After running code in a REPL, you have at most a log of your command entries and outputs – hardly a useful artifact for reproducible analysis.

Jupyter Notebook addressed this shortcoming by persisting code in cells which can be easily re-run later. This results in something more like a traditional script, except still with a critical difference. A script has statements that run in a well-defined order, while cells in a notebook can be run in any order and modify the global environment at the time they are run (and every time they are run). While this is great for flexibility, it often leads to a notebook existing in a state of apparent inconsistency and irreproducibility.

This is exactly why I developed Nodebook, an extension to Jupyter notebook.

The Nodebook

Nodebook attempts to go a step further in ensuring consistency by making a notebook behave more like a script in that cells depend only on the cells above them. Yet it also maintains the useful flexibility of being able to inspect intermediate results and insert and modify cells at any time. By way of example, suppose we have four cells:

[1] x = 7
[2] x += 10
[3] print x
[4] x = 1

If this were a script, these 4 statements would be run predictably in order: print would output 17 and at the end of the script x would contain the value 1. In Jupyter, we could run the 4 cells in order one at a time for the same result, but we can also run them out of order or multiple times. For example, if we run cell 3 after running cell 4, print will output 1. If we run cell 2 a few times, we’ll keep incrementing x and quickly lose track of what value it contains. While this is a contrived example, conceptually similar situations frequently emerge in notebooks as a byproduct of exploratory analysis.

In Nodebook on the other hand, the behavior is more like what one would expect from a script. The number of times a cell is run and the order cells are run in does not matter, only the position of the cells relative to each other. No matter whether cell 4 was ran or how many times cell 2 was ran, the print in cell 3 will always output 17. Additionally, Nodebook tracks changes in cell inputs and outputs. So if cell 1 is changed to x = 3 and then cell 3 is run, Nodebook will detect that an input has changed and will re-run cell 2, and print will output 13, consistent with what one would expect from a script. Here’s a short demo illustrating the difference:

To accomplish this behavior, Nodebook needs to keep track of the inputs and outputs associated with each cell. To track inputs, Nodebook parses the cell’s AST (abstract syntax tree) to find any variables which are not defined locally. Any such variables are labeled as cell inputs. To track outputs, Nodebook serializes (storing either to memory or disk) and hashes each cell’s environment before and after it is run, and looks for changed hashes.3 To reconcile inputs and outputs, Nodebook traverses a linked list of cell nodes (hence the name Nodebook) and matches each input to the most recent output. A cell run is associated with a particular input hash, which allows detecting changes to upstream data. This change-tracking is used to automatically re-run cells when needed to maintain consistency when cell inputs have changed.

Although the Nodebook approach primarily focuses on enforcing sensible constraints to make notebooks more repeatable and maintainable, it has a few side benefits as well. Because output is serialized and optionally stored to disk, Notebooks are no longer ephemeral: you can restart a notebook and pick up right where you left off without reloading source data or re-running intermediate computations. Also, the output change-tracking capability of Nodebook makes it easy to change a variable at the top of your notebook (for example, to try a different hyperparameter), and re-run a summary statistic at the end of the notebook, without worrying about which cells need to be re-run inbetween. Similarly, because Nodebook knows exactly which code is used to create any given output, it can automatically extract that code into a standalone python module. This lends itself to a pattern that I find extremely useful: you can prototype logic quickly in a local notebook on a small dataset, then have Nodebook package up the relevant code into a module and run it on a remote Docker container on the full dataset, and store a final artifact for later use.

Try It

Nodebook is open source and you can try it out right now! Nodebook is in development and has some rough edges, but contributions and issue reports are welcome.

Footnotes

2Some reasonable approaches are described in http://compbio.ucsd.edu/reproducible-analysis-automated-jupyter-notebook-pipelines/ and https://svds.com/jupyter-notebook-best-practices-for-data-science/ ↩

3Serialization isn't free, so this introduces a couple caveats. On large objects (more than a few hundred MB), serialization will be noticably slow, but it is typically a better pattern to iterate on a smaller sample of a larger dataset anyway. Also, there are a few types of objects (most notably generators) that aren't currently supported.↩