Collaborative and reproducible simulation and data analysis with Sumatra

Andrew Davison (UNIC, CNRS)

The increasing degree of complexity of neuronal models and simulations (due to a high degree of biological detail and/or large scale) and of neuronal data analysis (due to a large number of parallel channels from multi-electrode arrays, or to complex, multi-dimensional stimulation and/or behavioural protocols) increasingly requires collaboration between multiple researchers, potentially in widely-separated labs. The complexity of the models/analyses, together with the necessity for collaboration, tends to lead to more complex code, hence to a greater likelihood of bugs and to difficulty in reproducing the results for other researchers (or the same researcher 6 months later). Because of this, a more systematic approach is needed to tracking exactly what code, what tool versions, what data were used for a given simulation or analysis.

Among existing attempts to solve these problems, the most well-developed are workflow engines, such as Taverna, Kepler, LONI Pipeline and VisTrails. These allow rapid development of scientific workflows based on standardized components (each component being a discrete step in the analysis), and then execution of these workflows, often on grid computing architectures. They are widely used in neuroimaging and in bioinformatics analyses, much less so in modelling, simulation and electrophysiology. The principal limitation of workflow engines is the need to adapt analysis methods to fit within the workflow framework. Adding a new component generally has a certain administrative overhead, and there is limited capture of information about software dependencies or of information about the computational environment.

If not using a workflow engine, scientific computations are usually programmed through a domain-specific graphical interface or through writing code. For such computations, we have developed a tool, Sumatra (http://neuralensemble.org/sumatra), which aims to provide automated capture of all the information needed to reproduce a given computational experiment (simulation or analysis) for arbitrary command-line launched programs/scripts (e.g. Python, MATLAB), or for GUI-based tools that incorporate the Sumatra library. There is no need to modify your analysis/simulation script - provided it is launched using Sumatra then all the details of the computational environment will be recorded. If it is written in a language Sumatra understands (currently Python, Hoc, GENESIS Script Language - MATLAB/Octave support is in development) then additionally all the external dependencies (libraries, modules) of the script, together with their versions, will be recorded. All this information about a simulation/analysis is stored in a record store, which can later be browsed/queried/annotated using either a command-line or web-based interface, and hence functions as an electronic laboratory noteboook.

Recently we have added greater support for collaborative projects using Sumatra, allowing multiple users and multiple projects to use a common record store. We have also added a web-service record store, based on the Django web framework, which allows Sumatra or another client to store/retrieve computation records using the JSON data-interchange format over HTTP. This means that any client with web-access can use the record store, opening up the possibility of long-distance collaborations with minimal configuration (unlikely to be blocked by firewalls, for example) and the possibility of portal-type centralised databases/collaborative lab notebooks for simulation- and/or analysis-based projects.

In this presentation we describe Sumatra's collaboration features. We illustrate use of the web-service record store using Sumatra and other clients, describe the JSON-based interchange format, and demonstrate how the record store can be integrated with other Django components to develop a portal for collaborative modelling and data analysis.