Although I’ve announced as much on Twitter, I’m excited to present a poster
talk this year at SciPy. It’ll be my first time attending, and I’m looking
forward to meeting others that are passionate in advancing science through
software, and in particular with Python.

I’ll be presenting a software package I’ve been working on called
MDSynthesis, which has vastly improved the way I do science with molecular
dynamics (MD) simulations. MDSynthesis addresses an important bottleneck in
MD research: going from raw simulation data (perhaps many terabytes, spread
over tens to hundreds of simulations) to information that allows us to answer
biophysical questions. I’ll explain…

One of the obstacles to using modern data science tools like pandas to analyze
MD data is the multitude of formats the MD ecosystem trades in. CHARMM and
NAMD use DCD files, AMBER uses a NetCDF-derived format, and GROMACS uses
an XDR format; all told, there are at least 13 different formats used for
storage of MD trajectory data, each with unique strengths and limitations.
MDAnalysis is a python package that provides a common interface to many of
these formats, turning trajectory data into numpy arrays that can be handled
with the full power of the python universe.

But the diversity in trajectory formats isn’t the only obstacle to distilling
information from MD data; what’s also a problem is the variety of inputs
available for building any particular simulation system. For example, when
simulating a single protein, I have a lot of choices in: forcefields, starting
conformation, protonation states, solvent, ions, temperature and pressure
algorithm…the list goes on. The picture becomes more complicated when one
wants to run different types of MD, as there are also a wide variety of
enhanced-sampling methods available for use.

And that’s still not all: trajectory data can take a while to churn through
to extract measures we are interested in, depending on the measure and depending
on the number/length of trajectories. It’s therefore useful to store intermediate
data so we can interactively explore it.

Managing this complexity is burdensome, and frankly, boring. I’d rather spend
my limited time and energy doing science than managing my ever-growing
collection of data. Furthermore, I want quick, specific, and easy access to the
data I have so that I can begin answering questions.

MDSynthesis has done this for me. The basic idea behind the package is to
provide persistent objects that serve as data storage units, called
containers. One such container is the Sim object. This can store any
number of MDAnalysis Universe definitions (topologies + trajectories), along
with atom selections for later use. Sims store their state directly to disk in
a thin HDF5 database (using PyTables), allowing recall of the same Sim
instance later, or at the same time in another python session. Most
importantly, Sims give an interface for easily storing pandas and numpy data
structures in HDF5 format with no fuss, with just as easy recall. Almost any
other python data structure can also be stored just as easily; the container
will pickle what it can’t serialize to HDF5.

Beyond Sims, there are also Groups, which can store Sims and other Groups as
members for easy recall of whole ensembles of containers and easy aggregation
of their stored data.

Those are the basic elements; more details can be found in the docs. We just
made an alpha release of the package last week which is already usable for
daily work, but the project is still very young. What’s particularly exciting
for me is that development of the package has already fed back into development
of MDAnalysis, with even more performance and persistence functionality on the way.

If you find this software useful, let me know! If it’s missing something
that it sorely needs, feel free to submit an issue and we’ll get
cracking on it. Pull requests are also welcome!