Not Logged In

illuminate 0.5.10

Python module and utilities to parse the metrics binaries output by Illumina sequencers.

Illuminate parses the metrics binaries that result from Illumina sequencer runs, and provides usable data in the form of python dictionaries and dataframes.
Intended to emulate the output of Illumina SAV, illuminate allows you to print sequencing run metrics to the command line as well as work with the data programmatically.

This package was built with versatility in mind. There is a section in this README for each of the following typical use cases:

Running illuminate on the command line
Using illuminate as a python module
Parsing orphan binaries (e.g. just ErrorMetrics.bin)

Also, as of version 0.5, Illuminate supports the reading of active (in-progress) sequencing runs for Tile, Index, and Quality metrics.

But first you’ll need to get set up. Jump to “Requirements” below.

Supported machines and files

Currently, the following Illumina machines are supported (any number of indices):

HiSeq
MiSeq

IMPORTANT NOTE: illumina v4 software (RTA 2.0) is incompletely supported, as of illuminate
version 0.5.10. The key difference is that RTA 2.0 allows Quality metrics to be “binned”

Requirements

You’ll need a UNIX-like environment to use this package. Both OS X and Linux have been confirmed to work.
Illuminate relies on five open-source packages available through the Python cheeseshop:

numpy
pandas
bitstring
docopt
xmltodict

Please let the maintainer of this package (Naomi.Most@invitae.com) know if any of these requirements make it difficult to use and integrate Illuminate in your software; this is useful feedback.

Note: if you must use a version of pandas prior to 0.14, you should pin
your version of illuminate to 0.5.7.3.

Optional but Recommended: IPython

Because Illuminate is currently not geared towards interactive usage, if you want to play
with the data, your best bet is to use iPython. All of the parsers run from the command
line were written with loading-in to iPython.

Once you have iPython installed, you’ll be able to run illuminate or any of the
standalone parsers on your data and immediately (well, after a few seconds of parsing)
have a data dictionary and a dataframe at your disposal. See also “Parsing Orphan Binaries”.

How To Install Illuminate via Pip

The latest most stable version of Illuminate can be installed from the Python cheeseshop
However, some vagaries of python package management make automatic installaion of all of
the dependencies a bit problematic.

You’ll need to explicitly install numpy and pandas first:

$ sudo pip install numpy pandas

Once this completes, you can try:

$ sudo pip install illuminate

The remaining requirements (bitstring and docopt) should come along for the ride,
and you’ll be good to go. Jump down to “Illuminate as a Command Line Tool”
to immediately start illuminating your own data.

If you want some sample data to play with, grab Illuminate from its mercurial
repository on bitbucket.org (see next section).

How To Install Illuminate from BitBucket

The latest evelopment versions of illuminate come from its repository on bitbucket.org

Clone this repository using Mercurial (hg):

$ hg clone https://hg@bitbucket.org/invitae/illuminate

For integrated use in other code as well as for running the command-line utilities, it is
recommended (though not required) to use virtualenv to create a virtual Python environment
in which to set up this package’s dependencies.

Another option for filename output is –timestamp / -t which stamps each file with a
datetime.now() seconds-since-Unix-epoch. This timestamp will be the same for each
parsed file per illuminate run (in other words, you’ll get matching timestamps for each
metrics file produced).

You have the ability to get higher verbosity status messages during the parsing process
by specifying –verbose / -v.

The –debug / -d does nothing (right now) other than produce timestamps and raise the
verbosity of the output (same as -v). These messages are placed such that you can use
the timestamps to evaluate the processing time of parsing.

Finally, a fun way to explore the data is to use the –interactive / -i option to load
the dataset object directly into iPython. (This suppresses the normal printouts.)

(ve)$ illuminate -i /path/to/dataset

Within iPython, you’ll have the myDataset object at your disposal. This leads us naturally
to a discussion of how to use illuminate in code.

Using Illuminate as a Python Module

Illuminate was made to be integrated in code to make it easy to report on sequencing runs.

The usual way to start is to instantiate a “dataset” through the InteropDataset class,
providing it with a valid run path, like so:

Note that not all run data will contain all binaries. Particularly, ErrorMetrics.bin will be
missing if no errors were recorded / reported by the sequencer.

In the vast majority of cases, variables and data structures closely resemble the names
and structures in the XML and BIN files that they came from. All XML information comes
through the InteropMetadata class, which can be accessed through the meta attribute of
InteropDataset:

metadata=myDataset.meta

InteropDataset caches parsing data after the first run. To get a fresh re-parse of any
file, supply “True” as the sole parameter to any parser method:

tm=myDataset.TileMetrics(True)

Using the Results

The two main methods you have access to in every parser class are the data dictionary
and the DataFrame, accessed as .data and .df respectively.

Each parser produces a “data” dictionary from the raw data. The data dict reflects
the format of the binary itself, so each parser has a slightly different set of keys.
For example:

The parsers are designed to exist apart from their parent dataset, so it’s possible to call
any one of them without having the entire dataset directory at hand. However, some parsers
(like TileMetrics and QualityMetrics) rely on information about the Read Configuration and/or
Flowcell Layout (both pieces of data coming from the XML).

Illuminate has been seeded with some typical defaults for MiSeq, but if you are using a HiSeq,
or you know you have a different configuration, supply read_config and flowcell_layout as named
arguments to these parsers, like so:

More Sample Data

More sample data from MiSeq and HiSeq machines will be found in the
Downloads
section of this bitbucket repository.

If you’d like to contribute sample data, contact the maintainer of
this repository (naomi.most@invitae.com) along with a brief description.

Support and Maintenance

Illumina’s metrics data, until recently, could only be parsed and interpreted via Illumina’s
proprietary “SAV” software which only runs on Windows and can’t be sourced programmatically.

This library was developed in-house at InVitae, a CLIA-certified genetic diagnostics
company that offers customizable, clinically-relevant sequencing panels, as a response to
the need to emulate Illumina SAV’s output in a program-accessible way.

Invitae currently uses these parsers in conjunction with site-specific reporting scripts to
produce automated sequencing run metrics as a check on the health of the run and the machines
themselves.

This tool was intended from the beginning to be generalizable and open-sourced to the public.
It comes with the MIT License, meaning you are free to modify it for commercial and non-
commercial uses; just don’t try to sell it as-is.