Genomedata provides a way to store and access large-scale functional
genomics data in a format which is both space-efficient and allows
efficient random-access. Genomedata archives are currently write-once,
although we are working to fix this.

Under the surface, Genomedata is implemented
as one or more HDF5 files, but Genomedata provides a transparent
interface to interact with your underlying data without having to
worry about the mess of repeatedly parsing large data files or having
to keep them in memory for random access.

The Genomedata hierarchy:

Each Genome contains many Chromosomes

Each Chromosome contains many Supercontigs

Each Supercontig contains one continuous data set

Each continuous data set is a numpy.array of floating
point numbers with a column for each data track and a row
for each base in the data set.

Why have Supercontigs?

Genomic data seldom covers the entire genome but instead tends to be defined
in large but scattered regions. In order to avoid storing the undefined
data between the regions, chromosomes are divided into separate supercontigs
when regions of defined data are far enough apart. They also serve as
a convenient chunk since they can usually fit entirely in memory.

Genomedata archives are implemented as one or more HDF5 files.
The API handles both single-file
and directory archives transparently, but the implementation options
exist for several performance reasons.

Use a directory with few chromosomes/scaffolds:

Parallel load/access

Smaller file sizes

Use a single file with many chromosomes/scaffolds:

More efficient access with many chromosomes/scaffolds

Easier archive distribution

Implementing the archive as a directory makes it easier to
parallelize access to the data. In particular, it makes it easy to
create the archives in parallel with one chromosome on each
machine. It also reduces the likelihood of running into the
2 GB file limit applicable to older applications and older versions
of 32-bit UNIX. We are currently using an 81-track Genomedata
archive for our research which has a total size of 18 GB, but the
largest single file (chr1) is only 1.6 GB.

A directory-based Genomedata archive is not ideal for all circumstances,
however, such as when working with genomes with many chromosomes, contigs,
or scaffolds. In these situations, a single file implementation would be much
more efficient. Additionally, having the archive as a single file allows
the archive to be distributed much more easily (without tar/zip/etc).

Note

The default behavior is to implement the Genomedata archive as a
directory if there are fewer than 100 sequences being loaded and as
a single file otherwise.

A Genomedata archive contains sequence and may also contain
numerical data associated with that sequence. You can easily load
sequence and numerical data into a Genomedata archive with the
genomedata-load command (see command details additional details):

This command is a user-friendly shortcut to the typical workflow. The
underlying commands are still installed and may be used if more
fine-grained control is required (for instance, parallel data loading
or adding additional tracks later). The commands and required ordering
are:

As of the current version, Genomedata archives must include the
underlying genomic sequence and can only be created with
genomedata-load-seq. A Genomedata archive can be created
without any tracks, however, using the following pipeline:

New in version 1.2: The ability to create an archive without any data tracks.

Additionally, you may remove portions of data from tracks by hardmasking the
specified data tracks. This can be done anytime after loading in data and
unless specified otherwise will automatically close the archive as well. A
track can be loaded and filtered with the following pipeline:

The data in Genomedata is accessed through the hierarchy described in
Overview. A full Python API is
also available.

Note

The Python API expects that a genomedata archive has already been created.
This can be done manually via the genomedata-load command.
Alternatively, this can be done programmatically using
:_load_seq:load_seq.

To appreciate the full benefit of Genomedata,
it is most easily used as a contextmanager:

If Genome is used as a context manager, it will clean up any opened
Chromosomes automatically.
If not, the Genome object (and all opened chromosomes) should be
closed manually with a call to Genome.close().

This is a convenience script that will do everything necessary to
create a Genomedata archive. This script takes as input:

assembly files in either FASTA (.fa or .fa.gz) format (where the
sequence identifiers are the names of the chromosomes/scaffolds to
create), or assembly files in AGP format (when used with
--assembly). This is mandatory, despite having an
option interface.

trackname, datafile pairs (specified as trackname=datafile), where:

trackname is a string identifier (e.g. broad.h3k27me3)

datafile contains signal data for this data track
in one of the following formats: WIG, BED3+1, bedGraph, or a gzip’d
form of any of the preceding

the chromosomes/scaffolds referred to in the datafile MUST be identical
to those found in the sequence files

usage:genomedata-load[-h][-v][--verbose]-sSEQUENCE-tNAME=FILE[--assembly|--sizes][-f|-d]GENOMEDATAFILECreateGenomedataarchivenamedGENOMEDATAFILEbyloadingspecifiedtrackdataandsequences.IfGENOMEDATAFILEalreadyexists,itwillbeoverwritten.--trackand--sequencemayberepeatedtospecifymultipletrackname=trackfilepairingsandsequencefiles,respectively.Example:genomedata-load-thigh=signal.high.wig-tlow=signal.low.bed.gz-schrX.fa-schrY.fa.gzgdarchivepositionalarguments:GENOMEDATAFILEgenomedataarchiveoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exitFlags:--verbosePrintstatusupdatesanddiagnosticmessagesInputdata:-sSEQUENCE,--sequenceSEQUENCEAddthesequencedatainthespecifiedfileorfiles(mayuseUNIXglobwildcardsyntax)-tNAME=FILE,--trackNAME=FILEAdddatafromFILEasthetrackNAME,suchas:-tsignal=signal.wig-mMASKFILE,--maskfileMASKFILEABEDfilecontainingregionstomaskoutfromtracksbeforeloading--assemblysequencefilescontainassembly(AGP)filesinsteadofsequence--sizessequencefilescontainlistofsizesinsteadofsequenceImplementation:-f,--file-modeIfspecified,theGenomedataarchivewillbeimplementedasasinglefile,withaseparateh5groupforeachChromosome.ThisisrecommendediftherearealargenumberofChromosomes.Thedefaultbehavioristouseasinglefileifthereareatleast100Chromosomesbeingadded.-d,--directory-modeIfspecified,theGenomedataarchivewillbeimplementedasadirectory,withaseparatefileforeachChromosome.ThisisrecommendedifthereareasmallnumberofChromosomes.Thedefaultbehavioristouseadirectoryiftherearefewerthan100Chromosomesbeingadded.

Alternately, as described in Overview, the underlying
Python and C load scripts are also accessible for more finely-grained control.
This can be especially useful for parallelizing Genomedata loading over a
cluster.

You can use wildcards when specifying sequence files, such as in
genomedata-load-seq-s'chr*.fa'. You must be sure to quote the
wildcards so that they are not expanded by your shell. For most
shells, this means using single quotes ('chr*.fa') instead of
double quotes ("chr*.fa").

If you aren’t going to use the sequence later on, loading the assembly from
an AGP file will be faster and take less memory during loading, and
disk space afterward.

This command adds the provided sequence files to the specified
Genomedata, archive creating it if it does not already exist. Sequence
files should be in FASTA (.fa or .fa.gz) format. Gaps of >= 100,000
base pairs in the reference
sequence, are used to divide the sequence into supercontigs. The FASTA
definition line will be used as the name for the chromosomes/scaffolds
created within the Genomedata archive and must be consistent between
these sequence files and the data loaded later with
genomedata-load-data. See this example for details.

usage:genomedata-load-seq[-h][-v][-a][-s][-f][-d][--verbose]GENOMEDATAFILEseqfiles[seqfiles...]StartaGenomedataarchiveatGENOMEDATAFILEwiththeprovidedsequences.SEQFILEsshouldbeinfastaformat,andaseparateChromosomewillbecreatedforeachdefinitionline.positionalarguments:gdarchivegenomedataarchiveseqfilessequencesinFASTAformatoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit-a,--assemblySEQFILEcontainsassembly(AGP)filesinsteadofsequence-s,--sizesSEQFILEcontainslistofsizesinsteadofsequence-f,--file-modeIfspecified,theGenomedataarchivewillbeimplementedasasinglefile,withaseparateh5groupforeachChromosome.ThisisrecommendediftherearealargenumberofChromosomes.Thedefaultbehavioristouseasinglefileifthereareatleast100Chromosomesbeingadded.-d,--directory-modeIfspecified,theGenomedataarchivewillbeimplementedasadirectory,withaseparatefileforeachChromosome.ThisisrecommendedifthereareasmallnumberofChromosomes.Thedefaultbehavioristouseadirectoryiftherearefewerthan100Chromosomesbeingadded.--verbosePrintstatusupdatesanddiagnosticmessages

This command opens the specified tracks in the Genomedata archive,
allowing data for those tracks to be loaded with genomedata-load-data.

usage:genomedata-open-data[-h][-v]--tracknameTRACKNAME[TRACKNAME...][--verbose]gdarchiveOpenoneormoretracksinthespecifiedGenomedataarchive.positionalarguments:gdarchivegenomedataarchiveoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit--tracknameTRACKNAME[TRACKNAME...]tracknamestoopen--verbosePrintstatusupdatesanddiagnosticmessages

This command permanently and irreversibly masks out regions from tracks in the
Genomedata archive. Due to slow performance, it is not recommended for masking
large genome-wide datasets. In the case of very large datasets, it is
recommended you mask or filter your data first, then load the masked data with
genomedata-load-data.

usage:genomedata-hardmask[-h][-v][-tTRACKNAME[TRACKNAME...]][--hardmaskOPERATOR][--no-close][--dry-run][--verbose]maskfilegdarchivePermanentlymaskTRACKNAME(s)fromagenomedataarchivewithMASKFILEusinganoptionalfilteroperator.positionalarguments:maskfileinputmaskfilegdarchivegenomedataarchiveoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit-tTRACKNAME[TRACKNAME...],--tracknameTRACKNAME[TRACKNAME...]Track(s)tobefiltered(default:all)--hardmaskOPERATORSpecifyacomparisonoperationonavaluetomaskout(e.g."lt0.5"willmaskallvalueslessthan0.5).Seethebashcomparisonoperatorsforthetwoletteroperations(default:allvaluesmasked)--no-closeDonotclosethegenomedataarchiveaftermasking--dry-runDonotperformanymasking.Usefulwithverbositysettoseewhatregionswouldbefiltered--verbosePrintstatusanddiagnosticmessages

This command loads data from stdin into Genomedata under the given trackname.
The input data must be in one of these supported datatypes:
WIG, BED3+1, bedGraph. The chromosome/scaffold references in these files must
match the sequence identifiers in the sequence files loaded with
genomedata-load-seq. See this example
for details. A chunk-size can be specified to control the
size of hdf5 chunks (the smallest data read size, like a page size).
Larger values of chunk-size can increase the level
of compression, but they also increase the minimum amount of data
that must be read to access a single value.

BED3+1 format is interpreted the same ways as bedGraph, except that
the track definition line is not required.

Usage: genomedata-load-data [OPTION...] GENOMEDATAFILE TRACKNAME
Loads data into Genomedata format
Takes track data in on stdin
-c, --chunk-size=NROWS Chunk hdf5 data into blocks of NROWS. A higher
value increases compression but slows random
access. Must always be smaller than the max size
for a dataset. [default: 10000]
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

usage:genomedata-close-data[-h][-v][--verbose]gdarchiveComputesummarystatisticsfordatainGenomedataarchiveandreadyforaccessing.positionalarguments:gdarchivegenomedataarchiveoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit--verbosePrintstatusupdatesanddiagnosticmessages

usage:genomedata-erase-data[-h][-v]--tracknameTRACKNAME[TRACKNAME...][--verbose]gdarchiveErasethespecifiedtracksfromtheGenomedataarchiveinsuchawaythatthetrackdatacanbereplaced(viagenomedata-load-data).positionalarguments:gdarchivegenomedataarchiveoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit--tracknameTRACKNAME[TRACKNAME...]tracknamestoerase--verbosePrintstatusupdatesanddiagnosticmessages

This command generates a tab-delimited file containing chromosome name and
sizes, suitable for use as a UCSC “chrom sizes” file:

genomedata-infosizesgenomedata

usage:genomedata-info[-h][-v]{tracknames,tracknames_continuous,contigs,sizes}gdarchivePrintinformationaboutagenomedataarchive.positionalarguments:{tracknames,tracknames_continuous,contigs,sizes}availablecommandsgdarchivegenomedataarchiveoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit

Prints data from a genomedata archive, for the track TRACKNAME,
on CHROM, in the region BEGIN-END (0-based,
half-open indexing). Intended as a convenience function only;
this is much slower than the Python interface,
so it should not be used for large regions.

usage:genomedata-query[-h][-v]gdarchivetracknamechrombeginendprintdatafromgenomedataarchiveinspecifiedtracknameandcoordinatespositionalarguments:gdarchivegenomedataarchivetracknametracknamechromchromosomenamebeginchromosomestartendchromosomeendoptionalarguments:-h,--helpshowthishelpmessageandexit-v,--versionshowprogram's version number and exit

Genomedata uses an HDF5 data store. The data is stored in chunks.
The chunk size is 10,000 bp and one data track of 32-bit
single-precision floats, which makes the chunk 40 kB. Each chunk is
gzip compressed so on disk it will be smaller. To read a single
position you have to read its entire chunk off of the disk and then
decompress it. There is a tradeoff here between latency and
throughput. Larger chunk sizes mean more latency but better throughput
and better compression.

The only disk storage overhead is that compression is slightly less
efficient than compressing the whole binary data file when you break
it into chunks. This is far outweighed by the efficient random access
capability. If you have different needs, then it should be possible to
change the chunk shape (genomedata.CONTINUOUS_CHUNK_SHAPE) or
compression method (genomedata._util.FILTERS_GZIP).

The memory overhead is dominated by the chunk cache defined by
PyTables. On the version of PyTables we use, this is 2 MiB. You can
change this by setting tables.parameters.CHUNK_CACHE_SIZE.

There is currently an interaction between Genomedata and PyTables that
can result in the emission of PerformanceWarnings when a Genomedata
file is opened. These can be ignored. We would like to fix these at
some point.

For other support with Genomedata, or to provide feedback, please write
contact the authors directly. We are interested in all comments regarding the
package and the ease of use of installation and documentation.