GenomicTables are a cloud-optimized medium for storing large amounts of tabular data and querying it by genomic coordinates and other indices. The DNAnexus platform can use GenomicTables as a common format for genomic datasets, including reads, mappings, and variants.

What functionality do GenomicTables offer?

The DNAnexus platform has built-in functionality for GenomicTables that will relieve you of many common challenges in processing genomic datasets:

The API supports streaming of GenomicTable data to and from cloud storage, so that you don't have to deal with transferring and compressing massive files.

Many compute nodes can concurrently read or write data to one GenomicTable.

GenomicTable data can be visualized through the Genome Browser and remotely manipulated through the command-line client.

The platform can automatically sort and index GenomicTable data so that you can query it efficiently through the API. For example, when you create a GenomicTable containing mapped reads, the platform processes the mappings so that they can be queried efficiently by genomic coordinate. In fact, to visualize mappings, the Genome Browser simply uses the API to query a GenomicTable for the selected genomic region on-demand.

Overall, using GenomicTables will both make it much easier to develop scalable Apps, and promote interoperability with other Apps and platform features. Of course, the platform also provides a comprehensive set of tools for converting between common text/binary file formats and GenomicTables.

Columns

Each column of a GenomicTable must contain data of a particular type. Valid types are the following:

Type

Description

Size consumed

boolean

true or false

1 byte

uint8

representing integers in the range 0 to 255

1 byte

int16

representing integers in the range −32,768 to 32,767

2 bytes

uint16

representing integers in the range 0 to 65,636

2 bytes

int32

representing integers in the range −2,147,483,648 to 2,147,483,647

4 bytes

uint32

representing integers in the range 0 to 4,294,967,295

4 bytes

int64

representing integers between -263 and 263-1 that can be represented by an IEEE 754 double-precision number. This includes but is not limited to all integers between -9,007,199,254,740,992 and 9,007,199,254,740,992. The name "int" is also an alias for "int64". WARNING: this type does not have the full range of a signed 64-bit integer.

8 bytes

float

representing single-precision floating point numbers as defined in IEEE 754

4 bytes

double

representing double-precision floating point numbers as defined in IEEE 754

8 bytes

string

representing Unicode strings of variable length

(Length of UTF-8 encoding of string) + 4 bytes

The columns (their names and types, in order) need to be specified in advance when the GenomicTable object is first created, and remain fixed for the lifetime of the GenomicTable.

ACCOUNTING NOTE: The number of bytes consumed by a GenomicTable is the sum of the consumption of each table cell, calculated as mentioned in “Size consumed” earlier.

In addition to all the columns explicitly specified during GenomicTable creation, the GenomicTable contains an additional special column called "__id__" of type int64, which appears before all other columns, is automatically populated, and does not count towards the space consumed by the GenomicTable.

A column name may be any string, except for strings that match the reserved pattern __.*__.

Lifecycle

GenomicTable objects are stateful. The following diagram represents the possible states (boxes) and actions. When a new GenomicTable object is made (by calling new), it is initially empty and its state is “open”. In that state, GenomicTable rows can be added (by calling addRows), until a request is made to finalize the GenomicTable (by calling close). GenomicTable object finalization is not instantaneous; hence, the GenomicTable object advances to the “closing” state and remains in that state for as long as it is needed. In that state, rows may not be added or retrieved until the system has finalized the GenomicTable. Once finalization is done by the system, the GenomicTable object will advance to the “closed” state. In that state, GenomicTable rows can be retrieved by calling get.

Closing a very large table that must be sorted and indexed (e.g. one billion mappings) can take several hours. The platform performs the closing operation using a distributed algorithm specifically designed and tuned for the cloud environment.

Default row ordering and Fetching

Once it is closed, the rows in the GenomicTable will have a particular, fixed order, determined by the data in the GenomicTable and, if it exists, the first (primary) index. The special column “__id__” contains a row counter, which is equal to 0 for the first row, 1 for the second row, etc. You can use the get method with the "offset" and "limit" options to fetch a contiguous (according to that order) sequence of rows from the GenomicTable.

Indexing

The system supports column indexing for GenomicTable objects. Creating an index enables row retrieval in additional ways that would otherwise be unavailable without an index. Different types of indices support different kinds of queries. Indices must be specified when a GenomicTable object is created, and remain fixed for the lifetime of the GenomicTable.

A GenomicTable can have multiple indices; each index has a unique name. Two
types of indices are supported:

A Genomic Range Index is useful when each row is associated with a
particular interval on a chromosome. The index supports queries for rows that
overlap, or are enclosed by, a specific query range.

A Lexicographic Index supports a more general kind of query against one
or more columns, which may be of any type (string, numeric, or boolean). The
index names one or more columns, which specify a sort order for the rows.
This sort is executed lexicographically. That is, for sort columns A and B,
the rows are sorted by A, with ties being broken by comparing B. You can
issue queries against a prefix of the columns. For example, you can filter
for rows by the value of A (without restricting B), or by the values of the
columns in both A and B.

More details about each type of index are provided below.

We distinguish between the following:

The primary index is the first index specified in the "new" call. The primary index may be of any type (Genomic Range Index or Lexicographic Index).

A secondary index refers to the second and any subsequent indices specified in the "new" call. The secondary indices must be of type Lexicographic Index.

You can specify an index at GenomicTable creation time in the following ways (see the "new" method for more information about declaring indices, and see the "get" method for more information about querying them):

Genomic Range Index

This type of index allows queries that are suitable for bioinformatics applications that refer to a genomic coordinate system. This is a composite index on three different columns:

A column of type string, traditionally representing the name of a chromosome. This documentation will refer to this column as the “chr” column (though it can have any name in the actual GenomicTable).

Two columns of integer type (uint8, int16, uint16, int32, uint32, int64), traditionally representing the low and high boundaries of a genomic interval on that chromosome. This documentation will refer to these columns as the “lo” and “hi” columns (though they can have any name in the actual GenomicTable). DNAnexus suggests the following convention for interpreting the “lo” and “hi” values:

The beginning of the chromosome is marked as 0. In this example, the first nucleotide “g” is denoted by an interval whose “lo” is 0 and “hi” is 1. The dinucleotide “tt” is denoted by an interval whose “lo” is 2 and “hi” is 4. Finally, the point between the two “t” nucleotides is denoted by an interval whose “lo” is 3 and “hi” is 3 (such a segment could be used when describing an insertion in that position, for example). This scheme is also known as “0-based half-open” (see also http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1 )

A genomic range index allows for querying rows that “overlap” (or “are enclosed by”) a particular genomic interval, i.e. allows for fetching all the rows whose value of the “chr” column matches a particular query string, and whose “lo” and “hi” columns “overlap” (or “are enclosed by”) a particular query interval. More specifically, if the user provides the query values CHR, LO and HI, then the system will return the rows (chr, lo, hi) that match the following criteria:

where each COL_i is a string giving the name of a column. The hash for each column may also contain the following fields:

order: one of the strings "asc" or "desc" (case insensitive). This
field is optional and defaults to "asc". Specifies whether to order the
values in this column in ascending or descending order, which affects the
row
ordering
for the table (and therefore, the order in which rows will be returned when
queries are made against this index).

caseSensitive: one of the values true or false. This field is optional
and defaults to true, and may ONLY be supplied on an entry that corresponds
to a string column. Specifies whether to index this column case sensitively.
See the notes for row
ordering
and lexicographic queries (below) for the implications of case-insensitive
indexing.

For example, for the following column specification:

"columns": [
{"name": "A"},
{"name": "B", "order": "desc"}
]

the rows will be sorted in increasing order of the value in column "A" (either
numerically, or, if "A" is a string column, the strings are themselves compared
lexicographically as sequences of Unicode code points). Rows with the same
value in "A" will appear consecutively, sorted in decreasing order of the value
in column "B".

A query against a lexicographic index, as provided to the query.parameters field of the /gtable-xxxx/get method, consists of any of the following (the query language is inspired by that of MongoDB):

A hash of the form {"column1": CONSTRAINT1, ...} containing constraints
that must be matched for each of the specified columns. Each CONSTRAINT may
take any of the following forms:

VALUE: the value in column1 must equal VALUE (a string, number, or
boolean).

{"$eq": VALUE}: same as just specifying VALUE, see above.

{"$gte": VALUE}: the value in column1 must be greater than or equal
to VALUE. If the specified column is a string column, strings are
compared lexicographically by their Unicode code points.

{"$gt": VALUE}: the value in column1 must be greater than VALUE. If
the specified column is a string column, strings are compared
lexicographically by their Unicode code points.

{"$lte": VALUE}: the value in column1 must be less than or equal to
VALUE. If the specified column is a string column, strings are compared
lexicographically by their Unicode code points.

{"$lt": VALUE}: the value in column1 must be less than VALUE. If
the specified column is a string column, strings are compared
lexicographically by their Unicode code points.

{"$startsWith": VALUE}: the value in column1 (which must be a string
column) must begin with VALUE.

A hash of the form {"$and": [QUERY1, ...]} where the array specifies any
number of queries, recursively. Rows that are returned must match all of the
specified queries. Note that if you specify a hash with column names and
values using the syntax above, the constraints on individual columns are also
logically combined using "and", but only the $and operator allows you to
supply multiple constraints on the same column.

If any string column is indexed case-insensitively, then queries on that column
are matched in a case-insensitive way, and string inequality operators ($gt,
$gte, $lt, $lte) compare the lowercased version of the value to the
lowercased version of the operand as a sequence of Unicode code points when
comparing strings that are not the same (after normalizing for case).

The /gtable-xxxx/get method does not support arbitrary queries composed of
the operators described above: only queries that match a consecutive sequence
of rows in the ordering specified by the index. In particular, queries must be
on some prefix of the indexed columns, and each column except the last one
specified must have an equality constraint.

For example, if you have an index on four columns with the specification
[["A", "asc"], ["B", "asc"], ["C", "asc"], ["D", "asc"]], you can issue the
following query:

to find rows where A is 125, B is "DNA", and C is between 25 (inclusive)
and 30 (exclusive).

Indexing and Row Ordering

If any indices are specified upon GenomicTable creation, then when the GenomicTable is closed, the rows will be reordered using the row ordering algorithm for the primary index. The row ordering algorithm is based on the type of the index and is described below. The row ID of the rows will reflect this new order.

If no indices are given, the rows will be ordered in increasing order of the part number in which they were added (within each part, the ordering of the rows will be preserved). The row ID of the rows will reflect this new order.

The row ordering algorithm of each index type is the following:

Genomic Range Index

Rows in the table are reordered according to the following strategy: first, rows are ordered by comparing the Unicode code point sequence of the values in the "chr" column. Ties are resolved by comparing the contents of the “lo” column, and further ties are resolved by comparing the contents of the “hi” column. Further ties are broken arbitrarily.

Lexicographic Index

Rows in the table are reordered lexicographically by the tuple containing the
values being indexed. That is, the rows are sorted by the first indexed column,
going to the second indexed column if there is a tie in the first column, etc.

The sort process respects the asc or desc ordering for each column being
indexed. For string columns indexed case-sensitively (the default), the
individual strings are compared as Unicode code point sequences. For string
columns indexed case-insensitively, strings are converted to lowercase and then
compared as Unicode code point sequences; if the strings are the same when
compared case-insensitively, then they are compared by the original,
non-case-normalized values.

Ties are broken arbitrarily.

Secondary Index Considerations

Queries on secondary indices (which, for the moment, must be lexicographic
indices) are fastest when each column that is to be retrieved is among the
indexed columns (or the row ID column, "__id__"). In order to obtain
columns that are not among the indexed columns, internally the GTable must
perform additional random-access queries to retrieve that data. Queries that
request such columns will incur a significant performance penalty and are
recommended only for interactive use and not for bulk queries.

In order to make queries on secondary indices faster, consider doing the
following:

Request only the subset of columns you need.

If you frequently need to retrieve additional columns, append those columns
to your lexicographic index column specification. At the cost of making the
index larger and slightly slower, you will be able to obtain the values for
the columns of interest much more quickly.

List of API Methods

GenomicTable API Methods

The following are API methods specific to (or have behavior specific to) GenomicTables.

GenomicTable API Method Specifications

API method: /gtable/new

Specification

Creates a new GenomicTable object. The GenomicTable is initially in the “open” state. Refer to the Lifecycle section for more information on states.

Inputs

projectstring ID of the project or container to which the gtable should belong (e.g. the string "project-xxxx")

namestring (optional, default is the new ID) The name of the object

tagsarray of strings (optional) Tags to associate with the object

typesarray of strings (optional) Types to associate with the object

hiddenboolean (optional, default false) Whether the object should be hidden

propertiesmapping (optional) Properties to associate with the object

key Property name

value string Property value

detailsmapping or array (optional, default { }) JSON object or array that is to be associated with the object; see the Object Details section for details on valid input

folderstring (optional, default "/") Full path of the folder that is to contain the new object

parentsboolean (optional, default false) Whether all folders in the path provided in folder should be created if they do not exist

columnsarray of mappings List of column descriptors. The
order of elements in the array is used to determine the order of
columns in the created GenomicTable. Column names must be unique.
Each column descriptor has the following key/values:

namestring The column name (a string that matches the
regular expression [-./A-Za-z0-9_]+ and does not match the
reserved column pattern __.*__)

typestring The column type (must be one of the allowed types)

indicesarray of mappings (optional) List of index
descriptors. If provided, the first index will be used to reorder
the rows upon closing the GenomicTable. An index descriptor must be
in one of the formats specified above in the
Indexing and Row Ordering) section.

initializeFrommapping (optional) Indicate an existing GenomicTable from which to use the metadata as default values for all fields that are not given:

projectstring ID of the project or container containing the GenomicTable to use

idstring ID of the GenomicTable to use; this table can be
in any state

Inherited metadata includes column and index specifications. If
provided, metadata fields and specifications from the existing table
can be overridden by setting them explicitly. For example, to use
the column specs but not the indices, as well as removing the types
from an existing table, one would set both indices and types to
the empty array [ ]. Note that this allows initialization of the
metadata but not the data in the resulting GenomicTable object; it
will be an empty table with no rows.

Outputs

idstring ID of the created GenomicTable object (i.e. a string in the form “gtable-xxxx”).

Errors

InvalidInput

A reserved linking string (“$dnanexus_link”) appears as a key in
a hash in “details” but is not the only key in the hash

A reserved linking string (“$dnanexus_link”) appears as the only
key in a hash in “details” but has value other than a string

The “columns” array is empty

A column name appears more than once in “columns”

Any column name contains invalid characters or matches the
reserved pattern __.*__

The string for a column type is not one of the known types

An index name appears more than once in “indices”

Any index descriptor in “indices” is not valid (a specified column
name is not described with a valid column descriptor in the
“columns” array, or the index descriptor is not of one of the
allowed formats as described above in Indexing)

“initializeFrom” is not a hash or does not have both “project” and
“id” keys that are nonempty strings

For each property key-value pair, the size, encoded in UTF-8, of
the property key may not exceed 100 bytes and the property value
may not exceed 700 bytes

PermissionDenied (UPLOAD access required)

InvalidType (“initializeFrom” is specified with an object ID that is not a gtable)

ResourceNotFound (the specified project is not found, or the route in “folder” does not exist while “parents” is false)

API method: /gtable-xxxx/nextPart

Specification

Returns a part ID. Any two calls to nextPart on the same table are guaranteed to return different part IDs. You can therefore use this route to obtain unique part IDs.

Note that nextPart does not check that the part ID that it returns has not previously been written (nor does it protect against someone else writing that part). Therefore it is not, in general, safe to use part IDs obtained via nextPart if, for any addRows call, you supplied a part ID that was NOT obtained via nextPart.

Inputs

None

Outputs

partint A part ID (integer from 1..250000)

Errors

ResourceNotFound (the specified object does not exist)

PermissionDenied (UPLOAD access required)

InvalidState (the GenomicTable object is not in the “open” state, or 250,000 parts have already been allocated).

API method: /gtable-xxxx/addRows

Specification

Adds rows to a GenomicTable. To enable parallelism and robustness, DNAnexus follows an approach where row addition is done in parts. This method receives a part ID, as well as an array of rows corresponding to that part. When the GenomicTable is closed, parts are concatenated according to their part IDs.

If this method has not been called by the time the “close” method is called, the resulting GenomicTable will be empty.

Unlike file objects, for which a separate URL is provided for data upload, calling “addRows” on a GenomicTable object requires supplying the row data with the call (in the “data” input field). If the client aborts during the HTTP request, the partially transmitted data are discarded. If the HTTP request is completed successfully, the rows are added to the GenomicTable, unless another request has been already completed for the same part ID, in which case the system responds with InvalidInput. In other words, if this method is called multiple times for the same part ID, only the first successful request will matter. The system keeps track of successfully added parts, and this information is returned by the “describe” method.

The data given to this method need to correspond to the GenomicTable columns.

Inputs

partint (optional, default 1) Part ID that is being uploaded
in this call

dataarray of arrays List of rows to be added. Each row is an
array consisting of values that correspond to the GenomicTable
columns. Since all the input is given in JSON, values for columns of
type “string” need to be strings, values for columns of type
"boolean" need to be booleans, and values for columns of all other
types need to be numbers.

Outputs

idstring ID of the manipulated object (i.e. the string “gtable-xxxx”)

Errors

ResourceNotFound (the specified object does not exist)

PermissionDenied (UPLOAD access required)

InvalidInput (the input is not a hash, or the key part (if provided) is not an integer in 1-250000, or a part with that ID has already been uploaded, or data is missing, or is not an array, or its members (rows) are not arrays, or at least one member (row) is invalid [does not have the correct count and type of values])

InvalidState (the GenomicTable object is not in the “open” state)

API method: /gtable-xxxx/describe

Specification

Describes a GenomicTable object (see also /record-xxxx/describe). Returns, among others, the column and index descriptors, as well as the state of the GenomicTable object. If the GenomicTable object is in the “closing” or “closed” states, the length (in number of rows) is reported as well. If the GenomicTable object is in the “open” state, the response contains a “parts” key, whose value is a hash describing the parts that have been successfully added. Only parts for which the “addRows” method has been successfully called (i.e. the request has been performed and a successful response has been issued) are present in the hash. For each part, the length of the part (in number of rows) is reported.

Inputs

projectstring (optional) Project or container ID to be used as
a hint for finding the object in an accessible project

defaultFieldsboolean (optional, default false if fields is
supplied, true otherwise) whether to include the default set of fields
in the output (the default fields are described in the "Outputs"
section below). The selections are overridden by any fields explicitly
named in fields.

fieldsmapping (optional) include or exclude the specified
fields from the output. These selections override the settings in
defaultFields.

The following options are deprecated (and will not be respected if
fields is present):

propertiesboolean (optional, default false) Whether the
properties should be returned

detailsboolean (optional, default false) Whether the details
should also be returned

Outputs

idstring The object ID (i.e. the string "gtable-xxxx")

The following fields are included by default (but can be disabled using
fields or defaultFields):

projectstring ID of the project or container in which the
object was found

classstring The value "gtable"

typesarray of strings Types associated with the object

createdtimestamp Time at which this object was created

statestring The value "open", "closing", or "closed"

hiddenboolean Whether the object is hidden or not

linksarray of strings The object IDs that are pointed to from this object

namestring The name of the object

folderstring The full path to the folder containing the object

sponsoredboolean Whether the object is sponsored by DNAnexus

tagsarray of strings Tags associated with the object

modifiedtimestamp Time at which the user-provided metadata of the object was last modified

createdBymapping How the object was created

userstring ID of the user who created the object or
launched an execution which created the object

jobstringpresent if a job created the object ID of the job that created the object

executablestringpresent if a job created the object ID of the app or applet that the job was running

columnsarray of mappings List of column descriptors
representing the columns of the GenomicTable. The special column
“__id__” of type int64 precedes all other columns.

indicesarray of mappings (present if applicable) List of
index descriptors as provided in the indices field of the “new”
method. The primary index, if it exists, appears in the first
position. The order of the remaining indices, if any, is
unspecified.

sizeint The size (in bytes) of the GenomicTable; this is updated as rows are added

The following field (included by default) is available if the object is
in the "open" state:

partsmapping Information on the parts that have been uploaded

key Part ID that was provided to a successful “addRows” call

value mapping Mapping with key/values:

lengthint The length (in rows) of the part

The following field (included by default) is available if the object is
in the "closing" or "closed" state:

lengthint The length (in rows) of the GenomicTable

The following field (included by default) is available if the object is
sponsored by a third party:

sponsoredUntiltimestamp Indicates the expiration time of data
sponsorship (this field is only set if the object is currently sponsored, and
if set, the specified time is always in the future)

The following fields are only returned if the corresponding field in the
fields input is set to true:

propertiesmapping Properties associated with the object

key Property name

value string Property value

detailsmapping or array Contents of the object’s details

Errors

ResourceNotFound (the specified object does not exist or the specified project does not exist)

InvalidInput (the input is not a hash, project (if supplied) is not a string, or the value of properties (if supplied) is not a boolean)

PermissionDenied (VIEW access required for the project provided (if any), and VIEW access required for some project containing the specified object (not necessarily the same as the hint provided))

API method: /gtable-xxxx/close

Specification

Initiates finalization of the GenomicTable object. If this call is successful, it will return immediately and the GenomicTable object will advance to the “closing” state. The system will “concatenate” the rows of the parts, in order of increasing part ID (and those indices do not have to be consecutive). Once the system is done, the GenomicTable object will advance to the “closed” state.

If the GenomicTable was created by calling “new”, then the system does not perform any more checks and just concatenates all the rows, according to the part indices. If the GenomicTable has an index, then rows are further re-ordered according to that index.

Inputs

None

Outputs

idstring ID of the manipulated object (i.e. the string “gtable-xxxx”)

Errors

ResourceNotFound (the specified object does not exist)

PermissionDenied (UPLOAD access required in the project)

InvalidState (the GenomicTable object is not in the “open” state)

API method: /gtable-xxxx/get

Specification

Retrieves rows from the GenomicTable.

If the query parameter is missing from the input, then this request retrieves consecutive rows from the GenomicTable, starting from the row whose row ID equals to the “starting” parameter, and returning as many rows as the “limit” parameter.

If the query parameter is present, its parameters must be compatible with the structure of the selected index. Only rows that satisfy the query will be returned. The returned rows are ordered by the row ordering algorithm for the selected index (see the section titled "Indexing and Row Ordering" above). As an example, queries against a Genomic Range Index will return rows that are ordered by their leftmost coordinate in the genome. Note in particular that queries on the primary index will return rows in order of increasing row ID, but queries on secondary indices will not.

The limit parameter is used to limit the number of rows returned in the result. If more results would have been returned had limit been higher, the field next contains the row ID of the next row that would have been returned. The value of next is suitable to use as the starting parameter of a subsequent request (with the same query parameters) if you want to continue fetching rows where you left off. This works both for regular requests and for queries against an index (genomic or lexicographic).

Inputs

startingint (optional) Either

the lowest row ID to be returned (if query is not provided),
or

continue a previous query that had reached its limit; the
non-null value that was returned as next in the query's output
should be provided here.

limitint (optional, default 1000) The maximum number (between
1-100000000) of rows that may be returned

querymapping (optional) A query suitable for one of the
table's indices; if not provided, rows will be fetched by original
row ID. A valid query contains the following keys/values:

indexstring Name of the index that is to be used to answer this query

parametersmapping or array The query parameters. The
format of this value depends on the type of the index:

coordsarray of [string, int, int] Genomic range
coordinates of the form [CHR, LO, HI]. Please refer to
the Indexing and Row Ordering section for more information about how these
values are used to perform a range query.

Lexicographic index: array List of MongoDB-style
queries. See the Indexing and Row Ordering section for the query language that will be
supported.

columnsarray of strings (optional) The names of the columns
that will be included in the result and returned in the order
specified. If not provided, all columns will be included in the
original order and preceded by the special column “__id__”.

Outputs

lengthint The number of rows included in this response, i.e. the length of the data array

nextvalue or null If null, all row results were reported in
data. If non-null, represents the next result (generally as an
opaque int64 value) that could not be returned because limit
results have already been returned. This value should be passed
directly to starting in a subsequent query if more results are
desired.

dataarray of arrays List of rows; each row is an array
containing the row ID and the values corresponding to
columns. Values of string columns are strings, and values other
numeric types are numbers.

Errors

ResourceNotFound (the specified object does not exist)

PermissionDenied (VIEW access required)

InvalidInput (the input is not a hash, or the starting value (if provided) is not an integer, or the limit value (if provided) is not an integer between 1-100000000, or the query (if provided) does not supply a valid index name, or if the query (if provided) is not of a form that is compatible with the named index, or the column parameter (if provided) is not an array of strings, or one of the columns named in the column array is not present in the GenomicTable, or a column name appears multiple times in the column array)