Navigation

The real power in the GEMINI framework lies in the fact that all of your
genetic variants have been stored in a convenient database in the context of a
wealth of genome annotations that facilitate variant interpretation. The
expressive power of SQL allows one to pose intricate questions of one’s variation
data.

Note

If you are unfamiliar with SQL, sqlzoo has a decent
online tutorial describing the basics. Really all you need to learn is the
SELECT statement, and the examples below will give you a flavor of how to
compose base SQL queries against the GEMINI framework.

GEMINI has a specific tool for querying a gemini database that has been loaded
using the geminiload command. That’s right, the tool is called
geminiquery. Below are a few basic queries that give you a sense of how to
interact with the gemini database using the query tool.

The above examples illustrate ad hoc queries that do not request or filter
upon the genotypes of individual samples. Since GEMINI stores the genotype
information for each variant in compressed arrays that are stored as BLOBs
in the database, standard SQL queries cannot directly access individual
genotypes. However, we have enhanced the SQL syntax to support such queries
with C “struct-like” access. For example, to retrieve the alleles for a given
sample’s (in this case, sample 1094PC0009), one would add gts.1094PC0009
to the select statement.

Here is an example of selecting the genotype alleles for four
different samples (note the examples below use the test.snpEff.vcf.db
file that is created in the ./test directory when you run the
bash master-test.sh command as described above):

The above examples demonstrate how one can select individual sample genotype
information by explicitly listing each column and sample that one wishes to see.
Obviously, this can become tedious when a project involves hundreds or thousands of samples
— if you wanted to see genotype information for the 345 of 1145 affected samples in your study,
you would have to type each and every column.sample name out. Brutal.

To get around this, one can bulk-select sample genotype information using “wildcards”. The column and the wildcard must each be surrounded with parentheses and separated by a period. The “*” is a shortcut (wildcard) meaning “all samples”.

Note

The syntax for SELECTing genotype columns using a wildcard is (COLUMN).(WILDCARD).

For example, a shortcut to reporting the genotype for all samples (in this case 4) in the study, one could do the following:

Now, we often want to focus only on variants where a given sample has a
specific genotype (e.g., looking for homozygous variants in family trios).
Unfortunately, we cannot directly do this in the SQL query, but the gemini query
tool has an option called –gt-filter that allows one to specify filters to
apply to the returned rows. The rules followed in the –gt-filter option
follow Python syntax.

Tip

As you will see from the examples below, appropriate use of the –gt-filter
option will allow you to compose queries that return variants meeting
inheritance patterns that are relevant to the disease model of interest
in your study.

As an example, let’s only return rows where sample
1094PC0012 is heterozygous. In order to do this, we apply a filter to the
gt_types columns for this individual:

Many times, we want to be able to apply the same rule to multiple samples
without having to enter the rule over and over again for each sample. For example,
let’s imaging there are 100 samples in your study and you only want to report variants
where every sample has an observed alignment depth of at least 20 reads. Traditionally,
one would have enter each of the 100 samples from the command line as follows:

One can also apply wildcards that select samples based on the values in specific
columns in the samples table. For example, let’s imagine we wanted to require that variants
are returned only in cases where ALL the affected individuals in the study (i.e., the phenotype column in the samples table is equal to 2) have non-reference genotypes. We could do the following:

The examples provided thus far have deomnstrated how to enforce that ALL of the samples meeting specific criteria meet the same genotype column restrictions. However, clearly there are cases where one would want to be less restrictive. In order to accomondate such queries, there are three other enforcement operators allowed: any, none, and count.

For example, perhaps we want to relax the above query a bit and only require that at least one (i.e., any) of the affected samples has depth > 20:

The system is fairly flexible in that it allows one to wildcard-select samples based on custom columns
that have been added to the samples table based upon a custom PED file. For example, let’s imaging our
custom PED file had an extra column defining the hair color of each sample. We coukd use that to restrict interesting variants to those where samples with blue hair were heterozygous:

While exploring your data you might hit on a set of interesting variants and want to know
which of your samples have that variant in them. You can display the samples containing
a variant with the –show-sample-variants flag:

If you’d like to be able to export variants plus associated sample metadata into
a downstream tool like R or pandas for additional exploration, adding the
--formatsampledetail command flattens all found samples and attached
metadata information:

The --sample-filter option allows you to select samples that a variant
must be in by doing a SQL query on the samples table. For example if you
wanted to show the set of variants that appear in all samples with
a phenotype status of 2, you could do that query with:

By default –sample-filter will show the variant if at least one sample contains the
variant. You can change this behavior by using the --in option along with
--sample-filter. --inall will return a variant if all samples matching
the query have the variant. innone will return a variant if the variant
does not appear in any of the matching samples. --inonly will return a variant
if the variant is only in the matching samples and not in any of the non-matching
samples. --inonlyall will show all of the variant which are in all of the
matching samples and not in any of the non-matching samples.

The --family-wise flag applies the --sample-filter and --in behavior
on a family-wise basis. For example to show all variants that are only in samples
with a phenotype status of 2 in at least one family:

You can also specify that a variant passes this filter in multiple families with
the --min-kindreds option. So if you want to do the same query above, but restrict it
such that at least three families have to pass the filter:

If the PED file you loaded has extra fields in it, those will also work with the
--sample-filter option. For example if you had a hair_color extended field,
you could query on that as well as phenotype:

One can modify the default comma delimiter used by the --show-samples
option through the use of the --sample-delim option. For example, to use
a semi-colon instead of a comma, one would do the following:

You can pass –header to get a header to see which samples have which
variant. To use the TPED format you also need to generate a corresponing TFAM
file from your data as well, which you can get from the GEMINI dump tool:

For prioritizing variants sometimes it is useful to have summary counts of
the carrier status for all samples with a variant stratified across a phenotype.
--carrier-summary-by-phenotype takes a column in the samples table that you
want to summarize the carrier status of and adds a set of counts of
carrier/non-carrier status for each phenotype in the given column. For example,
to get a summary of how a set of variants segregate with affected status:

The gene tables viz. gene_detailedtable and the gene_summarytable have been built on version 73 of the ensembl genes. The column specifications are
available at The GEMINI database schema. These tables contain gene specific information e.g. gene synonyms, RVIS percentile scores(Petrovski et.al 2013), strand specifications, cancer gene census etc. While the former is more detailed, the later lacks transcript wise information and summarizes some aspects of the former. For e.g. while the gene_detailed table lists all transcripts of a gene with their start and end co-ordinates, the gene_summary table reports only the minimum start and maximum end co-ordinates of the gene transcripts. The chrom, gene and the transcript columns of the gene tables may be used to join on the variants and the variant_impacts tables.

Since the current available transcript sets are more than one (e.g. RefSeq, ENSEMBL and UCSC)
we support information (e.g pathways tool) for the ENSEMBL transcripts but provide a mapping of
these transcripts to the consensus set agreed upon by all the above three mentioned groups viz.
transcriptshavingavalidCCDS_ID. Here we show, how we can return variants and their
impacts for only these restricted set of transcripts using the gene_detailed table.

Most genes are popular by their common names while the representation of gene names
in the GEMINI database is mostly HGNC. For e.g ARTEMIS would be DCLRE1C in
the GEMINI database. As such one may miss out on variants if looking for specific
genes by their common names. While, joining the main tables with the gene tables for
synonym information would be useful (as shown in the previous examples), the gene
tables may also serve as a quick look up for alternate names of a gene, which
could then be looked up in the database.