1. Getting Started

This section describes the data-set, required software and computational environment for the exome association analysis to be demonstrated in this tutorial.

1.1 Data Source

We use the exome data from the 1000 genomes project. The entire data-set can be found at the NCBI ftp site. This release (version 3.0, April 30th, 2012) contains phased genotype calls on 1092 samples in VCF format, with 38.2M SNVs, 3.9M Short Indels and 14K Large Deletions. Data from different sources are pooled together. Targets for exome sequencing can be found at the 1000 genomes ftp site.

We created a snapshot dataset for this demonstration with the original exome data plus simulated genotype quality scores and phenotype values. The dataset is 2.1GB and can be automatically downloaded and loaded for use with the following commands:

Genotype attributes

The VCF file from 1000 genomes project has two genotype attributes, Genotype dosage from MaCH/Thunder and Genotype Likelihoods. To demonstrate the data QC procedures we need to have additional attributes such as Genotype Depth (GD) and Genotype Quality (GQ). We randomly simulate GD and GQ scores and append them to each genotype call, which is neither based on genotype dosage nor on genotype likelihoods. These scores will be used in QC for demonstration purposes.

Phenotypes

Two simulated phenotypes BMI and smoking combined with the original PED file will be used as phenotype data for this demonstration.

ANNOVAR

We will use the ANNOVAR program to determine which variants should be analyzed in association tests based on their functionality (nonsynonymous, splicing sites, etc).

KING

We will use KING program for phenotype level quality control, i.e., to infer kinship and population structure. PLINK program is needed to convert output to KING compatible format.

R with ggplot2

vtools report requires R with ggplot2 in order to produce QQ plot and Manhattan plot. To install it in R:

install.packages("ggplot2")

1.3 Computer Environment

Demonstrative analysis for this tutorial are not computationally intensive and can be efficiently carried out in a reasonably powered computer. We have applied similar analysis on ~6,500 whole exome sequencing samples with a total of ~2 million variants and experienced satisfactory performance on a regular Debian GNU/Linux based workstation.

Read the workstation configurations ▸

CPU

product: AMD FX(tm)-8150 Eight-Core Processor

vendor: Advanced Micro Devices [AMD]

size: 3600MHz

capacity: 3600MHz

width: 64 bits

clock: 200MHz

RAM (4X4=16GB)

description: DIMM Synchronous 1066 MHz (0.9 ns)

size: 4GiB (each, 16GB total)

width: 64 bits

clock: 1066MHz (0.9ns)

Cache

description: L1 cache

size: 384KiB

capacity: 384KiB

clock: 1GHz (1.0ns)

description: L2 cache

size: 8MiB

capacity: 8MiB

clock: 1GHz (1.0ns)

description: L3 cache

size: 8MiB

capacity: 8MiB

clock: 1GHz (1.0ns)

Motherboard

product: Crosshair V Formula

vendor: ASUSTeK COMPUTER INC.

Harddrive (2X2TB)

product: ST32000641AS

vendor: Seagate

version: CC13

serial: 9WM6ZQJG

size: 1863GiB

2. Initializing a Project

for convenience and portability of commands, we set bash variables:

script_dir="./cache"
data_dir="./data"

2.1 Preparing input format configuration file

A "data format configuration file" *.fmt is required to build a project. This mechanism allows importing variants/samples from a number of arbitrary input format. It is always good practice to review and adapt a template fmt file for a specific data-set. We provide a few FMT templates, among which a vcf format template is available. A modified version of the template vcf.fmt results in g1000vcf.fmt. We created this file by extracting the meta information of the source vcf file and make changes accordingly on the fmt template. You should find this file in your project directory if you have loaded our snapshot vt_qc. Using this snapshot, you do not need to import any data for this tutorial. However you can modify and use the configuration files for your other projects.

g1000vcf.fmt is configured such that it will not import variant INFO fields (we will create separated annotation database for it) and genotype likelihood GL field (we will not use it in the analysis)

Optimize the performance

If you have a large volume of data to import, it is important to set "runtime optimization" parameters to speed up the process. Please read this documentation for details.

Import from multiple sources

If your dataset is stored in multiple VCF file, please read this documentation for properly importing such dataset.

The snapshot we provided has intentionally left out information in the INFO field of the VCF file, which you will find in the file variants.tsv.gz in this snapshot, as well as the g1000.ann file to build annotation database from the INFO field of VCF file.

variant.variant_id
variant.bin
variant.chr
variant.pos
variant.ref
variant.alt
g1000.chr Chromosome
g1000.pos 1-based position
g1000.dbsnp_id variant id (rs number or other IDs)
g1000.ref Reference allele, '-' for insertion.
g1000.alt Alternative allele, '-' for deletion.
g1000.qual phred-scaled quality score for the assertion made in
ALT. High QUAL scores indicate high
confidence calls.
g1000.filter PASS if this position has passed all filters, i.e. a
call is made at this position. Otherwise,
if the site has not passed all filters, a
semicolon-separated list of codes for
filters that fail.
g1000.LDAF MLE Allele Frequency Accounting for LD
g1000.AVGPOST Average posterior probability from MaCH/Thunder
g1000.RSQ Genotype imputation quality from MaCH/Thunder
g1000.ERATE Per-marker Mutation rate from MaCH/Thunder
g1000.THETA Per-marker Transition rate from MaCH/Thunder
g1000.CIEND Confidence interval around END for imprecise variants
g1000.CIPOS Confidence interval around POS for imprecise variants
g1000.END End position of the variant described in this
record
g1000.HOMLEN Length of base pair identical micro-homology at event
breakpoints
g1000.HOMSEQ Sequence of base pair identical micro-homology at
event breakpoints
g1000.SVLEN Difference in length between REF and ALT alleles
g1000.SVTYPE Type of structural variant
g1000.AN Total Allele Count
g1000.AC Alternate Allele Count
g1000.AA Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/
vol1/ftp/pilot_data/technical/reference/a
ncestral_alignments/README
g1000.AF Global Allele Frequency based on AC/AN
g1000.AMR_AF Allele Frequency for samples from AMR based on AC/AN
g1000.ASN_AF Allele Frequency for samples from ASN based on AC/AN
g1000.AFR_AF Allele Frequency for samples from AFR based on AC/AN
g1000.EUR_AF Allele Frequency for samples from EUR based on AC/AN
g1000.VT indicates what type of variant the line represents
(eg: INDEL, SNP)
g1000.SNPSOURCE indicates if a snp was called when analysing the low
coverage or exome alignment data (eg:
EXOME, LOWCOV)

ANNOVAR Annotations

We want to annotate variants with ANNOVAR for variants functionality so that we can focus on functional variants in rare variants association analysis. Although all variants in this dataset should be found in the large collection of readily available annotation databases in variant annotation tools, we want to demonstrate here the use of customized text based annotation sources. For readily usable annotation please refer to documentation of variant annotation tools.

vtools show format ANNOVAR_exonic_variant_function and vtools show format ANNOVAR_variant_function will display details of these configuration formats

vtools show format ANNOVAR_exonic_variant_function

Format: ANNOVAR_exonic_variant_function
Description: Output from ANNOVAR, generated from command
"path/to/annovar/annotate_variation.pl annovar.txt
path/to/annovar/humandb/". This format imports chr, pos, ref, alt
and ANNOVAR annotations. For details please refer to
http://www.openbioinformatics.org/annovar/annovar_gene.html
Columns:
None defined, cannot export to this format
variant:
chr Chromosome
pos 1-based position, hg18
ref Reference allele, '-' for insertion.
alt Alternative allele, '-' for deletion.
Variant info:
mut_type the functional consequences of the variant.
Other fields (usable through parameters):
genename Gene name (for the first exon if the variant is in more than one
exons, but usually the names for all exons are the
same).
function the gene name, the transcript identifier and the sequence change in
the corresponding transcript
Format parameters:
var_info Fields to be outputted, can be one or both of mut_type and function.
(default: mut_type)

vtools show format ANNOVAR_variant_function

Format: ANNOVAR_variant_function
Description: Output from ANNOVAR for files of type "*.variant_function", generated
from command "path/to/annovar/annotate_variation.pl annovar.txt
path/to/annovar/humandb/". This format imports chr, pos, ref, alt
and ANNOVAR annotations. For details please refer to
http://www.openbioinformatics.org/annovar/annovar_gene.html
Columns:
None defined, cannot export to this format
variant:
chr Chromosome
pos 1-based position, hg18
ref Reference allele, '-' for insertion.
alt Alternative allele, '-' for deletion.
Variant info:
region_type The genomic region type (i.e., intergenic, ncRNA_intronic, etc) where
this variant lies.
Other fields (usable through parameters):
region_name Genomic region name that corresponds to the region_type. If the
variant lies in an intergenic region, this field will
specify the closest known regions upstream and
downstream of this variant.
Format parameters:
var_info Fields to be outputted, can be one or both of region_type and
region_name. (default: region_type)

A few additioanl fields regarding variant functionality are added to the variant table. They will be useful for selecting, filtering and grouping variants for association analysis.

3. Basic Quality Control & Data Preprocessing

Quality control is a crucial yet tedious step in association analysis. To facilitate this step, we provide a variety of summary statistic along with a number of sample/variant selection and filtering operations for quality control purpose. In this section we will demonstrate some basic quality control procedures. Please refer to our data exploration tutorial for advanced summary statistics and QC measures.

3.1 Variant & Genotype Level QC

vtools select and vtools exclude commands implement variant level data selection and filtering. Variants can be subsetted on the basis of criteria defined by variant properties (variant information, annotations, summary statistics, etc) displayed by vtools show fields. We could either focus on subsets of variants of interest or remove non-informative subsets of variants after variants are subsetted.

Low quality variants filter

We have made the FILTER column from the original vcf file available as annotation field g1000.filter. It can be used directly to remove low quality variants from the database. We select variants of good quality into a separate table, and use this table (instead of using table variant) for next steps of analysis.

vtools select variant "g1000.filter='PASS'" -t variant_pass

An alternative approach would be applying vtools remove so that variants of low quality are removed from the variant table. Note that the remove command is irreversible.

The GT and DS genotype fields are genotypes and genotype dosage from MaCH/Thunder. Both of them are from the original 1000 genomes project data. GQ and GD are simulated genotype quality and genotype depth of coverage. Genotype level filtering can be performed by removing genotype calls of given quality measure, which will irreversibly remove genotypes of low coverage and quality:

vtools remove genotypes "GD<10 or GQ<20"

In many cases, however, we do not want to remove genotypes at the early stage of analysis, since we may later decide to change the QC criteria. It is possible to always calculate various statistics conditioning on genotypes defined by expressions such as GD>10 and GQ>20, for example:

In this way we obtain statistic over genotype calls excluding the low quality genotypes. This approach is more flexible and allows evaluation of the same statistic conditional on different criteria.

QC statistics conditioned on genotype information

We will demonstrate here the calculation of summary statistics conditioning on genotype information. We set up a genotype condition expression from genotype information fields GD and GQ. The calculation of various statistic will skip genotypes failing to satisfy this expression, as you can see by comparing the counts of total genotypes and minor allele counts after executing the following commands.

Looking at the first variant "1:69536" we see the original total count is 1092 genotype calls; after conditioning on GD>10 and GQ>20, 4 genotype calls are skipped, resulting in 1088 counts. If all genotype calls in a variant site fail the genotype quality constraint, the total count will be zero. These variants are:

This calculates the proportion of genotypes having low quality scores per variant. Such variant level statistics can be useful in creating customized variant level quality control criteria.

The statistics calculated above are for demonstration only and will not be used in our next analysis steps. We will not pursue any QC related summary statistic calculations for now, since groups of samples have to be defined before statistics of subset of samples can be calculated (e.g., population specific minor allele frequency). In the next section we will demonstrate manipulation of sample information.

3.2 Processing Phenotype / Sample Level Information

We discuss in this section the organization and use of sample phenotype information. This snapshot provides a phenotype file that we can load phenotype data from:

Sample information thus created can be useful in individual genotype level quality control. For example, we may want to remove sample having too many low quality or missing genotype calls from the analysis (vtools remove samples). Advanced use of phenotype --from_stat command include calculating sample level 'transition-transversion ratio', or 'synonymous-nonsynonymous SNV ratio", see this [tutorial].

There are 268 samples having a yes tag for being unrelated Europeans. Within these European samples, there are population substructures which we would want to control for by incorporating principle components (PC) from MDS analysis as covariates. We calculated the PC for the 268 samples using the KING program, and import the output into the database:

Population structure determination and selection of unrelated samples is very important in exome association analysis. Even though the 1000 genomes data comes with self-reported kinship and population information, it is still necessary to conduct population structure and kinship analysis. One can use the same technique as has been applied in GWAS studies.

3.3 Variant and Sample Selection for Association Analysis

Alternative allele frequency calculations

We calculate AF within the 268 European samples, by first calculate total genotypes and number of alternative alleles, then calculate AF from them.

As was previously discussed, genotype conditions --genotypes can be applied with specified criteria.

Several other fields num_ie, het_ie, hom_ie, other_ie are calculated and will be used in HWE analysis.

Allele frequency thus calculated af=num/(total * 2.0) will be wrong for variants on X chromosome if males present in samples. Correct allele frequency can be calculated using more involved vtools update command, which we will not demonstrate here.

A new allele frequency field is added to the database. We can look at variants having non-trivial af_ie field and compare it with the allele frequency estimate based on the entire data-set

An hwe field is created with p-values from the HWE test. We want to filter out variant sites with {$HWE<5\times10^{-8}$}, which is 1132 variants. It turns out most of these variants have MAF greater than 0.01.

vtools select variant 'hwe<5e-8' -c

Creating common and rare variant subsets

Unlike traditional GWAS analysis, exome association analysis will study both common variants (variants having relatively high minor allele frequency) and rare variants (variants having relatively low minor allele frequency). Statistical methods for analyzing common and rare variants are different. It is thus necessary to subset variants based on minor allele frequency. We define common variants as having MAF greater than 1%, and rare variants MAF smaller than 1%.

The 1% cutoff for rare/common variants is arbitrary. For this small demonstration data-set (60k variants, 268 individuals) using 1% cut off we have 58062 rare variants and 2825 common variants. Alternatively, for single variant association tests (common variants analysis) one could analyze all variants except for singletons and doubletons regardless of their MAF classification; for aggregated variants analysis (rare variants analysis) one may want to use other MAF cutoffs depending on the available sample size, and particularly, for variable thresholds or RareCover methods, the cutoff should be higher (say, 5%).

Selecting functional rare variants

Rare variants analysis typically focus only on variants that effect protein function. We have already annotated the variants with ANNOVAR. The annotation fields can be used to select functional rare variants for association studies.

we end up having 1,711 common variants and 23,757 rare variants for association analysis.

3.4 Additional QC for Missing Data

Although there is no missing genotype calls in our test data (or, missing genotype calls have already being imputed), in practice it is not always the case. In addition to missing calls, low quality genotypes will be regarded missing in effect. As has already been discussed, we can use --genotypes conditions to skip low quality genotypes. The overall missingness of data per variant can be evaluated on both variant level and individual level.

After the variant/sample information field missing_ratio is calculated, an overall variant/sample level filtering can be performed. However we do not recommend the overall filtering for rare variant association analysis. In practice the distribution of missing genotype calls are not uniform. Some samples may not have data on a particular genetic region while other samples have (due to different exome capture arrays). To fully exploit the data we will filter variants and samples for missing data specific to each small genomic unit while carrying out association tests, as will be introduced in the next section of this tutorial.