Appendix 1. Fabric Genomics Variant Data Specification

Introduction

This appendix defines the structure and content of data that Fabric Genomics requires in a genome variation data file in order for it to be successfully processed by the Opal annotation pipeline.

To reduce the risk of error during parsing and processing of the input variation data, formatting of the data should comply with the examples shown below. All indicated data fields should be provided. Compliance with this specification will ensure more complete, accurate results, and shorter turnaround time.

General Requirements

General data requirements are described here. Different data formats have different mechanisms for providing this information.

Variant Type

Each data row of the input file must specify a single variation (SNP or indel), indicating the reference allele (or sequence) and the variant allele(s) (or sequence). Indel variations are insertions or deletions of at most 50 nucleotides. Structural variants and copy number variants are now supported as well.

Multiple alleles may be present, for example in the case of triallelic SNPs or when the reference allele is not the most common variant.

A given data row should not contain multiple variants. For example, here is an attempt to specify two SNPs and an indel within a single sequence string, supplied on a single row in an input data file:

Location

Reference Sequence

Variant Sequence

Variant X

1

ccgTatcGaaCCCatta

ccgGatcTaaCCCCatta

Instead, the different variants within that string should be provided separately, on multiple rows of the input data file:

Location

Reference Sequence

Variant Sequence

Variant X1

4

T

G

Variant X2

8

G

T

Variant X3

13

C

CC

Variant Genome Location and Zygosity

The genomic position of each variant must indicate the chromosome, starting and ending sequence coordinates where the variant is located with respect to the indicated reference genome build. The submitted data should use the conventional type of sequence coordinates for the specific data format being used (1-based or zero-based).

The zygosity of each variant call must be indicated (homozygous, heterozygous, hemizygous).

Variant Quality

Each variant in the input data file should include a Phred-scaled (or ‘Phred-like’) quality score (integer or float) representing the confidence of the variant call. This is expressed as the negative log of the probability that the variant call is wrong (i.e. that the position is the homozygous reference sequence):

VCF Quality: QUAL field (field 6)

Sequence Coverage

Coverage information consists of the number of filtered reads used for calling and should be provided for both the reference and variant alleles. Opal requires a depth representation that breaks out the reference reads and alt reads; we do not use the DP single value read depth field.

Coverage information can be provided in the following ways:

VCF Method 1 - GENOTYPE field - AD (GATK)

Field 9: Uses an ‘AD’ (allelic depth) GENOTYPE sub-field

Field 10: general data format for the AD sub-field is ‘<reference reads>, <variant reads>’

Example: ‘GT:AD’ ‘0/1:14,3’

VCF Method 2 - GENOTYPE field - RS

Field 9: Uses an ‘RS’ (RTG format) GENOTYPE sub-field.

Field 10: The general format for the data is ‘<allele1>,<read depth1>,<quality1>,<allele2>,<read depth2>,<quality2>[,…]’. The data is not in any particular order. You match the field to the reference and variant allele to find the read depth for that allele. The quality value is not used.

Example: ‘GT:RS’ ‘0|1: G,51,0.166,T,46,0.107’

VCF Method 3 - GENOTYPE field - AU

Field 10: The general format is a comma-separated ordered list of reads in the following order ‘<A reads>,<C reads>,<G reads>,<T reads>’. The reference and variant alleles are used as lookup keys into this ordered list to find the associated number of reads for that allele.

Example: ‘GT:AU’ ‘0/1:40,0,0,30’

VCF Method 4 - INFO field - DP4

Field 8: Uses a ‘DP4’ (4-way read depth) INFO sub-field.

The general format for the data is ‘<forward reference reads>, <reverse ref reads>, <forward non-reference reads>, reverse non-reference reads>’. The forward and reverse reads are added to get the reads for the reference and non-reference alleles.

Example: ‘DP4=2,2,3,2’

VCF Method 5 - INFO field - DP_<actg>

Field 8: uses a DP_ (Spiral Genetics format) INFO sub-field.

The general format for the data is ‘DP_A=<reads>; DP_C=<reads>; DP_T=<reads>; DP_G=<reads>’. The reference and variant allele is used as a lookup into this list to find the associated reads for each allele.

Example: ‘DP_A=0;DP_C=20;DP_T=40;DP_G=0’

VCF Method 6 - GENOTYPE field - AD (Complete Genomics)

Field 9: Uses an ‘AD’ (allelic depth) GENOTYPE sub-field

Field 10: general data format for the AD sub-field is a list of ‘first genotype allele reads, second genotype allele reads’

Example: ‘GT:AD’ ‘1/2:14,3’

Both Complete Genomics and GATK support an AD field. However, the formats are different. If we find one of the following header lines in the VCF we will interpret the AD field as a Complete Genomics formatted field. If neither are found the field is assumed to be in the GATK format (see the AD (GATK) section above.

##center=Complete Genomics

##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allelic depths (number of reads in each observed allele)">

VCF Method 7 - GENOTYPE field - NR/NV

Field 9: Uses a ‘NR’ number of reference reads

Field 9: Uses a 'NV' number of variant reads

Field 10: general data format for the NR field is an integer for reference read depth

Field 10: general data format for the NV field is an integer for variant read depth

Example: 'GT:NR:NV' '0/1:14:3'

VCF Method 8 - GENOTYPE field - RR/VR/DP

Field 9: Uses the ‘RR’ sub-field: number of reference reads

Field 9: Uses the 'VR’ sub-field: number of reference reads

Field 9: Uses the ‘DP’ sub-field: total read depth

Field 10: general data format for the RR sub-field is an integer for reference read depth

general data format for the VR sub-field is an integer for variant read depth

general data format for the DP sub-field is an integer for total read depth

Example: 'GT:RR:VR:DP' '0/1:14:3:17'

VCF Method 9 – INFO field - FRO/FAO

Field 8: Uses the ‘FRO’ sub-field: number of reference reads.

Field 8: Uses the 'FAO' sub-field: number of alternate reads. Becomes a comma separated list of reads if more then one alternate allele is provided for the variant.

Example: FRO=14;FAO=3

If the genotype contains more than one alternate allele, for example GT is 1/2, FAO will be a list of reads that is a 1 based list. The reads are indexed by allele number. For instance FAO=10,12 where 10 is the reads for alternate allele 1 and 12 is the reads for alternate allele 2 when the SAMPLE GT sub-field is 1/2.

VCF Method 10 – GENOTYPE field - FRO/FAO

Field 9: Uses the ‘FRO’ sub-field: number of reference reads.

Field 9: Uses the 'FAO' sub-field: number of alternate reads. Becomes a comma separated list of reads if more then one alternate allele is provided for the variant.

Field 10: general data form for the FRO field is an integer that is the number of reference reads

Field 10: general data format for the FRO field is an integer that is the number of alt reads. If more then one alt is given then this is a 1 based list indexed by the allele number.

Example: ‘GT:FRO:FAO’ ‘0/1:14:3'

If the genotype contains more then one alternate allele, for example GT is 1/2, FAO will be a list of reads that is a 1 based list. The reads are indexed by allele number. For instance FAO=10,12 where 10 is the reads for alternate allele 1 and 12 is the reads for alternate allele 2 when the SAMPLE GT sub-field is 1/2.