Meta-information lines

Each meta-information line must have the form ##KEY=VALUE
and cannot contain white-space. The first meta-information line must
specify the VCF version number (version 4.2 in the example).
Additional meta-information lines are optional, but are often
included to describe terms used in the FILTER, INFO, and FORMAT fields.
In the example, the additional meta-information lines say that
that GT means genotype,
GP means the probability of each possible
genotype call, and GL means
the likelihood of each possible genotype call.

Marker information

The first nine columns of the header line and data lines describe the variants:

CHROM

the chromosome.

POS

the genome coordinate of the first base in the variant.
Within a chromosome, VCF records
are sorted in order of increasing position.

ID

a semicolon-separated list of marker identifiers.

REF

the reference allele expressed as a sequence of one
or more A/C/G/T nucleotides (e.g. "A" or "AAC")

ALT

the alternate allele expressed as a sequence of one
or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If
there is more than one alternate alleles, the field
should be a comma-separated list of alternate alleles.

QUAL

probability that the ALT allele is incorrectly specified,
expressed on the the phred scale (-10log10(probability)).

FILTER

Either "PASS" or a semicolon-separated list of failed quality control filters.

Sample data

After the nine fixed columns, the remaining columns contain the
sample identifier and the colon-separated data subfields for each
individual. The data subfields in a record must match that
record's format subfields.

The most common format subfield is GT (genotype) data. If
the GT subfield is present, it must be the first subfield.
In the sample data, genotype alleles are numeric: the REF allele is 0,
the first ALT allele is 1, and so on. The allele separator is '/'
for unphased genotypes and '|' for phased genotypes. In the example,
all genotypes are unphased, and the genotypes for SAMP001 are
homozygote reference, heterozygote, and missing in the first, second,
and third records.

The second record contains a GP(genotype probability) format
subfield, and the third record contains PL
(phred-scaled genotype likelihood) format subfield. GP and GL data
subfields are three comma-separated values corresponding to the
REF/REF, REF/ALT, and ALT/ALT genotypes in that order. To convert
a phred-scaled likelihood P to a raw likelihood L, use the formula
L = 10(-P/10).

In the second record of the Example, the GP data subfield is missing for
SAMP001 and the GP subfield for SAMP002 has probabilities of
0.03, 0.97, and 0 for the REF/REF, REF/ALT, and ALT/ALT genotypes.

In the third record of the Example, the GL data subfield is missing for
SAMP001. The GL subfield for SAMP002 has phred-scaled likelihoods
of 10, 5, and 0 and raw likelihoods of 0.1, 0.316, and 1 for
the REF/REF, REF/ALT, and ALT/ALT genotypes. It is not necessary for
the genotype likelihoods to sum to 1.0.