The VCF format allows for multiple alternative alleles in a single variant record. The alternative alleles are specified as a comma-separated list of their bases, so one may easily estimate the distribution of the alternative allele numbers in a command line using the following one-line script:

Here we use bcftools query from the bcftools package for rapid extraction of alternative alleles from a VCF file. We need to know only the number of commas in each line, so we remove all other symbols using tr. Finally, we count lines containing the particular numbers of commas.

Example: 1000 Genomes variants on chromosome 22

Let us demonstrate the script using the VCF file of 1000 Genomes variants on chromosome 22. The file contains 1,103,547 variants, including 1,060,388 SNPs and 43,230 indels.

According to the output, most of the variants in the file are biallelic (i.e., having a reference allele and a single alternative allele) and less than 1% of them are multiallelic. Most of the multiallelic variants are triallelic (i.e., having a reference allele and two alternative alleles) and only 275 multiallelic variants have more than two alternative alleles.

The VCF format specifies quality scores (QUAL) for each variable position (variant) in a genome. The QUAL value is the Phred quality score for the assertion that alternative bases of a variant are correct, that is, , where is the probability that the alternative base calls are wrong. Using the QUAL scores, one may easily calculate the probability that all variant calls in a VCF file are correct.

Here we give an equation for that probability, a Python script that implements it and an example of its usage.

The AGP format is used to describe the assembly structure in the NCBI Genome database. Since AGP is a plain-text tabular data format that specifies positions of smaller sequence objects on larger ones (e.g., contigs on scaffolds), AGP files can be converted to the BED format for their further processing.

The bcftools and vcftools packages provide routines for merging or concatenating multiple VCF files. However, specifying a large number of input VCF files may terminate their processing because an operating system will not be able to keep so many files opened. This problem can be overcome by iterative combining of files: first, pairs of the original VCF files are processed, then pairs of the obtained files are processed and so on until we get the resulting VCF file.

Here we describe an iterative scheme for merging or concatenating VCF files using bcftools and GNU paralleland present a Python script that implements it.

Despite its name, Swiss PDB Viewer implements a number of features besides visualization of protein molecules. One of such features is side chain reconstruction for protein structures that contain only backbone atoms. However, Swiss PDB Viewer does not write model records to its output PDB files that may cause problems with other PDB-processing programs.

In this post, we present a Python script that adds proper model records to a PDB file produced by Swiss PDB Viewer.

Isaac Variant Caller implements the fast variant-calling algorithm and can be considered as an alternative to GATK or samtools variant callers. Unfortunately, it seems to have no manual that would describe its command-line options.

Here we give the list of the Isaac Variant Caller command-line options obtained from its source codes that are publicly available on GitHub.