Adventures in Bioinformaticshttps://gtamazian.com
Notes on bioinformatics, programming, mathematics and scienceTue, 20 Mar 2018 01:50:57 +0000enhourly1http://wordpress.com/https://s2.wp.com/i/buttonw-com.pngAdventures in Bioinformaticshttps://gtamazian.com
One-liner to get distribution of the alternative allele numbers in a VCF filehttps://gtamazian.com/2016/10/07/one-liner-to-get-distribution-of-the-alternative-allele-numbers-in-a-vcf-file/
https://gtamazian.com/2016/10/07/one-liner-to-get-distribution-of-the-alternative-allele-numbers-in-a-vcf-file/#respondFri, 07 Oct 2016 20:40:55 +0000http://gtamazian.com/?p=2653The VCF format allows for multiple alternative alleles in a single variant record. The alternative alleles are specified as a comma-separated list of their bases, so one may easily estimate the distribution of the alternative allele numbers in a command line using the following one-line script:

Here we use bcftools query from the bcftools package for rapid extraction of alternative alleles from a VCF file. We need to know only the number of commas in each line, so we remove all other symbols using tr. Finally, we count lines containing the particular numbers of commas.

Example: 1000 Genomes variants on chromosome 22

Let us demonstrate the script using the VCF file of 1000 Genomes variants on chromosome 22. The file contains 1,103,547 variants, including 1,060,388 SNPs and 43,230 indels.

According to the output, most of the variants in the file are biallelic (i.e., having a reference allele and a single alternative allele) and less than 1% of them are multiallelic. Most of the multiallelic variants are triallelic (i.e., having a reference allele and two alternative alleles) and only 275 multiallelic variants have more than two alternative alleles.

]]>https://gtamazian.com/2016/10/07/one-liner-to-get-distribution-of-the-alternative-allele-numbers-in-a-vcf-file/feed/0gtamazianbcftools plugin to convert a VCF file to the BED formathttps://gtamazian.com/2016/09/07/bcftools-plugin-to-convert-a-vcf-file-to-the-bed-format/
https://gtamazian.com/2016/09/07/bcftools-plugin-to-convert-a-vcf-file-to-the-bed-format/#respondWed, 07 Sep 2016 17:11:03 +0000http://gtamazian.com/?p=2583bcftools provides a convenient way to extend its functionality using plugins. Technically, the bcftools plugins are dynamic libraries that are executed when a user launches the bcftools plugin tool.

Here I present a simple plugin named vcf2bed that converts a VCF file to the BED format.

The plugin is launched in the following way:

bcftools plugin vcf2bed input.vcf > output.bed

or

bcftools +vcf2bed input.vcf > output.bed

The output BED file contains four columns:

the chromosome name;

the starting position of a variant that is equal to its POS field value minus 1 because coordinates in BED files are zero-based;

the ending position of a variant that is equal to its starting position plus its REF field length;

the variant ID.

The plugin source code is given below.

To compile the plugin, one should put it to the plugins directory of the bcftools source tree and run make. The obtained vcf2bed.so shared object file should be moved to the directory containing other bcftools plugins.

]]>https://gtamazian.com/2016/09/07/bcftools-plugin-to-convert-a-vcf-file-to-the-bed-format/feed/0gtamazianObtaining probability of all variant calls being correcthttps://gtamazian.com/2016/07/25/obtaining-probability-of-all-variant-calls-being-correct/
https://gtamazian.com/2016/07/25/obtaining-probability-of-all-variant-calls-being-correct/#respondMon, 25 Jul 2016 20:33:16 +0000http://gtamazian.com/?p=2421The VCF format specifies quality scores (QUAL) for each variable position (variant) in a genome. The QUAL value is the Phred quality score for the assertion that alternative bases of a variant are correct, that is, , where is the probability that the alternative base calls are wrong. Using the QUAL scores, one may easily calculate the probability that all variant calls in a VCF file are correct.

Here we give an equation for that probability, a Python script that implements it and an example of its usage.

Equation

Let us have variants which QUAL scores are . Then the probability that alternative base calls of the variants are correct is given by the following formula: .

Python script

The probability value specified above can be calculated using the following Python script.

The script produces a summary on the specified VCF file. Also it provides two additional options:

-t or –table produces a table of three columns: the QUAL value threshold (increases from the least to the greatest QUAL value in the file), the percentage of the variants which QUAL values are greater than the threshold (tends to zero) and the probability that all that variants are correct (tends to one);

-p or –plot produces the plot showing the variant percentage and the probability of all of them being correct as a function of the QUAL threshold.

Example: DoGSD wolf sample

We used a sample from the Dog Genome SNP Database to demonstrate the Python script. The VCF file CHW.g.vcf.tar.gz of the wolf genomic variants was downloaded from the database; non-variable sites were removed from it using the grep command in the following way:

The output plot CHW_qual.png, that is shown below, indicates that the probability of all sites being correct is close to one starting from the QUAL threshold of 70. For more precise estimation, we should use the output table file.

The number of the wolf variant sites and the probability of them being correct as a function of the QUAL threshold.

The output table file CHW_qual.txt, which fragment is given below, shows that the QUAL threshold of 75 corresponds to the 99% probability of all such variant calls to be correct and the percentage of such variants is 90.1%.

]]>https://gtamazian.com/2016/07/25/obtaining-probability-of-all-variant-calls-being-correct/feed/0gtamazianwolf-variant-sites-number-probability-qualConverting an AGP file to the BED formathttps://gtamazian.com/2016/06/23/converting-an-agp-file-to-the-bed-format/
https://gtamazian.com/2016/06/23/converting-an-agp-file-to-the-bed-format/#respondThu, 23 Jun 2016 17:53:34 +0000http://gtamazian.com/?p=1607The AGP format is used to describe the assembly structure in the NCBI Genome database. Since AGP is a plain-text tabular data format that specifies positions of smaller sequence objects on larger ones (e.g., contigs on scaffolds), AGP files can be converted to the BED format for their further processing.

Two ways to convert AGP to BED

An AGP file specifies the structure of two sets of sequence objects. Thus, a single AGP file can be converted to a pair of BED files based either on large sequence objects (e.g., scaffolds or chromosomes) or smaller sequence objects (e.g., contigs).

Let us consider the following example: the SEQ1 sequence composed of three fragments: FRAG1, FRAG2 and FRAG3 with two gaps of 30 bp. The lengths of FRAG1, FRAG2 and FRAG3 are 300, 700 and 250 bp, respectively.

The example of a large sequence object composed from three smaller ones.

Note that the smaller sequence object-based BED file can be obtained from the BED file based on larger sequence objects by switching columns 1 and 4 (sequence names) and 2-3 and 7-8 (sequence coordinates).

Python script converting AGP files to the BED format

The Python script given below implements the conversion of AGP files to the BED format as described previously. The script provides the following options:

The bar plot below shows the distribution of the genes on the fragments.

Distribution of genes on the human assembly fragments.

]]>https://gtamazian.com/2016/06/23/converting-an-agp-file-to-the-bed-format/feed/0gtamazianlarge-sequence-object-exampledistribution-of-genes-on-human-assembly-fragmentsCombining a large number of VCF fileshttps://gtamazian.com/2016/06/16/combining-a-large-number-of-vcf-files/
https://gtamazian.com/2016/06/16/combining-a-large-number-of-vcf-files/#respondThu, 16 Jun 2016 18:55:11 +0000http://gtamazian.com/?p=2097The bcftools and vcftools packages provide routines for merging or concatenating multiple VCF files. However, specifying a large number of input VCF files may terminate their processing because an operating system will not be able to keep so many files opened. This problem can be overcome by iterative combining of files: first, pairs of the original VCF files are processed, then pairs of the obtained files are processed and so on until we get the resulting VCF file.

Here we describe an iterative scheme for merging or concatenating VCF files using bcftools and GNU paralleland present a Python script that implements it.

Iterative scheme for combining VCF files

We implement an iterative scheme that will combine the lesser number of VCF files simultaneously. To process VCF files in parallel, we use GNU parallel with the -a argument that specifies the input source for it. Since the bcftools routines do not index their output VCF files, we use the tabix tool for this purpose.

At some stages, the number of VCF files to be combined may be odd. In this case, one file will be skipped from the combining procedure at the current stage and moved to the next iteration (see the figure below).

The iterative scheme for combining multiple VCF files.

Python script for combining VCF files

The Python script implementing the described iterative scheme is given below. It provides the following options:

-c or –command specifies how input VCF files are combined (merged or concatenated);

-j or –jobs sets the number of the bcftools processes to be launched in parallel;

-k or –keep prevents removal of temporary files created by the script.

The script creates temporary files of two types:

plain-text lists of file pairs;

intermediate combined VCF files.

For their names, the script randomly creates a random 5-letter prefix, e.g., tchkc. This prefix is followed by the iteration number starting from 1. Temporary VCF files also have serial numbers in their names. For example, the name of the pair list file for the second iteration might be tchkc_2.txt and the corresponding VCF files may have names like tchkc_2_3.vcf.gz.

]]>https://gtamazian.com/2016/06/16/combining-a-large-number-of-vcf-files/feed/0gtamazianiterative-scheme-for-combining-multiple-VCF-filesRestoring models in protein structure files by Swiss PDB Viewerhttps://gtamazian.com/2016/06/05/restoring-models-in-protein-structure-files-by-swiss-pdb-viewer/
https://gtamazian.com/2016/06/05/restoring-models-in-protein-structure-files-by-swiss-pdb-viewer/#respondSun, 05 Jun 2016 15:30:33 +0000http://gtamazian.com/?p=2044Despite its name, Swiss PDB Viewer implements a number of features besides visualization of protein molecules. One of such features is side chain reconstruction for protein structures that contain only backbone atoms. However, Swiss PDB Viewer does not write model records to its output PDB files that may cause problems with other PDB-processing programs.

In this post, we present a Python script that adds proper model records to a PDB file produced by Swiss PDB Viewer.

The script also provides the –keep option to retain or remove PDB records specific to Swiss PDB Viewer (the ones that start from SPDBV).

]]>https://gtamazian.com/2016/06/05/restoring-models-in-protein-structure-files-by-swiss-pdb-viewer/feed/0gtamazianCommand-line options of Isaac Variant Callerhttps://gtamazian.com/2016/06/01/command-line-options-of-isaac-variant-caller/
https://gtamazian.com/2016/06/01/command-line-options-of-isaac-variant-caller/#respondWed, 01 Jun 2016 20:54:24 +0000http://gtamazian.com/?p=1756Isaac Variant Caller implements the fast variant-calling algorithm and can be considered as an alternative to GATK or samtools variant callers. Unfortunately, it seems to have no manual that would describe its command-line options.

Here we give the list of the Isaac Variant Caller command-line options obtained from its source codes that are publicly available on GitHub.

Genotyping options

There are two options: –snp-theta and –indel-theta that define values of the theta parameters used for genotyping; their default values are 0.001 and 0.0001, respectively. These parameters are explained in the paper on the Isaac suite.

gVCF options

The following options specify producing the output file in the gVCF format.

–chrom-depth-file – use the mean read depth values for each chromosome from the specified file for high-depth filtration; the chromosome depth file should contain one line per chromosome like chrom_name<TAB>depth;

–gvcf-max-depth-factor – if the chromosome depth file is specified (see above), then the loci which depth exceeds the mean chromosome depth the specified number of times are filtered. The default value is 3;

–gvcf-max-snv-hpol – SNVs are filtered if they are located within a homopolymer context greater that the specified length value; a negative value disables the filter. By default, the filter is disabled;

–gvcf-max-indel-ref-repeat – indels are filtered if they lengthen or contract a homopolymer or dinucleotide with reference repeat length greater than the specified value; a negative value disables the filter. By default, the filter is disabled;

–gvcf-min-blockable-nonref – prevents joining of sites into a non-variant block if it contains more than the specified fraction of non-reference alleles. The default value is 0.2;

–gvcf-include-hapscore – include haplotype scores at SNV positions in the gVCF output;

–gvcf-no-block-compression – turn off block compression in the gVCF output;

Haplotype options

Non-reference model options

The following options specify how non-reference alleles are processed.

–nonref-test-file – test for non-reference alleles at any frequency and write the results to the specified file;

–nonref-sites-file – print the results of the non-reference allele test at every site to the specified file;

–nonref-variant-rate – the expected non-reference variant frequency used for the non-reference test. The default value is 9.99e-07;

–min-nonref-freq – the minimum non-reference allele frequency considered in the non-reference test. The default value is 0;

–nonref-site-error-rate – the expected rate of erroneous non-reference allele sites applied to the non-reference model. At error sites a non-reference allele is expected in the frequency range from 0 to the value specified by the –nonref-site-error-decay-freq option (see below) with a probability that linearly decays at the –non-site-error-decay-freq value to 0. The default value is 0.0001;

–nonref-site-error-decay-freq – the parameter used to estimate the error site probability as described above. The default value is 0.01.

Contig options

This group contains three options that specify how contiguous sequences (contigs) of aligned reads are processed.

–min-contig-open-end-support(default: 0);

–min-contig-edge-alignment (default: 7);

–min-contig-contiguous-match (default: 14).

The –min-contig-open-end-support option filters out any open-ended contig with an unaligned breakpoint sequence length of less than its argument value. The –min-contig-edge-alignment option filters out any contig with an edge match segment shorter than its argument value. The –min-contig-contiguous-match option filters out any contig without a match segment of length at least its argument value.

Realignment options

There are two options in this group: –max-indel-toggle-depth and –skip-realignment. The –max-indel-toggle-depth option controls the realignment stringency; lowering its value increases the realignment speed and decreases quality of the called indels. The default value of –max-indel-toggle-depthis 5.

The –skip-realignment option disables the read realignment; it is accepted if no indel-calling and contig options are specified.

Indel-calling options

This group contains options related to indel calling.

–max-candidate-indel-depth – the maximum estimated read depth for an indel to be considered; if this value is exceeded for any sample, then the indel is filtered. The default value is 10000;

–min-candidate-open-length – the minimum length of an open-ended breakpoint sequence required to become a breakpoint candidate. The default value is 20;

–candidate-indel-input-vcf – add candidate indels from the specified VCF file. The option can be provided multiple times to combine evidence from multiple VCF files;

–force-output-vcf – write to output a record for each site or indel in the provided VCF file even if no variant is found. The option can be provided multiple times to combine multiple VCF files;

–upstream-oligo-size – process reads as if they have an upstream oligo anchor for purposes of meeting minimum breakpoint overlap in support of an indel.

Variant-calling window options

The –variant-window-flank-file option outputs regional average basecall statistics at variant sites within a window of the variant call of the specified size. The option is provided with a pair of arguments: the window flank size and the output file name, for example, –variant-window-flank-file 10 window10.txt. This option can be specified several times for various window sizes.

Compatibility options

The –eland-compatibility option enables checking of input reads for an optional AS field corresponding to the ELAND PE map score.

Input options

This group contains two options: –max-input-depth and –ignore-conflicting-read-names. The –max-input-depth option specifies the maximum allowed read depth per sample (prior to the realignment procedure); by default, there is no limit. The –ignore-conflicting-read-names option disables reporting an error if two input reads share the same QNAME and read number.

Other options

There is a pair of options in this group: –report-file and –remap-input-softclip. The first option, –report-file, reports non-error run information and statistics to the file specified as its argument. The second option, –remap-input-softclip, attempts to realign all soft-clipped segments in input reads.

Legacy options

This group contains options that are marked legacy.

-bam-file – analyze reads from the specified sorted BAM file;

-bam-seq-name – analyze reads from a BAM file that are aligned to the chromosome with the specified name;

-samtools-reference – get the reference sequence from the specified multisequence FASTA file that follows the samtools reference conventions;

-bsnp-diploid-het-bias – specify the bias term for the heterozygous state in the bsnp model, so that heterozygotes are expected at allele ratios in the range 0.5±x, where x is the parameter value. The default value is 0;

-bsnp-diploid-file – run the Bayesian diploid genotype model and write its results to the specified file;

-bsnp-disploid-allele-file – write the most probable genotype at every position to the specified file;

-min-qscore – do not use a base if its qscore is less than the specified value. The default value is 17;

-max-window-mismatchn m – do not use a base if the mismatch count within the window of m flanking bases is greater than n;

-min-single-align-score – mark the reads which single align scores are less than the specified value as single-end failed. By default, such reads are excluded from consideration unless a paired score is present; this behavior can be modified by options -single-align-score-exclude-mode and -single-align-score-rescue-mode (see below). The default value is 10;

-single-align-score-exclude-mode – exclude single-end failed reads even when a paired score is present and the read is not paired-end failed;

-single-align-score-rescue-mode – include non single-end failed reads even when a paired score is present and the read is paired-end failed;

-min-paired-align-score – reads which paired align score is less than the specified value are marked as paired-end failed if a paired score is present. By default such reads are excluded from consideration, but may still be used if the single-score rescue mode is enabled. The default score value is 6;

-filter-unanchored – prevent using unanchored read pairs (that is, the read pairs that have a single-read mapping score of zero in both its reads) during variant calling;

-include-singleton – include paired-end reads with unmapped mates;

-include-anomalous – include paired-end reads that are not part of a proper pair (anomalous orientation or incorrect insert size);

-counts – write observation counts for every position to the specified file;

-print-all-site-evidence – print the observed data for all sites (indels not included);

-bindel-diploid-het-bias – set the bias term for the heterozygous state in the bindel model, so that heterozygotes are expected at allele ratios in the range 0.5±x, where x is the parameter value. The default value is 0;

-bindel-diploid-file – run the Bayesian diploid genotype caller and write the results to the specified file;

-indel-contigs – the contig file produced by the GROUPER indel-finding tool. This option must be specified together with -indel-contig-reads;

-indel-contig-reads – the contig reads file produced by the GROUPER indel-finding tool. This option must be specified together with -indel-contigs;

-indel-error-rate – for indel calling, set the indel error rate to a constant value equal to the one specified in the option. The default indel error rate is estimated from an empirical function accounting for the homopolymer length and the indel type (insertion or deletion). This option overrides the default behavior;

-indel-nonsite-match-prob – the probability of a base matching the reference in an average mismapped read; this value is used only by the indel caller. The default value is 0.25;

-report-range-begin – event reports and coverage begin at the specified base. The default value is 0;

-report-range-end – event reports and coverage end after the specified base or the reference size, if specified. The default value is the reference size;

-report-range-reference – event reports and coverage span the entire reference sequence; a reference sequence is required to use this option. This option can not be combined with -report-range-begin and -report-range-end;

-genome-size – the total number of non-ambiguous bases in the genome to which the input reads have been aligned. This option is used in indel calling;

-min-candidate-indel-reads – the minimal number of indel-supporting reads to consider the indel a candidate for realignment and indel calling. A read counts if it is either a contig read or a genomic read that passes the mapping score threshold. The default value is 3;

-min-candidate-indel-read-frac – the minimal number of intersecting reads containing an indel to consider it a candidate for realignment and indel calling. The criteria for counting a read are the same as for -min-candidate-indel-reads. Only genomic reads that pass the mapping score threshold are used for the denominator of this metric. The default value is 0.02;

-max-small-candidate-indel-read-frac – an additional indel candidacy filter for small indels (no more than 4 bases): the minimal fraction of intersecting reads containing the small indel to consider it a candidate for realignment and indel calling. The criteria for counting a read are the same as for -min-candidate-indel-reads. Only genomic reads that pass the mapping score threshold are used for the denominator of this metric. The default value is 0.1;

-max-candidate-indel-density – if there are more than the specified number of candidate indels per base intersecting a read, then realignment is truncated to only allow individual indel toggles of the starting alignments for that read. The default value is 0.15;

-candidate-indel-file – write to the specified file all candidate indels before realignment and genotyping;

-max-indel-size – the maximum size for indels processed for calling and realignment. Note that increasing this value should lead to an approximately linear increase in memory consumption. The default value is 150;

-skip-variable-metadata – do not print command-line or time stamps in the output file metadata.

]]>https://gtamazian.com/2016/06/01/command-line-options-of-isaac-variant-caller/feed/0gtamazianAdding rs numbers to VCF filehttps://gtamazian.com/2016/05/29/adding-rs-numbers-to-vcf-file/
https://gtamazian.com/2016/05/29/adding-rs-numbers-to-vcf-file/#commentsSun, 29 May 2016 18:09:06 +0000http://gtamazian.com/?p=1685The VCF format provides a fixed field for a variant ID. It is recommended to use IDs from the NCBI dbSNP database (so-called rs numbers)for variants that have been already described in it. Here we describe how to add rs numbers to a custom VCF file using the bcftools package.

Step 1. Obtain dbSNP VCF file

To add rs numbers to a VCF file, we need the dbSNP VCF file that contains that numbers. The file can be downloaded from the NCBI FTP server as described here.

Step 2. (Optional) Remove existing IDs from VCF file

You may skip this step if you would like to preserve existing IDs in your VCF file. Otherwise, the existing variant IDs can be removed from the VCF file using the bcftools annotate tool with the –remove option.

]]>https://gtamazian.com/2016/05/29/adding-rs-numbers-to-vcf-file/feed/2gtamazianR’s matplot function in MATLABhttps://gtamazian.com/2016/05/28/rs-matplot-function-in-matlab/
https://gtamazian.com/2016/05/28/rs-matplot-function-in-matlab/#respondSat, 28 May 2016 17:47:56 +0000http://gtamazian.com/?p=1683By default, MATLAB’s plot function draws no markers in the figure that it produces. One may explicitly specify a marker and a line style following the line specification string syntax; however, only one marker type and line style may be applied to a single data set.

Further we present a simple MATLAB function that implements the same functionality as R’s matplotfunction and allows to set style for each data line shown.

Source code

The source code of the MATLAB implementation of matplot is given below. The function provides options type, lty, lwd, col and pch to customize the line type, style, width, color and the marker type, respectively.

Examples

Here we give some examples of using MATLAB’s matplot function presented above.

Default plot settings

By default, matplot does not draw lines and uses various colors and marker types for the data points.

matplot(cumsum(rand(10, 5), 2));

Data plot by matplot with its default settings.

Single line style and color

The following command specifies the same line style (solid) and color (black) for all data lines; the markers are different.

matplot(cumsum(rand(10, 5), 2), 'type', 'b', 'lty', '-', 'col', 'k');

Data plot by matplot with the single color and style for lines.

Alternate line styles and colors

The following command specifies alternate line styles (solid and dashed) and colors (black, red and green) for the data lines; the marker type (circle) is the same for all lines.

For some assemblies, their chromosome-from-scaffold AGP files may be missing although the chromosomes were assembled from the scaffolds. In that case, one may reconstruct the AGP file of scaffolds on chromosomes using chromosome-from-components and scaffold-from-components AGP files.

Further we describe how to perform such a reconstruction and present a Python script implementing it.

Python script to get chromosome-from-scaffolds AGP file

To obtain the chromosome-from-scaffolds AGP file, we scan the chromosome-from-contigs AGP file and replace contigs with the corresponding scaffolds. Also we remove gaps between contigs within a scaffold using the gap_type value in the seventh column of an AGP file: if the value is scaffold, then the gap is located between two contigs within a scaffold.

Note that the script requires a single chromosome-from-contigs AGP file that can be obtained by merging per-chromosome AGP files with cat.