Abstract

The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.

Rare diseases can be difficult to diagnose due to low incidence and incomplete penetrance of implicated alleles however variant analysis of whole genome sequencing can identify underlying genetic events responsible for the disease (Nature, 2015). However, a large cohort is required for many WGS association studies in order to produce enough statistical power for interpretation (see post and here). To this effect major sequencing projects have been initiated worldwide including:

And although sequencing costs have dramatically been reduced over the years, the costs to determine the functional consequences of such variants remains high, as thorough basic research studies must be conducted to validate the interpretation of variant data with respect to the underlying disease, as only a small fraction of variants from a genome sequencing project will encode for a functional protein. Correct annotation of sequences and variants, identification of correct corresponding reference genes or transcripts in GENCODE or RefSeq respectively offer compelling challenges to the proper identification of sequenced variants as potential functional variants.

To this effect, the authors developed the Ensembl Variant Effect Predictor (VEP), which is a software suite that performs annotations and analysis of most types of genomic variation in coding and non-coding regions of the genome.

Species and assembly/genomic database support: VEP can analyze data from any species with assembled genome sequence and annotated gene set. VEP supports chromosome assemblies such as the latest GRCh38, FASTA, as well as transcripts from RefSeq as well as user-derived sequences

Noncoding Annotation: VEP reports variants in noncoding regions including genomic regulatory regions, intronic regions, transcription binding motifs. Data from ENCODE, BLUEPRINT, and NIH Epigenetics RoadMap are used for primary annotation. Plugins to the Perl coding are also available to link other databases which annotate noncoding sequence features.

Frequency, phenotype, and citation annotation: VEP searches Ensembl databases containing a large amount of germline variant information and checks variants against the dbSNP single nucleotide polymorphism database. VEP integrates with mutational databases such as COSMIC, the Human Gene Mutation Database, and structural and copy number variants from Database of Genomic Variants. Allele Frequencies are reported from 1000 Genomes and NHLBI and integrates with PubMed for literature annotation. Phenotype information is from OMIM, Orphanet, GWAS and clinical information of variants from ClinVar.

Flexible Input and Output Formats: VEP supports input data format called “variant call format” or VCP, a standard in next-gen sequencing. VEP has the ability to process variant identifiers from other database formats. Output formats are tab deliminated and give the user choices in presentation of results (HTML or text based)

VEP script: VEP is available as a downloadable PERL script (see below for link) and can process large amounts of data rapidly. This interface is powerfully flexible with the ability to integrate multiple plugins available from Ensembl and GitHub. The ability to alter the PERL code and add plugins and code functions allows the flexibility to modify any feature of VEP.

VEP REST API: provides robust computational access to any programming language and returns basic variant annotation. Can make use of external plugins.

Watch Video on VES Web Version training on How to Analyze Your Sequence in VEP

Availability of data and materials

The dataset supporting the conclusions of this article is available from Illumina’s Platinum Genomes [93] and using the Ensembl release 75 gene set. Pre-built data sets are available for all Ensembl and Ensembl Genomes species [94]. They can also be downloaded automatically during set up whilst installing the VEP.

Updated 11/15/2018

Research Points to Caution in Use of Variant Effect Prediction Bioinformatic Tools

Although we have the ability to use high throughput sequencing to identify allelic variants occurring in rare disease, correlation of these variants with the underlying disease is often difficult due to a few concerns:

As Whole Exome Sequencing (WES) returns a considerable number of variants, how to differentiate the normal allelic variation found in the human population from disease-causing pathogenic alleles

For rare diseases, pathogenic allele frequencies are generally low

Therefore, for these rare pathogenic alleles, the use of bioinformatics tools in order to predict the resulting changes in gene function may provide insight into disease etiology when validation of these allelic changes might be experimentally difficult.

In a 2017 Genes & Immunity paper, Line Lykke Andersen and Rune Hartmann tested the reliability of various bioinformatic software to predict the functional consequence of variants of six different genes involved in interferon induction and sixteen allelic variants of the IFNLR1 gene. These variants were found in cohorts of patients presenting with herpes simplex encephalitis (HSE). Most of the adult population is seropositive for Herpes Simplex Virus (HSV) however a minor fraction (1 in 250,000 individuals per year) of HSV infected individuals will develop HSE (Hjalmarsson et al., 2007). It has been suggested that HSE occurs in individuals with rare primary immunodeficiencies caused by gene defects affecting innate immunity through reduced production of interferons (IFN) (Zhang et al., Lim et al.).

We selected two sets of naturally occurring human missense allelic variants within innate immune genes. The first set represented eleven non-synonymous variants in six different genes involved in interferon (IFN) induction, present in a cohort of patients suffering from herpes simplex encephalitis (HSE) and the second set represented sixteen allelic variants of the IFNLR1 gene. We recreated the variants in vitro and tested their effect on protein function in a HEK293T cell based assay. We then used an array of 14 available bioinformatics tools to predict the effect of these variants upon protein function. To our surprise two of the most commonly used tools, CADD and SIFT, produced a high rate of false positives, whereas SNPs&GO exhibited the lowest rate of false positives in our test. As the problem in our test in general was false positive variants, inclusion of mutation significance cutoff (MSC) did not improve accuracy.

Methodology

Identification of rare variants

Genomes of nineteen Dutch patients with a history of HSE sequenced by WES and identification of novel HSE causing variants determined by filtering the single nucleotide polymorphisms (SNPs) that had a frequency below 1% in the NHBLI Exome Sequencing Project Exome Variant Server and the 1000 Genomes Project and were present within 204 genes involved in the immune response to HSV.

In order to validate the predictive value of the software, HEK293T cells, deficient in IRF3, MAVS, and IKKe/TBK1, were cotransfected with the nine variants of the aforementioned genes and a luciferase reporter under control of the IFN-b promoter and luciferase activity measured as an indicator of IFN signaling function. Western blot was performed to confirm the expression of the constructs.