Abstract

The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.

Rare diseases can be difficult to diagnose due to low incidence and incomplete penetrance of implicated alleles however variant analysis of whole genome sequencing can identify underlying genetic events responsible for the disease (Nature, 2015). However, a large cohort is required for many WGS association studies in order to produce enough statistical power for interpretation (see post and here). To this effect major sequencing projects have been initiated worldwide including:

And although sequencing costs have dramatically been reduced over the years, the costs to determine the functional consequences of such variants remains high, as thorough basic research studies must be conducted to validate the interpretation of variant data with respect to the underlying disease, as only a small fraction of variants from a genome sequencing project will encode for a functional protein. Correct annotation of sequences and variants, identification of correct corresponding reference genes or transcripts in GENCODE or RefSeq respectively offer compelling challenges to the proper identification of sequenced variants as potential functional variants.

To this effect, the authors developed the Ensembl Variant Effect Predictor (VEP), which is a software suite that performs annotations and analysis of most types of genomic variation in coding and non-coding regions of the genome.

Species and assembly/genomic database support: VEP can analyze data from any species with assembled genome sequence and annotated gene set. VEP supports chromosome assemblies such as the latest GRCh38, FASTA, as well as transcripts from RefSeq as well as user-derived sequences

Noncoding Annotation: VEP reports variants in noncoding regions including genomic regulatory regions, intronic regions, transcription binding motifs. Data from ENCODE, BLUEPRINT, and NIH Epigenetics RoadMap are used for primary annotation. Plugins to the Perl coding are also available to link other databases which annotate noncoding sequence features.

Frequency, phenotype, and citation annotation: VEP searches Ensembl databases containing a large amount of germline variant information and checks variants against the dbSNP single nucleotide polymorphism database. VEP integrates with mutational databases such as COSMIC, the Human Gene Mutation Database, and structural and copy number variants from Database of Genomic Variants. Allele Frequencies are reported from 1000 Genomes and NHLBI and integrates with PubMed for literature annotation. Phenotype information is from OMIM, Orphanet, GWAS and clinical information of variants from ClinVar.

Flexible Input and Output Formats: VEP supports input data format called “variant call format” or VCP, a standard in next-gen sequencing. VEP has the ability to process variant identifiers from other database formats. Output formats are tab deliminated and give the user choices in presentation of results (HTML or text based)

VEP script: VEP is available as a downloadable PERL script (see below for link) and can process large amounts of data rapidly. This interface is powerfully flexible with the ability to integrate multiple plugins available from Ensembl and GitHub. The ability to alter the PERL code and add plugins and code functions allows the flexibility to modify any feature of VEP.

VEP REST API: provides robust computational access to any programming language and returns basic variant annotation. Can make use of external plugins.

Watch Video on VES Web Version training on How to Analyze Your Sequence in VEP

Availability of data and materials

The dataset supporting the conclusions of this article is available from Illumina’s Platinum Genomes [93] and using the Ensembl release 75 gene set. Pre-built data sets are available for all Ensembl and Ensembl Genomes species [94]. They can also be downloaded automatically during set up whilst installing the VEP.

Updated 11/15/2018

Research Points to Caution in Use of Variant Effect Prediction Bioinformatic Tools

Although we have the ability to use high throughput sequencing to identify allelic variants occurring in rare disease, correlation of these variants with the underlying disease is often difficult due to a few concerns:

As Whole Exome Sequencing (WES) returns a considerable number of variants, how to differentiate the normal allelic variation found in the human population from disease-causing pathogenic alleles

For rare diseases, pathogenic allele frequencies are generally low

Therefore, for these rare pathogenic alleles, the use of bioinformatics tools in order to predict the resulting changes in gene function may provide insight into disease etiology when validation of these allelic changes might be experimentally difficult.

In a 2017 Genes & Immunity paper, Line Lykke Andersen and Rune Hartmann tested the reliability of various bioinformatic software to predict the functional consequence of variants of six different genes involved in interferon induction and sixteen allelic variants of the IFNLR1 gene. These variants were found in cohorts of patients presenting with herpes simplex encephalitis (HSE). Most of the adult population is seropositive for Herpes Simplex Virus (HSV) however a minor fraction (1 in 250,000 individuals per year) of HSV infected individuals will develop HSE (Hjalmarsson et al., 2007). It has been suggested that HSE occurs in individuals with rare primary immunodeficiencies caused by gene defects affecting innate immunity through reduced production of interferons (IFN) (Zhang et al., Lim et al.).

We selected two sets of naturally occurring human missense allelic variants within innate immune genes. The first set represented eleven non-synonymous variants in six different genes involved in interferon (IFN) induction, present in a cohort of patients suffering from herpes simplex encephalitis (HSE) and the second set represented sixteen allelic variants of the IFNLR1 gene. We recreated the variants in vitro and tested their effect on protein function in a HEK293T cell based assay. We then used an array of 14 available bioinformatics tools to predict the effect of these variants upon protein function. To our surprise two of the most commonly used tools, CADD and SIFT, produced a high rate of false positives, whereas SNPs&GO exhibited the lowest rate of false positives in our test. As the problem in our test in general was false positive variants, inclusion of mutation significance cutoff (MSC) did not improve accuracy.

Methodology

Identification of rare variants

Genomes of nineteen Dutch patients with a history of HSE sequenced by WES and identification of novel HSE causing variants determined by filtering the single nucleotide polymorphisms (SNPs) that had a frequency below 1% in the NHBLI Exome Sequencing Project Exome Variant Server and the 1000 Genomes Project and were present within 204 genes involved in the immune response to HSV.

In order to validate the predictive value of the software, HEK293T cells, deficient in IRF3, MAVS, and IKKe/TBK1, were cotransfected with the nine variants of the aforementioned genes and a luciferase reporter under control of the IFN-b promoter and luciferase activity measured as an indicator of IFN signaling function. Western blot was performed to confirm the expression of the constructs.

Abstract

The successor to the Human Genome Project intends to establish, by international cooperation, an encyclopedic catalog of sequence variants indexed to the human genome sequence.

Introduction

Genomics is not just for rich countries any more. Anyone can contribute to the Human Variome Project (HVP; see Commentary,page 433). Indeed, the project might just be ambitious enough that everyone really will need to contribute. By stating that all human genetics and genomics contributes to a single aim, the HVP essentially reduces duplication of effort while increasing credit for participation.

However, it will have to find ways to coordinate the disparate activities of clinicians, researchers, database curators and bioinformaticians by providing the means and incentives to lodge the variants they have found in public databases. Variome aims to get all to use compatible nomenclature and phenotype reporting systems and to index variant and phenotype data to gene models in the coordinate system generated by the Human Genome Project. Automation and expert curation, and open comment and expert review, will all have a place in this endeavor. How will we do this without creating more than a necessary minimum of new databases, procedures and bureaucracy?

A very important point, but a tough one to get across, is that much of the necessary work is currently happening across the globe—but is just insufficiently coordinated. The individuals already hard at work aren’t getting the credit they deserve. In a sense, the rest of the world’s geneticists deserve the kind of service that US researchers receive from the excellent coordinating work of the National Human Genome Research Institute and the repositories of the National Center for Biotechnology Information (NCBI), together with the kind of attention afforded by international journals. If only these kinds of coordination, recording and attention could be brought to bear, however briefly, on publication units as small as single instances of a variant gene! Thus, Variome aims to add value to databases such as OMIM, GenBank, dbSNP, dbGAP and the HapMap and organizations including NCBI and the European Bioinformatics Institute (EBI) by working with them all. It will start gene by gene, evaluating variants already found and curated for mendelian diseases, and will add rare and common variants in common diseases as they are reported. As it does so, HVP participants will develop mechanisms to expedite and automate reporting of variants and their occurrence.

In the consensus-building exercise of the first Human Variome meeting (page 433), delegates constructed a wish list of recommendations that numerically exceeded the number of participants at the meeting. We think that two points emerge as particularly important to the success of the project: publication and credit.

To be successful in persuading clinical and diagnostic laboratories to contribute variations and persuading researchers to evaluate the pathogenic potential of each variant, the HVP will need to introduce publishing innovations at both ends of the citation spectrum. It will need to track the citation of each variant’s accession code in papers, database entries and across the web. This closing of the online publication loop might be termed microattribution. Perhaps existing journals could be persuaded to take responsibility for monitoring and highlighting the citation of database entries in their papers, so that the HVP can readily aggregate this information. A journal devoted to the human variome could commission peer-reviewed, gene-based synopses of mendelian mutations based on information in locus-specific databases (see pages 425 and 427), meta-analyses of association studies and resequencing data such as those reported by Jonathan Cohen and colleagues in this issue (page 513, with News and Views on page 439). Phenotypic and diagnostic information might be linked to these synopses from existing databases such as the dysmorphology databases, PharmGKB (page 426) and GeneTests (http://www.genetests.org). Genome browsers including Ensembl and UCSC might then be persuaded to display a Variome track. We envisage such synopses to be a gene-based extension of the disease-based annual synopses for association studies we proposed last year (Nat. Genet. 38, 1; 2006). The first of these, on Alzheimer disease, was published by Lars Bertram and colleagues (Nat. Genet. 39, 17–23; 2007) using their newly created AlzGene database.

Which genes should the HVP annotate first to demonstrate the utility and impact of its coordinating activities? Perhaps we can learn from one of the most impressive recent exercises in evidence-based medicine: namely, the American College of Medical Genetics‘ systematic prioritization of genes for newborn screening (http://mchb.hrsa.gov/screening/). Variome synopses would take into account the prevalence, seriousness and treatability of the clinical condition(s), the value added by combining all three types of genetic study listed above and the availability of all three kinds of evidence in existing laboratories, databases and publications.

There are, inevitably, limits to what can be achieved by a gene-based view of human variation. Gene models are revised and re-annotated, and structural genomic variation plays havoc with reference genome builds and the context within which point variants and haplotypes are found. Physicians and the general public will want a disease-based view—and the associated diagnostic genetic tests, rather than genome annotation. Delaying the appearance of such alternative views, there is often a many-to-many correspondence between genes and disease phenotypes. On the brighter side, this complexity should provide good business for database designers and review journals.

As the participants of the Variome meeting note in their Commentary, the effort to index and evaluate all of human variation will provide many new opportunities in genomics for researchers whose home countries did not participate in the initial human genome sequencing project. They are right that this is both the project and the time to achieve the globalization of genomics.

Our Vision for the Future

Imagine you are sick. For many, this is not a difficult task. Now imagine you are sick and none of your doctors know why. Your symptoms suggest that you have a rare genetic disease, and you’ve been tested for a mutation in the gene responsible, but the results are inconclusive. The laboratory found a change in your genetic sequence, but is unable to definitively state that it’s what’s causing your symptoms. And with no definitive result from the test, your doctor—and your insurance company—are unwilling to prescribe the expensive course of drugs needed to control your symptoms.While many people might be willing to dismiss the chances of this happening to them, when you start to look at the facts, things start to get a little frightening. There are over 6,000 diseases that can be caused by a mutation in a single gene and it is estimated that 1 child in every 200 born will suffer from one of these diseases. Add to that the number of cancers that have an inherited genetic component and the chances of you, or someone you know being in this position is quite high.

Now imagine that the information the laboratory and your doctor needed to make an accurate diagnosis was out there, but it wasn’t accessible to them: it was hidden away in an obscure academic paper, or in some researcher’s forgotten notes.

Unfortunately, this is the situation that is currently facing thousands of people across the globe who are suffering the devastating effects of genetic illnesses.

The role that our genes play in our health and well-being is well known. The genetic makeup of an individual can cause a host of genetic disorders that can manifest from early childhood (cystic fibrosis, Prader-Willi Syndrome, Fragile X Syndrome) to adulthood (Alzheimer’s disease, polycystic kidney disease, Huntington’s disease) as well as significantly increase the risk of contracting more common diseases such as schizophrenia, diabetes, depression and cancer.

The world is rapidly moving towards an era where it is both economically and scientifically feasible to sequence the genome of every patient presenting with a chronic condition; already in the past decade the cost of a whole-genome sequence has dropped from several billion dollars to a few thousand.

But being able to sequence the genome of a patient cheaply and easily will be useless if we are unable to determine if the variations present in a sequence have an effect on human health. We are suffering from a critical lack of information about the consequences of the vast majority of the mutations possible within the human genome. And, even more concerning, is the fact that even when that information exists, it is not being shared and captured by the global medical research community in a manner that guarantees widespread dissemination and long-term preservation.

The Human Variome Project is trying to change this. We strongly believe in the free and open sharing of information on genetic variation and its consequences and are dedicated to developing and maintaining the standards, systems and infrastructure that will embed information sharing into routine clinical practice. We envision a world where the availability of, and access to, genetic variation information is not an impediment to diagnosis and treatment; where the burden of genetic disease on the human population is significantly decreased; where never again will a doctor have to look at a genetic sequence and ask, “What does this change mean for my patient?”

The Human Variome Project is motivated by the knowledge that by working together, we will be able to significantly reduce the needless physical, psychological, emotional and economic suffering of millions of people.

Human Variome Project International Limited is a not-for-profit Australian public company limited by guarantee that was founded in 2010 to provide central coordination efforts to the global Human Variome Project effort and run the International Coordinating Office. The company has no shareholders and is endorsed by the Australian Tax Office as a deductible gift recipient as a Health Project Charity.

Human Variome Project International Limited, as a company limited by guarantee, is a public unlisted company. It must file accounts annually with the Australian Securities and Investment Commission, it must be audited and, as a public company, the directors and officers of the company must comply with all the duties and responsibilities set out in the Australian Corporations Act. UNESCO also stipulates strict conditions for compliance with its functions and operation as a non-government and non-profit making organisation.

Human Variome Project International’s objects and powers include:

to promote the prevention or the control of diseases in human beings

to develop and provide educational programs, training and courses in public administration, public sector management, public policy, public affairs and any other related fields

to alleviate human suffering by collecting, organising and sharing data on genetic variation;

to further the Human Variome Project

to act as the co-ordinating office for the Human Variome Project

to attract and employ academics, researchers, practitioners and other staff as required to provide and support the services to further the objects of the Company

to provide facilities for research, study and education related to the Human Variome Project

to carry out and conduct the business of provider of administrative and consulting services;

to seek, encourage and accept gifts, grants, donations or endorsements

to affiliate with and enter into co-operative agreements with research educational institutions, government, local governments, practitioner bodies, non-government organisations, commercial, cultural and any other institutions or bodies

Company Members

Mr David Abraham

Professor Richard Cotton

Sir John Burn

Dr David Rimoin

Dr Eric Haan

Professor Jean-Jacques Cassiman

(representative of) National Institute of Gene Science and Technology Development (China)

The Board of Directors is advised by the Scientific Advisory Committee in matters of strategic scientific direction for current and future projects. The Scientific Advisory Committee has a variety of {ln:roles and responsibilities}, as wells as the delegated authority of the Board of Directors on the publication of all HVP Standards and Guidelines, and the arbitration of any dispute resolution processes in the generation of HVP Standards and Guidelines.The Scientific Advisory Committee consists of twelve members including one Chair. The Scientific Advisory Committee members are elected by the two Advisory Councils every two years, with half the positions on the Committee becoming vacant every two years. The Chair of the Scientific Advisory Committee is appointed by the Coordinating Office from among the members of the Scientific Advisory Committee. Membership of the Committee, in an ex-officio capacity, is also extended to:

the Scientific Director of the Human Variome Project Coordinating Office;

the President of the Human Genome Variation Society;

the President of the International Federation of Human Genetics Societies; and

a representative from the central genetic databases, chosen from amongst themselves.

Any Individual Member of the Human Variome Project Consortium is eligible to stand for election to the Scientific Advisory Committee. Candidates must be nominated and seconded by a member of either of the Advisory Councils.

The Scientific Advisory Committee meets on a face–to–face basis once per year, usually in conjunction with the HVP Fora series. The Scientific Advisory Committee also regularly meets via telephone/video–conference.

Table of contents

November 2012, Volume 44 No11 pp1171-1285

News and Views

Tracking the evolution of cancer methylomes –pp1173 – 1174

Arnaud R Krebs & Dirk Schübeler

doi:10.1038/ng.2451

Cellular transformation in cancer has long been associated with aberrant DNA methylation, most notably, hypermethylation of promoter sequences. A new study uses a clever approach of selective high-resolution profiling to follow DNA methylation over a time course of cellular transformation and challenges the notion that hypermethylation in cancer arises in an orchestrated fashion.

Older males beget more mutations –pp1174 – 1176

Matthew Hurles

doi:10.1038/ng.2448

Three papers characterizing human germline mutation rates bolster evidence for a relatively low rate of base substitution in modern humans and highlight a central role for paternal age in determining rates of mutation. These studies represent the advent of a transformation in our understanding of mutation rates and processes, which may ultimately have public health implications.

FOXA1 and breast cancer risk –pp1176 – 1177

Kerstin B Meyer & Jason S Carroll

doi:10.1038/ng.2449

Many SNPs associated with human disease are located in non-coding regions of the genome. A new study shows that SNPs associated with breast cancer risk are located in enhancer regions and alter binding affinity for the pioneer factor FOXA1.

Nick Orr and colleagues report a genome-wide association study for male breast cancer. They identify a new susceptibility locus atRAD51B and examine association evidence for known female breast cancer loci in these cohorts.

Adrienne Flanagan and colleagues identify a common variant in the T gene associated with strong risk of chordoma, a rare malignant bone tumor. The risk variant alters an amino acid in the DNA-binding domain of the T transcription factor and is associated with differential expression of T and its downstream targets.

Jan Molenaar and colleagues show that LIN28B is overexpressed and amplified in human neuroblastomas and that LIN28B regulates let-7 family miRNAs and MYCN. They create a transgenic mouse model of LIN28B overexpression and show that these mice develop neuroblastoma tumors.

Amos Tanay and colleagues characterize DNA methylation polymorphism within cell populations and track immortalized fibroblasts in culture for over 300 generations to show that formation of differentially methylated regions occurs through a stochastic process and nearly deterministic epigenetic remodeling.

Gordon Dougan and colleagues report whole-genome sequencing of a global collection of 179 Salmonella Typhimurium isolates, including 129 diverse sub-Saharan African isolates associated with invasive disease. They determine the phylogenetic structure of invasive Salmonella Typhimurium in sub-Saharan Africa and find that the majority are from two closely related highly conserved lineages, which emerged in the last 60 years in close temporal association with the current HIV epidemic.

Mayumi Tamari and colleagues report a genome-wide association study for atopic dermatitis, a chronic inflammatory skin disease, in a Japanese population. They identify eight new susceptibility loci for atopic dermatitis and compare their results to those of previous studies in European and Chinese populations.

Peter Gregersen and colleagues identify a regulatory variant inCSK, coding for an intracellular kinase that physically interacts with Lyp (PTPN22), associated with systemic lupus erythematosus (SLE). Their work suggests that the Lyp-Csk complex influences susceptibility to SLE through regulation of B-cell signaling, maturation and activation.

José Martin-Subero and colleagues report whole-genome bisulfite sequencing and methylome analysis of two CLLs and three B-cell subpopulations using high-density microarrays on 139 CLLs. They identify widespread hypomethylation in the gene body that is largely associated with intragenic enhancer elements.

Yanick Crow and colleagues show that mutations in ADAR1 cause the autoimmune disorder Aicardi-Goutières syndrome, accompanied by upregulation of interferon-stimulated genes.ADAR1 encodes an enzyme that catalyzes the deamination of adeonosine to inosine in double-stranded RNA, and the findings suggest a possible role for RNA editing in limiting the accumulation of repeat-derived RNA species.

Harry Dietz and colleagues report the identification of mutations in SKI in Shprintzen-Goldberg syndrome, which shares features with Marfan syndrome and Loeys-Dietz syndrome. SKI encodes a known repressor of TGF-β activity, and this work provides evidence for paradoxical increased TGF-β signaling as the mechanism underlying these related syndromes.

Rima Nabbout and colleagues report the identification of de novomutations in the KCNT1 potassium channel gene in individuals with malignant migrating partial seizures of infancy, a rare epileptic encephalopathy with pharmacoresistant seizures and developmental delay. The authors show that the mutations have a gain-of-function effect on KCNT1 channel activity.

Evan Eichler and colleagues report an estimate of the mutation rate in humans that is based on the whole-genome sequences of five parent-offspring trios from a Hutterite population and genotyping data from an extended pedigree. They use a new approach for estimating the mutation rate over multiple generations that takes into account the extensive autozygosity in this founder population.

Patrick Chinnery, Nils-Goran Larsson and colleagues show that mitochondrial heteroplasmy levels are principally determined prenatally within the developing female germline in mice transmitting a heteroplasmic single base-pair deletion in the mitochondrial tRNAMet gene.

Sequencing of the human genome via massive programs such as the Cancer Genome Atlas Program (CGAP) and the Encyclopedia of DNA Elements (ENCODE) consortium in conjunction with considerable bioinformatics efforts led by the National Center for Biotechnology Information (NCBI) have unlocked a myriad of yet unclassified genes (for good review see (2). The project encompasses 32 institutions worldwide which, so far, have generated 1640 data sets, initially depending on microarray platforms but now moving to the more cost effective new sequencing technology. Initially the ENCODE project focused on three types of cells: an immature white blood cell line GM12878, leukemic line K562, and an approved human embryonic cell line H1-hESC. The analysis was rapidly expanded to another 140 cell types. DNA sequencing had revealed 20,687 known coding regions with hints of 50 more coding regions. Another 11,224 DNA stretches were classified as pseudogenes. The ENCODE project reveals that many genes encode for an RNA, not protein product, so called regulatory RNAs.

However some of the most recent and interesting results focus on the noncoding regions of the human genome, previously discarded as uninteresting or “junk” DNA . Only 2% of the human genome contains coding regions while 98% of this noncoding part of the genome is actually found to be highly active “with about 4 million constantly communicating switches” (3). Some of these “switches” in the noncoding portion contain small, repetitive elements which are mobile throughout the genome, and can control gene expression and/or predispose to disease such as cancer. These mobile elements, found in almost all organisms, are classified as transposable elements (TE), inserting themselves into far-reaching regions of the genome. Retro-transposons are capable of generating new insertions through RNA intermediates. These transposable elements are normally kept immobile by epigenetic mechanisms(4-6) however some TEs can escape epigenetic repression and insert in areas of the genome, a process described as insertional mutagenesis as the process can lead to gene alterations seen in disease(7). In addition, this insertional mutagenesis can lead to the transformation of cells and, as described in Post 2, act as a model system to determine drivers of oncogenesis. This insertional mutagenesis is a different mechanism of genetic alteration and rearrangement seen in cancer like recombination and fusion of gene fragments as seen with the Philadelphia chromosome and BCR/ABL fusion protein (8). The mechanism of transposition and putative effects leading to mutagenesis are described in the following figure:

Figure. Insertional mutagenesis based on transposon-mediated mechanism. A) Basic structure of transposon contains gene/sequence flanked by two inverted repeats (IR) and/or direct repeats (DR). An enzyme, the transposase (red hexagon) binds and cuts at the IR/DR and transposon is pasted at another site in DNA, containing an insertion site. B) Multiple transpositions may results in oncogenic events by inserting in promoters leading to altered expression of genes driving oncogenesis or inserting within coding regions and inactivating tumor suppressors or activating oncogenes. Deep sequencing of the resultant tumor genomes ( based on nested PCR from IR/DRs) may reveal common insertion sites (CIS) and oncogenic mutations could be identified.

In a bioinformatics study Eunjung Lee et al.(1), in collaboration with the Cancer Genome Atlas Research Network, the authors had analyzed 43 high-coverage whole-genome sequencing datasets from five cancer types to determine transposable element insertion sites. Using a novel computational method, the authors had identified 194 high-confidence somatic TE insertion sites present in cancers of epithelial origin such as colorectal, prostate and ovarian, but not in brain or blood cancers. Sixty four of the 194 detected somatic TE insertions were located within 62 annotated genes. Genes with TE insertion in colon cancers have commonly high mutation rates and enriched genes were associated with cell adhesion functions (CDH12, ROBO2,NRXN3, FPR2, COL1A1, NEGR1, NTM and CTNNA2) or tumor suppressor functions (NELL1m ROBO2, DBC1, and PARK2). None of the somatic events were located within coding regions, with the TE sequences being detected in untranslated regions (UTR) or intronic regions. Previous studies had shown insertion in these regions (UTR or intronic) can disrupts gene expression (9). Interestingly, most of the genes with insertion sites were down-regulated, suggested by a recent paper showing that local changes in methylation status of transposable elements can drive retro-transposition (10,11). Indeed, the authors found that somatic insertions are biased toward the hypomethylated regions in cancer cell DNA. The authors also confirmed that the insertion sites were unique to cancer and were somatic insertions, not germline (germline: arising during embryonic development) in origin by analyzing 44 normal genomes (41 normal blood samples from cancer patients and three healthy individuals).

The authors conclude:

“that some TE insertions provide a selective advantage during tumorigenesis,

rather than being merely passenger events that precede clonal expansion(1).”

The authors also suggest that more bioinformatics studies, which utilize the expansive genomic and epigenetic databases, could determine functional consequences of such transposable elements in cancer. The following Post will describe how use of transposon-mediated insertional mutagenesis is leading to discoveries of the drivers (main genetic events) leading to oncogenesis.