Transcription, the first stage of gene expression, could have sweeping implications for the prevention and treatment of cancer.

But despite decades of intensive research in labs around the globe, myriad mysteries continue to shroud the genetic code.

With the help of Artificial Intelligence, two Russian researchers have made it their mission to unlock the secrets hidden within the transcription process – conducting research that could have a sweeping impact on what we know about genetics in general, and about maladies like cancer in particular.

The hidden world of DNA transcription

All of our physical traits are dictated by instructions contained within our DNA.

In transcription, bits of genetic code contained within a strand of DNA are copied by enzymes called RNA polymerase, which in turn create messenger RNA (mRNA) based on these copies.

Initially, the mRNA contain non-coding sections called introns. These are cut out during the so-called splicing process, leaving behind only coding sections – or sections that contain instructions for creating proteins.

The instructions hidden within the DNA then becomes tangible as the mRNA strand is translated into protein.

For instance, we know that transcription starts when RNA polymerase is bound to DNA by hundreds of proteins called transcription factors. But scientists have not yet figured out why transcription is initiated in certain tissues and not in others.

Further down the line, during the splicing process, scientists can predict the sites in which non-coding sections will be slashed. But they cannot yet predict with certainty which sections will be cut out or why.

Enter Ilya Vorontsov, a former computer programmer who is currently earning his PhD at the Vavilov Institute of General Genetics (VIGG) in Moscow, and Dmitry Svetlichnyy, a former medical doctor who now conducts postdoctoral research at Skoltech.

Vorontsov has devoted his studies to deciphering the secrets of transcription factors and what regulates the initiation of the process, while Svetlichnyy focuses on splicing.

Both rely on Machine Learning to make sense of these unsolved mysteries. Machine Learning – a form of Artificial Intelligence where computers are able to effectively learn new information without relying on specific programming – enables researchers to sift through massive volumes of data, and thereby cover a great deal of ground in a relatively short timeframe.

And the groundbreaking work both students have turned out has earned them international awards and grant money, shoring up their pools of resources and ensuring that they’ll be able to continue making waves in the field of genetics for years to come.

“This is one of the best programs in Russia,” CDIB Assistant Director Maria Kolesnikova said of the fellowship. “It has been transformed from a contest organized by the Dynasty foundation and has maintained a minimal level of bureaucracy and top-notch expertise.”

Making sense of transcription factors

Ilya Vorontsov

Vorontsov has zeroed in on how certain variations can affect the initiation of transcription. These variations include mutations – or changes in genetic structures – and polymorphisms – or genetic variations within a population.

“Most mutations and polymorphisms occur in non-coding regions of DNA and thus cannot affect protein sequences. But such mutations can disrupt or create binding sites in gene promoters or enhancers, thereby altering gene expression patterns,” Vorontsov said in a recent interview.

In some cases, these variations can have a dramatic impact on an organism’s life and wellbeing, while in others they have no effect at all.

Vorontsov’s immediate goal is to develop a methodology for distinguishing the harmful non-coding variations from the benign.

To identify points at which transcription factors bind to DNA, he and his team rely on public data generated through ChIP-sequencing, a method used to analyze how protein interacts with DNA.

However, they have only developed an understanding of a tiny portion of the binding preferences related to certain cell types.

“In order to carry out ChIP-sequencing, someone must perform experiments with respect to upwards of a thousand transcription factors for any given tissue, which is not realistic,” Vorontsov said.

To work around this stumbling block, Vorontsov and his team are developing a computational method to deduce transcription factor binding sites in a given cell type based on binding sites in other cell types.

Vorontsov said that his biggest struggle in this project has been to gradually improve methods little by little, and to find at times through this process of trial and error that hypotheses the team had once embraced have become obsolete.

“In order to cope with these difficulties, we try to automate as many stages of our analyses as possible so that each of our attempts to enhance methods or to strengthen criteria can be done with as little manual work as feasible,” Vorontsov said.

He and his team have developed an algorithm that can estimate how certain variations might affect the likelihood of transcription initiation.

Among other things, this algorithm enables them to generalize ChIP-sequencing results for all tissues, thereby allowing them to predict the specific site where a given transcription factor will initiate the process in a certain tissue.

Vorontsov and his team entered an ENCODE-DREAM challenge in 2016 – an international contest for the creation of an algorithm capable of predicting transcription factor binding sites – where they were selected as top performers.

A splicing causality conundrum

Dmitry Svetlichnyy

As mentioned above, Svetlichnyy focuses on the splicing process. More specifically, his research centers on the fact that one strand of DNA can create many different types of proteins depending on which sections are cut out before the remaining exons are spliced together. This is referred to as alternative splicing.

He and his team are striving to shed light on the murky world of alternative splicing regulation: why are some sections cut out while others are left to create proteins, and what factors determine the outcome of this process?

In conducting his research, Svetlichnyy has focused on several cancer cell lines; because cancer is a global plight that every country wishes to eradicate, laboratories around the world have poured boundless resources into studying the disease. As such, an abundance of high-quality data produced by top-flight laboratories is publicly available for research.

The wealth of available information on cancer cell lines opens new avenues for researchers: “We have genome and transcriptome [Ed: the full range of mRNA molecules expressed by an organism’s genes] data for cancer patients. This allows us to study how variations occurring in cancer may affect the regulation of alternative splicing,” Svetlichnyy said.

In addition to relying on Machine Learning methods to make efficient use of the available data, Svetlichniy and his team are also working to develop a Machine Learning model to predict the impact of mutations that occur in cancer patients on the binding of certain proteins, and to link these changes with the mRNA splicing process.

In the several months that he has been involved in the project, Svetlichnyy has developed a general understanding of which RNA binding proteins are responsible for the regulation of alternative splicing, and have identified several such regulators.

“Out of 80 candidates, we have identified four or five regulators which we believe are very much responsible for the regulation of exon exclusion or inclusion from the transcript,” he said.

But knowing which proteins are responsible isn’t enough; the team is bracing for an uphill battle in the quest to understand the complex causality at play when different types of proteins interact during the regulatory process.

“We’re trying to understand how one factor can affect the binding of another factor, how they work together, which factor had the primary role in the binding and which factor facilitated the binding of another one,” he said. “So from a biological standpoint, causality issues are probably the most important things to focus on in this project.”

But Svetlichny believes that the abundance of available information will help him and his team surmount these obstacles. “We believe that existing high-throughput datasets (eCLIP-seq), generated in the framework of the ENCODE project, coupled with the power of machine-learning algorithms and statistics will help us to figure out the rules governing the combinatorial binding of proteins to RNA and the combined effect it has on the splicing process,” he said.

Svetlichniy’s immediate goal is to focus on several key regulators in order to develop sound evidence that certain proteins are correlated in terms of alternative splicing and subsequent gene expression.

Research for the greater good

Both young researchers are committed to the ideal of improving our understanding of the genetic causes and effects of cancer and other illnesses.

Relying on the algorithm that set Vorontsov and his team ahedad of the pack at ENCODE-DREAM, as well as the testing of the statistical properties of transcription factor binding sites, Vorontsov has devoted his attention to studying selection pressure with respect to binding sites in cases of cancer.

“In the long term, I’m going to study the tolerance of binding sites to specific mutational processes so that we can predict which binding sites will be broken in particular cancer types,” he said.

“The ultimate goal of my work is to reduce the set of known variations that could potentially cause diseases like cancer and type-two diabetes. Ideally, the set of known variations will ultimately be small enough for biologists to verify them experimentally. This will help to determine the mechanisms behind these diseases,” he added.

In the longer term, Svetlichnyy hopes that his research will play a role in fine-tuning our understanding of certain types of cancer.

“We hope to make sense of exactly how cancer can affect the splicing of related genes, so as to find examples of genomic variations and determine whether these genes are known to be involved in different types of cancer,” he said.

“We will try to understand for a particular cancer type if there is a set of genes affected by a mutation and how that mutation affects splice variants,” he added, noting in particular that such research could help scientists connect particular mutations with the creation of cancer-related proteins.

By linking mutations to cancer types, researchers could ultimately help patients who are genetically predisposed to these and other maladies take prophylactic measures.