Plagnol lab's blog

Pages

Friday, 25 November 2016

I am delighted to see on BioRxiv.org the publication from my PhD student Jack Humphrey. Jack works in collaboration between the UCL Genetics Institute and the UCL Institute of Neurology.

Jack's work, during the first year of his PhD, has focused on understanding the molecular mechanisms behind devastating neurodegenerative disorders like ALS and FTD. These neurodegenerative disorders are raising interesting questions for computational biologists. Many of the variants- mostly rare- associated with these disorders implicate genes that play a role in RNA processing, for example RNA binding proteins. However, this is far from the whole story and many questions remain to precise the molecular mechanisms that initiate the disease process.

Back in August 2015, Ling et al published in Science an exciting paper showing that when a gene called TDP-43 is not expressed at the usual level, small sections of the genomes become unexpectedly expressed, as if new uncontrolled exons were appearing in the genome. TDP-43 is implicated in ALS and FTD in two ways. Firstly, rare variants in that gene are associated with familial forms of ALS-FTD, which provide a reliable causal relationship between that gene and disease. Secondly, TDP-43 inclusions, i.e. accumulation of that protein, are almost systematically found in the post mortem brains of ALS and FTD patients.

This Ling et al paper is obviously interesting because this cryptic exon observation could provide a mechanism linking TDP-43 and disease. Not taking anything away from that publication, the results are not highly quantitative in the sense that the authors did not make full use of the biological replicates, a pillar of quantitative biology. We felt that this publication opened many interesting questions that were worth investigating using a computational approach, which would provide a good starting point for Jack's PhD.

The BioRxiv publication looks at available TDP-43 datasets and mine these data to better quantify the cryptic exon mechanism. Jack's analysis confirms that the impaired competitive RNA binding between TDP-43 and other splicing factors is the most likely mechanism leading to cryptic exons. In addition, Jack shows that the gene containing these cryptic exons also are on average down-regulated, most likely because of non-sense mediated decay induced by these cryptic exons. This down-regulation suggests that the impact of cryptic exons on the regulatory machinery is potentially large, strengthening the view that this specific mechanism is a plausible candidate for pathogenesis.

Taken together, these results confirm many of the findings of Ling et al and provide new directions. We plan to take these results to the next stage and see how that signal can be detected in other neuro-degenerative disease models, not directly TDP-43 related. Most importantly, we will be looking at human data to assess whether similar signals can be identified. However, the challenges in human post mortem brains are much greater, mostly because of the much more degraded RNA and the extent in neuronal loss in late stage patients. Time will tell how far we will be able to go with these questions.

Wednesday, 9 July 2014

Kitty Lo, a postdoctoral researcher in my group at UCL, just published a new bioinformatics application note and an associated R package called RAPIDR (available on CRAN) to streamline the bioinformatics of non-invasive pre-natal testing (NIPT). It is an opportunity to advertise the package and say a bit more on NIPT and what is happening in the UK with regard to this relatively new technology.

What is non-invasive pre-natal testing (NIPT) and how does it work?

Much has been written on the impact of DNA sequencing technologies on people’s lives, and some fields like genetic diagnosis for rare disorders clearly have been radically changed. But (thankfully) this type of application only concerns a small fraction of the general population. There is one application, however, that has the potential to affect thousands of people: the ability to diagnose chromosomal abnormalities, such as Down syndrome, from the blood of the pregnant mother.

While the technical aspects are clearly non trivial, the concept is relatively simple. It turns out that there are bits of DNA floating in the plasma of the mother. Some of these DNA fragments originate from the fetus (typically 10%, but this number varies quite a lot). If the fetus has an extra chromosome 21, there will be more reads mapping to chromosome 21 than for a normal fetus. Statistics can detect this, and generate a firm diagnosis for the parents.

The key advantage of this non-invasive test of course is to avoid the invasive test, which carries a small but significant risk of miscarriage. It should be possible to test more women, hence increase the ability to detect Down syndrome while avoiding the risk associated with existing methodologies. There are also fiercelegal and ongoingbattles around the intellectual property associated with these tests but I am not best placed to comment on this.

NIPT in the UK: the RAPID project

Much of the development of NIPT for aneuploidy has been driven by US private companies. As a consequence, a large proportion of tests currently performed in the UK simply ship samples to the US for an analysis. Furthermore, it is currently only available through the private sector in the UK. Implementing this into routine clinical practice in a public sector health service will require more evaluation to see where in the care pathway it might fit, how we educate women and health professionals, the health economic aspects etc. The RAPID programme, which is funded by the NIHR PGfAR, is now evaluating some of these aspects and will report to the National Screening Committee in the next year.

As part of this project, my group (and in particular postdoctoral researcher Kitty Lo) collaborates with the chief investigator of the RAPID programme Prof Lyn Chitty in the development of the bioinformatics and statistical aspects of the RAPID project.

What women participate to the RAPID evaluation?

The RAPID project is working with selected maternity units located in the South of England and, more recently, Dundee. As traditionally done in the NHS, all pregnant women who opt for Down syndrome screening receive a risk based on the routine 12-week scan. Current NHS guidelines offer the option of an invasive test to women for which the estimated risk is greater than 1/150 (about 3% of pregnancies). In the context of the NIPT evaluation study NIPT is being offered to women undergoing Down syndrome screening who receive a risk of 1/1,000 or greater, we estimate that this will be around 12% of pregnancies. A blood test is then obtained, sent to the Regional Genetics Laboratory where cell free DNA is extracted and sequenced. Research midwives give the result of the NIPT to the mother. If NIPT suggests a chromosomal abnormality, an invasive test is recommended as there are a number of reasons why there may be discordance between the result obtained from maternal blood and the karyotype of the fetus.

Does it work?

Remarkably well. It is already well understood that the science behind NIPT is robust and reliable but, as discussed above, the NHS needs data that goes beyond laboratory performance before making decisions if and how to implement this on a large scale. The study seems to be going well and is welcomed by women. An exhaustive description of the outcome of the study will be published in a few months.

Bioinformatics and the next steps

The bioinformatics are relatively simple but nevertheless need to be performed correctly. We think that an open source R package that researchers can use and modify is a useful step toward making the technique reliable and widespread.

A key challenge is the noise of sequencing assays, with parameters such as GC content of the DNA sequence creating noise and potential false positive. The RAPIDR package has been designed to correct for this using several previously published strategies. It is also designed to generate quality control measurements to identify unreliable samples, also a source of false positive. The package is freely available on CRAN, we are keen to see it used, and please get in touch if you are interested in trying it out. Ongoing work tackles the much more challenging issues of smaller abnormalities (partial chromosome deletions or duplications). These are harder to detect but are nevertheless important for the parents to make informed choices.

There is seems little doubt, based on worldwide experience, that NIPT is the future for pre-natal diagnosis. Implementation into routine maternity care will no doubt take time but hopefully our ongoing work (in the lab as well as for the bioinformatics) will help expedite these changes.

Sunday, 26 January 2014

With clinical colleagues at UCL, we are looking for motivated applicants to take on three job opportunities with a significant computational/statistical/bioinformatics component. I am involved in each of these projects but the main lead is clinical.Two are at the postdoctoral level and one at the PhD. The closest deadline is the PhD application (January 31st, which is this coming Friday), the two postdocs deadlines are further in time and about to be advertised. Do not hesitate to contact either myself or one of the other listed investigators directly if interested.

Three year postdoc: Transcriptome analysis in blood and iPS cells of retinitis pigmentosa (RP) patients

This project has the strongest mol bio/wet lab component but the applicant should either have or will gain familiarity with the analysis of RNA sequencing data. It will investigates retinitis pigmentosa (RP), a Mendelian disorder that causes retinal degeneration and blindness. It is due to many specific genes that affect the development and/or maintenance of rod photoreceptors, the most abundant light-transducing cell in the human retina. Although many genes have already been identified, discovering the biological link between mutant or absent protein on the one hand, and specific dysfunction and death of rod photoreceptors on the other, is challenging. Without understanding the detailed molecular pathology, novel treatments will not be possible.

The project benefits from a large resource of such patients and families managed at Moorfields Eye Hospital and investigates the pathophysiology of those affected by mutations in splicing-factors. These comprise a significant proportion of affected patients including RP11, RP9, RP13 and RP18. The underlying genes encode proteins that make up the spliceosome, and are expressed in all eukaryotic cells. One enigmatic feature of these disorders is the fact that heterozygous mutation (they are all autosomal dominant disorders) causes specific problems with the retina and not, as far as is known, any other organ or cell-type. One further interesting feature of these disorders is the manifestation of variability, and sometimes non-penetrance in gene-carriers. By investigating this further, ameliorating factors might be identified.

The project plans to address these questions by the thorough analysis of the transcriptome of affected gene-carriers compared to those gene-carriers without disease and controls (ethinically matched non-carriers). We intend to use RNA-Seq experiments to explore this comprehensively, in collaboration with Dr Vincent Plagnol and colleagues at UCL Genetics Institute. Secondly, we intend to generate iPS cells from patients’ skin biopsies and thereafter retinal pigment epithelial cells and photoreceptor progenitor cells to explore their transcriptomes and phenotypes. This will be under the supervision of Professor Pete Coffey and team at UCL Institute of Ophthalmology.

This represents an opportunity to understand eukaryotic splicing and its aberration in human retinal disease. It will require a willingness to generate and work with RNA-Seq data, understand human inherited disease and be able to develop iPS and derived cells in cell culture. Hence the successful candidate will gain a wide experience in molecular and cell biology as well as bioinformatics.

Three year postdoc: Genetics of epilepsy

Investigator: Sanjay Sisodiya, UCL Institute of Neurology and myself

Funding: Wellcome Trust, European Union, MRC

Epilepsy refers a complex and heterogeneous set of disorders, and its etiology remains hard to elucidate. Sanjay Sisodiya at the UCL Institute of Neurology (IoN) leads a successful research program that investigates the genetic basis of these diseases and how this genetic basis affects response to treatments. Cases that are part of the study are very deeply phenotyped, including novel experimental techniques and extensive long term follow-up. High throughput sequence data (a lot of exome sequences, > 200 whole genome sequences) are being generated and are now becoming available. We are looking for an postdoctoral researcher to work on these data. There will be plenty of scope to develop new bioinformatics/statistical methods. The size of the dataset continues to increase and a lot of additional information about these patients is also available, which provides an opportunity to tackle pharmacogenomics questions.

Four year PhD studentship: System biology for graft versus host disease (GVHD)

This project uses a systems biology approach and mouse models to investigate the basis of graft-vs-host disease, i.e. the process of donor immune cells attacking the host organs, a severe complication following bone marrow transplant. As part of our research program, we are tracking clonal or polyclonal donor T cell responses across multiple sites in pre-clinical models of GVHD and evaluating transcriptional profiles and/or T cell receptor (TCR) sequences of purified cell populations. The student will use computational methods in a ‘dry lab’ environment (in collaboration with V Plagnol) to assess these microarray and TCR sequence data and to evaluate the extent to which GVHD is driven by TCR repertoire-dependent or independent factors. Systematic methods will be used to compare differentially expressed genes in our models with (1) multiple chipSeq maps that provide a genome-wide view of transcription factor binding sites and (2) extensive, publicly available gene expression datasets for human and murine T cells in other inflammatory conditions. These approaches will be used to identify novel regulatory or downstream effector pathways implicated in GVHD. Extensive cross talk between ‘dry’ and ‘wet’ lab researchers in our LLR program will permit further experimentation or data analysis as new hypotheses are formulated and tested.

Tuesday, 12 November 2013

With a few colleagues, we published in 2012 a paper describing ExomeDepth, a R package designed to call CNVs from exome sequencing experiments. The ExomeDepth paper has received some citations (not hundreds though, but still something substantial) which is obviously something I am happy to see. When a user asks questions, I always feel like there may be others with the same issues. Therefore, I decided to put together a small FAQ which I will edit as time goes by and when the package is updated. The main point I want to make is really that I encourage people to get in touch with me and feedback issues and results that seem incorrect. I am really happy to help if I can.

Q: ExomeDepth is advertised for exome sequencing. But can it be used on smaller targets than exomes?

Yes it will work without any substantial difference, but you need a substantial number of genes (or rather exons) to get the parameter estimated properly. I guess that ~ 20 genes will do the job.

Q: How do I know if my ExomeDepth results look fine?

For exome data, I like to look at the overlap with the common CNVs flagged in the Conrad et al paper. This dataset is already part of the ExomeDepth package and the vignette shows how to use it. In my initial tests using exome data generated by the BGI, about 70-80% of CNV calls had at least one Conrad CNV overlapping at least 50% of the call. In these cases I had about 180 CNV calls per sample, two third of them deletions. This is really the best case scenario and in such situations the CNV calls are quite reliable. Anything close to these numbers probably means that ExomeDepth ran properly and gave a meaningful output.

If you consider a smaller set of targeted genes, it is not absurd (but a very rough approximation) to get a prior estimate of the number of CNV calls by scaling the genome-wide number down to the smaller gene set of interest to you (i.e. 200 genes should give about 2 CNV calls per sample on average, but of course it will depend on the genes being targeted).

Q: But sometimes things really don’t look right. My numbers of CNVs is off, and the overlap with common known CNVs is too small. Why is that?

It may be an issue with the data. For example the correlation between the test sample and the combined reference must be very high (correlation coefficient at least 0.98 is recommended) otherwise the inference simply cannot work. In the next version of the package I plan to return warnings when that is not the case.

This being said I have seen data that look right by any obvious measure but which obviously was returning too few CNVs (about 40-50 per sample, and not a great overlap with the Conrad set). I do not have an answer to explain such poor results. Correlations between tests and reference looked high… just odd. I am wondering if it is an issue with recent library prep and capture experiments. This is something I need to investigate, and feedback is welcome.

CoNIFER (as far as I understand, and the authors should correct me otherwise!) corrects for differences between the "test" exome and the "reference" ones using a principal component type analysis. It is a perfectly valid approach of course. However, it probably requires a substantial number of exome samples to do this correction reliably. Moreover, my personal view is that usually these technical artifacts are so complex and difficult to correct genome-wide that this task is really challenging. ExomeDepth takes the alternative approach of mining through the "reference" exomes to find the one(s) that are best matched to the "test" exome, and performs a comparison between both. So rather than correcting for differences, I propose to start with something with minimum variability. Which one works best will probably depend on the data, but I think CoNIFER tends to be more conservative. It may be possible to combine both strategies in fact.

Q: I have just read a paper reporting what looks like a high false positive rate. Is that true?

It goes without saying that I am not very convinced (nor happy!) with this paper. First of all, ExomeDepth expects exomes generated from the same batch, and I have no idea what was actually done in that case. The authors may have done something a bit silly with the tool, like comparing exomes generated at different dates (or perhaps not, I just can’t say). They also report the validation rate on the whole set of CNVs (1.4K CNVs in 12 samples, which seems about right). But they compare numbers with CoNIFER which reports a much smaller number (I see n = 38 in Table 2). CNVs calls returned by ExomeDepth come with a Bayes factor that support the calls. The higher the Bayes Factor, the more confident the call is. If the authors had ranked CNVs by Bayes factor and reported validation for the confidently called large CNVs, I presume things would have been much close to the CoNIFER numbers. The next vignette will highlight the use of Bayes Factors, I think this point is not sufficiently advertised.

I have no problem with comparisons, negative or positive, I just feel like it’s a bit random in this case. I can absolutely see that ExomeDepth will sometimes delivers unreliable results, but I don’t think this paper shows the pros and cons well.

Q: I never see a CNV call in the first exon of a chromosome. Is this a bug?

Not really. The explanation for this is that I had issues with CNVs overlapping chromosome boundaries. The whole downstream analysis, in particular the annotation steps, were negatively affected. Therefore, in the coding of the underlying Markov Chain, I decided to force an absence of CNV (i.e. normal copy number 2) to start a chromosome. It is not an optimum way to address the problem but at least it is clean to code. This is not a high priority but some fix should be implemented at some point.

Q: Will ExomeDepth be updated? And when?

I plan to. For one thing some packages ExomeDepth depends on (like aod) have changed the syntax. I will probably need to update ExomeDepth using aods3 instead. While there are a few small things to improve, I would like to add more diagnostic tools to understand when and why ExomeDepth returns results with lower accuracy. That is the main thing to do really, other things are more minor incremental improvements.

Monday, 20 May 2013

In this latest paper (just submitted to arXiv, led by PhD student Claudia Giambartolomei) we want to answer the following question: given two genetic association studies both showing some association signal at a locus, how likely is it that the same variant is responsible for both associations?

We care about this because a shared causal variant is likely to imply an etiological link between the traits being considered. An obvious application consists of comparing a gene expression study and a disease trait. If one can show that the same variant is affecting both measurements, then it is very likely that the expression of this gene is affecting disease pathogenesis. It also provides information about the tissue type where the effect is mediated. This is a key information to inform a drug design process.

Previous work that led to this manuscript

A while back, I started a discussion with my colleague (and co-author on this manuscript) Eric Schadt about the involvement of a gene name RPS26 in type 1 diabetes. We came up with tests of co-localisation, which were later improved by my colleague (and co-author as well) Chris Wallace, based in Cambridge. These tests are somewhat dated now. The earliest version considered situations with very small number of SNPs, and was not well suited for densely typed regions, in particular as a result of imputation procedures.

This SNP density problem can be overcome to some extent, and Chris Wallace discusses how to do this here. However, a more fundamental issue is the Bayesian/frequentist difference. These earlier tests were testing the null hypothesis of a shared causal variant. Failing to reject the null could be the result of either a lack of power, or a true shared causal variant. In this newer Bayesian framework, the probability of each scenario is computed, including the “lack of power” case. It then becomes easier to interpret the outcome of the test. The tests are about to be released in the latest version of the coloc package (which is maintained by Chris Wallace).

In this latest paper, the underlying model is closely linked to the one proposed by Matthew Stephens and colleagues in a recent PLoS Genetics paper. However, co-localisation was more a side story in this paper, whereas it is the central point of our work. In particular, we show that it is possible to use single SNP P-values to obtain a very good approximation of the correct answer. As discussed below, this has important practical applications.

Another closely related work is the software Sherlock. Sherlock also uses P-values, and also tries to match a gene expression dataset with another GWAS. However, Sherlock does not really perform a co-localisation test but rather a general matching between a gene and a GWAS. In particular, in the Sherlock framework, only the variants significantly associated with gene expression contribute to the final test statistic. In contrast, a variant flat for the expression trait but strongly associated with disease provides strong support against co-localisation. Our work incorporates this information, by adding support to the “distinct association peaks” scenario.

A warning about the interpretation

As always in statistics, correlation does not imply causality. And what we quantify here are correlations. We can find very strong evidence that the same variant is affecting two traits, but what we cannot conclude without doubt is that the two traits (say, expression of a gene and disease outcome) are causally related. It may be likely, but we are not testing this.

An illustration of the complexity of this is the commonly observed case where a single variant (or haplotype) appears to affect the expression of a group of genes in the same chromosome region. Our test may, in such a situation, provide strong evidence of co-localisation for several of these genes with a disease GWAS. However, most of the time the expression of a single of these genes will actually causally affect the disease trait of interest. It does not mean that the test is wrong but one just has to understand what it is actually testing. Precisely, two traits affected by the same causal variant may suggest a causal link between both, but it does not have to be the case.

Two limitations of this approach

There are two additional limitations to mention. One is that the causal variant should be typed or imputed. We use simulations to show that if this is not the case, the behaviour of the test becomes very conservative.

A second issue is the presence of more than one association for the same trait at a locus. If both associations have approximately the same level of significance, the test can misbehave. In addition, identifying co-localisation with the secondary association requires conditional P-values. We give a nice example of this in the paper. However, if only P-values are available (which is key for what we want to do), this requires using approximate methods. Things are much easier if the genotype level data are available and a proper conditional regression can be implemented.

Why it is important to use summary statistics

Data sharing is always a contentious issue in human genetics. I am incredibly frustrated by the lack of willingness displayed by some groups to share data, even though the claim is that they do. It is a topic for another post. Eric Schadt has been extremely helpful by sharing the liver gene expression dataset with us, but this is a rather uncommon behaviour. In most cases, data are hidden between various “regulations” and “data access committees” that rarely meet and extensively delay the process of data sharing.

Given this frustration, being able to base tests on P-values makes it much easier to interact with other groups and share data. The success of large scale meta-analyses is an example of this. This is why we worked out the statistics so that P-values alone are sufficient to derive the probabilities for each scenario.

A practical implication is that it becomes possible to build a web-based server that will take P-values uploaded by users, compare these P-values with a set of GWAS datasets stored on the server (typically expression studies but perhaps other data types) and return statistics about the overlapping association signals.

We have initiated that process and the coloc server is now live (http://coloc.cs.ucl.ac.uk/coloc/), with a lot of help from the Computer Science department at UCL. We have only loaded the liver dataset that we used in this preprint as of now, but we are in the process of adding a brain gene expression study, led by my colleagues Mike Weale, John Hardy and Mina Ryten. We very much welcome collaborations, and if other datasets, for gene expression or any other relevant traits, are available, we would love to collaborate and incorporate these data into our server.

From genome-wide to “phenome-wide”

What we really want to do with this tool in the near future is mine dozens of GWAS studies using single variant P-values summary data, and search for connections that have been missed by previous investigators. Perhaps there are lipid traits that can be linked to neurodegenerative conditions, like the well known APOE result? Perhaps some T cell genes have an unexpected effect on a cardiovascular trait? Obviously these are not likely events but the genome-wide analysis of many association studies is likely to show several results of this type. The idea is to not only work genome-wide but also “phenome-wide”, comparing as many pairs of traits as possible. Again, this is definitely a collaborative work and we would be excited if we could bring more datasets to make these comparisons more powerful. So don’t hesitate to get in touch.

Wednesday, 1 May 2013

We just advertised a new two year position available to work at the UGI on a range of medical genetics research projects. The main source of funding is a collaboration between University of Virginia, the Wellcome Trust Sanger Institute and the University of Cambridge on the genetics of type 1 diabetes. But this is only part of the story and the aim is to integrate the successful applicant into a broader research program that includes a collaboration with the Institute of Neurology on the genetics of ALS.

A range of applicants are welcome, and the applicant's field can range from computer science to more abstract mathematics. Some experience with programming and scientific computation is expected however.

The link for this application is here. Please get in touch with me if you are potentially interested in working with me at UCL and have any question about the application process.

Why I was worried

This finding was really a bit of a miracle. As a statistician, I am the person supposed to tell clinicians that small underpowered association studies are pointless. This one pushed the limit quite far, with 40 patients in the discovery set. It was even a candidate study because the initial sequencing work only considered the brachyury gene. This gene is a strong candidate for chordoma owing to the role of duplications in familial forms of the disease. When my colleague Nischalan Pillay came to me with these results, I did not believe them at all. It took me a while and several failures at challenging the result to start believing. The addition of exome sequence data helped rule out technical artifacts. But even with the paper accepted, I must say I had doubts. Now it's OK.

What 23AndMe did

23AndMe is in this amazing position where they can gather information a posteriori on rare diseases like chordoma in a very large cohort of patients. They could rapidly put together a panel of 22 cases (and many controls obviously) that confirmed our result. They describe the work in their blog post which is well worth a quick read. I am really excited by the possibilities offerred by the 23AndMe experimental design. There is really much to do with this ressource. It is also a good but somewhat puzzling thought that any association result published can be almost right away challenged and checked by 23AndMe. This reminds me of the ongoing position of Decode which could publish an incredible amount of results by taking advantage of a ressource that no one else could match.

What it means for chordoma research

The finding for the research on chordoma is quite significant. From a heritability point of view, it is obviously a massive chunk explained. A rs2305089-CC individual has almost no chance of developing chordoma, whereas a TT individual has substantially higher risk (about 25 fold compared to CC). In fact the 23AndMe description is a bit misleading when they state that CC represents the typical odds of developing chordoma: the variant is so common that the heterozygous group is probably a better representation of the typical odds. This being said the disease is still very rare, so even a TT genotype (like me in fact) should not panic. This is still a very unlikely disease to develop.

Now does that lead to a treatment? Far from that, and in fact the effect of that high risk variant may occur early in development and be absolutely impossible to target from a therapeutic point of view. But still, this is a possibility to explore and it is one important piece of the pattern of inheritance of bone cancers.

The 240 controls were chosen to not report any cancer and to provide the best ancestry match for the cases (as determined by the first 5 components of a PCA analysis). A simple Fisher exact test, which in R looks like this:

yields a two-tailed P-value of 0.047 (and half of that of course to test the exact replication hypothesis of the Nature Genetics paper). I note that the odds ratio is lower than our estimates (point estimate OR = 2 in this replication set). It may well be that we have a case of "winner's curse" for our Nat Gen paper but only the future will tell.

Very many thanks to Nick Eriksson and the 23andMe team to share these data.