General geekery, tech, and the occasional attempt at writing a novel

Tag Archives: cancer

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

Another talk from Mark who was an excellent chair for some conference sessions as well. One of the biggest problems with personalized medicine is that some data is already silo’d, or at very best fragmented.

In the UK getting science into clinical practice within the NHS is really predicated on the evidence that it reduces costs, is transformational in terms of treatment and adds value to the current system. So the bar is set quite high.

This was contrasted with the INCa Tumour Molecular Profiling Programme which is running in France with colorectal and lung cancers. This is drawing on 28 labs around Europe. INCa appears to be run under the auspices of the Institut National du Cancer.

Critical resource: http://www.e-cancer.fr/en

Mark felt that empowering patient advocacy was going to be an important drive in NHS uptake of new technologies and tests. But equally important was increasing personalized medicine literacy amongst GPs, policymakers and the insurance industry.

Nazneen is obviously interested in testing germline mutations unlike much of the rest of the cancer programme which was focused on somatic mutation detection. Consequently working with blood draws and not biopsy material.

There are >100 predisposition genes implicated in 40+ cancers and there is variable contribution depending on the mutation and the cancer type. 15% of ovarian cancers result from germline variants, and this falls to 2-3% of all cancers. For this kind of screening a negative result is just as important as a positive one.

On the NHS testing for about half these predisposition genes is already available but even basic BRAF testing is not rolled out completely so tests have ‘restricted access’.

What is really needed is more samples. Increased sample throughput drives ‘mainstreaming of cancer genetics’. And three phases need to be tested – data generation, data analysis and data interpretation.

Critical resource: http://mcgprogramme.com/

They are using a targeted panel (CAPPA – which I believe is a TruSight Cancer Panel) where every base must be covered to at least 50x, which means mean target coverage of samples approaches 1000x even for germline detection. There’s a requirement for a <8week TAT and positive and negative calls must be made. It was acknowledged that there will be a switch to WEX/WES ‘in time’ when it is cheap.

The lab runs rapid runs on a HiSeq 2500 at a density of 48 samples per run. This gives a capacity of 500+ samples per week (so I assume there’s more than one 2500 available!). 50ng of starting DNA is required and there is a very low failure rate. 2.5k samples have been run to date. 384 of these were for BRCA1/2. 3 samples have failed and 15 required ‘Sanger filling’.

In terms of analysis Stampy is used for the aligner and Platypus for variant calling due to its superior handling of indels. A modified version of ExomeDepth is used for CNV calling and internal development produced coverage evaluation and HGVS parsers. All pathogenic mutations are still validated with Sanger or another validation method.

Data interpretation is the bottleneck now, its intensive work for pathogenic variants, and VOUS are an issue – they cannot be analysed in a context independent fashion and are ‘guilty until proven innnocent’ in the clinicians mind.

They have also performed exome sequencing of 1k samples, and observed an average of 117 variants per individual of clinical significance to cancer and 16% of the population has a rare BRCA variant.

Nazneen prefers to assume that VOUS are not implicated in advance, we should stick to reporting what is known, until such time a previous VOUS is declared to be pathogenic in some form. But we should be able to autoclassify 95% of the obvious variants, reducing some of the interpretation burden. Any interpretation pipeline needs to be dynamic and iteratively improved with decision trees built into the software. As such control variant data is important, ethnic variation is a common trigger for VOUS, where the variant is not in the reference sequence, but is a population level variant for an ethnic group.

Incorporating gene level information is desirable but rarely used. For instance information about how variable a gene is would be useful in assessing whether something was likely to be pathogenic – against a background which may be highly changeable vs. one that changes little.

Although variants are generally stratified into 5 levels of significance they really need to be collapsed down into a binary state of ‘do something’ or ‘do nothing’. A number of programs help in the classification including SIFT, PolyPhen, MAPP, AlignGVD, NN-Splice, MutationTaster. The report also has Google Scholar link outs (considered to be easier to query sanely than PubMed).

To speed analysis all the tools are used to precompute scores for every base substitution possible in the panel design.

5.3 Timothy Caulfield, University of Alberta, Canada: “Marketing the Myth of Personalised Prevention in the Age of Genomics”

No notes, here but an honorable mention for Tim who gave what was easily the most entertaining talk of the conference focusing on the misappropriation of genomics health by the snake oil industries of genomic matched dating, genomic influenced exercise regimes and variant led diets. He also asked the dangerous question that if you 1) eat healthily 2) don’t smoke 3) drink in moderation 4) exercise is there really any value in personalized medicine except for a few edge cases? Health advice hasn’t changed much in decades. And people still live unhealthily. You won’t change this by offering them a genetic test and asking them to modify their behavior. If you ever have a chance to see Tim speak, it’s worth attending. He asked for a show of hands who had done 23andMe. Quite shocking for a genetics conference 3 people had their hand in the air. Myself, Tim and one of the other speakers.

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

2.1 Lillian Su, University of Toronto: “Prioritising Therapeutic Targets in the Context of Intratumour Heterogeneity”

The central question is how we can move towards molecular profiling of a patient. Heterogeneity of cancers includes not just inter-patient difference but also intra-patient differences, either within a tumour itself, or a primary tumour and its secondary metastases.

Lillian was reporting the on the IMPACT study, which has no fresh biopsy material available so works exclusively from FFPE samples Their initial work has been using a 40 gene TruSeq Custom Amplicon hotspot panel, but they are in the process of developing their own ‘550 gene’ panel which will have the report integrated with the EHR system.

Lillian went on to outline the difference between trial types and the effects of inter-individual differences. Patients can be stratified into ‘umbrella’ trials – which are histology let, or ‘basket’ trials which are led by genetic mutations (as well as N-of-1 studies where you have unmatched comparisons of drugs).

But none of this addresses the intra-patient heterogeneity, it’s not really considered in clinical trial design. Not all genes have good concordance in terms of the mutation spectra between primary and metastatic stakes (PIK3CA was given as an example). What is really required is a knowledge base of tumour heterogeneity before a truly effective trial design can be constructed. And how do you link alterations to clinical actions?

Lillian outlined the filtering strategy for variants from FFPE and matched bloods. This was a MAF of <1% in 1kG data, a VAF of >5% and a DP>500x for the tumour, and DP>50x in matched bloods. Data was cross-referenced with COSMIC, TCGA, LDSBs and existing clinical trials, and missense mutations characterized with Polyphen, SIGT, LRT (likelihood ratio test) and MutationTaster.

They are able to pick out events like KRAS G12 mutations that are enriched on treatment, and this is a driver mutation, so the treatment enriches the driver over time.

Lillian sees WES/WGS as important as a long term investment rather than panels as well as the use of RNA-Seq in investigating heterogeneity. Ideally you want a machine learning system overlaid over the NGS datasets. Deep sequencing of tumours early might give you some idea of whether the tumour heterogeneity is pre-existing, or is it a result of tumoural selection over time. It was acknowledged that this was hard to do for every patient but would answer more long standing questions about the existence of resistant subclones being present and stable at the start of tumourogenesis.

PDX stands for “Patient Derived Xenografts”. This was an amazing talk, and as such I have few notes. The basic premise here is to take a tumour from a patient and segment it and implant the segments into immunodefficient mice where the tumours can grow. There was a lot of detail on the mouse strains involved, but the applications for this seem to be huge. Tumours can be treated in situ with a number of compounds and this information used to stratify patient treatment. The material can be used for CNV work, grown up for biobanking, expression profiling etc.

Fitting in with the previous talk, this model can also be used for investigating tumour heterogeneity as you can transplant different sections of the same tumour and then follow e.g. size in response to drug dosage in a number of animals all harbouring parts of the same original tumour.

Importantly this is not just limited to solid tumour work as AML human cell lines can also be established in the mice in a matter of weeks.

The quote that stayed with me from the beginning of the talk was “Precision cancer medicine stands on exceptions”. The success stories of genomic guided medicine in cancer such as EGFR and ALK mutations are actually present in very small subsets of tumours. The ALK mutation is important in NSCLC tumours, but this is only 4% of tumours and only 2% respond. Colorectal cancer (CRC) is characterized by EGFR mutations and disruption of the RAS/RAF pathway.

However the situation is that you can’t just use mutation data to predict the response to a chemotherapeutic agent. BRAF mutations give different responses to drugs in melanomas vs. CRC because the melanomas have no expression of EGFR, owing to the differences in their embryonic origin.

Consequently in cell-line studies the important question to ask is are the gene expression profiles of the cell line appropriate to the tumour? This may determine the response to treatment, which may or not be the same depending on how the cell line has developed during its time in culture. Are cell lines actually a good model at all?

Frederica made a point that RNA-Seq might not be the best for determining outlier gene expression and immunohistochemistry was their preferred route to determine whether the cell line and tumour were still in sync in terms of gene expression/drug response.

Nick started off talking about the various classes of risk alleles that exist for breast cancer. At the top of the list there are the high penetrance risk alleles in BRCA1 and BRCA2. In the middle there are moderate risk alleles at relatively low frequency in ATM and PALB2. Then there is a whole suite of common variants that are low risk, but population wide (FGFR2 mutations cited as an example).

With breast cancer the family history is still the most important predictive factor, but even so 50% of clearly familial breast cancer cases are genetically unexplained.

He went on to talk about the COGS study which has a website at http://nature.com/icogs which involved a large GWAS study of 10k cases and 12k controls. This was then followed up in a replication study of 45k cases and 45k controls.

Nick has been involved in the fine mapping follow up of the COGS data, but one of the important data points was an 11q13 association with TERT and FGFR2.

Data was presented on the fine mapping work that shows associated SNPs mapping to DNAseI hypersensitivity sites in MCF7 (a metastatic breast cancer cell line) as well as to transcription binding factor sites. This work relied on information from RegulomeDB: http://regulomedb.org/.

One of the most impressive feats of this talk was Nick reeling off 7 digit rsID’s repeatedly during his slides without stumbling over the numbers.

Work has also been performed to generate eQTLS. The GWAS loci are largely cis acting regulators of transcription factors.

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

1.1 Pui-Yan Kwok, UCSF: “Structural Variations in the Human Genome”

Talk focused on structural variant detection, the challenges were outlined as being

Short reads

Repeats

CNVs

Haplotying for compound heterozygote identification

Difficulty of analysis of SVs

Currently the approach is to map short reads to an imperfect assembly. Imperfect because it is haploid, composite and incomplete with regards to gaps, N’s and repeat sizes

There are 1000 structural variations per genome, accruing to 24Mb/person, and 11,000 common ones in the population covering 4% of the genome (i.e. more than your exome).

ArrayCGH dup/del arrays don’t tell you about the location of your duplications and deletions. Sequencing only identifies the boundaries.

Presented a model of single molecule analysis on the BioNanoGenomics Irys platform. Briefly this uses a restriction enzyme to introduce single stranded nicks in the DNA, which are then fluorescently labelled. These are then passed down a channel and resolved optically to create a set of sequence motif maps – that is very much akin to an optical restriction endonuclease map. This process requires high molecular weight DNA, so presumably therefore not suitable for FFPE/archival samples.

There are some technical considerations –the labelling efficiency is not 100% (mismatch problem on alignment), some nicks are too short for optical resolution. The nicking process can make some sites fragile causing breakup of the DNA into smaller fragments. The ‘assembly’ is still an algorithmic approach and by no means a perfect solution.

However this approach shows a great synergy with NGS for combinatorial data analysis.

They took the classic CEPH trio (NA12878/891/892) and made de novo assembled genome maps for the three individuals, generating ~259Gbases of data per sample. 99% of the data maps back to the GRCh38 assembly (I assume this is done via generating a profile of GRCh38 using an in silico nickase approach). The N50 of the assemblies is 5Mbases, and 96% of GRCh38 is covered by the assembled genomes.

This obviously enables things like gap sizing in the current reference genome. They were able to validate 120/156 known deletions, and identified 135 new ones. For insertions they validated 43/59 and found 242 new ones. A number of other mismatches were identified – 6 were switched insertion/deletion events, 9 were low coverage and 31 there was no evidence for.

The strength of the system is the ability to do tandem duplications, inversions and even complex rearrangements followed by tandem duplications. It also supports haplotyping, but critically you can tell where a CNV has arrived in the genome. This would enable applications like baiting the sequences in CNV regions and mapping the flanks. This allows you to produce diploid genome maps.

This platform therefore allows assessment of things like DUF1220-Domain copy number repeats, implicated in autism spectrum disorders and schizophrenia (repeat number increases in ASD, and decreases in schizophrenia).

Stephen spoke about new NCBI services including simplified dbGAP data requests and the option to look for alleles of interest in other databases by Beacon services.

dbGAP is a genotype/phenotype database for reseachers that presents its data consistent with the terms of the original patient consent. “GRU” items are “general research use” – these are broadly consented and genotyped or sequenced datasets that are available to all. This consists of CNV, SNP, exome (3.8k cases) and imputed data. PHS000688 is the top level ID for GRU items.

The Beacon system should be the jumping point for studies looking for causative mutations in disease to find out what other studies the alleles have been observed in rather than relying on 1KG/EVS data. This is part of the GA4GH project and really exists so a researcher can ask a resource if it has a particular variant.

At some point of genome sequencing we will probably have observed a SNP event in one in every two bases, i.e. there will be a database of 1.5 billion variant events. And critically we lack the kind of infrastructure to support this level of data presentation. And the presentation is the wrong way around. We concern ourselves with project/study level data organization but this should be “variant” led – i.e. you want to identify which holdings have your SNP of interest. This is not currently possible, but the Beacon system would allow this kind of interaction between researchers.

There are a number of Beacons online, which are sharing public holdings such as 1KG. The NCBI, GA4GH, Broad, EBI are involved. There is even a meta-Beacon that allows you to query multiple Beacons.

This introduces a new worfkflow – really it allows you to open a dialogue between yourself and the data holder. The existence of a variant is still devoid of context, but you can contact the data holder and then enter a controlled access agreement for the metadata, or information down to the read level

Machine mining of Beacon resources is prohibited. However the SRA toolkit allows access to dbGAP with security tokens which allows automatic query of SRA related material with local caching.

The steps involved are building a fosmid library. This is then plated out. Molecular inversion probes are used to identify fosmids from the region of interest. Single clones are then extracted and sequenced extensively. This obviously means you need a fosmid library for each individual you’re looking at and is not a hybridization extraction method like using BAC’s as baits for large regions.

Sequencing is done on Pacbio both for speed (faster than a MiSeq) and read length. At this point the data can be assembled by Velvet, or even by the venerable Phrap/Consed approaches. About 40-100 PacBio reads are required to assemble a fosmid clone.

Quiver can be used to find a consensus sequence, and one a fosmid has been assembled, it can be coassembled with other fosmids that have been similarly reconstructed to get regions of 800kb.

The question was raised whether it might be possible to bypass the fosmid step with other recombineering approaches to work directly with gDNA and MIPS.

Peter talked about the prediction of splice mutation effects with particular reference to the collagen genes. 20% of collagen mutations are splice site mutations (these genes have lots of exons). This is pathogenic in a spread of osteogeneis imperfect (OI) disorders. It is complex because we not only have to consider the effects on splice donor and splice acceptor sites but also the effects on Lariat sequences within introns.

Consequently there are a number of downstream effects – the production of cryptic splice sites, intron retention, exon skipping (which tends to lead to more severe phenotypes). But this is made more complex again by the fact a single variant can have multiple outcomes and there’s no clear explanation for this.

This complexity means that it is hard to produce a computational prediction program that takes into account all the uncertainties of the system, especially at locations 3, 4 or 5 bases outside the splice site.

SplicePort and Asseda were tested, and Asseda came out on top in the tests, with a mere 29% of events wrongly predicted when compared with experimental evidence. So what is happening to make these predictions incorrect?

Peter explained that the order of intron removal in genes is specific to the gene, but shared with individual, but there was no global model for what that order might be, however it must be encoded in some way by intronic sequence. The speed of intron removal and the effects on the mature mRNA are incredibly important to the pathogenesis of the disease. It was clearly shown that the splicing events under study were predicated by the speed of intron removal as the RNA matured.

If you want to predict the splicing effect of a mutation, you therefore need some information about the order of intron processing in the gene you’re looking at to have a completely holistic view of the system. How do you generate this information systematically? It’s a very labour intensive piece of work, and Peter was looking for suggestions on how best to mine RNA-Seq data to get to the bottom of this line of enquiry. Is it possible even to do homology based predictions of splicing speed and therefore splicing order?