Biography Accordion

Biography

Biography:

Dr Kim-Anh Lê Cao was awarded her Ph.D in 2008 in Université de Toulouse, France. She was awarded the "Marie-Jeanne Laurent-Duhamel" prize 2009 of the Société Française de Statistique (French Statistical Society) for her Ph.D thesis.

She started her postdoc in late 2008 in the Institute for Molecular Biosciences (University of Queensland), then worked as a research only academic in QFAB Bioinformatics. In 2014 she undertook a computational biologist position in the University of Queensland Diamantina Institute and was awarded an NHMRC Career Development Fellowship in 2015. She now leads the Computational Biostatistics Methods group at UQDI.

Since the beginning of her Ph.D Kim-Anh has initiated a wide range of valuable collaborative and research opportunities in both statistics and molecular biology. Her research interests are multidisciplinary as they focus on mathematical statistics characterization of molecular biological systems, and she is interested in developing sound statistical methods to answer new biological questions arising from these frontier molecular technologies. Her main research focus is on variable selection for biological data (`omics' data) coming from different functional levels by the means of multivariate dimension reduction approaches. Since 2009, her team has been working on developing a statistical software dedicated to the integrative analysis of `omics' data, to help researchers make sense of biological big data (http://www.mixOmics.org).

She has been teaching Statistics at UQ for five years and is currently teaching Applications for Computational Statistics STAT7174 for the UQ Bioinformatics Master’s program. She is regularly runs statistical training workshops and short series seminars for bioresearchers (http://www.imb.uq.edu.au/statistics) and workshops on multivariate analysis.

Areas of expertise:

Statistical data integration

Biomarker discovery

Microbiome

Techniques used / available:

Statistical multivariate analysis

R programming

Projects

Research Projects:

Data integration, biomarker discovery, meta-analysis and applications

Project Title:

Data integration for biomarker discovery

Project Information:

Motivation

With the decreasing cost of high-throughput technologies, experiments performed on the same patients but on several types of platforms is becoming more and more common not only to try unravel the relationship between the different types of omics data (transcriptomics, proteomics, metabolomics, etc.) but also to obtain a better biomarker signature than on a single omics data alone. One of the main analytical challenges is to combine these different data sets and extract the relevant information from this enormous amount of information. What is the multi-omic biomarker panel that will be able to give an accurate prediction of the diagnostic or prognosis of the disease?

Methods

We continue developing integrative multivariate methods that simultaneously integrate 'omics data sets (measured on the same samples) and identifies biomarkers correlated across 'omics data sets. Our recent developments include a biomarker discovery approach which we successfully applied on a TCGA breast cancer multi-'omics study. We identified a highly correlated multi-'omics molecular signature, composed of proteins, transcripts, methylation regions (CpG) and miRNA that can predict four subtypes of breast cancer tumours, leading to a classification accuracy of 85% in the training set, and 81.5% in an independent test set (in preparation).

We are also investigating a knowledge-driven approach to include pathway knowledge in our statistical model. In breast cancer we started with the Homologous Recombinant (HR) DNA repair pathway. HR pathway is crucial for the repair of double strand breaks generated during DNA replication, and dysregulation of this pathway has been frequently observed in tumours: a downregulated HR pathway can sensitise tumour cells to DNA-damaging chemical and radiological therapies while an upregulated HR pathway can cause resistance to these therapies. Our latest results show how we can quantify the dysregulation of such pathway using a multivariate method called Principal Curve Analysis and relate those results to chromosomal instability.

Case studies

I am working closely with the groups of Dr Michelle Hill in Prostate and Oesophageal cancers and Prof. Maher Gandhi in Lymphoma cancer from UQDI to establish multi markers and multi-omics biomarker panels capable of predicting early stage of cancers.

Project Title:

Data integration of longitudinal studies

Project Information:

Motivation

Longitudinal experiments are also becoming more common in clinical studies to try to understand the functional dynamics of a biological system. Taking the time component into consideration, as well as the large number of variables and the small number of samples in several multi omics data set is a numerical and analytical challenge in statistical analysis that has not been addressed previously. Our aim is to identify groups of different biological entities (e.g. expression levels of transcripts, proteins, metabolites, etc.) that are correlated and possibly controlled by the same molecular biological mechanism across several time points and answer the question: What is the set of omics biological features that act in concert across time and can predict the outcome of a disease?

Methods

Incorporating time dependency between the expression profiles during the data integration process is a challenging issue that includes modeling, and statistical analyses of the modeled profiles including clustering, differential expression analysis and time warping analysis. We are tackling this using linear mixed model splines and Fast Fourier Transform.

Case studies

My group has access to valuable longitudinal data sets from the Centre of Excellence for Prevention of Organ Failure (PROOF Centre, University of British Columbia, http://www.proofcentre.ca/) on kidney transplant data, and we are also collaborating closely with Dr. K. Ruggiero (University of Auckland, NZ) on a pancreatitis study. We are working with Prof. Ranjeny Thomas (UQDI) on a longitudinal diabetes study and on the effect of a drug for rheumatoid arthritis.

Project Title:

Meta-analysis and cross-platform comparison

Project Information:

Motivation

Most clinical studies are performed on very small sets of patients, which induces less precision on the parameters that are estimated. Therefore, most signatures are not reproducible from one study to another. Combining different studies can increase statistical power, but batch effects due to platform, lab or experimenter effects need to be taken into account and adjusted for. With my collaborators, I have started to address this issue in cell lines experiments with a new normalisation called ‘YuGene’ (available in CRAN) by answering the question ‘Where is my gene of interest expressed?’ . More challenges remain to be addressed, such as a) the identification of a unifying robust signature across similar platforms and b) the generalisation to very different platforms at the gene level (microarrays and RNA-seq data) or even across different species. Our goal is to be able to combine several studies that were not performed on the same patients or in the same labs, and derive a selection of robust biomarkers across these studies.

Methods

We are currently developing a wide range of multivariate dimension reduction methods to address those challenges. See our preprint (our exciting results are sumamrised below)

Case studies

With my postdoctoral fellow Dr Florian Rohart, we are closely working with the group of Prof Christine Well (AIBN, UQ, now University of Melbourne), the Stemformatics team (www.stemformatics.org) and Dr S Bougeard and Dr. A. Eslami (ANSES and ONIRIS, France) to develop multivariate approaches and apply them to stem cells microarray data sets.

We combined independent transcriptomics studies to identify a gene signature defining human Mesenchymal Stromal Cells (MSC) . This is a topical question in stem cell biology, as MSCs are a poorly defined group of stromal cells despite their increasingly recognized clinical importance. We integrated 84 highly curated public gene expression data sets representing 125 MSC and 510 non-MSC across 13 microarray platforms.The resulting agnostic platform signature of 16 genes gave an impressive classification accuracy of 97.8% on the training set, and 93.5% on an external test set ( 187 MSC and 474 non-MSC). The workflow is available to the biological community as an R package (‘bootsPLS’) and was implemented in the Stemformatics web-interface. The molecular signature has brought novel insights into the origin and function of MSC and our signature predictor is currently being used and validated by many of our stem cell collaborators through Stemformatics.

Project Title:

mixOmics software

Project Information:

The approaches I develop and apply to real case studies are all based on multivariate dimension reduction approaches (e.g. sPLS-DA, sGCCA). These data integration approaches are extremely efficient for large data sets and are currently implemented in the mixOmics R package that my group is improving and maintaining. An interactive web-interface is being implemented and will be publicly available to the research community.

mixOmics has been released to the CRAN (http://cran.r-project.org/) in 2009 and is in constant expansion. Our data base user is exponentially increasing (~ 10,000 unique IP downloads in 2015 compared to ~ 4000 in 2014, R CRAN download s). The mixOmics team includes collaborators from the Université de Toulouse, France as well as numerous contributors (University of British Columbia, Canada, University of Western Australia) and users around the world.

We run mixOmics 2-day workshops throughout the year. See www.mixOmics.org for more details.

Project Title:

Microbiome statistical analysis

Project Information:

Motivation

The Human Microbiome Project (HMP) has uncovered a diversity of microbes across human beings that seem to assist in maintaining processes necessary for a healthy body, but also unique communities of microbes that live at different body sites (e.g. skin, gut). Microbial unbalance has also been shown to be associated with disease (e.g. Crohn's disease, inflammatory bowel disease). However, microbiome research is still at its infancy, and substantial statistical and computational developments are needed to fully understand how these trillions of bacteria live, interact and regulate their ecosystem.

Our project focuses on the identification of microorganisms that characterise different environments or habitats, spatial locations or temporal changes and understand the biological functions of groups of microbial communities in relation with the host (e.g. human gut).

Methods

We are currently developing a set of univariate and multivariate methods for the analysis of microbiome data. We are particularly interested in understanding the interplay between the microbiome and the host, by statistically integrating meta-'omics ('omics measured on the microbiome) and the 'omics from the host.