Dr. Lisa Porter, Department of Biological Sciences, University of Windsor

EVIDENCE OF PROGRESS

Task 1 – Data collection and preprocessing:
We have obtained: the full METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) data comprising 10 subtypes and consisting of single-nucleotide polymorphism (SNP), copy-number variation (CNV) and copy-number aberration (CNA), and gene expression profiles obtained from 2000 primary breast tumors with long-term clinical outcome. An additional full gene expression data consisting of 5 breast cancer subtypes and 158 samples have also been downloaded. Human protein-protein interaction (PPI) network data and cellular pathway data have been collected from HPRD and KEGG databases and other repositories. All these data have been preprocessed in such a way to allow us to obtain the driver genes and the diagnostic biomarkers of the breast cancer subtype.

Task 2 – Data Integration:

The preprocessed breast cancer data above has been integrated with the human PPI network in order to be able to find the driver genes and the diagnostic biomarkers of the breast cancer subtypes. We first devised a set of algorithms to identify, from the gene expression data and the variation data, the candidate driver genes, which are the genes which have significant alteration and significant differential expression in each subtype. The final step of the integration process was to map all the genes of the breast cancer data onto the human PPI network; this will allow us to determine the functional relationships between the genes during the biomarker identification phase.

We have proposed two prediction methods which find the informative diagnostic biomarkers of the breast cancer subtypes. We have obtained at least 9 sets of highly predictive diagnostic biomarkers, with accuracies ranging from 90% to 99%. The two methods, however, yield different results on the same data as well as on different data; even though their results are excellent. We have identified possible solutions to this problem (of lack of reproducibility) and each is being implemented and tested. This Task 3 is the most important (and also the most time consuming and difficult) phase of this Seeds4Hope research, as it requires studying, implementing and testing different prediction models to be run on the large Metabric data set.

The goals of the program for the upcoming 2015/2016 year are as follow:

A) Complete Task 3 above, will be completed in Year 2.

B) Validation of the Biomarker Sets: Each biomarker set will be validated to assess whether the link between gene expression and the gene interactions in the set is biologically sound. This validation phase will also involve interrogating each biomarker set for its biological functions. The final list of driver genes will be identified as those belonging to the most significant biomarker sets that are differentially expressed in each subtype.

C) Biological Validations: Dr. Lisa Porter’s lab will be used to validate the driver function of the identified biomarker sets. The proteins products of each identified gene will be individually knocked down or overexpressed. Changes to proliferation will be noted. Alterations to treatment will be tested. Changes to metastatic potential will be. Biological relationships between identified biomarker sets will be a future direction. This task will be done also in collaboration with Dr. Caroline Hamm.