Informatics Team

Team Members

Warren Kaplan

Shane Husson

Dmitry Degrave

Derrick Lin

Manuel Sopena-Ballesteros

Tansel Ersavas

Background

Every precision medicine program relies on genomic sequencing of disease cohorts. Analytical insights from these cohorts lead to determining pathogenicity of genetic variants, contribute to known Genotype-Phenotype correlations and inform patient diagnoses.

Beautiful visions of High Definition Medicine (Torkamani et al. 2017), and explicit calls to realise this vision (Topol 2015) inspires our group to contribute to this future by building the Platform to Support Precision Medicine programs at any scale.

Mission Statement

The KCCG Informatics Programs aims to organize the world’s genome information and make it universally accessible and useful to authorised users.

Core Activities

Genome Cohorts

To enable easy interrogation of genome cohorts of any size we have built the Vectisvariant atlas platform. This platform was developed to suit the needs of diverse users including clinicians, patients, scientists and bioinformaticians.

Capabilities

Search:Query specific chromosome co-ordinates, gene names and annotations in a given cohort.

Beacon: Locate specific variants in different studies across the Global Alliance for Genomics and Health Beacon Network.

Explore: Highly interactive real-time exploration of cohort summary statistics of genetic variants, including variant type, average allele frequencies, and reference and alternate alleles. Supports the querying of 40 million variants in real time.

Integrated web notebooks: Enabling bioinformaticians to run their own scripts and analyses in situ, while preserving their code, figures and results.

Variant annotations: Including links out to the original supporting evidence.

Clinical filtering: subset patients based on clinical attributes and query specific genotypes at the individual level.

GA4GH Beacon Network

Interactive Visualisations

Web Notebooks for Bioinformatics Researchers

Clinical Filtering

Deep Learning Initiative

Deep Learning Initiative is a group of projects that aim to transition Garvan into the coming “Age of Artificial Intelligence”. The initiative covers projects to demonstrate abilities of deep learning and other soft computing techniques on biological and medical sciences, and organising introductory to advanced talks, seminars and workshops to popularise use of deep learning at Garvan. The initiative was started and is currently led by Tansel Ersavas under the supervision of Dr. Warren Kaplan. The leading project of the initiative is the “MitoWisdom” project using advanced deep learning techniques.

Mitochondria are critical to cell survival as the host cell’s energy source and in regulating cell metabolism. Mitochondria’s role in cancers, degenerative diseases and ageing are increasing in prominence, and better analytic tools are required to further identify their contributions to such conditions. We are developing a clustering mechanism that uses a novel deep learning system and unsupervised learning to extract features from mitochondrial genome data at multiple dimensions. This system then can be quickly and easily re-trained to analyse mitochondria in multiple ways with minimal sample data for specialised classification of any condition or trait. We use a “convolutionalautoencoder” to reduce the dimensionality of the data and use the reducer part of the autoencoder as a basis of a trained DL system. The generated encoder represents mitochondria and can now be used as a knowledge source that can be applied to any mitochondria related problem with minimal supervised training. The technique we use for the mitochondrial genome is general and is applicable to the whole genome or any selected proportions of it. This project is currently being implemented by Tansel Ersavas with data supplied by Dr. Mark Pinese, in consultation with Prof. Aleksandra Filipovska of the University of Western Australia.

Data Intensive Computer Engineering (DICE)

The increasingly rapid turn-around and plummeting costs of genome sequencing mean that most of the expense associated with genomes, will not be in their sequencing, but rather in their analyses, and the scale-out computing systems needed to analyse them. Disruptive change to computing that’s come about from commercial cloud providers like Amazon, Google, and Microsoft, brings great potential and opportunities for genomics and medicine, but requires deep understanding of the nuance associated with cloud usage. The Garvan Data Intensive Computer Engineering (DICE) Group was established to design solutions to meet these challenges.

About DICE

The Data Intensive Computer Engineering (DICE) group is part of the Garvan Institute of Medical Research in Sydney, Australia. DICE is a provider of innovative computing solutions for genomics data. DICE supports Garvan’s factory-scale accredited genome sequencing operation, the single cell studies of the Garvan Weizmann Centre for Cellular Genomics, and other big genome data solutions.

The DICE comprises engineers Derrick Lin and Manuel Sopena Ballesteros, who report to Warren Kaplan (KCCG Informatics leader, Garvan Chief of Informatics). DICE’s computing infrastructure is supported by over $2 million in grants by the DICE team.

Driven by the scale and economic model of a specific problem, DICE builds solutions to run on local infrastructure, supercomputing facilities and commercial cloud environments.

Our solutions extend from hardware, networking, software infrastructure layers (like Apache Spark, Hadoop), to bioinformatics applications. We do not limit our solutions to local infrastructure, but include designs that incorporate supercomputing facilities, and commercial clouds and fast Wide Area Networks too.

While focussing primarily on genomics data, DICE has a close working relationships with other niche markets that include Finance, Agriculture and Defence. DICE also works closely with other research institutes that look to emulate the role of DICE, as well as an expert solution provider for bespoke genome data and computation challenges.

Since 2010 DICE has built customised solutions to the Garvan Institute that include:

Bioinformatics analysis environments using GenePattern and Galaxy

DICE Wolfpack Cluster

The computing infrastructure for an accredited whole genome sequencing infrastructure for Genome.One

The building of a Science Demilitarised Zone (DMZ) for the safe transfer of data in partnership with UNSW Sydney IT.

Successful writing of over $2 million in grants to build our in-house infrastructure

Regular invitations to Big Data Conferences

The DICE Approach

From a technical perspective DICE works in very diverse technologies that include:

HPC (Rocskscluster)

Panasas storage

Mellanox Networking

Ansible

OpenStack Cloud

Nova

Neutron

Heat

Magnum

Ceph storage

Docker

Apache Spark

Hadoop (HBase)

Apache Kudu

Four of these technologies are used in production (1, 2, 3 and 4), with the others being used to develop new products in collaboration with Garvan researchers and is very similar to Google’s approaches to research (Spector et al. 2012). An important aspect of DICE’s business is change, with regular reconfigurations needed across the computing stack in order to support this change.

Product Development and Applications

We have deployed our Genome Analysis Platform in the following production environments: