CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an
online forum
for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

We look forward to a lively contest!

Hi-Res Cancer Data Integration Challenge

From the comprehensive description of genomic, transcriptomic and epigenomic changes of cancers provided by Genomic Data Commons (GDC, formerly at TCGA), the main goal of this challenge is to develop and demonstrate novel methods for gaining novel biological insights or improving support for Precision Medicine. Innovation can come from

Examine algorithm performance in a real-world clinical settings! We know that many approaches work well on some data-sets yet not on others. We here challenge you to demonstrate a unified single approach that matches or outperforms the current state-of-the-art for

CMap Drug Safety Challenge

Due to safety / toxicity issues, attrition in drug discovery and development remains a significant concern, and there are strong efforts to identify and mitigate risk as early as possible. Drug-induced liver injury (DILI) is one of the primary liabilities in drug development and regulatory clearance due to the limited performance of mandated preclinical models. There is a pressing need to evaluate alternative methods for predicting severe DILI, the main concern of the regulatory agencies. Increasing evidence suggests that multiple factors, including the interactions between drug properties and host factors (i.e., patient information), contribute to the DILI effect of a drug (Journal of Hepatology 63). With great hopes being placed in modern approaches from statistics and machine learning applied to genome scale profiling data. If we can better integrate, understand, and exploit information from multiple complementary studies of chemical compounds remains thus a critical question, specifically, exploring chemical descriptors of the drugs (Mold2, Journal of Chemical Information and Modeling 48), cell-based screening of pathway perturbations of the drugs (Toxicology in the 21st Century/Tox21, Nature Communications 7), gene expression patterns induced by them (Broad Institute Connectivity Map/CMap, Science 313, Nature Reviews Cancer 7, Cell 171), as well as host factors from the FDA Adverse Event Reporting System database (FAERS).

This CAMDA challenge focuses on understanding or predicting a drug’s potential to cause acute liver failure, the most severe type of DILI. To support the development of supervised machine learning approaches, we retrieved DILI severity information from the FDA-approved drug labeling, and specifically, now provide a new set of training labels of 422 drugs, indicating their potential to cause acute liver failure effects. In addition, we acquired a validation set of 195 drugs with blinded labels, which should be predicted. In the 2020 challenge, instead of relying solely on gene expression data, we extended the predictors by Mold2 chemical descriptors, host factors information (age and gender of the patients) from the FDA FAERS database, and pathway perturbation data of Tox21. Moreover, we now narrowed down last year's challenge CMap L1000 gene expression data set to cover six cell lines, potentially most relevant to liver (i.e. PHH, HEPG2, HA1E, A375, MCF7, PC3). The analysis teams will be encouraged to develop models using these predictors individually and/or in combination.

Analysis suggestions:

Integration of potentially complementary assays. Assessment of the relative values of the complementary data types for prediction.

Identification and interpretation of differences in cell-line response across drugs and across different predictors.

Metagenomic Geolocation Challenge

MetaSUB is creating a global genetic cartography of urban spaces, based on extensive sampling of mass-transit systems and other public areas across the globe. In a strategic partnership an extended set of data from global City Sampling Days is first introduced through the annual CAMDA contests.
CAMDA delegates thus receive access to over a thousand novel MetaSUB samples, comprising over a terabase of whole genome shotgun (WGS) metagenomics data. The primary data set covers over 20 cities around the world, with tens of samples per city (over 1000 samples in total), providing a unique resource for the study of biodiversity within and across geographic locations as well as ecological niches.

For better understanding of the relation between metagenomic profiles and location specificity / ecological niche the set of over a thousand features describing the climate conditions are provided as well as city and neighbouring biomes classification.

Together, these unique multi-source data set will allow to build novel models to predict ecological niche type or even origin locations of samples from cities seen for the very first time. Performance can be tested on an independent test set of over 50 new 'mystery' samples including locations from cities not sampled before.

Analysis suggestions:
A key challenge in metagenomic forensics is the construction of a microbiome fingerprint which will allow the prediction of the
geographical origin of a sample even in case when no reference samples from this location are known.

Typical considerations include:

How can we exploit metagenomic fingerprints for identifying the origin of a sample?

How reliable are such predictions of sample origins?

The primary data set is now available. This contains: i) hundreds of samples with WGS raw reads from urban locations from MetaSUB Consortium, ii) Over a thousand of weather/climate features for cities as well as city and neighbouring biome classification.

In addition the 16S sequencing-based OTUs for thousands of soil samples from two mentioned project from allover the world are also available.