Identify and validate a consensus signature using gene expression data

Summary

Do phenotypically different expression datasets share a common signature? Can the signature distinguish phenotypes in an independent dataset?

This recipe provides one method for identifying a consensus gene signature from a training set of several phenotypically distinct gene expression dataset. The recipe then validates the ability of the consensus signature to accurately distinguish phenotypes by using an independent test gene expression dataset. An example use case of this recipe is when an investigator may want to develop a gene expression signature to predict a specific phenotype, such as cancer or another disease.

Background information: What is a consensus gene expression signature?

A gene expression signature is the pattern of expression in a specific group of genes, usually ones that are related by function, position or other biological process. A consensus gene signature is an expression pattern for a specific group of genes, which is shared among different samples or across different phenotypes. For example, a group of genes regulating immune response could be similarly up-regulated during many different, unrelated infections. There are several types of consensus signatures; those that can be derived from gene expression data are called transcriptional consensus signatures. Consensus signatures can be created by overlapping individual gene signatures derived from multiple datasets. Compared to individual gene expression signatures, consensus signatures may be more accurate at distinguishing different phenotypes, such as diseased vs. normal samples.

To identify a consensus signature to predict sensitivity to JQ1 treatment, two training datasets and one test dataset were used. The training dataset included acute myeloid leukemia (AML) and a multiple myeloid leukemia (MM) cell lines, which had been treated with either DMSO (control) or with JQ1 (treatment). The test dataset included MYCN amplified and MYCN nonamplified neuroblastoma primary tumor samples. GenePattern was used to analyze the AML and MM cell lines; for each dataset, a gene expression signature was derived to identify JQ1 response in the cell line. Using Galaxy, the two signatures were then overlapped to determine the consensus signature between the two phenotypes.

GenePattern was used to validate the ability of this JQ1-associated consensus signature to differentiate between phenotypes, by using the signature to hierarchically cluster the test dataset (neuroblastoma). Since the MYCN amplified and MYCN non-amplified neuroblastoma samples should have differing expression profiles, it was hypothesized that the consensus signature would be able to separate the samples by phenotype. Indeed, the consensus signature was able to cluster the MYCN-amplified and MYCN-nonamplified samples separately, revealing that the consensus signature accurately distinguishes the sensitivity-to-JQ1 phenotype.

Inputs

To complete this recipe, we will need several gene expression datasets:

We will use the ComparativeMarkerSelection module to identify genes which are differentially expressed and can distinguish between two phenotypes (e.g. normal vs. JQ1-treated), separately for the acute myeloid leukemia (AML) and multiple myeloma (MM) datasets. This module uses the GCT file and the CLS file.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Click on the GenePattern icon to launch the tool.

Change to the Modules tab, and search for "ComparativeMarkerSelection". Once the module is loaded, change the following parameters:

input file: load the AML GCT file, e.g., GSE29799GPL6244_RNA_ORIGINALGENE_XXXXX.gct. To do this, use the GenomeSpace tab to navigate to Public > RecipeData > ExpressionData > GSE29799_AML, then drag the file to the input box.

cls file: load the AML treatment CLS file, e.g., treatment.cls, also located in the GSE29799_AML directory.

log transformed data: yes

output filename: AML_genes.comp.marker.odf

Click Run to run ComparativeMarkerSelection on the AML dataset.

Once the job has finished running, save the resulting file back to GenomeSpace:

Click on the file, and choose Save to GenomeSpace.

Navigate to a directory of your choice and choose Save.

Repeat these steps to identify differentially expressed genes for the MM dataset, GSE31365. Change the following parameters:

input file: load the MM GCT file, e.g., GSE31365GPL6244_RNA_ORIGINALGENE_XXXXX.gct. To do this, use the GenomeSpace tab to navigate to Public > RecipeData > ExpressionData > GSE31365_MM, then drag the file to the input box.

cls file: load the MM treatment CLS file, e.g., treatment.cls, also located in the GSE31365_MM directory.

log transformed data: yes

output filename: MM_genes.comp.marker.odf

Click Run to run ComparativeMarkerSelection on the MM dataset.

Once the job has finished running, save the resulting file back to GenomeSpace, as before:

We will load the two sets of differentially expressed genes from the AML and MM datasets into Galaxy. Then, we will use a pre-built GenomeSpace workflow to process the datasets, filtering and removing features that do not pass certain cutoffs. Finally, we will create a consensus signature and send a list of gene symbols back to GenomeSpace for additional analysis.

NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Click on the Galaxy icon to launch the tool.

Navigate to the following menu: Get Data > GenomeSpace import

Select the AML_genes.comp.marker.odf and MM_genes.comp.marker.odf files.

Click Send to Galaxy.

Once the files have been loaded, change the attributes for each file, by clicking the pencil icon and changing the following parameters:

We will use a pre-built GenomeSpace workflow to identify the consensus gene signature. This pre-built GenomeSpace workflow uses several steps to determine the overlap between the AML and MM datasets. First, we filter the AML and MM datasets to the top genes using the following cutoffs: (1) >= 1.5 differential expression; and (2) FDR < 0.05 as calculated by ComparativeMarkerSelection (GenePattern).

We will use several GenePattern modules to extract the relevant information from our test dataset, which is the MYCN-amplified and MYCN-nonamplified neuroblastoma dataset. Then, we will project the consensus signature onto the neuroblastoma dataset and evaluate its ability to distinguish the two phenotypes (MYCN-amplified and MYCN-nonamplified) by clustering the resulting dataset.

We will use SelectFeaturesColumns to filter the neuroblastoma dataset to only those samples that are MYCN-amplified or MYCN-nonamplified. There is a third group of samples (called 'NILL'), in which MYCN amplification status was not determined; therefore, we filter these samples out and work only with the annotated data.

Click on the GenePattern icon to launch the tool.

Change to the Modules tab, and search for SelectFeaturesColumns. Once the module is loaded, change the following parameters:

input filename: use the GenomeSpace tab to navigate to Public > RecipeData > ExpressionData > GSE12460, then click and drag to load the neuroblastoma GCT file, e.g., GSE12460GPL750_RNA_ORIGINALGENE_XXXX.gct.

columns: 0-2, 4-6, 8-16, 18-19, 21-25, 28, 30-36, 38-54

output: MYCN.gene.exp.gct

Click Run.

Change to the Jobs tab, and reload the SelectFeaturesColumns module by clicking on the job and choosing Reload Job.

Once the module is loaded, change the following parameters:

input filename: load the neuroblastoma CLS file, e.g., Myc.Expression.cls. To do this, click the next to the input filename parameter to remove the GCT file from the module. Then, use the GenomeSpace tab to navigate to the GSE12460_MYCN directory: Public > RecipeData > ExpressionData > GSE12460, then drag the file to the input box.

output: MYCN.gene.exp.cls

Click Run.

Once the CLS file has been generated, click on the file and choose Save File to download a copy of this to your local computer. We will need this file to be downloaded locally for a later step.

We will use SelectFeaturesRows to filter the neuroblastoma dataset to only those gene symbols which are in the consensus signature.

Change to the Modules tab, and search for "SelectFeaturesRows". Once the module is loaded, change the following parameters:

input filename: MYCN.gene.exp.gct, the previously filtered neuroblastoma GCT file. To do this, use the Jobs tab to find the previous job results, then drag the file to the input box.

list filename: consensus.genelist.txt, the consensus signature gene list. To do this, use the GenomeSpace tab to navigate to the directory containing the consensus signature gene list, then drag the file to the input box.

We will use HierarchicalClustering in GenePattern to create dendrogram of the filtered neuroblastoma dataset, clustering the data by phenotype (MYCN-amplified vs. MYCN-nonamplified), to determine how well the consensus signature can distinguish between phenotypes. Then, we will use HierarchicalClusteringViewer to view the results of the clustering algorithm and to label samples by phenotype.

Change to the Modules tab, and search for HierarchicalClustering.

Once the module is loaded, change to the Jobs tab, then change the following parameters:

input file: MYCN.consensus.gct (output from SelectFeaturesRows).

column distance measure: Euclidean distance

row distance measure: No row clustering

clustering method: Pairwise complete-linkage

Click Run.

Once the job has finished running, change to the Modules tab and search for HierarchicalClusteringViewer.

Once the module is loaded, change to the Jobs tab, then change the following parameters:

cdt file: MYCN.consensus.cdt

atr file: MYCN.consensus.atr

Click Run.

Once HierarchicalClusteringViewer has loaded, you should see a heatmap fo the original MYCN.consensus.gct file. To color the samples by phenotype, change the following parameters:

Click the add/edit labels button.

Make sure the "Samples" type of label is selected, then click the "Add Label" button.

Choose the MYCN.consensus.cls file you downloaded to your local directory from Step 5.

Click "OK". This should color the samples according to the labels in the CLS file.

Results Interpretation

This is an example interpretation of the results from this recipe. First, we identified a consensus gene signature of JQ1 activity by finding genes that became differentially expressed due to JQ1 treatment in both acute myeloid leukemia (AML) and multiple myeloma (MM). Then, we projected this consensus signature on a test dataset of neuroblastoma cells which were not treated with JQ1, but were either MYCN-amplified or MYCN-nonamplified. Since MYCN amplification is associated with an increased sensitivity to BET bromodomain inhibitors, such as JQ1, we expected that a signature of JQ1 activity would be able to separate MYCN-amplified and MYCN-nonamplified phenotypes.

These results suggest that the JQ1 consensus signature is capable of differentiating between MYCN-amplified neuroblastoma and MYCN-nonamplified neuroblastoma samples. In particular, we see that when we use hierarchical clustering to differentiate the two phenotypes, we observe three distinct groups of samples: (1) the majority of the MYCN-amplified samples (left cluster, light blue); (2) MYCN-nonamplified samples that are similar to MYCN-amplified samples (middle cluster, dark blue); and (3) MYCN-nonamplified samples which are distinct from MYCN-amplified samples (right cluster, dark blue). The significance of this possible result would need further confirmation.