ProteoRE Galaxy instance provides necessary tools to execute a complete annotation pipeline of a protein list identified by LC-MS/MS experiments. This tutorial introduces these tools and guides you through a simple pipeline using some example datasets based on the following study: “Proteomic characterization of human exhaled breath condensate” by Lacombe et al., European Journal of Breath, 2018.

Estimated time to achieve this tutorial is 60 minutes. If you have any question, problem or feedback, please contact us at contact@proteore.org.

Objective

The objective of this tutorial is to annotating and exploring a proteomic dataset by answering the following questions:

How to filter out technical contaminants?

How to check for tissue-specificity?

How to perform enrichment analysis?

How to map your protein list to pathways (Reactome)?

How to compare your proteome with other studies?

Requirements

In order to follow this tutorial, general knowledge of Galaxy's environment is necessary. Please read Galaxy introduction if you are not familiar with this environment.

Input datasets

For this tutorial, we will use three datasets:

The list of proteins identified by LC-MS/MS in the exhaled breath condensate (EBC) from Lacombe et al.:

Once identified and/or quantified using a MS-based approach, interpreting the proteome in a sample is an important step to characterize its content in terms of functional properties in order to extend the biological knowledge related to this sample. In this tutorial, we illustrate the annotation and the exploration of the EBC proteome by performing the following steps:

A group of 10 proteins were identified in both “technical” control samples with an enrichment in EBC samples below a fixed threshold. These proteins were thus considered to be technical contaminants (see list of proteins in Table 4 in Lacombe et al. 2018) and have to be removed from the initial dataset.

Step 3. Click Insert Filter by keywords box to add the list of keywords to be filtered out. In this case, keywords are list of Uniprot accession numbers.

Step 4. Fill in the parameters in Filter by keywords section:

The column of the input dataset on which the filter will be apply, in this case is the column that contains Uniprot accession numbers (c1 as by default).

You can perform exact or partial match with the keywords entered. Partial match is set by default. We keep default option (No) in this tutorial.

You can either copy and paste list of keywords (separated by ";") to text area or choose a file that contains keywords in text format, in which each lines contains a keyword. Here we choose to copy and paste the following list of Uniprot accession number:P04264;P35908;P13645; Q5D862 ;Q5T749; Q8IW75;P81605;P22531; P59666; P78386

As EBC samples are obtained from air exhaled through the oral cavity, and even though the RTube collection device contained a saliva trap to separate saliva from the exhaled breath, contamination with salivary proteins had to be assessed. We decided to check the expression pattern for each protein of the "core" EBC proteome using the Human Protein Atlas (HPA). As HPA is indexed by Ensembl gene identifier (ENSG) we first need to convert Uniprot ID to Ensembl gene (ENSG). Secondly, check for proteins which are highly expressed in the salivary glands as reported by HPA, then in a third step, we filter out these proteins.

Step 3. Numerous information can be extracted from the HPA source files, you can read user documentation at the end of the submission form of the tool for more detailed description. In this tutorial, we select Gene name, Gene description, RNA tissue category (according to HPA) and RNA tissue specificity abundance in "Transcript Per Million".

Then click Execute button.

In History Panel, a new file named Add expression data to your protein list on data 8 will be created:

Four columns were added (n°5, 6, 7 and 8) corresponding to the HPA information previously selected; scroll down the table, note at the end of the list, column n°8, that AMY1B, CALML5, PIP, ZG16B, CST4, MUC7, CST1 and CST2 have been reported as highly enriched in salivary gland with elevated RNA transcript specific TPM value for each, suggesting that these proteins may come from the saliva and not from the exhaled breath condensate. We thus will removed these biological contaminants from our initial protein set.

Step 3. Click Insert Filter by keywords box to add the list of keywords to be filtered out. In this step, we will filter out the lines that contain "salivary" in the column of RNA transcript specific TPM.

Step 4. Fill in the parameters in Filter by keywords section:

The column of the input dataset on which the filter will be apply, in this case is the column of RNA transcript specific TPM: c8

You can perform exact or partial match with the keywords entered. Partial match is set by default. We keep default option (No) in this tutorial.

You can either copy and paste list of keywords (separated by ";") to text area or choose a file that contains keywords in text format, in which each lines contains a keyword. Here we choose to type "salivary" in text area.

Then click Execute button.

Two output files is created:

Filter by keywords or numerical value on Add expression data to your protein list on data 8 - Filtered lines: 10 proteins have been removed from the EBC list.

Note also that a list of “gene” may have been entered (selected on the basis of their TPM value) applied to column n°5 instead of the keywords "salivary" to column n°8, as it has been done in "Lacombe et al, 2018".

The resulting list of 141 proteins identified in the two pooled EBC samples (excluding the 10 salivary proteins) is now submitted to Gene Ontology (GO)-term enrichment analysis to determine functions that were significantly enriched in our EBC proteomic dataset compared to the lung proteome (corresponding to tissue-specific genes extracted from the Human Protein Atlas). To do so, we first build a lung reference proteome (that should be more representative of the studied sample conversely to a full human proteome) that will be used for enrichment analysis performed with the ClusterProfiler tool (based on the R package clusterProfiler)

Experimental data source: Two experimental data sources are proposed (expression data from immunohistochemistry (IHC) and from RNAseq experiments both from HPA), here we retrieve information based on IHC (default param)

Tissue: Dropdown menu allows to select tissue of interest among a list of 58 tissues, click Lung, redo by then clicking Bronchus

Expression level: Ranges from High to Not detected (according to HPA criteria), here only High, Medium and Low are selected

Reliability score: Indicates how reliable the expression/detection level is; here we select Enhanced and Supported which are the most reliable score according to HPA, you can read user documentation at the end of the submission form of this tool for more detailed description

Note that expression information for respiratory cell types is retrieved (column 4; e.g. macrophages, pneumocytes, respiratory epithelial cells) that could be used for further refinement of your reference background.

As the ClusterProfiler tool (we are going to use for the enrichment analysis) does not consider ENSG (Ensembl gene) identifiers as input, we need to convert these IDs into either entrez gene ID or Uniprot accession number that are compliant with.

Now in History Panel, there are a new text output file and a new list of graphical outputs.

The suffix "GGO" (GroupGO) corresponds to the results "GO categories representation analysis" option (performs a gene/protein classification based on GO distribution at a specific level) while the suffix "EGO" (EnrichGO) corresponds to the results from the enrichment analysis (based on an over-representation test of Go terms against the lung reference background). Two type of graphical output are provided either in the form of bar-plot or dot-plot.

According to this analysis, the main biological processes that were found over-represented in EBC compared to lung were numerous immune system processes and exocytosis (see EGO.BP.dot.png, for Enriched Biological Process GO terms dot-plot representation in png format). Below you can click on Go to dataset to view the diagrams of MF category.

The 141 proteins identified in EBC samples are now mapped to biological pathways and visualized via the web service of Reactome, an open access, manually curated and peer-reviewed human pathway database that aims to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge.

Input file: the EBC proteome to be analyzed after the removal technical and biological contaminants

Header: Yes

Column number: c1

Then click Execute button.

From History Panel, click View data button of the new output to display access to Reactome in the central panel. Click the Analyse button to display the Reactome analysis tools page via the web service and display the results. Browse biological patwhays in which EBC proteins are highlighted (e.g. immune system pathways) using Reactome interface functionalities. Here you can click on Go to dataset > Analyse to open the web service page.

Our experimental design and the dataset produced (i.e. the list of 151 proteins identified in both pooled EBC samples including the 10 salivary proteins) were compared to the two most extensive EBC proteome maps previously described for healthy subjects (Mucilli et al., 2015 ; Bredberg et al., 2012). To do so, a Venn diagram showing the overlap between our dataset and the two previous EBC characterizations in healthy donors is drawn using the Jvenn tool from ProteoRE.

Now a text output and a graphical output will be created. From the venn diagram, we can see the number of proteins that is common/unique for each list combinations (click on Go to dataset to view the venn diagram).