Formatting Expression Matrix TSV files

If you are importing expression data from an external source or want to populate a file with your own data, please ensure that it is formatted properly for use with KBase. A tab-separated values (TSV) file is a tab delimited text file that has genes across the rows and sample observations across the columns. Make sure the first label in the first column is “feature_ids” followed by tab-delimited labels for samples.

For this example, we will use the Shewanella_oneidensis_MR-1 genome and the Shewanella_MR-1_M3D_ExpData expression data set from the Example data tab. (When working with your own data, you can start with an empty Expression Matrix template.)

For this example, we will use the Shewanella_oneidensis_MR-1 Genome from the Example data tab. To add the Genome to your Narrative, find the Data Panel along the left side of the screen and click the Add Data (or red “+”) button. This will open the Data Browser slideout. Select the Example tab at the top of the slideout, and search for “Shewanella_oneidensis_MR-1” genome. Mouse over the Genome name and click the blue Add button.
The S. oneidensis Genome object should appear in your Data Panel.

Do the same for the Shewanella_MR-1_M3D_ExpData expression data set: find it in the Example tab and click the “Add” button to add it to your Narrative. You should now see both the Genome and the Expression Matrix in your Data Panel.

Each gene measured in the expression dataset should have an identifier listed in the first column of the TSV file. To ensure that the gene identifiers listed in your dataset correspond to the aliases in KBase, click the name of the Genome to open up the genome viewer. Click on the tab labeled Browse Features and locate a gene of interest by searching for the name of the function or protein associated with the gene. In this example, we’ll search for ‘DNA Polymerase’ and identify SO_0009 as a gene of interest.

Click the Feature ID of the gene of interest to open up a new tab with additional information about the gene. Locate the section titled Aliases and crosscheck the gene labels contained within your expression dataset with either the Feature ID or one of these aliases to ensure that these labels will correspond to features in KBase.

To see an example Expression Matrix format, download the Shewanella_MR-1_M3D_ExpData expression data set from your Data Panel. Unzip the downloaded file and examine the file matrix.tsv. In Excel, a few of the rows and columns are:

The first column matches feature IDs from the genome. This column could have also been a gene alias. Some of the gene aliases supported by KBase include NCBI, EMBL, UniProt, BioCyc, and ASAP. The column heading are sample conditions. The remaining cells in the table contain expression values for the appropriate gene and sample. Be sure to exclude gene features that are missing all expressions or are composed of non-changing expressions across the samples.

Below is an example of a properly formatted expression data file in TSV format. In this case, the gene-ids in the first correspond to gene identifiers for E. coli K-12 MG1655 genes and the sample conditions are derived from the Many Microbe Microarrays Database (M3D).

feature_ids

dinI_U_N0025_r1

dinI_U_N0025_r2

dinI_U_N0025_r3

b4634

9.05367

9.07827

9.10114

b3241

7.20924

7.08695

7.07071

b3240

7.21535

7.14312

7.19478

If you want to build an expression matrix with your own data, download an empty template and populate it with your expression data.

Additional Information for Plant Expression Data

For KBase plant genomes, the gene IDs retain the data structure from the external source databases (Ensembl or Phytozome) and do not have aliases as mentioned above. When constructing an expression dataset, append your gene IDs with the transcript IDs followed by “.CDS” as seen in the screenshot below. You can check that you have the correct gene IDs using the same method detailed in the Formatting Expression Matrix TSV files section.

Upload an Expression Matrix from a TSV-formatted file

For this example, we will upload the expression dataset from above containing expression values for Shewanella oniedensis MR-1.

Expression Matrices can be uploaded into KBase without a Genome, but they will have limited use. In order to successfully upload an Expression Matrix into KBase, you first need to add the Genome that corresponds to references in the Expression Matrix you wish to upload. For the example we’ve been using, the Genome was added above from the Example data tab.

Now open the new Import tab in the Data Slideout and drag into your Staging area the example expression matrix file that you just downloaded (matrix.tsv). (Go here if you need instructions on doing that.)

Drag & Drop Limitations

The drag & drop from your local computer works for many files, but there is a size limit that depends on your computer and browser. Some users have reported problems around 20 gigabases. For larger files, use the Globus Online transfer.

Now the expression matrix is in your Staging area and you can import it from there into your Narrative. Open the pulldown menu to the right of the filename in your Staging Area and select “Expression Matrix”:

Click the import icon (up arrow) to the right of “Expression Matrix”. The data slideout will close and an app called “Import TSV File as Expression Matrix From Staging Area” will be added to your Narrative.

Notice that the name of the Tab-delimited (TSV) file is already filled in, as is a suggested name for the Expression Matrix data object that will be created by the import (you can change that if you like).

At this point, the corresponding genome is optional and hasn’t been linked to the expression matrix. To add the name of the Genome, click on the ‘show advanced’ link to the right of ‘Input Objects’ in the Import app:

Use the Genome dropdown to select the Shewanella_oneidensis_MR-1 genome. Adjust any of the other advanced options if needed, then click the green Run button to start the import. When the import is finished, your Data Panel will update to show the new Expression Matrix object, and a report will appear in the import app cell.

Compressed/zipped files

In the example above, we used a Genome that was in the Example data. You can also import your own Genome data from a file (go here for more information).

The Genome import can handle gzipped (.gz) input files. However, .zip files require special handling and .Z files are not yet supported by the importers (we are working on adding that). You can upload a zip file to your Staging Area, but it is recommended that you use the “uncompress” button to its left (the one with the diagonal arrows) to unzip it before trying to import it.