The TISSUES resource for the tissue-specificity of genes

Last week a seminal paper on tissue-specificity of the transciptome, proteome, knowledgeome, and literome was published [1], along with an accompanying webapp. This study is notable for intelligently merging expression studies and performing informative comparisons between studies and data types.

We have already processed Bgee for anatomy-specific transcription. Compared to Bgee, the anatomical coverage of TISSUES is much lower, but edges are likely of higher quality. Therefore I think Bgee and TISSUES will be complimentary. Additionally, TISSUES contains other measures of tissue-specific gene expression such as UniProtKB, proteomics, and literature mining. We are particularly interested in including these data sources.

Our first step is to retrieve the data for each of the four methods. We want entities encoded using identifiers rather than names or symbols. We will need to pick a confidence threshold. And for each method, we would like the broadest tissue coverage permitted by that method. Advice appreciated!

The TISSUES resource uses the proteins in the latest version of STRING as baseline. If you need to map the Ensembl protein identifiers to other database identifiers, the best thing to do is thus to use either the STRING alias file or one of the specific mapping files available here.

Introducing a score cutoff does not sound like the right way to go about it to me. The problem is that there is no right cutoff; wherever you put it, what scores just above the cutoff is almost exactly as reliable as what scores just below the cutoff.

I know that life is much simpler if you do not have to deal with confidence scores. However, the moment you take associations with confidence scores and make them binary by applying an arbitrary cutoff, you are throwing away information. For this reason, you will almost always be better off having a method that can deal with confidence scores in a sensible manner and only apply a cutoff on your predictions in the very end after all available evidence has been combined.

Introducing a score cutoff does not sound like the right way to go about it to me.

I totally agree that cutoffs are suboptimal because they require arbitrary decision-making, conflate levels of confidence, and throw away information. In the long term, we hope to modify our method to enable weighted edge and to investigate other methods that allow weights (such as data fusion [1, 2]). However, in the short term, I want to proceed with unweighted edges and understand the sacrifice.

@larsjuhljensen, you may disagree with implementing cutoffs, but by providing "confidence scores that are comparable between datasets" [3], you have made the life of binners like me much more pleasant (:

For the experimental dataset from TISSUES, one cutoff I envision is 2 or more sources reporting scores ≥ 3 per gene–tissue relation. Does this sound reasonable?

The TISSUES resource uses the proteins in the latest version of STRING as baseline.

My gut feeling is that your cutoff is too stringent. If you want support from at least two different experimental datasets, I would not put the cutoff at 3. I would at most put it at 2.

However, I think there are more fundamental problems with that approach than just the numeric cutoff. Not all tissues were included in all datasets. This means that some tissues will be entirely lost if you require support from two datasets. Also, I do not understand why you would want to exclude the other channels (knowledge and text mining). Having, for example, text mining to support something that would otherwise be based on only a single dataset is very valuable.

If you want to enforce a hard cutoff to make things binary, I would urge you to at least take the integrated scores that takes everything into account and apply the cutoff to that. In this case a score of 3 might be appropriate. Applying cutoffs to the scores of individual datasets before combining them is in my opinion a fundamentally bad idea.

@larsjuhljensen, we will use the integrated dataset as the primary resource. However, the integrated dataset is subject to knowledge biases (stemming from text mining and UniProtKB). Since we want the option to subset the network to only include knowledge-unbaised edges, we would also like a consolidated score using only the experimental dataset.

In other words, if a gene–tissue edge scores above the cutoff in the consolidated experimental dataset, it's considered unbiased. However, if it only passes the cutoff in the integrated dataset, it's considered biased.

So that leaves one remaining question: how to create an integrated score using only experimental evidence?

some tissues will be entirely lost if you require support from two datasets.

Let's not worry to much about uniform coverage of tissues. Our approach can handle nonuniform network sparsity and uniform coverage is unfeasible in most cases.

Lars Juhl Jensen: Please note that using only the experiments channel will not eliminate knowledge biases. In particular, the immunohistochemical staining data from the Human Protein Atlas are heavily biased, since it depends strongly on the number and quality of antibodies available for each protein.

The way we currently calculate the integrated score can obviously be applied to any subset of evidence channels and sources (e.g. all sources in the experiments channel, or all sources in the experiments channel except HPA-IHC).

First, we convert all the confidence scores ($$s_{ijk}$$) between 0 and 5 to pseudo-probabilities ($$p_{ijk}$$) between 0 and 1 by simply dividing with 5. Here $$i$$ and $$j$$ are the two entities (proteins, tissues, diseases, etc.) and $$k$$ is the channel or source. Next, assuming independence between the different types of evidence, we define the combined pseudo-probability for two entities as:

$$$p_{ij} = 1-\prod_{k}{(1-p_{ijk})}$$$

Finally, we convert $$p_{ij}$$ back to $$s_{ij}$$ by simply multiplying with 5.

This is admitted ad hoc and has not yet been benchmarked or otherwise compared to other alternatives. The major assumption here is that you can convert confidence scores to some sort of probabilities by simply dividing with 5, which is obviously an oversimplification. There is also the assumption of independence, but I believe this is less of a problem. The formula for combining probabilities is very similar to the STRING scoring scheme, which has been extensively tested.

Daniel Himmelstein: In human_tissue_integrated_full.tsv I treated HPA as referring to HPA-IHC. Is this correct?

Lars Juhl Jensen: If you meant experiments instead of integrated in that filename, then yes.

Initial release of processed TISSUES data

We have completed an initial processing of the TISSUES data (repository, notebook) [1]. The main output of our analysis is merged.tsv.gz, a table where each row is a tissue (Uberon)–gene (Entrez) pair. For each pair, we provide 5 scores:

score_text: score from the text mining channel

score_knowledge: score from the UniProtKB/knowledge channel

score_experiment: integrated score from the experimental channel

score_experiment_unbiased: integrated score from the experimental channel without immunohistochemical staining data from the Human Protein Atlas

score_integrated: integrated score combining everything

Integrations (score_experiment and score_experiment_unbiased) were calculated using the above formula.

Visualizing channel concordance

We visualized the relationships between scores. The off-diagonal plots show a 2D histogram, using hexagonal bins. The diagonal of the grid contains 1D histograms for the x-variable. Bin counts for all panels are log-transformed.

You may be surprised to see points where y < x for the integrated 2D histograms. This occurs because Uberon and Entrez Gene mappings are not always one-to-one.