PsyGeNET database information

PsyGeNET database integrates information on psychiatric disorders and their genes
(Gutiérrez-Sacristán et al.,Bioinformatics 2015).
This second release of PsyGeNET contains updated information on depression, bipolar disorder,
alcohol use disorders and cocaine use disorders, and has been expanded to cover other psychiatric diseases of interest.
The database has been developed by automatic extraction of information from the literature using the text
mining tool BeFree (http://ibi.imim.es/befree/),
followed by curation by experts in the domain.

PsyGeNET data is classified in Psycur15 (from the first release of PsyGeNET) and Psycur16 (current 2.0 release).
Note that all the information contained in the database has been curated by experts.

Psycur15: Genes associated to alcohol use disorders, bipolar disorders and related disorders, depressive disorders
and cocaine use disorders. It contains 1537 associations between 579 genes and 32 psychiatric disease concepts.
The information has been extracted from the literature by text mining, followed by expert curation. The curation process
has been described in
Gutiérrez-Sacristán et al.,Bioinformatics 2015, from data extracted from MEDLINE abstracts from 1980 and 2013.

During curation of gene-disease associations (GDAs), we found publications that supported the
association between the gene and the disease while other works found just the opposite (that the gene is
not associated to the disease). The latter is what is generally referred as a negative finding in the
literature. In PsyGeNET we think that it is important to keep track of both “positive” and the “negative”
findings, and let the user make their own judgements based on the available evidence. Thus, for each
GDA and each supporting publication, we include the Association type to provide this information.
According to the evidence, there are two types: “Association” and “No Association” (e.g. the “negative
findings”). This information is available in the “All associations evidences” tab.

In addition to indicate the association type, we reflect the variety in the evidences for a gene-disease
association in the “Evidence index” (EI). This index, like a traffic light, is green when all the evidences
reviewed by the experts support the existence of an association between the gene and the disease
(Association, EI = 1), is yellow when there is contradictory evidence for the GDA (some publications
support the association while others publications do not support it, 1 > EI > 0), and is red when all the
evidences reviewed by the experts report that there is no association between the gene and the disease
(Association, EI = 0). Note that the experts validated a maximum of 5 publications for each GDA. The
set of 5 publications was selected as the most recent ones.

This information is available in the “Summary of All Associations” tab in a numeric format. Note that given
a set of genes of interest, psygenet2r package allows to visualize the evidence index in a heatmap where
genes are located in the X axis and disorders in the Y axis, and the cell color will be red, yellow or green
according to the EI value (more details in R package: psygenet2r).

In PsyGeNET there are genes that are associated to all the disease classes (e.g. they are
more pleiotropic), while other are more specific to a disease class. The Gene Pleiotropy ranges
from 12.5 to 100 and is proportional to the percentage of different disease classes a gene is
associated to. Thus, a gene associated to diseases of diverse classes (such as DRD2, associated
to 6 disease classes - alcohol UD, bipolar disorder, depression, schizophrenia, cocaine UD and cannabis UD - ),
will have a Gene Pleiotropy close to 100. Conversely, the ADRB1, is only associated to 1
disease class -depression -, and has a low Gene Pleiotropy value.

Each disorder class in PsyGeNET, namely Schizophrenia or Depression, is defined as a set of
diseases identified by UMLS CUIs. Some of these diseases (UMLS CUIs) are associated to
several genes in the same disease class while other diseases are associated to a reduced
number of genes. The Disease Load is a measure of this property of the diseases. It is the
fraction of the number of genes associated to a disease over the total number of genes
associated to a disease class. For example, the Schizophrenia class is defined by 24 UMLS
concepts. One of these concepts, Schizophrenia (umls:C0036341) has the larger Disease share
in its class (0.95) because is annotated to 861 from the total number of 903 genes. On the
other hand, Catatonic schizophrenia (umls:C0036344) has a smaller share of genes since is
associated to 4 genes.

The diseases in the current release of PsyGeNET are standardized with the Unified Medical
Language System® (UMLS®) Metathesaurus Concept Unique Identifiers (CUIs).In this way,
each disease is identified by a unique CUI. Each disorder class in PsyGeNET, namely alcohol
use disorders, bipolar disorders and related disorders, depressive disorders, schizophrenia
spectrum and other psychotic disorders, cocaine use disorders, substance- induced depressive
disorder, cannabis use disorders, substance induced-psychosis, is defined in term of a set of CUIs.

Genes

The vocabulary used for genes in the current release of PsyGeNET is the NCBI Official Full Name, NCBI Entrez gene identifiers
as well as the Uniprot accession. In addition, genes are classified according to the
Panther protein ontology .

For the update of the PsyGeNET database the process that has been followed involves: i) the recruitment
of a team of experts to curate the information extracted by text-mining; ii) the extraction of information of
gene-disease associations (GDAs) from the literature using the text mining system BeFree
(Bravo et al., 2015),
iii) the development of a curation workflow iv) the development of a web-based
annotation tool in order to facilitate the curation task v) the definition of detailed guidelines to assist the
curation task.

We put in place a curation workflow including a pilot phase and two curation and analysis phases (see
Figure 1). During the pilot phase, the initial training of the curators was carried out including how to use
the curation tool. After this process both the curation tool and the annotation guidelines were improved
and the first curation phase was launched (Curation Phase I), to evaluate 2,507 GDCAs identified by text
mining and supported by 4,065 publications (from 1980 to 2015). The results of the curation were analyzed to
estimate the inter-annotator agreement at the level of abstract. The validations for which an agreement
was not found in Curation Phase I are then reviewed by a third expert during Curation Phase II (results
not reported here). Four experts participate in this phase. Only the validations for which agreement of at
least 2 experts was found have been included in the database.
For more detailed information on the process check this publication: Gutiérrez-Sacristán et al.
Text mining and expert curation to develop a database on psychiatric diseases and their genes. Proceedings of
the 7th International Symposium on Semantic Mining in Biomedicine. Potsdam, Germany, August
4-5, 2016

A team of 22 experts from different domains (such as psychiatry, neuroscience, medicine, psychology and
biology) was recruited from the Spanish Network of Addiction and other collaborators of the coordination
team (Research Group on Integrative Biomedical Informatics (GRIB)) to participate in the curation
process.