In this article, we talk with one of our users, Antonio Messina from the High-Performance Computing and Networking Institute of the Italian National Research Council (ICAR-CNR).

Antonio (@xMAnton on Twitter) is a Computer Science Engineer who works as an Applied Scientist at the largest public research institution in Italy. His area of expertise includes (No)SQL databases and advanced Unix systems administration, and he likes to get his hands dirty coding mainly in Java and Node.js. He is enthusiastic about technologies such as graph databases and Docker and is constantly looking for innovation in IT.

Recently, Antonio successfully submitted a paper that describes a practical use case for GRAKN.AI. The paper, “BioGrakn: A Knowledge Graph-based Semantic Database for Biomedical Sciences,” will be published after the CISIS 2017 conference that takes place in July.

We asked Antonio to tell us some more about BioGrakn, and how it is the first step in using the power of knowledge graphs and machine reasoning to solve common problems in the domain of biomedical science.

Q: What problem did you need to solve?

A: Nowadays, the amount of biological data available online is huge, but integrating and connecting related information from different sources to gain new knowledge is a challenge.

We’ve identified a need for tools to aggregate, integrate, and model data while managing significant complexity and contextual specificity.

Some of the most common problems include locating resources, differing data formats, ambiguity and duplication, relationships between data, and the sheer volume and granularity of the information. As of yet, there is no standard memorization and query format for this kind of data, so each resource usually requires a different approach to be properly handled.

Q: Typically, what kind of data storage do you work with?

A: Several classes of bio-molecular data — such as transcriptional regulatory networks and protein-protein interaction networks — interact as complex networks. They can usually be modeled as graphs, where nodes (and their attributes) model biological entities and edges contain relationships between these entities. Examples of the adoption of graph databases in bioinformatics are given by ncRNA-DB, Bio4J, and BioGraphDB.

ncRNA-DB is a NoSQL database based on OrientDB that combines many biological resources to deal with several classes of ncRNA such as miRNA, long-noncoding RNA (lncRNA), circular RNA (circRNA), and their interactions with genes and diseases.

Bio4j is based on a Java library and is an integrated cloud-based data platform, built upon a graph structure on top of Neo4J. For now, it includes data about proteins, GO, and enzymes.

BioGraphDB integrates several types of data sources to perform bioinformatics analysis using a comprehensive system built on top of OrientDB. It includes data about genes, proteins, microRNAs, molecular pathways, functional annotations, and associations between microRNAs and cancer diseases.

Q: So, what is BioGrakn?

A: In short, BioGrakn is a graph-based semantic database that takes advantage of the power of knowledge graphs and machine reasoning to solve problems in the domain of biomedical science. We address the major issue of semantic integrity — that is, interpreting the real meaning of data derived from multiple sources or manipulated by various tools.

BioGrakn has been built on top of GRAKN.AI, a distributed knowledge graph database which allows complex data modeling, verification, scaling, querying, and analysis. A key step is the definition of an ontology, which facilitates the modeling of complex datasets and guarantees information consistency. Inference rules allow the extraction of implicit information from explicit data, to achieve logical reasoning over the represented knowledge.

Q: What data sources did you use?

A: The data sources we chose are almost the same as those used by BioGraphDB. This way, we can build an integrated database containing resources related to genes, proteins, miRNAs, and metabolic pathways. References can be found at the end of the article.

NCBI Entrez Gene provides a lot of genes data, such as interactions with other genes, genomic context, annotated pathways, and so on.

UniProt Knowledgebase (UniprotKB), the largest public collection of annotated functional information on proteins.

Reactome contains validated metabolic pathways, each annotated as a set of biological events, dealing with genes and proteins.

miRBase provides all the known miRNAs sequences and annotations, associated with names, keywords, genomic locations, and references.

mirCancer contains associations between miRNAs and human cancers.

miRNASNP aims to provide a resource of the miRNA-related mutations (SNPs) for human and other species.

mirTarBase is a list of experimentally validated miRNA-target interactions.

miRanda is a list of putative miRNA-target interactions.

HGNC is the HUGO Gene Nomenclature Committee database contains, for each gene symbol, a list of synonyms and a list of corresponding entries in the most popular genes databases.

Q: How did you import the data?

A: Much of the above data is in TSV format, a simple text format for storing data in a tabular structure where each record in the table is one line of the text file, and each field value of a record is separated from the next by a tab character. By contrast, miRBase, GO, and UniprotKB are distributed as EMBL text file format and XML format, respectively.

GRAKN.AI does import TSV, but EMBL and XML source data files are not currently supported, so we developed an ad-hoc set of Extract-Transform-Load (ETL) tools. Data consistency and proper relations between entities were guaranteed by the precise order of execution of the ETLs. This way, when a data source also refers to others, the presence in the database of all the depending resources is assured.

Q: What does the ontology look like?

A: A Graql ontology specifies the relevant concepts and their meaningful associations and must be clearly defined before loading data into a graph. Objects and relationships are categorised into distinct types, enabling automatic reasoning over the represented knowledge, such as inference (extraction of implicit information from explicit data) and validation (discovery of inconsistencies in the data).

The ontology has four types of concepts to model the domain. The categorization of concept types is enforced by declaring every concept type as a subtype of exactly one of the four corresponding built-in concept types: entity, relation, role, and resource.

Here’s a screenshot of the ontology we used. You can find the text version up on Github.

Let’s consider the Gene Ontology annotation “platelet activating factor biosynthetic process” that has GO:0006663 as an identifier. In order to find annotated genes, the annotation relation, with the functional annotation member equal to our starting identifier, points out all the related annotated entities, from which we extract the genes, printing their symbols and names. The following Graql query returns the desired results:

At a first sight, this seems like the previous problem. However, genes cannot be directly linked to pathways, because Reactome just provides pathway-to-proteins associations. Therefore, we have to go through two relations:

As expected, the graphic results now show direct links from gene to pathways.

Reasoning on gene-pathway links.

Summary

In this article, we have looked at how GRAKN.AI was used to build a prototype of a bioinformatics semantic database. We’ve discussed how BioGrakn takes advantage of the power of knowledge graphs and machine reasoning to solve problems in the domain of biomedical science. We address the major issue of semantic integrity, that is, interpreting the real meaning of data derived from multiple sources or manipulated by various tools.

What’s Next?

In the short term, further developments are expected, such as the integration of other publicly available biological resources, the use of the native GRAKN.AI migration tools for data migration procedures, and the deployment of a user-friendly web interface.