Table of Contents

Thematic working days, October 15-17, 2012

Due to the increasing on-line availability of various biomedical data sources, the ability to federate heterogeneous and distributed data sources becomes critical to support multi-centric studies and translational research in medicine. The CrEDIBLE project organized 3 thematic working days in October 15-17 2012 in Sophia Antipolis (France) where experts were invited to present their latest work and discuss their approaches. The aim was to gather scientists from all disciplines involved in the set up of distributed and heterogeneous medical image data sharing systems, to provide an overview of this broad and complex area, to assess the state-of-the-art methods and technologies addressing it, and to discuss the open scientific questions it raises.

The methods for biomedical data distribution considered in the context of CrEDIBLE are:

Federation: the (virtual) fusion of geographically spread data stores which should appear to end users as a unique and coherent data source.

Mediation: the semantic alignment of heterogeneous data sources, which were often designed independently from each other.

Querying: the description of distributed data sets, defined through data retrieval queries that apply on the whole federated system.

Data flow: the use and the enrichment of the federated data stores through the use of data processing pipelines.

Location

On Monday afternoon, Tuesday afternoon and Wednesday morning, the workshop will be held in the conference room of the I3S laboratory in Sophia Antipolis (France). See the following map to locate the laboratory. The conference room is located at the ground floor of I3S (room number 007).

On Tuesday morning, the workshop will be held in room 101-001 on the first floor of the “Templiers Ouest” building. Please refer to the map below for identifying the building.

Content

Web semantic technologies play a critical role to represent, interprete and query data, to finally achieve data mediation. The themes of knowledge modeling through ontologies and the reuse of existing ontologies will be addressed more specifically. The challenges related to the integration and the federation of heterogeneous data bases, including the data representation models and their impact on the system performance, will be studied. Downstream, the study will also consider the exploitation and the production of new knowledge in the context of data processing workflows. Some feedback on existing tools and their capabilities / limitations is also expected.

Data integration

The data sources to be integrated are related but yet heterogeneous, using different semantic references (vocabularies…), different representations (files, relational / triple / XML databases…) and even different data models (relational, knowledge graphs…). Data integration will also be constrained by medical application constraints, in particular the set up of multi-centric studies, the support of translational research and medical applications. Data security and fine-grained access control is another important related problem. The ability to simultaneously process different data representation model makes data security a particularly challenging problem.

Ontologies

The ontology defines conceptual primitives which represent data semantics (images, test and questionnaire results) by integrating their production context (study, examination, subject, medical practitioner, data acquisition protocol, processing, acquisition device, parameterization, scientific publications). Such an ontology spans over different domains (different entity classes) and includes hundreds of concepts. It is structured through modules with different abstraction levels, to leverage generic primitives that can formalize several domains for practical reason related to ontology maintenance.

The ontology design involves: reusing (completely or partially) existing ontological modules (at different abstraction levels); designing new modules (in particular to represent knowledge related to particular medical domains); managing modules life cycle; documenting to ease reusability. There are different means of exploitation: ontological alignment to federate data that rely on different semantics (addressing problems related to the level of details or even discrepancies in the entities considered); data processing assistance (checking the compatibility of data with processing tools, producing data provenance information); query-based and/or visualization-based data access. Each usage scenario might involve adapting the ontology representation to the tool manipulated (inference engine, visualizer) and its language.

Data representation models and reasoning

Usually, medical data are stored in relational databases which allow for a fast access to data, while metadata are formalized through graph-based knowledge representation models, designed for the semantic Web, which enable reasoning capabilities through inferences based on ontologies used to model this knowledge. Main challenges are the mixed use of different representation and the scalability of data storage and reasoners. The scalability problem is well known in the Web of data community. Promising approaches lie on the use of graph-oriented databases, the adaptation of inferences performed to the size of manipulated data stores, and on querying and reasoning techniques adapted to distributed stores.

Semantic workflows

The acquisition and the representation of knowledge related to the manipulated data is tightly linked to the data processing and transformation tools applied. Knowledge acquired on data may be used to validate or filter the processing tools applied on this data. Conversely, knowledge acquired on processing tools can be used to infer new knowledge on data, in particular the data produced through this processing. Knowledge exploitation can happen at different levels of the scientific processing pipelines life cycle: at design time through editing assistance (static validation, assisted composition) and at run time (dynamic validation, new knowledge creation).

Knowledge on both data and processing tools is also often used to describe data provenance information. Provenance is then described as semantic annotations tracing the execution path. Provenance is tightly related to the nature of data processed. It facilitates the reuse and the interconnection between data from different sources. It can make use of several domain-ontologies and facilitate interoperability between different data processing engines.

Alignement of foundational ontologies (e.g. DOLCE and BFO): are there techniques to align foundational ontologies and facilitate reusing ontologies grounded on such foundational ontologies?

2) Application ontology (overall project ontology)

Ontology modular structure: what structure may facilitate reuse of existing ontologies (designed and maintained by other organizations), and favor the reuse of ontology modules in other projects?

Reuse of existing ontologies: what part can be reused (depending on the project need) and what level of reuse (what needs to be modified to take into account project needs and the overall ontology coherency)? For instance, FMA (Foundational Model of Anatomy) describes normal anatomy. how can it be reused to represent pathological structures? Is the Hoendorf 2007 solution acceptable?

Creation of new modules, usually to cover the medical areas considered: how to organize such a multidisciplinary work?

Expressiveness of representation languages: how to properly manipulate relations between Universals (e.g. the head is part of a person; the left hemisphere and the right hemisphere compose part of the brain) and relations between Universals and Individuals (e.g. the hydrogen concentration of a solution, the processing class performed by a software tool), especially at the operations level, to implement inferences (see OWL dialects)?