Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Dr. Hasso Plattner. If you are interested in our work or want to join our team, please contact Dr. Matthias Uflacker.

Our team is giving a series of lectures and seminars with a focus on enterprise systems design and in-memory data management. Strong links to the industry ensure a close connection between theory and its implementation in the real world.

Our research focuses on the technical aspects of business software and the integration of different software systems to meet customer requirements. This involves studying the conceptual and technological aspects of in-memory databases, design principles, and programming methods for enterprise applications.

We continually strive to translate our research into practical outputs that improve the quality of enterprise applications. A close link to industry partners ensures relevance and impact of our work. Get here an overview of our current and previous projects.

Intelligent Support for Document Annotation using Semi-Supervised Learning

General information

Motivation

The goal of this master project is to develop a system to support manual annotation of documents and linking of entities to database records. Manual annotation of textual documents is often necessary for building corpora to support training and evaluation of natural language processing applications. For instance, corpora have been developed for the extraction of a variety of entities, e.g., genes/proteins, as well as relationships, e.g., protein-protein interactions. Although there are many tools for document annotation [2], they do not suggest pre-annotations based on text mining and machine learning and do not provide real-time learning.

Curation tools support extracting data from text collections for a certain topic [1]. For instance, biological databases need to extract precise information from publications, which are further stored into their databases and made available to the users via a Web interface. This is a time-consuming and complex task which requires careful reading of many publications.

For performance purposes, the tool will be built on top of the SAP HANA in-memory database, given its potential for processing large datasets in real-time and its built-in text analysis functionalities. Interaction of the users with the system will be carried out by uploading a document or a collection of documents. The system will include a text mining pipeline for automatic processing of documents and suggestion of annotations. This pipeline will contain the following components: recognition of pre-defined entity types andextraction of pre-defined relationships between two or more entity types.

Further, ongoing annotations will be used for active learning of user preferences, for updating predictions of annotations and indicating which document to annotate next. This learning process will rely on existing machine learning algorithms implemented in the SAP HANA database, which will need to be adapted for on-line learning. Implementation of state-of-the-art on-line learning algorithms will also be considered.

Project Goals

Develop a Web application for annotation of documents and validation of data derived from text mining/machine learning

Technology and Skills

Participants should have knowledge of SQL, of at least one programming language (preferably C++, Python or Java) and of Web development, as well as interest in database technologies, machine learning and natural language processing.

Open Positions

We are proud to announce " A Course in In-Memory Data Management" by Prof. Dr. h.c. Hasso Plattner. This book is the culmination of six years work of in-memory research. As such, it provides the technical foundation for combined transactional and analytical workloads inside one single database as well as examples of new applications that are now possible given the availability of the new technology. The book is available at Springer.