For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.

The Web Science group focuses on various topics related to the Web, such as Information Retrieval, Natural Language Processing, Data Mining, Knowledge Discovery, Social Network Analysis, Entity Linking, and Recommender Systems. The group is particularly interested in Text Mining to deal with the vast amount of unstructured and semi-structured information available on the Web.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.

In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking.

This project aims at the automatic creation of entity links from texts to a knowledge base. In contrast to recent research that usually balances the rate of linking correctness (precision) and the linking coverage rate (recall), this project focuses on creating reliable links by favoring the linking precision. Linking precision is the decisive factor for subsequent tasks, building upon the linking results, such as, text summarization, document classification, or topic-based clustering.

CohEEDistributed

This project aims at enabling the entity recognition and alignment of huge text collections with dozens of millions of documents. To reach this goal, this distributed implementation of CohEEL is built on the Apache Flink framework and the applied knowledge base is Wikipedia. The source code is available on GitHub.

People

Datasets

News: The news article dataset contains 100 randomly picked Reuters articles from the CoNLL-YAGO dataset [1]. The articles were carefully manually annotated with entities from YAGO by our team members and can be found here.

Encyclopedic: The encyclopedic text corpus consists of Wikipedia articles selected in 2006 by Silviu Cucerzan [2]. The original annotations are available here. However, some of the original Wikipedia articles were missing and the YAGO alignments had to be determined. The updated dataset with the annotated YAGO entities can be found here.

Micro: The synthetic micro corpus consists of 50 short text snippets and was introduced in the AIDA project [4]. Every text snippet consists of few (usually one) hand-crafted sentences about different ambiguous mentions of named entities and has similar properties as content of microblogging platforms, such as Twitter. It was produced in the realm of the AIDA project and is available as the KORE dataset here.