For bachelor students we offer German lectures on database systems in addition with paper- or project-oriented seminars. Within a one-year bachelor project students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, search engines and information retrieval enhanced by specialized seminars, master projects and advised master theses.

The Web Science group focuses on various topics related to the Web, such as Information Retrieval, Natural Language Processing, Data Mining, Knowledge Discovery, Social Network Analysis, Entity Linking, and Recommender Systems. The group is particularly interested in Text Mining to deal with the vast amount of unstructured and semi-structured information available on the Web.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our data sets and source code.

Profiling Dynamic Data

Data Profiling

Data profiling is the process of examining a given dataset for its structural metadata. These metadata include simple statistics, such as value distributions, which are easy to compute. More complex metadata usually involve multiple columns, which makes them significantly harder to discover. Particularly important multi-column metadata types are unique column combinations, inclusion dependencies, and functional dependencies. Because of their many use-cases, such as schema matching, data cleansing, query optimization, or data exploration, data profiling is a frequent activity for any IT professional.

Dynamic Data

Real-world data constantly change in daily business, thereby rendering existing metadata out of date. To keep up with the changes, profiled metadata must be updated continuously or at least frequently. So far, database research has proposed many profiling algorithms to effi- ciently discover certain types of metadata for fixed datasets, but re-executing these algo- rithms for every change is a too costly and time-consuming process. In consequence, new profiling algorithms are needed to efficiently maintain the metadata of such ever-changing datasets.

Project Goals

Given a relational dataset and its metadata, our objective is to monitor insert, update, and delete operations on the dataset in order to update the metadata accordingly. The metadata-updates need to be fast enough to cope with possibly high change rates of the data. While incremental metadata updates are an algorithmic challenge for every type of metadata, we shall focus on functional dependencies (FDs). The project consists of the following subgoals:

Literature research: Review different profiling algorithms from previous research and consider their suitability for making them incremental.

With the HPI Metanome data profiling framework (www.metanome.de), we have access to many existing profiling algorithms and can probably reuse previous work for our new task. Ultimately, we aim to publish our results at a major scientific conference.