In distributed stream environments, data may be delayed or not arrive at all if there are technical issues or data providers fail. Consequently, timely monitoring of situations may be difficult. In a collaboration with the University of Queensland, Brisbane, we will look at a use case of transportation scan data. We aim to investigate the following questions:

How can one detect potential incompleteness in streams?

How can one design systems that can assess incompleteness of queries over large databases in a timely manner?

How can one design monitoring tools to better warn of situations where incompleteness may lead to wrong understandings of the situation?

This project is executed in collaboration with the University of Queensland.

2. Completeness of Complex Objects

Information about database objects is often split across tables, and is often incomplete in human-intensive workflows. In a use case of a local public agency, construction projects have related information such as calls, contracts, billing documents or fotos, which are often incomplete.

Users unaware of this incompleteness then may make wrong decisions.

Our goal is to find semisupervised methods for assessing the completeness of such complex objects, which could then help users in understanding the completeness of information about them.

This project is executed in collaboration with Google Research, AT&T Labs-Research, the Hasso-Plattner Institute and the University of Trento.

Recall of Knowledge Bases

A third line of research we would like to pursue is the completeness analysis of general-purpose knowledge bases. Large knowledge bases such as Wikidata, NELL, or the Google Knowledge Graph usually put high emphasis on the correctness of their information.

In turn, little is known about the recall of these knowledge bases. Not surprisingly, the recall is often very good on popular topics (e.g. Nobel prize laureates in Physics), but lousy on many other (e.g., the average number of children per person is 0.02 according to DBpedia).

Our goal is to estimate the recall of knowledge bases on specific topics. We intend to use a mix of web extraction and machine learning techniques for this purpose.

This project is executed in collaboration with Telecom ParisTech University.