When:Friday, March 27, at 14.00Where:PCRI Building, room 445Who:Helena Galhardas, INESC-ID and IST/University of LisbonTitle: Speeding up information extraction programs – a holistic optimizer and a learning-based approach to rank documents

Abstract:
A wealth of information produced by individuals and organisations is expressed in natural language text. Text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task.

In this talk, I will first present our proposal to optimize information extraction programs. It consists of a holistic approach that focuses on: (i) optimizing all key aspects of the information extraction process collectively and in a coordinated manner, rather than focusing on individual subtasks in isolation; (ii) accurately predicting the execution time, recall, and precision for each information extraction execution plan; and (iii) using these predictions to choose the best execution plan to execute a given information extraction program.

Then, I will briefly present a principled, learning-based approach for ranking documents according to their potential usefulness for an extraction task. Our online learning-to-rank methods exploit the information collected during extraction, as we process new documents and the fine-grained characteristics of the useful documents are revealed. Then, these methods decide when the ranking model should be updated, hence significantly improving the document ranking quality over time.

This is joint work with Gonçalo Simões, INESC-ID and IST/University of Lisbon, and Pablo Barrio and Luis Gravano from Columbia University, NY.

Permanent link to this article: https://team.inria.fr/oak/2015/03/27/helena-galhardas-speeding-up-information-extraction-programs-a-holistic-optimizer-and-a-learning-based-approach-to-rank-documents/

Abstract:
Data is incomplete when it contains missing/unknown information, or more generally when it is only partially available, e.g. because of restrictions on data access.

Incompleteness is receiving a renewed interest as it is naturally generated in data interoperation, a very common framework for today’s data-centric applications. In this setting data is decentralized, needs to be integrated from several sources and exchanged between different applications. Incompleteness arises from the semantic and syntactic heterogeneity of different data sources.

Querying incomplete data is usually an expensive task. In this talk we survey on the state of the art and recent developments on the tractability of querying incomplete data, under different possible interpretations of incompleteness.

Permanent link to this article: https://team.inria.fr/oak/2015/03/11/cristina-sirangelo-querying-incomplete-data/