Abstract

Computing has been an enormous accelerator to science and it has led to an information explosion in many different fields. The unprecedented volume of data acquired by sensors, derived by simulations and analysis processes, and shared on the Web opens up new opportunities, but it also creates many challenges when it comes to managing and making sense out of these data. In this talk, I discuss the importance of maintaining detailed provenance (also referred to as lineage and pedigree) for digital data. Provenance provides important documentation that is key to preserve data, to determine the data’s quality and authorship, to understand, reproduce, as well as validate results. I will review some of the state-of-the-art techniques, as well as research challenges and open problems involved in managing provenance throughout the data life cycle. I will also discuss benefits of provenance that go beyond reproducibility, and present, in a live demo, techniques and tools we have developed that leverage provenance information to support reflective reasoning and collaborative data exploration and visualization. I conclude with a discussion on new applications that are enabled by provenance. In particular, I will show how provenance can be used to aid in teaching, to create reproducible publications, and as the basis for social data analysis.

Abstract

Evaluating conjunctive queries over a relational database is a central problem of database theory. This problem is also closely related to constraint satisfaction problems in artificial intelligence. We discuss decomposition methods, that are an efficient ways to cope with the computational intractability of these problems. Then we discuss semantic interoperability problems in coalitions of autonomous sources, where we present the advantages of adopting a constraint optimization-based framework. We also discuss other related questions in Web data management, in particular entity matching in Web document collections and Twitter streams.

In many scientific domains, researchers are turning to large-scale behavioral simulations to better understand important real-world phenomena. These phenomena emerge as the result of a myriad interactions among large numbers of interdependent agents in a complex system, such as a transportation network or an ecological system. While there has been a great deal of work on simulation tools from the high-performance computing community, behavioral simulations remain challenging to program and automatically scale in parallel environments. In this talk, I will show how database techniques can solve this dilemma, by offering simulation developers a programmable environment that automatically provides for scalability and durability.

My talk will be organized in two parts. In the first, more detailed part of my talk, I will discuss the design of BRACE, the Big Red Agent-based Computation Engine. BRACE leverages spatial locality to treat behavioral simulations as iterated spatial joins and greatly reduce communication between nodes in a cluster. While this set-at-a-time processing model can be very efficient, it can be much simpler for the domain scientist to program the behavior of a single agent. As a consequence, BRACE includes a high-level language called BRASIL (the Big Red Agent SImulation Language). BRASIL has object oriented features for programming simulations, but can be compiled to a database-style representation for automatic parallelization and optimization. In the second part of my talk, I will discuss techniques to provide efficient durability to behavioral simulations. The problem is challenging because these systems must sustain extremely high update rates, often hundreds of thousands of updates per second or more. We leverage the observation that simulations have frequent points of consistency to develop novel checkpoint recovery algorithms that trade additional space in main memory for significantly lower overhead and latency than existing methods. After presenting the database approach taken in BRACE, I will discuss directions of ongoing work on complex simulation models over cloud computing environments as well as directions of future work.

Short bio

Marcos Vaz Salles is a postdoc at Cornell University. His research targets building novel data-driven systems that bring classic database benefits, such as scalability and ease of programming, to new domains. At Cornell, Marcos is currently working on data management techniques for computer games and behavioral simulations. During his PhD in the Systems Group at ETH Zurich, he investigated hybrid search and data integration architectures for personal dataspace management in the iMeMex project. Previously, Marcos obtained his MSc from PUC-Rio, Brazil, and his BSc from UNICAMP, Brazil.

Abstract

The representation of imperfection has always been a major issue in the field of database management. It is in fact well recognized that if we cannot represent imperfection, such as missing, imprecise and uncertain values, we risk to lose much valuable information. Object of the talk will be the management of uncertain data, and more specifically the topic of data integration with uncertainty. Uncertainty is a state of limited knowledge, where we do not know which of two or more alternative statements is true – for instance, we may not know with certainty the value of a data item when two heterogeneous databases show different values for it. Data integration is the process of providing the user with a unified view of data residing at different sources. While traditional data integration methods more or less explicitly consider uncertainty as a problem, as something to be avoided, some recent approaches treat uncertainty as an additional source of information, sometimes that is precious and that should be preserved. In this talk I will focus on the status of research on uncertainty management in data integration, presenting some recent results.

Abstract

An increasing number of applications are making use of explicit knowledge about words and the entities they represent. This talk presents three data integration methods to obtain such knowledge. The first involves learning models to disambiguate word meanings. The second reconciles equivalence and distinctness information about entities from multiple sources. The third method adds a comprehensive taxonomic hierarchy, reflecting how different entities relate to each other. Together, they can be used to produce a large-scale multilingual knowledge base semantically describing over 5 million entities and over 16 million natural language words and names in more than 200 different languages.

Short bio

Gerard de Melo is a post-doctoral researcher at the Max Planck Institute for Informatics in the Databases and Information Systems group led by Gerhard Weikum. Gerard received his doctoral degree from Saarland University. He has published over 15 papers at conferences like CIKM and ACL, and has won two Best Paper awards (ICGL 2008, CIKM 2010). For more information, please visit http://www.mpi-inf.mpg.de/~gdemelo/.

Abstract

One of the major issues faced by Web applications is the management of evolving of data. In this thesis, we consider this problem and in particular the evolution of active documents. Active documents is a formalism describing the evolution of XML documents by activating Web services calls included in the document. It has already been used in the context of the management of distributed data.

The main contributions of this thesis are theoretical studies motivated by two systems for managing respectively stream applications and workflow applications. In a first contribution, we study the problem of view maintenance over active documents. The results served as the basis for an implementation of stream processors based on active documents called Axlog widgets. In a second one, we see active documents as the core of data centric workflows and consider various ways of expressing constraints on the evolution of documents. The implementation, called Axart, validated the approach of a data centric workflow system based on active documents.

Abstract

Student teams have competed to build a distributed query engine in the SIGMOD 2010 Programming Contest. In this talk, I will describe the task briefly and present the winning system. Specifically, I will cover the design choices and implementation issues for the query planner and executor. Also, I will discuss several optimizations and their effectiveness.