Abstract

The workflow models have been essentially operation-centric for many years, ignoring almost completely the data aspects. Recently, a new paradigm of data-centric workflows, called business artifacts, has been introduced by Nigam and Caswell. We follow this approach and propose a model where artifacts are XML documents that evolve in time due to interactions with their environment, i.e. human users or Web services. This paper proposes the AXART system as a distributed platform for collaborative work that harnesses the power of our model. We will illustrate AXART with an example taken from the movie industry. Indeed, applying for a role in a film is a typical collaborative process that involves various participants, inside and outside the film company. The demonstration scenario considers both standard workflow process and dynamic workflow modifications, based on two extension mechanisms: workflow specialization and workflow exception. The workflows, modeled using artifacts, are supported by the AXART system by combining techniques specific to active documents, like view maintenance, with security techniques to manage access rights.

Abstract

This talk shall consist of two parts. In the first part, we shall deal with the problem of replicating data items in an unstructured peer-to-peer network. We present a simple distributed greedy algorithm aiming at optimizing the probability of successfully retrieving the requested items. This is the first work determining both the degrees of replication and the placement of the replicas in a provably near-optimal way. We prove that our algorithm (coined P2R2) can guarantee a successful-search probability that is within a factor of 1/2 of the optimal solution.

We then present another distributed algorithm for a similar problem, which dramatically improves upon the greedy algorithm in terms of number of communication rounds (linear vs poly-logarithmic). Our algorithm is based on efficiently solving a linear program formulation of the problem in a distributed environment and bears the interesting feature that it can be easily implemented in map reduce. Our experiments show the viability and effectiveness of our approaches.

We conclude our talk by showing how our general methodology for solving a linear program in a distributed environment represents a valuable tool for solving many different problems, in particular in information extraction.

Abstract

Peer Data Management Systems (PDMS) consist of a volatile set of peers. Each of them answers queries against its own schema by exploiting both local data and by passing queries to neighboring peers along so-called schema mappings. PDMS are highly flexible due to their decentral nature, but query answering has only limited scalability due to the massive redundancy in the paths along which queries get routed. Additionally, repeated query rewriting often leads to increasing information loss.

Our work is based on the idea to trade completeness of query answers for speed of execution, thus turning completeness from a requirement into an optimization goal. To this end, peers can prune those paths during query answering for which they estimate a bad cost/benefit ratio. However, estimating this ratio in highly distributed systems as PDMS is difficult. We present a technique based on self-adaptive multidimensional histograms that are updated by exploiting the queries passing through the network. Based on these histograms, we present several techniques to trade benefit with cost. One approach limits the time budget available for query answering. An orthogonal strategy exploits statistics on overlap between data to reduce redundancy in query processing. Experiments with our self-developed PDMS “System P” show efficiency gains of an order of magnitude or more.

Short Bio

Armin Roth is an external doctorate candidate at the Humboldt-Universität zu Berlin, Germany. His advisors are Ulf Leser and Felix Naumann from the Hasso Plattner Institut in Potsdam, Germany. Armin joined the IBM Lab in Boeblingen, Germany as a development engineer in 2009. He contributes to the IBM Infosphere Information Server. He received his diploma in mechanical engineering from Universität Stuttgart, Germany and finished postgraduate studies on practical computer science at FernUniversität Hagen, Germany. His focus is on information integration and data quality, both in industry and academia.

During this talk, he will focus on his research performed independently from IBM.

Abstract

While Ajax-based programming enables faster performance and higher interface quality over pure server-side programming, it is demanding and error prone as each action that partially updates the page requires custom, ad-hoc code.

The problem is exacerbated by distributed programming between the browser and server, where the developer uses JavaScript to access the page state and Java/SQL for the database. The FORWARD framework simplifies the development of Ajax pages by treating them as rendered views, where the developer declares a view using an extension of SQL and page units, which map to the view and render the data in the browser. Such a declarative approach leads to significantly less code, as the framework automatically solves performance optimization problems that the developer would otherwise hand-code. Since pages are fueled by views, FORWARD leverages years of database research on incremental view maintenance by creating optimization techniques appropriately extended for the needs of pages (nesting, variability, ordering), thereby achieving performance comparable to hand-coded JavaScript/Java applications.

Abstract

Visual analytics aims at combining interactive data visualization with data analysis tasks. Given the explosion in volume and complexity of scientific data, e.g. associated to biological or physical processes, social networks etc., visual analytics is called to play an important role in scientific data management.

Most visual analytics platforms, however, are memory-based, and are therefore limited in the volume of data handled. Moreover, the integration of each new algorithm (e.g. for clustering) requires integrating it by hand into the platform. Finally, they lack the capability to define and deploy well-structured processes where users with different roles interact in a coordinated way sharing the same data and possibly the same visualizations.

We have designed and implemented EdiFlow, the first workflow platform for visual analytics. EdiFlow uses a simple structured process model, and is backed by a persistent database, storing both process information and process instance data. Ediflow processes provide the usual process features (roles, structured control) and may integrate visual analytics tasks as activities. We present its architecture, deployment on a sample application, and main technical challenges involved.

Abstract

ViP2P is a fully functional Java-based platform for the efficient, scalable management of XML documents in structured peer-to-peer networks based on distributed hash table (DHT) indices. We exploit indices (or materialized views) deployed in the P2P network independently by the peers, to answer an interesting dialect of tree pattern queries. There is a query (and view) language, a rewriting algorithm, view definition indexing strategies based on the DHT and much more…

You can find out more details about ViP2P by visiting the ViP2P website.

Abstract
XML projection is one of the main adopted optimization techniques for reducing memory consumption in XQuery in-memory engines. The main idea behind this technique is quite simple: given a query Q over an XML document D, instead of evaluating Q on D, the query Q is evaluated on a smaller document D’ obtained from D by pruning out, at loading-time, parts of D that are unrelevant for Q. The actual queried document D’ is a projection of the original one, and is often much smaller than D due to the fact that queries tend to be quite selective in general.

While projection techniques have been extensively investigated for XML querying, we are not aware of applications to XML updating. The purpose of this work is to investigate a projection based optimization mechanism for XML updates.

Abstract

This talk will present YAGO. YAGO is a large ontology, which currently contains more than 2 million entities and close to 20 million facts about them. The talk will explain how the ontology was constructed automatically from Wikipedia and WordNet. The talk will also introduce the SOFIE project. SOFIE uses logical reasoning to extract new information for YAGO from Web sources and reconciles it with the existing data.

Abstract

Online reviews are an important asset for users deciding to buy a product, see a movie, or go to a restaurant, as well as for businesses tracking user feedback. However, most reviews are written in a free-text format, and are therefore difficult for computer systems to understand, analyze, and aggregate. One consequence of this lack of structure is that searching text reviews is often frustrating for users; keyword searches typically do not provide good results as the same keywords routinely appear in good and in bad reviews. User experience would be greatly improved if the structure and sentiment information conveyed in the content of the reviews were taken into account. Our work focuses on identifying this structure and sentiment information from free-text reviews, and using this knowledge to improve user experience in accessing reviews. Specifically, we focused on improving recommendation accuracy in a restaurant review scenario.

We report on our classification effort, and on the insight on user-reviewing behavior that we gained in the process. We propose new ad-hoc and regression-based recommendation measures, that both take into account the textual component of user reviews. Our results show that using textual information results in better general or personalized restaurant score predictions than those derived from the numerical star ratings given by the users.

Abstract

We examine the challenges behind recommendations in social content sites. We use collaborative tagging sites (think del.icio.us, YouTube and Yahoo!Travel) as our application and report on our experiments in harvesting the collective tagging behavior to serve relevant content (think URLs, videos, travel destinations) to users. We address well-known and lesser-known problems in recommender systems such as over-specialization and data management for the masses. We conclude with open questions.