BibTeX

Bookmark

OpenURL

Abstract

Abstract. The advent of cloud computing technologies shows great promise for web engineering and facilitates the development of flexible, distributed, and scalable web applications. Data integration can notably benefit from cloud computing because integrating web data is usually an expensive task. This paper introduces CloudFuice, a data integration system that follows a mashup-like specification of advanced dataflows for data integration. CloudFuice’s task-based execution approach allows for an efficient, asynchronous, and parallel execution of dataflows in the cloud and utilizes recent cloud-based web engineering instruments. We demonstrate and evaluate CloudFuice’s applicability for mashup-based data integration in the cloud with the help of a first prototype implementation.

...s. All operators and tasks are implemented as REST-based web services and use JSON as data exchange format. The prototype employs Google App Engine’s datastore that implements the Bigtable data model =-=[4]-=-. Task results can be concurrently stored using a unique task id which is assigned during task generation. Each task result contains a reference to the corresponding operator and, thus, operator resul...

...s following in the dataflow as we will detail below. To support efficient, parallel execution of dataflows, we support both intraand inter-operator parallelism similar as in parallel database systems =-=[7]-=-. In addition to pipeline parallelism between adjacent operators, data partitioning is utilized to run independent operators on different data in parallel and to parallelize operators on disjoint data...

...nown as cloud) has triggered research in parallel data processing in a wide variety of applications. Driving forces are simple and powerful parallel programming models such as MapReduce [6] and Dryad =-=[12]-=-. Although the MapReduce program model is limited to two dataflow primitives (map and reduce), it has proven to be very powerful for a wide range of applications. On the other hand Dryad supports gene...

...oop further stimulate the popularity of distributed data processing. Higher-level languages can be layered on top of these infrastructures. Examples include the high-level dataflow language Pig Latin =-=[19]-=-, Nephele/PACTs [1] (based on MapReduce), DryadLINQ [24] as well as SCOPE [3] that offers a SQL-like scripting language on top of Microsoft’s distributed computing platform Cosmos. Similar to Cloudfui...

... processing. Higher-level languages can be layered on top of these infrastructures. Examples include the high-level dataflow language Pig Latin [19], Nephele/PACTs [1] (based on MapReduce), DryadLINQ =-=[24]-=- as well as SCOPE [3] that offers a SQL-like scripting language on top of Microsoft’s distributed computing platform Cosmos. Similar to Cloudfuice, a high-level program is decomposed into small buildi...

...vel languages can be layered on top of these infrastructures. Examples include the high-level dataflow language Pig Latin [19], Nephele/PACTs [1] (based on MapReduce), DryadLINQ [24] as well as SCOPE =-=[3]-=- that offers a SQL-like scripting language on top of Microsoft’s distributed computing platform Cosmos. Similar to Cloudfuice, a high-level program is decomposed into small buildings blocks (tasks) th...

...e the popularity of distributed data processing. Higher-level languages can be layered on top of these infrastructures. Examples include the high-level dataflow language Pig Latin [19], Nephele/PACTs =-=[1]-=- (based on MapReduce), DryadLINQ [24] as well as SCOPE [3] that offers a SQL-like scripting language on top of Microsoft’s distributed computing platform Cosmos. Similar to Cloudfuice, a high-level pr...

...d [17]. Applications need components for data, process, and presentation level [18] and CloudFuice mainly focuses on the data level. A popular approach are pipes, e.g., as used in Yahoo! Pipes, Damia =-=[21]-=-, or [15], that process entity sets via relatively simple userspecified dataflows. These tools have demonstrated to be applicable in different settings since they offer a powerful and easy-to-use inte...

...rk Mashup-based data integration has become very popular in recent years and many tools and frameworks have been developed [17]. Applications need components for data, process, and presentation level =-=[18]-=- and CloudFuice mainly focuses on the data level. A popular approach are pipes, e.g., as used in Yahoo! Pipes, Damia [21], or [15], that process entity sets via relatively simple userspecified dataflo...

...pplications need components for data, process, and presentation level [18] and CloudFuice mainly focuses on the data level. A popular approach are pipes, e.g., as used in Yahoo! Pipes, Damia [21], or =-=[15]-=-, that process entity sets via relatively simple userspecified dataflows. These tools have demonstrated to be applicable in different settings since they offer a powerful and easy-to-use interface for...

...rdware (also known as cloud) has triggered research in parallel data processing in a wide variety of applications. Driving forces are simple and powerful parallel programming models such as MapReduce =-=[6]-=- and Dryad [12]. Although the MapReduce program model is limited to two dataflow primitives (map and reduce), it has proven to be very powerful for a wide range of applications. On the other hand Drya...

... few large tasks that may not fully exploit the power of the available computing resources. For intra-operator parallelism, we evaluate sizebased partitioning functions for entity matching similar to =-=[13]-=-. A maximal block size b ensures that each match task only process a limited part of the Cartesian product, i.e., at most b × b entities per task. For our experiment we employ attrMatch with two entit...

...ts, in contrast to synchronized offline data analysing/processing. For example, MapReduce enforces synchronization between the map and reduce phase (a disadvantage that has been recently addressed in =-=[8]-=-). 7 Conclusions and Future Work We presented CloudFuice, a flexible system for specification and execution of dataflows for data integration. The task-based execution approach allows for an efficient...

...es. However, the use of2 Andreas Thor and Erhard Rahm existing search engines may require several queries for more complex integration tasks to obtain a sufficient number of relevant result entities =-=[9]-=-. Execution of many queries needs to be reliable, i.e., it requires handling of failed queries (e.g., due to network congestion) as well as dealing with source restrictions, such as access quota. Mash...

...It is based on a general mashup dataflow model with operators that can be executed on different nodes. A dynamic scheduling takes into account several parameters such as network and users. The AMMORE =-=[11]-=- system even modifies original mashup dataflows to avoid duplicate computations and unnecessary data retrievals. To this end, AMMORE identifies common operator sequences in different mashups and execu...

...ent. CloudFuice therefore strives for a good balance between powerful data integration operators and a simple scripting language. A few recent mashup platforms also deal with mashup efficiency. CoMaP =-=[10]-=- targets a distributed mashup execution that minimizes the overall mashup execution time of multiple hosted mashups. It is based on a general mashup dataflow model with operators that can be executed ...

...aper whereas “B Smith” relates to the Cloud and the Matching paper. Correspondences can be annotated with similarity values or other meta data, e.g., to reflect a confidence level for entity matching =-=[22]-=-. In this paper we restrict the mapping definition to entity pairs without annotations but we present an extension in Appendix A. Queries are the third type of data structures because they are key to ...

... processing. Higher-level languages can be layered on top of these infrastructures. Examples include the high-level dataflow language Pig Latin [19], Nephele/PACTs [1] (based on MapReduce), DryadLINQ =-=[24]-=- as well as SCOPE [3] that offers a SQL-like scripting language on top of Microsoft’s distributed computing platform Cosmos. Similar to Cloudfuice, a high-level program is decomposed into small buildi...