Abstract

In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS ) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. The opportunities and challenges for efficient parallel data processing in clouds are discussed and present the research project Nephele. It is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's IaaS clouds for both, task scheduling a nd execution. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on this new framework, the extended evaluations of MapReduce -inspired processing jobs on an IaaS cloud system is performed and compared the results to the popular data processing framework Hadoop.