SEEP: Stateful Big Data Processing

As users of big data applications expect fresh results, we witness a new breed of stream processing systems that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges:

to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases;

failures are common with deployments on hundreds of VMs, so systems must be fault-tolerant with fast recovery times yet low per-machine overheads.

An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

The SEEP project explores large scale stream data processing in cloud architectures. It explores new data management and system mechanisms that enable elastic data processing, provisioning more resources when computing demand increases and decommisioning them when are no longer necessary. Elasticity is one important property due to the cost-per-cycle of the cloud; equally important is fault tolerance. The system needs to recover after failures to cope with the incoming data, and the process needs to be efficient, since failures are common in large clusters.

The main challenge is to provide these two important properties when we handle stateful operators, which improve the applicability of the system to more domains and thus the applications that can benefit from it. We propose an integrated approach to scale out stateful operators while maintaining fault tolerance. In SEEP we make the internal state of operators explicit, exposing it to the system, in order to scale out operators and recover them after failures. This contribution allows us to build more complex models and applications, that can benefit now from the state externalisation.