Abstract [en]

Community Network Cloud is an emerging distributed cloud infrastructure that is built on top of a community network. The infrastructure consists of a number of geographically distributed compute and storage resources, contributed by community members, that are linked together through the community network. Stream processing is an important enabling technology that, if provided in a Community Network Cloud, would enable a new class of applications, such as social analysis, anomaly detection, and smart home power management. However, modern stream processing engines are designed to be used inside a data center, where servers communicate over a fast and reliable network. In this work, we evaluate the Apache Storm stream processing framework in an emulated Community Network Cloud in order to identify the challenges and bottlenecks that exist in the current implementation. The community network emulation was performed using data collected from the Guifi.net community network, Spain. Our evaluation results show that, with proper configuration of the heartbeats, it is possible to run Apache Storm in a Community Network Cloud. The performance is sensitive to the placement of the Storm components in the network. The deployment of management components on wellconnected nodes improves the Storm topology scheduling time, fault tolerance, and recovery time. Our evaluation also indicates that the Storm scheduler and the stream groupings need to be aware of the network topology and location of stream sources in order to optimally place Storm spouts and bolts to improve performance.

Peiro Sajjad, Hooman

Abstract [en]

In this thesis, our goal is to enable and achieve effective and efficient real-time stream processing in a geo-distributed infrastructure, by combining the power of central data centers and micro data centers. Our research focus is to address the challenges of distributing the stream processing applications and placing them closer to data sources and sinks. We enable applications to run in a geo-distributed setting and provide solutions for the network-aware placement of distributed stream processing applications across geo-distributed infrastructures.

First, we evaluate Apache Storm, a widely used open-source distributed stream processing system, in the community network Cloud, as an example of a geo-distributed infrastructure. Our evaluation exposes new requirements for stream processing systems to function in a geo-distributed infrastructure. Second, we propose a solution to facilitate the optimal placement of the stream processing components on geo-distributed infrastructures. We present a novel method for partitioning a geo-distributed infrastructure into a set of computing clusters, each called a micro data center. According to our results, we can increase the minimum available bandwidth in the network and likewise, reduce the average latency to less than 50%. Next, we propose a parallel and distributed graph partitioner, called HoVerCut, for fast partitioning of streaming graphs. Since a lot of data can be presented in the form of graph, graph partitioning can be used to assign the graph elements to different data centers to provide data locality for efficient processing. Last, we provide an approach, called SpanEdge that enables stream processing systems to work on a geo-distributed infrastructure. SpenEdge unifies stream processing over the central and near-the-edge data centers (micro data centers). As a proof of concept, we implement SpanEdge by extending Apache Storm that enables it to run across multiple data centers.