Datix: A System for Scalable Network Analytics

The ever-increasing Internet traﬃc poses challenges to network operators and administrators that have to analyze large network datasets in a timely manner to make decisions regarding network routing, dimensioning, accountability and security. Network datasets collected at large networks such as Internet Service Providers (ISPs) or Internet Exchange Points (IXPs) can be in the order of Terabytes per hour. Unfortunately, most of the current network analysis approaches are ad-hoc and centralized, and thus not scalable. In this paper, we present Datix, a fully decentralized, open-source analytics system for network traﬃc data that relies on smart partitioning storage schemes to support fast join algorithms and eﬃcient execution of ﬁltering queries. We outline the architecture and design of Datix and we present the evaluation of Datix using real traces from an operational IXP. Datix is a system that deals with an important problem in the intersection of data management and network monitoring while utilizing state-of-the-art distributed processing engines. In brief, Datix manages to efﬁciently answer queries within minutes compared to more than 24 hours processing when executing existing Pythonbased code in single node setups. Datix also achieves nearly 70% speedup compared to baseline query implementations of popular big data analytics engines such as Hive and Shark.

Public Review By:

Marco Mellia

Public Review for Datix: A System for Scalable Network Analytics Dimitrios Sarlis, Nikolaos Papailiou, Ioannis Konstantinou, Georgios Smaragdakis, and Nectarios Koziris Big Data is a hot topic, and the Internet is one of the few sources where it is possible to collect large amounts of data. It is not surprising then to see researchers trying to exploit Big Data techniques to analyze Internet data. This work goes in this direction, and applies Big Data methodologies to network monitoring and management. Authors propose Datix, a fully decentralized network traffic analytics engine for querying very large datasets using existing map-reduce infrastructures. The key contribution is the ability to do efficient distributed joins between network traffic data (in this case, SFlow packet samples) and metadata about fields in that data (e.g. IP to AS number mappings), a key primitive operation in many network traffic analysis studies. The data model is a star schema with a large log table and smaller dimension tables, which are partitioned by keys on load time. At runtime, queries are mapped to relevant partitions that contain the data, and the resulting query is passed to Shark or Hive for execution. The result is a fast and scalable system that results particularly suited for the analysis of network management traces. Reviewers found this paper to be interesting, well motivated, even if incremental. Despite the limited novelty of the proposed work, reviewers found Datix to be an important contribution, allowing existing infrastructure to be applied to very common network measurement tasks -- for which MapReduce is somewhat underutilized in practice. Plus, Datix is Open Source and available on GitHub.