Month: January, 2016

Congrats to all the Apache Spark contributors out there! I just read Databricks’ Spark 2015 Year In Review. What a change Spark has gone through this last year. I started learning Spark at the beginning of the year. About a year on and what a change to the APIs. The biggest change from my perspective looks like the Machine Learning Pipelines and the creation of ml over mllib.

In my efforts to design and implement a Kalman filter using Apache Spark, I’ve had to explore some of the design limits of basic distributed computing. At first, without thinking, I attempted to convert all the operations on collections to the equivalent “parallelized” operation. It quickly became apparent to me that the differential equation solver needed for time series filtering would not translate in this way. The simple numerical integration of a data set or function is easily distributed. However, the propagation of an initial state through a series of transformations is not. Each step could be processed by a separate node, but the steps must all be completed in sequence, taking the result of the previous computation as input.

After running into this wall for the first time, now I can’t stop thinking about designing across the boundary of distributed vs. local computing. As the velocity and veracity of time series data fluctuates, tension can mount between the potential gains of farming out computation and the need to integrate and filter data for use locally. Distribution and networking offers an arbitrarily large brain that is often just out of reach given an aggressive local need for real-time processed data and decision making. Decision makers in a complex data flow must integrate system level information, not just latency, available resources, priorities, etc., but also the characteristics of the work needing to be done.