In my efforts to design and implement a Kalman filter using Apache Spark, I’ve had to explore some of the design limits of basic distributed computing. At first, without thinking, I attempted to convert all the operations on collections to the equivalent “parallelized” operation. It quickly became apparent to me that the differential equation solver needed for time series filtering would not translate in this way. The simple numerical integration of a data set or function is easily distributed. However, the propagation of an initial state through a series of transformations is not. Each step could be processed by a separate node, but the steps must all be completed in sequence, taking the result of the previous computation as input.
After running into this wall for the first time, now I can’t stop thinking about designing across the boundary of distributed vs. local computing. As the velocity and veracity of time series data fluctuates, tension can mount between the potential gains of farming out computation and the need to integrate and filter data for use locally. Distribution and networking offers an arbitrarily large brain that is often just out of reach given an aggressive local need for real-time processed data and decision making. Decision makers in a complex data flow must integrate system level information, not just latency, available resources, priorities, etc., but also the characteristics of the work needing to be done.