Tuesday, October 18, 2011

Stream oriented and parallel programming

Good software design comes to applying simple patterns; for many years now my simple pattern is stream oriented programming, and in the functional programming monads. In this post, I will introduce the concept of stream oriented programming by showing the value it brings to parallel program.
Parallel programming design has three basic patterns:

The two first are classic: operations happen sequentially along the execution, operations can happen in parallel along synchronous or unsyncronized execution flows. One way to present the third pattern is through hierarchical state machines (HSM) where each sate can be a yet another HSM, recursively to any depth.

Hierarchical patterns of execution are important because they are really expressing higher order properties of the solution space. Sequential and parallel patterns are good: they do work, they make things more complex; but they do not “shape” the specific nature of an algorithm; hierarchical patterns do. A good way to bring in a notion of stream oriented programming is by describing the same example hierarchical system in two different ways: state machine oriented and stream oriented:

The “hierarchical state machine” oriented way: the system is a machine, with an application, with a socket connection that transports packets of messages.

In “stream oriented” way: there is a machine, it starts up, it starts an application, which opens a socket, which sends a packet that contains multiple messages. Later it sends more packets with more messages, the socket closes down, then starts up again, to later close down again with the termination of the application. The application is restarted, the socket start up, etc. Again later, the machine is rebooted, the pattern continues.

Note how the hierarchical stream pattern is built of many sequential pattern used at multiple levels: the messages are sent sequentially, so are the packets; the application is run a multiple number of times, the machine is restarted time after time. Also, the messages, packets, application, etc. can all be seen as happening synchronously in parallel. Yet looking at the system as separate sequences and parallel is not good enough, from a state machine perspective it would be like considering each element of the system as separate state machines, which they are not. What is really happening is that the system is one big sequence and we choose to see this sequence as a structure of levels of hierarchies. It is exactly this notion of sequence “over a hierarchy” that the hierarchical execution pattern captures.

Most classical programming languages are built around the state of data, while “execution” state is implicit and ideally hidden. Indeed, one of the first efforts in program development was to “hide” the program counter first with subroutines and then with structured programming. A side note is that in many ways what separates software from hardware is the notion of “execution state”: hardware makes that execution state explicit.

Execution state is central to the notion of shared access to resources: a locked resource is really defining specific exclusion rule for the execution state. SQL and its variants extend the notion of implicitly managing these exclusion rules for shared resources. Multi-threaded programming is supported by different locks and basic thread operations. Yet still execution state is hidden. And for good reason, anybody who has attempted to use a thread suspension as part of their design know how impossibly difficult it is to reasoning about certain execution states.

Unfortunately, the “hidden” nature of execution state also makes it very hard to structure and scale the execution of multithreaded and parallel programs. So while a good architect may map certain computations to thread resources, it is hard to do this a systematic way; and even then, it is hard to foresee how well a parallel designs will perform. A solution to this problem taken in GPU processors has been to impose very simple execution models on the developer. The execution state is still implicit but with GPUs it has an imposed structure: we know that multiple states of the threads all “in-line”. It has to be said that languages like Erlang have also taken a similar approach but they too impose strong restrictions on the execution model, limiting their applicability.

An attentive developer will have noted the steady but sure growth of stream oriented and functional style programming as a “solution” to the difficulties of multi-threaded and parallel programming. Functional programming has been around for fifty years and yet would previously have never been recommended for efficient parallel execution, so clearly there is something more than “just” functional programming in making it a parallel complexity slayer. The answer lies in explicitly managing state of execution! This deserves a little explaining.

One way to define a stream is as follows:

Repeatedly triggered by an input

Performs I/O and computation sequentially

Local readable and writable state

Within a constant, read only, environment

Drives an output

(Except the first item and the notion of synchronicity, all of the above are optional.)

Now here is the key insight: A stream has an overall state of execution! To be more precise, both the repeatedly triggered executions on stream events have execution states as well as the overall “bigger” life cycle of each stream. This is just like a socket which can be opened or closed and have individual packets running over it.

The functional style programmer may have noted the notion of stream is not far from the notion of a monad. Indeed, monadic programming shares very much the ability to have two execution states: the state seen as part of each individual monadic expression, and the whole execution state when “running” the monad. It is this property that allows monads to be a basic building block for parallel programming.

Looking again at an overall system, at any given time a stream oriented system can be seen as a hierarchy of stream flows; Each of these streams is associated to an execution state; The key insight is that an optimal use of the parallel resources of a system is found by optimally assigning these execution states to the parallel resources of the system.

You might ask: “Why it was necessary to introduce streams, can we not write parallel systems with only hierarchical state machines”? As we have seen in the system description above, HSMs and hierarchical stream patterns are really very similar. In fact the stream pattern can be seen as the edges of the transition graph of the state machine. (It is a form of data differentiation of the hierarchical state machine pattern). A single graph has usually many edges, therefore it by assembling many stream oriented patterns we can build equivalent HSMs. We know now that stream patterns have two levels of state execution, and that good parallel design is about smartly assigning execution state to parallel resources, and finally we know that by combining many stream patterns we can build hierarchical state machines. The result is that stream oriented programming is more modular and flexible than direct HSM type programming:

Each stream can be assigned locally to execution states,

The compositional property of streams allow them to adapt both to algorithms as well as to parallel topologies ,

The dynamical nature of their assmbly allow them to be reconfigured quickly.

2 comments:

Łukasz
said...

Hi James,I've read your post a few times but am still not sure if I understand. I've been scribbling an application to get myself more used to parallel programming, and I am trying to fit your comments into the problem I ran into. As far as I understand, the 'stream' picture reflects well a situation where events triggering messages appear in real time, and the parallel application is supposed to respond more or less synchronously to them. But this is not always the case: suppose you have one thread doing all input-output including file access (I figured that otherwise qt or the OS is doing funny things), and another owning the data structure that is supposed to accomodate what is serialized in a file, which can be huge. As far as I understand, one can do either of two things: flood the worker thread with all the data chunks up-front by pushing messages from the io thread (which I would rather avoid), or have the worker thread pull chunks one by one as parsing is being done, which seems more reasonable. Now, I have a problem with implementing the second idea. The scenario is the following: the worker thread receives the message with the first chunk of data, so it builds a stream object using boost iostreams infrastructure, and starts to parse from that stream with the usual operator>>, since this seems to be the natural way to deserialize, and deserialization should be orthogonal, ie know nothing about threads etc. Other messages, appearing in the meantime, not relating to this operation are supposed to wait in queue until parsing finishes. Now comes the problem: the first chunk of data exhausted lands the application in a function call to "std::streamsize InputStreamProxy::read(char* s, std::streamsize n);" that is supposed to provide data that will be parsed as soon as it returns. I see no practical way of using the thread's messaging infrastructure, ie. the message queue, to do what is needed: send a request for more data to the io thread and wait till it comes. Actually, I have a dirty hack here in my application, that I don't like, but know no better. Can this two-way special communication be somehow accomodated within your big picture or am I completely missing your point?RegardsŁukasz

I'll follow up with longer answer, but for the moment I can say the following:- Your messages are parsed, processed and trigger output; if a single thread can do the job, use a single thread! The context switch between your threads is sufficiently expensive that you really need a heavy load, or other constraints to justify a multi-threaded pipe.- With a linear stream (does not split or merge) and no complex dependency (processing does not depend on early part of the processing chain), there is no real difference with a push or pull approach. So here again, keep it simple and use one pattern if you can.- C++ is a pain to implement streams: only very simple streams can be implemented as expressions. What you need to do is capture your stream (parser, processing, and output) with classes (one per "stream node"). You could have special "class" stream nodes to enqueue and dequeue for multithreading. As you want each step of the stream to be able to work with different types you want all these "stream node classes" to be template parameterized with the nodes return value.