I'm working on a library that lets me write operation on an input "stream" of data (I don't call them that, but it's a potentially unbounded input regardless, think data coming from a socket).

I might have one operation that eg: applies a frequency shift to the incoming data, and then another that applies a frequency-selective filter to that result, I'm writing in C++ so my syntax might look something like this:

input >> tune >> filter >> output;

My problem is that different operations might require an unknown number of data points to compute the output. So eg: tune perhaps (and can) work with an arbitrary number of inputs at a time, but filter requires some minimum number of samples before it can produce output.

The easiest answer is to run each filter in a thread, and connect them with some sort of thread-safe pipe or equivalent. If possible though, I'd like to avoid threading if I can.

Is anyone aware of an alternative pattern or research on composing streaming/batch operations on a stream without resorting to threads and blocking I/O?

I should have clarified I want to do this in-process
– gctFeb 4 '19 at 14:32

The big question here is, is there any reason why you have all these arbitrary self-imposed restrictions like no multi-threading and no multi-processes? Writing a single threaded pipeline is possible with coroutine or event-based programming, but they're fairly tricky to get right. It's often a lot easier to let the system multiplex the threads in a pipeline. Can you elaborate why (you think) you have to do this single threaded?
– Lie RyanFeb 4 '19 at 15:06

I'll be doing multi-process stuff at a higher level, these are low-level primitives it doesn't make sense to have the overhead of copying data between processes for. I'd like to avoid threads because the ultimate users of this code aren't software engineers, they're mathematicians and physicists that don't necessarily understand the intricacies of thread safe programming.
– gctFeb 4 '19 at 15:09

Understand that with what you're asking for, you're merely trading off the complexities of multi-threaded programming with the complexities of cooperative multi tasking. Personally, I find that in general, the subtleties of multi threading is easier for non-programmers to work with than the subtleties of cooperative multi tasking, especially if you're using the pipeline paradigm already.
– Lie RyanFeb 4 '19 at 15:18

2 Answers
2

Streaming frameworks (such as the built-in streams in Java 8) generally work by setting up a pipeline structure where each node can pull data from the previous node. This requires each node to support a method that can produce a single data point, as well as a reference to the previous node.

For example (pseudo-code):

tune.produce() {
// call previous.produce() once
// return the processed output
}
filter.produce() {
// call previous.produce() as many times as necessary
// return the next data point of processed output
// you might keep some state about the last N points
}

When you start consuming the output of the pipeline, it begins a kind of chain reaction that bubbles all the way to the input, and each time you call produce() on the output, it will consume just enough input data points to produce one output data point.

Edit: If performance is a concern, then you can batch samples as Erik Eidt mentioned, so each data point can be a list of samples. Then your nodes might look like this:

My problem with that is I generally can't produce a single output at a time. The canonical example would be computing a convolution, which you use an FFT to do. You have to have N samples before you can compute the M samples of output. Overhead for a single-sample at a time would almost certainly be too high as well.
– gctFeb 2 '19 at 14:38

2

@SeanMcAllister, so maybe the elements in your stream are sets or lists of samples instead of individual samples.
– Erik EidtFeb 2 '19 at 16:08

Maybe, I'm trying to think how I could make this work, I Think if my buffers between stages were large enough I could do it...
– gctFeb 2 '19 at 16:11

Appreciate that. I think I need an intermediate buffer because, eg: one primitive might need to operate on 2000 samples, but the previous one only produces 1024 at a time because of eg: FFTs, so someone has to buffer them.
– gctFeb 2 '19 at 19:11

The general approach that I favor for this type of work is called Functional Reactive Programming or Reactive Programming. A program is built up from observables (things to observer, such as data or events) using operators to compose functions together into an observer chain.

There are a number of libraries that implement reactive extensions to a number of languages. For instance, RxJava is an extension library for Java. Many of these libraries are modelled after the reactive extensions for CLR languages.

The composition operators can process streams of data. More importantly, they can also handle processing across thread boundaries, allowing an observer chain to move data from one thread to another as processing requires it. This, in my mind, is the biggest advantage of the extension libraries: they allow you to connect together smaller functions into larger functions in a thread-safe manner.

In my head FRP is equivalent to building up a dataflow graph, but that still leaves the question open of how you actually execute the graph.
– gctFeb 4 '19 at 15:04

The FRP framework implements a subscription mechanism that starts the observer chain running. In RxJava, it is called subscribe(). Some C++ reactive frameworks use a similar name. See github.com/ReactiveX/RxCpp for instance
– BobDalgleishFeb 4 '19 at 16:28