Consider this problem: We have a hypothetical startup that, like so many other unimaginative clones of each other, provides some marginal benefit in exchange for tracking user locations. We want to mine that location data.

For the purposes of this brief blog post, we might have a file that looks like this:

The first column is a pseudo-anonymous hash identifying a user. The second is a pseudo-anonymous hash representing a location. If we eyeball the first 14 lines, we can see that user 1a2ddc2 visited 5f2b932, bd11537, 5890595, 5f2b932, bd11537, then 5890595. Meanwhile, user f1a543f visited 5890595, 5f2b932, bd11537, 5890595, 5f2b932, bd11537, and then 5890595. And so forth.

Let’s say we’re interested in learning where people tend to go. We are looking for the most popular transitions. So given that user 1a2ddc2 visited 5f2b932, bd11537, 5890595, 5f2b932, bd11537, then 5890595, we count the transitions as:

5f2b932 -> bd11537

bd11537 -> 5890595

5890595 -> 5f2b932

5f2b932 -> bd11537

bd11537 -> 5890595

Notice that we have to track the locations by user in order to get the correct transitions. Next, we’re interested in the most popular transitions, so we’ll count them:

5f2b932 -> bd11537 appears twice

bd11537 -> 5890595 also appears twice

5890595 -> 5f2b932 only appears once

Now all we have to do is count all the transitions across all users, and report the most popular transition. We’ll look at three different approaches:

Now we have each step in our process consisting of applying a single function to the return value of the previous function application. But let’s take the next step. We have a mapping from users to their transitions, but we don’t care about the users, just the transitions, so let’s fold them back together:

Now we want to count the occurrences of each transition. We’ll reduce our new list to a pairing between the highest count and a list of transitions that match. To facilitate this, we’ll turn the arrays for each transition into a string:1

The very nice thing is that we have decomposed our solution into a simple pipe that takes some data in at one end, and performs a succession of transformations on it, until what emerges at the other end is the result we want.

Each step can be easily checked and tested, and each step as a well-understood and explicit input, followed by an explicit and well-understood output. There are no side-effects to confuse our reasoning.

But there is a dark side, of course. If we care very deeply about memory, at each step but the last, we construct a data structure of roughly equal size to the input data.

We would use much less data if we wrote a single fold that had a lot of internal moving parts, but only iterated over the data in one pass. Let’s try it.

The single pass approach

In production systems, memory and performance can matter greatly, especially for an algorithm that may be analyzing data at scale. We can transform our “staged” solution into a single pass with a bit of care.

What about obtaining transitions from the locations for each user? Strictly speaking, we don’t have to worry about slicing the list if we know that the current set of locations has at least two elements. So we’ll just take a transition for granted, then we’ll discard the oldest location we’ve seen for this user, as it can no longer figure in any future transitions:3

Folding the transitions per user back into one stream would be sheer simplicity, but we can actually skip it since we have the transition we care about. What’s the next step that matters? Getting a string from the transition:

We get the same solution, but with a single pass through the data and requiring space proportional to the number of users, not a multiple of the size of the data. But note that although the code now looks somewhat different, it actually does the exact same steps as the staged approach, in the same order.

That’s because we wrote (and debugged!) the pipeline, and then refactored it to a single pass. We did all of the hard reasoning while working with the easier-to-reason-about and factor code, before we wrote the everything-entangled code.

Obviously, there’s a trade-off involved. Maximum readability and easiest to reason about? Or performance? Or is it obvious?

What if we could have it both ways?

The stream approach

Our staged approach sets up a pipeline of functions, each of which has a well-defined input and a well-defined output:

This is an excellent model of computation, it’s decomposed nicely, it’s easy to test, it’s easy to reuse the components, and we get names for things that matter. The drawback is that the inputs and outputs of each function are bundles of data the size of the entire input data.

If this were a car factory, we would have an assembly line, but instead of making one frame at a time in the first stage, then adding one engine at a time in the second stage, and so on, this pipeline makes frames for all the cars at the first stage before passing the frames to have all the engines added at the second, and so forth.

Terrible!

Ideally, an automobile factory passes the cars along one at a time, so that at each station, inputs are arriving continuously and outputs are being passed to the next station continuously. We can do the same thing in JavaScript, but instead of working with lists, we work with iterables.

So instead of starting with a massive string that we split into lines, we would start with an iterator over the lines in the log. This could be a library function that reads a physical file a line at a time, or it could be a series of log lines arriving asynchronously from a service that monitors our servers. For testing purposes, we’ll take our string and wrap it in a little function that returns an iterable over its lines, but won’t let us treat it like a list:

asStream has no functional purpose, it exists merely to constrain us to work with a stream of values rather than with lists.

With this in hand, we can follow the same general path that we did with writing a one pass algorithm: We go through our existing staged approach and rewrite each step. Only instead of combining them all into one function, we’ll turn them from ordinary functions into generators, functions that generate streams of values. Let’s get cracking!

Our original staged approach mapped its inputs several times. We can’t call .map on an iterable, so let’s write a convenience function to do it for us:

Now we’re ready for something interesting. Our original code performed a reduce, folding a list into a map from users to locations. We are working with a stream, of course, and we absolutely do not want to reduce all the elements of the stream to a single object.

collating our locations

Consider the metaphor of the assembly line. Log lines enter at the beginning, and are converted into arrays by datumizeStream. Instead of bundling everything up into a box, we want to process the lines, but we need to collate the items so we can process them in order for each user.

One way to do this while processing one line at a time is to create a series of parallel streams, one per user. We direct each line into the appropriate stream and do some processing on it. We then merge the outputs back into a single stream for more processing.

If we stop and think about it, this is what we actually wanted to do when we created a map to begin with. We just need to code that intention directly. So we will write a function that takes a stream and divides it (metaphorically) into multiple streams according to a function that takes each value and returns a string key.

The key function is simplicity itself:

constuserKey=([user,_])=>user;

We plan to will apply this to each value as it comes in, and streams will be created for each distinct key. Then, a transforming function will be applied to each stream. Our mapping functions so far were stateless, and mapped one value to another. But we’re going to do both of these things differently. Our transforming functions will have state, and they will map each value into a list of zero or one value, which will then be merged to form our resulting stream.

Our function looks a lot like the code we wrote for extracting transitions from our single pass solution, only we don’t keep the locations per user in a map, and we either return a transition in a list, or an empty list:

This function take a location at a time, and returns either an empty list or a transition in a list. We can use it to iterate over locations one by one, and get transitions. Which is exactly what we’re going to do. Mind you, it isn’t quite ready, because while it does maintain state (in the locations variable), we will need a different state for each user. In order to have as many of these as we like, we’ll wrap the whole thing in a function:

Now we can call transitionsMaker for each user, and get a function that can map the locations for that user into transitions.

Armed with a function for turning a user and location into a key, and transitionsMaker, we can write our collating function. It takes a function that makes a stateful mapping function and a function that extracts keys from values as arguments, and returns a function that transforms a stream of values:

Why is sortedFlatMap called a “flat map?” A function that maps a value to zero or more values is called a flat map. There’s actually more to this idea if we dive into functional programming a little more deeply, we can think of putting values in lists as “wrapping” them, and if we have an operation that takes a value and then returns a wrapped value, “flat map” is a function that performs the operation on a value and unwraps the result.

In our case, we take values and map them to zero or one transition, which we represent with an empty list or a list with a transition. sortedFlatMap “flattens” or “unwraps” these lists using yield *, which yields the contents of an iterable, in our case, a list with zero or one element.

Continuing our practise of writing our “stream” solution with the same steps as our “pipeline” solution, we transform the transitions into strings we can use as keys:

To recap what we have accomplished: We are processing the data step by step, just like our original staged approach, but we are also handling the locations one by one without processing the entire data set in each step, just like our single pass approach.

We have harvested the best parts of each approach.

Now, it’s true that we have does a bunch of things that people call “functional programming,” but that wasn’t the goal. The goal, the benefit we can inspect, is that we have decomposed the algorithm into a series of steps, each of which has well-defined inputs and outputs. And, we have arranged our code such that we are not making copies of the entire data set with each of our steps.

The end goal, as always, is to decompose the algorithm into smaller parts that can be named, tested, and perhaps reused elsewhere. Using iterables and generators to implement a stream approach can help us achieve our goals without compromising practical considerations like memory footprint.

notes

It would be nice if JavaScript gave us a Deep JSON Equality function, but it doesn’t. We could go down a rabbit-hole of writing our own comparison functions and maps and what-not, but it’s simpler to convert the transitions to strings before counting them. That’s because JavaScript acts as if strings are canonicalized, so they make great keys for objects and maps. ↩

Note that if we are reading the file from disc, we can actually iterate over the lines directly, instead of calling .split('\n') on the contents. ↩

We could also tidy up some extra variable references, but we’re trying to make this code map somewhat reasonably to our staged approach, and the extra names make it more obvious. Compared to the overhead of making multiple copies of the data, the extra work for these is negligible. ↩