Explaining Babbage

ReadyForZero recently released a library
called Babbage and I thought I'd take
a few minutes to describe the problem that it's solving and how it
does it.

We use clojure at ReadyForZero and one of the great things about it is
the ability to explore data at the REPL (I talk about one idea
why here). At
ReadyForZero we collect a lot of data about how people use the site
and track how it's working, and for whom. Clojure has powerful
sequence manipulation operators, and the tools for accessing
statistics and accumulating over sequences is right at hand. For
example, suppose that we are curious about visitors to our website, we
might execute the following in our REPL:
But now, if you want to break down average spend and compare them
among the different groups, you might need to write the same thing again for multiple sets:
Having to write out all the code for each set sucks! A few of the shortcomings:

Unscalable - as you add more and more different groups
that you want to look at, it becomes unwieldy to write the same form
over and over

Inefficient - each seq is being processed repeatedly

Verbose - it's hard to distinguish different lines from each other because there is so much boilerplate

We thought about how to formalize this process, and like all good
reductions, ended up breaking it down into three steps.

The Babbage Model

Babbage seeks to abstract the process of collecting and comparing statistics into 3 steps:

Create a list of records (maps) that contain the relevant inputs to your accumulators and set predicates.

Partition the records into subsets that you want to consider. These can overlap, so one record can be in multiple sets.

Aggregate fields of interest. Both what raw numbers to extract from each record and what aggregations you want (eg: mean, sum, histogram).

Let's take these in turn with the example above.

Creating the input

Clojure is really well suited to working with maps and sequences, and
so it's a good idea to start any "flow" or manipulation with a
sequence of flat (as opposed to deeply structured) maps. Building up
the required sequence can often require several function calls. For
this, babbage provides a mechanism to declare dependencies of
functions [1]. Here is a simple example:
This case is a bit pedantic, since this would be more easily done in a
single pass through raw-visitors, but
defining these dependencies as graphfns has several
advantages over regular functions:

Parallelism - two functions that don't depend on each other can be executed in
parallel. In this case, spends and browser can be
executed in parallel.

Lazyness - optionally, run-graph can be run in a
mode where nothing is actually done until one of the keys in the
resulting map is dereferenced. Here's an example of that:

Composability - you can write smaller functions that can
be composed by run-graph, avoiding computation when you don't
need it.

Structuring the input computation as a graph helps you create the input, and this sets up
nicely for the next step: computing aggregates over different
groupings of this sequence.

Partitioning the records into subsets

It's common to want to compare statistics over different subsets, and
as we saw above in our "spending per browser" example, computing these
by traversing a sequence each time is unscalable, inefficient, and
verbose.

With Babbage, you can define the subsets you're interested in
declaratively, by defining predicate functions that indicate
membership. For example, continuing our example, we can take the
output of the previous section and compute the "spend" for different subsets:

In addition to just partitioning them and defining sets with predicate
functions, you can build up more complicated set compositions
declaratively by using the standard set composition operators
like union and intersection. There are plenty of
examples in the README.

This makes definining different subsets efficient since
each predicate is just computed once, and the aggregations for
different subsets happen in the same pass over the sequence. More
importantly, the definition of partitions is succinct.

Aggregating field values

We've seen above how aggregations are used, for example the "mean"
above is computed on the "spend" field. You can specify multiple
aggregation functions per field, and use any function you want to
extract the value from a
record. The README
has more.

Babbage defines these stats using monoids, which is a simple formalism
that lends itself to parallelizing the reduction of these
statistics. If there's interest, let me know in the comments and I can
write about it and how it interacts with the upcoming clojure reducers
library and distributed aggregation.

Advantages of this model

I've tried to demonstrate how using Babbage breaks down the process
of accumulating statistics into 3 distinct pieces, which are
completely composable and orthogonal. This makes it faster to
develop, more efficient to run, simpler to reason about, and easier
to change.

At ReadyForZero, we've found a 3-4X development time reduction from
thinking about aggregations in this way. If you're doing the kind of
stuff at the top of this post, give Babbage a shot and let us know how
it works for you or what we can do to improve it.