Haskell, electronics et al.

Main menu

Post navigation

Data analysis with Monoids

This post expresses the key ideas of a talk I gave at FP-SYD this week.

Monoids are a pretty simple concept in haskell. Some years ago I learnt of them through the excellent Typeclassopedia, looked at the examples, and understood them quickly (which is more than can be said for many of the new ideas that one learns in haskell). However that was it. Having learnt the idea, I realised that monoids are everywhere in programming, but I’d not found much use for the Monoid typeclass abstraction itself. Recently, I’ve found they can be a useful tool for data analysis…

Monoids

First a quick recap. A monoid is a type with a binary operation, and an identity element:

class Monoid a where
mempty :: a
mappend :: a -> a -> a

It must satisfy a simple set of laws, specifically that the binary operation much be associative, and the identity element must actually be the identity for the given operation:

As is hinted by the names of the typeclass functions, lists are an obvious Monoid instance:

instance Monoid [a] where
mempty = []
mappend = (++)

However, many types can be Monoids. In fact, often a type can be a monoid in multiple ways. Numbers are monoids under both addition and multiplication, with 0 and 1 as their respective identity elements. In the haskell standard libraries, rather than choose one kind of monoid for numbers, newtype declarations are used to given instances for both:

We’ve now established and codified the common structure for a few monoids, but it’s not yet clear what it has gained us. The Sum and Product instances are unwieldly – you are unlikely to want to use Sum directly to add two numbers:

These functions are trivial, but they put a consistent interface on creating monoid values. They all have a signature (a -> m) where m is some monoid. For lack of a better name, I’ll call functions with such signatures "monoid functions".

Foldable

It’s time to introduce another typeclass, Foldable. This class abstracts the classic foldr and foldl functions away from lists, making them applicable to arbitrary structures. (There’s a robust debate going on right now about the merits of replacing the list specific fold functions in the standard prelude with the more general versions from Foldable.) Foldable is a large typeclass – here’s the key function of interest to us:

foldMap takes a monoid function and a Foldable structure, and reduces the structure down to a single value of the monoid. Lists are, of course, instances of foldable, so we can demo our helper functions:

It’s worth noting here that the composite computations are done in a single traversal of the input list.

More complex calculations

Happy with this, I decide to extend my set of basic computations with the arithmetic mean. There is a problem, however. The arithmetic mean doesn’t "fit" as a monoid – there’s no binary operation such that a mean for a combined set of data can be calculated from the mean of two subsets.

What to do? Well, the mean is the sum divided by the count, both of which are monoids:

The Aggregation type class

For calculations like mean, I need something more than a monoid. I need a monoid for accumulating the values, and then, once the accumulation is complete, a postprocessing function to compute the final result. Hence a new typeclass to extend Monoid:

This makes use of the type families ghc extension. We need this to express the fact that our postprocessing function aggResult has a different return type to the type of the monoid. In the above definition:

aggResult is a function that gives you the value of the final result from the value of the monoid

AggResult is a type function that gives you the type of the final result from the type of the monoid

In order to use the monoids we defined before (sum,product etc) we need to define Aggregation instances for them also. Even though they are trivial, it turns out to be useful, as we can make the aggResult function strip off the newtype constructors that were put there to enable the Monoid typeclass:

The 4 computations have been calculated all in a single pass over the input list, and the results are free of the type constructors that are no longer required once the aggregation is complete.

Another example of an Aggregation where we need to postprocess the result is counting the number of unique items. For this we will keep a set of the items seen, and then return the size of this set at the end:

Higher order aggregation functions

All of the calculations seen so far have worked consistently across all values in the source data structure. We can make use of the mempty monoid value in order to filter our data set, and or aggregate in groups. Here’s a couple of higher order monoid functions for this:

groupBy takes a key function and a monoid function. It partitions the data set using the key function, and applies a monoid function to each subset, returning all of the results in a map. Non-numeric data works better as an example here. Let’s take a set of words as input, and for each starting letter, calculate the number of words with that letter, the length of the shortest word, and and the length of longest word:

*Examples> let as = words "monoids are a pretty simple concept in haskell some years ago i learnt of them through the excellent typeclassopedia looked at the examples and understood them straight away which is more than can be said for many of the new ideas that one learns in haskell"
*Examples> :t groupBy head (a3 count (min.length) (max.length))
groupBy head (a3 count (min.length) (max.length))
:: Ord k => [k] -> MMap k (Count, Min Int, Max Int)
*Examples> afoldMap (groupBy head (a3 count (min.length) (max.length))) as
fromList [('a',(6,1,4)),('b',(1,2,2)),('c',(2,3,7)),('e',(2,8,9)),('f',(1,3,3)),('h',(2,7,7)),('i',(5,1,5)),('l',(3,6,6)),('m',(3,4,7)),('n',(1,3,3)),('o',(3,2,3)),('p',(1,6,6)),('s',(4,4,8)),('t',(9,3,15)),('u',(1,10,10)),('w',(1,5,5)),('y',(1,5,5))]

Many useful data analysis functions can be written through simple function application and composition using these primitive monoid functions, the product combinators a2 and a3 and these new filtering and grouping combinators.

Disk-based data

As pointed out before, regardless of the complexity of the computation, it’s done with a single traversal of the input data. This means that we don’t need to limit ourselves to lists and other in memory Foldable data structures. Here’s a function similar to foldMap, but that works over the lines in a file:

foldFile take two parameters – a function to parse each line of the file, the other is the monoid function to do the aggregation. Lines that fail to parse are skipped. (I can here questions in the background "What about strictness and space leaks?? – I’ll come back to that). As an example usage of aFoldFile, I’ll analyse some stock data. Assume that I have it in a CSV file, and I’ve got a function to parse one CSV line into a sensible data value:

Conclusion

So, I hope I’ve shown that monoids are useful indeed. They can form the core of a framework for cleanly specifing quite complex data analysis tasks.

An additional typeclass which I called "Aggregation" extends Monoid and provides for a broader range of computations and also cleaner result types (thanks to type families). There was some discussion when I presented this talk as to whether a single method typeclass like Aggregation was a "true" abstraction, given it has no associated laws. This is a valid point, however using it simplifies the syntax and usage of monoidal calculations significantly, and for me, this makes it worth having.

There remains an elephant in the room, however, and this is space leakage. Lazy evalulation means that, as written, most of the calculations shown run in space proportional to the input data set. Appropriate strictness annotations and related modifications will fix this, but it turns out to be slightly irritating. This blog post is already long enough, so I’ll address space leaks in in a subsequent post…

Related

9 thoughts on “Data analysis with Monoids”

I don’t understand Haskell well enough (former SML-NJ hacker, though), but it seems like the identity for the ‘min’ operator should be the MaxBound. That is, minimum of sensible number and +inf is the sensible number. Am I missing something?

Perhaps the name for the bounding values on my min and max monoids are less than ideal. I called it MinBound as it was the bounding value for the min operation, however you are right in that it has to behave as if it is bigger than all other values. It’s just a label though – I believe the code is correct.

The names “MinBound” and “MaxBound” on the Min and Max monoids have caused confusion, given the minBound and maxBound methods of the standard Bounded typeclass. Hence I’ve updated the post and renamed them to be “MinEmpty” and “MaxEmpty”.

I’m not sure that Aggregation type class is right choice. It’s possible to extract several statistics at once from the single monoid. For example from Mean it’s possible to extract both count and mean. With single pass variance estimator it’s possible to get count, mean, unbiased (N-1 in denominator) and biased (N in denominatro) variances and corresponding stadard deviations. It calls for multiparameter type class but MPTC without dependencies between types are difficult to work with.

Also fact that estimators form monoids reflects monoidal structure of samples. Empty sample is identity and joining of two samples is composition. Also if order of elements doesn’t matter estimator will form commutative monoid.

There are several options. First is to write bunch of monomorphic functions which does the extraction. It doesn’t scale well so don’t really consider it. Another option is to make Aggregation multiparameter and create bunch of newtypes. (I hope WP won’t treat code snippets too badly

It does have problem common to MPTC without fundeps or type families, in (extract . foldMap f) type of monoid couldn’t be inferred in many cases. This approach could be used to extract statistics from districutions as well. Currently Statistics.Distributions uses bunch of ad hoc type classes. So far I don’t know satisfactory solution.

I’ve played with this idea although in slightly different way but never finished it