This blog is interested in imperative, functional, procedural, logic-based, and all sorts of ways of thinking about programming. I write mostly about C++, my bread-and-butter.
Recent articles have focussed around functional programming in C++; this is one paradigm C++ programmers often neglect. Many believe that it is not possible or efficient to do. I challenge this assertion by example!

Friday, December 14, 2012

Quick and Easy -- Manipulating C++ Containers Functionally.

Update:Added examples for dup and foldMap.

Probably the most useful parts of the standard C++ library would be container and algorithms support. Who has worked in C++ for any non-trivial amount of time without using std::vector, list, map, or any of the others? <algorithm>, on the other hand, is more something everyone should know. It solves many of the problems that C++ developers encounter on a daily basis.

"How do I test if there exists an element, x, where p(x) is true?" : std::any_of"How do I copy each element, x, where p(x)?" : std::copy_if "How do I removed each element, x, where p(x)?" : std::remove_if "How do I move elements from one container to another?" : std::move, <algorithm> version. "How do I find a subsequence?" : std::search "How do I sort an array?" std::sort "How do I find the sum of an array?" : std::accumulate

Any programmer worth half their salt could write any of these functions in their sleep--they're basic--and the thing is that these algorithms do get written, over and over and over again. Either because one does not realize a specific <algorithm> function exists, or because one is thinking on a low level and unable to see the higher level abstractions.

What I like most about the STL is that the only requirements for adapting any data type to a sequence are (1) define an iterator, and (2) define begin() and end(). After that, all (if not most) of <algorithm> becomes instantly usable with that type. (As well as the range-based for loop.) This makes it incredibly generic and useful.

That's what this article will be about. An abstraction over the STL that lends itself to writing more terse, concise code without losing any clarity. This abstraction is less general, by design, because it works on entire containers, not iterators. I am not writing about a replacement for any <algorithm> functions, but an alternative inspired by functional programming.

However, I do go over many <algorithm> functions, so this can also be thought of as a review.

Filtering, Taking, and Dropping: Collecting data.

I've always found the erase-remove idiom an unintuitive solution to such a common problem. I certainly would not have figured it out on my own without the help of the C++ community to point it out. Requiring containers to define a predicated erase wouldn't be generic, and <algorithm> knows only of iterators, not containers, so the standard library can't offer anything simpler. filter fills this gap by combining its knowledge of containers and iterators.

It also breaks the convention of returning s's type. There's a reason for that. Infinite lists. Consider this Haskell code:

take 10 [1..] == [1,2,3,4,5,6,7,8,9,10]

[1...] is an infinite list, starting at one. Obviously, it doesn't actually exist in memory. take returns a finite list that does.

The concept of iterators that represent infinite ranges in C++ isn't new, but neither is it common. std::insert_iterator could insert a theoretically infinite number of elements into a container. std::istream_ and ostream_iterator may read from or write to a file infinitely.

We can create pseudo-containers to represent infinite ranges and plug them into take.

drop makes no promises about infinite lists, but unlike most container- or range-based algorithms, it can work on them. In the above example, two integers are read from std::cin, and their values lost.

Folding: Reducing a list from many to one. (std::accumulate)

Accumulating is the "imperative" description of folding. Historically, you'd call the variable you update with the results of each calculation the accumulator. To accumulate, then, is to iterate through a sequence, updating the accumulator with each iteration.

Folding is another way to think of it. A fold is a transformation from a list of values to just one value. Haskell defines foldl and foldr, meaning "fold left" and "right".

foldl is really just another name for accumulate.The accumulation function (here, std::minus) expects the accumulator as the left argument and value to accumulate as its right. foldr is reversed: Not only does it iterate in reverse, but expects the accumulator in the right-hand argument.

Functional programmers also like to build lists using fold. They build lists starting at the tail, so they typically prefer foldr to foldl. std::forward_list works like [] in Haskell and linked lists in other functional languages. This snippet simply copies the values from the std::vector, v.

Zip and Map: many to many. (std::transform)

To zip two sequences together by some function is the same as calling std::transform. Transform implies modifying each member by some function. Zip implies the same, but with the visual metaphor of combining two lists into one, starting at one end and working up.

Note: The only way I have discovered to write zip variadically is with tuples. Since this article is not on tuples, refer to the definition of transform in "Zipping and Mapping Tuples".

Note2: An in-place version of this function is possible, but showing both general and optimized versions of each function would be redundant, and the topic of optimization is worth discussing on its own.

Mapping is similar to zipping--in fact the two-argument forms of zip(f,xs) and map(f,xs) should be equivalent. The three argument form, like map(f,xs,ys), applies f to every combination of x and y.

map(f,{x,y},{a,b}) == { f(x,a), f(x,b), f(y,a), f(y,b) }

If xs is size N and ys is of size M, then map(f,xs,ys) returns a sequence of size N x M.

While this may turn an algorithm from one-pass (update and add if valid) to two-pass (update all states, then filter), it also makes simpler algorithms that can be optimized more easily by the compiler at times. For example,

// or: auto r = filter( pred, map(std::multiplies<int>(),xs,ys) );
While only profiling can tell in any given instance, the second example may be faster under some circumstances. The compiler may be able to vectorize the call to map, but have difficulties applying the same optimization to the first because it cannot evaluate both the multiplication and predicate in one vectorized step.

Sometimes, the goal is to calculate something given the data, rather than map it. Naively, one might write something like

auto r = fold( f, map(g,xs) );

But isn't creating the new container inefficient? What if an in-place version of map were implemented, wouldn't transforming xs before folding still be inefficient? Thus, foldMap is useful.

Conclusions.

Haskell's Data.List is actually a lot like <algorithm>, though on a higher level of abstraction. There are some things that can only be done with iterators, but many that can also only be done with whole containers. Data.List gives some good inspiration for helpful algorithms, even in C++.

But unlike in C++, Haskell uses simple linked lists by default and all of Data.List's function work only on linked lists. This gives both Haskell and functional programming a bad name when people compare Haskell code using linked lists to C++ code using std::vector. (See "C++ Benchmark -- std::vector vs. std::list vs. std::deque") When libraries are written to optimize inefficiencies in the linked list, like Data.Text, they re-implement Data.List's interface and often achieve equivalent efficiency to well-optimized C, but not without plenty of redundancy.

In C++, we can write one static interface that is both generic and efficient. Writing functional code does not mean writing slow code. The mathematical nature of these operations can even help the compiler optimize. The high-level interface of Data.List fits snugly atop of the low-level interface of iterators.