Main menu

Post navigation

Producing combinations, part three

All right, we have an immutable stack of Booleans and we wish to produce all such stacks of size n that have exactly k true elements. As always, a recursive algorithm has the following parts:

Is the problem trivial? If so, solve it.

Otherwise, break the problem down into one or more smaller problems, solve them recursively, and aggregate those solutions into the solution of the larger problem

Let’s start with a signature. What do we have? integers n and k. What do we want? A sequence of Boolean stacks such that the size is n, and there are k bits turned on. We therefore require that n and k both be non-negative, and that n be greater than or equal to k.

UPDATE: A commenter noted that though my program worked, the private helper method did not actually fulfill its contract, and its caller was therefore “working by accident”. Apologies for the error; I’ve fixed it.

What is the trivial case? If we are trying to make a sequence of zero bits, with zero of them on, there is only one possibility, which is the empty sequence. And we know that if n is smaller than k, then we are trying to make a sequence with n bits that has more than all of them true, which is impossible, so that sequence is empty. Those sound pretty trivial! Since this is a private method we can eschew the error checking; the caller will be responsible for ensuring that the arguments are non-negative.

I’ve divided that into two parts based on whether the first element is true or false. Being able to find the point where you “divide and conquer” is the key to a lot of recursive algorithms, and this seems like a good spot to do it. For the cases where the first element is true, the remaining four elements have to be from the sequences of four Booleans where exactly two are true. For the cases where the first element is false, the remaining four elements have to be from the sequences of four Booleans where exactly three are true. In both cases we’ve decreased the size of the sequence and are therefore heading rapidly towards our recursive base case. We can now easily write the rest of the algorithm based on this insight:

(Do you see why we don’t need to check whether n > 0 in either recursive case?)

We’ve solved an equivalent problem of the original problem given; the original problem was to produce combinations of a sequence, not bits. Let’s actually solve that problem, shall we? We need a combination of the Zip and Where extension methods on sequences, which is easily written: (Error handling omitted again.)

Related

38 thoughts on “Producing combinations, part three”

For a while I was thinking “But Eric, we iterate through the stacks backwards! Won’t our combinations be output backwards?”, but then I realized that that’s perfectly fine, because we’re dealing with sets. I then realized it wasn’t even happening, because only the stacks of bools were reversed, not the resulting subsets!
And then I realized that the stack of bools was created in such a way that even though the Combinations method seems to want to choose true first whenever possible, it’s only adding that true to the stack after recursing, so the stacks still come out in the intuitive order. It seems like the call stack is being used as a temporary upside-down stack so that the final stack is right-side-up.

How much of this “working out” was intentional, and how much was just a product of the way you approached the problem?

Hah, that is very funny that you point this out as this will become the point of the fifth episode in the series.

My primary aim here was simply to make the algorithm clear and obviously correct, but I also wanted to have the nice property that the sets I enumerated were in “increasing order”, because that helps make it clear to the reader that yes, we really are getting all possible element combinations. That is, if we concatenate the values we have the order 506070, 506080, 506090, … and they go up from lowest to highest. Interestingly, this is also the order where the bit sets I listed, when interpreted as numbers with the least significant bit *on the right*, are *decreasing* numbers. So the answer to your question is: I wrote up a couple different versions of the algorithm and then chose the one to present first that I thought was most pedagogically interesting.

However this is not the only possible algorithm for producing combinations that has a nice “lexical” ordering property. Next time in this series I’ll give another data structure that represents an immutable set by manipulating bits, and then in the last episode I’ll use this data structure to enumerate the combinations in the other lexical ordering.

“For a while I was thinking “But Eric, we iterate through the stacks backwards!”

You can only iterate through a stack in one direction: LIFO. Whether that’s “backwards” depends on what you put in the stack first. As you eventually realized, with recursion you put the last computed result in first.

“Won’t our combinations be output backwards?, but then I realized that that’s perfectly fine, because we’re dealing with sets.”

That wouldn’t be perfectly fine if you want to preserve the order of the items, which is desirable.

“I then realized it wasn’t even happening, because only the stacks of bools were reversed, not the resulting subsets!”

That would be a disaster, as then you would select the wrong items. ZipWhere requires that the lists be parallel.

“How much of this “working out” was intentional, and how much was just a product of the way you approached the problem?”

That it works out follows directly from the breakdown of the problem … he did the divide-and-conquer based on whether the first (leftmost) item should be included, therefore the first bool on the stack necessarily applies to that item. This is the way it is with good functional programming … it’s driven by a description of the problem, rather than dealing grubby implementation details.

The parts of recursive algorithm would be more correct as follow
Solve a problem recursively:
If the problem is trivial
Solve it
Otherwise
Break it in smaller problems and for each Solve a problem recursively

Hi Eric. In the interest of understanding better what’s going on, I rewrote this in python. I think this really helped me. Here is the pastebin (27 lines): http://ideone.com/Lb88ry

I also tried caching the result of choose(), and while it’s much faster, it quickly runs out of memory on large combinations like (25 choose 10).

Of course if you needed combinations() in a real-world python setting, my implementation (including the cached version) is 4-6 orders of magnitude slower (let alone algorithmic complexity) than what you should be using: itertools.combinations. Also your ZipWhere is renamed to compress() since that’s what it’s named in itertools.

Interesting, thanks! As a non-python-programmer, it strikes me that your code is extremely similar to the C# code you ported it from. Obviously that’s not surprising. My question is: if a python programmer had started to implement this algorithm from scratch, rather than translating a C# program, would it have been more or less the same? or is your code clearly “C# written in python” to the reader accustomed to “pythonic” code?

Barring itertools and assuming a recursive implementation, I think my code is reasonably pythonic. I’ve worked on medium sized python projects in the past, but I’m not sure I would consider myself an expert. Also I think I’m a bit biased in favor of generator functions in general; lazy evaluation is pretty awesome.

If I were asked to make it more pythonic, I might have compress written in terms of zip,filter,map/zip instead. And the new(ish) Python 3.3 added the ‘yield from’ generator delegation expression which could be used in place of lines 7-10.

Hi Eric,
I assume the ZipWhere function directly manipulates the enumerators [rather than using Zip(), Where() and Select()] so it doesn’t allocate lots of temporary item/selector tuples which are garbage collected shortly thereafter?

After seeing you’ll be using an immutable data structure in part2 (and viewing what kind of data structure) I toyed with it a little bit in F# because immutability + pattern matching WTF 😀
You can see my experiments here: https://dotnetfiddle.net/XTPXux (warning: can burn eyes)
The code (at least the first one, the most readable) is a direct translation of the “rules” of the algorithm ; so I think it should be understandable to almost anyone (past the cryptic and unusual syntax for a newcomer)

Today seeing finally the part3 I reworked that with a direct port of your C# code (heavily type-annotated even if not required) in F# (rewriting ImmutableStack [or at least what I needed to]) along with a simplified version of my experiments to fit with your code. (so much better than my first trial in C# ^^)
here’s the link : https://dotnetfiddle.net/0QCI6j

Mostly out of curiosity, is there a benefit to shortened identifiers in F#? When I write C# in Visual Studio, I always get an auto-complete suggestion box with the appropriate names, and so I generally only need to write the first few letters anyway. It would certainly help with that eye burning if the function names were in English!

Also, my understanding is that “x :: xs” is named that way because “x” is a typical variable and “xs” is pronounce like “excess”; you’re matching the first element of the list to “x”, and the rest (or the excess) of the list to xs. But I’m having trouble applying that pattern (ha ha ha) to “xxs”. Did you choose that name because it’s like xs, but with an extra x on the front to indicate some sort of nesting?

I don’t meant to be critical, it’s just that you mentioned readability. I think it would be extra readable with more explicitly descriptive names!

I studied Orwell in 1989, a language very much inspired by Miranda and a forerunner of Haskell. The (x:xs) convention for the head and tail of a list was already in place but xs wasn’t excess, it was just the plural of x. Hence the convention for a list of lists was xss, the plural of xs.

First, as carlos said, xs is just the plural of x and yes i should have used xss instead of xxs (my bad)
I know the auto-complete capabilities of Visual Studio are far enough to ease the use of long names ; but I tend to use F# as a scripting tool, mainly directly in a prompt or using various online editor.
Furthermore, I see the functional paradigm closely related to maths (Lambda Calculus) where you don’t care about the name of the “thing” you care about the shape and “properties” (in a mathematical sense) of those “things”.
That is somewhat the way where the language tend to send you, with possibilities like wildcard in indentifier (and pattern maching in identifier in general), operators (composition, pipe or “tupled-pipe” ||>) or custom operator (ex some kind of monadic design with bind defined as >>= and applicative functor as <*>)

Initially, my first link wasn’t intended to be published which explains (in part) the cryptic syntax

You’re right, good names helps to understand, but to someone who knows a little of FP those are common names (like the use of i,j,k as counter in C#)
Another point, I tend to use those “unnamed-names” where the code is intended to be generic ; but when I know I manipulate an identified type (ex: a Person list) I’ll tend to use meaningful names (ex: persons or people)

Now, I know I’ll have to pay attention to details when posting (I confess it’s not a thing I do so often [posting I mean] especially in a foreign language) 😉

PS: In my defense, I also want to say when you switch from trying to do some “pseudo-functional” stuff in C# or VB.Net (mainly for me) to F# you also tend to switch from verbosity land to succintness world 😀

The “scripting mode”/non-public comment makes sense. I was actually only saying “use longer names” when referring to the outer function names, that will get reused. I agree that (for the most part) it’s not necessary to change things like “impl” and “x :: xs”, because there’s only one thing they can mean in that context. Although I do like to use things like “countOfTurkeys” and “turkeyListIndex” as my loop variables, to differentiate their roles beyond just “loop variable”.

In the same sense, functional programming does tend to the more generic, and thus the more “shape vs detail” approach; however, I think problem-specific variables can be named without losing that. If you’re defining map, you can’t label the incoming list, but if you’re working on a turkey tracker, you know your list contains turkeys, and it’s good to label it that way.

I agree with EM — that’s an interesting way to think about it. Here’s another way to think about it. Suppose we are interested in bit sequences of length 5 with 3 bits on. Construct the following graph: A node is a pair of integers where the numbers are non-negative and non-increasing. Start with (5, 3) and then construct the graph where there is an arrow from (n, k) to (n-1, k-1) labeled “1”, and an arrow from (n, k) to (n-1, k) labeled “0”. This is a directed acyclic graph; trees are also DAGs. The set of all possible bit sequences is the set of all possible paths that start at (5,3). The recursive algorithm is essentially just a traversal of this graph.

I should clarify the tree structure I am thinking about is not of the input numbers but of the true / false selection criteria.

I don’t think you actually need a tree structure. You have an virtual tree structure defined in code. In the 4th code block (where true or false is pushed and returned) you have defined that for whatever node you are currently on (Empty, True, or False) you are going to evaluate the result of what happens if you push True and what happens if you push False. When you find branches that meet your requirements (depth of 5 with 3 nodes being true) then you yield that result.

I would lean pretty strongly towards having the Combinations method accept an IReadOnlyCollection rather than an IEnumerable. Clearly your method needs the count, in addition to being able to iterate the items, which is behavior that IReadOnlyCollection will give you.

The other issue is that if you accept an IEnumerable you end up being forced to iterate the source sequence twice. This could be problematic in any number of ways; it could not have the same size when iterated multiple times, it could be expensive to compute, it could cause side effects that aren’t expected to occur more than once, etc. Accepting a collection rather than an IEnumerable would force the caller to materialize the sequence into a collection to be able to call your method when they are in any of those situations. The alternate behavior is to materialize the collection into a list internally within the method, but I find that, in most cases, if the source sequence needs to be materialized it’s preferable to indicate that to the caller.

It’s also worth noting that since we’re computing combinations, which is going to scale *horribly* as the size of the input increases, so the input is going to need to be fairly small. It won’t be important to be able to stream the input data and avoid materializing the whole sequence in memory. If we couldn’t trivially fit it into memory then the size of the output would be entirely unmanageable anyway.

These are good points and I considered restricting to collections for the purposes of this exercise; as you’ll see in the next couple of episodes I show that by restricting to sequences of fewer than 32 elements, we can represent the combination in an int.

Your point about the number of combinations scaling (roughly speaking) exponentially is also well-taken, but of course we don’t know exactly what the caller is going to do with the sequence of combinations. They might not be enumerating all of it. Or, to look at it another way: in the case where the number of elements is large and the number of combinations enumerated is enormous, the cost of calling “Count” once seems likely to be lost in the noise.

This helps emphasize something I find interesting about IEnumerable: they can be used to store infinite sequences. You can write some code that creates an IEnumerable object that represents the entire Fibonacci Sequence, while only using a couple of variables (and a handful of bytes of overhead). You can then enumerate that sequence to an arbitrary element.
Of course, the sequence members grow rougly exponentially, so you’ll need a way to store large integers, and adding them might get slow. Still, that seems powerful to me. So, as Eric said, you might not need to worry about space, even with large n. Then again, all of the recursion will start to use a lot of stack, but that’s just O(n) anyway.

I wish .NET would add a means by which interfaces could specify default implementations for methods, and by which (as a consequence) new versions of interface could be implemented by code written for old ones, provided that any new members included default implementations. The performance and safety of many LINQ methods could be greatly enhanced if IEnumerable and IEnumerator had some extra methods which a default implementation could handle, but which some implementations might handle better.

For example, there should be an easy way to ask any enumerable or enumerator “What can you tell me about your count (total items, or items left)”, with possible answers including “I can’t tell you anything”, or various combinations of “It’s not infinite”, “It’s not finite”, “I can say what it is”, “If you ask what it is, I’ll cache the result”, “It will never change”, “It’s very likely to change”. Any enumerator could legitimately implement such a method by simply returning a response that says “I can’t tell you anything”.

Adding a few such methods to IEnumerable, as well as an “int Move(int)” method to IEnumerator, would greatly enhance the performance and safety of methods like “Count” and also allow infinite sequences to inform methods like “ToList()” that they should throw an exception *before* running the system out of memory [note that compiler support for “int Move(int)” method could improve the performance of iterators, since `yield return x;` could translate as something like `if (iterationsToSkip) iterationsToSkip–; else {…code for yield return…}]

In the absence of any means by which an IEnumerable or IEnumerator can indicate that it is infinite, I would consider it wise to have sequence generators include a parameter at construction indicating the maximum number of items they will return. While this creates the possibility of bugs if an overly-small number is guessed, it may help avoid the possibility of a resulting from code calling `ToList` on a sequence as a means of trying to ensure that any information on a remote server will get eagerly fetched.

I agree with the statement that if k is 0 then the result set would have no items in it (if you are choosing none). However I would have expected the stack to have n false items in it. From your later usage in ZipWhere it will do the right thing (because end of list and remaining list all being false will be the same) but from the point of view of looking at the data structure as being your list of true/falses it seems that it should return a stack of n falses…

It occurs to me after posting that given this is an internal implementation detail then the absolute correctness of the intermediate stages is not important and that compactness of data may be more desirable.

No, you are absolutely right. I’ve messed this up. The code *works*, but it does so by accident, not because the invariants which I wish to maintain are actually maintained. Thanks for noticing this! I’ll fix it.