Wednesday, July 11, 2012

Comparing objects in C++

The standard C++ pattern for comparing and ordering objects is to overload operator<() for your class. This function is what all the standard containers (std::set, std::priority_queue, etc.) and algorithms (std::sort, std::binary_search, etc.) use for comparing objects.

So if you have a class that you would like to put in a standard container, you have to overload operator<(). Your class usually has more than one member, and you usually want to order your objects first by the value of one member, then by the value of another member, and so on. For example, if you have a class person with string members firstname and lastname, and you would like a "phone book" ordering, you would order by lastname first, then by firstname, e.g.:

This code is not optimal. To see why, consider a case where you have two persons with the same lastname. The first check will fall through (lastname < other.lastname is false); so will the second (other.lastname < lastname is also false). Each check had to loop over the lastname when a single loop should have sufficed.

Fortunately, std::string happens to implement a member function compare() which returns an int. The semantics are similar to C's strcmp(a, b), which also returns an int. The return value is less than 0 if a < b, 0 if they are equal, or greater than 0 if a > b. Using this function, we can implement a much more efficient operator<():

To see the difference in running time, we can implement a simple loop which inserts the same element many times to a multiset:

int main(int argc, char *argv[])
{
std::multiset<person> persons;
for (unsigned int i = 0; i < 1000000; ++i)
persons.insert(person("Thomas", "Anderson with a rather long last name for the sake of the argument"));
}

We exaggerate the number of elements and the length of the lastname in order to amplify the difference. Running this on my laptop, it takes approximately 3.9 seconds to run the original code and 2.9 seconds to run the new version.

This is all well and good, but the actual problem runs a bit deeper than this. What if our code uses other compound data types than strings? Take e.g. std::pair. It is hopelessly broken, precisely because it doesn't implement a compare() function! std::pair only implements operator<() and only expects its elements to implement operator<() too. Have a look at the libstdc++ implementation (simplified):

Can you spot the performance bug? Yep, it calls checks both __x.first < __y.first and __y.first < __x.first in the worst case. This also means that std::pair<std::string, std::string> has exactly the same problem as our first implementation of struct person.

In order to show just how bad this really is, consider the case where we have nested pairs:

Intuitively, we should only need to perform 128 comparisons in the worst case (two comparisons for each pair of corresponding elements from the two objects). If we had two arrays of 64 ints, we would do it like this:

In order to answer the question posed ("How many comparisons between two ints are we actually making when we compare x6 against itself?"), we can simply program it using a wrapper class for int that counts the number of comparisons:

Because std::pair has three comparisons in the worst case, instead of just two (which would have been sufficient), the complexity of comparing nested pairs has just climbed from 2^n (where n is the number of levels of pairs) to 3^n. That's pretty bad!

Part of the solution, as employed by std::string and our improved struct person, is to define the most fundamental comparison between objects as a function that can tell you whether a < b, a == b, or a > b in a single call rather than determine this using two calls. It is clear that the two are semantically equivalent (you can both define operator<() in terms of compare() and the other way around), but one is computationally superior for compound objects.

This is still not the full story, however. We wouldn't get any performance improvement from std::set if we changed it to call compare() instead of operator<() — because it doesn't need to compare both ways! In fact, if we changed it, we would now find that std::set<int> operations take more time, since calling compare() on an int would have to check both a < b and b < a even when std::set only needs to know the result of one of them! (This is of course tighly coupled to the std::set implementation — implemented as a tree, lookups would typically traverse the tree by comparing the element that is searched for with the current node, and update the current node depending on the result of the comparison.)

The real solution, therefore, seems to be to implement bothoperator<() and compare() for all types. std::pair's would look something like this:

Now containers and algorithms can use the most efficient one — std::set would still use operator<() on its elements, but std::pair::operator<() would now use compare() (for its first element), and so on.

@bert hubert If it's performance-sensitive and/or you have a lot of nested ties, you may consider just implementing your own tuple/tie that does the same thing using a compare() instead. But I like the idea, it's indeed a clever trick! Your website/blog is nice, by the way :-)