C++11's Move Semantics Are Not Free

Overview

I was wondering whether lambdas together with move semantics or any other new feature can do as good as ETs. Any thoughts?

So I answered the question: no, expression templates are still needed! This blog post is a rewritten version of my response on StackOverflow.

C++11 Move Operations Are Not Free

If you are new to C++11 think of move semantics as optimized copy operations. Never think of them as operations with zero overhead because they don't have zero overhead. In the worst case, moving something has the same complexity as copying: i.e., O(moving data) = O(copying data). In the best case (for classes/structs with data members), the cost of moving is at least the cost of copying pointers to internal data structures: i.e., Ω(moving data) = Ω(copying pointers to internal data structures). In no case, except when the class/struct has no data members, will the cost of moving something be zero: i.e., Ω(moving data) ≠ 0 just as Ω(copying data) ≠ 0 under the same circumstances.

If it ever appears that the cost of moving or copying data in C++ is zero, it is because of the copy elision optimization and the return value optimization (RVO) are being performed by the C++ compiler –or in a very niche set of cases your classes/structs have no member variables.

The example presented below and in the StackOverflow answer I made shows why the costs of moves needs to be considered. In the example, a math_vector class using an statically-sized std::array as its internal representation. Thus, the cost of moving a math_vector object is the same as copying it and so avoiding any extra copies or moves involving temporaries is of utmost importance. However, I don't stop there as the next question one would likely ask is: how does one get rid of the extra copies and or moves? The method used to accomplish this was the C++ template metaprogramming technique called expression templates.

The Objective and the Problem

When code is written, the goal usually sought after is to have maximum efficiency while keeping the code understandable and maintainable by humans. Object-oriented and object-based programming can help achieve this goal as it groups the operations that can be performed on types together. This leads to code that looks like:

i.e., CASE 1 requires four explicit constructed instances using initializer lists (i.e., the "initlist" items), the "result" variable (i.e., `0x7fff8d6eddd0`), and an additional three objects for copying and moving; CASE 2 is better only requiring three extra objects for moving.

But one can do better: zero extra temporaries should be possible with the one caveat: all explicitly instantiated types would still be created (i.e., the four "initlist" constructors and "result"). This will be achieved using the expressions templates technique by creating:

a proxy math_vector_expr<leftexpr ,BinaryOp,RightExpr> class to hold an expression not computed yet,

a proxy plus_op class was created to hold the addition operation not computed yet,

a constructor added to math_vector to accept a math_vector_expr object,

"starter" member functions in math_vector to trigger the creation of the expression template (e.g., operator + and operator +=), and,

"end" member functions in math_vector to force the computation to occur (e.g., by assigning a math_vector_expr to a math_vector and by passing a math_vector_expr to a math_vector constructor).

Key is realizing that care was taken to ensure that all function arguments are perfectly forwarded and all return values are moved with the proxy objects and all code invoked from the use of such. Additionally, code was written to ensure that copies would not be needlessly be made by writing overloads for rvalue references –not just constant references– and moving returned values when they are lvalues referring to rvalue references.

which appears between the initialization of the four "initlist" variables and the output of the "result" data. The reason this is an "even better" result is that absolutely no function calls are done here: only CPU operations related to loading and adding floating point numbers. This is what one would do when writing hand-optimizing the code, e.g., unrolling loops, etc.

Expression Templates

Before showing the code below, a short bit concerning expression templates should be provided. Expression templates are a form of template metaprogramming. Template metaprogramming is a style of programming using C++ templates that has the compiler compute the code to generate at compile-time. This can be done because the C++ template mechanism is itself Turing-complete. Unlike "normal" programming it involves the use of types and/or constants.

Aside: Overall template metaprogramming is not easy, so if you are new to C++ wait until you are at least an intermediate-skilled C++ programmer before jumping into the C++ metaprogramming world as you will need to understand key subtle points of the language. Many of my other C++ posts on this site use template metaprogramming if you want to see some of the things that can be done and some of what it entails.

The expression template technique involves building parse trees on recognized expressions using user-defined types. Each type in these parse trees is a proxy object representing the values of what is to be computed at a later point in time. Thus, each proxy object stores its arguments and any operations to be performed without actually performing the operation until such absolutely needs to be done or otherwise desired.

In the code below, this is exactly what the math_vector_expr class does. Notice that it stores the "left expression" and the "right expression" for some "binary operator". Notice that both the left and right expressions are stored, but, the binary operation was not since plus_op class does not have any state in this example. If it did, then the binary operation would have also been stored.

The actual construction of the parse tree is triggered from operator + code in the math_vector class since it returns a math_vector_expr. Notice that the math_vector_expr class doesn't ever compute anything –it merely stores its arguments and returns another math_vector_expr. Key here is to have all math_vector_expr instances rvalues or at least constant references to exploit compiler optimizations and avoid generating code with run-time overhead. Finally, notice that only by passing a math_vector_expr to the math_vector constructor or by assigning one to a math_veector will the math_vector_expr finally be asked to compute the values belonging to each vector element. When coupled with C++11 compiler optimizations the end result is the elision of all compiler temporary variables and total code efficiency. It is totally and very cool!

C++ Code

The following C++ code is the example code used to produce all output above. Be sure to compile the code with at least level one optimizations. This code will work with both clang++ v3.1 and g++ v4.8 (and likely v4.7). I compiled the code using these options: -std=c++11-O3.

If you want to build the code so that it does not use expression templates, then #define DONT_USE_EXPR_TEMPL. If not, then ensure the macro is not defined.