The Name Return Value Optimization

I received the following mail the other day in response to my writing that the Visual C++ compiler has [finally!] implemented the name return value optimization (NRVO) in the Visual Studio 2003 release:

Sender: James Slaughter

I was in the process of catching up with your blog entries … when I spotted your claim that VC 7.1 implements the named return value optimization.I've so far failed to write code that takes advantage of NRVO, with any optimisation settings that I know of. Are you mistaken, or have I overlooked something? I realise that this could be a breaking change for badly written code, so is there an un(der)-documented compiler switch?

The fact that James has to ask -- that there is no source level way to be certain whether the optimization is or is not being applied -- underscores what I believe is a fundamental mistake we made in its provision, and why I would urge the new ISO C++ committee to revisit the whole issue. Let me give some background.

Given a function that returns a class object by value, such as the Matrix addition operator:

Matrix operator+( const Matrix &m1, const Matrix &m2)

{

Matrix sum;

// default Matrix constructor applied

// do the math

return sum;

}

the function is generally transformed internally into

// Psuedo C++ Code:

// internal transformation of function

void operator+( Matrix &_retvalue,

const Matrix &m1, const Matrix &m2)

{

Matrix sum;

// invoke default constructor

sum.Matrix::Matrix();

// do the math

// copy construct sum into return value

_retvalue.Matrix::Matrix( sum );

}

The complaint with this code, by inspection, is the realization that the final copy construction of the local Matrix object is unnecessary. It would be more efficient to directly compute the result into the internally generated return value.

This inability of the programmer to return a class object efficiently was considered a weakness of the C++ language. One proposed solution, by Michael Tiemann, was a language extension to name the function’s returning class object. For example,

Matrix

operator+( const Matrix& m1, const Matrix& m2 )

name result

{

// no longer write: Matrix result;

//...

// no longer write: return result;

}

The compiler would then rewrite the function internally to take a third reference parameter:

// internally rewritten function

// under proposed language extension

void

operator+( Matrix &result,

const Matrix& m1, const Matrix& m2 )

{

// directly compute into result

}

and transform all uses of the function to compute the result directly in the target class object. For example,

Matrix c = a + b;

becomes transformed internally into

Matrix c;

operator+( c, a, b );

Michael made the unsupported claim in his presentation of this language extension that the compiler, without this explicit syntactic cookie, could not consistently carry out this optimization. Others in the community took exception to this, and claimed that the optimization could be provided without the necessity of extending the language. [This was discussed at some length in the commentary of Section 12 of Stroustrup & Ellis’ Annotated Reference Manual. Walter Bright was the first to implement the NRVO, in his Zortech C++ compiler, without Michael’s extension. Subsequently, Nancy Wilkinson implemented it in a next release of cfront, and these were taken as demonstrations that a language extension was not needed.

This name return value extension never became part of the language — but the optimization did. It was realized that a compiler could recognize the return of the class object and provide the return value transformation without requiring an explicit language extension. That is, if all the exit points of a function return the same named object. For example, given a function of the general form

classType

functionName( paramList )

{

classType localResult;

// do the work ...

return localResult;

}

the compiler transforms both the function and all its uses internally to the form

void functionName( classType &namedResult, paramList )

{

// still must construct result object

// replacing the call on the local object …

namedResult.classType();

// directly compute into namedResult

}

eliminating both the return by value of the class object and the need to invoke the class copy constructor. To trigger it, the class object returned must be the same named object at each return point within the function. To give a sense of the performance gain, here is a measure I made a bit of a long time ago on a piece of code that heavily returned objects [timing is based on the UNIX timex command],

Not AppliedAppliedApplied + -O

1:48.5246.7346.05

As you can see, what’s right with the optimization is the performance gain with little or no additional work. What’s wrong with the optimization is that you don’t know if it will or will not be triggered, and this is a real problem. For example, under cfront, it was only triggered at the top level of the local function. If you nested it within a local block, it was not generated; this surprised people. What surprised people even more was that, up until recently, the two most widely used compilers, Visual C++ and GNU C++, did not provide it. I have been told that Visual C++ 7.1 does implement it, but James cannot verify that by an inspection of the source level language or of compiler itself, apparently.

There are other side-effects as to whether this optimization is applied or not: you cannot inspect your source-level code and know whether there is a balance between the copy constructor and destructor of your class; therefore, any programming that hand-shakes between the two may or may not be semantically valid depending on whether the optimization is or is not turned on – this, I presume, is what James meant when he said,

I realise that this could be a breaking change for badly written code

What happened, I suspect, is that everyone got caught up with Michael’s assertion that the extension was required in order for the compiler to do its work. This seems to have been overstated and, once it was demonstrated not to be the case, the question of whether the extension was necessary was dropped. And yet it seems to me that the extension is necessary – but not for the reason Michael put forward. Rather, it is necessary in order to know at the source level whether or not the optimization is being applied.

Michael’s solution is probably not sufficient, however, because there are cases in which one which to elide the copy constructor but which one cannot pin the name to, such as

Matrix mat = Matrix( 4, 4 );

In this case, no one wants a temporary 4x4 Matrix constructed and then copy-constructed into mat, but Michael’s syntax doesn’t necessarily cover this case. One possibility is to have the class designer place it on the copy constructor itself.

This, by the way, is the reason that an initialization statement is able to eliminate a temporary object while an assignment cannot. For example, when we write,

Matrix composite = m1 + m2;

the compiler must internally transform this to match the transformation of the Matrix addition operator. One possible transformation is the following:

// One possible internal transformation

// Psuedo C++ Code

Matrix temp;

operator+( temp, m1, m2 );

Matrix composite( temp );

temp.Matrix::~Matrix();

In order for this to work, of course, the definition of

Matrix temp; // no associated constructor call

must not be accompanied by an invocation of the default constructor. This is necessary, remember, because temp is constructed within the addition operator and we cannot initialize an object twice. temp, that is, must represent raw storage.While this transformation works, it can be considerably improved when the compiler recognizes that composite also represents raw storage. Rather than generating a temporary to copy construct into composite, it can pass composite directly into the addition operator:

// A more aggressive internal transformation

// Psuedo C++ Code

Matrix composite;

operator+( composite, m1, m2 );

Again, this is possible because composite, at this point, represents raw storage. (The internal definition of composite, above, only allocates storage; the compiler suppresses an invocation of the default Matrix constructor, the same as it did when defining temp.) The assignment is a totally different puppy:

composite = m1 + m2;

composite, at this point, represents initialized storage. It cannot be directly passed into the addition operator, which would result in a second initialization. Rather, the compiler must generate raw storage, then copy assign the result to composite:

// internal transformation of assignment

// Psuedo C++ Code

Matrix temp;

operator+( temp, m1, m2 );

composite.Matrix::operator=( temp );

temp.Matrix::~Matrix();

Because the C programmer by the C declaration syntax must always, within the C language, write

Matrix composite;

composite = m1 + m2;

and typically carries that habit into C++, the C programmer tends to write code that is inexplicably inefficient while on the surface perfectly correct.

So, how can we help James discover whether Visual C++, or any C++ compiler that he uses, implements the NRVO? Well, you need to write a piece of code in which the copy constructor is either present [indicating the absence of NRVO] or it is elided [indicating the NRVO having been applied]. One can either do this by executing the instrumented copy constructor [here I am, or let its silence be deafening], or by examining the generated assembly. For example,

#include <iostream>

usingnamespace std;

class foo {

public:

foo() { cout << "foo::foo()\n"; }

foo( const foo& ){ cout << "foo::foo( const foo& )\n"; }

~foo(){ cout << "foo::~foo()\n"; }

};

foo bar(){ foo local_foo; return local_foo; }

int main()

{

foo f = bar();

}

When I execute this, without any special settings, under either Debug or Release, using the Visual Studio IDE, the output is

foo::foo()

foo::foo( const foo& )

foo::~foo()

foo::~foo()

and this indicates that the NRVO is not being turned on by default, which is the only behavior that I would expect as a programmer. So, this helps underscore my belief that the ISO committee must revisit this decision.

For what it’s worth, the test you gave was similar to one I produced when resolving a performance issue in some code ported from GCC: it turns out that recent versions do in fact elide the copy construction. In that specific case, I was able to take advantage of VC’s support for URVO instead of passing by reference.