Description

Welcome to another installment of C9 Lectures covering the powerful general C++ library, STL. The great Stephan T. Lavavej, Microsoft's keeper of the STL cloth (this means he manages the partnership with the owners of STL (dinkumware) and Microsoft, including, of course, bug fixes and enhancements to the STL that ships as part of Visual C++). Simply, Stephan is a C++ library developer.

As is Stephan's nature, he elaborates on technical details in very substantive way. The Standard Template Library, or STL, is a C++ library of container classes, algorithms, and iterators. STL provides many fundamental algorithms and data structures. The STL is a general purpose library: its components are heavily parameterized such that almost every component in the STL is a template.

This is not an introductory series on C++. As such, it is expected that viewers already possess a working knowledge of the C++ language. Go ahead and open VC++ IDE and follow along--there are many examples provided on-screen and Stephan will monitor this thread and answer questions.

The remover struct provides the template machinery so that there only needs to be one definition of
erase_if(). I prefer this to having an overload for each container type, although the fact that it doesn't work for the set containers shows that this isn't the perfect solution either

It would be nice to have a more comprehensive set of traits classes in the STL to make writting this kind of code easier.

great lecture and i want more much much more STL videos with c++0x goodness

btw, will the initializer list be implemented in visual studio soon ? like in a coming service pack or small update or do we have to wait several years for the next visual studio release ? I'm very tempted to go over to gcc that have most of the c++0x implemented
already, see http://gcc.gnu.org/gcc-4.6/cxx0x_status.html, yet another reason of many others for open source, it's much faster releases and updates.

[NotFredSafe]> the predicate "e % 2 == 1" does not work correctly with> negative numbers. I would replace it with "e % 2 != 0"

Good catch. I was going to write != 0, but while I was talking I thought that removing the negation would make things easier to understand. Fail!

> Here are my proposed solutions to the homework

Your erase(vector<T, A>&, const T&) requires the vector's element type and the given value's type to be identical. If they're even slightly different, template argument deduction will fail - for example, when you try to erase 0 from a vector<unsigned char>,
0 is of type int. std::remove(first, last, value) doesn't require *first and value to have the same type; in fact, they can be radically different, as long as elem == value works.

This applies to erase(deque<T, A>&, const T&) too.

erase([forward_]list<T, A>&, const T&) isn't quite the same, because [forward_]list<T, A>::remove() does take const T&. However, because of template argument deduction, you still want to permit different types here.

Your erase([multi]set<T, C, A>&, const T&) could be more efficient. [multi]set<T, C, A>::erase(const T&) is the dedicated member function for this. Again, you'll want to permit different types.

Good work on avoiding iterator invalidation in erase_if([multi]set<T, C, A>&, P). Many people get that wrong.

[benyaboy]> I'd love to see a talk about using the STL with multi-threading.

I may be able to cover this, but I'd need to introduce the Parallel Patterns Library (or the lower-level boost::thread, since std::thread won't be available until VC11).

> Is there a way to insert elements into a std::map like container in parallel.

Not directly. map, like everything else in the STL, supports "the usual thread safety guarantee". That is, multiple threads can simultaneously read a single object, and they can simultaneously read/write different objects. Const operations are reads and
non-const operations are writes (construction and destruction count as writes).

[JeanGa]> Great article! Thank you for the great technical C++ 0x content on Channel9!!> I'd love to get even more on those.

You're welcome, and more are on their way.

> Have you considered the following topics?> - Parallel programming with STL

Looks like there's some demand for this. It's relatively advanced (first you have to be able to use the STL in serial programming) but I'll try to cover it in a few more parts.

I'll probably refer to them occasionally, but first I have to cover the basics of using the STL.

[philgates]> Here's the solution I came up with for what NotFredSafe called erase_if().

Random-access containers with elements that are pairs, like vector<pair<A, B>>, will confuse your partial specializations of remover. Additionally, bidi-or-weaker containers with elements that are pairs, like list<pair<A, B>>, will select your second partial
specialization, when you really want your primary template to be selected.

(Also, you're missing a few typenames in erase_if(), although VC isn't very good about requiring their presence.)

> It would be nice to have a more comprehensive set of traits classes in the STL to make writting this kind of code easier.

This homework problem actually highlights one of the least generic parts of the STL: its containers have several different erasure behaviors.

[Mr Crash]> great lecture and i want more much much more STL videos with c++0x goodness

Glad to hear it!

> btw, will the initializer list be implemented in visual studio soon ?

Magic 8 Ball says: "Ask again later."

(I really want initializer lists, but they're actually not in my top three list of desired features. Also, please ignore the <initializer_list> header that I accidentally left in VC10 RTM. It doesn't do anything.)

> like in a coming service pack or small update

You shouldn't expect new features to appear between major releases. We make exceptions to this rule on a case-by-case basis, but they are rare. Doubly rare for the compiler.

Great lectures, much better presentation than other MSDN videos I've seen.

Never having used much STL, I'm surprised that map doesn't have a const operator[], returning a const reference. Is this because things would get tricky if the supplied key isn't present in the map? Would an exception not suffice here? It could be considered
undefined behaviour, just like the out-of-range vector subscript you highlighted in the first video. Perhaps the intention is to guide programmers down a more robust path, since find() can indicate 'not found' through its return value without resorting to
exceptions or asserts.

Thanks for such a good talk. The introduction to map was really good, I had mostly used sets before this.

A lot of folks come from C or do not use deep C++ (like templates & STL). If you can talk about some useful idioms that can pull them into STL (and why they should), that would be cool! IMO this would also reach out to a huge audience, all those who do C-like
or C-with-classes style of programming in C++.

One example is how C-style arrays can be used almost everywhere in STL-land as vectors and thus gain the ability to be sorted and searched for free by using STL.

Looks good to me. My solution will be functionally equivalent to this, although I'll handle a few more containers.

[Andrew McDonald]> Great lectures, much better presentation than other MSDN videos I've seen.

Thanks - that means a lot to me.

> Perhaps the intention is to guide programmers down a more robust path,> since find() can indicate 'not found' through its return value without resorting to exceptions or asserts.

Bingo.

[ashwinn]> Thanks for such a good talk. The introduction to map was really good, I had mostly used sets before this.

The great thing about map and set is that because they're backed by the same data structure and they have almost the same interface, after you learn how to use one, all of your knowledge transfers over to the other.

Similarly, after you've learned how to iterate through one STL container, you've learned how to iterate through them all (as long as you're using != instead of < of course; I'll use < for indices but never for iterator loops).

In math and physics, the realization that "all of these different things are really the same thing" is absurdly powerful. The STL, as designed by Stepanov, was strongly influenced by this concept. Explained poorly*, the STL can seem too abstract and "too
mathy", but my aim in these videos has been to demonstrate that this mathematical influence is actually a good thing. Almost all of the stuff in the STL behaves consistently, so it's easy to learn and easy to remember.

(* For example, starting off with iterator categories. I've mentioned them in passing, but understanding them isn't necessary to begin using vectors, which have iterators equivalent in power to pointers. I love abstraction, but the way my brain works, I
want to start off by understanding something concrete before generalizing. This is also why I explain lambdas differently than other people do. Other people start ranting about closures and captures and blah blah blah. I repeatedly hammer just one idea: lambdas
stamp out function objects, which are classes with data members (sometimes), and everybody knows how classes and data members work.)

> One example is how C-style arrays can be used almost everywhere in STL-land> as vectors and thus gain the ability to be sorted and searched for free by using STL.

why would vector need a linear time (38:10) to re-condense itself after erasure of an element? It is contiguous, and all it holds are pointers - why couldn't memmove() be used just to move them all back one notch at once, in constant time? Of course this
is only for the case of non-inlined values, where you have an extra indirection level necessary for that, just like for the rvalue references. I guess in its quest for "efficiency" STL stores "light" values inside vector cells themselves, instead of boxing
them up and storing just a pointer to the actual value on heap. Still, the implementation could distinguish between the two cases, right?

> why couldn't memmove() be used just to move them all back one notch at once, in constant time?

Erm... "one notch at a time" and "in constant time" contradict each other The time required to move bytes around depends on the number of bytes, thus it cannot be constant time. If you remove one pointer from a vector of 1000 pointers, 999 pointers have
to be moved, right? That's linear time, not constant time.

"move them all back one notch [,] at once". I always thought of memmove() as an atomic operation (unless we move really big chunks of memory). Just like Stephan says, CPUs like to deal with contiguous blocks of memory, which fit in their caches - then moving
one or a 1000 takes the same time. Supposedly. The key here is we're not moving the 999 ptrs one by one - they all live in one contiguous block of memory, of size 4000 or 8000 bytes (32- or 64-bit ptrs), so we just need to move one memory block of size 999*4
, 4 bytes down. The target and source regions overlap, but that's OK (I think).

So you see, here each element is copied, one by one - and this involves copy constructor in C++ which will actually create new object to be copied (the problem which move semantics came to address). This is obviously unimaginably worse than just calling
memmove() once for each 100,000-long block of pointers or so (provided that the vector does actually store pointers to the actual objects somewhere on the heap) - even if pointers get copied one by one inside memmove(), still no temporary object creation/destruction
is going on at all. That's exactly what move semantics does. And that's just what memmove() does, isn't it?

> I'd like to see episode on how and why the STL has changed over the last> couple of years (I think of LWG's work) and what's planned in the future.

I'll be introducing shared_ptr soon.

> Thanks for popularizing C++.

Thanks for watching.

[Will48]> why would vector need a linear time (38:10) to re-condense itself after erasure of an element?

Because it obeys the laws of physics.

> and all it holds are pointers

vector's representation looks like this (I would link to my BoostCon 2010 presentation, but their website is down):

T * m_begin;size_t m_size;size_t m_capacity;

m_begin points to a contiguous chunk of memory that has m_size initialized elements and ultimately has space for m_capacity elements (at that point, the chunk of memory will be full, and the vector must undergo reallocation in order to store any more).

Our vector has been implemented slightly differently (our T * _Myfirst is m_begin, our T * _Mylast is m_begin + m_size, and our T * _Myend is m_begin + m_capacity; this representation is equivalent and more convenient for us), but this model is perfectly
valid and it's the one you should keep in your head.

> why couldn't memmove() be used just to move them all back one notch at once, in constant time?

memcpy() and memmove() take O(N) time (i.e. linear time) to copy N bytes. (They do the same thing, except that memcpy() forbids overlapping source and destination, while memmove() permits that.) They can be implemented with ultrafast assembly code, but they
have the same asymptotic complexity as a handwritten for-loop. It's not possible for them to be faster than O(N) with our computer architecture.

> Of course this is only for the case of non-inlined values, where you have an extra indirection level necessary for that,

There appears to be serious confusion here. Please watch Part 1 carefully - I'm pretty sure I sketched all of this out with sufficient clarity there.

> I guess in its quest for "efficiency"

The STL cares about wall-clock efficiency. No need for scare quotes.

> STL stores "light" values inside vector cells themselves, instead of boxing them up and storing just a pointer to the actual value on heap.

vector does no such thing. (You may be thinking of the Small String Optimization.)

> I always thought of memmove() as an atomic operation

It's not constant time. If you graph the number of cycles versus the number of bytes, it'll look like a straight line (possibly jagged, since it's strongly influenced by processor behavior).

Please note that the word "atomic" has a very special meaning in multithreading, and memmove() is certainly not atomic.

> Just like Stephan says, CPUs like to deal with contiguous blocks of memory, which fit in their caches

This is true, but I should clarify: computers love sequential memory access, even when the memory being accessed is bigger than the caches. If everything fits into cache, that's really awesome, but even if it doesn't, things will still be really fast. (The
reason is latency versus bandwidth. By the standards of CPUs, main memory has terrible latency but reasonable bandwidth. When accessing memory sequentially, you're limited by bandwidth. When accessing memory randomly, you're limited by latency. Caches, which
have good latency and insanely high bandwidth, attempt to hide how bad main memory is.)

The _Scalar_ptr_iterator_tag machinery is how scalars (int/double/etc.), for which this optimization is safe, are distinguished from things for which it is not (std::string/etc.). As I may have mentioned, we could be slightly more aggressive here, but currently
we aren't.

(This machinery is advanced, so I won't be explaining it in the videos any time soon. If anyone really wants extra credit, try to figure out where we're spending unnecessary cycles and instructions in the above function. Fixing this is on my todo list for
VC11.)

Such a scheme would have an odd effect: erasing the first element would reduce the vector's capacity, but erasing any other element would not. (It would also increase the size of vector objects by 33%.)

The Standard forbids std::vector from being implemented like that. You could always write your own container that does that, but it would be much better to use std::deque or std::list when you need push_front() and pop_front() (of course, they are not contiguous).
boost::circular_buffer may also be appropriate.

If you simply have to talk about contiguous subranges (i.e. you don't want to look at some elements, but you also don't need to physically erase them), you can use pair<vector<T>::iterator, vector<T>::iterator> or pair<T *, T *>.

As for part 3, an interesting thing to cover might be unique_ptr - I've found it very useful, especially in a vector<unique_ptr<T>> where T is a complex type. The guarantees are interesting - you get random access to elements, exception safety, and the
elements never move memory address even if the vector reallocates. (A common bug I've seen is someone takes the raw address of an element in a vector, does something which reallocates the vector, then the pointer is invalid - but only sometimes, when the
vector happens to reallocate.) Perhaps that could tie in to rvalue references if you mention it.

yes memmove() is linear, but with incredibly smaller constant factor (to the point, I was hoping, of insignificance) than assign/copying objects one by one in the loop. My point was to use indirection always (with T** m_begin), thus enabling to call memmove()
instead (i.e. T* is scalar just like int, and std::string* is scalar too). Then each erase() will take very small amount of time hopefully. I'm guessing but I think for vectors of lengths in low thousands it would make the "dumb" approach acceptably fast.
Of course the smart algo with two iterators doing all the work in one pass is always better. Thanks for the explanations.

yes, but in practice they do. But Stephan did clarify something to me (that should've been altogether clear anyway) - with indirection the objects indeed won't be stored contiguously. So adding/erasing elements at random positions would improve, but access
would become slower. So I guess it depends, what a given vector is mainly used for. If its elements get shuffled a lot by a given algorithm, maybe my scheme would still be beneficient.

"(This machinery is advanced, so I won't be explaining it in the videos any time soon. If anyone really wants extra credit, try to figure out where we're spending unnecessary cycles and instructions in the above function."

Isn't that micro/cycle optimization, that's what you're talking about right ?

[Will48]> If its elements get shuffled a lot by a given algorithm, maybe my scheme would still be beneficient.

One of the great things about the STL is that it lets you be very precise when requesting data structure representations. In this case, you're thinking about vector<unique_ptr<T>> or vector<shared_ptr<T>>.

[Mr Crash]> Isn't that micro/cycle optimization, that's what you're talking about right ?

Yes, it's micro-optimization. In the absence of profiling data proving that you're looking at a bottleneck, micro-optimization is usually a bad idea, but the Standard Library is one of the few places where it's reasonable. Because it has so many users, it's
likely that someone somewhere will really care about each individual chunk of code.

> But i do want to hear about how you're going to optimize it ( optimization stories/talks are very interesting )

If you look at the generated assembler, it's performing unnecessary pointer arithmetic. memmove() works on bytes, so we can use unsigned char * internally. My usual saying is "you know what they say about casts, they conceal something broken", but when wielded
by experts they are useful tools.

[Sys64738]> Is boost::thread a valid free alternative?

Yes - it served as the basis for std::thread, and is actively maintained again.

[BigHands]> Thanks for posting this great 2nd lecture, Stephan. You're giving us all a fantastic look at the STL.

I'm glad you find it useful. We're filming Part 3 on Friday, August 6.

[Mr Crash]> well i think that every single chunk of code should be as perfect as possible.

> "Every piece of code should get its fair share of love"

All code should be correct. Most code should not be micro-optimized. Premature optimization is the root of all evil, and constantly worrying about cycles and bytes tends to damage code's correctness and make it harder to understand. Avoiding gratuitous
pessimization is also important, but that's different.

> How did you find it ?

I was looking at that code for another reason, and wondered how efficient it was, so I dumped the generated assembly (compile with /FAsc).

> Do you use a (special ?) profiler or do you go over the assembler code manually (hardcore style) ?

I've used VS's profiler, and I've looked at assembly, but I'm not a profiling expert, and I can barely read assembly.

> Perhaps make a video of finding and optimizing (hard to find) bottlenecks ?

> (And perhaps teach how to avoid common mistakes)

I've already talked about asymptotic complexity (getting that right is very important - all of the micro-optimization in the world won't save you if you're using a quadratic algorithm when a linear one with reasonable constants is possible). When it comes
to shared_ptr I'll be talking about make_shared<T>()'s performance impact. However, because this is an introductory series, I will be strongly focusing on writing correct code. When you can do that quickly and easily, you have more time left to improve the
performance of your application.

I wonder if we could get one of our performance gurus, Bruce Dawson (from IE Perf, formerly from Xbox Perf) to do a Channel 9 video. Maybe he already has and I haven't heard about it.

erm maybe you use make files instead of vs2010 project files since manually writing project files is wasting time and not really practical.

> I've already talked about asymptotic complexity (getting that right is very important - all of the micro-optimization in the world won't save > you if you're using a quadratic algorithm when a linear one with reasonable constants is possible).

Since you mentioned it, perhaps you can teach how to measure/analyze algorithms to figure out if they are linear, quadratic, log n, etc..

When i hear log n, linear, quadratic, i get a bit confused since i don't come from a computer science background, so can you explain a bit more and go into a bit how to measure algorithms, or is this a too complex topic for this series ?
A hint for future videos perhaps

> I was looking at that code for another reason, and wondered how efficient it was, so I dumped the generated assembly (compile with /FAsc).

That's good to know, thanks

Is this how you normally/a good way to analyze bottlenecks (compile and go over the assembly) ? Or are there only special cases where this is necessary, what are the hint to look for that will tell you that this technique is needed ?

> I wonder if we could get one of our performance gurus, Bruce Dawson (from IE Perf, formerly from Xbox Perf) to do a Channel 9 video. Maybe he already has and I haven't heard about it.

- Warning for microsoft lovers, beware

Is this one of the guys that worked on IE 8 ?

I know you understood me but well lets just talk about the elephant in the room. I meant good performance as in fast, not how to make the worst kind of bloat ware possible that you can't uninstall completely (always some crap left behind
because it's "needed") and that you can't remove anyway since you can't get to the update site without IE (microsoft really knows how to force people to use their products, just make it impossible for the competition to actually compete, *hint hint* EU here's
your next target )

Oh my this elephant is big but lets stop here

So to the point: is he the right guy ? a real performance guru ? if he's a IE performance gurus aka bloat ware guru then i think i'll pass.

No offensive meant just pointing at the obvious elephant in the room.

- Warning End

Yes, I am bitter about IE and the IE exclusive update site (to just name a few microsoft failures)

Metapad. It wraps a Rich Edit control, so it's more powerful than Notepad which wraps an Edit control.

> Also i'm guessing you use the command line to compile ?

VS (like Windows and Office) is built from the command line. "build -cZP" kicks off our build (and I don't pretend to understand how it works - I know how to hack bits and pieces of it). When compiling examples or test cases (consisting of a single file,
or a small handful at most), I also use the command line. "cl /EHsc /nologo /W4 meow.cpp" is my minimal command line; it enables conformant exception handling, silences a noisy banner, and uses warning level 4. At home, I use handwritten makefiles (which are
smarter than they sound, since they automatically build new files and automatically generate dependencies). I can't tolerate build system complexity (I have a limited amount of patience for complexity, and I need to spend it all on C++), so I avoid it as much
as possible.

> Since you mentioned it, perhaps you can teach how to measure/analyze algorithms to figure out if they are linear, quadratic, log n, etc..

Computational complexity is a vast field, but the basics are pretty easy to learn, and the advanced stuff almost never comes up in day-to-day programming.

You've got a problem of size N. N is the amount of stuff you have to deal with. For example, you've got a vector of N elements. Then, you write an algorithm to do stuff to the stuff. That algorithm is going to take some amount of time to do its stuff. In
the real world, we care about wall clock time (the actual amount of time that it takes). That's messy, because some fundamental operations are faster than others. (For example, incrementing a register takes a cycle or less, while dereferencing a pointer may
take many cycles if the destination isn't cached and we have to go all the way to main memory.) Fortunately, when analyzing asymptotic complexity, we get to ignore the details. All we have to do is count the fundamental operations that the algorithm performs.

It gets better - we get to ignore almost everything. Imagine that we can prove that the algorithm will perform at most 3 N^2 + 17 N + 42 operations (we're typically interested in the worst case scenario, but it's also possible to analyze the "expected" scenario
for things like hash containers). We keep the highest power of N, and we throw away its coefficient too. So, that algorithm would be O(N^2). (This is
http://en.wikipedia.org/wiki/Big_O_notation .)

In practice, if the algorithm isn't especially twisty, you can usually look at what it's doing and figure out the complexity without doing anything like counting operations.

O(1), constant time - This means that the time taken (number of operations performed) doesn't depend on the problem size N. It could still be big, but it won't get bigger as N gets bigger. Fundamental operations (increments, dereferences, everything in the
core language) are obviously constant time. Using op[]() to get the Ith element from a vector of size N is constant time - the vector takes a pointer, adds I to it (constant time), then dereferences it (also constant time), and gives you the result. Because
we throw away coefficients, we don't call this O(2) or anything like that, it's just O(1), because it doesn't depend on how big N is.

Coefficients matter, but constant time can be thought of as blazingly fast.

O(log N), logarithmic time - This often comes up when dealing with balanced binary trees. If you put N elements in a balanced binary tree, it's going to be log N nodes tall. (A bit of math: because we're throwing away coefficients, it turns out that we don't
care about the base of the logarithm. That's why we don't bother writing it out, although usually the base is 2.) map/multimap/set/multiset are balanced binary trees, so their find() is O(log N), because it has to walk from the root to a node, and that node
might be at the bottom of the tree. Their data structure also guarantees O(log N) insertion and erasure - this part is magic (you have to carefully follow the algorithm to prove that it works and has that complexity), but you can treat it as a given since
the STL has already been implemented for you.

Binary searches (e.g. std::lower_bound() on a sorted vector) are also logarithmic time.

Logarithmic time is still very fast. If you have a map with 1000 elements, it's about 10 nodes tall. If the map grows to 1,000,000 elements, it's only about 20 nodes tall. Your problem got 1,000 times bigger, but your algorithm got 2 times slower.

O(N), linear time - This one should be the easiest to understand. It's what happens when you take N elements and do something that takes constant time to each of them. std::count_if() is one of many examples. It looks at each element, and counts how many
satisfy some predicate (a function object taking an element and returning bool).

O(N log N), somewhat jokingly called linearithmic time - In general, sorting N elements takes O(N log N) time. It can't be faster (although you can write algorithms that are slower, obviously), unless you know something special about the elements (like,
they're the suffixes of a string). std::sort() is now guaranteed to be O(N log N). The algorithm quicksort is usually O(N log N) (and its coefficients are small, meaning that it's actually quite fast in wall clock time), but it sometimes takes O(N^2) on "bad"
or pathological inputs. In C++98/03, std::sort() was allowed to display that behavior, but it is now required by C++0x to never be slower than O(N log N). (VC has done this since at least VC8.)

Because log N is cheap, N log N is comparable to linear. It's not quite as good, of course.

O(N^2), quadratic time - This happens when you take N elements, and do something that requires O(N) time to each of them. For example, naively scanning for duplicates has this behavior.

Quadratic time is generally bad (in some applications, quadratic is the best you can do, but if you can avoid it you generally should). It has the undesirable property of making algorithms take forever for seemingly small inputs. For example, suppose N grows
from 1000 to 10,000 (maybe it's the number of mail messages in my account). The problem became 10 times bigger, but a quadratic algorithm would be 100 times slower. Users will notice that.

There are other notable complexities (O(2^N), exponential time, makes quadratic look fast), but these are the common ones encountered every day.

> Is this how you normally/a good way to analyze bottlenecks (compile and go over the assembly) ?> Or are there only special cases where this is necessary, what are the hint to look for that will tell you that this technique is needed ?

Profiling will tell you where your time's going. These days, common offenders are things like the network and the disk, external stuff that takes forever. If the culprit is a poor algorithm, fixing that will be a massive win - that's macro-optimization.
Understanding asymptotic complexity makes it easier for you to identify and fix poor algorithms, and especially to avoid writing them in the first place.

If computation is actually your bottleneck (as opposed to non-computational things like the network), and your algorithm is smart, then it's time to look at codegen, and assembly is a good way to do that.

> Is this one of the guys that worked on IE 8 ?

He's working on IE9. I'm not sure when on the IE timeline he joined the team.

> So to the point: is he the right guy ? a real performance guru ?

If you think I know what I'm doing, then you should believe me when I say that he knows what he's doing.

I read several more mathematical-oriented explanations, but they were too abstract and more "smoky", more vague; instead yours is very clear and effective.

About performance, I'm looking forward to the interview with Bruce Dawson (BTW: I agree the performance of IE8 is not very good - I do hope that his contributions will make a difference in IE9).

Speaking of performance, I used to think that when you really want the top performance, you should abandon STL and use a more C-like programming style. For example, there was an interesting series of blog posts about a Chinese/English Dictionary reader:

[...] Of course he used available lower level libraries to do this, but that's still a lot of work. Can you call what's left an STL program? I don't think so [...]

So, is it correct that when top performance is required (e.g. in a tight loop executed lots of times or in some other critical portion of code) we should abandon STL and use lower level C-like programming styles? Or is it possible to still use STL successfully
in these scenarios?

In Microsoft you guys write software used by millions of people worldwide: do you use STL in performance critical parts of your software?

> So, is it correct that when top performance is required (e.g. in a tight> loop executed lots of times or in some other critical portion of code)> we should abandon STL and use lower level C-like programming styles?

As I mentioned in Part 1, the STL is a general library for pure computation. Because it was designed by a genius (Stepanov) and implemented by experts (e.g. Dinkumware) who can wield the full power of the Core Language (and the occasional magic compiler
hook), it's extremely efficient in a vast range of scenarios (when used properly, as this series is trying to introduce). Whenever it uses algorithms (e.g. search_n()), data structures (e.g. map), or implementation tricks (e.g. make_shared<T>()) that are beyond
the reach of non-library programmers, it can easily outperform their handwritten low-level code. However, because the STL is a general library, it can't be perfectly tuned for every scenario. With enough skill and effort, handwritten low-level code can always
outperform it. That has costs, though - not only the skill and effort required, but the resulting fragility and resistance to modification of the low-level code. The job of an STL maintainer (like me) is to minimize the frequency with which library users have
to resort to low-level code.

The vast majority of a typical C++ program should use the STL. In hotspots where the STL is introducing overhead (the vast majority of code is not a bottleneck, and in some bottlenecks the STL will not be imposing any overhead - as few as possible if your
STL maintainers have done their jobs well), or where you have special knowledge that a dedicated algorithm or data structure would do better, you can use handwritten low-level code. Very, very rarely, you have to go all the way to assembly. (One obvious example
is the inner loop of a video encoder.)

Note that because the STL is a library, not a framework, it never gets in your way when you need to fall back to lower-level code. It just wants to help you avoid that as much as possible.

In your linked blog posts, Raymond Chen was bottlenecked by iostreams' getline() and code conversions. iostreams (which is part of the C++ Standard Library, but not what I call "the STL proper") is a disgusting mess in terms of implementation and performance.
It is widely agreed (even by Standardization Committee members) that the specification of iostreams makes it really, really difficult to write efficient implementations. (The main offenders, as I understand it, are lots of virtuals and lots and lots of customization
points.) We fixed some iostreams perf problems in VC10 (replacing single-character operations with multi-character operations) and we're fixing more in VC11 (replacing locks with interlocked operations), but it's like pushing a cubic kilometer of jello.

Another bottleneck has been solved; in part 5, Raymond talked about copying strings. C++0x/VC10's move semantics fixes this, usually automatically, sometimes with a move() or two. (Unnecessary copies were the biggest performance problem with the STL proper
in C++98/03.) Even constrained by the C++98/03 Core Language in VC8 and VC9, our STL implementation did the best that it could. When faced with a vector<string> or vector<vector<T>> (or any element type itself from the STL), vector would swap instead of copying
during reallocation (which I called the Swaptimization).

> In Microsoft you guys write software used by millions of people worldwide: do you use STL in performance critical parts of your software?

Some teams do. For example, the compiler front-end (C1XX, which parses C++) uses the STL extensively. Other teams have what we politely refer to as "legacy codebases"; for example, the compiler back-end (C2, aka UTC, which optimizes and generates assembly)
is essentially C. They're slowly converting it to C++. Then there are teams that are suffering from exception denial - Outlook (my former team) was an example. After I left, though, they did the work necessary to enable exception handling, and are now beginning
to use the STL - replacing some of their data structures with the STL significantly increased performance.

In fact, during VC8 and VC9, the major performance headache with the STL was that we enabled our security checks in release mode by default (_SECURE_SCL), which in worst-case scenarios could harm perf by around 2x. That has been solved in VC10, where the
security checks default to OFF in release mode (and have been reorganized into the easier-to-understand _ITERATOR_DEBUG_LEVEL setting).

I really dislike the code cout << "(" << i-> first << … I would rather
read it as cout << (*i), or even cout << (map) for the whole loop, because this is what you essentially are trying to do.

Thanks for sharing your thoughts about IOStreams. This is the part designed by His Majesty Bjarne I, and if I did not know that I would say it was written by an electical engineer. Isn´t it time to do something about that mess?

I am missing the possibility to use std::set <int [02], Predicate>.

I am missing the possibliity to replace the key with another one that compares equal to the original. This is an action that would not damage the structure.