Performance Quiz #7 — Generics Improvements and Costs

Time for another quiz for you all, this is just a micro-benchmark so we want to be careful not to conclude too much but it's useful for understanding the costs and their origins. Consider the two snippets shown below:

// choice #1, using Generics

int total = 0;

List<int> li = new List<int>();

for (i = 0; i < 20; i++) li.Add(i);

// start timer

for (i = 0; i < 10000000; i++) foreach (int el in li) total += el;

// end timer

// choice #2, using ArrayList

ArrayList al = new ArrayList();

for (i = 0; i < 20; i++) al.Add(i);

// start timer

for (i = 0; i < 10000000; i++) foreach (int el in al) total += el;

// end timer

Now try to answer these question (no peeking at the comments until you've done your work :))

1) IEnumerable<int> struct implementation of List<int> versus IEnumerable class implementation of ArrayList. The first is a direct and fast pattern invocation while the second makes slower interface calls.

2) Allocated objects versus unallocated objects. The generic version allocates no additional memory in the loop while the second allocates a new enumerator object one million times.

3) Int copying of List<int> versus Int unboxing and typechecking of ArrayList. Int unboxing is not as costly as boxing, but there is still the null-check and the type-check that needs to be performed.

There may also be additional virtual calls in the ArrayList version not present in the generic List version–such as calls to GetEnumerator.

I thought that this may be a trick question… especially since object references are just as efficient for computers to manipulate as are integers, but the disadvantages that I pointed out earlier lead me to believe the List<int> implementation performs fast.

As a dev on the F1 Profiler team, I figured that I would take a wack at using our tool to profile this performance quiz. I posted my findings to my blog in an entry here. I ended up finding out some of the issue that have already been mentioned above.

Hidden allocations and memory leaks in loops in languagues like C++ are a real problem. I wouldn’t want my app to allocate and leak 280MB while it runs.

But in .NET, is this really an issue? Is the 280MB really all allocated and "held," or was the same small memory block just reused a million times? Or was the memory allocated and set aside for the GC to come along later?

One other comment: List<int>.Enumerator may be instantiated at the beginning of the foreach loop in which case there would be additional start-up cost for the JIT compilation to occur at the beginning of the test.

In this case, the performance of the generic list will probably be significantly improved if we have an additional foreach loop before performing the timing tests, thereby moving the instantiation costs out of the tests.

I doubt that this occurs, because I believe that the mscorlib already contains an instantiation of List<int>.

Presumably all the code for ArrayList is ngened, so if you’re not the only process on the machine using ArrayList, there’s no impact on your private memory usage for using that code. (Although I think you’ll still be paying a per-process hit for having used ArrayList type at all – ngen just means you get to share the code. If I’ve understood correctly, you don’t get to share the data structures the CLR creates for managing the type.)

But what about List<T>? I have to admit that I don’t actually know whether NGEN has some mechanism for allowing the code generated for specific instances of a generic type to be shared. As I understand it, while generics can share one set of generated methods across all reference types, the code has to be generated anew for each value type you instantiate the generic for.

So taking a wild stab in the dark, I’d guess that you’re allocating private bytes to hold JITed code specific to the List<int> instantiation of List<T>, a cost you won’t be paying with ArrayList. (And of course there may well be startup time issues here too – if other processes are using ArrayList, the ngened code in question will already be resident.)

Of course it’s possible that the calls into the collection classes might be inlined, in which case the code-sharing-thanks-to-NGEN argument doesn’t hold. If I had to guess I’d say that the call to GetCurrent in the generic case will be, but MoveNext looks big and hairy enough in ILDASM that it probably won’t be inlined.

Given where you’ve placed the "start timer" / "end timer" comments, some or possibly all of this discussion is all completely irrelevant. Unless of course one of the points of this post is to illustrate how you can get misleading results if you start your stopwatch in the wrong place… You’re starting the stopwatch after creating your collection objects, so there are a whole load of costs you’re not even measuring!

>>I suspect that all the samples in CLRStubOrUnknownAddress really indicate time when execution is suspended because the garbage collector is running.

>>If this is the case, the profiler is not doing a very good job of indicating the amount of time spent in GC

That is not correct. I haven’t confirmed this yet but I’m 99% sure that the stubs in question in this case have to do with virtual dispatch over an interface. We use a sparse representation for these guys in Whidbey because of the space savings we get.

Time in GC would be charged to the GC in mscorwks — even with no symbols at all it would not appear to be a stub as there is a module to charge the time to at the very least. With the publicly available symbols you’ll get a strong indication that it is the collector.

>>You’re starting the stopwatch after creating your collection objects, so there are a whole load of costs you’re not even measuring!

I don’t think I could measure the difference even if I moved the stopwatch. We do twenty adds to the collection then we do two hundred million read operations. Those adds would have to be pretty hefty indeed to make a difference 🙂

With only a few objects allocated, the GC wouldn’t have much work, even if it’s invoked 100’s of times per second.

However, you’re running through a large chunk of memory a few hundred times. My money for the main culprit would be on loss of locality, with tons of cache misses and expensive RAM accesses as a result.

>>My money for the main culprit would be on loss of locality, with tons of cache misses and expensive RAM accesses as a result

That was a very good thought. When I started doing the quiz I expected the costs to be more evenly divided between the extra virtuals and the memory allocation. The measurement seems to indicate otherwise though.

First of all, it wasn’t the adds I was thinking of. I was thinking more about the price paid before you even get to start to run the method – all the stuff that’ll happen at JIT time.

And to be fair you did ask "What price do we pay… ?" Where I was going with this is that there might be startup costs which are amortized away in this particular example, but which may still be relevant in a Windows Forms application where startup time is a big deal…

There are very few Windows apps (managed or unmanaged) that I use on a regular basis where the load time is fast enough not to be annoying. So pleading that you’re iterating 20,000,000 times seems like cheating – the startup costs can still matter a great deal.

So the kind of thing I had in mind was "How much extra stuff do I have to wait for the disk to load before the code executing here even starts." In the world I happen to occupy, 10-20 second application startup time is a problem that plagues me far more than burning too many CPU cycles once I’m up and running.