Thursday, January 15, 2009

Jeff is Wrong and don't listen to him

I rarely (I hope) post specific corrections to misleading information out there on the web, not least because there's so much of it and it's usually a futile effort, but Jeff Atwood's latest post really rubbed me up the wrong way.

In his post, Jeff writes some unidiomatic C and bewails its ugliness and pain etc.:

b1 = (double *)malloc(m*sizeof(double));

By way of comparison, he then writes some C# (I expect that's what it's supposed to be) that doesn't actually use the GC, since it's not allocated on the heap:

Double b1;

(Of course, if it's not C#, and is in fact supposed to be e.g. Java, it's still not an example since it hasn't been initialized.)

Now Jeff's core message (at least at the start of the post) that GC is a Good Thing, I'm all in favour of. I strongly believe that manual memory allocation has good reasons to be used only in about 5% (or less) of programming tasks, usually restricted to things like operating systems and embedded devices. I consider well-defined, trivially provably correct zoned allocation to be a good 80% of the way to GC, so I would include good uses of that in the 95% case.

Jeff's final point, however, about "disposal anxiety" is where he casually reveals that he doesn't pay much attention to a very important issue: disposal of resources. He mocks a particular piece of code that has at least the core concept right:

sqlConnection.Close();
sqlConnection.Dispose();
sqlConnection = null;

Two-thirds of this code is redundant - the last line is always so unless sqlConnection is read later in the same routine, while either of the first two would do for resource disposal. This wouldn't be so bad but for Jeff saying:

Personally, I view explicit disposal as more of an optimization than anything else, but it can be a pretty important optimization on a heavily loaded webserver, or a performance intensive desktop application plowing through gigabytes of data.

Disposal of resources in a long-running application using GC is not a performance issue. It's a correctness issue.

Garbage collection knows about memory. With most collectors, GC is only invoked when the GC is asked for more memory than is immediately available, taking specific tuning parameters into account. In other words, the GC is only sensitive to memory pressure. It doesn't know about resources that it sees only as pointer-sized handles, it doesn't know how much they "cost", and indeed those resources might be on a different machine or even spread across many different machines.

More critically, garbage collection gets its performance surplus over and above manual garbage collection by not collecting until as late as reasonably possible, where "reasonable" is usually a function of how much free memory is available without swapping to disk. The long-run-average optimal collector for a program that has a machine all to itself won't collect at all until every last remaining byte of RAM has been allocated, which may of course take some time. (Added to clarify: this is a theoretical optimum, and not how most GCs act in practice. They collect much sooner e.g. the youngest generation may fit in L2 cache and so be very fast to collect.)

Precise tracing garbage collectors work somewhat paradoxically not by collecting garbage, but by collecting live objects. The "garbage" is everything left over after all the live objects have been collected. The more garbage as a fraction of live objects there is, the cheaper, proportionally, it has been to collect. (Added to clarify: this means that the collection of garbage is amortized; any amount of garbage costs the same to collect, providing the set of live objects is held constant.) This is how GCs outperform manual memory allocation on average and with sufficient free memory. Ideally, the GC never runs at all, and program termination cleans up.

With this insight under your belt, it should be clear that expecting the GC to clean up resources is to be ignoring one of the key benefits of GC. Not only that, but you shouldn't be expecting the GC to finalize your objects at all. If your resources must be disposed of - and almost all resources should, e.g. TCP sockets, file handles, etc. - then you need to take care of that yourself, and deterministically. Leaving e.g. file handles to be disposed of by the GC is opening up the program (and possibly the user) to odd non-deterministic failures when they find files on disk are still locked, even though the program should have closed them.

32 comments:

if that SqlConnection example is supposed to be C# then it's even more wrong: the correct approach to managing disposal of that sort of handle is almost always a using (...) { } block. If the handle lives longer than a given scope it should be a member of some other disposable object and cleaned up in Dispose.

I think the whole thing is a little silly myself. I think there is a whole generation of programmers today who agonize over the aesthetics of the code file and NOT the resultant assembly.

I understand there are significant code maintenance problems and issues involving legacy code... not to mention the minor time cost in physically typing less.

I think the value of knowing more primitive internal operations and having them exposed in the code is vastly underestimated. Especially when the next generation growing up on a brand new sequence of languages inherits the chimera of legacy code from this one.

I know you're trying to satirically decry the paying of attention to the leaky abstractions inherent in all programming, but it's not washing with me.

Cleaning up resources is a correctness problem. Resources are usually external to the program; creating them and cleaning them up are I/O operations. I/O is a side-effect. Optimizations shouldn't induce side-effects in ways that affect correctness.

Jeff gave a reasonable explanation for manually garbage-collecting that connection. It was not to improve performance or free up memory, it was because "your database server may be powerful, but it doesn't support an infinitely large number of concurrent connections, either."

However, setting the variable to null was almost certainly unnecessary, and I presume he meant something like

Anonymous - you don't speak for me. Jeff doesn't usually annoy me. Most of the time - particularly when he's talking about higher level things - I tend agree to a greater or lesser extent, and at the very least I enjoy the thoughts.

His writing gets a lot weaker when he gets into lower-level issues, though, whether it's deep CS or fundamental operations.

Right, except using the Dispose method is only tangentially related to garbage collection.

Dispose does not initiate a GC sweep, it only guarantees (or is supposed to guarantee by interface contract) that any external resources allocated within that object are gracefully released. That could be anything from file locks to releasing the DB connection back into the connection pool. Releasing these resources usually means that the GC finalization routines can be suppressed for that object, but that's not required nor guaranteed as part of the IDispose interface contract.

I think your missing the point. The point to his post is that memory management is easier with higher level programming languages. While ultimately probably a no duh topic, he does make the point that you don't need to do those three redundant lines of code just to clear the sql connection from memory. I get the impression you skimmed over his post, felt like you got the gist of it and then started ripping into your straw man.

Jeff's posts are often interesting, but it gets dicey when he starts pulling out real code. That's where the serious gaps in his abilities come to center stage. That's when the defenders of Jeff span out to tell the world that despite how grossly wrong Jeff was throughout that post, he *meant* something different...some higher meaning that you just aren't getting. It really is extraordinary the stretches people will make.

His lack of "optimization" would quickly lead to serious application faults because of locked files, the connection pool is exhausted, and so on (you hit these limits much quicker than imagined). I consider it seriously dangerous what Jeff is pushing there.

Most modern garbage collectors begin collecting memory long before the memory pressure gets high. There are 2 reasons for this: When the heap is small, it is often quicker to collect and you want to play nice to the rest of the system and not just grab everything.

Second, it is not the reason for the speed of garbage collectors. GC can be faster than manual allocation primarily because it can amortize its operations: Allocating memory is a pointer move and a check. Collecting memory is more expensive, but when it does happen, you collect more than a single object and most of these objects even get the treatment without touching them ;)

Calling Dispose() explicitly is not necessary, it is called automatically by the garbage collector. So it's not true that calling it explicitly is for correctness's sake, but like he said for some kind of optimization.

Jlouis - I said that the GC is only sensitive to memory pressure. I didn't say that it'll only when the memory pressure is very high. For the CLR, for example, there has to be some pressure on gen0.

As to your second point, you've clearly misunderstood me. I explicitly pointed out that (most) GCs collect live objects, and by delaying GC until there's a good fraction of live to dead space (even if it's only in the first generation, in a generational collector), it guarantees that it's getting a good divisor on that amortization calculation.

In other words, all you are doing is saying what I've been saying, but in a slightly different way.

See this paper for more details, wherein everything is made very clear.

Let's pretend the Obj class is a resource (it has a finalizer). The Console.ReadLine call represents an arbitrary amount of time the application is blocked for. Notice how, when you run the application, the obj instance is not collected. This is because there are no allocations going on. The GC is only going to get called when there's some memory pressure. There isn't any. Thus the resource is not getting freed in a timely manner.

If you don't understand this, I don't know how to make the point more explicitly. Jeff is wrong; it's not an optimization. If this Obj instance was e.g. locking a file on disk, and only going to unlock it in a Dispose or finalizer, this would be a flat-out bug.

Jeff Attwood is just embaressing as a programming guru. I regularly listen to the Stack Overflow podcast because I have a great deal of time for Joel Splosky and it's painfully apparent that the guy is regularly out of his depth. His most recent major fiasco was a discussion of np-completeness where Jeff plainly had no idea what the problem actually was and proceeded to define an np-complete problem as 'just a hard algorithm that no-one has solved yet'. He's obviously got talents as a tech entertainment blogger and community builder, but programmer? No

Anonymous @9:55, I don't quite agree with your assessment of Jeff. I think he does occasionally get some fundamental details wrong, and certainly in the more esoteric details of computer science (NP-completeness isn't generally useful in the majority of programming), but that doesn't stop talented programmers from getting things done. A good portion of talent lies in solving the right problems, while making things useful and working for the 80% cases for the end user typically does not involve deep CS insights. Even knowledge of things like Turing completeness and the halting problem aren't really necessities, save as a rule of thumb in writing compilers or code analyzers - the limiting factor on what you can deduce from code.