2011-09-30

A note on distributed computing diagnostics

I've just stuck in my first mildly interesting co-authored Hadoop patch for a while, HADOOP-7466: "Add a standard handler for socket connection problems which improves diagnostics"

It tries to address two problems

Inadequate Diagnostics in the Java Runtime

Despite Java being "the language of the Internet" or whatever Sun used to call it, when you get any kind of networking problem (Connection Refused, No Route to Host, Unknown Host, Socket Timed Out), the standard Java exception messages don't bother to tell you which host isn't responding, what port is refusing connections or anything else useful. In a room with 2000 machines, it's not that useful to know that one of them can't talk to another. You need to know which machine is having problems, what other machine it is trying to talk to, and whether its the HDFS level or something above. But no, the exception text never gets any better, whoever wrote them didn't read Waldo's A Note on Distributed Computing and think that if two machines are near each other nothing can possibly go wrong.

Whatever they were thinking, if they tried to submit exception messages like that to the Hadoop codebase today, the review process would probably bounce them back with a "make this vaguely useful". The patch tries to fix this by taking the exception and the (hostname, port) of the source and destination (if known), and then includes these details in the exception text. This helps people like me know what's gone wrong with our configuration and/or network.

Inadequate understanding of the fundamental network error messages

This is something I despair of. There are people out there that haven't done enough homework to know what a ConnectionRefused exception means, and ask for help when they see it. Again, again and again. Same for all the other common error messages.

The people who are trying to set up Hadoop clusters who don't yet know what these error messages are in way out of their depth. That should be an appendix to Waldo's paper: the many layers of historical code underneath are not transparent; it helps to have read Tanenbaum's "Computer Networking" book, it helps to spend some time writing code at the socket layer, just to understand what goes wrong at that level. Trying to download the Hadoop artifacts and then push it out to a small set of machines without this basic knowledge is dooming these people to days of confusion, which inevitably propagates to the mailing lists and bug trackers. Usually someone posts a stack trace to the -user and -dev lists, then starts repeating it every hour until someone answers; the total cost of wasted time is surprisingly high.