Just when you think you know somecode.

Once again, it has been ages since I have touched this site. And once
again I promise to be more active. … what’s up? Oh, yeah! BLOG!

I have recently been looking into the glibc resolver code.It
started out like any other troubleshooting effort, just trying to get a
good foothold and identify where things could go wrong and how to
ensure they went right. Once I got in the code… it was a real “took the red pill” sort of moment.

I
often deal with how the resolver is configured, but had never needed to
consider where it lived. As it turns out, I have been sort of
imagining a sort of PFM magic bubble whenever I thought about name
resolution on *nix based operating systems. I generally understood that
name resolution was not handled by the kernel, but I also never
imagined that resolution occurred purely in the user space. I’m not
sure what I imagined wedged between user space and the kernel. I think I
envisioned a sort of stateful shared resolver living under the mystic
veil of glibc. As you might guess, that is not the case. Every process
for itself.

Sort of…

The nscd process helps out, using
it’s own user-land resolver to provide resolution services over a local
unix domain socket. Every other process’s resolver can then forego
doing the work itself and just pass requests off to nscd, which may have
already done the lookup within the result’s TTL, eliminating the need
to resolve the same name for multiple processes. Shazam! A stateful
shared resolver! Except I knew nscd was a completely optional service
and still imagined a non-caching single resolver living somewhere.

It
is indeed, every process for itself. The gethostbyname and getaddrinfo
functions (along with some group and user related resolvers) create an
instance of the resolver entirely within the process. res_init() or
more accurately one of it’s internal calls (__res_maybe_init() is maybe
my favorite) is called, initializing the resolver. The initialization
involves reading /etc/resolv.conf to load the search suffixes,
nameservers, and other configs. This could very well be the last time
that information is ever read by the process. This is the source of the
trouble I was trying to… shoot.

… wobbly transition to flashback …Changes
needed to be made to resolv.conf to add a search suffix. After the
change, nscd was restarted and the server seemed completely functional.
Command line tests worked. Our PHP code running under Apache httpd
were now able to resolve the hosts with the new suffix. All was well.

Are you sure I can’t just take the blue pill?

Days later we started seeing periodic “unknown host” errors from PHP applications.

Why would it have worked initially, but start failing a couple days later?

I will update this post in a couple days with what we found.

In the mean time, here are a couple links to other interesting DNS resolution information