Thursday, 17 September 2015

I'm
getting increasingly tired of the network administrators at a certain
LEA. I'm going to venture that they aren't really qualified to run the
LEA's WAN...

Back at the start of July, one of our
customers reported that an application was intermittently extremely slow
or completely failed. Originally the customer thought that it was a
firewalling problem, but we identified a DNS problem as the cause - the
LEA's DNS server was taking a few orders of magnitude longer than you'd
expect to respond to AAAA record lookups for two domains that were used
by the app, and eventually responded with a failure.

A
quick explanation is probably in order here: when a client needs to
connect to a web server on the internet, it has to convert the domain
name (e.g. www.example.com) into the numerical IP address(es) of the
server(s). It does this through the Domain Name System (DNS).
Typically a (modern) client requests both "A" records and "AAAA" records
from a DNS server - "A" records list the (legacy) IPv4 addresses for
the web server, whilst "AAAA" records list its (newer) IPv6 addresses. A
web server may not have both IPv4 and IPv6 addresses, but the DNS
server still has to produce a successful response to tell the client
this.

Importantly, even if the client only has a legacy
IPv4 internet connection, it may not know that it can't contact a
server using IPv6 until it actually tries, so it will usually still ask
for "AAAA" records so that it can get an address and try it. Also,
whether or not the DNS server has an IPv6 connection is irrelevant - if
AAAA records exist it is required to reply with them, and if they don't
its required to reply saying they don't; the LEA DNS server was doing
neither.

The LEA were informed that their DNS server
was breaking when queried for the AAAA records, so they replied saying
they had pinged a few things and used done some nslookups (presumably
only for the A record!) and they couldn't see a problem.

We
ran more tests - looking up the A records worked fine (successful
response in 9 milliseconds), looking up the AAAA records failed (failure
response in 17 seconds) and looking up the AAAA records through a
different DNS server (successful response in 23 milliseconds). We even
gave them transcripts of the tests so that they would know exactly how
we tested it and would be able to reproduce it themselves.

The
LEA responded with words to the effect of "well no one else has
reported this problem", so we ran the same tests from a different school
within the same area and demonstrated that they had the same issues.
Again, we sent transcripts of the tests to the LEA (* see footnote).

The
LEA then started asking whether the school was using a transparent
proxy and what the school's internal domain name is - none of this is
relevant to the problem being reported. We weren't reporting problems
with the transparent proxy, or any of the school's internal servers, we
were specifically reporting a problem with the LEA's DNS server.

We
did some further investigation and got more detail on which DNS lookups
were failing, sent this to the LEA together with more transcripts of
tests and an offer to work with them to help. Rather than asking for
our help, the LEA closed the ticket as "resolved", but provided no
explanation. We reran the tests, sent them another transcript
demonstrating that nothing had been fixed.

The problem
was originally reported at the start of the summer holidays. Two months
later the new term started - still the problems weren't fixed, still
the LEA hadn't taken us up on our offers to help them (for free!) and
now it transpires a lot more domains are affected than we originally
investigated. Its causing really serious problems for the school, so
the school started banging heads together and someone from the LEA
actually called us. I explain the problem yet again and he goes off
saying he needs to look up some more information.

Then
they start talking about transparent proxying again, and again I have to
point out that we are reporting a problem with the DNS server and that
this has nothing to do with the transparent proxy. Again, I send them
an email describing the problem, providing transcripts of tests, etc.
LEA techie tells the school that I didn't send any information and that I
just forwarded his email back to him - I'm a bit stunned about this
since it means that (1) he has never seen an email with inline comments
before, and (2) he didn't read past the first line of the email. So the
email gets resent to him.

The LEA reply with some screenshots of some tests they have done which they say show that there's no problem:

They logged into the leased line router and queried the network interface statistics that show no line errors.

They pinged a few machines.

They tracerouted to somewhere.

i.e. they didn't test the thing we actually reported being faulty.

The
LEA suggests that this is happening because they don't provide IPv6
connectivity (as mentioned above, whether or not IPv6 is available
doesn't actually change anything from a DNS perspective - clients still
look up AAAA records and DNS servers are still expected to reply).

Now
they say they've poked lots of holes in their firewall because they
"have no information on what port AAAA records would be using" (errm,
53, the same as every other DNS request in the world?!) and could we
retest - unsurprisingly its still broken.

As far as I can see:

They haven't actually run the tests (which we've told them how to run!) to try and reproduce the problem. They've tested a few other things that were never a problem to begin with.

They don't understand enough about DNS (which is an extremely
fundamental internet protocol) to diagnose the issues - they seem to
have entered a "change something at random and see if it fixes it" phase
instead of trying to get to the root of the problem.

They are completely out of their depth - if they want to run a
reliable WAN, they need someone wuo is actually qualified to administer a
network. That means someone who understands how to reproduce problems,
use debugging tools such as WireShark, etc.

They haven't handled this in a timely way at all - they had the
whole of summer to investigate, and didn't actually start looking at
anything in earnest until after the start of term.

I have spent literally hours on this problem, mostly
repeating the same explanations and tests over and over (although
strictly speaking this isn't "our problem", diagnosing and liaising with
the LEA is something we're handling as part of the customer's advanced
support contract, so we're not really being paid by the LEA for this
level of hand-holding). I honestly can't see them resolving this
problem until they reproduce it themselves and do some proper
diagnostics.

Footnote

As mentioned, part
of the LEA's defense is basically "no one else has reported a problem" -
now, not looking into a problem because it isn't affecting many people
is a pretty crumby attitude to begin with, but there are reasons why
some people would be affected and some not.

Fundamentally,
how services, such as DNS, are expected to behave are defined by
standards. These boil down to rules like "when a client sends a request
like this, the server must send a response like that".
Software that relies on these services is written to expect them to
follow the rules laid out by the standards, and there is no standard set
of rules saying how to handle a service that is breaking the rules - it
is extremely difficult to draw up a standard explaining how to deal
with something breaking the standards, simply because there are so many
ways the standards could be broken!

So you may have two different pieces of software that do basically the same job, call them A and B.
In an environment where everything is sticking to the rules, they both
work equally well since this behaviour is standardised. However, if
some service isn't sticking to the standards then they will often handle
this differently - maybe software A still works fine, but software B breaks. In a different situation the roles may be reversed, with software B working ok.

So
its possible for a real problem, such as this, to go unreported simply
because a lot of people happen to be using software that, by chance,
isn't badly affected by the broken service. Its also possible for
problems to go unreported because people write off the problem as
"software A is broken" and so don't report the issue to the operator of the broken server.