Today we published the Final report DNSSEC in SURFdomeinen describing our DNSSEC deployment. This will be the last post for a while on our DNSSEC deployment, we are going to continue later this year. In the mean time, I will try to post updates when I have interesting information available for instance about the validation rate as the year progresses.

Last Wednesday I spent some time tweaking the Cacti plugin for Unbound and managed to incorporate a live graph of the validation rate. It will be interesting to monitor the progression of this graph over the coming months and I will try to post and update every now and again. To give something away, here is a screenshot of today’s graph (showing the validation rate in % of total queries):

MTU woes

I got some news last week that initially got me worried; several colleagues were experiencing DNS problems at home since our secure delegation had been made active (or so it seemed). They had big problems resolving names under the surfnet.nl domain.

We mounted an investigation and it soon became apparent that there was a common denominator shared between these colleagues: they all had the same ISP. Luckily, we were able to contact the ISP and convince them to look into it. They quickly came back to us with the cause: UDP packets over a certain size weren’t getting back to their resolvers… some network component had decided that fragmented UDP was bad UDP and was discarding large packets. Initially, the ISP suggested that we limit the packet size on our authoritative name servers. Although this sounds plausible, this is not a solution. They didn’t only have problems with our authoritative answers but also with other DNSSEC-signed domains. And tracking down every single one of them to convince them to lower their EDNS0 buffer size simply isn’t feasible — and it is solving a problem in your own network by letting other people change things. Luckily, we were able to convince them of this and they have now lowered their EDNS0 accepted buffer size and are looking into resolving the UDP fragment problem (which is the root cause).

The merits of re-signing

For the second time since becoming operational, the .be zone has DNSSEC trouble. Earlier this autumn, they had a problem where there zone (or parts of it) weren’t getting re-signed. Today, I noticed – because of a Nagios alarm triggered on one of our resolvers that only triggers if the failed validation rate exceeds a certain level – that I saw lots of validation failures for .be zones. It turns out that at least some of their NSEC3 records have expired signatures, which is very bad news. This affects all domains in the .be zone, not just the ones that are signed. And because of interconnectedness between domains because they share secondary name servers this also affects non .be domains (!).

This once again shows the importance of sound procedures and the importance of monitoring; it’s one thing to have a policy on when you re-sign, etc., it’s another to actually check that your policy becomes hard technical fact by regularly checking your zone as it goes out on the Internet…

Hugo Salgado from Chile managed to come up with a plausible solution: the spike is due to both DLV and the root trust anchor being used concurrently. I consulted NLnet Labs to confirm this theory and they answered the following:

“DLV lookups are done using recursion. So if a TLD is in DLV (and has no DS in the root zone) every validatable lookup for that domain counts twice because the DLV is signed as well. As soon as the TLD moves its trust anchor to the root zone, this double counting goes away. In the period the data is for this happened for both .se as well as .gov”

Since there was no easy way to send Hugo a real pie he accepted a token ASCII art one, the proof is below:

I’ve created a new validation rate graph based on the statistics gathered on our resolvers. The data is up-to-date until week 43 (this week). The graph shows an interesting trend:

In the graph you can see three lines representing three different resolver locations spread out across The Netherlands. What is interesting is that the validation rate steadily rose until week 39 and then showed a significant drop.

I am – of course – very curious to find out what caused this drop. We use both the root trust anchor as well as ISC’s DLV repository to retrieve trust anchors that cannot be reached from the root down.

Anyway, I will try my best to send the first person who can give me a satisfactory explanation a nice pie ;-). Winner to be announced here.

OpenDNSSEC is much more dependent on proper timing than plain DNS, mainly because of the regular rollover of keys. The last thing we would want is a secure resolver which has parts of a secure domain, but not enough to validate it properly. What this means is that a lot of care must go into the design of timing, and especially the TTL parts of DNS records.

If you care for a state diagram, please click on the snippet image on the right.

To get a zone signed, you therefore need to go through a number of stages, and insert the proper delays between steps:

To go from unsigned state to having the zone signed, we basically run the zone through OpenDNSSEC, and wait until a signed zone pops up at the other end. We are cautious however; “at the other end” means that all authoritative name servers for a zone have updated to the signed zone.

We now proceed to publish the DS in the parent zone immediately. There is no danger of caches missing out on the signed state, as the cache should not have been asked for DNSKEY or RRSIG records before the DS record is public. And since the DS is not published before all our authoritatives have updated to the signed zone records, we are certain that the answers to DNSKEY or RRSIG will be available to caches.

We now wait until the parent has published the DS in all its authoritative name servers. At this time we call the zone verifiable, because it is now possible for clients to validate the zone. But we do not report this state back to the zone owner, but our responsibility to ensure continuous proper state of the zone does start at this point. Since we signed through OpenDNSSEC, the management of this state is easily delegated to its component KASP Enforcer.

We are still not completely satisfied when a zone is verifiable; caches may hold the parent’s NS records without matching DS. So after the DS is published, we wait the maximum caching time for the records in the signed zone. Only after that timer expired, will we announce to the zone owner that the zone is now verified.

While in this verified state, we can rest assured that the zone is either properly signed, or drops off the Internet. As said above, we now rely on OpenDNSSEC to manage the zone for us, by keeping signatures fresh and the zone signed and available. If the signer stops for whatever reason, we now need to repair it soon, before zone signatures start to expire. This is why we have setup a redundant architecture.

In the verified state, we can edit the zone as always. Added zones will be included, OpenDNSSEC will pick them up and sign them in a newly published zone version. If we’d care to move to another registrar, we would have to add a DNSKEY for the new registrar’s public key; if we’d move a zone to our service, we might have to do the opposite, and/or publish the old registrar’s DNSKEY for some time. And if we decide to rollover the Key Signing Key, we’d have to communicate with the parent zone to replace the DS stored for our zone.

Reverting the zone back to unsigned state is a similar exercise, but in the opposite order. This too must be done with care, because dropping signatures too soon would make the domain disappear from the Internet from the perspective of caches that still expect a signed zone. So here are the steps we follow:

The first thing we do is retract the DS from the parent. This can be done without further ado, because this breaks the chain on the top side, in a way that creates a signature that our zone opts out from DNSSEC. Once all authoritatives of the parent have dropped the DS record, we assume to have a retracted zone.

We are not ready at this point to tell OpenDNSSEC to stop signing the zone; caches may still hold the old DS and may still rely on signed zones, so OpenDNSSEC is still responsible for keeping the zone signed. We now wait until all zones have expired any remaining DS records before we consider to have arrived at an insecure zone.

Since no cache is now relying on signed zone records, we can remove the signatures that OpenDNSSEC created. In short, we revert to publishing the unsigned zone, leading it into the unsigned state from which it can be signed once again if its owner is so inclined.

It should be clear that publishing signed zones takes some effort, and is not to be taken lightly. On the light side though, it is perfectly possible to automate the procedure and just sit back and wait while computers roll through the scenario.

When?
This procedure is needed when one (not both) Hardware Security Module (or HSM) has failed. Before doing this, it should be established that there is no repair possible.

What?
The replacement of a single HSM in high-availability mode is foreseen by the HSM vendor, so their procedures can be followed.

Why?
The purpose of replacing an HSM is to regain a situation where the secure key management hardware is redundant. Running on a single HSM should be considered a fragile mode of operation, because the key backups are the only step between fully functioning DNSSEC and total anarchy (a.k.a. unsigned DNS).
To accommodate high availability mode, it is vital that a failed HSM is replaced with the utmost speed. Monitoring facilities will be required to detect the need of this procedure.

How?

Establish that a HSM has failed, and that no recovery is possible.

Follow the procedures from the HSM manual to fence the broken HSM.

Order a new HSM; a support level agreement may help to speed up this procedure. Have it delivered directly to the place where it needs to replace the broken one.

If the backup token was used with the broken HSM, consider shipping any backup-related hardware to the other location. If the ordering time is minor, the only service disruption from not being able to backup is that no new keypairs can be put to use; in other words, signing will continue but key rolling and possibly creating new zones will have to wait.

Upon arrival of the new HSM, set it up immediately and integrate it with the other HSM, following the HSM manual.

Verify that monitoring tools pick up on the new HSM and feel free to take a deep breath.

When?
This is a nasty procedure that must only be performed if private key material of ZSK and/or KSK is (or may have been) compromised. It always leads to temporary unsatisfactory performance, which is why the chances for this are virtually eliminated with our architecture: either the domain drops out of validating resolvers, or it becomes insecure.

What?

The suspicious key material will no longer be a basis of trust in the zones it used to sign.

After the procedure, all domains are back to secure mode.

Why?
This theoretic case implies a tough decision. The contact person for a zone must decide which is worse: being invisible to a part of the internet, or being insecure. Note that the Kaminsky attack is the only widely known form of attack that makes DNSSEC a necessity, and it is known to always succeed, but only if it is given weeks of time. The weighing act between the two forms of badness is a human decision, and it is important to know who takes it, or to work out procedures matching your organisation’s security goals.

How?

ZSK problem.

Request a ZSK rollover from OpenDNSSEC.

Push for immediate publication of the new records.

Wait for the ZSK signatures to have expired from caches before fully trusting the affected zones.

KSK problem — temporarily dropping from the internet.

Without delays between steps, remove the affected DS from the parent, remove the affected KSK DNSKEY record from the zone, generate a new KSK, upload the new DS to the parent using their emergency procedure if they have one. Do not wait for the registry, cache expirations or anything else.

Sign the domain with the new KSK.

Publish the domain as soon as possible, actively reloading authoritatives and flushing caches where possible.

Wait for the old DS TTL to expire from caches before fully trusting the affected zones.

KSK problem — temporarily going to insecure mode.

Perform a regular KSK rollover, but force it to happen right away.

Update the parent, removing the DS as soon as possible.

Wait until the old DS has expired out of caches before fully trusting the zone’s security again.