Posted
by
kdawson
on Friday August 29, 2008 @07:14AM
from the but-it's-the-right-character dept.

An anonymous reader writes "According to a thread on the bind-users mailing list, there is nothing inherent in the DNS protocol that would cause the massive vulnerability discussedatlength here and elsewhere. As it turns out, it appears to be a simple off-by-one error in BIND, which favors new NS records over cached ones (even if the cached TTL is not yet expired). The patch changes this in favor of still-valid cached records, removing the attacker's ability to successfully poison the cache outside the small window of opportunity afforded by an expiring TTL, which is the way things used to be before the Kaminsky debacle. Source port randomization is nice, but removing the root cause of the attack's effectiveness is better."Update: 08/29 20:11 GMT by KD: Dan Kaminsky sent this note: "What Gabriel suggests is interesting and was considered, but a) doesn't work and b) creates fatal reliability issues. I've responded in a post here."

There is a small window of time when a malicious record could be cached by ANY DNS server. (Port randomization makes guessing the correct port to hit much harder) Bind (and only bind) has/had a huge fucking bug that opened that window of time.

I don't think hacking every DNS server has ever been the solution of choice.
Maybe updating your record and serial number, then reloading, if needed, the authoritative server. And the ones you don't control, well wait.

So don't keep your TTL so high? On the other hand we changed from a T1 to a fiber line the other day and all our IP addresses changed. It's been a nightmare trying to get them changed properly. I had them preemptively set to a low TTL for a few days to give things time to clear from cache but we still are getting weird fluctuations. Some servers are showing the old addresses still more than a week later. Some are alternating between the new and old address. Some have just decided to give addresses that are

Bind is effectively the reference implementation, so probably, or they made the same mistake at any rate. That's not surprising, this is a very subtle bug that requires knowledge of the Kaminsky attack to recognise. It's worth pointing out however that djbdns had source-port randomisation from the start as a defensive measure, and thus remained very resistant to this attack.

No, this solution is basically breaking the DNS functionality that Kaminsky exploited. By design, the referral records were supposed to overwrite the cache (which some organizations do use). This patch breaks that.

That seems accurate to me. After all, what happens when a DNS record gets updated? With the new behavior, you won't see the change until your cached record expires. That may be preferable to a gaping security hole which lets attackers poison your cache, but I don't think it's accurate to call the issue a bug in BIND. I believe BIND was working as intended to allow updated records to overwrite older ones.

This is what the DNS books I've read say happens. When I first started playing with DNS I was always surprised and could never explain why my updated records became active before the old record's TTL expired. Sounds like a bug that's been needing to be fixed for a long time now.

These are the TTLs of the name servers for a domain. One hopes there's more than one. Further, one hopes they are widely separated. This is totally DNS "standard operating procedure". We're not talking about a highly coherent transaction processing system, here. Availability is the goal. Achieved admirably, given the age and design history of DNS...

... what happens when a DNS record gets updated? With the new behavior, you won't see the change until your cached record expires.... I believe BIND was working as intended to allow updatedrecords to overwrite older ones.

That's what I thought too, at first. It seems right to replace cahced records with newer ones. And perhaps that's why the code is written as it is.

But then I got to thinking: If the server has a cached record already, it wouldn't have asked for and update. So why is the new information

how come the same vulnerability is present on other DNS servers as well ?

It isn't. djbdns [cr.yp.to] for example, is not affected. I don't think maradns is affected either.

Do they all use the same code from BIND for this particular 'feature' ?

Very likely.

BIND has a very permissive license; most other DNS servers exist to facilitate lock-in with a particular vendor's stack, or to push some enhanced feature set, so they'd be considered foolish if they didn't copy BIND's source code where they could.

If this is indeed not a protocol flaw,

Well, I'm not sure it is unfair to call this a protocol flaw. Maybe a design flaw.

BIND has resisted port randomization because "the RFC said so"- never mind that they wrote the RFC, and that no clients bother checking. Because it stopped spoofing attacks ten years ago, and it stops them today, most DNS servers- including those derived from BIND- do this.

BIND also uses these very complicated credibility rules for determining if it can override existing cache-knowledge. This can presumably save one or two queries per dot, but surely it would be safer to only cache answers to questions that were asked. That is, by the way, what djbdns does.

Most DNS spoofing attacks can also be solved by solving most blind spoofing attacks. There's a little reluctance to do so, because it makes things like DNSSEC largely obsolete for their intended audience. As a result, we see a lot of chest thumping and stomping in the temper tantrum. You can tell when you're about to get into one because they start by saying "If we just switched to DNSSEC by now, we wouldn't be having this problem."

Of course, since BGP peers now route-filter everywhere on the internet (they didn't used to!), mandatory source filtering is a completely possible and realistic way to stop this and other similar problems...

I'm sorry, that link indicates that there's another bug in BIND. It doesn't say djbdns and MaraDNS are vulnerable to this attack; just merely that source port filtering is not enough to withstand a 1gbp/sec sustained attack.

Please point to a link that says that attack works on djbdns and MaraDNS. I cannot find anything that supports your statement.

I haven't looked at MaraDNS, which is why I said I don't know about it.

I don't see how that attack would work against djbdns because djbdns doesn't accept answers

I'm not sure what world you're in but BGP peers do not route-filter everywhere.

Prove it.

Publish a route for 207.68.160.190/32 to any AS besides AS8075. Your publishing must be visible from AS21863. I will monitor and log ALL BGP announcements I receive for 207.68.160.190/32 for the lest two weeks.

If you can do this, I would love to know what your ISP is.

Every ISP I've dealt with route-filters their peers to prevent you from doing that. It would be a small thing for them to source filter packets as well, and

If this is indeed not a protocol flaw, how come the same vulnerability is present on other DNS servers as well ?

Do they all use the same code from BIND for this particular 'feature' ?

No.

The/. description of that thread is inaccurate and the behavior of BIND in breaking trustworthiness ties (which are set up by RFC2181) in favor of apparently newer records is not a bug, but rather a behavior which has been operationally useful and normal for most of the history of DNS. If you look closely at Dan Kaminski's discussion of how he came to recognize the vulnerability, it becomes clear that he was using that normal behavior and put together all of the pieces of the attack from the fact that

This is not the first time a huge security vulnerability was fixed by changing a single character!

Yeah. I once wrote a web application and in one of the auth checks, I put a '==' where I wanted a '!='. Fortunately in that case one of the testers caught it and it never actually went to production, but sometimes we all make silly mistakes.

There are a lot of bugs fixed by changing 1 character... It is a very common occurrence. Either you comment out a feature that isn't needed but causing a problem. Or change a default variable or a constant to a different value.Eg origional code (Just making it up on the fly) of a possible security hole bug:

char x[9];// x is populated by a char* variable for (register int i=0; i <= 9; i++) {//doing some stuff on x[i]
}

Which will return the size of a char * (pointer to character) on your system (typically 4 bytes), _not_ the length of the array. There is no way in C to get the length of an array after it's been allocated. Arrays are 'stupid' chunks of memory, not objects with properties.

Which will return the size of a char * (pointer to character) on your system (typically 4 bytes), _not_ the length of the array. There is no way in C to get the length of an array after it's been allocated. Arrays are 'stupid' chunks of memory, not objects with properties.

Huh?

char x[9];printf("%d\n", (int)sizeof x);

will print 9 exactly as required.

There are a handful of cases where arrays do not decay to pointers. This is one of them.

I've used that construct (and in many cases sizeof(x)/sizeof(x[0]) for getting count of elements that aren't byte sized ) for more than a decade over many compilers without issues. You need to relearn C/C++.

Actually, I'd just as soon forget it. C was a very thin layer over assembly language, and C++ is an absolute abomination containing the worst of procedural and object-oriented languages (new and malloc in the same language? Oh my!) What I was describing happens after you've used your variable as a parameter to a function. As soon as you're no longer in the same block as the declaration, you lose the original size data. That's a fairly dangerous thing, IMHO.

It's not the first time that any huge defects have been caused by a single character. Quoting Code Complete [wikipedia.org] who in turned was referencing an article from the early 80's (Kill that Code, Gerald Weinberg), "...three of the most expensive software errors of all time-costing $1.8 billion, $900 million and $245 million- involved the change of a single character in a previously correct program.
So on the one hand, it's easy to sort of snicker that they were so close to having a correct implementation, but just m

It does not fix the case where the attacker first tries to poison aaaaaa.example.com, aaaaab.example.com,... , fc4dss.example.com until he succeeds with the glue-record being the real evil. In that case there is no previous cache-entry to rely on.

It is indeed an issue: The injected record is trusted because it orgiginates from example.com, but the evil bits are in the glue record, which goes ahead hijacking the www.example.com record. Without really knowing bind, I assume the patch does not work in that case.

yes, the whole point of this patch is to fix this problem. previously, if i successfully passed a bad record for safdsaus.example.com i could send glue records for www.example.com that would overwrite your cached record for www.example.com no matter what. with this patch i can only pass bad glue records if the ttl on your cached www.example.com record has expired. this gives an attacker a very narrow window during which they could mount this type of attack, likely making it not worth the effort.

What's the downside to my patch ? I guess we are now holding anauthoritative server to the promise not to change the NS record forthe duration of the TTL, which is kinda what the TTL is for in thefirst place:)

I wonder if this is an issue. Otherwise it seems Kaminsky may really have missed the point.

It does sound like an issue. Suppose an authoritative server responds to a query with a TTL of five minutes. That means it must not change the record during the next five minutes. After one minute the domain owner makes some change. Okay, there will be a lag of four minutes before it fully takes effect. Fine. But what if a second request is received a minute after the change? The authoritative server has to know that it has a change queued up to take effect in three minutes' time, and serve a reply w

That's not how caches work. There is no guarantee that the authoritative server won't give out different responses until the TTL expires. The TTL just means that the resolver may cache the value for that duration. If the value changes during that time, the effect is just like when the server does DNS round-robin load balancing: This resolver uses a different value than other resolvers. Whether that is a problem depends on the validity of the resource, not on a server side decision to stick with an answer or to change it before the old value's TTL. When you change DNS records, you always keep the old resource up until you see only a low amount of requests to the old resource. There are way too many caches which ignore the server-defined TTL and use their own minimum TTLs.

thank you!A TTL is not a promise to never change the record. A true authoritative source can change and push new information. A TTL is an amount of time that a cached record can live before the holder of the cache needs to check back for new information, which is usually not changed.

They would have to know all of the caches in order to push the changes to them, and since caches can cache for caches, it's unrealistic that a normal site could know this, and unlikely that a specially designed site would.

The cache should not cache answers to questions it didn't ask, and that includes new authorities for the domain.

ok, yes, somewhat. I wasnt as clear as I should have been. All caches are not known, as you said, and DNS isnt a push system. But, there are cases of things like stealth masters that do keep track of all of its slaves, and these can tell the slaves to come look for new information. Not allowing updates to the slaves because of TTLs would create a non-needed time gap in propagation.

But, there are cases of things like stealth masters that do keep track of all of its slaves, and these can tell the slaves to come look for new information. Not allowing updates to the slaves because of TTLs would create a non-needed time gap in propagation.

That's a terrible reason to allow such a large security hole.

You should have to list all of your ignore-ttl-from hosts, and src-filter communication to those sites before you should be allowed to do this.

But that's not what the TTL is for in the first place. The TTL was not intended to mean "I will hold this record for this duration, ignoring any other updates in the meantime". It was meant to mean, "I will not under any circumstances remember this record any longer than this duration". The difference has practical implications for DNS operations.

Just to be at the same time informative and to the point, the 7 replies so far have been as positive as this patch [iu.edu] is in the linux kernel mailing list a few years ago.

OMG, is that guy for real?? I mean, I haven't still read through all of the replies, but... trying to un-UNIX Linux? Either he is one of the biggest morons to ever roam the earth, or he deserves a special place in the Trolls hall of fame...

OMG, is that guy for real?? I mean, I haven't still read through all of the replies, but... trying to un-UNIX Linux? Either he is one of the biggest morons to ever roam the earth, or he deserves a special place in the Trolls hall of fame...

Don't know, but after getting the old Al Viro's treatment he hid under a rock and I hear he's still there.

I'm not an expert so please ignore (or better.. explain) this if I'm way off but Google adives TTL 1 week when using their GoogleApps mail server. So.. what if I really really want to change my MX after 2 days?:-/

A manufacturer had a problem with one of the older machines on their line. It shut down the line and held up production, costing many thousands of dollars in lost production. Since it was older equipment it was hard to find someone knowledgeable in repairing the machine, and nobody on-site knew what the problem could be. They found a technician with knowledge of the machine and hired him to come in and fix it.

When the technician arrived on site he listened to the client's description of the problem, examined the machine, opened a panel, and turned a single screw. He restarted the machine and it was back to full function. The line was up and running and the manufacturer was happy.
A week later the manufacturer received a bill for services: $1000. They called the technician and demanded an explanation - after all, they reasoned, he had only turned one screw to fix the problem. He agreed to re-bill, this time with itemized charges. The next bill contained two lines.

This is a derivative (or descendant) of a story that I read about a small town in Vermont. They had there own power generation facility for the town and it went on the fritz, plunging the small hamlet into darkness. The only person who knew anything about the machinery had long since retired, but the townspeople were desperate, so they they gave the old guy a call. He came out and took a look at the equipment. He then took a small hammer from his old toolbox and gently tapped on a certain point of the aged

In cases where www.foo.com is not cached, DNS resolvers are vulnerable to the much more trivial attack of simply forging the answer www.foo.com IN A 66.6.66.6. Of course, they have to hope to guess the proper transaction ID in the first query, because if they fail, the proper answer will be cached.

Poisoning an uncached name is fairly easy and doesn't require Kaminsky's trick. Kaminsky's trick relies on caching the answers to questions you didn't ask, rather than not caching them or using the cached answer over the uncached answer. I think you called this the "elephant in the room" at Usenix Security, even.:-)

They stopped random UDP port use, and now use a static pool of UDP ports for queries. Note that they have come out with a P2 release that addresses a completely different issue that the first patch caused. I was able to essentially cause a DOS on a BIND server that was patched with P1 by sending more than 10,000 queries to the system. It ran out of usable UDP ports and puked. The same issue exists in the Windows patch, and especially on Windows 2003 SBS. There was way more than one line of code, or a single character changed.

I didn't have time to read everyone's post here but the one thing I think everyone is forgetting or overlooking is why cache was created/used in the first place. 1st. Could you imagine the amount of extra traffic on the internet and core devices if every lookup had to traverse the net? It was designed because there is no need to change you DNS RRs constantly they should be static, or failing that only needed to be update once in a while. 2nd. what happens if my DNS serves go down (yes assuming both of them

I think maybe what he's saying is that the fix isn't good enough. It's not very elegant as it breaks long-standing functionality in DNS, *and* it doesn't fully address the issue. Perhaps the gist of what he's saying is "let's think on this one a little harder before committing lame fixes and thereby shooting ourselves in the collective foot... again". (Of course, feel free to correct me if I'm wrong, Dan!)

I'd just like to say that during all of this the one man that deserves the community's recognition is DJB. I've been following this on and off since the exploit erupted and through out all of it the one thing that has been missing is significant, heartfelt praise for DJB. He's often maligned by the open source community for releasing his code to the public domain but the fact remains the guy produces and ships kick ass code. Qmail and Tinydns absolutely rock and I think it's a great shame the man doesn't ge

This has more to do with an oversight in the DNS standard - doesn't have anything to do with any single implementation. Windows, Linux, and any other networked system that uses DNS are equally affected.

Besides, it doesn't matter if your operating system is Open Source. You can write closed or open source software on any platform you want, and just because the source is available does not necessarily mean that bugs will be noticed and fixed. This situation just shows that even if there are no 'bugs' in an implementation of a standard, the original design may still be flawed.

I haven't been following this situation very closely, so perhaps I'm a bit off with the details, but I'd be happy for someone to put me right if that's the case.

Favouring cached DNS records seems to me to not be a spectacular idea for all situations. It depends on the length of the TTL setting on your DNS server though. I'm not sure what expiry time would be sensible for an ISP to use. You have to balance the fact that you want to up to date records with the amount of overhead that will be generated by all the DNS traffic.