Top level domain nonsense and how it can break your stuff

Call me old school, but I really (I mean REALLY) don’t like the recent explosion of the top level domains. I understand that most good names are taken in .com, .org, and .net zones, but do we really need all those .blue, .parts, and .yoga TLDs?

Why am I whining about all this all of a sudden? I’ll tell you why. Because a new top level domain – .aws – is about to be introduced, and it already broke something for me in a non-obvious manner.

I manage a few Virtual Private Clouds on the Amazon AWS. Many of these use and rely on some hostname naming convention (yeah, I’m familiar with the pets vs. cattle idea). Imagine you have a few servers, which are separated into generic infrastructure and client segments, like so:

bastion.aws.example.com

firewall.aws.example.com

lb.aws.example.com

web.client1.example.com

db.client1.example.com

web.client2.example.com

db.client2.example.com

… and so on.

Working with such long FQDNs (fully qualified domain names) isn’t very convenient. So add “search example.com” to your /etc/resolve.conf file and now you can use short hostnames like firewall.aws and web.client1. And life is beautiful …

… until one day, when you see the following:

user@bastion.aws$> ssh firewall.aws
Permission denied (publickey).

And that’s when your heart misses a beat, the world freezes, and you go: “WTF?”. All kinds of thoughts are rushing through your head. Is it a typo? Am I in the right place? Did the server get compromised? How’s that for a little panic …

Trying a few things here and there, you manage to get into the server from somewhere else. You are very careful. You are looking around for any traces of the break-in, but you see nothing. You dig through the logs both on the server and off it. Still nothing. You can dive into all those logwatch and cron messages in your Trash, that you were automatically deleting, cause things were working fine for so long. There! You find that cron was complaining that backup script couldn’t get into this machine. Uh-oh. This was happening for a few days now. A black cloud of combined worry for the compromised machine and outdated backup kills the sunlight in your life. Dammit!

A few minutes later, you establish that the problem is not limited to that particular machine. All your .aws hosts share this headache. A few more minutes later, you learn that ‘ssh firewall.aws.example.com’ works fine, while ‘ssh firewall.aws’ still doesn’t.

That points toward the hostname resolution issue. With that, it takes only a few more moments to see the following: