Fear and Loathing in AWS or Adventures in Partition Resizing

Nov 26, 2016

Note - This is very, very long. It is isn’t done yet and it is 3,589 words when I thought to add this note. At my normal metric of 250 words per page, that I learned in college, that’s 14.5 printed pages. The reason for not splitting it up is that if you ever have these issues then you want all of this in here in one place.

A few days ago I made a mistake – a significant one. I had a large data load running into a MySQL database and I wasn’t really thinking about the storage implications of it. To make matters worse this was the only box in the cluster that wasn’t running the monitoring software I like – Inspeqtor. The db server had been the first box I had configured when I brought up all of our boxes and I never went back and installed Inspeqtor. I know, I know – an amateur’s mistake at best. But, and it is no excuse, we’ve been running hard and fast for a while and, ultimately, this always catches up to you.

So you know where this is going – we ran out of space. And out of space on a db server is generally a bad thing. Sometimes it falls into the category of “ok bad” and other times it falls into the category of “oh shite bad”. This was “oh shite bad”. I spent about 12 hours wondering if I’d get back my data, not when. Happily, my luck held, and it was when and that when is likely about 24 more hours from the time of this writing. A bunch of the data has already been recovered and, with that, it is time to write the retrospective mea culpa and maybe cast some light on what is honestly a fairly crappy aspect of AWS.

Now, unlike most of my AWS writings which I structure as tutorials, this is not a tutorial. It is more of an essay or perhaps advice. I couldn’t fully document the meandering path that I took as it was done under pressure and with a fair to moderate level of cursing.

Note: When I went through all this I was fairly dissatisfied with AWS in this regard and there are some real quirks to storage and minor issues but when I think about what this would have taken with a classical data center, I applaud Amazon. My guess is that if I had to get professional operations people involved to have helped with this, I would quickly be out a few thousand dollars in high priced on call labor.

Understand This Well, Very Well, Here There Be Dragons

The first thing to understand is that the steps I took and the tools I used are dangerous. Had I taken a slightly wrong path I would likely have lost everything. If you are going to walk this type of road then I strongly recommend:

Do it with a second set of eyes. I am an unabashed fan of pair programming and whenever things are bad I prefer to have a pairing buddy. My normal pairing partner, Dv, wasn’t available when all this was going on so I brought on Nick as a consultant for this ordeal and he was fantastic. At two critical junctures he kept me from going astray. Highly recommended.

Take your time. When data recovery is an issue you need to take your time and actually think. If people are yelling at you to get it done and hurry up then, well, don’t listen to them. If I hadn’t brought this back up it would have likely cost us between 1 and 3 months of lost effort. With those kind of stakes I’m going to take my damn time and so should you.

When in doubt – stop. It is always tempting to plunge forward since you are almost there. Nope. When in doubt my advice is rest, coffee or even sleep. Some things just can’t be rushed and understanding that is key.

Initial Mistakes Made

As I look back at this I can see that I made one real mistake early in setting up this, our first AWS box was that I didn’t really analyze our data volumes and growth rate and do some projections. I’ve done that now, at least a bit, and I wish I had done that at the start. For this project we’re growing at about 250 + gigs per month. When I setup our database server I went for a very simple configuration – a 2 terabyte volume configured as the boot volume. Apparently, even today, there are still hard and fast limits such as you can’t have a boot volume greater than two terabytes. Sigh. What I should have done is have a small boot volume and then a data volume that was 2 terabytes. I can’t prove it but I think that if I had done this my woes in partition resizing would not have happened.

Take away advice:

For a db server always have a small boot volume

Make your database store its data on a non boot volume; yes that’s obvious based on #1 but I’m being explicit.

I suspect that if I had used LVM things might have been better but I’ve never understood LVM very well.

Whatever you do make absolutely certain that if you are using MySQL then at the version you are using either has innodb_per_file either turned on or as the default. At my last data center I had this manually turned on in my.cnf. And it became the default in MySQL 5.6. I never dreamed when I set up my boxes at AWS this past fall that I didn’t get 5.6 as part of a normal apt-get install mariadb operation. This option defines whether you have a single gigantic blob of disc space that stores all your files or a blob of disc space per table. When you use innodb_per_file, despite some issues, it means that you can address storage needs far more granularly by applying symlinks to different file systems. This is an incredible advantage for handling large amounts of data.

Note: If you understand the issues of innodb_per_file then you can likely ignore #4. And, honestly, if you understand it then perhaps you should drop me a resume; I’m always looking for talent.

But Why Didn’t You Use RDS?

I’m sure someone out there is shouting at their phone or tablet saying Why, oh dear lord, why didn’t you use RDS. Well I actually tried to use RDS, specifically Aurora, but found, for our data, RDS led to silent data loss on data load. I don’t know why but I verified it myself and I was in a hurry so I just setup my own DB server. Yes RDS would have been easier but not if it loses data.

Take away advice:

RDS is still a new technology; be wary

Make sure that if you have large or complex data loads that it does load everything you give it. Verify your row counts between old and new servers.

Step 1 - Reboot the Server

When a database server runs out of space, the database software itself generally goes down and that’s actually good. The first step is always, always, always to reboot the server. A reboot generally clears up at least some disc space, often enough to get things operational again. I wouldn’t recommend actually adding more info to the database until you solve the storage issues but you can at least get access to the system.

Take away advice:

Reboot your database server after problems to clear up some space.

If your database software isn’t quickly accessible it is likely recovering from the crash. Check the logs and be patient. MySQL, my database of choice, is excellent at recovery. I can’t speak to other databases but MySQL has always been great at this in my experience.

Step 2 - Get a Second Set of Eyes

As described above I strongly recommend getting a second set of eyes early in the process. You really should do this. It made all the difference in keeping things on track.

Step 3 - Understanding Storage in the AWS World

Most of us are familiar with storage in terms of our personal systems but AWS is a bit of different thing. Here is what I learned through trial and error:

Any EC2 server can have multiple chunks of storage attached to it.

These chunks of storage are called volumes.

When you need more storage you can just create a volume.

Newly created volumes are entirely blank and do NOT have a file system on them. For people from a Windows background in particular this can be odd.

Before you can use a volume you need to either attach or detach it.

Volumes need to be mounted with either a mount command or fstab.

When you need to transfer a LOT of data from EC2 server 1 to EC2 server 2 then:

You can create a volume

You can attach it to the source of the data (say a machine you are using for running mysqldump)

You can make a filesystem on it

You can mount it

You can dump your data to the new volume

You can detach the volume

You can attach it to the destination of the data

You can mount it

You can load the data

If you need to bring parallelism to things there is nothing stopping you from creating multiple volumes, putting data in one, moving it to the destination and then using another volume to continue the process.

All of your old school Unix commands come in handy.

Snapshots are quickly created backups that you can use to initialize a volume with data.

If you need a way to visualize AWS storage then think of it this way:

Volumes are external hard drives that you can create freely and attach to servers.

You can have as many as you want within reason; you aren’t constrained by the number of ports you have in your machine.

Attaching and then mounting is equivalent to plugging the volume in

Step 4 - Experiment Before You Do Anything With Your Data

One of the brilliant aspects of AWS is that you can experiment with abandon. Need to see how to mess about with a 4 terabyte
partition? You can just create one and then try things. I’d strongly recommend experimenting with the process in full before you risk your data.

Step 5 - The Low Level Unix Commands in Question

There are several underlying low level Unix commands that I had to use during the course of this.

The fstab File

Fstab isn’t a command, it is a user defined ASCII data file in /etc i.e. /etc/fstab that defines how logical volumes are attached to the computer you’re working on and where they are mounted. What you have to is create a master directory such as /mnt and then subdirectories where you want discs to be attached such as /mnt/old. The instructions in fstab would then connect a low level volume such as /dev/xvdf to that directory allowing you to ls, cd and so on.

Note: I’m going to add noatime and nodiratime into my mount statements shortly. This is what I’m going to add to my fstab:

rw,noatime,nodiratime,nobarrier,data=ordered

which I sourced from here. Once that’s set in /etc/fstab then I’m going to:

sudo service mysql stop (making certain that nothing is using the db first)
mount -o remount

You can also do it this way without modifying fstab but then you lose the options on reboot so that’s kind of suckass.

Mounting a Drive - mount

The mount command mounts the volumes identified in /etc/fstab. You need to use this after you have attached a drive using the AWS console in order to make it available to the system. If it helps you to understand this then think of the AWS Console’s attach command as equivalent to plugging in a hardware cable to a drive whereas mount is the software side.

sudo mount -a

You can also mount without an fstab file like this:

sudo mount /dev/sdg /vol -t ext4

Unmounting a Drive - umount

When you need to detach a volume with the AWS console you first need to unmount it with umount.

sudo umount /dev/xvdf

When you can’t unmount a volume with the AWS console, what do you do – you do this:

sudo umount /devxda1

And when you get

umount: /: device is busy.

then you need to figure out what process still has this open. Two tools for this are lsof and fuser (I was much more successful with fuser).

List Block Devices - lsblk

The lsblk command shows you the logical block devices on your system. A block device essentially means a disk but it could be something else. This one was new to me or at least something I don’t think I’ve used since I ran Fedora on my ThinkPad back in 2000. Here’s an example:

This is showing all the block devices. If you notice xvda this is showing a physical volume and then a logical volume within it. It is also interesting to note that docker is actually a block device. It makes sense that Docker went this route but until I saw it myself I don’t think I really appreciated that; very, very cool.

File System, File System, What do We Have - file

The file command tells you what file system your disk volume supports. When I first moved to Linux professionally from Windows, as opposed to just mucking about with Linux, I remember being a bit baffled about the wealth of file system options – ext2, ext3, reiser / murderfs, etc. You really didn’t see this much if at all in Windows at the time so it was disconcerting. The basic idea being having multiple filesytems is that different filesytems are good at different things. If, for example, the year is 2005 and you want to put 100,000 files in a single directory then your only real option is ReiserFS. Reiser was fantastic at this. Similarly if you need a maximum file size of 16 exabytes in size then you need Btrfs and that’s likely the only filesystem on the planet other than ZFS which can do this.

Given that all data on a computer is ultimately written to a file at some point, if you can get better performance from a different filesystem, then changing filesystems is a really easy way to get better performance that affects everything. When you improve your code you affect only your application but move to a filesystem with say 10% better write speed then everything you do on that machine improves. That’s the power of changing your filesystem. Now, that said, this is something where you absolutely need to understand every single issue. For example Btrfs looks great but only use it on OpenSuse or Oracle Linux. On other platforms it is still under active development.

Here’s an example of using file on a newly created volume:

sudo file -s /dev/xvdf
/dev/xvdf: data

If you notice all it says is data. This means that the volume is newly created and has no filesystem yet so you’ll need to use mkfs covered below.

Making a File System - mkfs

The mkfs command actually creates a filesystem. Years ago this was actually slow but in 2016 even with an enormous 4TB volume it was rippingly fast:

sudo mkfs -t ext4 /dev/xvdf

Step 6 - And Now We Come to Partition Resizing

It may be hard to believe that it has taken this many words to get ourselves ready for partition resizing. Damn but I can be a wordy son of a bitch at times – apologies. Anyway I’m going to give the punch line first:

I was never, ever able to resize a 2 TB AWS boot volume to a bigger size. According to the AWS docs this should be possible but they are actually fairly crappy with respect to volume issues. No matter what I did I kept getting superblock / magic number errors that I could not get past.

The basic process should have been something like this:

Create a snapshot of the current volume

Create a new volume of the desired size

Import the snapshot into the new volume

Use parted / gpt to adjust the core partition to the new size

Use resizefs to finish the process

Profit!!!

Honestly I was really, really surprised that in 2016 trying to resize a Linux partition was such an absolute shite show. Perhaps it was me but the other person pairing on this had no better luck than I. It should not have been this hard.

So, somewhat unfortunately, I ended up having to do a full dump / restore of my database to get around this. And while that sucked monkey chunks it did end up wiht some positive things as covered in the next section.

AWS Gotchas

While at this step we hit a number of gotchas that were really confusing at first. The first of these was the fact that logical device names change depending on AWS versus which kernel you have.

(Here it is /dev/sdf)

(Here it is /dev/xvdf)

(Here is the warning)

I find it absolutely inconceivable that Amazon can’t do better than this. Given the possibility for destructive errors by getting things wrong this should be much, much better. Given that this is a controlled cloud computing environment, doesn’t Amazon know the kernel?

Note: I’m a self admitted amazon fan boy for AWS so for me to say this means I think its really really serious.

When you create a disc volume there is no option for naming / tagging it at creation time:

That is equally sucky. Do this a lot and you end with a bunch of things name vol-64bb3234d and vol-534343bc and you’re scratching your head going “hm…”. Yes you can name them after but that increases the chance that they never get named.

My Path If Its Useful To You

Here is my rough, meandering path thru resize2fs, parted and gpt in case anyone out there wants to tell me where I went wrong.

Step 7 - Why Having Partition Resizing Fail was Ok

The reason that I never lost my mind was that in the process of troubleshooting all this I realized that I had failed to set the innodb_per_file option when my db server was initially setup. The innodb_per_file option can only be set when a database is created / loaded so if I wanted it, I didn’t have a lot of options. And, since it was a holiday and downtime was sort of naturally going on, it was actually about the best possible time to do all this. Keep in mind that while it isn’t pleasant to go from mince onions for stuffing to check database dump and then to make cranberry orange relish, you can actually do this without losing your mind. By the end of it I was timing database dump routines along side recipes with ease i.e. “Ok I can make a caramel apple cobbler in the same time it takes to dump the table page2016_q2s”. Certainly 2016, also known in my head as dbpocalyse, is a holiday I won’t soon forget.

Step 8 - Generating Your MySQL Dump Files

Due to the size generated and the need for load concurrency I generated my mysqldump files with a rake task. All this has todo is:

The advantage to this approach is rather than a single massive database dump you end up with a file for each table which lets you reload multiple tables in parallel.

Step 9 - Verify Your Dump Files with Tail

Here’s a general bit of advice when you are dealing with multi-hour long / multi-day long mysqldump processes – make sure that they actually complete. After all when you’re looking at a directory listing and all you see is this:

215121716510 page2016_q2s.sql

How you to know that a file you basically identify as byte size / filename actually finished successfully? An easy to check this is to just tail the file like this:

That’s what the tail of a successful MySQL dump should look like. The key thing is – Dump completed on 2016-11-23 7:50:34.

Step 10 - Why Didn’t I Try maatkit or something else?

You’ll notice that I used straight up mysqldump as my backup tool. Why would I do that when there are higher performance alternatives like maatkit, mydumper or mysqlhotbackup? Well it all comes down to trust:

mysqldump for all the issues is solid. I’ve used it for more than a decade and it has never given me issues – when I used it correctly. Any issues I’ve had have been user error and that hasn’t happened for at least a decade.

mysqldump is regularly maintained. maatkit seems to have last been maintained in 2013 while mydumper is about a year since last maintenance. While I’m not one to take the position that old software is dead software, for something as critical as a backup tool it does make me a little bit squeamish.

All of this happened over the U.S. Thanksgiving holiday and while I had time it was in dribs and drabs – often while I was in the middle of cooking – the last thing I had time for was sitting down and poring thru tech notes to figure out why the command line options for mydumper were different from mysqldump. Honestly – why would you change these? I could simply set mysqldump going on a half dozen different terminals front ended by tmux to keep them alive. That gave me concurrency if not performance.

Sidebar for Open Source Authors - Currency Matters

There are tons and tons of different open source projects that are actually fantastic but haven’t really been updated recently. That doesn’t mean that you shouldn’t use them by any means. But, if you are an open source author, you should really be aware that the potential users of your software often look at your project in terms of its git repository. And one of the first things you notice is the datestamp which github gives you as “2 years ago”, “a month ago” and so on. I’m going to illustrate this with an example from a project I really, really like: Inspeqtor. Inspeqtor is an open source monitoring tool I’ve written about before. Honestly it is monit with an easier configuration approach. And it is from one of my open source heroes, Mike Perham, so of course I like it. But, the first time I considered it, it hadn’t been updated in about a year and I have to say that did give me pause. If I hadn’t known exactly who Mike Perham was then I likely would have raced over to Monit and used it instead.

Now the secret trick for an open source is really simple – you don’t have to touch the code – for github to update the last modified time stamp. All you have to do is update the readme or some other documentation file. A lot of the time when you are an open source user is that you don’t want abandonware. When you choose an open source tool you’re making an investment. It may be an investment of time rather than money but it damn well is an investment. And when things are abandonware the nature of that investment changes. Even if I am 90% likely to never reach out to a project in the form of an issue or support, knowing that the project is still alive means that I can. And about the only positive signal that a project is alive comes from damn timestamp that github reports so gleefully.

Step 11 - Why Didn’t I Get a Real Professional?

I have to admit that it is pretty damn inconceiveable to me that in 2016 you can’t resize an AWS volume dynamically particularly if you can so easily unmount it and operate it on it. And it is important to know that I am not an expert in this area. I’m not a full time ops guy nor am I a true sysadmin – I just play one on the internet from time to time. So the logical thing to do would have been to reach out to a real professional or even AWS support so why didn’t I? Well it comes down to this:

Timing. This happened just prior to a holiday so people’s availability is severely constrained. I didn’t want any resource involved that couldn’t see it thru end to end. I know that I am fully available and I am confident enough that no matter what happened I would get thru it.

No Backups. There were no database backups available (see next section) which means that any solution had to be 100% safe. And I could only guarantee that if I did things myself.

Pricing. The older I get the more sensitive I get to being, what I feel, unfairly screwed. The last time I got significant talent on the spot market – i.e. when you need it immediately – that person charged me either $300 or $400 per hour for about two hours of what amounted to configuration support. And that’s fine, it was his right to do that. But that degree of what I felt was dramatic unfairness meant that I will never, ever contact him again before the heat death of the universe. And the pity is that this person could easily have taken less per hour and rolled it into an ongoing relationship where I’d still be relying on him today. I mean I am still running, daily, the code he wrote. Sigh. I saw no reason to raise my blood pressure by being potentially screwed once again.

Step 12 - Flying without a Net or Life Without Database Backups

It may surprise some people that I did not have a decent backup system in place and there’s a story here. This is a project that is an outgrowth of a very poorly funded project that I’ve been continuously working on since 2010. The aggregate code base size is north of 400K lines of Ruby and all the code growth has been organic not planned. Although other people have touched the code base at times I’ve been the sole author of probably 95% of it. As an incredibly poorly funded project there was never much of an ops budget and since this was a 24x7x365 system there wasn’t a lot of opportunity for offline time for backup. There was about a 2 year period when we changed to one data center where I was told backups were being done. Unfortunately there were enough times were I was told that “we have a backup issue and we need to reboot all of your servers to address it” that I never once trusted it.

During the past 7 years there was only one time when an important table was accidentally dropped. That was the one case when we might have called for data recovery and we had a local copy of the same data that was only a few hours out of sync so we didn’t bother. So, at least from a track record basis, I have a pretty good record for not screwing things up. One of the secrets to my success tho is I applied a very strict trust metric to database access – if I woudn’t trust you to watch my kids then I wouldn’t give you access to the actual database. And, surprisingly, that actually has worked out pretty well.

Now, does any of that excuse there not being backups? Absolutely not. This is now both better funded and far more mission critical so I’ll get this figured out in short order. If I end up rolling my own backup solution then I’ll do it on my own time and open source it. While there are different backup solutions out there, our data sets have some unique characteristics and rolling some code to automate this might be an interesting challenge. I suspect using dynamodb for tracking the backup catalog might be interesting.

Step 13 - Engineer Color Code Thyself

The biggest mistake in something like this, beyond the fact that it happened at all, is likely to come from yourself. People, particularly tired, frustrated people (and you will be both tired and frustrated from this), make mistakes. A lot of this work happens inside terminal tabs that look exactly alike. It is incrediby easy to check on this and mistake source for destination and then – WHAM – you have a real problem. And I know you are thinking that this cannot happen or “how stupid is this guy; I am smarter than that.” Well you might be, I certainly am, but it still happens. And, honestly, if something like that has even a chance of happen you are better off setting things up so that it won’t happen.

Just as an example, even though I’d like to think I’m smarter than that, all of this started just prior to Thanksgiving 2016 so Turkey Day 2016 was spent running back and forth between Thanksgiving cooking and seeing if certain long running exports had finished yet. Just as a 22 pound turkey seems to take forever to cook so, too, does a table dump that has this many bytes: 215121716510. Since I was in and out of things so many times, statistically, the chance that I’d make a mistake I think was actually fairly high.

When you are dealing with old and new systems that would normally look identical the best trick I’ve ever found is really, really simple – color code things. If you look at how much of the brain is related to vision then this actually makes sense. Making dangerous things highly visual means that they really stand out. All I did was set the background color differently for each terminal. I made all terminals related to source be solarized and all terminals related to destination be normal (black). Here is what that looks like:

Given that I might have 3 or 4 terminals open on source and the same number on destination this makes everything so much easier.

Sidebar: I’ve been looking, for years, for a way to define my background terminal color on login to a different system automatically specifically to prevent this. This should happen automatically at login based on an environment variable. If anyone out there knows a unix shell scripting trick for this I’d bloody well love to hear it. Thanks!

Useful Urls

Here are some of the useful urls I came across in the process of dealing with the raft of shite associated with this minor debacle.