As part of Percona Remote DBA for MySQL service we recognize that reliable backups are one of the most important things we can bring to the table. In my experience handling emergencies, the single worst thing that can happen is finding out you don’t have backups available when some sort of data loss or catastrophic event occurs.

With our Remote DBA service we can take care of backups for you, what follows are some of the internals of our implementation.

What kind of outages can happen?

Someone runs UPDATE or DELETE and forgets the where clause or filters weren’t quite right

The application had a bug causing data to be removed or overwritten

A table (or entire schema) was dropped accidentally

Your InnoDB table was corrupt and mysql shuts down

Your server or RAID controller crashes and all data is lost on that server

A disk failed, and RAID array does not recover

You run into a InnoDB corruption bug that propagates via replication (not common, but does happen)

You lose your entire SAN and all your DB servers were located there. Let’s hope your backups are somewhere else!

You lose a PSU or network switch in your datacenter and some or all of your servers go down in that location

Your entire datacenter loses power and the generators do not start, which happens more often than you might think

What tools do we use in Remote DBA?

Philosophy on backups

It is a good idea to schedule both logical and binary backups. They each have their use cases and add redundancy to your backups. If there is an issue with your backup, it’s likely not to affect the other tool.

Store your backups on more than one server.

In addition to local copies, store backups offsite. Look at the cost of S3 or S3+Glacier, it’s worth the peace of mind!

Test your backups, and if you have a test environment, load them there periodically. You can also spin up an EC2 instance to load your backups onto. In addition, you can binlog rollforward 24 hours of binlogs as a good test.

Store your binlogs off your primary server so you can perform point in time recovery.

Store your binlogs offsite for disaster recovery scenarios.

Run pt-table-checksum periodically (i.e. once a month) and make sure your servers data stays consistent. Checksumming is important, as backups are typically pulled off a slave and it’s vital that it has the same data.

How do we use these components to give our customers reliable backups?

Think about the 10 example outages listed above. Each tool has it’s strong points given the conditions.

Percona XtraBackup for MySQL for binary backups.

Strong Points:

It can restore an entire server very fast. Often the limiter of how fast this can be restored to another server, is how fast you can transfer data over your network. If you have 1GB network and you have 1TB of data, it could take awhile.

It can compress the DB on the fly

It can backup a server at approximately the maximum rate the server allows, given it’s IO system

It can typically execute a backup with little to no major impact on the server. For example in xtrabackup 2.0.5+, the time taken for “FLUSH TABLES WITH RAED LOCK” is normally under 1 second.

Tips/Tricks:

If you have a lot of non-transactional tables (i.e. myisam), use –rsync option. This will rsync a copy of all the frm files and all the MYD/MYI files. It then does a second rsync while under a global lock. This means where you may have been locked for hours where you had many non-transactional tables, now you can be locked sub-second. Even with innodb only this can greatly cut down on the lock time by syncing the frm files.

Enable –slave-info when backing up from a slave so you know what the position you are in the master’s bin logs

–compress option, compresses on the fly using qpress under the hood.

When do we typically use xtrbackup restores:

Setting up new slaves

When we lose an entire server due to hardware failure, corruption, etc

When the majority of data on the server was lost. e.g. there is one primary schema and that schema was dropped. Basically when restoring may take less time that trying to load a logical backup.

Restoring your data from backup is another topic. Piecing together data after accidental data loss is one of Percona’s specialties, and there are many different techniques depending on the scenario. I will go through some of these in detail in a future blog post.

Mydumper for logical backups

Strong Points:

Very fast for logical backups – compared to mysqldump

Consistent backups between myisam and innodb tables. Global read lock only held until myisam tables are dumped.

We are researching into how we could further improve lock times here when non-transactional tables are
used

Almost no locking, if not using myisam tables

Built in compression

Each table is dumped to a separate file. This is very important to make restoring single tables easy. You can quickly restore a single table, instead of restoring your entire backup just to find a tiny table you need. This is actually the most common type of restore needed, so it’s important to make this operation as painless as possible.

Compressed mydumper typically 3x-5x smaller vs compressed xtrabackup

Typically we upload mydumper backups to s3 vs xtrabackup given the time needed to upload/download. Though it depends on the available bandwidth and should be factored into your restore time.

Problems:

You can’t rely on mydumper to dump schema’s. It does not handle views/triggers/procedures etc. Run with –no-schemas, instead use mysqldump for the schemas and rely on mydumper for data only.

You will have to compile it yourself as binary packages aren’t distributed

Be careful with importing a dump from a server running in a different timezone. We have a fix here.

Details on how we dump schemas:

loop through each DB

write out ALTER DATABASE DEFAULT CHARACTER SET <charset> to the schema file, putting in the current charset

mysqldump … -d -R –skip-triggers, out to the schema file

create a schema-post file that has the triggers # mysqldump … -d -t

How to restore mydumper data:

Load the schema file

Run myloader –threads=x

Load the schema-post file

I will get into specifics on the tips/tricks to restore data in a future blog post.

Tips/Tricks:

run with –kill-long-queries to avoid nasty problems with “FLUSH TABLES WITH READ LOCK”

–compress, compresses tables per file and should typically be enabled by default. The time needed to uncompress is not a limiting factor on restore time when done inline.

When do we typically use mydumper restores:

Restoring a single file

Restoring a single schema or rolling forward a single schema to a point in time

Restoring data while automatically replicating out to all slaves

mysqlbinlog 5.6

Last year Percona IT director Tamas Kozak had a great blog post that showed how mysqlbinlog in 5.6 could be used. With mysqlbinlog 5.6, you can now pull binary logs in real time to another server using “mysqlbinlog … –read-from-remote-server –raw –stop-never”

Useful to mirror the binlogs on the master to a second server.

Allows you to roll forward backups even after losing the master

Very useful for disaster recovery.

You can have your backups in S3 and mysqlbinlog –stop-never running on a small ec2 instance. This can allow for a very low cost disaster recovery plan to ensure you will not lose data even in the worst case scenarios.

Takes very little resources to run, can run about anywhere with disk space and writes out binlog files sequentially.

Tips/Tricks (how we run this):

Ensure it stays running, restart it if it appears to be hanging

Verify the file is the same on master and slave

Re-transfer files that are partially transferred

Compress the files after successful transfer

Amazon S3 for MySQL

I discuss S3 here but other cloud based storage can be used as well. S3 is just the most popular in this category and is in wide use.

Details:

s3cmd – we have been using the version from github, Mostly for multi-part upload support. This prevents us from having to split files up before uploading to S3.

You can now set bucket lifecycle properties so data over X days is archived to Glacier and data over Y days is removed. This is very convenient feature and allows you to cost effectively store long term backups with little additional work

Tips/Tricks:

–add-header=x-amz-server-side-encryption:AES256 to use the server side encryption feature which helps with some types of compliance. We also have the capability to encrypt all files with gpg prior to upload via a separate script

use_https = True, especially if your data is not encrypted before transfer

Monitoring

Monitoring is the most important piece to tie all of these process together. We employ nsca nagios alerts for all of the backup processes.

freshness_threshold should be set so if your nsca hasn’t checked in within a certain period it will alert you. For example if you backup once a day a good threshold could be 36 hours.

For our mysqlbinlog processes, we have it sending nsca alerts every 30 seconds and have it alert when nothing has been received for 15 minutes -> 1 hour

If backups throw an error and are aborted, we send a critical alert immediately to be investigated

The number one cause of backup alerts are due to problems with “FLUSH TABLE WITH READ LOCK”. Namely when a select is blocking the flush from completing and queuing all requests behind it. Our current solution to deal with this issue is we have a guardian process that runs during a backup. It then kills any process that causes a stall of the flush. We are also researching into other ways that could improve this in the future.