Data seem sometimes to have their own life and will, and they refuse to behave as we wish.
Then, you need a firm hand to tame the wild data and turn them into quiet and obeying pets.

Friday, November 12, 2010

Filtering binary logs with MySQL Sandbox and replication

A few days ago, a friend of mine asked me if I knew of a way of filtering a bunch of binary logs, to extract only statements related to a single table. The task was about filtering a few hundred binary log files.

It's a tricky problem. Even with my experience with regular expressions, I knew that using a script to extract statements related to a single table was going to be a nasty business.
However, I know of an entity that can filter by table name efficiently, and that's the MySQL replication system. So I suggested using replication to a sandbox with a replicate-wild-do-table statement to get the job done.
My friend was skeptical and did not want to go that way. I was busy writing an article for an Italian magazine and did not follow up immediately. But today, with the article safely in the editor's hands, I did a quick test, and guess what? It works!
Here is a step-by-step procedure to do it. I started with a server built with MySQL Sandbox, using MySQL 5.5.7. I used the employees test database to create a large enough binlog, and soon I had a database containing 160 MB of data and a binary log of about the same size.
Then I decided that I wanted to filter the binlog, to get only statements about the employees table. Thus, I issued this command:

The "-c" option transfers its argument to the sandbox configuration file.
At the end of this operation, I had one server with the same version of the server that I had filled with the employee database. The server is ready to filter replicated streams, accepting only commands that affect the table 'employees' within the database 'employees'.

The second step was to create an empty database in the second server, with the Innodb tables converted to BlackHole (to avoid wasting unnecessary space).

Combining the flexibility of the sandbox with some command line skills, the operation requires just one command.
Before starting the replication, I needed to avoid re-creating the tables, or my blackhole trick would have been useless. So I looked at the binary log, and found where the CREATE TABLE statements ended:
$ ./my sqlbinlog ./data/mysql-bin.000001 |less
BEGIN
/*!*/;
# at 3057
#101112 9:48:45 server id 1 end_log_pos 3364 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1289551725/*!*/;
INSERT INTO `departments` VALUES ('d001','Marketing'),('d002','Finance'),('d003','Human Resources'),('d004','Production'),('d005','Development'),('d006','Quality Management'),('d007','Sales'),('d008','Research'),('d009','Customer Service')/*!*/;

Armed with this knowledge, I logged in the second server and did the following:

FLUSH BINARY LOGS; # to get a binlog with only the statements that I need
CHANGE MASTER TO
master_host='127.0.0.1',
master_port=5570,
master_user='msandbox',
master_password='msandbox',
master_log_file='mysql-bin.000001',
master_log_pos=3057; # this is the position after all the CREATE TABLE
# statements in the master
START SLAVE;

After a few seconds, I issued a "SHOW SLAVE STATUS". All was OK.
I flushed the logs again and inspected the second binary log. As expected, it contained only the statements related to the employees table.
Total cost of the operation: 5 minutes. Way quicker than writing this report!

@Shantanu,You will filter all the commands that contain "tbl_company", even inside a query itself. For example, if the table name is also inside the text being inserted into a table, you will filter the wrong query.Bit that is not the most important part. The biggest problem is that the query depends on several statements that are executed before the command, such as SET TIMESTAMP and SET SQL_MODE, and so on.So, unless you want to implement the replication server itself, a simple regular expression will not get you off the hook.

I had to do something like this in a data recovery scenario recently. We had hundreds of binary logs to dig through to extract changes for only a subset of tables. Using MySQL replication to filter the binary logs was very effective.

What about real scenario with production server?For eg. i have 20, 1Gb binary logs, the oldest one was created 2 weeks ago. and i have a backup 2 weeks old too.After recovery from backup, i want to do point-in-time recovery exactly for 1 table.Could you suggest the easy way to achieve this goal?