Musings of a remote software developer…

MySQL – Processing 8.5 Million Rows In a Reasonable Amount of Time

I had to crunch through a database of approximately 8.5 million records computing a hash, validating a few fields, and then updating a column with the results of the hash. Because I needed to compute the hash, I couldn’t just use an UPDATE statement to work on all the rows – I had to read and update each of the 8.5 million rows in my script. Sounds painful already!

On my initial attempt at SELECT the record, calculate the hash, and UPDATE, I was able to do about 150 rows per second… Let’s see, that’ll take a 20 hours… We can do better than that…

Parallelization & Optimization

The first thing that I wanted to optimize is the use of all 8 logical cores (Core I7 with hyper threading). That’s why I chose to use Sidekiq. We could now have multiple workers crunching different chunks of data and hopefully saturate both IO and CPU. I used Boson to make my app a simple command line application, but I could have easily used rake tasks too.

My strategy here was to have a secondary table (output_table) that I would dump my results to and join it to the primary table work_table. With my outer left join, I would search for rows in the work_table that had null values in the results field indicating that it had not been processed yet. I tried out using Sequel for this project because I never used it before and thought it’d be nice being able to have a ruby model way of accessing the data. It turned out sorta nice. The data was divided into different chunks among the workers. Each chunk was chunk_size large (100 in the example)… So it would look something like this:

This algorithm and implementation did not turn out nice though! Because I was SELECT-ing data and processing it non-atomically; if, for instance, worker 1 finished rows 100-199 and asked for another set of data, it would now be working on rows 200-299 at the same time as worker 2 that was supposed to work on the data…

Bad bad bad race conditions… Inefficient processing of data… still only getting about 300 rows processed per second… 10 hours still too long for the job… back to Google, StackOverflow, and the MySQL manual…

I avoided the race condition by using LIMIT and OFFSET to always advance not base the query off of data that is being processed

We write the processed batch to CSV file and have mysql import it.

Let’s give it a run…

Worker #1 reports validate job took 2153123.5738263847 milliseconds

Yay! We were finally able to finally process through the full dataset in a reasonable amount of time! 1/2 hr is not too bad for 8.5 million records, right? Actually, we can probably optimize this a bit more…

Scanning By Primary ID instead of LIMIT/OFFSET

I noticed that the workers that was processing data at the beginning of the table was returning back fast – 1-5 seconds and as we reached the end of the table, around the 4 million mark, it got extremely slow and the CPU would start working really hard.

I did a little digging and I found the issue is that LIMIT/OFFSET has to query the ENTIRE dataset up to the point where you need and then it discards all the data at the beginning and returns you the amount of data that you want. In other words, mysql was going through 4,001,000 records to give me record 4,000,000-4,01,000. It was going through 4,002,000 records to give me the next chunk and so forth.. No wonder the query was getting slower and slower!

We are now scanning the table by primary ID from 1 to the AUTO_INCREMENT counter, so we are guaranteed to get all the rows. This SELECT method worked fast! If there was no data, the query returned back almost instantaneously. I had some gaps in my table from deletes, so it was really critical that this query returned back fast if it had no results. Overall, I probably lost a few trivial seconds skipping over deleted IDs. Let’s run our benchmark again:

Worker #9 reports validate job took 208492.26823838233 milliseconds

Wow! We’re able to process through 8.5 million records within FOUR MINUTES! That is certainly fast enough for me to be working with this database on a regular basis.

Applying What We Learned to UPDATEs

It is unfortunate that we can’t use LOAD FILE INLINE for updates, but let us see if we can apply the primary key scan and parallelization techniques for UPDATE-ing our existing output_table that we created.

For this job, I added the recommended turning auto commit, unique_checks, and foreign_key_checks off, but with my database, I don’t think I saw any improvements. I don’t think I really was using unique or foreign keys too much.

Anyways, after running this job, I was able to make 8.5 million updates in 26 minutes:

Worker #6 reports validate job took 1557370.9786546987 milliseconds

Not that bad, but could we make this faster??

LOAD DATA INFILE to temp table + UPDATE by JOIN

Doing more research, I saw that you can UPDATE one column from a table from another table via a join. Gave me the idea about loading the column I wanted to change into a tmp table and then overwriting the column I want to update. Would it work?

My strategy here was to do the exact same thing as LOAD DATA INFILE, then when all the workers were done load the data from the tmp_table into the output_table. I used the last worker that finished running to run the query by keeping a count of all the workers complete.

What’s the runtime?

Worker #7 reports validate job took 394662.35620292247 milliseconds

We got updates down to under 7 minutes!

Final Optimization Notes

To tune the system, you want to be paying close attention to the cpu time and the iowait time which you can see with top. If the iowait time is high, consider lowering the chunk size to work on smaller chunks at a time. I was able to get my iowait time to stay under 5%. If the CPU isn’t fully loaded, feel free to up the number of workers.

For my system:

i7 3770 3.4Ghz
16GB RAM
256GB Samsung 850 Pro
Mysql 5.5

I ended up with 20 workers and a 20000 chunk size (rows selected at once). When I tried increasing the number of workers, it actually had a negative effect on the benchmark speed, so there is a max that is beneficial. If you have more RAM, I would also consider tuning mysql server and even trying to have the whole DB buffered in memory (innodb_buffer_pool_size). Mysqltuner is a good resource as well as checking dba.stackexchange.com.