Data Reduction, Part 2

Last month, we left off part way through a data reduction effort with ten million Apache log records in a single MySQL table that was taking up far too much disk space and memory. We analyzed the data, found ways to normalize the schema to reduce the space required, and created the new tables. Now let's finish the job by creating a script that can intelligently move data from the old table into the new ones.

Last month, we left off part way through a data reduction effort with ten million Apache log records in a single MySQL table that was taking up far too much disk space and memory. We analyzed the data, found ways to normalize the schema to reduce the space required, and created the new tables. Now let’s finish the job by creating a script that can intelligently move data from the old table into the new ones.

The Plan

Migrating data from the old table to the normalized set of tables should be both simple and not overly burdensome on the database server. We should never do anything that will lock the original table for more than a second or so, otherwise it would prevent hits from being logged in a timely fashion and would cause Apache processes to block, waiting for the database.

The method we’ll use is to migrate entries in chronological order, from oldest to newest, and we’ll do so in small batches, pausing between each batch to reduce the overall impact of converting the data. It will take a non-trivial amount of time, but it should be a smooth transition. To reduce the number of queries sent to MySQL, we’ll try to cache frequently used data in memory along the way.

The code is included with this article as Listing One. Let’s walk through the code and see how it works.

The Code

Lines 1-10 tell Perl to enable warnings, unbuffer standard output, and pull in the Hits::DBI package. Hits::DBI is a simple sub-class of DBIx::DWIW, which is a handy module that makes using Perl’s DBI even easier than it already is. The relevant code in Hits::DBI looks like this:

The DB attribute contains the name of the database we’ll connect to. Here, jeremy_zawodny–com contains the new tables.

From there, we setup some variables that control what the script does. The $old_tbl table is where we’re pulling records from. $interval specifies the time range (in seconds) we’ll use when selecting records. That’s 24 minutes worth of traffic. (Initial test runs of the code showed that using a full hour was a bit too heavy.) $delay specifies how long the code sleeps at the end of the processing loop before it requests another batch of records to process.

Lines 18-20 define several global caches that we’ll use to the increase the performance of the conversion. There’s a cache for each of the lookup tables — agent, referer, and hit — that we created last month.

Lines 24-26 bootstrap things by finding the timestamp of the oldest record in the table. From there, we’ll walk forward in time using $interval.

Line 28 is a sanity check to make sure the script never touches data that’s less than one day old. This means the script could be safely run once a day via cron and it should always leave the latest 24 hours worth of data in the original table.

Lines 30-38 setup the SQL queries we’ll use to fetch a batch of hits and then insert them into the hit table. On line 42, we initialize the $total counter to zero. It’s used to keep track of how many records we migrate.

The main loop in the code begins on line 44 and continues to line 90. It starts by fetching a batch of records. The records come back as a list of hash refs stored in @r. For each record, we extract the agent, referer, and URI, and convert them to numeric identifiers using the agent(), referer(), and uri() functions. (We’ll discuss those shortly.)

Starting at line 62, we use the accumulated data to add a new record to the hit table. We then track and count up any errors and increment the total number of rows migrated in $total. After each row is processed, we use the record’s timestamp to calculate $hightime, which is used to bound the results returned from the SQL query to fetch the next batch of rows. At the end of the loop (line 87), we pause for a few seconds before starting the next batch.

The rest of the code is composed of the agent(), referer(), and uri() functions. All three functions perform nearly identical tasks and probably should be combined to reduce the amount of code. (But doing so would add some complexity to the code and make it a bit harder to understand at first glance.)

The uri() function, for example, is passed a URI and is tasked with returning the URI’s id from the uri table. If it doesn’t exist, it will add the URI and return the newly created id. First, we check the global cache. If it’s found, we return immediately. Otherwise, we check the database table, inserting a new record and returning the new id. Here’s the code (with line numbers) that performs those steps:

You might notice that the global caches are never cleared. In testing this code, the memory usage grew to roughly 26 MB and stabilized. There wasn’t sufficient diversity in any of the lookup tables to cause the caching to become a problem. If it was, you could simple empty the caches (%agent_id, %referer_id, and %uri_id) every time that $total grows by 100,000 or so (if not $total % 100000)

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.