Meta

Month: April 2013

By coincidence, my last two jobs I’ve been in have seen me administer a very high traffic ad server. Or maybe not so much a coincidence, since InnoDB is a great OLTP database with a focus on consistent response time – not just raw throughput. It is also very well optimized for the simpler style queries (primary key lookups, simple secondary key etc) which seems to be all that is needed to serve ads.

Anyway, in 2011/2012 I found myself managing OpenX – and discovered some quirks with it.

OpenX 2.6

When I first inherited an OpenX 2.6 install, it was configured in a mode called distributed statistics. What this means is that you have one central database server, and then a series of slave MySQL servers. The impression and click recording web servers write to a slave server, which then has a cron that batches up and writes to the master database server. The master database server always has a copy of all the data – and has a cron that runs a summarization of the data to feed reports.

What I didn’t like about this design is that the master always required all of the data anyway. so the slaves weren’t necessarily adding any capacity. What they were doing however was not without benefit – they were a poor man’s message queue. As data arrived there was some opportunity for flow control so your peaks would be smoothed out. It also enable you to optimize for throughput rather than response and put load on the master without impact to ad serving.

Message queues are a great part of any architecture – I am in love with them. However, in OpenX particular implementation, since the slaves didn’t have all the data, and the master’s summarization used temporary tables it also caused a lot of problems. We couldn’t upgrade to row-based replication because the summarization process created temporary tables with the TYPE=X syntax which had since been deprecated – locking us into an earlier version of MySQL as well.

I also profiled the code and noticed that our performance was very erratic. Sometimes requests were super fast, other times they would wait minutes. What I discovered was that the cron that ran on the slave servers was blocking – when it ran every 5 minutes (configurable) it would CHECK TABLE, then REPAIR TABLE if required! – a paranoid design that is centered around MyISAM. InnoDB has internal consistency checking with page checksums and doesn’t require this. So the first thing I did to improve OpenX 2.6 performance was comment out CHECK TABLE/REPAIR TABLE, and things were instantly much better.

When Open 2.6 was released in 2008, maybe more people had single core machines and web servers processed less traffic – so they may not have noticed exactly how hot of a lock waiting on CHECK TABLE/REPAIR TABLE was – which is a single threaded process inside MySQL. Since our slave servers sat on the web servers, we had these locks fairly well distributed, and since I don’t think the servers synced with NTP, this managed to stay functional for years because machines would fall in and out of service from the load balancer on different cycles. However, as traffic per server increased, it also got much worse.

I also changed the slave servers to not be on the web servers – but a separate tier of servers, which supported x3 web servers per slave without any issues. To me this was more elegant architecturally, because of the time consuming process of rebuilding slaves as they failed due to the master’s use of temporary tables, and that database servers tend to have uniquely different needs to web servers (more memory, faster disks, CPUs less important). However, my little architecture optimization presented me with another problem – the cron that ran on the slaves was consuming too much memory and dying!

What I discovered was that it was simply reading from a few tables, and then inserting those rows on the master (like I said above – a poor man’s message queue). But it did this in PHP and used all sorts of associative arrays that were not memory efficient. I managed to reverse engineer and change this cron to be just a simple shell script which used the MySQL CLI and was incredibly more efficient. It also meant that my CHECK TABLE/REPAIR TABLE patch was obsolete – which made me comfortable having slightly less OpenX code to maintain.

However, OpenX 2.6 was still a beast at best. The ‘monthly’ part of the maintenance script on the master couldn’t delete old impression rows as fast as new ones came in – and the total impressions table was over 1TB and caused a lot of locking inside InnoDB! So I commented out the monthly cleanup code and wrote a standalone script to constantly cleanup old versions of the data. However, my standalone script wasn’t fast enough either, so every month we would temporarily disable the slave crons and master cron while we created a new table like the previous table and then inserted just the small fraction of data we required, renaming and dropping the old table. It was very high effort to maintain, and not particularly operationally friendly.

OpenX 2.8

OpenX obviously figured out on their own that there was a problem with how they were collecting data, because with 2.8 they changed from using raw impression logging of one row per impression to something they called Bucket Based Logging . What they would do, is record only a running count of impressions per ad-zone/creative-id/time-period, and perform an insert on duplicate key update count = count+1. They still had the ability to have “distributed” statistics with the slaves having a part of the data, and the maintenance on the slaves would have a special update to the master with an insert on duplicate key update count = count + my_number.

Taking from my earlier thought process where the slaves are really just message queues, I managed to find a really non invasive way to modify the OpenX code to not write to MySQL, but instead write to a message queue – in my case ZeroMQ. I then wrote some proof of concept code (that a colleague of mine later improved) which would receive from the message queue, and then merge requests by ad-zone/creative-id/time-period and then insert directly onto the master server. Eliminating those pesky slaves!

I left before the project could go live – and from what I heard afterwards, it unfortunately never did. But when I reflect on it, I really don’t like OpenX’s 2.8 bucket based logging and from an operational perspective – I think this would have still been hire maintenance. If one slave was too slow to phone home statistics before the master ran its summary cron, you will be missing statistics. And since the on duplicate key update count = count+1 style design is not idempotent – you can’t easily regenerate statistics to fix data. You are stuck with inaccurate results.

Then I would have written a cron to find all log files not currently open, and then summarize them similar to how my message queue reader did + upload the log file to permanent storage (in our case Amazon S3 would have worked). I don’t think that MySQL is the correct place to store the raw data – but it’s essential to be able to reprocess as necessary:

Keeping these log files in plain text on S3 make them easy victims for Amazon EMR – which can read and process directly, and then store the results in MySQL. I would have loved to have eliminated the summarization cron that ran on the master and turned it into a hive query. Then I would have used sqoop to write the results back to MySQL. I am sure refactoring the summarization/maintenance cron to Hive would have been difficult – there were a lot of strange behaviors in it, but it would have made things idempotent, reduced a lot of load on my MySQL servers (which are not good at the complex summarization queries) and meant I could avoid the temporary tables which made replication slaves hard. I would have kept one replication slave for HA.

Also – the more I think about it, being able to keep raw log files is so important. They are incredibly easy to compress and retain, and often take up a lot less space than any (row-store) database indexed equivalent would.