Development and UX from Michael Mahemoff. Maker of Player FM. Previously: Google, BT, O'Reilly author.

Menu

Firefighting an RSS Aggregator’s Performance

That’s a before-and-after shot of the database server’s CPU! I was watching it slowly creep up, planning to inspect it after some other work, before receiving mails from Linode that the virtual server is running over 102% capacity, then 110, 120, …

Three things made the difference in fixing this:

Feed Item Keys Must Be Unique

The most important thing was to nail down keys, which I noticed from looking at logs and the oddly cyclic nature of the graph above. I later on ran a query to see how many items were being stored for each feed, and sure enough, certain feeds had thousands of items and counting.

The RSS 2.0 spec (as official a spec as there is) says of individual items: “All elements of an item are optional, however at least one of title or description must be present.”. What’s missing there is a primary key! Fortunately, most feeds do have a unique <link>, <guid>, or both. But if you’re trying to be robust and handle unanticipated feeds, it gets tricky. There were also some boundary cases involving feeds which had changed their strategy (fortunately, improved it by adding guids) at some point, but never updated the old items. So the feed was a hybrid.

The net effect was a gigantic number of items being accumulated for certain feeds. Every hour, the server checked for updates, it decided that yes these key-less feeds had totally changed and we need to pull all the posts in again and save a record of it. That’s why you see the hourly cycles in the “before” picture. I still need to go and cleanse the database of those duplicate items.

By taking a step back and looking at what makes the items truly unique, and with the help of Rails’ handy collection methods, it was possible to make feed items unique again and smooth out crawling.

Indexing

Inspecting a handful of anomalous feeds once an hour, due to the problem mentioned above, is not the worst thing in the world. What made the server veer towards FUBAR was certain query that was being performed each time in the absence of indexes. I was able to see the heaviest queries in the Rails log using the grep/sed command posted here yesterday. I added those indexes and the queries went from ~ 1200ms to 20ms, with the overall throughput for a feed dropping down to about 20% of its former time.

Validation

A third issue was forcing the database to run the wheel all the time. This wasn’t a major hourly thrashing like above, but a few feeds that were being polled every few minutes.

I got a sniff of this problem when I noticed the same set of feeds would keep appearing when I looked at the logs. After grepping, I realised they were not obeying the rule of waiting an hour to re-check, but were in fact taking their turn to poll the external feed, then jumping right back in line for another go.

This really wasn’t having much performance impact, because these feeds weren’t adding new items with each check (as the item keys were sound). But with more feeds like this, it could have an impact, and more to the point, being polled every few minutes is not good for my bandwidth or the people on the receiving end!

The cause turned out to be some trivial problems with feed items, which were being blocked by Rails’ validation when trying to save the items. Because scheduling info is &emdash; for convenience &emdash; tied to the items’ records, the scheduling info was being lost. A bit of overkill to isolate out the scheduling info at this stage, so I switched the validation to a before_save which did some cleansing to ensure the format is right.

Update: IO Rate

Another issue I still had to fix was the IO rate. You can see it above, not in the spikes – which reflect me making the fixes above – but in the small wave pattern on the left. Those are actually very high in absolute terms, at around 1K blocks per second being transferred between disk and memory. This is due to swap thrashing and required updates to my.cnf. In particular, decreasing key_buffer. Also, I decreased max_connections, such that (with key_buffer change), https://github.com/rackerhacker/MySQLTuner-perl was content with the memory required, and also increasing innodb_buffer_pool_size. I haven’t measured the effect of that yet, need to let it run for a while in order to get that.

I’m sure plenty of other optimisations are possible, but the good news is that IO Rate has gone right down to near-zero and swap rate likewise. So no more thrashing.

6 thoughts on Firefighting an RSS Aggregator’s Performance

If I understand your post (and I’m not sure I do), your problem is in trusting the feeds. You need to hash feed items as they come in (I combine the title, url and content and use MD5 if I remember correctly) then store that hash in an indexed field which you can use to compare on the next update. If an item changes in content, it’ll change the hash. You can’t ever rely on a feed’s guid to be unique.

This has been floating in my head all day – I use SHA1 as the hash (not MD5) as it’s faster and produces a smaller output. The chances of collision within a single feed source is remotely small so it works perfectly.

Also at some point you’ll need to purge older feed items because the storage gets insanely big really fast. I subscribe to only a hundred or so feeds, but it’ll easily fill up a few GBs of database storage over a couple months.

When you truncate the items, you should leave the ids, urls, titles and hashes in the DB, that way there’s always a record of old content and you won’t serve up duplicates from feeds that update only rarely.

BTW, the best library for feed parsing hands-down is simplepie.org. It’s in PHP, but it’s got all the edge cases covered. It’ll detect which of the various versions of RSS or Atom a feed is, figure out things like descriptions vs. summaries, auhors, dates, character encoding issues, etc. Even if you don’t use it, a quick look through the code will make you aware of stuff to look out for.

Regarding full text, did a few calculations using variants of the following against a MySQL corpus:

SHOW TABLE STATUS LIKE ‘%’;
SELECT SUM(LENGTH(content)) from posts;

Very roughly, it’s ~50% of the total size for this corpus, which is not too big and also has the benefit of requiring indexes (which may be take up as much storage as the table itself). In order to enable full-text search, text indexing is indeed required, but at least it can be offloaded to a dedicated server like Solr and not clog up the DB.

Overall, the text ~doubles the overall size. As long as it doesn’t 10x it, that’s not too drastic if it’s adding value.

G’Day

Welcome to Michael Mahemoff's blog, soapboxing on software and the web since 2004. I'm presently using HTML5 and the web to make podcasts easier to share, play, and discover at Player FM. I've previously worked at Google and Osmosoft, and built the Ajax Patterns wiki and corresponding book, "Ajax Design Patterns" (O'Reilly 2006).
For avoidance of doubt, I'm not a female, nor ever have been to my knowledge. The title of this blog alludes to English As She Is Spoke, a book so profoundly flawed it reminded me of the maturity of the software industry when this blog began in 2004. I believe the industry has become more sophisticated since then, particularly the importance of UX.
Follow @mahemoff