Development and UX from Michael Mahemoff. Maker of Player FM. Previously: Google, BT, O'Reilly author.

Menu

Tag Archives: Feeds

Background: Has the Feed Changed?

I previously mentioned some work I did to cut down processing and IO on an RSS client. Yesterday, I was able to continue this effort with some more enhancements geared around checking if the feed has changed. These changes are not just important for my server’s performance, but also for being a good internet citizen and not hammering others’ machines with gratuitous requests. Note everything in this article will be basic hygiene for anyone whose written any kind of high-scale bot, but documenting here as it was useful learning to me.

Normally, a fetch requires the client to compare the incoming feed against what has been stored. This requires lookup on the database and a comparison process. It’s read-only, so not hugely expensive, but does require reading a lot — all items in the feed — and at frequent intervals.

All this comparison effort would be unnecessary if we could guarantee the feed hasn’t changed since the last fetch. And of course, most of the time, it won’t have changed. If we’re fetching feeds hourly, and the feed changes on average once a week, then we can theoretically skip the whole comparison 99.4% of the time!

So how can we check if the feed has changed?

Feed Hash

The brute-force way to check if the feed has changed is to compare the feed content with the one we received last time. We could store the incoming feed in a file, and if it’s the same as the one we just sucked down, we can safely next it.

Storing a jillion feed files is expensive and unnecessary. (Though some people might temporarily store them if they’ve separated the fetching from the comparison, to prevent blockages, which I haven’t done here). If all we need the files for is a comparison, we can instead store a hash. With a decent hash, the chance of a false positive is extremely low and the severity in this context also extremely low.

HTTP if-not-modified-since

The HTTP protocol provides its own support for this kind of thing, via the if-not-modified-since request header. So we should send this header, and we can then expect a 304 response in the likely event no change has happened. This will save transferring the actual file as well as bypassing the hash check above. (However, since this is not at all supported everywhere, we still do need the above check as an extra precaution.)

ETag

Another HTTPism is ETag, a value that, like our hash, is guaranteed to change if the feed content changes. So to be extra-sure we’re not re-processing the same feed, and hopefully not even fetching the whole feed, we can save the ETag and include it in each request. It works like if-not-modified-since; if the server is still serving the same ETag, it will respond with an empty 304.

For the record, about half of the feeds I’ve tested — mostly from fairly popular sources, many of them commercial — include ETags. And of those, at least some of them change the ETag unnecessarily often, which renders it useless in those cases (actually worse than useless, since it consumes unnecessary resources). Given that level of support, I’m not actually convinced it adds much value over just using if-not-modified-since, but I’ll leave it in for now. I’m sure managers of those servers which do support it would prefer it be used.

Share this:

I’m getting more convinced Plus is the new Twitter, and also the new Posterous. I’ve been posting things on there I previously would have stuck on the Twitter or the Posterous, and so it was time to integrate Plus on my homepage alongside the existing Twitter and Posterous links.

Latest Post

It was pretty easy to integrate my latest Google Plus post (we don’t really have a name for a Plus post yet; a plust?), as I already have a framework in place for showing the last post from an Atom or RSS feed.

Share this:

I just discovered a new feed meta-search: TalkDigger (via Data Mining).
It’s Ajax search all the way (buzzword overload!).

The site shows how ideal Ajax is for meta-search. Each time you enter a query, the browser fires off multiple queries – one for each engine it searches. That means the results all come back in parallel – no bottlenecks.

Back in the day, metacrawler and others were smart enough to start writing out the page straightaway, so users start seeing some results while others are still pending. The Ajax meta-search improves on the situation by directly morphing the result panels, so the page structure remains fixed even as all the results are populated. Each panel gets its own Progress Indicator.

This is an example of Multi-Stage Download – set up a few empty blocks and populate them with separate queries. When I initially created the pattern, it was pure speculation, but TalkDigger now makes the third real example I know of. I recently created a Multi-Stage Download Demo.

Another nice feature of TalkDigger, which fits well with meta-search, is the use of Microlinks: You can click on the results to immediately expand out a summary.

There’s some more features I’m hoping to see:

The results page definitely needs work – it’s nice seeing a brief summary of all results and having them expandable, but it’s difficult to get an overall feel. An “Expand All” would help, or showing at least one posting for each search engine.

The results are broken up by an ad. To me, that’s counter-productive as they look like two separate panels. I think most users will mentally filter out the ad anyway and just see the results as broken into two.

[Sortable columns](http://ajaxpatterns.org/Query-Report Table] – so I could sort by engine name or feed count.

Share this:

G’Day

Welcome to Michael Mahemoff's blog, soapboxing on software and the web since 2004. I'm presently using HTML5 and the web to make podcasts easier to share, play, and discover at Player FM. I've previously worked at Google and Osmosoft, and built the Ajax Patterns wiki and corresponding book, "Ajax Design Patterns" (O'Reilly 2006).
For avoidance of doubt, I'm not a female, nor ever have been to my knowledge. The title of this blog alludes to English As She Is Spoke, a book so profoundly flawed it reminded me of the maturity of the software industry when this blog began in 2004. I believe the industry has become more sophisticated since then, particularly the importance of UX.
Follow @mahemoff