When you add a web site like Flickr or Google Reader to FriendFeed, FriendFeed's servers constantly download your feed from the service to get your updates as quickly as possible. FriendFeed's user base has grown quite a bit since launch, and our servers now download millions of feeds from over 43 services every hour.

One of the limitations of this approach is that it is difficult to get updates from services quickly without FriendFeed's crawler overloading other sites' servers with update checks. Gary Burd and I have thought quite a bit about ways we could augment existing feed formats like Atom and RSS to make fetching updates faster and more efficient. Our proposal, which we have named Simple Update Protocol, or SUP, is below. You can read more details and check out sample code on Google Code. Discuss the proposal in the SUP FriendFeed room.

SUP is just a proposal at this stage. We are eager to get feedback and ideas, and we expect to update the protocol based on feedback over the next few months.

Simple Update Protocol

SUP (Simple Update Protocol) is a simple and compact "ping feed" that web services can produce in order to alert the consumers of their feeds when a feed has been updated. This reduces update latency and improves efficiency by eliminating the need for frequent polling.

Benefits include:

Simple to implement. Most sites can add support with only few lines of code if their database already stores timestamps.

Works over HTTP, so it's very easy to publish and consume.

Cacheable. A SUP feed can be generated by a cron job and served from a static text file or from memcached.

Compact. Updates can be about 21 bytes each. (8 bytes with gzip encoding)

SUP is designed to be especially easy for feed publishers to create. It's not ideal for small feed consumers because they will only be interested in a tiny fraction of the updates. However, intermediate services such as Gnip or others could easily consume a SUP feed and convert it into a
subscribe/push model using XMPP or HTTP callbacks.

Sites wishing to produce a SUP feed must do two things:

Add a special <link> tag to their SUP enabled Atom or RSS feeds. This <link> tag includes the feed's SUP-ID and the URL of the appropriate SUP feed.

Generate a SUP feed which lists the SUP-IDs of all recently updated feeds.

Feed consumers can add SUP support by:

Storing the SUP-IDs of the Atom/RSS feeds they consume.

Watching for those SUP-IDs in their associated SUP feeds.

By using SUP-IDs instead of feed urls, we avoid having to expose the feed url, avoid URL canonicalization issues, and produce a more compact update feed (because SUP-IDs can be a database id or some other short token assigned by the service).

Because it is still possible to miss updates due to server errors or other malfunctions, SUP does not completely eliminate the need for polling. However, when using SUP, feed consumers can reduce polling frequency while simultaneously reducing update latency. For example, if a site such as FriendFeed switched from polling feeds every 30 minutes to polling every 300 minutes (5 hours), and also monitored the appropriate SUP feed every 3 minutes, the total amount of feed polling would be reduced by about 90%, and new updates would typically appear 10 times as fast.

Update: Several people have asked how using SUP compares with using HTTP If-Modified-Since headers. The two features are complementary. With SUP, feed consumers can monitor thousands of feeds with a single HTTP request (to fetch the latest SUP document) instead of having to request each feed individually. For example, each user's feed on FriendFeed has a unique SUP-ID (mine is "53924729"), but all of the feeds point to a single SUP URL, http://friendfeed.com/api/sup.json. Therefore, it's possible to watch for activity on thousands of separate FriendFeed URLs by polling just one URL, http://friendfeed.com/api/sup.json. If my SUP-ID appears in that SUP document, then you know that my feed has updated and it's time to fetch a new copy. This is substantially more efficient than polling each of those thousands of URLs individually.

David and Mike: our crawler already does HTTP If-Modified-Since requests. The main issue is that we still need to do one request for every feed we fetch. So if 80,000 Flickr users have connected their accounts to FriendFeed, we need to do 80,000 requests to Flickr to look for updates. And Flickr needs to look in their database for every requests (to see if an update has occurred). If we want updates to have at most 10 minutes of latency between publishing a photo and an update coming to FriendFeed, that is 133 requests per second. With SUP, we can look for updates for ALL feeds with a single requests, which reduces load on both ends. Does that make sense?

RaghuL With SUP, when we first crawl a feed for a particular user, e.g., http://friendfeed.com/bret?format=atom, we get the SUP-ID out of the feed. In the feed above, the SUP-ID given is "44123b23". We can then monitor the SUP feed, which gives the most recent updates from all feeds, identified by SUP-ID: http://friendfeed.com/api/sup.json. When we see "44123b23" in the SUP feed, we know http://friendfeed.com/bret?format=atom has been updated, so we fetch it again to get the updates. Since we only need to monitor on feed (http://friendfeed.com/api/sup.json) to look for all the SUP-IDs we are monitoring, we only need to poll one URL to trigger updates for all the feeds we are monitoring. We can get updates almost immediately after they happen, and we don't need to poll every feed to accomplish it.

This is one of those simple ideas that one wonders why no one thought of before. It's a good proposal and a required one in the rapidly growing aggregation/Lifestreaming world. I am sure the proposal will be widely and quickly SUPported.

sounds a bit like newnews for nntp did to help news traverse to the various nntp servers faster. Here's a crazy thought maybe a new http verb "poll" in combination with if-modified-since. if used against a single feed it would limit the transmitted items only to new items. If used against a domain it could return modified pages/feeds for that domain. It could even increase the speed for crawlers for search engines. the accept header could even limit what types of updates are returned via poll.

""""On July 21st, 2008, FriendFeed crawled Flickr 2.9 million times to get the latest photos of 45,754 users of which 6,721 of that 45,754 visited Flickr in that 24 hour period, and could have 'potentially' uploaded a photo."

Source: http://www.slideshare.net/kellan/beyond-rest (Slide 16)

RSS is simply not going to get us there."""

Given that this still requires the services y'all are scraping to Do Something, why not become the killer-app for XMPP?

Your first and third benefits incorrectly assume that all information resides in a single database, which is not the case for any site large enough to benefit from your proposal. Most large architectures work on the concept of sharding. Groups of systems that handle a smaller subset of the overall user population, without a need to know about each other. Unless I am misunderstanding some vital component, what you are asking providers to do is create a single point of aggregation that must know about all updates within all shards. This in itself creates a far more complex engineering problem than an end-site making 200 requests a second (which remember, in a sharded architecture, scales by adding a few extra queries each to a number of autonomous systems).

SUP would also multiply the volume of raw computing work that needs to be done by a data provider. In an RSS architecture, a system must generate output for only the users that are actively being polled. In a "SUP" system, resources must now be spent to generate output for all users, regardless of if anyone cares to come looking for it. Your claim that it reduces load on both ends is false, the only reduction in work is on the consuming end.

Your claims of not exposing secret feed URLs is also incorrect, as it appears there is no way to actually fetch the contents of an update without knowing the URL that originally led you to a SUP-ID.

oh god... this is not a "simple idea that no one thought of before". This is weblog.com's changes.xml (courtesy of Dave Winer) all over again.

In fact, the inherent scaling problems in such a solution led to the RSS <cloud>-Element which attempted to solve the problem at the source, but through XML-RPC, which is why the cool kids don't support it.

(That and that it breaks if you get 1000s of timeouts from unavailable RPC endpoints)

Also: what Mike D said in his 2nd post.

His first one missed the point. HTTP HEAD is not going to solve this problem.

Please take this back to the drawing board. You're right, this problem needs a solution.

But the solution will not be simple, because it's a really complex problem.

I'm still having trouble envisioning what a provider's implementation of this SUP feed would look like. MikeD makes good points about federated databases making this far more complicated and expensive. Paul Watson rightly points out that this feed would be gigantic. If it were hypothetically to contain 80,000 items, Flickr's would completely turn over every 25 minutes or so.

Rather than continuing to complain endlessly, I'd like to hear from FriendFeed themselves how they would go about creating a SUP stream for FriendFeed itself. I want to be convinced that this is a valid easy-way-out of the much longer road of popularizing XMPP.

I like this idea, simple and elegant. I agree that constant polling sucks and is a terrible solution for what services like (friendfeed or Feedheads are doing), however, I struggle to imagine something like this being promptly adopted by most feed producers.

Google Reader's shared feed doesn't even make use of IF-Modified-since headers. They're google!!

Re: multiple databases. First, you could serve multiple SUP feeds corresponding to your multiple databases. Second, if a client like FriendFeed or Gnip is hitting all your feeds, generating a single SUP feed is provably easier on your servers: you could poll your own RSS feeds on the server generating the SUP.

re: "In an RSS architecture, a system must generate output for only the users that are actively being polled." SUP is useful when two conditions are satisfied: a lot of your feeds are being hit by a single client, and many of those feeds are not updated during the polling interval. SUP is not designed for the use case you're talking about.

Re: "it appears there is no way to actually fetch the contents of an update without knowing the URL that originally led you to a SUP-ID" Yes, that's the point.

Re: push solutions. Maintaining and producing/consuming an open connection is harder for both ends.

I'm not sure SUP would help much for a large scale generator of feed-based activity like Netflix without some notion of subscription resulting in SUP feed per feed consumer. The personalized Netflix feeds generate generate about 6 million posts per day (about 2M each of queue adds, shipped DVDs and received DVDs). Given that any given feed consumer is likely only interested in a small fraction of the 8.4M+ subscribers, the signal-to-nose ratio in a general SUP feed would be quite low.

But what do you do when you're polling thousands of SUP feeds individually? Obviously you'll need a way to see if any of the feeds in any of the SUPs you are tracking have updated in order to save even more bandwidth. Hmm... maybe we need another protocol ;-)

Might be fun to add a track capability with Pusbub, though. But I digress.

As regards data exchange, XMPP Pubsub looks like an interesting solution, considering for example that contrarily to the SixApart ATOM stream, Friendfeed servers would only receive messages from subscribed XMPP Flikcr accounts, not all the data which would avoid filtering.

whats wrong with having a simple atom “update” feed listing recently changed feeds? that feed could use RFC5005 and consumers wouldn’t need to worry about polling faster than the SUP feed turns over completely

If I understand thus right we have a network efficiency problem and limitations based on data structure. Large providers may have millions or potentially billions of individual nodes that have a status update flag.Current efficient polling is done by comparing their remote time stamp and the node's last update time.Excuse me for the modest description this problem space is new to me (but important to what I'm working on).

Is there a viral hub heirarchy? This would be a limited number of remote servers who simply pass along the last update time to all subscribers (and any polling system could also serve last node update time). You would subscribe to a virtual server represented by a dynamic list of your nearest neighbors. A network similar to bittorrent but only concerned with last update time.

The storage could get problematic for large databases (even of a simple time) so each serving hub could keep what it needs for internal use and just push all the times to it's short list of clients.

I'm trying to think of a decentralized status architecture, would appreciate expert opinions