It's all spinning wheels and self-doubt until the first pot of coffee.

More on ignorant feed handling

Part of the reason this whole must ignore thing with respect to feeds has me a bit fired up is because it seems like so few feed processing tools out there embrace this idea. And because of that, these tools are unfortunately brittle and prone to future shock.

For example, take Syndication.framework on OS X: Amongst the monkeys and ninjas and pirates and robots, you've got your standard title-date-link-description columns with a few other bits for good measure. But, where's the data from iTunes RSS extensions? Nowhere, gone, lost. If it was in the feed when Syndication.framework found it, it wasn't understood and so it wasn't retained after the parser finished chewing up and spitting the data into that DB table.

I've written about shared feed processing foundations before, but I don't think I've totally gotten the idea to gel in my head until now. Here's the thing: If you want feed processing tools that are useful for the general case, they have to be tolerant of things not understood. Rather than intrusively breaking apart and recasting feed data into a predetermined data structure, you've got to remain hands-off as much as possible.

This is what I did in FeedSpool. This code can subscribe to feeds, poll feed data periodically, and even work out which items in a feed are new—but it punts on everything else by only caring about where a feed starts and ends, and where its individual entries start and end. The rest is left in its original XML form. So, if there was data in there for iTunes? It's still there, because FeedSpool didn't know enough to do anything to it.

This is the major difference in feed processing model. Syndication.framework loses information when it encounters things it doesn't understand. FeedSpool retains the information, because it leaves things alone when it doesn't know any better.

Now, this is not to pick on just Syndication.framework. Despite the general-sounding name, this framework is pretty much just around to power Safari RSS and not iTunes or anything else. And as I said, pretty much every other feed processing framework and tool works in this manner. Just about everybody uses a destructive process when they parse and marshal feed data into local-idiom structures.

At any rate, there's one sentence in this overview that gives me hope for RSS as a general service in Windows Vista:

"It is also possible to access the item XML for applications that want to perform operations on the XML instead of using the item's properties."

So, dig that. If it works the way I hope it does, RSS in Vista will take care of the subscriptions for you, poll the feed data, grab new stuff—but then it leaves the data intact for you to process whatever new and unanticipated feed payloads that may arrive.

That's how it should work.

Archived Comments

If you want feed processing tools that are useful for the general case, they have to be tolerant of things not understood. Rather than intrusively breaking apart and recasting feed data into a predetermined data structure, you’ve got to remain hands-off as much as possible.

It sounds to me like you're arguing for a triplestore and SPARQL. You'd stash everything in the triple store, and then your front end app just needs to be able to construct the appropriate query and process the results. More general and extensible than creating custom classes for filtering by specific fields, and I think if you were to ever write a general filter app where users can specify filter fields and values, you'd basically be reimplementing SPARQL.

Well, a triplestore would be great if these feeds were RDF. But alas, with the exception of RSS 1.0, they're XML. I could play with trying to make transformations from XML to RDF, but that's getting back to a dangerously unlazy level of intelligence required to map unanticipated future feed extensions to RDF equivalents.

Well, a triplestore would be great if these feeds were RDF. But alas, with the exception of RSS 1.0, they’re XML. I could play with trying to make transformations from XML to RDF, but that’s getting back to a dangerously unlazy level of intelligence required to map unanticipated future feed extensions to RDF equivalents.

Isn't there already a clean mapping between Atom and RDF? There's a list of integration ideas here.

On a side note, while Googling for Atom/RDF notes, I came across blogseive, which claims to be:

...a free web-based tool that creates new feeds by filtering, merging and sorting existing feeds. The BlogSieve engine accepts virtually every (valid) feed format, processed results are then exported into any feed format you choose.

Maybe, though I don't think it's official. And even it if is, it leaves out RSS. But even if it worked for RSS too, what about all the feed extensions that might be? I think RSS 1.0 had the right idea for extension modules in the RDF universe, but the world seems to be settling for XML.

I haven’t tried [Blogsieve], but it claims to allow filtering.

It does filter, but it does so destructively. (And that's not to mention the 7-8 step form I had to go through to start filtering. Definitely not a URL-line application.) But, with respect to their filtering and conversion, check out these feeds:

If you compare these to each other, you'll find information loss and even just plain corruption. The dc:subject elements encoding del.icio.us tags are gone, even in the RSS-1.0-to-RSS-1.0 transformation. And somehow, in the Atom version, they managed to jumble up titles and authors. Granted, my stuff doesn't do conversion yet, but I wouldn't want to do it like this.

I haven't tried it yet, but I'd have to guess that a podcast feed with iTunes and/or Yahoo! Media elements would get mangled in a very nasty way.

Sean: Ooh, nice link! It's been awhile since I read about IFF, and I don't think I ever quite understood the concept.

But, this part certainly caught my eye:

Our task is similarly to store high level information and preserve as much content as practical while moving it between programs. But we need to span a larger universe of data types and cannot expect to centrally define them all. Fortunately, we don't need to make programs preserve information that they don't understand.

Maybe, though I don’t think [a clean mapping between Atom and RDF] 's official. And even it if is, it leaves out RSS. But even if it worked for RSS too, what about all the feed extensions that might be?

You could store the fully qualified entries, with the appropriate namespaces, and then define equivalencie using owl:equivalentClass. Then (I believe) a SPARQL query that extracted the rss:entry resources would also pick up the entries from Atom and the various RSS flavours. Although at that point you'd need a OWL-capable triplestore and library.