XML syndication the easy way

It’s all very well mashing up other people’s content, but it’s even better to enable them to reuse your own. This is an ideal job for XML – and our PHP expert, Paul Hudson

Shares

You’ve got a website that rocks and it’s time to tell the world how great you are. You don’t need a million-dollar advertising campaign and you don’t need a team of astroturfers blowing your trumpet – in fact, you don’t need any marketing budget at all. The best way to get your great content out to the world is to offer it up for XML syndication, which means anyone can take it and do as they please with it.

The two primary ways people can use your content are through desktop syndication and mashups. The best way to accommodate both of these is to target the first type: people who want to use your content by reading it on their desktop. If your content is already in a news format, you can convert it all into XML so that people can read it in their RSS readers, and once they’re doing this, it’s hopefully only a matter of time until your content is syndicated by other developers. This may only involve the printing of parts of your content on their site, but if your content is unique, and rich with meta information, you could find yourself part of mashups (don’t worry about targeting these mashup developers; if you make your content open with XML for desktop users, mashup developers will do the hard work of figuring out how to get that into their projects). We’re going to look at how to convert your content into XML, targeting both RSS and our own XML language that provides maximum meta information for mashups.

Meta matters

That’s perfectly valid HTML, and it would look just fine when rendered to a web browser. But what does it mean to a computer? The computer can tell that there are line breaks in there, and that the first line should probably be rendered in bold, but that’s about it. It can’t tell that it’s a to-do list because we’re using formatting tags (<strong>, <br />, etc.) rather than semantic tags. Give some thought to what a computer would make of this code instead:

Sure, it’s no longer HTML, but really HTML is just one form of XML with an agreed styling for each element. The point is this: if you want people to do cool things with your content, you need to present it in a format that can easily be understood by computers. Our XML to-do list can be parsed with XSLT to produce HTML content (we’ll be looking at this next issue), or it can be processed into a database.

When we work with RSS, we have to throw away a lot of information, but later on we’re going to create our own XML syndication schema that people can read from and mash up as they please. This is where meta information matters most; you don’t know what people are going to do with your content, so you should avoid saying to yourself, “Oh, no one needs to know XYZ, so I’ll just leave it out”.

Converting to RSS

Let’s get cracking with our first task: converting some content to RSS. If you edit the MySQL connection details in this next script, it will set us the database we’re going to use for our RSS feed:

If this were a real website, we’d also want to create a table to handle the different categories and authors – we’re storing them as numbers in news_items, but they ought to be served up as text. The PostedAt times are set to the current time but, again, if this were a real website, you’d want to do it properly by setting this time as news was added to the site. It’s important to remember that the standard date format for RSS is Thu, 26 Oct 2006 21:17:49 +0100, so you should make sure you always use this format.

Now, the most basic RSS feed needs the following information: our site’s name (for example, ‘.net magazine’), our site’s link (http://www.netmag.co.uk) and our site’s description (‘the world’s finest internet resource’). We then need to provide a list of news items, each of which must have a title (‘Police called to babysitters’), a description (‘Find three-year-old resisting a rest’), a link (http://www.example.com/ stories/1) and a globally unique identifier – usually shortened to GUID.

GUIDs are simply individual, unique codes that identify each of your news items so that aggregators don’t end up repeating the same story twice. You can use whatever you like for your GUIDs, as long as each item has its own unique GUID. In practice, the easiest way to ensure a GUID is unique is to use the URL to the full story, as each link usually contains just one story. So, with all that in mind, here’s a basic script to serve up some RSS from a database – save it as ‘db2rss.php’:

We can split that code into three parts: connecting to the database, outputting general site information, then outputting individual stories. There’s nothing surprising in any of those sections but the end result is magic. If you run that on the command line, you’ll see your RSS feed printed out, neatly indented. If you like to save as much bandwidth as possible, you can remove all the white space from the feed and it will still work, but it will also be a lot harder to debug!

To make sure everything is working, copy and paste your RSS text into the ‘validate by direct input’ box at validator.w3.org/feed and check through the output for errors. Speaking as a self-acknowledged lazy person, you can send out feeds without the site description or the item GUIDs, but they won’t validate as RSS without them. Our RSS feed is valid, which means we can push on with the real test: doing a test subscription through a desktop RSS reader.

Desktop syndication

If you don’t already have an RSS Reader, head over to www.rssreader.com to get a free one to try out your feed. In RssReader, press Ctrl+Shift+F to add a new feed, and enter the URL to the previous script. The feed will be downloaded from your server at this point, and if everything is working OK, RssReader will ask you to set the title you want to use for the feed.

Now the next test: RssReader will parse your news items and should display them individually in the top-right frame, along with a summary of all the items and their contents in the bottom-right frame. Not bad, but it’s not perfect either: if you scroll to the right of the top pane, you’ll see there’s no published date, author or comments for our stories. The first two are self evident, but the last one is actually a URL where comments about this news story can be posted.

We can add each of these to our existing script by adding five lines just before the </item> in the current script:

From the top-left pane, right-click on the title of your news feed and choose ‘Clear feed history’. Then right-click on it again and choose ‘Get new feed headlines’ – this will flush the existing stories, then refetch them all. As long as you updated the code correctly, you should now see the new information displayed.

Limit data reading

The nice thing about starting up your own XML syndication format is that you don’t need to follow any rules but one thing to be careful of, unless the appeal of your content is its size, is you’ll probably find it best to limit how much data people can read from you. If you’re serving up news items, serve only the last ten or so. If you’re serving up shorter content, then you can provide 20 or more without stressing your connection too much. This is particularly important in mashups, where thousands (even millions) of people could be hitting the mashup site, which in turn ends up being thousands or millions of people reading from your rapidly melting server!