Blogroll

Forums

python

RSS reader – Part One

All right so I’ve already figured out how to write an executable script that writes out “Hello World!” to the command line, now I need to figure out how to do something interesting.

As a result I was surfing the Internet and reading some Python documentation, trying to come up with something to do but nothing seemed interesting enough or easy enough for me to do until I visited one of my favorite websites: slashdot.org.

It was there that I noticed that RSS icon that Firefox always shows me whenever I come across a website with an RSS feed (like this one.)

So I thought to myself, ‘Hmm I wonder how hard it would be to create a simple RSS reader in Python?’ Read on to discover the results.

First I had to look at Slashdots RSS XML to know what it looks like and what my code should be looking for. If you copy and paste: http://rss.slashdot.org/Slashdot/slashdot into your web browser you will be able to take a look at the XML format. The format is basically as follows:

[code lang=”xml”]
< ?xml version="1.0" encoding="UTF-8"?>

The Link
The Description

The Link
The Description

[/code]

For this simple RSS reader I’m only going to pay attention to each <item> and ignore all other tags. In each <item> I’m going to want to display the <title> and the <description>. That way when we run this program well see the title and descriptions of all the items on Slashdot’s main page.

Now that I knew what I wanted to do I fired up my trusty copy of gedit and began searching the web for information. By far the best site I’ve found for Python based information is the source, python.org. There is tons of documentation and information on this site, everything a new programmer needs. There may be sites out there that are better, but I haven’t found them yet.

Now that I knew what I was going to do, I needed to find some libraries to help me with my task, since there was no way I was going to do all of this myself. Fortunately for me python has many built-in libraries that make doing complicated things a lot easier.

The first thing I needed to do was create a blank shell of a script based on my first executable python script. I called my new file pythonRSS.py and edited it so that it only contained the following:

[code lang=”python”]
#! /usr/bin/env python
[/code]

After that I needed to include (sorry for the c++ terminology) the libraries that I needed to make my RSS reader:

The first line gives me access to the library urllib2, and the second line imports the minidom and Node submodules from the xml.dom library and makes them available without the package prefix. Had I just used the following line:

[code lang=”python”]
import xml.dom
[/code]

I would have had to reference the Node and minidom submodules using the package prefix: xml.dom.Node, instead I am able to reference them directly as Node

Now to get down to the programming business, the first piece of the puzzle that I needed was slashdot’s RSS XML, that’s pretty easy to get using the ulrlib2 library. We download the RSS feed using the urlopen function:

This gives us a “file-like” object stored in url_info, if we wanted to run through the XML returned line-by-line we could actually use the following:

[code lang=”python”]
for lines in url_info:
print lines
[/code]

But since we have a fancy XML library at our disposal manually running through the file wouldn’t make much sense.

Now that we have the RSS XML we are going to pass it to our minidom object and get it to parse it into a document:

[code lang=”python”]
xmldoc = minidom.parse(url_info)
[/code]

The parse function parses up the XML into the Document Object Model, or DOM (as in minidom or xml.dom), and returns the document to us. For more information on this please see the Document Object Model specification.

Now that we have the document, were going to get the root node (<rdf :RDF>) and then loop through all of its children nodes looking for <item> nodes. This is actually really simply and intuitive:

The above is something that I am definitely learning to love about python, the ability to iterate through things easily. Instead of having to redo the same iteration code for difference classes and different types (like I have to in C++) python makes this easy and intuitive using for loops or while loops.

Once we’ve found an <item> node, were just going to do that exact same thing we did above except this time we are going to look for <title> and <description> nodes:

[code lang=”python”]
for item_node in node.childNodes:
if (item_node.nodeName == “title”):
title = “”
for text_node in item_node.childNodes:
if (text_node.nodeType == node.TEXT_NODE):
title += text_node.nodeValue
if (len(title)>0):
print title

The above code is basically just a variation of the code that we used to get the iterate through all the children of the root node. The only difference in this case is what we do when we find them, which is to iterate through all of their child nodes that are text nodes and then store that value for printing. This is the only part of the script that felt a bit weird to me, shouldn’t it be more like node.Text or something like that? Either way it works and that is basically the end of the RSS reader. Just feed it into python or make it executabe and that’s it:

$ python pythonRSS.py

All-in-all this code didn’t take that long to write, a few google searches and I was on my way. Truthfully witting this blog post took longer to write then the code did.

Please keep in mind that there are many things wrong with this code, the most obvious is that it make no use of functions to simplify the code. But this is only part one of my RSS reader and with time I will fix up the code.

Here is the code in its entirety with some comments thrown in, alternatively you could also download the code as a text file:

if (url_info):
“”” We have the RSS XML lets try to parse it up “””
xmldoc = minidom.parse(url_info)
if (xmldoc):
“””We have the Doc, get the root node”””
rootNode = xmldoc.documentElement
“”” Iterate the child nodes “””
for node in rootNode.childNodes:
“”” We only care about “item” entries”””
if (node.nodeName == “item”):
“”” Now iterate through all of the ‘s children “””
for item_node in node.childNodes:
if (item_node.nodeName == “title”):
“”” Loop through the title Text nodes to get
the actual title”””
title = “”
for text_node in item_node.childNodes:
if (text_node.nodeType == node.TEXT_NODE):
title += text_node.nodeValue
“”” Now print the title if we have one “””
if (len(title)>0):
print title

if (item_node.nodeName == “description”):
“”” Loop through the description Text nodes to get
the actual description”””
description = “”
for text_node in item_node.childNodes:
if (text_node.nodeType == node.TEXT_NODE):
description += text_node.nodeValue
“”” Now print the title if we have one.
Add a blank with \n so that it looks better “””
if (len(description)>0):
print description + “\n”
else:
print “Error getting XML document!”
else:
print “Error! Getting URL”