Friday, 22 April 2011

How to Build a Dataset in R using an RSS feed or Web page

I recently wanted to build a dataset from content in an RSS feed - the feed of crimes in Newark provided by SpotCrime. (They have feeds for lots of US cities, but I just wanted Newark. Please read their Terms of Service before using this code on their feed.) After some tinkering, I got it to work using the XML package in R.
The first step is to read in the RSS feed XML file:

The xmlTreeParse command "parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree." There are tons of optional arguments, but as you can see, I didn't use any of them, and frankly, I don't understand many of them. But the function did what I wanted.
Next, I used the command xmlRoot to isolate the "top level XMLNode object resulting from parsing an XML document." Now is a good time to look at what we have:
> xmlRoot(doc)
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:georss="http://www.georss.org/georss">
<channel>
<atom:link href="http://spotcrime.com" rel="self" type="application/rss+xml"/>
<title>Spotcrime.com Crime Listing - Newark, NJ</title>
<description>Crime feed - RSS - 5 incidents. To see more visit http://spotcrime.com</description>
<language>en-us</language>
<link>http://spotcrime.com</link>
<ttl>180</ttl>
<copyright>ReportSee, Inc.</copyright>
<item>
<guid isPermaLink="true">http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj</guid>
<link>http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj</link>
<pubDate>Mon, 18 Apr 2011 00:00:00 -0400</pubDate>
<title>Robbery on EASTON AVENUE, Franklin, NJ (via spotcrime.com)</title>
<description>Police are seeking a man who robbed the Financial Resources Federal Credit Union</description>
<georss:point>40.5242061 -74.495662</georss:point>
<geo:Point>
<geo:lat>40.5242061</geo:lat>
<geo:long>-74.495662</geo:long>
</geo:Point>
</item>

This is only a portion of the full output - there are more <item> nodes, one for each crime.
So the feed starts with a header full of stuff we don't need, followed by the content in the <item> node, which is the good stuff: a link to the crime on SpotCrime, the publication date (more on this later), the crime "title," a description, and the Lat/Lon, in two different formats. How do we get at that meaty stuff, and put it into a friendly R dataframe? We'll use the xpathApply command:

src<-xpathApply(xmlRoot(doc), "//item")

xpathApply is a "way to find XML nodes that match a particular criterion" using XPath syntax. XPath is a way to navigate XML trees. My approach for a project like this is to aim, first and foremost, for code that works, and worry about advanced techniques later. So I did a simple search for nodes identified as "item," ignoring all the other possible arguments to xpathApply. src is now a list with 5 elements, one for each "item" node in the feed (recall that above, I only showed the first item node - four more followed). We can now iterate through the 5 elements of src and convert the data into a dataframe:

xmlSApply applies a function to the subnodes of an XML node. In this case, the function is xmlValue, which returns the raw contents of a node. So foo becomes a character vector containing all of those nice data bits for crime i. We then transpose foo into a matrix and convert it to a (1 row) data.frame. The stringsasFactors=FALSE prevents R from treating the strings as factors, which makes sense in this case - it might not in yours.
The first time through the loop, we want to create the data.frame; subsequent iterations, we just want to rbind a row on the bottom. When we're done, we have what we want: the data from the RSS feed nicely formatted in a data.frame named (descriptively) DATA.

Now, returning to the date and time. SpotCrime reports the publication date and time, not the date and time that the crime actually occurred. What can we do? It looks like SpotCrime reports the date and time we want on the webpage for the crime, the link to which was helpfully provided in the RSS feed. Take a look:

So, let's read in the html for that page, and grab the correct date and time!

Here, we used many of the same commands we used for the RSS feed. The real date and time were stored in a node called "title," so we just grabbed that node for each crime, stuck it into the appropriate slot in a vector, and slapped that vector onto our DATA data.frame.
With a little string processing to extract and convert lat/lon and date/time to appropriate data types, the data collection code is finished!

3 comments:

Thanks for the post! This was incredibly useful for me. I used your code as the basis for a function to grab news stories from google for a given stock. I posted my code to stack overflow, and I'd love to hear what you think!

Glad it helped! I like your idea of grabbing stock-related news stories from google. What are you going to do with the results?

For the timezone issue, did you try removing the tz="GMT" argument from strptime and just letting it default? I found this in ?strptime

tz A character string specifying the timezone to be used for the conversion. System-specific (see as.POSIXlt), but "" is the current time zone, and "GMT" is UTC.

Also, I wouldn't throw out the link field just yet, depending on what you want to use the data for. E.g., if the title and synopsis for a given story isn't giving you enough info, you could go to the full article (via the link) and mine through the article text.