I am currently in the market to buy my first home so I've been spending a lot of time on various real estate websites searching through listings trying to find the perfect property. I live in a competitive housing market so it is important that I am informed whenever a new property becomes available. Logging onto any number of real estate websites to check for new listings each day is very repetitive and time consuming. Fortunately, it is possible to easily gather this information automatically using a technique called screen scraping.

Since most web pages are simply made of HTML it is easy for a computer to parse and store the information contained within these documents. Each programming language commonly has a host of libraries to assist in the screen scraping/parsing process and Ruby is no exception. To create simple screen scrapers in Ruby I have been using a library called scRUBYt!. scRUBYt! provides methods to access a given website and scrape its content. All the programmer needs to do is provide the XPath string to the desired information.

Using the scRUBYt! library has allowed me to write a small screen scraper script to access the FranklyMLS.com website, check for new listings, and then report back with the results. This has saved me a lot of time and effort. Let's dive into some code to see how this is done.

First, we'll need to create a simple class to store the information that we scrape from the FranklyMLS.com website. The Property class will hold various property related information (price, MLS number, square footage, etc.):

Next we'll need to make sure that scRUBYt! is installed. If you don't already have Github set up as one of your gem repositories do so now by executing the following command:

gem sources -a http://gems.github.com

Then install the scRUBYt! gem:

gem install jspradlin-scrubyt

Side note: I've built some functionality into the scRUBYt! library so you will need to grab the gem from my Github repository (i.e. jspradlin-scrubyt). I've spoken with the lead developer on the scRUBYt! project and it looks like my changes might make it into a future version of the official gem.

At this point we need to give scRUBYt! the URL of a website that we wish to scrape. The FranklyMLS.com website has its own special URL query syntax which displays only properties that meet our specific criteria. For example, if we only wanted to find active listings in the following zip codes - 22201, 22202, 22203 - the FranklyMLS.com URL would be:

http://franklymls.com/default.aspx?m=R&s=(22201,22202,22203)+active

We can dynamically generate a URL with our specific housing criteria by including the following code in our script:

Now we're ready to scrape some housing data. Once the FranklyMLS.com property page loads we are presented with a table that contains information about the listings that meet our criteria (image modified to save space):

The HTML that generates this table would appear like this (modified to save space):

If you look at the table, the HTML code, and the Ruby code you'll see that I've color coordinated each separate piece of information to illustrate how it is parsed and then stored. The scRUBYt! library will "fetch" the given URL, locate the HTML elements by the given XPath, and then store the data.

Once we have all of the data collected we may want to do something useful with the information such as convert it into an RSS feed. We can accomplish this by using the Hpricot and Builder libraries (which should be installed as dependencies of scRUBYt!). The code for the RSS conversion would look like this:

To make sure I get routine updates, I run this Ruby code on my server every hour using a cron job and pipe its output to an RSS feed. I am subscribed to the generated RSS feed so now I know exactly when a new property becomes available in my area!

Overall scRUBYt! is very easy to use and for simple screen scraping tasks it should work fine. However, I have found that it can run into some problems when the HTML gets complex. In these cases I would recommend using Hpricot for fine-level scraping.

To view the source code for this entry and to view other screen scraping code that I have written check out my Github page.

If you'd like to look at another example of scRUBYt! in action, feel free to read the post I wrote for my company's blog by clicking here.

Share and Enjoy:

6 Responses to “Ruby Screen Scraping with scRUBYt!”

Pretty cool. I have used franklymls to find open houses in Northern Virginia.

Interesting point on screen scrapers, For simple stuff i use python to screen scrape, but for larger projects i used extractingdata.com screen scraper which worked great, they build custom screen scrapers and data extracting programs

They syntax is a little funky for sure, but it is valid syntax. I used an XML parser library called Hpricot for this example. Hpricot takes a block of XML and allows you to parse out individual elements by referencing their Xpath. For the example you gave above if the XML looked like this:

As far as the errors are concerned a lot has happened since I last used this script. For one, ruby gems are no longer hosted on github so my customized jspradlin-scrubyt gem may no longer be available.

Anyway, I do most of my screen scraping using a library called Nokogiri. I’d check that library out. I find the syntax a little more intuitive.