Scraping Google search results

Then run this to Scrape the first two pages of the Google results for ruby:

require 'rubygems'
require 'scrubyt'
# See http://scrubyt.org/example-specification-from-the-page-known-issues-and-pitfalls/
# Create a learning extractor
data = Scrubyt::Extractor.define do
fetch('http://www.google.com/')
fill_textfield 'q', 'ruby'
submit
# Teach Scrubyt what we want to retrieve
# In this case we want Scruby to find all search results
# and "Ruby Programming Language" happens to be the first
# link in the result list. Change "Ruby Programming Language"
# to whatever you want Scruby to find.
link do
name "Ruby Programming Language"
url "href", :type => :attribute
end
# Click next until we're on the second page.
next_page "Next", :limit => 2
end
# Print out what Scruby found
puts data.to_xml
puts "Your production scraper has been created: data_extractor_export.rb."
# Export the production version of the scraper
data.export(__FILE__)

Finding the correct XPath

The learning mode is pretty good at finding the XPath of HTML elements, but if you have difficulties getting Scrubyt to extract exactly what you want, simply install Firebug and use the Inspect feature to select the item you want to extract the value from. Then right-click on it in the Firebug window and choose Copy XPath.

Note that there's a gotcha when copying the XPath of an element with Firebug. Firebug uses Firefox's internal and normalized DOM model, which might not match match the real-world HTML structure. For example the tbody tag is usually added by Firefox/Firebug, and should be removed if it isn't in the HTML.

Another option that I haven't tried myself is to use the XPather extension.

Using hpricot to find the XPath

If you're really having problems finding the right XPath of an element, you can also use HPricot to find it. In this example the code prints out the XPath to all table columns containing the text 51,999: