Login

Google Docs and Xpath Data Functions

In this tutorial, we will teach you how to use Xpath data functions (such as ImportXML) to retrieve information from Google Docs and other outside sources.

Google docs has already proven itself to be useful to office workers highly dependent on the Internet for sharing information. It allows users to share spreadsheets, active collaboration, and create questionnaires/forms more easily than any other solution out there.

This tutorial will take a look at some of the rare and important uses of Google docs which many ordinary users may not know. These are called external data functions. These functions are meant to retrieve data from external sources outside of your Google docs spreadsheet. For example, suppose you are working on a Google docs spreadsheet and you need to retrieve the list of hyperlink URLs found on web page X, or you need to grab all of the top ranking URLs from the search result of keyword X.

This tutorial will focus on the implementation of the ImportXML function, one of the important external data functions available in Google Docs used to retrieve data from outside sources.

Basic Definition of ImportXML Function

The syntax of this function is:

importXML("URL";"query")

Definition:

URL- the URL of the website HTML page. It can also be the URL of an XML page.Query – the Xpath query you would like to run that defines which specific data you would like to retrieve in a URL or XML page.

If you are completely new to Xpath queries, I recommended you read the tutorial over at W3Schools first: http://www.w3schools.com/xpath/xpath_syntax.asp for reference.

Example 1: Suppose you want to grab the URL of all the hyperlinks on this page: http://www.php-developer.org/site-map/

Solution:

A hyperlink uses this format: <a href="http://www.example.com/">Anchor text</a>. To extract the URL of the hyperlinks, you should use this Xpath query:

Xpath query = //a/@href

Discussion: According to the Xpath documentation at: http://www.w3schools.com/xpath/xpath_syntax.asp , “//” instructs “to select nodes in the document from the current node that match the selection no matter where they are.” Since you are interested in picking up hyperlink URLs regardless of where they appear on the page, you should start the query with “//”.

This also implies that “//a” means to select all “a” elements no matter where they are on the page. Once the “a” element has been selected, the rest of the query /@href means to extract the value of the attribute href of the element.

Therefore the final instruction of the Xpath query: //a/@href means literally to “extract all the values of the href attribute element regardless of where they appear on the page: http://www.php-developer.org/site-map/”

Implementation in Google docs:

To implement the above function in your Google docs spreadsheet as well as to retrieve the data, follow the steps below:

1. Log in to http://docs.google.com/ with your Google account.

2. In the empty cell A1, paste the function below in the formula window:

=importXML("http://www.php-developer.org/site-map/";"//a/@href")

3. Google docs will automatically populate the columns with all the URLs of the hyperlinks that are retrieved from the page. See the screen shot below:

{mospagebreak title=Parsing Data with the Xpath ImportXML Function}

But what if you are interested in creating a more complicated Xpath query to retrieve data from external sources?

Example2: Suppose you are interested in automatically retrieving the top 100 ranking URLs in the Yahoo search engine for the keyword “php developer,” as indicated in the screen shot shown below:

Of course, manually retrieving each of the URLs from position 1 to position 100 is a tedious and time-consuming activity. You can use the ImportXML function to automate this process neatly within the Google Docs environment and save time.

First, you need to define the URL from which to import the data. To do this, simply go to www.yahoo.com and type the keywords "php developer" in the Yahoo search box. Once you see the results, click “Options” (located beside the search button), and then click “Advanced Search.” Under “Number of Results,” change “10 results” to “100 results” and click “Yahoo Search.”.

The search results will then be updated to provide the Top 100 results. Now get the Yahoo search result URL you see in the browser address bar, which should look something like this:

This is value of the URL which you will be using in your ImportXML function.

Finally, you will need to define your Xpath query. As you may have suspected, this can be quite complex because the Yahoo search result page contains a lot of elements that you can grab and analyze.

To easily formulate your Xpath query for such complex pages as this, you can use your favorite browser developer tools, such as Firebug or the Inspect Element feature in Google Chrome. There are also developer tools in Internet Explorer which you can use as well.

Inspect/focus it on the area of interest until you find the div section of the search results. In this example the div region covering the search result is:

<div id="main">

<!–Yahoo search result within this div–>

</div>

Below is a screen shot of the affected code region (encircled in red):

The next thing you will want to take note of is under the element: <div id="main">; here you will see two hyperlinks which are a child or a descendant of the element <div id="main">. The first hyperlink element contains the correct URL you need to retrieve (enclosed within the bigger red circle), while the other hyperlink element is not the correct one because it is pointing to this URL: http://search.yahoo.com/r/_ylt, which is not part of the search result URLs pointing to external websites.

What’s more interesting is that the correct hyperlink you need to retrieve contains a unique identifiable attribute:

class=’yschttl spt’

If you are not going to specify this attribute in your Xpath query later, keep in mind that it will retrieve the wrong hyperlink element.

Finally, to retrieve the URL of the hyperlink, you need to grab the value of the href attribute.

So what is the final Xpath query? You can find it below:

//div[@id='web']//a[@class='yschttl spt']/@href

This tells the ImportXML function to “extract all the values of the href attribute in the hyperlink element a where it has an attribute of “class” with value=’yschttl spt’, which is also descendant of the div element with an attribute id whose value is equal to ‘web’”

This is the exact Xpath query required to retrieve the URLs of the ranking pages in the Yahoo search result.

Implementation On Google Docs

1. Implement the function below by pasting it into the formula window in your Google Docs spreadsheet:

2. It will automatically parse the URL of the Top 100 URLs in Yahoo search result for the keyword “php developer.” You can then apply any formatting and labels to the Google Docs spreadsheet so that the report will be presentable and clear.

You can view the actual Google Docs spreadsheet here: https://spreadsheets.google.com/ccc?key=0Aihbrnddtq2EdGxlbFVOTTh5NURNTHJjVDVpYnlmSWc&hl=en&authkey=CLGFqYQI illustrating Example1 and Example2 as explained in this tutorial.