Introduction

As part of the research that I am doing for my thesis I had to perform a lot of image search queries against the most popular search engines. The Yahoo! guys provide us with the Yahoo! SDK API which also supports image search and was really useful to me. Google, however, for some obscure reasons, provides only an API for the regular search engine and does not provide anything for image search. A few weeks ago, I came across a very simple implementation of a Google web translation service by Peter Bromberg. So I thought, why not do something similar for the Google image search? Well, I did that, and here it is. The regular expressions used are far more complex than the one used in the translation service, but in the end, it's basically the same.

Note: The article has been translated to Chinese, and the translation is available here.

What's in the source files?

The source files include two projects. The Ilan.Google.API project includes a DLL which can be used to query Google image search programmatically. The Ilan.Test.Google.API project includes a simple application that enables you to run a query and display all the resulting images dynamically on the form. When you double-click a thumbnail the full original image is displayed. This application is aimed at showing how simple it is to use the API. I wouldn't recommend it as a real searching tool (at least as is) because:

It fetches all images (thumbnails) on one single thread. Although it does provide the ability to view the full image before all the thumbnails have been downloaded, a "real" application would have to download several thumbnails at the same time and significantly boost performance.

Only simple queries are supported (space-separated words). Special characters are not handled. If you need to support more complicated queries, you'll have to parse the query and transform it to a format that complies with the URL.

QuickStart - How do I use the API?

If you're not interested in how it works, but just want to use the library, this section is especially for you :-)

Add a reference to the Ilan.Google.API library in your project.

Add the using directive:

using Ilan.Google.API.ImageSearch;

When you need to run a query, make sure it conforms to the URL supported by Google. For this I suggest you to check the query on Google Image Search and look at how they build the URL. For instance, if you need to support only simple queries of space-separated words, you just need to transform the query to a list of words separated by the plus (+) sign. For example, the query "apple cake" must be transformed to "apple+cake". Notice that several space characters must be transformed to a single + sign, so I wouldn't recommend a simple string.Replace call. You could use the regular expression that I have used in my demo project:

The response object holds *all* the first 50 results for the given query (or less if there are no more results). For example, you can retrieve the URL of the first image through the response.Results array:

string firstImageUrl = response.Results[0].ImageUrl;

The parameters for the SearchService.SearchImages method are:

(string) query- The query to be sent.

(int) startPosition- The index of the first item to be retrieved (must be positive).

(int) resultsRequested- The number of results to be retrieved (must be between 1 and (1000 - startPosition)).

(bool) filterSimilarResults- Set to 'true' if you want Google to automatically omit similar entries. Set to 'false' if you want to retrieve every matching image.

Well, I think that should be it. Just remember that Google does not return results beyond the first 1000 results for a query, so if you're trying to get a result that exceeds the first 1000 I'll throw you an exception...

And if you want to know why and how it works...

... then I hope this section will be of help to you.

Returned objects

The SearchResponse and SearchResult classes are pretty straightforward. A query returns one SearchReponse which holds the total number of available results for the query as well as an array of SearchResult objects, each representing a separate image returned by Google. The SearchResult objects hold the URL of the thumbnail of the image (located somewhere at Google) and the URL of the actual image (at its source).

Building the query URL

Digging a little further, you can fetch results at a certain position by adding "&start=". So, if you want to fetch results from 21 to 40 (i.e. the second page if you were using their web site), the URL should be this.

Note that the index of the images is 0-based, so to start with the 21st result you must mention "&start=20". Then, I found out that there is a default filter that omits results if they resemble the previous results. If you want to disable this filter you need to add "&filter=0". A quick test will show you that "&filter=1" turns the filter on. To see the effect of the filter I suggest you run the following two queries, which return the results starting at result nr. 900:

Extracting information from the retrieved HTML

Here comes the ugly part. We have to parse the HTML and extract the number of available results for the query as well as information for each one of the retrieved images. After having analyzed the HTML I got from the Google, I managed to find a recurring pattern that accurately allows you to know where each of these interesting information bits can be located in the HTML. Needless to say, that if Google changes the format of the returned HTML the parsing will fail!!! Of course, I relied on regular expressions to parse the text. Following are the different patterns used in the API:

This pattern is used to retrieve information about each image. The URL of the original image is captured into the "imgurl" group, the URL of the thumbnail is captured into the "images" group and the width and height of the thumbnail image are captured in the "width" and "height" groups respectively.

This pattern is used to retrieve additional information about each image - the original images' widths, heights and sizes (in groups "width", "height" and "size" respectively). I didn't find a way to use the same pattern for all the images' information - I guess there is but I gave up searching after a while...

This pattern is used to retrieve the total number of results available for the query (can be found on the upper right portion of the HTML when you look at the result of a query). I have also extracted the last result index - to find when there are no more results.

What with the 20-results-per-query?

That's straightforward. Once you know the "start=" portion of the URL, you can run the queries in a loop until you reach the requested number of results.

And the 1000 results limit?

Hmmm, sorry. I didn't find a way to work around that one. I assume, however, that virtually in all applications 1000 results should be more than enough. Besides, once you get to the last result, most of them become totally irrelevant to the actual query anyway...

Thanks and apologies

I wish to thank Roy Osherove and his Regulator. I have used regular expressions a few times in the past, but mostly with very simple patterns. The expressions used here are by far the most complicated ones ever written by me, and I wouldn't have tested it and successfully written it without the help of "The Regulator". Which brings me to the apologies - there is a good chance that the patterns I'm using could be simplified. If you find a way to simplify it (with no performance penalty), then please let me know, and I'll update the code. Finally, I would like to thank "pedrito68", who has provided me with very useful comments (and code) based on the first version of this API, which I have added in the current version (see History section).

Conclusions

The Google Image Search API is essentially a tool that you can use if you need to perform an image search against Google programmatically. Since it parses the HTML returned by Google, if the format of this HTML changes, the library's implementation will have to change accordingly. The implementation is rather simple. It shows a simple example of how to send a URL to a web server (using the HttpWebRequest object) and retrieve the HTML returned by the web server. It also uses regular expressions (using the System.Text.RegularExpressions.Regex class) with some pretty complicated patterns to extract the interesting data from the HTML. Finally, the demo application shows how to use the API.

On a personal note - I have been using this API for the past few days to run over 40,000 single-word queries. It has proven to be very accurate and never did the regular expression break. One very interesting feature is that it does not suffer from any quantity-limit as the regular SDK. For instance, Google's web search API won't let you run more than 1000 queries with the same key in a single day (24 hours). Similarly, Yahoo! has 5000 queries per day limit. It might be good to adapt this API to provide regular search capabilities and work around the Google's 1000-qeries-per-day limitation or adapt it to Yahoo and work around their limitation...

If this article was useful to you, please don't forget to vote. I'd like it to get out of the 'unedited' section as soon as possible. Also, you're welcome to visit my blog.

History

October 5th, 2005 - First version.

October 9th, 2005 - Added changes recommended by pedrito68 and some bug fixes:

Extract more information about each image: file size/width/height/name/extension, thumbnail width/height.

Get thumbnails on separate threads.

Double-click a thumbnail pops-up the full image.

Support for SafeSearch.

Bug fix - when a query has only a few results and more are requested, we get the same result multiple times. For example, if a query returns only three images, and we request for 100, we would get 15 results (the same 3 results are repeated 5 times).

January 4th, 2006 – Updated the second regular expression due to changes in the format of the HTML returned by Google.

Decemeber 27th, 2006 – Updated source code:

Works under .NET 2.0 instead of .NET 1.1 (and VS2005 accordingly).

The three regular expressions used by the API are now loaded from an external text file, whose name is read from the config file. So now, you can change the regular expressions without needing to recompile or even rerun the application, just update the regex in the text file. To reload the regular expressions, call the new SearchService.LoadRegexStrings() function. I've added a button in the sample application that does just that, so it's easy to see how it works. I put the regexes in a text file and not directly inside the config file, in order to simplify the regex, and not have to make it even more complicated to comply to XML format.

January 28, 2007 - Updated source code.

Support for the new format of the results returned by Google Image Search

Thumbnails are downloaded on separate threads (test application)

Better UI thread handling (test application)

March 11, 2007

Updated to comply with new format of results returned by Google Image Search

Share

About the Author

I am an MSc. student at the Interdisciplinary Center Herzlia, Israel (www.idc.ac.il)Also, I work as private consultant in the fields of OOP/OOD, C++, C#, SQL Server and solving complex problems with AI and Machine Learning methods.See my Blog at: http://ilanas.blogspot.com/