Writing Google Desktop Search Plugins

The Google Desktop Search
(GDS) engine is a tool created by Google that indexes all of the files on your
(Microsoft-Windows-based) computer and then provides the ability to search
those files. The types of files that it indexes include all files written to
disk (text files, web pages, media files, etc.), email, instant messages, and
web pages and media files visited on the Web. GDS creates a deskbar in the
toolbar which enables quick searching on criteria you specify. It returns
search results by directing your default browser to a web server running on your
machine. The browser-based search interface has an obviously Googlish look and
feel. Interestingly, if you have GDS installed, you will have a Desktop
search option (in addition to Web, Images, Groups, News, Froogle, and
Local) when you visit Google. When you
perform a search on the main Google page, GDS matches for that search may also
show up in the form of "<Some number of> results stored on your computer"
as the first search result.

As cool as this is, an even cooler aspect of GDS is that it is an extensible
framework. Google has released an SDK so developers can write
plugins for GDS. One such plugin is Kongulo, a web
spider. Kongulo provides a command-line interface to crawl, starting at a
specified URL, and index the resources it finds there within GDS. Command-line
options include depth, URL match, loop, sleep time between loops, and
passwords. Kongulo can be a useful tool for indexing intranets or private wikis
... or to see an example of a good plugin written for GDS.

How does a plugin tie into GDS? COM. As I mentioned above, GDS is an
application for MS Windows systems. On Friday, May 27, 2005, Google released the source
code for Kongulo. Here is the meat of how Kongulo pushes spidered web pages
to GDS. (The pieces of the code that pertain specifically to spidering are
interesting, but this article won't detail that aspect of Kongulo.)

The first argument the crawler passes into
CreateEvent is the guid that Kongulo registers for
itself the first time it runs. The second argument is a text string containing
the fully qualified name of the type of event. Kongulo only uses
Google.Desktop.WebPage, but other options include
Google.Desktop.Indexable (which is the parent of all of
the following indexable resources),
Google.Desktop.Email,
Google.Desktop.IM,
Google.Desktop.File, and
Google.Desktop.MediaFile.

The next steps entail adding properties. The event object has
an AddProperty method that takes two arguments: a property name
and a property value. The crawler adds the following four properties to all
pages it finds:

doctype is the document type, pulled from the HTTP headers.
Kongulo will only index documents of the type text/html or
text/plain. content is the body of the web page.
uri is the web location of the resource, and
last_modified_time is actually the current local time, but there
is a note in the source code to use the last-modified HTTP header
instead.

The crawler adds the following property for HTML pages that contain a
title:

event.AddProperty('title', title)

Interestingly, Kongulo uses regular expressions to find titles, frames, and
links, as opposed to using an HTML parser. The Kongulo team felt this would
provide a less strict processing of web pages.

The final step is to send the page to GDS, like this:

event.Send(0x01)

Send expects a bitwise OR of the following
values:

EventFlagIndexable = 0x00000001
EventFlagHistorical = 0x00000010

where EventFlagIndexable just indicates an event that GDS
should index, and EventFlagHistorical indicates a historical event (as opposed to
an event that is currently happening in realtime). The Kongulo source code
indicates that if the crawler passes in the historical flag, GDS will not
process the event until the user's system becomes idle.

At this point, GDS has the web page and it is available for searching.
That's all there is to it.

The GDS team has done an excellent job of providing a great tool that is
easy to extend. The more I play with GDS, the more it impresses me. Of course,
I would play with it more if it ran on Linux (hint, hint). Likewise, the
Kongulo team has done an excellent job of providing a useful plugin to GDS, but
more importantly, of providing clean, readable source code (being written
in Python doesn't hurt its readability) to serve as an example of how to write
a plugin for GDS. While there are plenty of plugins already available for GDS,
this ease of creating a plugin makes me expect many more in the future.

Jeremy Jones
is a software engineer who works for Predictix. His weapon of choice is Python.

Web traffic is food, drink, and oxygen--in short, life itself--to any web-based business. Whether your web site depends on broad, general traffic, or high-quality, targeted traffic, this PDF has the tools and information you need to draw more traffic to your site. You'll learn how to effectively use PageRank (and Google itself); how to get listed, get links, and get syndicated; SEO best practices; and much more.

When you approach SEO, you must take some time to understand the characteristics of the traffic that you need to drive your business. Then go out and use the techniques explained in this PDF to grab some traffic--and bring life to your business.