Background

Searcharoo 1 describes building a simple search engine that crawls the file system. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page.

Searcharoo 2 focused on adding a 'spider' to find data to index by following web links (downloading files via HTTP and parsing the HTML). Also discusses how multiple search words results are combined into a single set of 'matches'.

Searcharoo 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across IIS application restarts without having to be generated each time. It also spidered FRAMESETs and added Stop words, Go words and Stemming to the indexer.

Introduction to version 5

This article is shorter than most, covering just two topics:

Allowing Searcharoo to run on websites where the ASP.NET application is restricted to Medium Trust. The remote-indexing console app in v4 was intended to addrsess this issue - but just building the catalog remotely isn't enough because you cannot binary-deserialize the file under Medium Trust. Rather than advise people to try and get the trust level on their server changed or customised (difficult!), the file format has been changed (to XML) to allow it to work in Medium Trust.

Extend the Document object hierarchy introduced in v4 to index Office 2007 (OpenXML) file types. I received a *.docx file from a collegue recently, and since I don't intended to upgrade to Office 2007 any time soon, it seemed like a good idea to investigate how the file could be indexed/searched without having the application/IFilter installed.

ASP.NET has 'Trust Issues'

When Searcharoo v4 is run under Medium Trust, you get one of these errors:

WebPermission denied if Search.aspx cannot find a catalog file and triggers SearchSpider.aspx (accessing websites or webservices is not allowed under Medium Trust by default).

The combination of errors -- cannot create a new catalog, and cannot load an existing catalog file (even if it was generated elsewhere) -- means that Searcharoo v4 doesn't work under Medium Trust. There are two options to fixing this problem:

Update your server with a custom Code Access Security policy to allow the Searcharoo code to perform these functions. This could be very difficult if your site is on shared hosting and you need to convince the ISP to make changes 'just for you'.

Make changes to Searcharoo so that at least one of those errors does not occur.

We'll do #2, since it's easier! There was a long discussion in v4 about why Binary Serialization was a good idea and Xml Serialization was bad: in this article we'll turn that around by fixing the problems with the Xml output so that we can build it remotely using the Indexer Console Application then uploaded to a Medium Trust website. Xml-serialized data can be de-serialized even under Medium Trust, so it can be loaded and searched.

About Option #2: Xml redux

Original (v4) Xml Catalog formatWay back in v4, the Xml-serialized Catalog object was dismissed as bloated, inefficient and (as implemented) unable to be de-serialized. It looked like this:

Recall that each Word object contains a collection (Hashtable) of File objects, indicating which File/s that Word appeared in. That works in memory because the File objects in Word._FileCollection are references - there's only one File 'object' per indexed file.

The problem with the resulting Xml is that the File object references are 'flattened out' (repeated every time they are referenced). You can see above that the document http://localhost:3359/content/Kilimanjaro.pdf is represented twice in the small excerpt. This repetition occurs for EVERY WORD in each File, creating a huge amount of redundant data in the Catalog file.

What's needed is a more succinct way to represent the relationship between Word and File: a 'foreign key' in database terms.

New (v5) Xml FormatThis 'foreign key' will be represented by a new object CatalogWordFile, which will act as a 'proxy' for Word objects (which we will no longer serialize). The Word object will continue to be the basis of the Catalog, but when we load and save it via Xml Serialization, we will use attributes to ignoreWord and treat the File and CatalogWordFile like two 'database tables' joined by a 'foreign key': the FileId.

Now the File objects are serialized once and their FileId is their implicit order in the serialized Xml (starting from zero, of course). The content we mentioned above - http://localhost:3359/content/Kilimanjaro.pdf - appears in the new Xml as FileId=2 (below) just once.

In the same Xml file the individual CatalogWordFile objects reference just the FileId, resulting in a significantly smaller Xml than when Word objects were used.

Repeating the Original (v4) Xml Catalog example, you can see the two words boxed above shown here again, with just the FileId rather than a whole serialized File object.

Note that the markup shown still has some complete element names; in the actual code the element names are overridden to further reduce the Xml file size using attributes: [XmlElement("w")] and [XmlElement("f")] (see right).

The test data used during development created a 178 Kb file when Binary Serialization was used. This equated to a 1.1 Mb Xml file in the old format.

Using the new, improved Xml format, the file shrunk to 194 Kb; and after applying XmlElement attributes to shorten the element names shrunk even further to 97 Kb - actually smaller than the Binary version.

Behind the Xml-serialization Scenes

So that's the Xml format we need - how do we get it? Unfortunately, just replacing the Word[] with CatalogWordFile[] isn't all we needed to do to make this work. The FileId needs to be 'in-sync' between the CatalogWordFile and File arrays, but we don't really know what order the XmlSerializer will access the properties (nor whether they'll be accessed multiple times). To avoid having to populate the internal CatalogWordFile collection unnecessarily, we use pre/post methods in the Property accessors to create it on-demand.

The two property declarations look like this (below): the PrepareForSerialization() does the work of 'flattening' the _Index Hashtable of Word objects into CatalogWordFile proxies with FileIds, it's called in both get accessors to ensure they return the 'synchronized' collections.

The PostDeserialization() method waits until bothFile and WordFileset accessors have been called (because we need both collections to re-build the original _Index via our 'foreign key'), then loops through the data and calling the Add() method just like the Spider does when it builds the Catalog while indexing.

One final note: rather than remove the Binary Serialization feature, both methods are still available, controlled by a new web.config/app.config setting (for your Website and Indexer Console application).

If set to True, the Catalog will be saved as an Xml file, if set to False it will be written as *.dat. Don't forget to update the other .config file settings to match your environment - including the Searcharoo_VirtualRoot, Searcharoo_CatalogFilepath and Searcharoo_TempFilepath which will be used in the DownloadDocument class discussed in the remainder of this article...

Office 2007 File Formats

The rest of the article discusses indexing the new Office 2007 file formats.

Microsoft Word Docx file 'structure'This blog on getting started with OpenXML discusses how to use the Open XML File Formats. It explains the basic structure of OpenXML documents: they are actually a series of related Xml (and other) files, 'hidden' inside a single ZIP file with an Office 2007 file extension like docs, xlsx, pptx, etc).

A Microsoft Word 2007 file looks like this 'inside' the ZIP: You can read all about the details of the format in the references, but the key file we're interested in is the document.xml part. To search it, we'll need to do the following steps:

Download the OpenXML file/ZIP archive from the web link

Extract the file we need from the ZIP archive

Learn a bit about the Xml format so we can extract the plaintext we want to index, and ignore all the formatting and other data.

Step 1: Subclassing Document to share download code

The v4 article describes how the FilterDocument needed to download files for IFilter processing (whereas previously downloads were loaded into/parsed from a MemoryStream). The new Office 2007 classes need the same behaviour, so the SaveDownloadedFile method is pushed up to a superclass they can all implement.

Step 3: Extract text

Turns out the Word 2007 OpenXML format is very Html-like in it's treatment of formatting and content: all document structure and formatting present in document.xml is contained in Xml attributes and the relevent plaintext in the InnerXml of each element. For our purposes, we'll assume that's all the text we wish to index (more research is required to determine whether headers/footers/tables/references are included, and more work would be required to detect and index other embedded Office documents).

DocxDocument in 3 easy steps

The new Docx file indexer inherits most of it's functionality from the abstract Document and DownloadDocument classes. All we really need to do is override the GetResponse() method to extract the file contents and set the WordsOnly property which is used to generate the Catalog.

This same pattern can be easily applied to PowerPoint 2007 (.pptx files) and Excel 2007 (.xlsx files) - see the XlsxDocument and PptxDocument code for the additional work that was required to loop through sheets/slides to get all the text in those file types.

Wrap-up

These additions to Searcharoo are quite minor, and have been posted mainly to help anyone wishing to use the code under Medium Trust. Many users may have Office 2007 installed (or the relevent IFilter on their server) and may not even need the additional Document subclasses - if this is the case, simply remove the new case statements from DocumentFactory and let the existing FilterDocument direct the Indexer.

Comments and Discussions

It's impossible to even guess what might the problem here without a lot more information. It's generally a good idea when posting questions here or on any newsgroup to provide as much information as possible.

First of all, set the searcharoo log level (in .config files) to 5 and see what exception messages are generated.

You might also check web.config settings for the save location of TEMP files, which is where Searcharoo downloads PDF/DOC/PPT files to before it can open them for indexing - your code must have read/write permission to that directory.

If you can post the complete error log and your web.config settings, perhaps then we can diagnose the problem more effectively.

Hey Craig! i was just testing your newest application (version 5) and it seems to work fine sometimes but i'm having problems with the indexing part. It seems to recognize everything but the documents (PDF, PPT, DOC....) do you have any idea why is this? please let me know. Thanks in advance once again!

There's something i'm not sure of....and i've got some questions. Hope you can help me with this.
This project is also a resource for searching a determined virtual directory, right? so, how can you restrict some folders inside that virtual directory. Let's say that you have your applications folder with the folders you say in this project and in the application's one, there's a txt file called: "robots.txt" where you can place the restricted searching areas but how would you write those folders in the txt file?? do i have to change anything in the code to tell searcharoo not to look for those folders inside the virtual directory?
if you can give an example, i would really appreciate that.

thanks for your answer, i'm not sure about where to place the file that you say. In this project there are lots of folders and there's specially one called: "WebApplication" so, i'm not sure where this file goes and if i have to change something in the code to tell the spider where to look.Please let know.
Thanks a lot once again!

is there any built-in mechanism to tell the parser where to start and stop parsing (e.g. by a special tag)? my problem is that my website uses asp.net 2.0 master pages. the basic master page is inherited by all pages and contains navigation, etc. so some words that occur in the navigation tree are actually on every page in the website. so it would be great if I could tell the parser to start where the main page content begins and stop parsing after it ends.

<html><div id="navi">...</div><div id="mainContent">

<!-- Start parsing here -->

blablabla

<!-- Stop parsing here -->

</div><div id="footer">...</div></html>

I'm not sure if you have already implemented this and I've only overlooked it. But if not, this could be a feature everyone working with master pages would like to see.