Search form

Drupal Search: How indexing works

This article explores the process of taking HTML content from Drupal nodes and indexing it for the purpose of search and text retrieval at a later time. The code examples apply to Drupal 6.

Finding what to index

Indexing happens when cron.php is run, and the search module starts the whole process by invoking hook_update_index. This gives each module the chance to keep its search index needs up to date. This article follows what happens in the node module's node_update_index function.

The node_update_index function first finds out what node ids are new, have been updated, or have new comments. These node ids are then sent to _node_index_node which prepares them to be indexed by loading them, and building them in a way that is particular to the needs of indexing.

The $node->build_mode flag is set to NODE_BUILD_SEARCH_INDEX. This value, and its cousin NODE_BUILD_SEARCH_RESULT, don't play any role in Drupal core, but they could theoretically be used in hook_view and hook_nodeapi implementations to build the node differently for the benefit of the search index.

The body of the node is then built with calls to node_build_content and drupal_render. This is followed by passing the built node along to any module that implements the 'update index' operation of hook_nodapi. This is an important moment in the process because modules like comment and taxonomy use the opportunity to add things (such as comments, cck fields, and taxonomy terms) to the text that will get indexed for this node.

Text is sent into search_index

After the node is built and prepared to be indexed, it is sent to the search_index function where all the fun happens. Drupal is chiefly focused on handling HTML content, so it is therefore not surprising that search_index is best equipped to deal with HTML. Drupal's search evaluates the relative importance of any word in a document based on a number of criteria, one of which is its enclosing HTML tag. The HTML tags to which Drupal assigns value are the following:

The array keys are the tag names, and the values are weights that will affect the scoring of the words associated with these tags.

The next step in search_index is to prepare the text for tokenization by inserting spaces in between tags and texts:

Figure 3: Prepare to be tokenized.

Any tags not in the $tags variable are removed. The text is then split in such a way that the whole document becomes an array that alternates between text and tag fragments. This is an example of cunning and clever code. For example, the following text gets split into the array in figure 1.

Look carefully and you will see that the even positions in the array are occupied with text (or are empty), and the odd positions in the array have tag contents (literally whatever is in between the < and the >). With this structured array of tokens dependably in place, the tokens are iterated over and the $tag variable is used as an even/odd counter directing the processing into either the tag handling area or the text handling area.

The values in position 9-11 (a, (), and a) are artifacts from search_nodeapi() which would otherwise be adding in link text from incoming links found in other nodes. This is discussed in detail below.

Tag handling

Once the text is tokenized as shown in Figure 3, search_index loops over the tokens, alternating between tags and the text in the middle. Within the context of the token loop, tag handling is primarily about watching the opening and closing of tags, in a nested fashion, and ascribing the appropriately boosted score to the text in between. Following the array in Figure 3, for example, the item in position #1, h1, will get pushed onto a an array called $tagstack. At the same time, the $score will be increased according to that tag's value in the $tags array.

search.module: search_index()

Scoring words

The token loop then continues, carrying the updated $score and $tagstack into the next iteration. The next item, in position #2 from Figure 3, is the text Taxonomy. In the text handling portion of the token loop, the first thing that normally happens is the text is split using the function search_index_split. search_index_split does a certain amount of processing on the text, which is covered in its own section below. What is returned is an array of individual words. This array is iterated over and the individual words are inserted into a further array that tracks each unique word for the whole document as well as its accumulated score. This array is the $results array.

In addition to the word => score array that gets built for $results, an accumulated list of all tokenized and processed words is created and stored in the form of the $accum variable. This is used later to build the search_dataset table.

search.module: search_index()

<?php// Add word to accumulator$accum .= $word . ' ';?>

Figure 6: Building the accumulated string of all a document's words.

The scoring of a word within a document is very important to obtaining relevant search results later. If multiple occurrences of a word are found in a document, each will contribute to the total score for that unique word within the document. How much they contribute is dependent on a couple of factors, however. The first factor is of course the tag within which the word is found. The second factor is the absolute position of the word within the document also plays a role. The lower in the document the word is found, the less it will increase score. The amount that any individual word will add to the total score can be expressed with this formula:

<?php$total = $score * $focus;?>

The $score is the value that was being incremented in Figure 5, and $focus is a decimal value between 0 and 1 that is a calculation of the word's position in the document. Here is how focus is calculated for each word:

search.module: search_index()

<?php// Focus is a decaying value in terms of the amount of unique words up to this point.// From 100 words and more, it decays, to e.g. 0.5 at 500 words and 0.3 at 1000 words.$focus = min(1, .01 + 3.5 / (2 + count($results[0]) * .015));?>

In Figures 7 through 10, the $results array is the array of unique words from this document and their scores. With both $score and $focus now determined, the word's score can be added to the total score for this unique word within the document:

<?php$results[0][$word] += $score * $focus;?>

Figure 8: Accumulating the score for a word.

Thus when the word Taxonomy from position 2 arrives here, the values in Figure 4 look like this:

<?php$results[0]['taxonomy'] += 26 * 1;?>

Figure 9: The title word "Taxonomy" gets a score of 26 and a focus of 1.

"Taxonomy" was the title of this node, thus it was in <h1> tags, and its score was therefore 26. At the beginning of the document $focus is 1 and it remains 1 until the 233rd word, after which it decreases, approaching zero as the number of words in the document reaches infinity. Here is the same calculation for the word "ponies":

<?php$results[0]['ponies'] += 1 * 1;?>

Figure 10: The word "ponies" gets a score of 1.

Note that if "ponies" had been lower in the document (the 233rd word or more), the focus would have been a number less than 1 but greater than zero.

Closing tags

After each of the words in this particular token have added their scores to the accumulative total it is time for the next token, which is by the nature of the token array, a tag. In fact it will be a closing tag, as seen in postion #3 from Figure 3. The handling of closing tags works pretty much the reverse of how opening tags were dealt with. The tag is popped off from the tag stack and the score variable is reduced.

search.module: search_index()

Figure 11: When a closing tag is being handled the $score is reduced and the tag is removed from the stack.

Link handling

So far hyperlink handling in documents has been ignored, yet this is an important part of building relevancy. When the search indexer finds a hyperlink that references another node on the same site the words in the hyperlink will be added to the list of words that identify the target node. Stated differently, if node 1 has the link <a href="node/2">Green goo</a>, the search terms "green" and "goo" will both find node/2. Link text also counts towards the node in which it is found, albeit at an 80% reduced focus value.

To accomplish this, in the tag handling phase of the token loop, anchor tags are spotted and receive extra handling. It is determined whether the link references a node on the same site. The title, node id, version id, and filter format are fetched from the database. If the filter format is cacheable, the variable $link is set to TRUE to indicate later that the text phase is dealing with an internal link, and the variable $linktitle is set to contain the title of the target node.

Then, in the text phase of the token loop, a regular expression is run to see if the text within the link is the same as the the link's URL. This is the case when using the URL filter, for example, so it comes up pretty often. If the text is a URL, the $linktext variable containing the referenced node's title replaces the text token ($value). This way there is still relevant text that can be ascribed to the current document as well as the target.

search.module: search_index()

<?phpif ($link) {// Check to see if the node link text is its URL. If so, we use the target node title instead.if (preg_match('!^https?://!i', $value)) {$value = $linktitle; }}?>

Figure 13: For link text that is a URL, grab the target node's title and use it as the token $value.

Finaly, the $results array is extended to contain information about link texts, keyed by the target node id. The $focus gets reduced by 80% for the text as it applies to the current document, as explained in the code comment that links score mainly for the target.

Figure 14: Link text words are collected for later storage in their own database table.

That concludes the processing in the token loop. The entire document has now been tokenized, analyzed, scored, and saved in the $results array. Furthermore, the $accum variable now contains every word in the document in a space delimited string. Using the information stored in these two variables, the remainder of search_index is dedicated to persisting the search data to the four search related database tables: search_index, search_dataset, search_node_links, and search_total.

Storing the search data

First let's look at the four search related tables and discuss their roles in searching.

The search_index database table

The search_index table is the primary table used in searching. Here is what it will contain after processing the tokens from Figure 3:

Noteworthy are the calls to search_wipe and search_dirty. The search_wipe function is very straightforward; before starting to insert lots of words and scores into the index for a document, search_wipe gets rid of that document's previous data from the search_index, search_datase, and search_node_links tables.

The search_dirty call is more interesting. Part of the algorithm used at search time calls for the global totals of words. These are stored in the search_totals table, and the search_dirty function keeps track of which words in the table are now out of date due to the most recent document being indexed. Later, these words will be recalculated and the search_total table updated.

The search_dataset table

Here's what is in the search_dataset table after indexing the tokens in Figure 3:

Figure 17: The contents of the search_dataset table after processing the tokens in Figure 3.

As you can see, the data column of search_dataset is all of the words that were encountered in this document, in the order that they were encountered. This string is used for executing phrase queries, but it will take writing a different article on the search process to show how that works!

The search_node_links table

The tokens in Figure 3 didn't contain any links. Here's some text that links back to node 2 (the node that talks about ponies and badgers):

Figure 19: The search_nodeapi function takes care that the text from referencing links gets added to a node's text at indexing time.

Here you can see that before a node gets indexed, it's text is extended with all of the link text from nodes that reference it. The referring link texts get wrapped in <a> tags to boost the score of the words appropriately.

The search_total table

Take a look at what the search_total table contains after indexing example nodes 2 and 3:

After each time that indexing occurs, all of the words that have been marked by search_dirty will be updated in the search_total table. The count value is a normalization according to Zipf's law that says a word's value to the search index is inversely proportionate to its overall frequency therein. Here is the count calculation expressed in code:

Note that more frequent words will have a highercount value, thus making them less valuable in the index. Normally words like "and", "the", and "are" will naturally have very high values due to their high frequency in the English language, whereas on most websites, words like "badgers" and "ponies" would have low counts. This limited sample gets things a bit backwards because of the author's obsession with ponies and badgers ;-)

Word level processing

It's not easy for a word to get into the search index unscathed. There are several imposed limitations and transformations that can happen to an unsuspecting word, and these all lurk in the benign sounding search_simplify function. To keep things in context, search_simplify gets called from search_index_split, which in turn gets called in during the text token phase of search_index (it's a bit like playing mental twister, isn't it?)

// With the exception of the rules above, we consider all punctuation, // marks, spacers, etc, to be a word boundary.$text = preg_replace('/[' . PREG_CLASS_SEARCH_EXCLUDE . ']+/u', ' ', $text);

return

$text;}?>

Figure 22: The search_simplify function transforms text to the norms of the search index.

When a string of text comes into search_simplify, three things happen right away: it's entities are decoded, it is put into lowercase, and it is subjected to the whims of hook_search_preprocess. This hook allows any module to make a transformation on the text in a way intended to improve its utility in searching. The best example of a module that uses hook_search_preprocess is the Porter Stemmer module which applies stemming rules to the text.

The next transformation, Simple CJK handling, applies only to asian languages whose character based alphabets require special tokenizing.

The next three transformations attempt to bring structure to the vast diversity of numerical data and punctuation that might come into the indexer. It would be nice if searching for dates would work as expected, as well as acronyms (P.E.T.A and PETA should return the same results). This has its share of side effects as well as benefits. For example, the word "low-budget" will get indexed as word "lowbudget" so neither "low" nor "budget" will return results. A possible solution to this problem would be to remove the hyphen and add "low" and "budget" separately.

After all these transformations, the text is sent back to search_index for the rest of the indexing process.

Updating the totals

The last thing that happens in indexing is the updating of the search_total table. As explained previously, a word is considered more valuable (lower count score) the rarer it is in the index. The following code first selects the sum total of every word that has been added to search_dirty during indexing. It then calculates the inverse document frequency of the word (IDF) and updates the search_total table. Sometimes words have been deleted during indexing (via search_wipe), and housecleaning dictates these be removed from search_total as well.

search.module: search_update_totals()

<?php/** * This function is called on shutdown to ensure that search_total is always * up to date (even if cron times out or otherwise fails). */function search_update_totals() {// Update word IDF (Inverse Document Frequency) counts for new/changed wordsforeach (search_dirty() as $word => $dummy) {// Get total count$total = db_result(db_query("SELECT SUM(score) FROM {search_index} WHERE word = '%s'", $word));// Apply Zipf's law to equalize the probability distribution$total = log10(1 + 1/(max(1, $total)));db_query("UPDATE {search_total} SET count = %f WHERE word = '%s'", $total, $word); if (!db_affected_rows()) {db_query("INSERT INTO {search_total} (word, count) VALUES ('%s', %f)", $word, $total); } }// Find words that were deleted from search_index, but are still in // search_total. We use a LEFT JOIN between the two tables and keep only the // rows which fail to join.$result = db_query("SELECT t.word AS realword, i.word FROM {search_total} t LEFT JOIN {search_index} i ON t.word = i.word WHERE i.word IS NULL"); while ($word = db_fetch_object($result)) {db_query("DELETE FROM {search_total} WHERE word = '%s'", $word->realword); }}?>

Figure 23: Updating the search totals to reflect the inverse document frequency value of each word in the index.

Conclusion

Drupal's text indexing has a lot to it. The complexity of the code makes it a somewhat formidable subject of study, but understanding how it all works is a sure way to build appreciation for the Drupal search framework. This article steps through the indexing process from beginning to end and serves as a guide to your further study.

Wow. This has to be the authoritative document on indexing. Very
well done.

A doc like this generates as many questions as answers. What is the
search query like for a single word? For a phrase? Also, it would be
nice to do a unified such query that includes users/profiles and other
non node data.

Thank you for your article! I am new to Drupal and still hacking
php. I have content that is pulling from a database using php. Will
drupal index that database. For example, a page has a list (from the
database) of nonprofits and the funds they recieve. A user would be
interested in searching the site for a specific nonprofit.

Drupal indexes node-based content by default. If your content isn't
inside of a node, it won't get indexed. You can use the indexer to
create your own index and search, however, and this is described very
well in Pro Drupal
Development, which I highly recommend. If your goal is to have the
stuff in your other database show up as part of the normal
q=search/node search, then you need to find a way to import it into
nodes. One way would be to have a node type that loads and renders the
content from the external database (a proxy node, so to speak). Then
you'd just need your module to implement the 'update index' $op of
hook_nodeapi.

The node_update_index function first finds out what
node ids are new, have been updated, or have new
comments.

I have been using the "update index" case in hook_nodeapi to index
related/child nodes.

The problem is that when one of the related/child nodes is updated,
Drupal doesn't know to reindex that parent node.

What is the correct way to flag a node to be reindexed. My
assumption is that I just need to set the reindex column in the
search_dataset table to the current timestamp for the parent node when
a child node is added or updated.

Thanks for the great article and everything you are doing to
improve Drupal search!

I have a problem where the indexer won't index CCK fields on nodes
on occasion (Drupal 6.9 / CCK 2.2) I skimmed through the code and got
a better grip thanks to this article. Great work!

I have a question regarding this line:

"This is followed by passing the built node along to any module
that implements the 'update index' operation of hook_nodapi. This is
an important moment in the process because modules like comment and
taxonomy use the opportunity to add things (such as comments, cck
fields, and taxonomy terms) to the text that will get indexed for this
node."

Maybe I'm wrong (these are my first ventures into Drupal code) but
CCK doesn't add field data to $node->content in 'update index' on
nodeapi. This is done in the 'view' op of hook_nodeapi. Moreover, the
update index op isn't even implemented in CCK and when it's called, it
just being ignored by CCK.

Nonetheless, the result should be the same: CCK fields and extra
data are hooked on $node->content and the data gets indexed.