Spreading activation is a neat technique for building search engines that return accurate results for a query even when there is no exact keyword match. The engine works by building a data structure called a context graph, which is a giant network of document and term nodes. All document nodes are connected to the terms that occur in that document; similarly, every term node is connected to all of the document nodes that term occurs in. We search the graph by starting at a query node and distributing a set amount of energy to its neighbor nodes. Then we recurse, diminishing the energy at each stage, until this spreading energy falls below a given threshold. Each node keeps track of accumulated energy, and this serves as our measure of relevance.

This means that documents that have many words in common will appear similar to the search engine. Likewise, words that occur together in many documents will be perceived as semantically related. Especially with larger, coherent document collections, the search engine can be quite effective at recognizing synonyms and finding useful relationships between documents. You can read a full description of the algorithm at http://www.nitle.org/papers/Contextual_Network_Graphs.pdf.

The search engine gives expanded recall (relevant results even when there is no keyword match) without incurring the kind of computational and patent issues posed by latent semantic indexing (LSI). The technique used here was originally described in a 1981 dissertation by Scott Preece.

When true, tells the module to use compiled C internals. This reduces memory requirements by about 60%, but actually runs a little slower than the pure Perl version. Don't bother to turn it on unless you have a huge graph. Default is pure Perl.

using the compiled version makes it impossible to store the graph to disk.

Load documents from a directory. Takes two arguments, a directory path and an optional parsing subroutine. If the parsing subroutine is passed an argument, it will use it to extract term tokens from the file. By default, the file is split on whitespace and stripped of numbers and punctuation.

Opens and loads a term-document matrix (TDM) file to initialize the graph. The TDM encodes information about term-to-document links. This is a legacy method mainly for the convenience of the module author. For notes on the proper file format, see the README file.

You can tell the graph to cut off searches after a certain distance from the query node. This can speed up searches on very large graphs, and has little adverse effect, especially if you are interested in just the first few search results. Set this value to undef to restore the default (10^8).

Adds a document from a file. By default, uses the PATH provided as the document identifier, and parses the file by splitting on whitespace. If a fancier title, or more elegant parsing behavior is desired, pass in named arguments as indicated. NAME can be any string, CODE should be a reference to a subroutine that takes one argument (the contents of the file) and returns an array of tokens, or a hash in the form TOKEN => COUNT, or a reference to the same.

Add documents to the graph in bulk. Takes as an argument a hash whose keys are document identifiers, and values are references to hashes in the form { WORD1 => COUNT, WORD2 => COUNT...} This method is faster than adding in documents one by one if you have auto_rebalance turned on.

Given a list of nodes, returns a hash of nearest nodes with relevance values, in the format NODE => RELEVANCE, for all nodes above the threshold value. (You probably want one of search, find_similar, or mixed_search instead).

Iterates through the graph, calculating edge weights and normalizing around nodes. This method is automatically called every time a document is added, removed, or updated, unless you turn the option off with auto_reweight(0). When adding a lot of docs, this can be time consuming, so either set auto_reweight to off or use the bulk_add method to add lots of docs at once

Dumps internal state in term-document matrix (TDM) format, which looks like this:

A B C B C B C
A B C B C B C
A B C B C B C

Where each row represents a document, A is the number of terms in the document, B is the term node and C is the edge weight between the doc node and B. Mostly used as a legacy format by the module author. Doc and term nodes are printed in ASCII-betical sorted order, zero-based indexing. Up to you to keep track of the ID => title mappings, neener-neener! Use doc_list and term_list to get an equivalently sorted list

Searches the graph for all of the words in @QUERY. Use find_similar if you want to do a document similarity instead, or mixed_search if you want to search on any combination of words and documents. Returns a pair of hashrefs: the first a reference to a hash of docs and relevance values, the second to a hash of words and relevance values.

Given an array of document identifiers, performs a similarity search and returns a pair of hashrefs. First hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.

Given a hashref in the form: { docs => [ 'Title 1', 'Title 2' ], terms => ['buffalo', 'fox' ], } } Runs a combined search on the terms and documents provided, and returns a pair of hashrefs. The first hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.