New tag feature enabled

After rebuilding my site to basically run off of static html files, with a
healthy dose of server side includes, I have been experimenting with how to
add other feature. For example, I used Google’s
Custom Search Engine to handle searches, I added
a feature to suggest similar pages,
custom atom feeds, and more.

I was also interested in re-enabling tags, but wasn’t sure of the best way to
do it. Ideally, the system would be intelligent — suggesting tags for new
content, providing a mechanism to analyze the tag framework as it grows to
find redundancy, and ensuring that pages are tagged appropriately.

I’ve started working on a system. To back up for a moment, though, I want to
describe the similar pages feature. Basically, it creates a “vector” for each
page in the site that corresponds to the number of occurrences of each word in
the document. The words are stemmed, certain common words are removed, and
everything is lower-cased. For example, a document consisting of “The Cow
jumped over the Moon” would become:

cow 0.577350269189626
moon 0.577350269189626
jump 0.577350269189626

Because of the way the math works, you don’t need to store a “0” for each word
that is not in a given document, but occurs in other documents. This makes
things much easier.

To compare two documents, one simply calculates the cosine of the angle
between the two respective vectors. The closer the vectors align with each
other, the closer the cosine is to 1 (a perfect match).

The way I suggest similar pages is to compare each document with every other
document and calculate their similarities. When a page is shown, this list is
scanned, and matches are found that exceed a certain threshold. I also cap the
list at the top 5 choices so a visitor isn’t overwhelmed with 12 similar
pages.

Back to tags…. I have a routine I can run locally that compares pages to the
overall vector for the pages contained within each tag. In otherwords, I can
find pages that look similar to pages already in a tag with the idea that
perhaps this new page belongs there as well. This can also suggest tags for a
new document that has yet to be posted (thanks, TextMate!).

I still would like to create a tool to visualize the relationships between
pages based on similarities and tags to more formally look for a structure to
the pages and the relationships between them. This might help suggest new tags
that aren’t being used — for example, if there is a cluster of closely
related pages about “ice cream”, that would likely be a tag that should be
added.