More progress on soft tags

I added enough functionality to TaQuilla this week to make it fully functional as a soft-tagging extension. I’m now running it regularly in my main profile, though I am only soft tagging for Personal at the moment. I’m very happy with how well the statistics are working, as it seems quite accurate after just a little training.

Soft tags are enabled in a setup screen that lists all of my tags, whether they are enabled, and also whether I want to show diagnostic and status columns in the thread pane.

Checking “Status Column” shows soft-tagging status for each message in the thread pane, which can either be either hard-tagged (showing as a check) or soft-tagged (showing as a sigma). Checking “Percent Column” shows a numerical column in the thread pane, which is the percent match of the message to the tag as determined by the bayes filter. “Enabled” determines whether the soft-tagging calculations are done for the particular tag. The numerical percent is the setpoint in the bayes calculation for deciding whether to apply the soft tag or not. So far, I have always just used 50 as the setpoint.

In addition to the options, I added commands to calculate soft tags on selections and folders, similar to the junkCommands.js routines for junk management.

I’m beginning to believe though that applying soft tagging is going to be as simple as saying “yes” to enable it for a particular tag. So I’ll probably extend all of the tag definition dialogs with an “enable soft tags” column, and reserve the more detailed setup to TaQuilla Options.

So at this point, I am happy with the basic operation. But a number of issues remain, many dealing with the backend of the mozilla mailnews code.

I am relying on database listeners in the code to detect changes to the message that determine whether I need soft tag a new message, or train the bayes filter when the user manually tags a message. Although this mostly works, it is fairly inefficient. Also the listeners are not fired reliably in the backend code. I may need to add back-end events that are specific to tag and trait management to increase efficiency and reliability.

The bayes tokenizer currently sees the applied keyword as a token to include in its calculation, which leads to some strange feedback effects. (See bug 472005) I have a patch for that now to allow extensions to selectively enable and disable including specific headers in the bayes calculations.

The bayes calculation currently returns 50% until training is done both on emails that match a trait, as well as those that do not match a trait. This makes it more difficult to start up the soft tagging. I have a patch I’m using that returns 100% if only emails that match the trait are trained. This makes it easier to implement a simple user interface for soft tags, so that users merely have to enable soft tagging for a particular tag, tag at least one message, then correct any errors that occur in the future. Without this patch, you need to convince the user to manually mark some messages as NOT matching a trait, which is counter-intuitive and difficult to explain.

As I try to use tags in multiple dimensions (that is, one dimension for status, and another for category) the limitations of the existing tagging interface in TB become more apparent. I can fix that in an extension though. As a start, I will probably add a column that shows applied tags – but only using the first character of the tags to save space.

One of the more interesting applications of soft tagging would be to automatically categorize interesting blogs, to assist the reader in knowing which blogs to read. This would be particularly interesting in a “planet” style of aggregated blog. Although currently the mailnews backend does not apply bayesian calculations to feeds, I investigated this some and believe I could add that through an extension. I’ll probably do that as part of TaQuilla eventually.

“Traits”? The use of the term “features” in Bayesian classification and classification theory in general has a long history; moving away from that seems misdirected. However, it seems that you are using “trait” as what is normally called “category”. In spam classification the presence of a particular “token” is what is called a “feature”.

I don’t come at this with a personal “long history” in Bayesian classification, and the term “feature” is pretty overloaded in the computer world. Even in the example you gave, “feature” has one meaning in the Bayes world, and a different in the spam world. I was trying to select a more neutral term. But had I known that “feature” was a well-understood term in the Bayesian nomenclature, I probably would have left it.