Have something to say?

Ready to be published? LXer is read by around 350,000 individuals each month, and is an excellent place for you to publish your ideas, thoughts, reviews, complaints, etc. Do you have something to say to the Linux community?

Using Mathematical Probability to improve Linux news aggregation

Spam filters (such as SpamBayes and bogofilter) based on the principles of Bayesian inference have changed the face of spam fighting, by bringing intelligent learning to mail clients. Can the same concepts be applied to other areas, namely Linux news aggregation?

This (LXer) is a Linux news site whose primary function is to read the countless stories out there, select the very best ones, and present them to the reader in a simple and easy to scan format. To accomplish this goal, I am drawing upon my years of experience of watching, serving, and participating in the Linux community.

In the old days, I would simply browse all the usual news sites, looking for new stories that contain information that would be useful to my readers. As of the foundation of LXer, however, I'm taking a different approach, which is centered around the philosophy that if the computer can be programmed to do something for me, it should. As a result, rather than endlessly browsing news websites all day long, today the LXerBot goes out and hits hundreds of websites each half-hour.

When it finds something that contains the keywords I've programmed it to look for, it splits out the headline, the lead paragraph, author, URL to the resource, and it guesses the categories in which the story probably belongs.

Ultimately, I am left with a "pending" queue of stories, waiting for me to axe them or crown them with the glory of appearing on the LXer newswire and on webpages worldwide where our OpenContent-licensed database is syndicated.

This last step consumes 75% of the time I spend on this site, however, and I have to wonder if I am missing something. Can technology be used to make this job even easier, take less time, so I can spend more time writing editorials and programming fun and exciting new features for the site? Maybe yes!!

Now, I compare posting stories here to reading email. They both are populated automatically, usually without my assistance. Then, I wake up in the morning and find it is full of new stuff for me to check out. Either I delete an email (decline a story) or I find it interesting enough to keep around (postpone) or even act on with a reply or a save (approve).

From the first day of this website in January, 2004 to the present moment, I have had 9,998 stories go through my pending queue. Less than half (4,605) have been deemed worthy enough to be posted into the newswire, while the other half are doomed to spend their lives sitting as status "declined" in my story database table.

Bayesian spam filters do a tremendous job in weeding out the garbage - things that are obviously not interesting to me are stopped before they even enter my inbox. The more mail I read, the better the filter works because it continues to learn what kind of mail I like, and what kind of mail I dislike.

Can the same approach be used for the process of managing a web-based news website? I wake up in the morning and find 80 stories in the queue, where perhaps 20 are obvious junk, 20 are questionable, 20 more are somewhat interesting, and 20 are definitely hot stories. What if I had a bayesian-type of system that would examine each story, and automatically kill the bottom 20 percent, and float the top 20% to the top of my queue, marked as "potentially interesting"?

As I continue using the system, it would learn more and more about my editorial preferences, and what kind of stories I'm interested in. So a story about "Red Hat Linux" would get a good score, which a story mentioning "Red Hat Society" would obviously get dumped post-haste.

For long-term opportunities, the sky is literally the limit. Suppose each user had their own custom bayesian filter preferences, where every time you voted on a story (whether you thought it was worth reading or not) the system remembered your vote and, using that data as the seed, compile a filtered list of news on the homepage, customized specifically for your preferences!

On and on we can go, and the possibilities are actually exciting and intellectually stimulating. I have already implemented bayesian filtering for my pending stories queue, and I must brag that it is working tremendously and has helped me out quite a bit. Let me hear your thoughts on the matter, if this interests you. Perhaps some collaboration is in order to really get this thing working beautifully.

As a final note: This story that you just finished reading, while in my pending queue, received a score of 58.8%, which means that the system thinks there is an above average chance that I would find it interesting enough to post. :)