Wednesday, August 09, 2006

Web spam, AIRWeb, and SIGIR

I am giving a very short talk at AIRWeb (a workshop on web spam) at SIGIR.

I thought some readers of this weblog might be interested in web spam but unable to make it to this workshop. It might be fun to discuss the topic a bit here.

I wanted to use my short talk to bring up three topics for discussion:

First, I wanted to talk about the scope of weblog junk and spam, especially if "junk and spam" weblogs are loosely defined as "any weblog not of general interest."

The primary data points on which I will focus are that Technorati reported 19.6M weblogs in Oct 2005, but the dominant feed reader, Bloglines, reported that only 1.4M of those weblogs have any subscribers on Bloglines and a mere 37k have twenty or more subscribers.

This seems to suggest that over 95% of weblogs, possibly over 99%, are not of general interest. The quality of the long tail of weblogs may be much worse than previously described.

Second, I wanted to bring up the profit motive behind spam. Specifically, I will mention that scale attracts spam -- that the tipping point for attracting spam seems to be when the mainstream pours in -- and that this has implications for many community-driven sites that currently only have an early adopter audience.

Third, I wanted to discuss how "winner takes all" encourages spam. When spam succeeds in getting the top slot, everyone sees the spam. It is like winning the jackpot.

If different people saw different search results -- perhaps using personalization based on history to generate individualized relevance ranks -- this winner takes all effect should fade and the incentive to spam decline.

What do you think? I would enjoy getting a discussion on web spam going in the comments!

Update: The talk went well. Let me briefly summarize some of the comments I received at the workshop on what I said above.

On the Technorati vs. Bloglines numbers, a few people correctly pointed out that this could be seen as recall vs. precision and that, for some applications, it may be important to list every single weblog, even if that weblog has no readers. At least a couple others disputed whether weblogs without readers were important at all. One mentioned that readers may be able to be faked, which might allow spammers a way to attack this type of filter.

On the "winner takes all" and personalization as a potential solution, some seemed skeptical that there was enough variation in individual perceptions of relevance to make a big enough impact. Others seemed intrigued by the possibility of using user behavior and recommender systems to filter out spam.

I enjoyed talking about web spam in such a prestigious group! Great fun!

Update: See also the paper, "Adversarial Information Retrieval on the Web" (PDF), which gives a good summary of the discussions at the workshop.

8 comments:

Seems to me that Paul Graham's "A plan for spam" made the same argument about how personalization would save us from e-mail spam. Yet the bad guys found ways around bayesian analysis - 1st by exploiting particular algorithms and 2nd by overwhelmingly increasing the amount of spam being delivered to make up for the decreased % that which gets through. I doubt personalization on the web will be any more successful in the long run - not that it's a bad idea to try!

Personalization is nice, but let's think about what differentiates a real user from a spammer. You can't just say "my interest is in Computer Science, so non-Computer Science content is spam for me". Nah! It won't work, and indeed, it does not. I think that Paul Graham's plan for spam was a brilliant idea and it worked, to a point, but it is not enough, it won't be enough.

The easy way to differentiate spam from real content is by association. I have a network of friends and these people never send me spam, by definition. In turn they know people who never send them spam and so on. Go 6 degrees and you pretty much cover all the people I would trust to send me email. I think that trying to differentiate spam from non-spam using the content itself is a battle that cannot be won, in the end.

Let's take most collaborative filtering algorithms such as slope one, pagerank and so on. They are built by aggregating expressed interests from unrelated users together. This tends to favor spammer who have no friends, but lots of time on their hands. At least as long as expressing your interest toward something does not cost too much money.

What we need is enforce the concept of trust network. There has been lots and lots of academic work on the issue. However, most implementations are centralized (unlike the web or email) and so, can't scale enough to compete with decentralized solutions.

In some respect, we are still waiting for the Tim Berners-Lee of collaborative filtering, if you ask me. The academic solutions are there, but nobody has quite gotten it right on a large scale.

Initiatives like FOAF had the right idea. They failed because people focused on solutions, without trying to solve problems.

Findory is fine and all that, and I figure you have a financial interest in such as centralized site. But let's think this through...

Why can't we filter the content through our network of trusted friends? If I ever start getting spam through my connection with Greg, I remove Greg from my network, and voilà! This ought to work. To a point, it works informally in the blogosphere. It is the old push versus pull debate.

Implementing this, at least the initial implementation, is stupidly simple. We just need to markup, using XML, who we trust, and the stuff we like (and dislike)... then, through our first order or second order network, we could filter all content. It can even be partially centralized, as long as everyone agrees to use the same exchange formats.

In other words, if you include "who you know and who you trust" in the personalization of the content you want to browse, then I think personalization ought to work as a solution against spammers... but you only take into account my past history and my age, you will fail.

'general interest' sounds like MSM to me. Is the age of the long tail over already? Definitions of spam should rather focus on the intention behind the content. Was it intended for human consumption, or for tricking search engines and/or people for financial gain?

Just a quick clarification. When I was talking about personalization helping to reduce spam, I mostly meant that different people would see different search results, so the winner takes all effect would be reduced. I did not mean that all "non-Computer Science content is spam" for someone who mostly is interested in computer science.

I think we do want to surface interesting content from the tail. However, if you are ranking by readership, junk and spam is much worse as you dig into the tail. Separating the good from the bad is tricky when there is so much bad.

I wouldn't go quite as far as to say that ignoring blogs without any readers is like focusing on the MSM. Bloglines claimed 1.4M weblogs had at least one reader in their system. That's quite a pool of content. Even the smaller set of 37k weblogs with more than 20 readers represents a large pool of bloggy goodness.

This seems to suggest that over 95% of weblogs, possibly over 99%, are not of general interest. The quality of the long tail of weblogs may be much worse than previously described.

The assumption that the rest of the blogs are not of general interest, seems wrong to me. The readership count may be a metric to evaluate the popularity of the blog but should not be the thumb rule. There are quite a few interesting reads out there waiting to be explored. Most of us look at the blogosphere through a pin-hole consisting of the widely popular blogs. This restricts us to exploring the long tail of the blogoshpere which is as interesting as the 37k blogs.

More about the pinhole concept can be read here: http://semanticvoid.com/blog/2006/10/07/sheeple-of-the-blogosphere/

Daniel said:"The easy way to differentiate spam from real content is by association. I have a network of friends and these people never send me spam, by definition. In turn they know people who never send them spam and so on. Go 6 degrees and you pretty much cover all the people I would trust to send me email. I think that trying to differentiate spam from non-spam using the content itself is a battle that cannot be won, in the end."

This sounds a lot like what Tailrank appears to be doing since they're only growing their index, at least in one dimension, according to links fm previously trusted blogs/posts. Since good blogs don't link to spam, Tailrank is able to avoid polluting its index.