When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the userís query.

Choose the PDF version to read, and when the paper opens, scroll down to page 6, which is in a box, and is titled "trawling the web for emerging cyber-communities."

In that section, the authors describe a pattern of linking that is pretty interesting - communities can often interlink amongst themselves in ways that tend to ignore a lot of the rest of the Web, and don't point to the "Authorities" and "Hubs" that you might talk about in something like HITS. These community based interlinkings are referred to as "bipartite graphs." I think that section of that paper explains those pretty well.

It's possible that not only do communities tend to link like that to each other, but also link-based spam may exhibit some of the same behavior - with interlinking amongst pages, and perhaps a number of links pointed into a cluster of those pages from outside, from places like guestbooks and blog comments (the patent was originally filed in 2003 - I wonder if it was more recent if it would also include blog comments with guestbook comments).

The patent tells us that clusters can be identified other ways, too. Once clusters are created, it looks at links pointing into the clusters as well as signals of "manipulation" within documents, such as machine generated text, heavy keyword repetition in meta tags, redirects, historical data about content and links and site ownership, and other information on pages and about pages.

Clusters that are identified can be examined to see if they have manipulative signals pointing towards them, and if documents within them show signs of being manipulated. For ones that reach some certain threshold, they might be manually reviewed. For others that reach a higher threshold, they might instead be marked as manipulative.

A page within one of these clusters, or a subset of the cluster, that doesn't have any manipulative signals on it, but is clearly related to pages that do, may be treated in the same way as those other pages.

That treatment may involve having rankings reduced, being removed from a search index, or losing the ability to pass on something like PageRank.

I found the patent interesting because it seems to describe some things that we may have been seeing from Google for the past few years, and it presents some interesting approaches to fighting spam.

There have been some other papers and patents over the past couple of years that also provide some interesting approaches to fighting spam, from places like the AIRWeb (Adversarial Information Retrieval on the Web) workshops which I don't think that we've talked about much. You can find papers from those by following the "proceedings" links from each. Some pretty good ones in there.

On the ads, those kinds of interstitial advertisements are a pain, aren't they. From a usability perspective, they break the flow of your accomplishment of a task.

Imagine walking into a store, and before you can get to what you wanted to buy, a salesclerk came up to you and started telling you what you should buy. It's not a sales tactic that I'm really pleased with.

I'd call that a manipulative signal.

Forbes also doesn't seem to link out too much to sites on the web besides other Forbes properties, and advertisements.

They have a nice little link farm at the bottoms of their pages that go to places like forbestravel.com, forbesautos.com, and sites like investopia.com that have similar link farms listed on the bottoms of their pages. So, they have one of these "bipartite graphs" going on too, where they all sort of link to each other, but rarely link out to anywhere else.

Forbes is one of the sites that did see a drop in their toolbar pagerank, recently.

When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the userís query.

What if the user was searching for pornography in the first place? That quote reads to me like they are stating any search that leads to pornography is manipulated even if they were searching for it in the first place?

Towards the bottom of the patent, they explain the kinds of things that they might do to pages within a cluster if they think that it has a high manipulation signal, such as lowering rankings of those pages, or removing them from the index, or not letting them pass along pagerank.

But, check the sentences that I highlighted below:

A manipulation indicator can be associated with every document in a cluster or subset of the cluster determined to be manipulated.

This manipulation indicator can then be used during the retrieval and ranking phase by the search engine 120 in a variety of ways.

For example, a manipulation indicator can be used in a ranking function to lower the rank of a document.

Alternatively, a manipulation indicator can be used as an indication that the document should be removed entirely from the search results.

Additionally, a manipulation indicator can be used to treat the document differently, such as not using the document in a hyperlink structure-based ranking calculation, such as PageRank.TM. from Google, Inc.

Further, a manipulation indicator can be used depending on the query. For example, if the query relates to pornography, the manipulation indicator may not be used.

Manipulated indicators can be used in a variety of other ways during the retrieval and ranking processes.

So, someone actually looking for pornography might be able to find pages that trick you into arriving at pages that deliver pornography to you.