How a Search Engine May Identify Undesirable Web Pages By Analyzing Inlinks

The term “undesirable web pages” is used in a patent application from Yahoo published today to refer to pages that rank highly in search results based upon links pointed to those pages solely for the purpose of increasing their rankings for specific queries even though those pages may not be very relevant for the query terms in question.

“Undesirable” appears to indicate that these are pages that Yahoo doesn’t want ranking well in search results at their search engine.

So, what might Yahoo (and possibly other search engines) look at to determine whether a page is undesirable based upon the links it sees to that page?

Analyzing Inlinks for Manipulation

When search engines show pages to searchers in response to a query, those pages are placed in an order intended to show pages that are a combination of relevance and importance or quality.

One way that search engines determine how important a page might be is based upon the number and importance of pages linking to that page. Search engines may also pay attention to the text used in a link, often referred to as “anchor text,” while determining how relevant a page might be for a certain keyword term or phrase.

But there’s a problem in relying too much upon links pointing to pages to determine how relevant and important a page might be. By giving links such value, the search engines have turned links into a commodity that may determine how highly a page might rank in search results.

Many links pointing to pages are created not to bring direct traffic to a page, or to refer to a page in a specific context, but rather solely to improve the ranking of a page, and may result in “artificially promoted web pages” ranking highly in search results even though those pages may not be very relevant to a query by a searcher.

In response to this problem, search engines may weigh the value of some links differently. The patent application from Yahoo describes how that search engine might distinquish between links pointing to a page, also known as “inlinks” to that page, based upon a statistical analysis of information about those links.

a search engine operative to index a set incoming links (“inlinks”) which reference the resource,

a log module coupled with the search engine and configured to store log data associated with the set of inlinks,

a partitioning module coupled with log module and operative to partition the set of inlinks into a plurality of groups of inlinks based on at least one partitioning scheme,

a statistics module coupled with the partitioning module and operative to compute a statistic associated with the inlinks within each of the plurality of groups of inlinks, and

a computation module coupled with the statistics module and operative to process the computed statistic associated with the inlinks of each of the plurality of groups of inlinks and compute a metric associated with set of inlinks where the metric indicates a level of uniformity of a distribution of values of the respective computed statistics among the plurality of groups of inlinks, and where the search engine places a list of search results, generated in response to a search query, in a pattern based on the metric.

In analyzing links pointing to a page to try to identify the artificial manipulation of links, the search engine may look at information associated with those links to try to see if there are any unnatural patterns associated with those links.

The search engine might look at information such as:

An internet protocol (IP) address segment of the source of each inlink

the domain name of a source of each inlink

The top-level domain name associated with each inlink such as “.com” or “.edu” or country code top level domains

The written language used in each inlink (e.g., English, French, or German)

A geographic region associated with the source of each inlink

A network routing group associated with the source of each inlink

The anchor text (i.e., the clickable text) contained within each inlink

The patent filing tells us that a quality or importance ranking score might be given to pages based upon link-based ranking approaches such as PageRank, or a system that gives more weight to newer links and less to older, or some other kind of algorithms.

A statistical analysis of information about links pointing to a page over time may result in a demotion of ranking of that page if abnormal patterns appear that seem to indicate that links to a page have been manipulated solely to increase the ranking of that page.

Conclusion

I’ve written about a number of other patents and whitepapers from the major search engines that can be found in my category on Web Spam, but this Yahoo patent filing provides details about some specific kinds of information that Yahoo might analyze that many of those papers or patent filings haven’t mentioned.

Chances are that Google and Bing may perform some similar types of analysis when looking at links pointing to pages.

Related

Reader Interactions

Comments

And once again we see how clueless the search engineering community can be when it comes to Web spam. The application was filed in October 2008 and yet it provides a wholly inaccurate description of a “link farm” in section [0015]. What they are describing is a “link network” (a network of sites that are used to boost the importance of sites outside the network). A link farm is a group or network of sites that all link to each other.

Link farms don’t have “sole purposes” — by definition they cannot have a sole purpose.

One of the key elements of this patent application appears to be their entropic analysis of backlink behavior — how Web documents accrue links (or link out) over time. They may be using quarterly data points to analyze linking trends.

Of course, even if they were doing that 2-1/2 years ago, they could have replaced this method with something else.

But I have often noted that people should be looking at how their linking resources (and the destinations being promoted) behave because the search engines look at Website behavior.

Definitely have to take the level of domain into consideration. You have to assume that the relevant sites want those TLD’s and will aquire them. As for .edu and .gov, those can’t be registered publicly anyways.

The definition of a link farm in the patent was pretty much off base, and the authors of the post seemed to have a very narrow and negative view of SEO, even though Yahoo practices SEO on their own sites, participates in search marketing conferences, and have even published at least one patent on how they would automate aspects of search engine optimization.

I agree with you that the patent does make a strong point about people paying more care and attention to links pointing to their pages, and to how and where they may acquire links for their sites.

It would be interesting to see some examples of the kinds of patterns that have been found by the people who wrote this patent, including examples about the growth of links from specific top level domains, but they didn’t include any. There are limitations on the ability of people to register .edu and .gov sites. Unfortunately, a lot of web spam and manipulative links end up on old message boards and other places on .edu domains that aren’t monitored or controlled very carefully.

I suspect that this kind of analysis has been going on for a while. The search engines do collect an incredible amount of data about the Web. The greatest difficulty might be in indentifying what information might be most helpful in finding manipulative linking. The kinds of information that we were told about in the patent filing may be some of the things that they look at, but I would guess that they’ve explored many others as well. Hopefully it will make a difference.

These are the signs that are maddening about SEO. You’re really guessing anyway about how the algorithms manifest themselves so you build up a strategy and the SEs flip the script on you just when you have a usable plan. Maddening.

Interesting article to say the least. I know that search engines are actively seeking ways to reduce “link spam”. But I am a little insecure about trusting patents … at least when it comes to recipes. Think of it like this: If Coca Cola had patented their formula for Coke, do you think someone would be making a copy by now? Isn’t algorithms much like recipes? I’m not much on patent law and that sort of thing, so I could be way off base.

It’s interesting to me that the authors of this patent would take a negative view of search engine optimization. I would imagine that with the thousands of SEOs out there who are constantly coming up with innovative ways of pushing clients to top SERP positions, we are likely forcing search engineers to constantly tweak and innovate their algorithms and therefore making their ‘product’ better and better. Then again, given the fact that the patent is outlining ways to curb the value of inorganic linking, I suppose it’s unsurprising that its authors would take a slightly negative view of SEOs…

Regardless of how search engines may change the algorithms they use, their ultimate goals don’t waver much – they want to try to find the best sites that they can that match the intent behind a search.

Build a strategy around that, and around links that not only help the ranking of a site, but also provide meaningful traffic, and you become much more immune to shifts in search engine algorithms.

Search engines primarily file patents to protect their intellectual property, but they don’t patent every idea that they may have, and they don’t provide all of the details of something they implement that they may have patented. If you don’t trust the patents, at least trust the clear signals behind ones like this one – search engines don’t like links that are created solely to manipulate search engine rankings.

The patent filing describes a way of patching holes in an approach to assigning value to pages that has some flaws. A link based ranking system is prone to being attacked in ways that probably should have been anticipated by its inventors, but weren’t. I’m not advocating the use of those methods of abusing such a ranking system, and I like many of the ideas that they suggest for identifying links that are created solely to manipulate such a system.

I have a query. I see in my WP Spams that I am getting a lot of trackbacks from Spammy sites. Will that affect my site? Will Google see that as blog commenting spam? Will those inlinks affect my ranking?

I get a lot of trackbacks from spammy sites, too. Sometimes they aren’t even real trackbacks, but are formatted like them. Sometimes they are trackbacks where there was an original site that linked to me, and they scraped that site, sending me a trackback when they published it. I sometimes don’t even get a trackback from the original.

I try very hard not to publish those when I see them. I’ll usually copy a snippet of text from one of those, put it in quotation marks, and search for it to see if I can find out where it originally came from.

If it’s a trackback from a site that’s pretty much just spam, I usually just delete it.

This patent is from Yahoo rather than from Google, but the concern you have is real to a degree. Work on attracting and acquiring links from legitimate non-spam sources. The better a job you do of that, the less likely any links pointing to your pages can hurt you.

How does the trackback works? If we delete those trackbacks it will be gone from our page(s) but it will be there on the spammy site right? Won’t that be considered as mass blog commenting by Google? Because that still acts as a backlink to our site though its spam.

And by the way you should really add a “Subscribe to comments” plugin to notify us for followup comments.

Trackbacks work by having a website that links to yours pinging your website to let it know that someone linked to it. Their website needs to be set up to send out that ping, and yours needs to be set up to accept the trackback, and leave a notification of it on your site. We can moderate trackbacks and delete them before they even get published. The link still exists on the other site. Since it hasn’t been published on our site, we aren’t linking back.

I had a subscribe to comments plugin, but removed it because I didn’t like the way that it sent out emails from my site. I’ve been reading about how email filtering programs work, and am concerned about how they might treat emails from my site if I use a plugin like that.