Blog

Search Engine Algorithm Research & Testing

November 20th, 2011

Performing tests on search engine algorithms can help you stay current on what the search engines are looking for and help guide your SEO strategies. The hardest part is devising tests that show an effect on individual ranking factors.

Because change is a factor in itself, it’s nearly impossible to test a single ranking factor independently.

Here are some suggestions and warnings on conducting a search engine algorithm test.

Defining the Metrics

Let’s start by looking at some metrics you could use:

Ranking: Combines too many factors and should be communicated with its probable level of certainty. Pinpointing the effect on an individual factor is harder than proving the effectiveness of an entire SEO activity.

Indexation (cache:… or site:… ): Being indexed doesn’t mean being able to rank for anything, but you can see if a search engine follows certain links.

PageRank: Rough metric with slow update frequency, but you can get an idea how PR is distributed through your website.

Information from Google Webmaster Tools: Error handling, auto canonicalization, and backlink reporting give many insights in the ranking factors used by Google. They are, however, flavored by the level of openness Google is willing to provide.

Penalties: Only shows when you’ve gone too far. It could even show signs of manual intervention. The triggers for manual checks are harder to reverse engineer.

When you use ranking as a metric for a specific change between two timeframes, you won’t get a clear answer on the importance and performance of that change. Why?

Your competitors usually aren’t standing still – and could be doing things that you can’t monitor.

The search engine could change the weighting of specific ranking factors.

Some of your changes need to ripen over time.

Bottom line: Be careful when making any conclusions based on a ranking increase/decrease.

As for penalties as a metric, well, getting banned from the index is a sure signal! Google will often communicate the main reason for such a penalty. The main reason, however, isn’t the only reason. The threshold is determined by many factors combined.

Metric Difference Between … ?

To reverse engineer specific ranking factors, it’s important to know the scope of their effect. Slightly increasing search term focus on a single page affects that single page, but does little for all pages in that theme.

Getting a relevant link partner effects multiple pages through your internal links and might even have a domain-wide effect. These assumptions are testable, but ranking is regrettably the only metric at your disposal.

Effected Broadness

Certain ranking factors are domain based, some page based, and some effect only a certain type of queries:

Domain based: Domains provide patterns on a larger scale and hold a certain amount of trust.

Query based: Is ranking only influenced for specific result type (for instance for the remaining normal results when universal results are shown) or query? (Additional filtering with for instance negative matches. “Loans” vs. “loans -credit site:com inurl:loan” doesn’t only remove specific results, but also changes the sequence between the remaining ones.)

Measuring the ranking effect between pages within your site will give some indication of the broadness, but you should monitor the competition on all these queries separately. If they are taking a similar action across the full scope of queries, ranking provides some indication.

Google Panda Update & Ranking Factors

The Panda Update provides great opportunities to test this approach. Signals so far show a ranking difference within specific query types of specific domains.

Experts speculate that the duplicate content percentages for individual pages across a single domain trigger a domain-wide dampening for a factor that mainly influences ranking for certain query types. Because Panda consists of a combination of multiple new viewpoints on quality sites, it’s hard to separate the individual ranking factors involved.

Another commonality between affected websites seems to be the narrowness of the collection of their incoming links. Within sites with the same low quality content, ones with a full and natural link profile (with multiple types of links and apparent method of initial acquisition) seem to be left alone and even increased ranking when others dropped.

Collecting an enormous amount of data from hundreds of affected sites (providing free consultancy in exchange for their openness) gave me just some common denominators. The amount of data apparently only made it harder to generalize on the effects of Panda, and I currently see a continuing shift in ranking.

Threshold

Some factors gradually decrease in efficiency and know some kind of ideal situation, which is somewhere in the middle. Deviating from the ideal on both sides makes it less then perfect. This goes for “keyword density,” for example.

The detection of spam either makes certain efforts less effective or damages a site entirely. The certainty with which spam is seen as intentional creates a sliding slope. If you increase a spammy activity over time, you either get a threshold after which you see a huge drop or you see a gradual decrease in ranking, for instance.

Changes Outside the Scope of Your Test

Besides the test you’re running, the world probably isn’t standing still. If a test requires time (and most do), you need to take some things into account.

Time is a factor: For example, the ripening of links makes them more valuable. Quick link acquisition has a better short-term effect; links from news articles sink deeper and deeper in that site.

Proof stacks: You might be providing more incentives for a more detailed scan of your activities.

Other websites increase/decrease in ranking: If ranking is the selected metric, this is the biggest weakness of the test.

Algorithms change: Some bigger algorithm changes are announced, but shifts in thresholds, the ideal situation, and the weight of certain ranking factors and signals happen continuously.

Successful Tests

So far, this all sounds pretty gloomy. While testing is nearly impossible, here’s an example of a successful test.

PageRank Distribution and nofollow

When nofollow was introduced, some SEO enthusiasts wanted to use it for so called “PageRank sculpting” (the intentional steering of link value through a website). To verify their assumptions the following test was devised (and not broadly shared until today).

We used Toolbar PageRank as a metric, because there aren’t many factors besides link value in its calculation. We did the same test on more than 200 websites from different owners in different industries. With various launch dates the tests were eventually simultaneously running for more than six months and this is what it showed.

The top represents a single page somewhere in those 200 websites, all varying between PageRank 3 and 5. That page only linked to three other pages, which showed the same PageRank as their siblings. After that we created three different situations.

Situation A: Only one link on the page. The linked page showed a similar PageRank as its parent or just one number below it.

Situation B: 100 links on the page. The linked pages showed a PageRank of at least one number less than its parent and often two.

Situation C: 100 links on the page, 1 normal and 99 with a nofollow attribute.

Was the PageRank equal to a page in situation A or B? A would indicate that you save PageRank by using nofollow on unimportant links, B means you can’t (by this metric).

The answer was … B

Monitoring Many Potential Factors Across Many Sites

One expensive mistake I made in the past was helping create a system that monitored ranking and many potential signals for more than 20,000 websites (which should have been a much greater number) and many of their pages. With each ranking change (especially shifts around big algorithm updates) we tried to let the system calculate what movers and shakers had in common.

After it showed many false positives and negatives without clear patterns, the research project was soon abandoned by many of the SEO specialists involved.

Maybe the idea of testing this way wasn’t too far-fetched, but the willingness to do it on an incredibly huge scale involving statisticians and super computers isn’t something the collaborate SEM community seems to have. Besides, who can gain when everybody is involved?