Surprising SEO A/B Test Results - Whiteboard&nbspFriday

The author's views are entirely his or her own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

You can make all the tweaks and changes in the world, but how do you know they're the best choice for the site you're working on? Without data to support your hypotheses, it's hard to say. In this week's edition of Whiteboard Friday, Will Critchlow explains a bit about what A/B testing for SEO entails and describes some of the surprising results he's seen that prove you can't always trust your instinct in our industry.

Click on the whiteboard image above to open a high-resolution version in a new tab!

Video Transcription

Hi, everyone. Welcome to another British Whiteboard Friday. My name is Will Critchlow. I'm the founder and CEO at Distilled. At Distilled, one of the things that we've been working on recently is building an SEO A/B testing platform. It's called the ODN, the Optimization Delivery Network. We're now deployed on a bunch of big sites, and we've been running these SEO A/B tests for a little while. I want to tell you about some of the surprising results that we've seen.

What is SEO A/B testing?

We're going to link to some resources that will show you more about what SEO A/B testing is. But very quickly, the general principle is that you take a site section, so a bunch of pages that have a similar structure and layout and template and so forth, and you split those pages into control and variant, so a group of A pages and a group of B pages.

Then you make the change that you're hypothesizing is going to make a difference just to one of those groups of pages, and you leave the other set unchanged. Then, using your analytics data, you build a forecast of what would have happened to the variant pages if you hadn't made any changes to them, and you compare what actually happens to the forecast. Out of that you get some statistical confidence intervals, and you get to say, yes, this is an uplift, or there was no difference, or no, this hurt the performance of your site.

This is data that we've never really had in SEO before, because this is very different to running a controlled experiment in a kind of lab environment or on a test domain. This is in the wild, on real, actual, live websites. So let's get to the material. The first surprising result I want to talk about is based off some of the most basic advice that you've ever seen.

Result #1: Targeting higher-volume keywords can actually result in traffic drops

I've stood on stage and given this advice. I have recommended this stuff to clients. Probably you have too. You know that process where you do some keyword research and you find that there's one particular way of searching for whatever it is that you offer that has more search volume than the way that you're talking about it on your website right now, so higher search volume for a particular way of phrasing?

You make the recommendation, "Let's talk about this stuff on our website the way that people are searching for it. Let's put this kind of phrasing in our title and elsewhere on our pages." I've made those recommendations. You've probably made those recommendations. They don't always work. We've seen a few times now actually of testing this kind of process and seeing what are actually dramatic drops.

We saw up to 20-plus-percent drops in organic traffic after updating meta information in titles and so forth to target the more commonly-searched-for variant. Various different reasons for this. Maybe you end up with a worse click-through rate from the search results. So maybe you rank where you used to, but get a worse click-through rate. Maybe you improve your ranking for the higher volume target term and you move up a little bit, but you move down for the other one and the new one is more competitive.

So yes, you've moved up a little bit, but you're still out of the running, and so it's a net loss. Or maybe you end up ranking for fewer variations of key phrases on these pages. However it happens, you can't be certain that just putting the higher-volume keyword phrasing on your pages is going to perform better. So that's surprising result number one. Surprising result number two is possibly not that surprising, but pretty important I think.

Result #2: 30–40% of common tech audit recommendations make no difference

So this is that we see as many as 30% or 40% of the common recommendations in a classic tech audit make no difference. You do all of this work auditing the website. You follow SEO best practices. You find a thing that, in theory, makes the website better. You go and make the change. You test it.

Nothing, flatlines. You get the same performance as the forecast, as if you had made no change. This is a big deal because it's making these kinds of recommendations that damages trust with engineers and product teams. You're constantly asking them to do stuff. They feel like it's pointless. They do all this stuff, and there's no difference. That is what burns authority with engineering teams too often.

This is one of the reasons why we built the platform is that we can then take our 20 recommendations and hypotheses, test them all, find the 5 or 6 that move the needle, only go to the engineering team to build those ones, and that builds so much trust and relationship over time, and they get to work on stuff that moves the needle on the product side.

So the big deal there is really be a bit skeptical about some of this stuff. The best practices, at the limit, probably make a difference. If everything else is equal and you make that one tiny, little tweak to the alt attribute or a particular image somewhere deep on the page, if everything else had been equal, maybe that would have made the difference.

But is it going to move you up in a competitive ranking environment? That's what we need to be skeptical about.

Result #3: Many lessons don't generalize

So surprising result number three is: How many lessons do not generalize? We've seen this broadly across different sections on the same website, even different industries. Some of this is about the competitive dynamics of the industry.

Some of it is probably just the complexity of the ranking algorithm these days. But we see this in particular with things like this. Who's seen SEO text on a category page? Those kind of you've got all of your products, and then somebody says, "You know what? We need 200 or 250 words that mention our key phrase a bunch of times down at the bottom of the page." Sometimes, helpfully, your engineers will even put this in an SEO-text div for you.

So we see this pretty often, and we've tested removing it. We said, "You know what? No users are looking at this. We know that overstuffing the keyword on the page can be a negative ranking signal. I wonder if we'll do better if we just cut that div." So we remove it, and the first time we did it, plus 6% result. This was a good thing.

The pages are better without it. They're now ranking better. We're getting better performance. So we say, "You know what? We've learnt this lesson. You should remove this really low-quality text from the bottom of your category pages." But then we tested it on another site, and we see there's a drop, a small one admittedly, but it was helping on these particular pages.

So I think what that's just telling us is we need to be testing these recommendations every time. We need to be trying to build testing into our core methodologies, and I think this trend is only going to increase and continue, because the more complex the ranking algorithms get, the more machine learning is baked into it and it's not as deterministic as it used to be, and the more competitive the markets get, so the narrower the gap between you and your competitors, the less stable all this stuff is, the smaller differences there will be, and the bigger opportunity there will be for something that works in one place to be null or negative in another.

So I hope I have inspired you to check out some SEO A/B testing. We're going to link to some of the resources that describe how you do it, how you can do it yourself, and how you can build a program around this as well as some other of our case studies and lessons that we've learnt. But I hope you enjoyed this journey on surprising results from SEO A/B tests.

#1 Sometimes, you do little change like titles and metadescriptions trying to optimize for new keywords or keywords with better searches and the result are dramatic drops. I think that sometimes it's better not to touch anything that works.

#2 Other times, you lose a lot of time improving some things that don't bring any tangible result.

#3 Actions that work well for a project, will not necessarily work well for others.

Yeah - and a large part of my point is that if you are trying to do 100 things 1% better, you *absolutely* need to be testing, because there's a very big risk of wiping out your gains. I wrote a little more about that here.

Really gets me thinking about how A/B testing can really help specially with bigger sites where it is at times difficult to convince the client to implement changes on their site. with A/B testing we can validate our recommendations which will also help us stop wasting time on strategies that don't really work. As you said in point #3 A/B testing will really help test different strategies on diff sites as not the same strategy always works on all websites (I think that's where most SEO professionals go wrong is that they think if this works for Site A it will work for Site B).

There are some case studies linked at the bottom - but the issue unfortunately is that what we are seeing is that even the most "obvious" of recommendations fail to move the needle sometimes. The things that work (sometimes) are the things that you expect - but they don't always work and sometimes even move you backwards (I wrote a little more about that here).

We will definitely keep on sharing stories of things that do work though.

Wow these are some crazy finds Will. I have also been told that you should target the more trafficked keyword with implementing it in your meta title and description. I was shocked to learn that this can actually decrease website traffic & rankings, although with your explanation I see a couple of reasons of why this could have happened. Great article!

I personally have encountered numerous situations where I have seen that changes 10+ stuffs doesn't really makes good noticeable impacts on the performance by SEO; whereas, there exists some parameters which when changed even by fraction % brings significant noticeable impacts on the performance by SEO.

Therefore, it's highly recommended to work specially towards the low hanging fruit fixes which really matters a lot when it comes to SEO.

Some of the things are listed below, which helps in proper tuning up the SEO scores:-

1. Improving the indexing and crawling rates by eliminating the render blocking elements

Thanks for the great and interesting topic. I remember you mentioned a little bit about result no. 3 in Searchlove London last year thought.

Do you say %30-%40 of common recommendations does not work even in different industries? I'm pretty sure it contains missing/duplicate descriptions, pagination tags, sitemaps etc. We all know some of them does not change anything. But in my opinion categorized sitemap is much more important for bigger sites and industries such as e-commerce.

So, yes - it's consistent across different sites and industries, but unfortunately it's not always the same 30-40% (in which case we could just stop doing those things and move on). This is why we are finding that we need to continue testing on as many of our projects as possible.

Great British WFB, and I agree what we see in best of SEO practices or in theory sometime really doesn't support every website. The best methodology that works, can only be found by AB testing!

A very recent example, we made certain changes in one of our client's website page length, as result we observed significant improvement in session duration as well as the event associated was improved by 6.87% compare to the old layout.

For this we used optimise.google as tool for the client

Today, SEO is not just about keep optimising, its more about keep testing as well!

Nice work with your testing - worth noting that the testing I am talking about here is specifically testing search performance rather than user experience or conversion rate performance. The latter can be tested with tools like the one you mention - whereas I am talking about testing groups of pages with server-side changes. You can read more about the differences here: what is SEO split testing? Hope that helps.

Can you share some of the 'common SEO techniques' you talk about that make no difference on a regular basis? I have my assumptions, but I'd be interested in hearing if you're finding any type of pattern there!

I wrote a little more about some losing and null tests here. Hope that helps - the over-riding part though is that it's not about things never working - but rather that you have to test because few things work consistently everywhere / every time.

This is fascinating and I love the look of the tool! Would love to try it out sometime. It would be ideal for me, as a risk averse SEO managing an enterprise level site with millions of organic visits at stake through sitewide changes. Testing on a small area of the site would not normally be possible as we only technically have a few templates but millions of pages - so they way this tool works like a CDN is intriguing...

Really good stuff here Will. I am a big advocate of A/B testing and it can definitely be frustrating to find out that something that works on 1 website, could actually hurt another website. Interesting results indeed, thank you for sharing your ab test results here!

There are some great philosophical questions embedded in trying to figure out what exactly we'd mean by "one algorithm" - I mean we know that however it works, it takes into account some domain-level features and some keyword-level features, so even if it is "one" algorithm, it'll likely behave differently in different verticals / niches and with different levels of competition. We have done some preliminary work around this (see this presentation, video here) but it's hard to separate out algorithm complexity from marketplace complexity.

The good thing about the kind of split testing I talk about in this Whiteboard Friday works regardless of these underpinning differences...

Hi Will, I just read your article about marginal loses making the true difference in the long run. Very thoughtful and interesting.

So basically, SEO is getting more complicated and time consuming because it always works and doesn'tt, depending on the circumstance.

What is the solution then?

Testing?

But you can't do that for every page and keyword, especially smaller bloggers who want to improve their SEO, but it's not their life's purpose to figure out the algorithm.

Do you recommend best SEO pracitce for low priority pages (hoping for the best), and split testing for crucial pages? or, is there a way to automate the process, because of the obvious value in split testing everything (marginal gain adds up quickly when done at scale).

Thank for the article,

It's really interesting to me because the concept of SEO becoming a black box is new to me; it's real and I want to be prepared.

It's a real problem - and there aren't perfect easy answers. Here are some thoughts:

On smaller sites especially, there are often far bigger gains to be had from creating new content than from getting existing pages absolutely perfect - I'd skew my efforts more towards new content creation on a smaller site

Smaller sites are much more likely to see the needle moved substantially by growing in authority than larger sites (where step-changes in authority are much harder). My colleague Tom Capper wrote a bit about how links are likely a more important ranking factor for those who don't have many of them than they are in hyper-competitive spaces where many massive sites are battling to rank. This implies that there is a greater likelihood that focusing on off-site strategies will be effective while your site is smaller and less authoritative

Although you can't run split-tests, at smaller scale, you can keep a closer eye on specific performance - so sometimes you can make a change and then watch the rankings and traffic for that specific page. This will be subject to other factors (e.g. seasonality, competitor actions, algorithm updates etc) but it may be better than nothing

Beyond all of that, although it's not perfect, for all the reasons I highlighted in this video, I think the best you can do is learn from the tests run on larger sites and extrapolate them to your situation as best you can

Good to see some data supporting this as it does often feel the case, messing with what is working can sometimes produce negative results or no results and its easy to get lost in investing time in these improvements resulting in demotivation.

When I saw you explaining test 2, with how removing text provided two different results, my thoughts immediately went to TL-IDF and LSI. I know this has been debunked by Rand in earlier Whiteboard sessions therefore these terms might not be the best way to describe it, but I found that it's better than calling it keyword density.

I have personally seen this work, adding words and phrases that semantically where used by top 3 websites, for brands I've done SEO for on the European market. Meaning that 500 words could rank better than a 1.000 page.

Yeah - I sometimes think people get the wrong end of the stick when experienced SEOs talk about how keyword density is totally not a thing. It's correct - density is absolutely not a useful metric for a search engine to use - but that doesn't mean that they aren't using any on-page keyword-related metrics, and it's obviously true that there are better and worse amounts you can mention the thing you are writing about (none is obviously bad, more is better, until you get to too much -- though I realise this isn't helpful!).

It seems to me that the reason these kinds of content change can work on one site and have the opposite result on another is about two things:

They aren't identical changes - although in both cases we are talking about adding (or removing) a block of text -- the content we are actually changing clearly differs on different sites so it's not a perfect extrapolation

There is so much non-linearity - as I said, never mentioning the thing you are talking about is unlikely to work, but it's also possible to stuff too many mentions onto a page. Somewhere in the middle is the "right" amount for any given query and given page, but it's impossible to say what that right amount is without experimentation.

The way we approach this is to define the site section in such a way that it includes not only the pages that exist at the beginning of the test, but also will contain new pages added during the test.

Then, randomly allocate initial pages and newly created pages to the control or variant groups.

This does obviously add in more statistical noise, and so somewhat reduces the power of the test (in other words, you will only be able to measure more dramatic changes), but the basic principles hold, and if you have enough traffic and enough volume of these pages being created and removed, then it still works.

Can you go into a little detail about how you keep search engines from seeing the control page? I'm imagining a canonical tag pointing back to the variant, unless you don't let these two pages exist at one time and are comparing historic performance.

Also, to paraphrase Dr. Ian Malcolm, "SEO is the essence of chaos theory."

The control pages and variant pages are different pages (and search engines see both). This is intentionally different to how CRO / UX testing works - which is the variant with which you are probably more familiar.

I use an approach on this matter.so can you explain me Is it right/wrong approach.Basically I am targeting a few primary keywords and secondary keywords for back-linking to my client site. After then I use long tail keyword which has both primary and secondary keywords.Then I use all these keyword for back-link and also use primary keywords and secondary keywords in our client web content near about 2% of total content.

When it comes to A/B testing, have you found that even on the same websites, what you test on page A doesn't necessarily have the same success/failure on page B?

As you say, it's not necessarily true that what is successful on site 1 is going to be successful on site 2. It's also the case, is it not, that even on the same site what you do on one set of category pages, isn't necessarily even going to succeed on a different set of category pages on the same site?

Or have you found that if you do have success with your changes, when/if you roll that out to other pages on the same site, they also see the same success?

It's very interesting stuff and obviously any opportunity to test is beneficial, I'm just intrigued to know more about subsequent challenges once a test has been considered successful enough to roll out to the entire site?

Hey Dan. There are absolutely risks that, like you say, things will be effective on some pages / in some areas of the site but not in others. With our testing, we are taking an averaged view across the whole site section (which isn't necessarily the whole of the site, and could be arbitrarily small subject to it having enough organic search traffic to measure the uplift). What that means is that we are looking at the net impact - that the benefits to some pages outweigh any downsides for other pages. It's really tough to disaggregate further in a guaranteed way - though you may be able to identify specific losing pages via e.g. rank tracking - that's something that we will have a think about some more.

In general, we have found that the rollout has gone pretty smoothly - because we randomise whether specific pages receive the control or variant treatment, and so, on site sections large enough to test on, the law of large numbers means that the result on the variant pages tends to generalise well to the initial control group. So even if it's a blend of ups and downs, we get a similar blend on the remaining 50%. Hope that makes sense...

You confirm what I was observing in my own tests: With travelmemo.com I removed all "recommended posts" at the bottom of each post (I don't get those with AMP anyway) as well as the prev / next links on all reviews. I also removed a lot of navigational internal linking. That's hundreds of links.

But no measure moved the needle whatsoever on any KPI, be it bounce rate, time on page, organic traffic, anything. I performed the changes on three different Wordpress blogs that operate in different languages in different markets and none showed any change at all.

This experience encouraged me to try much more radical changes than before.

It is essential to do A / B test to improve the user experience in our projects. Google Optimize is working to offer new features to make the job easier. The results obtained are the best field of tests that can be had, since it is the user who really decides.

Hi Agus. I just wanted to emphasise that what I am talking about here is SEO A/B testing whereas tools like Google Optimize and others are focused on testing the effect on conversion rates and user experience. Hope that helps.

Thanks for some great points Will. I particularly appreciate the idea of building A/B testing into your operations and processes. I also thought it was a really great observation of how the smarter the engines get the more narrow the gap becomes in ranking better or worse.

Result #1: Targeting higher-volume keywords can actually result in traffic drops.

Yes it can but if there is a website having a brand name and quality of contents then the traffic will not drop, just they have to make sure that the content is really useful for users and have some unique idea for the targeted keyword.

I'm afraid our testing shows that it's not that straight-forward - I suspect in large part because it's very hard to be sure how much of the long tail you are targeting with each of two variants, and how those different sections of the long tail compare to one another in search volume.

I have tested the same case on 2 brands. But, both got a great response. I added content with user asked query and haven't stuffed the content. You can say almost no keyword was used. Just the info is there. So do you think that it is still the good strategy?

It sounds as though it worked! The challenge we have found - and the reason we are doing so much split testing is that it can be really hard to tell if a tactic has been effective without the testing, simply because there are so many confounding variables.

I'm new member of this blog. I have no idea about A/B test. I heard about it, marketers using this formula before launched new product or something like that. Now I understand the value of A/B test, I hope it helps to understand the market value, customer demands, product activity and more about my business. Thank you.

Really interesting how a strategy based on data can make the difference. In my case work in affiliate marketing in some of the competitive niche, and this article make a lost sense to me 50% of my work consists on research and high volume traffic mean nothing to me.

Hah - well - no - you do need to do the audits to discover things that could make a difference, but what I'm saying here is that you then need to test those changes to make sure you are applying beneficial changes (and not deploying harmful changes - per this article I wrote recently and linked elsewhere in the thread).

Lets take the keyword 'NGO Jobs' for example. It must be very popular and very competitive worldwide but when you use Google keyword planner, you realize that it has got a lot of searches per month and very low competition in Cameroon, South Africa and Tanzania. On the other hand, the keywords for the individual countries i.e 'Ngo jobs in Tanzania or Ngo jobs in South Africa' do not have a good search volume at all. So you decide that you would want to target those three countries only with the keyword NGO Jobs. How best would you do that with a 'com' domain without creating several different websites for individual countries knowing well that Google search console only allows you to target one country?