Audit WP Exposes Potential SEO Issues With WP Engine Hosting

This week Audit WP launched a new SEO consultancy, headed up by SEO strategist Jacob King. The firm’s first order of business after launching was to publish a post to expose what King perceives to be important SEO and privacy concerns for customers of managed WordPress host WP Engine.

I spoke with WP Engine founder Jason Cohen and asked if they consider this to be a privacy concern for their customers. Cohen said that it doesn’t constitute a privacy issue, given that the indexed domains are already public. “Those emails were literally already published on the Internet,” he said. “That is, the reason those were available to be scraped, is that they were already public, and scrape-able, by any person or robot.”

The issue with WP Engine’s staging sites being indexed was corrected some time ago, however developers will sometimes create other versions of their sites on subdomains, not fully understanding that these sites are public via a Google search. In the case of sites hosted with WP Engine, they are very easy to find, given that they all end in *.wpengine.com. When not properly hidden from search engines, sites in progress are exposed.

As demonstrated with one particular high profile example, the staging site for Harvard Law Review was public. If you’re curious about what the next iteration might possibly look like, King posted a screenshot in his post. Developers of that site have since made it private, but unfortunately, in the meantime of setting up a new WordPress install, someone got in and created an explicit website, ostensibly to prove a point. I captured a screenshot before it was taken down.

This illustrates the very real danger of having your subdomains public while they are still works in progress. Ultimately, hiding these sites from Google is the responsibility of the developer, but Cohen said that they will be taking some suggestions from the post, as he mentioned in his comment on the post:

However, your suggestion that it’s better to 301 that domain is still *also* very valid. Also, not all search engines are aware of this scenario, and thus one of the take-aways we have from your article is that we should auto-force robots.txt for the XYZ.wpengine.com domains just as we do for the staging domains

The Issue of Duplicate Content

You don’t have to be very technical to know that search engines regard duplicate content as a cardinal sin and will swiftly penalize you for it. Although WP Engine forces a “deny robots everything” in the robots.txt file on staging, not all search engines will respect this.

However, Jason Cohen is hanging his hat on personal assurances he received from Matt Cutts regarding the practice:

Google maintains a set of root domains that they know are companies that do exactly what we and many other hosting companies do. Included in that list are WordPress.com, SquareSpace, and us. When they detect “duplicate content” on subdomains from that list, they know that’s not actually duplicate content. You can see it in Google Search, but it’s not counted against you.

We have had a dialog directly with Matt Cutts on this point, so this is not conjecture, but fact.

King stated that he doesn’t trust any information from Matt Cutts to be “fact.” He replied that they don’t think it’s wise to let Google determine which site is a duplicate and that’s why the post offers step-by-step instructions to help WP Engine customers set up the proper subdomain redirects to prevent indexing.

A Long Standing Issue with Subdomains

The post on Audit WP sent WP Engine into damage control mode and they launched a massive Twitter campaign of replies discrediting the main points of the article. So far they have not tweeted any acknowledgement of the suggestions that they will be implementing.

I spoke with Audit WP’s founder Jacob King to find out his motivation behind publicizing his findings of the 1.5 million indexed subdomains.

“Well, we were a bit floored upon the initial discovery,” King said. “As someone with web scraping experience, the ease of access to WP Engine user names, emails, and other information troubled me greatly.”

He also felt that it was important to go public with the information given his previous interactions with WP Engine support:

I was hosting my personal blog there so I have a good amount of experience with the system. One big issue was my actual monthly human traffic being massively different from the traffic stats WP Engine was recording. A very large difference, analytics showing ~25k monthly visits, yet WP engine was showing well over 100k. I brought it up on Facebook and one day on Twitter as well, I was told it’s from Bot traffic and search engine spiders.

When King commented that he had blocked all common bots and asked for further explanation of his bot traffic being more than 5 times his human visitors, he received no reply.

“We never discussed anything specific to the indexation issue,” King said. “I went with my gut which told me they wouldn’t give me the time of day if we didn’t make it public.”

However, Audit WP is not the first to publicize problems with WP Engine’s subdomains. Although, WordPress SEO expert Joost de Valk chimed in on the post to condemn King’s public handling of the situation, he tweeted last April regarding what he perceived to be an SEO mistake by WP Engine:

WP Engine co-founder Ben Metcalfe responded in a blog post at that time, clarifying that clients can redirect traffic arriving at the WP Engine subdomain to the primary domain via the .htaccess file or the client portal. He also clarified why this is not done by default: “We don’t do this by default because it would then prevent us accessing the site via the sub-domain during a support call/etc, should the DNS on the primary domain fail.”

Joost de Valk replied:

If it were truly that common I wouldn’t have tweeted it. It’s an issue, it’s something you’re aware of because there’s a setting for it, but it’s something you could & should prevent from happening altogether. That’s what the “managed” in managed hosting stands for in my eyes.

WP Site Care founder Ryan Sullivan, reiterated that this issue has been a customer concern for quite some time:

I’ve brought this issue up to WP Engine on several different occasions through several different channels. In fact, it was Rob who pointed the issue out to me early last year, and I’ve never once been given a clear answer on plans to solve it. And with the number of sites we host on WP Engine (I give them a lot of money every month), this is a legitimate concern that should have been addressed a long time ago.

The post on Audit WP brought attention to an issue that was originally discussed in April of 2013 without a satisfactory response from WP Engine. Metcalfe’s post addressing the issue concluded by stating that it’s a common practice for all web hosts that don’t use VirtualHosts.

Forthcoming Changes at WP Engine

When I spoke with Jason Cohen, he confirmed that the post has spurred them to make some changes at WP Engine.

“First, we’re making a change to /robots.txt when served from our canonical domain (i.e. the ABC.wpengine.com style domains) so that spiders won’t index or crawl those domains.” Although it is WP Engine’s official position that there is no SEO penalty for what they are already doing, the company has decided that it’s a good idea to make this change anyway.

“Second, the point was made that our customers ought to 301-redirect their ABC.wpengine.com domains to their proper domains,” Cohen said. “We agree that’s a best-practice. While that’s trivial to do in our User Portal, we do NOT do a good job TELLING customers that this is a good idea.”

WP Engine is also considering a more push-button approach to suggest that developers make their staging sites private via a plugin of some kind. They are in the process of creating a “Best Practices” document which they will include in their public knowledgebase. “We are going to proactively link to it inside our User Portal for all customers to see,” Cohen said. The article will include suggestions with screenshots so that customers will be better-informed.

Cohen said that if they make any sweeping product changes, they will email customers. However, they do not have plans to proactively email the owners of the 1.5 million subdomains that have been indexed by Google, given that some of those are intentionally public.

If your site is among the list found by Audit WP and your subdomain isn’t meant to be public, you may want to double check your robots.txt file and set up the proper 301 redirects. Ultimately, no matter how many conveniences a managed host provides its customers, the responsibility for any subdomains falls to the site owner.

Like this:

Related

4 Comments

My biggest concern is how many of those subdomains with fresh, ready to install copies of WordPress exist. I could potentially scrape based on those subdomains that are indexed, locate what I’m guessing would be hundreds to thousands of that same scenario, and go ballistic.

Something certainly needs to be changed here. A staging area is good, but exposing all of those is a serious potential security risk.

Thanks for the follow-up Sarah. I had some reservations about rolling with Jacob’s post because I knew there would be an inevitable fallout of some sort. What really pushed us to move forward with it was the fact that pushing it out would mean helping a ton of people. To me, that meant the benefits outweighed any potential backlash.

No matter what Jason, or Joost, or anyone says about “fallacies”, the truth is that the issues Jacob pointed out are real. Just like anything in SEO, the severity can be debated but in my experience duplicate content can really screw you, especially if you’re a relatively new site. A huge site might be able to afford leaking out a bit of their authority here and there. On the other hand, someone just getting off the ground needs every advantage they can get. Hanging on to every ounce of domain equity can really help in that department and sometimes can be a make-or-break factor.

I’m really happy to hear that WP Engine is going to be working on implementing some changes to correct this stuff. The primary goal of this post was to help WP Engine’s customers take care of this issue themselves in case WP Engine decided it wasn’t enough of a problem to change their system. The secondary reason was to push WP Engine to make some real core-level changes to their server setup and it appears that we’ve managed to do that. Seems like a win-win situation to me. :)

I think the “duplicate content” issue is overblown. There’s no such thing as a “duplicate content penalty” otherwise thousands of WordPress sites would get penalized right out of the box. Duplicate content is displayed on archive pages, index pages, search results, in addition to the main article page, by default.

I generally see this manifest itself in Google results when things like /page/12/ are ranked ahead of the main article. It’s probably best to use a SEO plugin to clean that sort of stuff up. Canonical URLs (WordPress default) help too, but it’s not a guarantee.

While this is a separate issue since we’re talking about content on separate domains, the alleged Matt Cutts conversation about a list of “root domains” for sites like WP Engine is interesting, as I’ve never heard of anything like that before, but it makes sense.

Just because it’s indexed, doesn’t mean it ranks for any meaningful keywords. I see a lot of WP Engine-hosted sites in search results, and don’t think I’ve ever seen the .wpengine.com version rank at all. So I tend to believe this something is going on to prevent this, whether it be the “root domain” thing or some other algorithmic check to make sure main sites rank and staging sites don’t.

On the privacy issues, yeah, it’s definitely a little too easy to grab all those *.wpengine.com staging URLs with a simple Google query. There’s no reason for those to be indexed at all, as support can always access the subdomain directly.

But it’s also worth noting you can do reverse IP checks and find out the URLs hosted on ANY server pretty easily as well. Is it really WP Engine’s fault that some unlaunched Harvard website exposed itself to the entire internet? That’s the developer’s responsibility to keep things under wraps, not the host.

I’m not even going to comment on how the whole #FailboatGate situation played out on Twitter and blog comments elsewhere, but both sides could’ve probably handled things a bit better.