Google SEO News and Discussion Forum

I posted about this in the back room but I think this need to be brought into public view. This is happening right now and could happen to you!

Over the weekend my index page and now some internal pages were proxy hijacked [webmasterworld.com] within Google's results. My well ranked index page dropped from the results and has no title, description or cache. A search for "My Company Name" brings up (now two) listings of the malicious proxy at the top of the results.

The URL of the proxy is formatted as such: https://www.scumbagproxy.com/cgi-bin/nph-ssl.cgi/000100A/http/www.mysite.com

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000. The number of sites affected are increasing exponentially and your site could be next.

Take preventative action now by doing the following...

1. Add this to all of your headers:

<base href="http://www.yoursite.com/" />

and if you see an attempted hijack...

2. Block the site via .htaccess:

RewriteCond %{HTTP_REFERER} yourproblemproxy\.com

3. Block the IP address of the proxy

order allow,deny deny from 11.22.33.44 allow from all

4. Do your research and file a spam report with Google. [google.com...]

This is an important, but challenging topic. I'm glad you brought it up.

First thing I want to clarify -- as I understand your post, you are talking not about someone directly hijacking your traffic through some kind of DNS exploit. You're also not talking about someone strealing your content, although that can play into this picture at times.

Instead, you are talking about a proxy domain that points to yours taking over your position in the SERPs, sometimes a position that your url has held for a long time.

We've had some threads in this forum about it in the past, but there is nothing like a definitive address to the issue at this time. Yes, Google "should" fix it, and ultimately it is their job -- but what do you while you're waiting for Google to fix it. Might be a long wait, you know?

For reference, here are the two recent threads about proxy hijacks. Feel free to fold the issues raised there into this discussion. As I said before, I don't think anyone has the definitive solution so far, but there are a lot of good ideas.

I'd also like to talk about the idea of blocking the IP address of the proxy server. In some cases, I think that the IP address may not be static, it may be spoofed, etc. Be careful if you do decide to block by IP that you've got all the technical information pinned down accurately, you don't want to harm your legitimate traffic, whether bot or human.

I had the exact same issue. Sometimes it happens with 'weaker' sites, and other times it happens with strong sites with decent rankings.

The best thing I did from advice from this forum was to periodically search google content from my website, and then if a proxy was stealing content, I added the ip to my htaccess which effectiely served a forbidden error for that page, and soon enough the page is removed.

Sometimes when this happens I will also do a page change, rewrite change on the index.

For the purposes of full disclosure, Synergy and I have exchanged stickymails on this issue.

Here is what has worked in the past:

The first step is to attempt a dialog with the site owner. Frequently a hole in their setup is being used to do a bit of mayhem. In this case it appears that someone found that port 443 was open to indexing and fed the search engines some links.

Don't give them a lot of time to fix it on their end.

Second is to get to their hosting provider if the first step results in no change.

When you contact the ISP it is also time to file a DMCA request with Google.

But somewhere between step one and two file a very large spam report with Google noting the many sites copied especially any site containing content that Google might not wish to show in its index.

Currently, I can't get any response from the site that has all of the boosted content on it. Maybe a change is in the works.

Given an understanding of what a proxy *is* and how it works, the only step really needed is to verify that user-agents claiming to be Googlebot are in fact coming from Google IP addresses, and to deny access to requests that fail this test.

If the purported-Googlebot requests are not coming from Google IP addresses, then one of two things is likely happening:

1) It is a spoofed user-agent, and not really Googlebot. 2) It *is* Googlebot, but it is crawling your site through a proxy.

The latter is how sites get 'proxy hijacked' in the Google SERPs -- Googlebot will see your content on the proxy's domain.

The most fool-proof and low-maintenance method to validate Googlebot requests is to do a double reverse-DNS lookup on the IP address requesting as Googlebot; If the IP address points to a Google hostname, and looking up that hostname then returns the original IP address, then it is legitimate Googlebot request.

This is the method recently recommended by Google in their Webmaster help -- doubtless due to this very problem.

However, some servers are configured such that the Webmaster cannot do rDNS lookups. In that case, just using a simple list of the IP addresses that Google usually crawls from is a viable solution -- IF you keep a sharp eye out for Google changing or adding to the list of IP addresses that Googlebot uses to crawl.

Stopping these proxy hijacks at the front door will eliminate the need to repeatedly chase your content in search, prevent the need to file potentially-false DMCA complaints, etc. A pinch of prevention is worth a pound of cure...

It's happened to one of my sites through another site altogether - also a proxy, and it looks like it's taken my homepage out of the index altogether. There isn't an identifiable search string like the other site that's doing it, but the number of victimized pages is growing daily.

It just doesn't make sense to me, to leave that whole entire domain of swiped pages in the index, so that tens of thousands of individual webmasters have to file a DMCA, and most of them won't know what hit them, in order to know what to look for or do in the first place.

The average webmaster or site owner just knows when all their traffic is gone, they usually don't do an inurl: search as a matter of course and most probably have never heard of it.

Perhaps someone can explain how leaving those sites in the index while tens of thousands of legitimate site page that actually have value disappear altogether is supposed to be any good for the quality of search for users.

Instead, you are talking about a proxy domain that points to yours taking over your position in the SERPs, sometimes a position that your url has held for a long time

Exactly. It seems a bit difficult for people to wrap their heads around... hijacked within Google. The symptoms: A search for "My Company Name" would bring up the hijacker's site and not mine, all rankings for all keywords gone, massive traffic and sales loss.

Good news, either Google has taken action via my spam request or the coding mentioned in the first post has reversed the action of the hijacker but my index page is no longer hijacked and I am back in the rankings (although slightly lower ranked) for my big money keyword.

I plan to implement the double reverse DNS check to prevent this in future. I've also set up Google alerts to monitor inurl:mysite.com on a daily basis.

The hijacking site hosts an online proxy web browser through which you can anonymously browse the web, and I think that someone used it to browse my website recently. In order to serve the page privately, the proxy sends a spider out to scrape the page and "host" it locally via 302 redirect. Google's spider come across this new 302 redirect and thinks that the content has a new location... confusing the original for the duplicate. In order to tell Google that the scraped page is not the original, you have to ban the 302 connection via htaccess by URL and IP address.

The problem lies with Google as they cannot tell the difference between a real 302 redirect and a scraped 302 redirect.

No special "browser" or "robot" or software is needed to do this hijack thing. All that is needed is to configure the server as a proxy. A proxy does what its name implies: It "stands in" for another client or host.

In this case, the proxy is configured such that it forwards requests from a client --in this case, Googlebot-- to your server, and forwards the responses from your server back to the client. Your server sees the proxy as the client, and Googlebot sees the proxy as your server.

The forwarding may be completely transparent, or may employ some insidious software -- Perhaps substituting a fake robots.txt file if one is requested by the client, doing a bit of cloaking for fun and profit, etc.

But leaving that detail for the moment, Google crawls your site through this proxy and lists your content at the proxy's address (domain).

Since your server sees the proxy as the client, it's easy enough to check that a client claiming to be Googlebot is coming from a Google address and not from/through some proxy's address, as previously described.

Proxies are not necessarily good or evil, they're just proxies. Some are used to hijack content, and some are used by information-starved people stuck behind firewalls to get to the real and unfiltered world-wide-web.

We had (I think) exactly this problem on one of our sites. Someone had a joke script that scrapes then re-writes content - to turn 'the' into 'da', 'with' into 'wiv' etc. Somehow Google had got hold of hundreds of pages they had viewed using this script.

What was staggering is (a) that all the pages were non-Supplemental when the actual parent site the script was installed on was dead and low PR, and (b) even more remarkable than fully indexing all this newer, near-dupe content, Google was dumping the older real pages from SERPs because of it. Overnight we lost all rankings for all but 2 terms.

It wasn't malicious but that didn't help us. What we did was track the site owners down and by chance they lived locally (same city). After we spoke to them they took all the pages down. Luckily enough Google actually crawled them and us within a few days of this and all our rankings came back.

It's also now still working with a regular 302 redirect. I've found one that returns a 302 on the URL with the php.go and shows that URL in the search result along with the snippet from the site being redirected to.

A quick search in Google for "cgi-bin/nph-ssl.cgi/000100A/" brings up now 55,000+ results when Saturday it was 13,000 and Sunday it was 30,000.

It's now up to 79,200 since this morning, and going strong.

That one may be an unintentional "mis-hap" on their part, but based on what's been uncovered, it doesn't appear that the "other one" that's doing the same thing is quite so innocent.

I asked someone well versed in such matters earlier in the day, and according to them dealing with IP numbers is futile, since IP numbers can easily be swapped around, and cloaked info served to Google.

Also, the one doing the hijacking with a 302 go.php has 12,800 pages indexed, with the ones not serving the redirected site running Adsense on their own pages with the scraped content.

Marcia your search that hauls up 74,000 pages isn't just the proxy like site, it gets any mention of the pages on other sites as well.

The particular site (and likely not the only one running that script) that Synergy was talking about has over 50,000 of the pages and is no longer responding to any request that I make. I suspect they may have closed the hole or that part of the net is off line to me.

On the go.php what can I say, It isn't difficult to handle redirection in any scripting system.

If your site's traffic is 100 percent U.S. based and if most of these proxies are located in certain countries such china, russia, and in the middle east, wouldn't the best solution be to block these IP ranges?

The only thing is, where do you find all-inclusive list of IP ranges per country? I DON'T MEAN doing the IP lookup thing, but, rather, banning entire countries using a reliable list.

Would this sort of thing be less likely to happen on a blogger blog? In other words, would a blog site hosted on a google server be less likely to get jacked? I mean, the issue of where the content sits, where it should be indexed from, and the domain it should be attributed to should be fairly clear if the content sits on a server operated by google itself. What do you think, Tedster?

That's not a hijack, that's just someone linking to your site with a bogus query string -- probably hoping to pick up a link from your 'stats' page if it is publicly-accessible, or from the page(s) where that link appears.

Servers themselves ignore query strings attached to URLs -- They have meaning only if the URL points to (or is rewritten to) a script and if that script ascribes some meaning to the query string; The server itself doesn't use the query string at all, which is why that link displays your page.

In order to prevent the appearance of legitimacy of that link, you can detect and redirect to remove such bogus query strings. How you do that depends on what kind of server you're hosted on, and whether any other pages/scripts on your site actually use query strings.

You're not blocking the site, you're blocking an actual visitor that may have landed on a hijacked page, not a wise idea to punish the wrong person IMO. Fact is, it's highly unlikely you'll ever see the proxy as an actual referrer as they mask their presence, so that step is a total waste of time.

The most fool-proof and low-maintenance method to validate Googlebot requests is to do a double reverse-DNS lookup on the IP address requesting as Googlebot; If the IP address points to a Google hostname, and looking up that hostname then returns the original IP address, then it is legitimate Googlebot request.

BINGO!

If you validate Googlebot (or any other crawler) with reverse/forward DNS checking the proxy hijacking simply goes away.

Not only can you stop a page hijacking, but when you detect this condition you can feed the search engine special pages via the proxy server and turn a thwarted hijacking to your advantage!

I'll leave what you feed the proxy server up to your imagination ;)

Otherwise, if you just let the page get returned via the proxy, Google sees your content coming from a different address and may assign ownership of the content to that proxy address.

Why Google is incapable of and/or refuses to block crawling via these sites, after all these years, which appear to have obvious detectable fingerprints is beyond me.

You're not blocking the site, you're blocking an actual visitor that may have landed on a hijacked page, not a wise idea to punish the wrong person IMO. Fact is, it's highly unlikely you'll ever see the proxy as an actual referrer as they mask their presence, so that step is a total waste of time.

If you do the .htaccess blocking, granted, you are denying legit access, but aren't you also letting G know it's a bad link when they visit the proxy server site?

If you do the .htaccess blocking, granted, you are denying legit access, but aren't you also letting G know it's a bad link when they visit the proxy server site?

If you block the IP of the proxy, yes.

If you block the proxy referrer, no.

A good proxy never divulges it's domain name as every request for every link is filtered via the proxy so your site should never see their referrer ever, unless something else was going on like they performed a redirect to your site instead of routing it through the proxy for some reason.

"Not only can you stop a page hijacking, but when you detect this condition you can feed the search engine special pages via the proxy server and turn a thwarted hijacking to your advantage!"

incrediBill hows about its own tail?

However you will never see Googlebot doing its thing through those scripts unless the user of said thingies is totally out of it.

What I see after I have located one of the little babies is frequenty like this entry:

127.0.0.1 - - [28/Jun/2007:22:45:59 -0400] "GET /robots.txt HTTP/1.0" 200 27 "-" "-" only difference is that I am running a modified copy of one of the little goody scripts and as you can see the referer and agent is gone.

This is a little 6 line version of one that normally caches server side on the site or desktop using it.