Can you help with this mystery? Old site's links attributed to new site.

The site, http://bit.ly/WB3S68 was affected by Penguin. The site owner wanted to start over and did the following to create a new site (http://bit.ly/11xqetB):

-Used the Google URL removal tool to remove the old site from the index.
-Used Robots.txt on the old site to disallow search engines
-When the old site was gone from the index, placed the new site online. The new site has a different home page than the old but the inner pages are all the same.

A couple of weeks later, according to WMT the new site suddenly had 10,000+ links that previously had pointed at the old site's home page. There is a line underneath each link saying "Via this intermediate link: [old site]"

There was definitely never any redirect from the old to the new.
I am figuring the problem is related to what is described here, http://dejanseo.com.au/mind-blowing-hack/ where Dejan SEO noticed that if you copied a page completely Google, that page's links would appear in your WMT.

It turns out that when you remove a site from the index using Google's URL removal tool, if you choose to remove the whole site it removes it from the index and not the cache. When I searched for "webcache.googleusercontent.com/[old site]" it would display the cache for the new site! So, I went in and removed each individual url from the cache using the tool. (you only get the option to remove from cache if you remove individual URLs).

Now, 2 days later, the cache search for the old site throws a 404. Yet, all 10,000+ of the links to the old are still pointing at the new according to WMT.

My guess is that as these links get crawled they will disappear, but I would have expected at least some of them to be gone by now. I know that WMT is slow to update and I'm hoping that this is why we're still seeing them.

Questions:

1.What do you guys think of these "via another site" links appearing in WMT? Do you think Google is passing link juice through them? In the Dejan SEO article, those via links appeared even if a low pagerank site copied a high pagerank site's article, so I am guessing that the links aren't really passing juice and therefore don't really count as pointing to your site.

2. I know no one can answer this for sure, but how likely do you think it is that this site would be affected by Penguin if it refreshes soon?

Thanks KP. I'm quite certain this was a Penguin issue. I didn't have any analytics or server logs to dig through, but there was a distinct drop in traffic at the end of April. The site lost positions for the keywords that they had been building links for. There were thousands of links from blog comments, articles and paid links all with keyword anchor text. I also didn't find any obvious Panda issues on the site.

The new site has a different home page than the old but the inner pages are all the same.

IMHO
The inner pages are the problem here.
The issue is although the pages are de-index, and out of the cache this is only 2 layers of the google beast. Think of it as just the customer facing layers.

Google has become quite clever over the years on recognizing duplicate content even if spun. It can only do this if its data core has a bit of history in there as well. If that article is correct and I have no doubt it is then I am pretty sure its matching the old inner pages up with the new but same inner pages and treating them as the same.

The old pages where subject to a penalty so this new but duplicate content should be treated the same as google has already identified them as being bad in their eyes.

The only solution in my view is to re-write a couple of the pages and see if the links drop in GWT, my guess is they will once they get indexed.

Pain in the arse I know and to be honest a bit unfair in this case but unless you can get someone at google to manually fix things I would say the inner pages need re-writing as well.

Thanks guys. Good thoughts here. Sorry I couldn't weigh in earlier. A stupid stomach bug has thrown our family for a loop the last couple of days.

This is not a Panda issue for sure. And as Darren mentioned, if it was Penguin then it's the links that are the culprit and not the content so the site itself doesn't deserve to be punished.

At this point we'll give the site a bit more time to see if the "via this intermediate link" messages go away once the links get crawled again. WMT is notoriously slow at updating so I am hoping that one day we'll just wake up and see that those links are gone.

The next option is to disavow those links. But that would be a pain. Plus the links will always be there clouding up WMT even if there is a disavow.txt present.

Comments on this post

DarrenHaye agrees
: It's such an unusual issue, hopefully things will get reprocessed and they can get a clean start.

Chedders agrees
: will be interesting to know any time scales when they get removed

I thought I'd update this thread. An interesting thing has happened. Google is now once again showing us the "via this intermediate link --> url of old site" for a number of pages again. The old site is completely erased other than an index page that says, "We have moved and the url (but no link) of the new page. The inner pages of the old site are completely gone.

Yet, when we do a site: search of the old site's domain name, Google shows all of the pages that we had formerly used the url removal tool to remove and says, "A description for this result is not available because of this site's robots.txt".

Apparently the url removal tool only works for 90 days. After that time, if the page appears to be still active, Google will reinclude it in the index. I am guessing that what happened is that the robots.txt file blocked Google from crawling the site and as such they can't tell that all of the inner pages have been removed and that the home page has been changed.

We're going to remove the robots.txt file and then use the url removal tool again to remove the pages from the index and the cache.

My gut instinct is that these "via this intermediate link" links do not pass pagerank and as such don't contribute to Penguin. We've got the new site ranking #1-#2 for most terms so it doesn't appear to be affected by Penguin. And, if the url removal tool only kept the old pages out of the index for 90 days then Penguin would have refreshed a couple of times since Google reinstated the pages in the cache.

You know what is baffling me, how a page that has no index, no follow can still have a pagerank 2, i also do not understand why you would not redirect this to the new address to remove it from Google once the new page is crawled via 301 permanent redirect, along with letting Google know you have moved address via webmasters?

stagingworks(dot)ca/splash.png Not sure if you no indexed/nofollow this page or whether this also would be treated as a separate url which is keeping that homepage indexed. I am probably clutching as straws.

I was always under the impression that once you no index a page the PR drops to N/A, weird, unless it is keeping it online via the spash.png image url which is a direct link also.