Google Granted New Similarity Patent

Google’s new similarity patent means your search engine optimization campaign may get a little harder. Don’t you just love it?

Google was granted a new patent this morning that describes a method for determining the similarity between Web pages. The patent describes a system where a similarity engine can identify duplicate content by spidering a site and generating a sketch of that page. The sketches will be weighted and can then be compared for similarities. If a page is deemed "too similar" Google can opt not to index it.

Doing this means Google will be able to streamline its indexing process and help to reduce the amount of duplicate content on the Web. It also means if you’re not careful with your breadcrumb navigation, using dynamic URLs or implementing any of the other techniques commonly associated with duplicate content issues, you may find all your search engine optimization dollars officially wasted when Google decides not to index your site. Awesome, right?

Though the system may be imperfect at first, this would be an easy way for Google to quickly compare pages to determine their similarity. This of course mean that if you don’t want your pages booted out of a users search, you best be offering a high level of unique content to differentiate yourself from others out there. Otherwise, Google gets to decide not to crawl your site.

The entire similarity concept is also too math geeky for me to comprehend but the idea of creating a virtual geek map of your page to determine its likeness to others is pretty cool. So cool that Google’s not the only one who has filed a patent for this technology.
Anything that limits the amount of duplicate content I’m seeing in my search results is okay by me. I think it would be very interesting to see something like this in place, mainly because it would mean Google would have to form a definitive answer as to how much duplication or resemblance is too much. No one has been able to do that to date and it’d be telling to see how Google views it.

Looking down the road a bit, I think it would also be fun to watch similarity become the new PageRank. Everyone will be speculating as to what factors are given the most weight and how similar is too similar according to Google. You have to think Google will leave room for some degree of similarity. Just because every site is using Meta tags or using search engine friendly design doesn’t make them similar or even related. And what about forums? Won’t they look relatively the same to a search engine?

Of course who knows if Google will ever implement this into their algorithm? They may have just applied for the patent to block all the other companies toying with similar technology. Either way, it’s worth speculating about, especially if it means we can talk about the dangers of duplicate content. (Duplicate content is bad. Very bad.)

7 responses to “Google Granted New Similarity Patent”

Portland — I agree completely. I imagine Google will only be looking at site text, not the design. Otherwise, most sites will appear to be dups.

Handsome Rob — Time will tell. No matter what Google does, dedicated scrapers will always find a way around it. All we can do is keep making the wall higher so it’s harder to climb… and hope the scrapers hurt themselves when they fall.

Assuming that these “sketches” apply only to copy and related tags (as opposed to a full site design and stylesheet sketch), will it be a weighted system that is less stringent for pages that have less copy? Will scrapers then simply jumble their text every few words or make more pages with less actual copy on them? How will this impact popular CMS platforms like WordPress if tagging structure and layout are part of the “sketches”?

What concerns me about this technology is how it will impact ecommerce operations that use their store to sell a small series of items that are basically the same. And to boot, usually the ecommerce webapps out there are all spitting out dynamic url’s left and right…

Hmmmm… the patent was filed in 2001 so my expectation is that this technology has been in use for a number of years already. As you can see it takes a while for patents to work their way through the system. Most of the IP in a new gadget or software will have any patents being processed, but companies can’t and won’t wait until the patent is awarded.

I know – opinions go both ways. I agree that Google should not be a censor, but it is their index and duplicated results diminished the value of their top 10 results when they are all substantially the same. My view sways but in the end I lean towards providing quality unique content that satisfies the searchers even if it aggravates website owners. So which is worse: make users write their own content or reward plagiarism?

It’s Google’s job to present users with the best, most expert information for any given query. If your site is 80 percent duplicate to someone elses, you’re not presenting quality information and your site doesn’t deserve to rank well.I look at this as an incentive for site owners to create better sites, not an example of Google trying to control what content you’re allowed to view.

oh come in…Google needs to be careful or it may find itself not so friendly in the end. I thought it was googles job to find and list sites, not to determine what site they think I should be able to see. So if Google doesn’t like your site, then I can’t find it?

HQ Hours of Operation:
8:30am to 5:30 pm Pacific timeDays of Operation:
Monday through Friday — email works other times in many casesSupport Operations:
M-F 9:00 to 5:00 Email Support FormTraining Facility:
Please see the training facility map