Google Turns to Deep Learning Classification to Fight Web Spam

In the past few years, Google has been busy building what has become known as the Google Brain team, which started out by having its deep learning approach watching videos until it learned to recognize cats.

Web Spam Classification Patent

Receiving an input comprising a plurality of features of a resource, wherein each feature is a value of a respective attribute of the resource

Processing each of the features using a respective embedding function to generate one or more numeric values

Processing the numeric values using one or more neural network layers to generate an alternative representation of the features of the resource, wherein processing the floating point values comprises applying one or more non-linear transformations to the floating point values

Processing the alternative representation of the input using a classifier to generate a respective category score for each category in a pre-determined set of categories, wherein each of the respective category scores measure a predicted likelihood that the resource belongs to the corresponding category

That “pre-determined set of categories” can include a search engine spam category. The category score for the “resource” measures a predicted likelihood that the resource is a search engine spam resource.

The pre-determined set of categories can include a respective category for each of a plurality of types of search engine spam. The pre-determined set of categories includes a respective category for each resource type in a group of resource types. Category scores can be used to:

Determine whether or not index resources in a search engine index.

Generate and order search results in response to received search queries.

A deep network can be effectively used to classify resources into categories. For example, resources can be effectively classified as being spam or not spam, as being one of several different types of spam, or as being one of two or more resource types. The patent tells us:

Using the deep network to classify resources into categories may result in a search engine being able to better satisfy users’ informational needs, e.g., by effectively detecting spam resources and refraining from providing search results identifying those resources to users or by providing search results that identify resources that belong to categories that better match the user’s informational needs.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for scoring concept terms using a deep network.

One of the methods includes:

Receiving an input comprising a plurality of features of a resource, wherein each feature is a value of a respective attribute of the resource

Processing each of the features using a respective embedding function to generate one or more numeric values

Processing the numeric values using one or more neural network layers to generate an alternative representation of the features, wherein processing the floating point values comprises applying one or more non-linear transformations to the floating point values

Processing the alternative representation of the input using a classifier to generate a respective category score for each category in a pre-determined set of categories, wherein each of the respective category scores measure a predicted likelihood that the resource belongs to the corresponding category.

The patent tells us that this resource classification system could classify resources as “search engine spam resources or not search engine spam resources.” It doesn’t define web spam in much detail, but does tell us that it might look at typical types of spam such as:

Content spam

Resources that include link spam

Cloaking spam resources, and

So on

The resources on the pages of a site can include words from the content of the site in a tokenized form, URLs from the site, the title of the site, its domain name, categories or entity types relevant to the site, the age of the site. Each of these many features might be used to calculate a probability that the site is spam, which could determine whether or not it get indexed or reduced in rankings:

For example, when the scores represent a likelihood that a resource is a search engine spam resource, the search system can use the score in a decision process so that a resource that is more likely to be spam is less likely to be indexed in the index database. As another example, when the scores represent likelihoods that a resource is one of several different types of search engine spam, the search system can determine that resources having a score that exceeds a threshold score for one of the types not be indexed in the index database.

In some other implementations, the search system can make use of the generated scores in generating search results for particular queries. For example, when the scores represent a likelihood that a resource is a search engine spam resource, the search system can use the score for a given resource to determine whether or not to remove a search result identifying the resource before providing the search results for presentation to the user or to demote the search result identifying the resource in an order of the search results. Similarly, when the scores represent a likelihood that a resource belongs to one of a pre-determined group of resource types, the search system can use the scores to promote or demote search results identifying the resource in an order of search results generated in response to particular search queries, e.g., search queries that have been determined to be seeking resources of a particular type.

While the patent doesn’t provide much in the way of details on training and classification of features under this machine learning model, it does refer to a paper that does, at:

Google’s long time head of Web Spam, Google Distinguished Engineer Matt Cutts has been on his first extended leave after 15 years at Google. He is due to return in October. So that’s pretty interesting timing, with the patent released during his first real vacation in years. I wonder if it was turned on while he was gone?

Related

Reader Interactions

Comments

Bill, when you think about what Google saves in cycles and bandwidth costs fighting spam in this manner not only theoretically improves results it saves a lot of cycles and bandwidth and that ==$$$$$$.

Agreed completely. I feel like I’m seeing some smart stuff coming out from Google lately. I was talking with my sister and mother this morning about how hard it was to find a puppy two years ago when they were looking. My sister told me that search was so difficult that she gave up on Google for a while. Same search today, enter the dog breed in the search box and one of the top two results is the American Kennel Club, which has a site link to a search box where you can find puppies of different types. It’s a lot easier.

The only patent I could find that’s already published that seemed like a good fit for Penguin is an old one from 2003 or 2004. It’s on annoying and manipulative links, found through dense subgraphs of links of the type that usually indicate the existence of things like doorway pages. This one is better.

The diagram is from the patent office, and is one of the pictures that was filed with the patent.

The patent tells us that part of these category score involves looking at a lot of features around a web site and looking at patterns involving those that might be a sign of spam. There’s not a lot of description in the patent about how this process works, but if you have the time, I’d recommend reading Ray Kurzweil’s book “How to Create a Mind.” There’s a section in there were he talks about the architecture of the neocortex and how it captures patterns of information, and builds upon them. I suspect that this deep learning approach does something very similar.

There are a lot of people out there who have automated spamming methods and attempt to manipulate Google’s search results. Hopefully Google can make it cost too much to keep on doing that. I think that’s one of their aims.

I’m just not sure that there is enough in the patent itself that would give us an idea of how it might go about trying to distinguish actual spam on a page or site from spam that might be the result of a negative SEO campaign.

It’s possible that this system might create a “negative SEO” category, covering whoever it might be who is acting to make another site look like it is spamming.

Again, the patent itself doesn’t address the issue of negative SEO enough to give us an idea of how it might address it.

This is awesome. I just found your blog and am ecstatic to say the least. Having a technical analysis of the algorithms and a review of the patents being put out by Google is something I have yet to see many people write about. I’m curious to see what you think about Dwayne’s negative SEO comment above.

If Google could actually make their webspam algos work then they wouldn’t need to have such an massive manual action web spam team manually taking action against then networks it deems to have violated its T & C’s. The fact that Matt Cutts and his team do much of this manually indicates that the algo’s can’t detect half as much as they would hve us believe it does??
Neg SEO is a whole can of worms that hasn’t even begun to bit yet, but given time it will i believe be seen in more and more high profile cases.