Google Patent on Web Spam, Doorway Pages, and Manipulative Articles

A patent granted to Google today explores Web spam and the manipulation of documents and links on the Web. It describes how the rankings of pages may be influenced if they are identified as “manipulative.”

The identification of manipulative documents, how they might be grouped together, and how they could be treated by the search engine is described in some detail. That treatment might include removal of pages from the search index, reductions in rankings for pages, and possibly a change in how quality scores (PageRank) are calculated for links from manipulative pages.

The patent was filed almost 4 years ago, on December 10, 2003, and wasn’t granted until today.

A good number of papers and patent applications have been published since then on Web spam, and have explored more detailed approaches, but this patent is interesting in that captures some aspects of how Google may have been detecting and fighting Web spam over the past few years (and may still be).

Systems and methods that identify manipulated articles are described. In one embodiment, a search engine implements a method comprising determining at least one cluster comprising a plurality of articles, analyzing signals to determine an overall signal for the cluster, and determining if the articles are manipulated articles based at least in part on the overall signal.

These are a few of the manipulative techniques that the patent identifies:

Use of a domain name of a once legitimate site,

Filling the text of pages or anchor text from links in the page with certain popular query terms,

Automatically creating links from other pages to the manipulated page, and;

Showing a different page to a web crawler than to human visitors.

The patent also provides us with one definition of Web spam:

These manipulated documents can be referred to as spam. When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user’s query.

The harm to a search engine from such manipulation is that a searcher’s experience with the search engine can be degraded if the search engine returns a search results set containing manipulated documents.

Clustered pages

One part of this process might be for the search engine to identify clusters of documents that may be related to each other somehow, such as being on the same web host, or being interlinked at doorway pages and articles targeted by those doorway pages.

For clusters that are identified, signals that manipulation is happening on pages within those clusters are explored, and an overall signal for the cluster may be determined. Signals can exist within documents contained in a cluster, and from documents outside of the cluster pointing into it:

The signals can comprise outside signals and document signals. Outside signals can be signals associated with the cluster, but not from the individual documents in the cluster and document signals can be signals from the documents in the cluster. In one embodiment, the overall signal is determined for a subset of articles in the cluster.

A score, or “overall signal” can be calculated a few different ways, and the magnitude of that signal may determine how the search engine ends up treating pages within a cluster:

The overall signal is used at least partly to determine if the articles are manipulated. The overall signal can represent to what grade the page is considered to be manipulated or it can be used together with a threshold to determine whether the article is manipulated. The overall signal can be used at least in part in a ranking of an article in the cluster in response to a search query.

In one embodiment, a cluster is determined by computing a dense bipartite subgraph of articles comprising doorway articles and target articles, wherein the doorway articles contain links to the target articles. In one embodiment, a cluster is determined at least in part by identifying all of the documents on a host, where the host is likely to contain manipulated articles.

It’s possible that if there are doorway pages found which target documents located on other domains, that those other domains might be included within the same cluster.

A cluster may also be created by “performing a search for manipulated documents and forming a cluster based on the search result set.”

Outside Signals Pointing to Clusters

There may be some outside signals that point to a cluster that might tell a search engine if it is being manipulated, such as a high number of links on guestbook pages pointing to pages within the cluster. The patent tells us that other outside signals might also be determined.

One potential issue with this approach that the patent doesn’t discuss is that those “outside” signals might not be created by the owners of the pages being pointed towards, or anyone acting on their behalf.

Can links pointed towards your pages harm your site, even if someone else creates those links? Perhaps that’s partially why the patent also looks at signals of manipulation found upon pages within clusters, too.

Individual Document Signals

These signals may be determined automatically for all of the pages in the cluster, or only a subset of pages may be evaluated for document signals. Document signals are ones that indicate that the document may be manipulated.

Here are some examples provided in the patent:

The text of the document –whether it appears to be normal English (or other language), generated by a computer, such as containing a large number of keywords and not containing any sentences.

Meta tags — whether the page has meta tags, and if so whether those contain a large number of repeated keywords.

Redirects — whether there are scripts in the document that redirects a user to another document upon when they access that page.

Similarly colored text and background — whether there is a large amount of text in the document that is the same color as the background of the document.

A large number of random links — whether the document contains a large number of unrelated links, though what “unrelated” might mean isn’t defined here.

History of the document — whether the text of the document, the link structure of the document, or the ownership of the website on which the document appears has changed recently.

Anchor text — whether a lot of links appear on the page, and there is very little or no text that is not anchor text.

Determining an overall signal for a cluster

The outside signals and the document signals are evaluated, and an overall signal (or score) is created for each cluster.

As an example, the percentage of documents within the cluster that contain a certain document signal might play into this determination, such as:

The percentage of documents that only contain text that is anchor text,

the percentage of documents that have meta tags, and;

the percentage of documents that cause a redirect.

Other signals might also be looked at in combination with those, such as whether a document’s ownership has recently and whether the document’s out link structure has also recently changed.

Marking clusters or subsets within clusters as manipulated

If a cluster or subset of that cluster is determined to be manipulated, either manually, or through an automated machine learning process, documents within the cluster or subset might be marked as manipulated.

If the signal involving manipulation is weak, a manual review might take place to check for manipulation. If it is stronger, then all of the documents within the cluster or subset may be declared manipulated without manual review.

How a manipulation indicator can impact a page during retrieval and ranking by the search engine

Being marked as a manipulated page might mean that indicator will be:

a) Used in a ranking function to lower the rank of a page.

b) Used as an indication that the page should be removed entirely from the search results.

c) Used to treat the document differently, like not including it in a hyperlink structure-based ranking calculation, such as PageRank. (The patent here doesn’t say anything about just lowering the toolbar pagerank, but that could be a possibility, too.)

d) Used differently depending on the query, so for example, if the query relates to pornography, the manipulation indicator may not be used.

e) Used in a variety of other ways during the retrieval and ranking processes.

Why Cluster?

Some pages within a host, or an interlinked set manipulative pages may have no signals of manipulation attached to them. But if they are clearly enough related to pages that do, then chances are that they should be included within that cluster, and marked as manipulated also.

Other signals might also be identified by looking at the percentage of documents within a cluster that have a particular signal.

References within the Patent

Granted patents commonly have a “References Cited” section which includes lists of documents, granted patents, and patent applications that may have been referred to by the inventors, or explored by the patent examiner during the patent process.

I’ve listed and linked to most of the documents in the reference section of this patent except for three wikipedia pages – one of them seems to no longer exist, and the other two have been changed since the time that they were accessed.

The documents aren’t necessarily tied to the creation of this patent, but I started looking through them, and found them interesting enough to include in this post.

The text accompanying the patents and patent applications are the abstracts from those documents.

A method of organizing information in which the search activity of a user is monitored and such activity is used to organize articles in a subsequent search by the same or another user who enters a similar search query. The invention operates by assigning scores to articles under the key terms in the index. As users enter search queries and select articles, the scores are altered.

The scores are then used in subsequent searches to organize the articles that match a search query. As millions of people use the Internet, type in millions of search queries, and display or select from the many articles available over the Internet, the ranks the information available over the Internet through an evolutionary process. The invention includes additional embodiments which incorporate category key terms and rating key terms.

A method of organizing information in which the search activity of users is monitored and such activity is used to suggest additional key terms for addition to a search query. The invention operates by assigning scores to key term groupings in an index. As users enter search queries of two or more key terms, the scores are altered.

The scores are then used in subsequent searches to suggest other key terms which can be added to the search query to narrow the search. As millions of people use the Internet and type in millions of search queries, the invention learns which key terms should be suggested for addition to a search query through an evolutionary process.

An interactive system for analyzing and displaying information contained in a plurality of documents employing both term-based analysis and conceptual-representation analysis. Particulars of the invention are especially effective for analyzing patent texts, such as patent claims, abstracts and other portions of a patent document.

A method of organizing information in which the search activity of a user is monitored and such activity is used to organize articles in a subsequent search. The invention operates by assigning scores to articles under key term components in an index. As users enter search queries and select articles, the scores are altered according to, among other things, the amount of time spent inspecting an article, whether the article is the last one inspected, how many articles are ranked higher than the article, how many articles have been previously inspected by the user, and whether an advertising banner was selected by the user.

The scores are then used in subsequent searches to organize the articles that match a search query. As millions of people use the Internet, type in millions of search queries, and display or select the many articles available over the Internet, the present invention uses this search activity to rank information available over the Internet through an evolutionary process. The invention includes additional embodiments which incorporate category key terms and rating key terms.

A method of organizing information in which the search activity of previous users is monitored and such activity is used to organize articles for future users. Personal data about future users can be used to provide different article rankings depending on the search activity and personal data of the previous users.

The present invention is directed to a data mining method and apparatus that dynamically initiates the counting of sets of items (itemsets) at any time during the pass over the records of a database and terminates the counting at the same location in the next pass.

In this manner, the present invention begins to count itemsets early and finishes counting early while keeping the number of different itemsets which are being counted in any pass relatively low.

A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations, the world wide web or any other hypermedia database. The rank assigned to a document is calculated from the ranks of documents citing it.

In addition, the rank of a document is calculated from a constant representing the probability that a browser through the database will randomly jump to the document. The method is particularly useful in enhancing the performance of search engine results for hypermedia databases, such as the world wide web, whose documents have a large variation in quality.

This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a “target profile” for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a “target profile interest summary” for each user, which target profile interest summary describes the user’s interest level in various types of target objects.

The system then evaluates the target profiles against the users’ target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media.

Users’ target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user’s target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.

A search engine for searching a corpus improves the relevancy of the results by refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching a user’s search terms to an index of a corpus.

A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set.

A system allows a user to submit an ambiguous search query and to receive potentially disambiguated search results. In one implementation, a search engine’s conventional alphanumeric index is translated into a second index that is ambiguated in the same manner as which the user’s input is ambiguated.

The user’s ambiguous search query is compared to this ambiguated index, and the corresponding documents are provided to the user as search results.

A system and method for browsing, retrieving, and recommending information from a collection uses multi-modal features of the documents in the collection, as well as an analysis of users’ prior browsing and retrieval behavior.

The system and method are premised on various disclosed methods for quantitatively representing documents in a document collection as vectors in multi-dimensional vector spaces, quantitatively determining similarity between documents, and clustering documents according to those similarities.

The system and method also rely on methods for quantitatively representing users in a user population, quantitatively determining similarity between users, clustering users according to those similarities, and visually representing clusters of users by analogy to clusters of documents.

An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query.

In one embodiment, query-relevant information or text (also referred to as “snippets”) is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.

Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

Techniques for extracting information from a database are provided. A database such as the Web is searched for occurrences of tuples of information. The occurrences of the tuples of information that were found in the database are analyzed to identify a pattern in which the tuples of information were stored.

Additional tuples of information can then be extracted from the database utilizing the pattern. This process can be repeated with the additional tuples of information, if desired.

A search engine for searching a corpus improves the relevancy of the results by refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching a user’s search terms to an index of a corpus.

A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set.

Techniques for finding related hyperlinked documents using link-based analysis are provided. Backlink and forwardlink sets can be utilized to find web pages that are related to a selected web page.

The scores for links from web pages that are from the same host and links from web pages with numerous links can be reduced to achieve a better list of related web pages. The list of related web pages can be utilized as a feature to a word-based search engine or an addition to a web browser

An improved human user computer interface system, providing a graphic representation of a hierarchy populated with naturally classified objects, having included therein at least one associated object having a distinct classification. Preferably, a collaborative filter is employed to define the appropriate associated object. The associated object preferably comprises a sponsored object, generating a subsidy or revenue.

A system allows a user to submit an ambiguous search query and to receive potentially disambiguated search results. In one implementation, a search engine’s conventional alphanumeric index is translated into a second index that is ambiguated in the same manner as which the user’s input is ambiguated. The user’s ambiguous search query is compared to this ambiguated index, and the corresponding documents are provided to the user as search results.

A process for fabricating a ceramic electroactive transducer of a predetermined shape is disclosed. The process comprises the steps of providing a suitably shaped core having an outer surface, attaching a first conductor to the outer surface of the core, coating an inner conductive electrode on the the outer surface of the core such that the inner conductive electrode is in electrical communication with the first conductor, coating a ceramic layer onto the inner electrode, thereafter sintering the ceramic layer, coating an outer electrode onto the sintered ceramic layer to produce an outer electrode that is not in electrical communication with the first conductor, and then poling the sintered ceramic layer across the inner electrode and the outer electrode to produce the ceramic electrode.

A method and system for placing a purchase order with a product trader for a product over a communication network comprises accessing an information site on a shopping server using a shopping client and downloading program code for executing a shopping cart to the shopping client from the shopping server.

The programme code when executed on the shopping client generates a purchase order interface for the user enabling a user to input product selection data and payment data. Order data is generated using the received selection data and payment data and at least the payment data is encrypted. The order data is then transmitted from the shopping client over the communications network to a location for reception by the product trader.

Methods and apparatus consistent with the invention provide improved organization of documents responsive to a search query. In one embodiment, a search query is received and a list of responsive documents is identified. The responsive documents are organized based in whole or in part on usage statistics.

A search system provides search results to searchers in response to search queries and the search results are ranked. The ranking is determined by an automated ranking process in combination with human editorial input.

A search system might comprise a query server for receiving a current query, a corpus of documents to which the current query is applied, ranking data storage for storing information from an editorial session involving a human editor and a reviewed query at least similar to the current query, and a rank adjuster for generating a ranking of documents returned from the corpus responsive to the current query taking into account at least the information from the editorial session.

Methods and apparatus are described for viewing and responding to electronic messages. In one embodiment, when an electronic message is displayed, a portion of the electronic message is elided to aid in the viewing experience.

In one embodiment, a method of viewing a first electronic message, comprises: identifying an extraneous portion within a second electronic message; eliding the extraneous portion within the second electronic message; and generating the first electronic message wherein the first electronic message includes the second electronic message with the extraneous portion of the second electronic message suppressed.

The present invention is directed to a computer-implemented method and apparatus for searching in response to Internet-based search queries using a search engine and an electronic database.

According to one example embodiment of the present invention, data sets representing published items are input, for example, scanned-in or sent electronically, and stored in a searchable database. Each data set includes text from at least one published item. Responsive to the search query, a search engine searches for and identifies relevant web pages and data sets representing published items and, in a more specific embodiment, ranked characterizations are returned for the relevant web pages and published items. An electronic path can be provided with the published item for accessing further information about the published item.

In one embodiment, the electronic path is a hyperlink from a characterization of a relevant published item to a more complete electronic representation of the relevant published item. Publishers provide authorization to display copyrighted materials through a permission protocol.

Advertisers are permitted to put targeted ads on, or to serve ads in association with, various content such as search results pages, Web pages, e-mail, etc., without requiring the advertiser to enter and/or maintain certain targeting information, such as keyword targeting.

This may be accomplished by using a searchable data structure, such as an inverted index for example, of available advertiser Web information. The advertiser Web information may include terms and/or phrase extracted from the advertiser’s Website. In particular, a search query may be used to search for matching advertisers, and therefore matching ads.

For example, the search query can be used to search an inverted index including words and/or phrases extracted from advertiser Websites. The advertiser Web page, or some other identifier, can be used as a key to search for an associated ad.

A system forms search results clustered by address or telephone number. When clustering by address, the system may receive a search query and identify a geographical area of interest based, at least in part, on the search query.

The system may identify documents that are associated with addresses located within the geographical area of interest, group the identified documents into clusters based, at least in part, on the addresses located within the geographical area of interest, and present the clusters as the search results.

When clustering by telephone number, the system may receive a search query that includes at least one portion of a telephone number and identify documents that are associated with telephone numbers that match the at least one portion of the telephone number. The system may group the identified documents into clusters based on the telephone numbers included in the identified documents and present the clusters as the search results.

A system and method for automatically targeting Web-based advertisements is described. Advertisements are identified relative to a query, wherein identified advertisements describe characteristics relative to at least one of a product and a service. The advertisements are scored according to match between the query and the characteristics of the identified advertisements. At least some of the advertisements are provided as Web-based content.

A system identifies a document and obtains one or more types of history data associated with the document. The system may generate a score for the document based, at least in part, on the one or more types of history data.

Concept similarity may be used to help resolve ambiguities with respect to ads served using, at least, keyword targeting. More specifically, concept similarity may be used to help determine ad relevancy and/or ad scores.

One of the contributing factors that I thought might have had something to do with the Florida update is in the list of patents referenced – the one (actually, it was applied for twice, and is listed twice) on local interconnectivity.

But it might not be a stretch to consider how the clustering and rerankings, removals, and reduction of pageranks discussed in this patent could have a big impact if implemented during one of Google’s regular updates during that period in time.

I didn’t see any drops in rankings during the Florida update in 2003, but I wasn’t using doorway pages, redirects, repetitious keywords in meta keywords tags, and some of the other tactics that they mention.

Hi Charlie,
The Google God that is controlling everything is really getting on my nerves lately. They always seem to get what they want. I’m sure one of these days I will have a great idea and I prey the don’t find a loop hole to take it away from me with all this patent craziness. I will always take extra percussion’s when coming into a situation of this matter I hope you do to. Take extra care! = )

People are going to read too much into this patent. It does suggest why some spam techniques now fail (placing unconnected doorways on a domain, link bombing in guest books, etc.) but the evidence was there all along that Google has been making statistical comparisons to determine which pages may be falling out of the norm.

It really doesn’t have any relevance to modern SEO practices. Sure, there are plenty of people runnng around the Web burning domains and dropping links, but they are living in the past.

I think you’re right that people might make too much of this, but I think it does provide some insight into some of the things that we may have seen and experienced in the past, and it gives us a search engineer’s perspective rather than a marketers. I find a lot of value in that.

I also appreciated the insight into the possible targeting of bipartite sub-graphs, rather than an analysis of manipulative activity on just a page or host level.

Thanks for sharing your research! Your materials are always enlightening on how to better understand the search engines. I realise this comment may be 3+ years late but you can thank Tony Verre for pointing us to you.

You’re welcome. I’ve been spending a lot of time personally going through a lot of my old posts as well. Some of the ideas in those older posts and some of the patents I wrote about look a little different with the passage of some time.