10 Most Important SEO Patents: Part 2 – The Original Historical Data Patent Filing and its Children

Imagine gathering together 10 extremely knowledgeable search engineers, locking them into a room for a couple of days with walls filled with whiteboards, with the intent of having them brainstorm ways to limit stale content and web spam from ranking highly in search results. Add to their challenge that the methods they come up with should focus upon “the nature and extent of changes over time” to web sites. Once they’ve finished, then imagine taking what appears on those whiteboards and condensing it into a patent.

The end result would likely look like Google’s patent Information Retrieval based on Historical Data. When this patent was originally published as a pending patent application awaiting prosecution and approval back on March 31, 2005, it caused quite a stir in the SEO community. Here are a few of the many reactions in forums and blog posts as a result:

New Google Patent Details Many Google Techniques – msgraph at Webmasterworld appears to have been the first to spot this patent and write about it publicly, and it caused quite a stir amongst the members of the forum when he did.

Google’s Patent – Information Retrieval Based on Historical Data – Rand very quickly worked through the patent, and captured many of the details of what it covered at SEOmoz.

Google’s War on SEO – Documented – A Threadwatch post discusses the impacts of many aspects of the patent, with one member calling it “The most important SEO related document in the last 5 years.”

The impact of this patent goes on today, likely responsible for the recent Google freshness update, the possible impact on the rankings of a site as content upon pages change and anchor text pointing to that page no longer matches up well, whether Google might consider some pages as doorway pages when they are purchased and links are added to pages and topics of those pages change, and more. There’s also possibly some SEO mythology that sprung up from this patent as well, seen in forums and blog posts and other places, such as the length of registration of a domain name being used as a sign from Google as to whether a site might be web spam or not.

I’ve written a few posts since the Historic Data patent was published, including some very recent ones, that either covered it or one of its children, including the following:

I mentioned that this patent had children. The patent probably covered too much ground, and had too many claims shooting off in different directions, and Google filed a number of followup “divisional” patents that focused more narrowly upon smaller parts of what was covered by the original patent. While a number of these share a name and the descriptions sections are the same, the claims listed within them are different. Some of these updates may be responsible for things like Google’s freshness update from the beginning of November.

Here are the patent filings that came out of the Historical Data patent. Note that for many of them that have been refiled under the same name, the claims have been changed. For a few of those, the claims have been changed considerably. For others, the majority of the claims were canceled for one reason or another. If the abstract remained the same for those, I’m only showing it once.

A method may include receiving a document and an initial score for the document; determining that there has been a decrease in a rate or quantity of new links that point to the document over time; classifying the document as stale in response to the determining; decreasing the initial score for the document, resulting in an updated score; and ranking the document with regard to at least one other document based, at least in part, on the score.

A system may determine time-varying behavior of links pointing to a document, generate a score for the document based, at least in part, on the time-varying behavior of the links pointing to the document, and rank the document with regard to at least one other document based, at least in part, on the score.

A system may determine an extent to which a document is selected when the document is included in a set of search results, generate a score for the document based, at least in part, on the extent to which the document is selected when the document is included in a set of search results; and rank the document with regard to at least one other document based, at least in part, on the score.

A system determines a freshness of a first document. The system determines whether a freshness attribute is associated with the first document. The system identifies, based on the determination, a set of second documents that each contain a link to the first document. The system assigns a freshness score to the first document based on a freshness attribute associated with each document of the set of second documents or the freshness attribute associated with the first document.

A system may determine a document inception date associated with a document, generate a score for the document based, at least in part, on the document inception date, and rank the document with regard to at least one other document based, at least in part, on the score.

A system may determine a measure of how a content of a document changes over time, generate a score for the document based, at least in part, on the measure of how the content of the document changes over time, and rank the document with regard to at least one other document based, at least in part, on the score.

The Original Provisional Historical Data Patent Application

I’ve written about a few of the new patents that followed from the granted Historical Data patent, but while I was researching this post I came across the original filing for that patent in the USPTO Patent Application Information Retrieval database, which as far as I can tell hasn’t been published on the Web before. As the Historical Data patent notes:

This application claims priority under 35 U.S.C. .sctn. 119 based on U.S. Provisional Application No. 60/507,617, filed Sep. 30, 2003, the disclosure of which is incorporated herein by reference.

While there’s still a fair amount of technical language in the provisional version of the patent filing, I thought it was worth publishing and sharing here. In many ways, it’s more readable than the version of the patent that was granted, and provides some insights that might not be as clear from the one that was granted.

It’s possible that aspects of it or any of those from the other granted and pending patent filings I’ve pointed to haven’t been implemented by Google, but I think it does a great job of giving us a window into what people were thinking at Google back on September 30, 2003, when it was filed, and a number of practices that may still be in place at Google today.

Keep in mind as you read this that Google has changed considerably since 2003, that the children patent filings that sprung from this one added a number of changes including very recently, that parts of this patent filing might never have been implemented by Google, and that the version of this patent which was granted by the US patent office is different in a number of ways from this original provisional patent application.

The present invention relates to the field of search engines, and more particularly, to methods and systems for using temporal information to assess relevancy of (potential) search results by a search engine.

BACKGROUND OF THE INVENTION

Improving mechanisms for searching vast numbers of documents, such as that available via the World Wide Web (“web”), for information has increasingly been an area of focus and research. This is especially due to the continued growth in the number of computer users, services and applications offered for example on the web, and of course, the amount of information being added to the web and other databases.

A search engine is a widely used mechanism for allowing web users to search the web for information. Modern search engines, such as those that search the web, typically index crawled documents as a document corpus through which they search in response to a user query.

The basic functionality of a search engine involves receiving a search query (e.g., one or more search terms or “keywords” describing the desired information) from a user and then generating for output to the user one or more relevant results. A typical search engine is one that uses a query-dependent technique, generating its search mainly based on matching characters, words, or phrases in a query to those in corpus of pre-stored documents. Depending on factors involving how often, where, in what format, etc., that one or more search terms occur in a document, such a search technique will then rank, sort and then return the search results for output to the user, usually as a list of hyperlinks sorted according to their relevance.

Ideally, a search engine, in response to a given user’s query, will provide the user the most relevant results as quickly as possible. However, an exclusively query-dependent, “intra-document” search technique (i.e., one exclusively based on how terms in a query match terms in a document) suffers from a number of drawbacks that may often result in irrelevant documents being included as high ranked results. For example, one might “spam” such a search engine merely by including search terms in highly weighted locations and formats (e.g., in a URL or title, in bold/large font, etc). In some situations, such terms may even have nothing to with the document’s actual content but may be used by a spamming document to obtain higher placement in ranked results returned by a search engine.

For this and other reasons, search engines which factor criteria other than query-dependent information retrieval (IR) have been developed. One example of such a search engine has been proposed in a paper entitled, “The Anatomy of a Large-Scale Hypertextual Search Engine,” by Sergey Brin and Larry Page (also see U.S. Patent No. 6,285, 999, issued September 4, 2001 to Larry Page). The search engine Brin et al. propose analyzes “off document” factors, and in particular, analyze links to or from a document to score the document, with the premise that documents for which more/better links exist are likely to be more important and thus should be scored accordingly. Because such a scheme of ranking is not dependent on a given query-and in fact, can be performed by a search engine prior to receipt of queries-overall search quality can be significantly improved over less advanced search techniques.

Nonetheless, even more advanced search engines that use off-document scoring criteria may produce undesirable results in some situations. For example,a search engine such as that proposed by Brin et al. may be susceptible to “artificial” or “spammed” links that are generated in volume in an aim to make the document seem important to a search engine, and thereby increase the document’s rank score. In other circumstances a “stale” document may have many links associated with it because of its age and thus obtain a relatively high score by such a search engine, even though a “fresher” document in association with there may not be as many links may in fact be a more relevant document for a given query.

Thus, there remains a need to improve the quality of results generated by search engines.

BRIEF DESCRIPTION OF THE FIGURES (note – the images from the filing didn’t really seem all that useful, so I didn’t include them here)

Figure 1 is a flow diagram of a method for using temporal information to score documents for search, in accordance with one embodiment of the invention.

Figure 2 is a diagram illustrating an exemplary system in which concepts consistent with the present invention may be implemented.

SUMMARY OF THE INVENTION

The present invention provides methods and systems for using historical (e.g., the nature and extent of change of over time of) information associated with one or more documents and/or queries for search thereof to score the one or more documents. Such documents can be stored and/or searched locally (e.g., on a client computer system) and/or throughout a network (e.g., the Internet, an intranet, etc.).

DETAILED DESCRIPTION

The present invention provides methods and systems for improving the quality of search results generated by a search engine for searching a document database that may include linked documents, e.g., the web or other citation or hypermedia-based document database. In accordance with one aspect of the invention, temporal information-e.g., when/how often/what/what extent information associated with a document is time-varying-is used to score the relevance of one or more documents. In accordance with another aspect of the invention, other criteria may be used to rank document relevance.

OVERVIEW OF ONE ASPECT OF THE INVENTION – TEMPORAL-BASED SCORING

Figure 1 is a flow diagram of a method for using temporal information to score documents for search, in accordance with one embodiment of the invention.

At block 102, a search engine derives temporal information relating to a document. As used herein, a document is to be broadly interpreted to include any machine-readable and machine-storable work product. A document may be an email, a file, a combination of files, one or more files with embedded links to other files, etc. In the context of the Internet, a common document is a web page or associated website, e.g., in HTML, PDF, or one or a combination of other typical web document formats. Web pages may include one or more types of content and may include embedded information (such as meta information, hyperlinks, etc.) and/or embedded instructions (such as Javascript, etc.).

At block 104, the search engine, based on the time based information, assigns a score to the document. In one embodiment, the score is used to rank the document (e.g., relative to other documents) for search operations.

A search engine in which the present invention is implemented may take into consideration one or a combination of novel time-based factors to derive a score for documents. Without limitation, such factors may include the following:

1. DOCUMENT INCEPTION DATE

In one embodiment of the invention, the set of one or more temporal criteria used to score a document includes a document’s inception date. Though the present invention may take into account a document’s inception date as provided by a “biased” source-e.g., a webserver of a document, which can update inception date so that it is always current–other indications of inception date as described below may also be used in one or more embodiments of the invention.

In one embodiment of the invention, the inception date of a document is the date a search engine first learns of or indexes the document. The term “date” is used broadly here, and may thus include time and date measurements. The search engine may discover the document through crawling, submission of the document (or representation/summary thereof) to the search engine from an “outside” source, a combination of crawl or submission-based indexing techniques, etc.

In another embodiment, an indirect measure of inception date may be used. For example, in one embodiment, the date that a domain in association with which a document is registered may be used as an indication of the inception date.

In another embodiment, the first time a document is referenced in another document, such as a news article, newsgroup, mailing list, or a combination of one or more such documents, may be used to infer an inception date.

In yet another embodiment, the date the search engine first discovers a link to a document may be used to determine the document’s inception date.

In one embodiment, the inception date of a document is used for link­ based scoring of the document. In this embodiment, the invention may assume that a document having a fairly recent inception date, but for which a significant number of links exist (e.g., from other documents), may be considered more important than a much older document for which a larger number of links exist. For example, a document that is created yesterday but to which 10 links exist may be more important than a document to which 100 links exist but whose inception date is 10 years ago, since the rate of link growth for the former is relatively higher. However, as further described below, in accordance with one aspect of the invention, a spiky rate of growth in a factor typically used by a search engine to score documents, e.g., the number of links, may also signal an attempt to spam the search engine. Accordingly, it should be appreciated that the-search engine may actually lower the score of a document(s) in such situations to prevent spamming.

Thus, in one embodiment of the invention, a search engine may use the inception date to determine a rate at which links to a document are created (e.g., as an average per unit time based on time the number created since the inception date or some window in that period). This rate can then be used to score a document, for example,giving more weight to documents to which links are generated more often.

In one embodiment, adjusting document scoring in a link-based search engine is accomplished by solving the following equation:

H = L/log (F+2),

where H is the history-adjusted link score, L is the link score given to the document, which can be derived using any known link scoring technique (e.g., as described by Brin et al. in the article mentioned above) that assigns a score to a document based on links to/from the document, and F is elapsed time measured from the inception date associated with the document.

2. CONTENT UPDATES/CHANGES

In one embodiment of the invention, the set of one or more temporal criteria used to score a document may include the manner in which document’s content changes over time. In one embodiment, a document whose content is edited often may be-scored differently than one that remains static over-time. In one embodiment, a document having a relatively large amount of its content updated over time might be scored higher than one having a relatively small amount of its content updated over time.

In one embodiment, a content update score, U, is determined as a function of both an update frequency score, UF, and also an update amount score, UA:

U = f {UF, UA).

In one embodiment, UF represents how often a document is updated, which may be determined in a number of ways, including as an (average) time between updates, or the number of updates in a given time period, etc.

Moreover, with these and other temporal criteria described herein, the rate of change in a current time period can be compared to the rate of change in another (e.g., previous) time period, for example, to determine whether there is an acceleration or deceleration trend. Documents for which there is an increase in the rate of change might be scored higher than those which have steady rate of change, even if that rate of change is relatively high.

The update amount score, UA, represents how much of a document (e.g., a web page, a web site, etc.) has changed over time. This score can also be determined in one or more ways, including without limitation, (1) the number of “new” or unique pages on a site over a period of time; (2) the ratio of the number of new or unique pages on a site over a period of time versus the total number of pages on that site; (3) the (average) amount that the document is updated over one or more periods of time (e.g., n % of a web page’s visible content may change over a period t). In embodiment, UA is determined as a function of (1), (2), the average monthly amount of change, and the amount of change in the most recent n days.

In one embodiment, UA may be determined as a function of differently weighted portions of document content. For instance, in one embodiment, changes to content such as Javascript, comments, advertisements, navigational elements, boilerplate, or date/time tags may be given relatively little weight or even ignored altogether when determining UA. In one embodiment, any content that is determined to be a date, time or advertisement in a document is ignored. On the other hand, content deemed to be important if updated (e.g., more often, more recently, more extensively, etc.) could be given more weight when determining UA. For example, in one embodiment of the invention, changes in the title or outgoing anchor text of a document are given more weight than changes in other text.

To efficiently manage data storage resources when monitoring content changes to documents, in one embodiment of the invention, a search engine system stores and uses “signatures” of documents instead of the (entire) documents themselves to detect changes to document content. In one embodiment, a term vector for a document, e.g., a web page, is stored and monitored for relatively large changes. In another embodiment, a relatively small amount of the documents that are determined to be important or the most frequently occurring (excluding “stop words”) may be stored and monitored. In yet another embodiment, a summary or other representation of a document may be maintained and monitored for change thereto. In one embodiment, a simhash, for example, for detecting near-duplication of a document, may be computed and monitored for change, since even a relatively small change in a simhash may be considered by the search engine to indicate a relatively large change in its associated document. It should be appreciated that such techniques may not be implemented in various embodiments of the invention, for example, if adequate data storage resources exist to perform one or more of the techniques described herein.

Again, using these techniques, various portions of a document’s content may be given different weighting.

3. QUERY ANALYSIS

In one embodiment of the invention, one or more query-based factors, which may or may not be temporal, are used to score documents that relate to the query. For example, one query-based factor that may be used in an embodiment of the invention is the extent to which a document is selected over time, when the document is included in a set of search results. In this embodiment, documents selected relatively more often/increasingly by users might be scored higher than others.

Another query-based factor that be taken into account in scoring documents is the occurrence of certain search terms appearing in queries over time; for instance, if over a period of time a particular set of terms is increasingly appearing in queries (e.g., the terms could relate to a “hot” topic that is gaining/has gained popularity, such as an in-demand news event), then documents associated such queries may be scored higher than those not containing such terms.

Similarly, a change over time in the number of search results generated by similar queries may be used to score the documents corresponding to those results; again, a significant increase, for example, might indicate a hot topic and cause the search engine to increase the rankings of documents related to such queries.

In one embodiment of the invention, if, for example, a query remains relatively constant (e.g., “world series champions”) but the results change over time (e.g., documents relating to a particular team dominate search results in a given year/time of year), such change may be monitored and used to score documents accordingly. This query-based factor may also depend on other factors to signal that a document is “stale,” including without limitation diminishment in anchor growth, traffic, content change, in/out link growth, etc. Additionally, if over time a particular document is included in mostly topical queries (e.g., “World Series Champions”) versus more specific queries (e.g., “New York Yankees”), then this query-based factor-by itself or with others mentioned above-may be used to lower a score for a document that appears to be stale.

In one embodiment, a search engine may monitor and score a document depending on the extent to which, rate, etc., the document appears in results for different queries. In other words, the entropy of queries for one or more documents may be monitored and used as a basis for scoring. In one embodiment, if the queries are discordant, this may (though not necessarily) be considered a signal that the document is spam, in which case the search engine may score the document relatively lower.

In some situations, what might be considered a “stale” document, as determined by one or a combination of the temporal, query-based or other criteria described herein, should not necessarily be scored lower than a relatively “fresh” document (e.g., one in connection with which data is updated often, at an increasing rate, to a great extent, etc.). In other words, there may a need to determine whether, how (e.g., positively or negatively), and to what extent temporal criteria should affect the scoring of a document. Thus, in one embodiment of the invention, one or more factors, which may or may not be query-based, may be used to determine whether, how or what extent temporal information should be used to score stale pages versus than “fresh” documents. For example, if for a given query, users over time tend to select a lower ranked relatively stale result over a higher ranked though relatively fresh result, this may be used by the search engine as an indication to adjust a score of the stale document up.

In another embodiment of the invention, determining how to use temporal information relating to a document in a set of search results may be determined as a function of the temporal information associated with the other documents in the set of search results. For instance, in one embodiment, to the extent a score is assigned to a document based on one or more temporal factors, the score may be adjusted based on the difference between one or more temporal factors relating to that document and to the (average) of the set of search results in which that document is included; such temporal factors can include one or more of the factors described herein, such as inception date, content update over time, etc.

Still, the invention in its various embodiments may use other ways to adjust scores for giving/not giving preference to fresh or stale documents. For example, a link based score (such as the one proposed by Brin et al.), since typically biased toward relatively stale sites to which a number of links may have accumulated over time, may be adjusted by some factor. Alternatively, an information retrieval (IR) score can be adjusted to account for a bias toward fresh or stale documents.

4. LINK-BASED CRITERIA

In one embodiment of the invention, the time-varying behavior of links may be used as a basis for assessing a document’s “freshness” and in turn, performing a ranking operation based on behavior.

In one embodiment, a search engine may monitor the time/date when a link (e.g., a hyperlink) to a document appears or disappears in a crawl or index update operation. Using this date as a reference, the search engine may then monitor the time-varying behavior of links to the document: e.g., whether and at what rate links appear or disappear over time, how many links appear or disappear during a given time period, whether there is trend toward appearance of new links versus disappearance of existing links to the document, etc.

Using the time-varying behavior of links to (and/or from) a document, the search engine may rank the document accordingly. For example, a downward trend in the number or rate of new links (e.g., based on a comparison of number or rate of new links in a recent time period versus an older time period) over time could signal to the search engine that a document is stale, in which case a search engine may decrease the document’s (relevancy) score. Conversely, an upward trend may signal a “fresh” document that might be considered more relevant, depending on the particular situation and implementation of the invention.

Additionally, the inception date and/or “freshness” of documents in which links to a document of interest appear may also be used to assess that document’s freshness, which in turn, is used to determine its relevance. The age distribution, for example, of links (or the documents in which they appear) to a document may be used by a search engine to determine whether that document is fresh or not. Fresh documents may be considered those to which links have appeared more recently (i.e., links are relatively “young”), have an upward trend in the number or rate of new links thereto being established, etc, and such documents may be scored relatively high or low depending on the circumstance.

By analyzing the change in the number of or rate of increase/decrease of inlinks of a document (e.g., a web page or site) over time, a search engine in which a method consistent with an embodiment of the invention may be implemented may derive a valuable signal of how fresh the document is. For example; if such analysis is reflected by a curve that is dropping off, this may
signal that the document may be dormant (stale)-that is, no longer updated, diminished in importance, superseded by another document, etc.

In one embodiment, the analysis depends on the number of new links to a document: first, the search engine may monitor the number of new links to a document in the last n days, compared to the number of new links since the document was first found, or alternatively, the oldest age of the most recent n% of links compared to the age of the first link found.

For the purpose of illustration, consider n = 10 and two documents (web sites in this example) that were both first found 100 days ago. For the first site, 10% of the links were found less than 10 days ago, while for the second site 0% of the links were found less than 10 days ago (they were all found earlier). In this case the metric results in 0.1 for site A and 0 for site B. The metric may be scaled appropriately.

In one embodiment where such a metric is employed, the metric may be improved by performing a relatively more detailed analysis of the distribution of link dates, for example building models that predict if a particular distribution signifies a particular type of site (e.g., no longer updated, increasing or decreasing in popularity, superseded, etc.).

In one embodiment, each link is weighted by a function which increases with the freshness of the link. As such, the relevancy score of a document to which there are links will be raised or lowered as a function of the sum of the weights of the links pointing to it. In one embodiment, this technique may be employed recursively: for example, assuming a document is 2 years old, S may be considered fresh if n% of the links to S are fresh or if the documents containing the in-links to S are considered fresh. The latter can be checked by either using a combination of the creation date of the page and this technique applied recursively.

The dates of links can also be used to detect “spam” where owners of websites or their colleagues create links to their own site for the purpose of boosting their rank score by a search engine. A typical, “legitimate” website attracts inlinks slowly. A large spike in the quantity of inlinks may signal a topical phenomenon (e.g., the CDC web site may develop many links quickly after an outbreak such as SARS), or signal attempts to “spam” search engines (to obtain a higher ranking and thus placement in search results) by exchanging links, purchasing links, or gaining links from web sites without editorial discretion on making links. Examples of web sites that give links without editorial discretion include guestbooks, referrer logs, and “free for all” pages that let anyone add a link to a page.

As such, to the extent a search engine uses the number of links to a document to rank a document (hereinafter, a “link-rank”), in one embodiment of the invention, the time-varying behavior of this factor may be used to detect spam, hot topics, etc. For example, in one embodiment of the invention, hysteresis may be employed to allow a link-rank to grow at a certain rate. In another embodiment, the link-rank for a given document may be allowed – a certain maximum threshold of growth over a predefined window of time. Any or a combination of these techniques may prevent spamming.

Moreover, to allow topical phenomenon to be distinguished from spam, in one embodiment of the invention where link-rank is only allowed a certain amount of growth over time, an exception may be made to the extent documents in which link growth is taking place are determined to be authoritative in some respect. For example, if an unusual spike in the number or rate of increase of links to a document occurs in government web sites (e.g., .gov sites), a web directory (e.g., Yahoo), in documents which themselves have a relatively steady and high link-rank over time, then the search engine may consider such a document not to be spam and thus allow a relatively high or even no threshold for (growth of) its link-rank (over time).

On the other hand, in one embodiment, the date at which one or more links to a document disappear, the number of links that disappear in a given window of time, or some other time-varying decrease in the number of links (or links/updates to the documents containing such links) to a document, may be monitored by a search engine to identify documents that may be considered stale. Once a document has been determined to be stale, the links contained in that document may be discounted or ignored by a link-rank mechanism of a search engine.

5. ANCHORTEXT

In one embodiment of the invention, the time-varying behavior of anchortext (e.g., the text in which a hyperlink is embedded, typically underlined or otherwise highlighted in a document) associated with a document may be used to score the document. For example, in one embodiment, changes over time in anchortext corresponding to inlinks to a document may be used as an indication that there has been update or even change of focus in the document; a relevancy score may take this change(s) into account.

Moreover, because some search engines use anchortext as a factor in scoring documents, in one embodiment of the invention, the time-varying behavior of documents and/or anchortexts inbound thereto may be used by such a search engine to detect when a domain changes and in turn helping prevent generating results based on outdated anchortexts. Changes in anchortext can signal that the focus of the linked document has changed. Alternatively, if the content of a document changes very much from associated inbound anchortext, then, for example in the case the document is web site, the domain for the site may have changed completely from a previous incarnation. For example, this may occur if a domain has expired and a different party purchases the domain.

Then the domain may continue to be included in search results for queries that are no longer on-topic, since in some ranking schemes, anchortext may be considered to be part of the document it points to. In one embodiment of the invention, a search engine addresses this situation by estimating the date that a domain changed its focus (e.g., based on when the text on the page changed significantly and/or when the text in the new anchortext changed significantly), and discount or ignore alllinks/anchortext from before that date.

6. TRAFFIC

In one embodiment of the invention, time-varying characteristics of traffic to, or other “use” of, a document by one or more users is factored into the scoring of that document. For example, a web site that has experienced a large reduction in traffic may no longer be updated or may be superseded by another site.

In one embodiment of the invention, a search engine compares the average traffic to a site over the last n days (n may equal 30, for example) to the average traffic during the month where the site received the most traffic, optionally adjusted for seasonal changes, or during the last m days (e.g., m may equal 365). Optionally, in one embodiment of the invention, a search engine may identify repeating traffic patterns or perhaps a change in traffic patterns over time; e.g., a document may be more or less popular (i.e., have more or less traffic) during summer, weekends or some other seasonal time period, during and outside of which the search engine may adjust its relevancy score accordingly.

Additionally, in one embodiment, time-varying factors relating to “advertising traffic” for a particular document(s) may be monitored and used for scoring a document. For example, a search engine may monitor one or a combination of the following to make scoring decisions about a-document: (1) the extent to and rate at which advertisements are presented or updated by a given document over time; (2) the quality of the advertisers (e.g., a document whose ads refer/link to documents known to the search engine over time to have relatively high traffic and trust, such as amazon.com, may be given relatively more weight than those whose ads refer to low traffic/untrustworthy documents, such as a new pornographic site); (3) the extent to which the advertisements generate user traffic to the documents to which they relate (e.g., their click-through rate), etc.

7. USER BEHAVIOR

In one embodiment, individual or aggregate user behavior over time may be used to score one or more documents. For example, in one embodiment of the invention, the number of times a document is selected from a set of search results and/or the amount of time one or more users spend on the document may be used to score that document. For example, if a web page is returned for a certain query, and over time or in a given time window, users spend either more or less time on average on the document given the same or similar query, then this situation may be used as an indication that the document is fresh or stale, respectively. The search engine may score the document accordingly.

8. DOMAIN-RELATED INFORMATION, DNS/WHOIS

In one embodiment, information relating to how a document is served over a computer network (e.g., the Internet, an intranet or other network or database of documents), which information may- or may not be time-based, may be used-to score the relevance of the document.

For example, those who attempt to deceive search engines often use throwaway or “doorway” domains, and attempt to obtain as much traffic as possible before being caught. Signals that distinguish between these fly-by-night types of domains can be used in scoring. For example, domains can be renewed up to a period of 10 years, and valuable domains are often paid for several years in advance, while doorway domains rarely are used for more than a year. The date when a domain expires in the future can be used as a factor in legitimacy of a document(s) associated therewith.

In one embodiment, the DNS (domain name system) record for a document may be monitored to score the document. The domain name record contains details of who registered the domain, administrative and technical addresses, and the addresses of name servers (machines that resolve the domain name into an IP address). By analyzing this data over time for a domain, spamming or other “sham” domains may identified and scored accordingly. For instance, a search engine may monitor whether physically correct address information exists over a period of time, whether contact information for the domain changes relatively often, whether there is a relatively high number of shifts between different nameservers and hosting companies. In one embodiment, a list of known-bad contact information and known-bad nameservers and IP addresses may be identified, stored and used as scoring factors.

In one embodiment, the age of a nameserver may also be a factor in scoring. A “good” nameserver (one that should be assigned relatively higher score) will typically have a mix of different domains from different registrars and have a history of hosting those domains, while a “bad” nameserver (one that should receive a relatively lower score) might host mainly porn or doorway or domains with commercial words (a common indicator of spam), or might be brand new, or might host primarily bulk domains from a single registrar. Again, the newness of a nameserver might not automatically be a negative factor in scoring, but in combination with other factors, such as ones described herein, it could be.

9. RANKING HISTORY

In one embodiment, the time-varying behavior of how a document is ranked in response to search queries to a search engine may be used to adjust the score of that document. Referring to an exemplary embodiment of the invention as implemented by a search engine for searching the Internet, the search engine may determine that a domain which jumps in rankings across many queries might be a topical site or it could signal an attempt to “spam” the search engine.

Thus, the quantity or rate that a site moves over a period of time might be used as a scoring factor. In one embodiment, for each set of search results, a domain may be weighted according to its position in the top N search results. For N=30, one example function might be [ ((N+1)-SLOT)/N] 1\ 4. Then a #1 result will receive a score of -1.0, down to a score near 0 at the Nth position. A query set (e.g. of commercial queries) can be repeated, and sites that gained more than M% in the rankings may be flagged, or the percentage growth in ranking may be used as a signal in ranking.

In one embodiment, the search engine may determine that a query is likely commercial if the average (median) information retrieval (IR) score of the top results is relatively high and there is a significant amount of change in the top ten results from month to month. In one embodiment, chum may also be monitored and factored by a search engine as an indication of a commercial query.

In one embodiment, in addition to history of positions (or ranking) of documents for a given query, a search engine may score a document (and in the case of an Internet document, this may be done on a page, host, site, and domain basis) based one or a combination of other time-based factors. Such factors may include the number of queries for which, and the rate at which (increasing/decreasing) a document is generated as a search result over time; seasonality, burstiness and other patterns over time that a document is generated as a search result; or changes in IR scores over time for a URL-query pair.

Alternatively or in addition, in one embodiment, a number of document (e.g., URL) independent query-based criteria may be monitored over time to improve search results. For example, in one embodiment, the average IR score among a top set of results generated in response to a given query or set of queries may be used to adjust the score of that set of results (and/or the other results) generated in response to the given query or set of queries. Moreover, the number of results generated for a particular query(ies) may be monitored over time, and, for example, if the number is increasing or there is a change in the rate of increase, those results which are generated may be scored higher (e.g., such an increase may be an indication to the search engine of a “hot topic” or other phenomenon).

10. USER MAINTAINED/GENERATED DATA (E.G., BOOKMARKS)

In one embodiment of the invention, data maintained or generated by a user may be monitored over time and used to score one or more documents by a search engine. For example, in one embodiment of the invention where the search engine, either directly or indirectly, has access to the “bookmarks” or “favorites” lists maintained by users’ browser programs, the search engine may monitor upward and downward trends, rates thereof, etc., that a document (or more specifically, a path thereto) is added or deleted to, or accessed through, such lists. For example, if a number of users are adding a particular document to their list of favorite documents or often accessing the document through such lists over time, this may be considered a signal to the search engine to score the document as a relatively important document. On the other hand, if a number of users are decreasingly accessing a document indicated in their favorites list or are increasingly deleting/replacing the path to such document from their lists, this may taken as a signal that the document is outdated, unpopular, etc., in which case the search engine may decrease the score of the document.

It should be appreciated that in alternative embodiments of the invention, other-user data that would indicate an increase or decrease- in -user interest in-a particular document over time, which in turn could be used in alternative embodiments of the invention to score the document higher or lower, respectively, could be monitored by a search engine. For example, the “temp” or cache files associated with users could be monitored by a search engine to monitor an increase or decrease of a document being added over time. Similarly, storage and use of cookies associated with a particular web page/site might be monitored for a number of users to score the corresponding documents based on whether there is an upward or downward trend in interest in such document(s).

11. UNIQUE WORDS, BIGRAMS, PHRASES IN ANCHORTEXT

In one embodiment, the link or web graphs and their behavior over time may be monitored and used for scoring, spam detection or other purposes by a search engine. Naturally developed web graphs typically involve independent decisions. Synthetically generated web graphs-usually indicative of an intent to spam a search engine are based on coordinated decisions; as such, the profile of growth in anchor words/bigrams/phrases is likely to be relatively spiky in this instance.

One reason for such spikiness may be the addition of a large number of identical anchors from many places; another possibility may be addition of deliberately different anchors from a lot of places. With this in mind, in one embodiment of the invention, this information could be monitored and factored into scoring a document by capping the impact of suspect anchors associated with links thereto on the associated document score (a binary decision).

In another embodiment, a continuous scale for the likelihood of synthetic generation is used, and a multiplicative factor to scale the score for the document is derived.

12. LINKAGE OF INDEPENDENT PEERS

A sudden growth in the number of apparently independent peers (e.g., unrelated web sites), incoming and/or outgoing, with large number of links to individual pages may indicate a potentially synthetic web graph, e.g., which in turn may signal an attempt to spam the search engine. This indication may be strengthened if the growth corresponds to anchortext that is unusually coherent or discordant. This information can be used to demote the impact of such links- e.g., in a link-based ranking system such as the one proposed by Brin et al.­ either as a binary decision item (e.g., demote score by fixed amount) or a multiplicative factor.

13. DOCUMENT TOPIC

In one embodiment of the invention, topic extraction (e.g., through categorization, URL analysis, content analysis, clustering, summarization, set of unique low frequency words, or some other means of topic extraction) may be performed and the topic of a document monitored over time and used for scoring purposes. In one embodiment, if there is a significant change over time in the set of topics associated with a document, the search engine may consider this as an indication that link-based ranking; anchortext, or other external to the document but associated therewith and present prior to such change should be discounted.

Similarly, a spike in the number of topics could indicate spam. For example, if a particular site is associated with a set of one or more topics over what may be considered a “stable” period of time, then if there is a (sudden) spike in the number of topics associated with the site, this may be an indication that the site has been taken over by “doorway” documents. Another indication may include the disappearance of the original topics associated with the site. In one embodiment of the invention, if one or more of these situations are detected, the search engine may reduce the relative score of such documents and/or the links, anchortexts or other data associated therewith and used for scoring the document.

HARDWARE/SYSTEM OVERVIEW

Figure 2 is a diagram illustrating an exemplary system in which concepts consistent with the present invention may be implemented. The system includes multiple client devices 202, a server device 210, and a network 201, which may be, for example, the Internet. Client devices 202 each include a computer­ readable medium 209, such as random access memory and/or read-only memory, coupled to a processor 208. Processor 508 executes program instructions stored in memory 209. Client devices 202 may also include a number of additional external or internal devices, such as, without limitation, a mouse, a keyboard, microphone, other input user input device(s); a display, speakers, other user output device(s); a CD/DVD, diskette or other read or read-write data storage device(s).

Through client devices 202, users 205 may be able to communicate over network 201 with each other or with other systems and devices coupled to network 201, such as server device 210. Similar to client devices 202, server device 210 may include a processor 211 coupled to a computer-readable memory 212. Server device 210 may additionally include a secondary storage element, such as database 230.

Client processors 208 and server processor 211 can be any of a number of well known computer processors. In general, client device 202 may be any type of computing platform connected to a network and that interacts with application programs, including without limitation a desktop or portable personal computer, a digital assistant or a “smart” cellular telephone or pager. Server 210, although depicted as a single computer system, may be implemented as a network of computer processors. Memory 212 contains a search engine program 220. Search engine program 220 locates relevant information in response to search queries from users 205.

In one embodiment of the invention, the search engine program 220 is specialized to search for a specific type or category of information, such as products or product categories, or music, or video, etc. In alternative embodiments, the search engine program 220 may be more general to the extent that it can be used to search for various unrelated categories of information. Users 205 send search queries to server device 210, which responds by returning a list of relevant information, the search results, to user 205. Typically, users 205 ask server device 210 to locate documents relating to a particular topic (e.g., product-related information in the case of a product search engine implementation of the invention) and stored at other devices or systems connected to network 201. Search engine 220 includes document locator 221 and a ranking component 222. In general, document locator 221 finds a set of documents whose contents match a user search query. Ranking component 222 may rank the located set of documents based on relevance and may generate a relevance score for each document that indicates a level of relevance. Search engine 220 may then return a list of links pointing to the set of documents determined by document locator 221. The list of links may be sorted based on the relevance scores determined by ranking component 222.

Document locator 221 may initially locate documents from a document corpus stored in database 230 by comparing the terms in the user’s search query to the documents in the corpus. In general, processes for indexing web documents and searching the indexed corpus of web documents to return a set of documents containing the searched terms are well known in the art. Accordingly, this functionality of document locator 221 is not described further herein.

Ranking component 222 assists search engine 220 in returning relevant documents to the user by ranking the set of documents identified by document locator 221. This ranking may take the form of assigning a numerical value, called a relevance score, corresponding to the calculated relevance of each document identified by document locator 221. There are a number of suitable ranking algorithms known in the art, one of which is described in the article by Brin and Page, as mentioned in the Background of the Invention section of this disclosure. Alternatively, the functions of ranking component 522 and document locator 521 may be combined so that document locator 521 produces a set of relevant documents each having rank values.

In accordance with the present invention, the ranking component 222 may include a time-based or historical ranking component (not shown) which may detect, store and/or monitor one or more criteria associated with documents, such as one or more of time-based criteria described above, and use the behavior of such criteria over time to score the documents or other data associated therewith. This component may be implemented through instructions and data stored in one or more data storage areas of one or more devices; such instructions, when executed by one or more processors, would cause the processors to perform one or more of the methods of the invention.

GENERAL

It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” or “an aspect” of the invention means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention.

Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” or “an aspect” in various portions of this specification are NOT necessarily all referring to the same embodiment.

Furthermore, the particular features, structures or characteristics of one or more embodiments or aspects described may be combined or implemented independently of each other as suitable in one or more embodiments of the invention. It will be apparent to one of ordinary skill in the art that aspects of the invention, as described above, may be implemented in many different forms of software, firmware, and hardware in the implementations illustrated in the figures.

The actual software code or specialized control hardware used to implement aspects consistent with the present invention is not limiting of the present invention. Thus, the operation and behavior of the aspects were described without reference to the specific software code–it being understood that a person of ordinary skill in the art would be able to design software and control hardware to implement the aspects based on the description herein.

The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention.

No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. Unless expressly stated otherwise, “or” means “and/or” herein.

It should further be appreciated that, in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the invention requires more features than would be expressly recited in each claim.

CONCLUSION

As can be seen, in accordance with one aspect of the invention, a method is provided in which an information retrieval system (e.g., a search engine) monitors the behavior of one or a combination of factors associated with a document(s) over time and based on such behavior, scores the document(s) for search. It should be appreciated that the invention is not limited to any particular factor, method of monitoring of such factor, or using such factor to perform scoring/ranking operations.

Related

Reader Interactions

Comments

I like the line about sites turning to doorway link whales (e.g. Anyone bought a blog post recently!)

In one embodiment, if there is a significant change over time in the set of topics associated with a document, the search engine may consider this as an indication that link-based ranking; anchortext, or other external to the document but associated therewith and present prior to such change should be discounted.

Similarly, a spike in the number of topics could indicate spam. For example, if a particular site is associated with a set of one or more topics over what may be considered a “stable” period of time, then if there is a (sudden) spike in the number of topics associated with the site, this may be an indication that the site has been taken over by “doorway” documents. Another indication may include the disappearance of the original topics associated with the site. In one embodiment of the invention, if one or more of these situations are detected, the search engine may reduce the relative score of such documents and/or the links, anchortexts or other data associated therewith and used for scoring the document.”

The freshness update has me thinking about something that I did to one of my blogs. I enabled a plugin on WordPress that recycled old posts and brought them to the fore.

If they are timeless posts with no date attached to them other than the revised publishing date, I wonder how Google would view that? Would re-dating it benefit or harm the post from a “freshness” perspective?

Bill, Mark asks a good question. WHat is your thought on that? Does Google just look at the file date or does it compare copy on the page to see if it has been changed? I glanced over another page of yours on the freshness update which refers to ads on the page. But what about the main content itself? If you covered that already, a simple link to your answer would be terrific.

I read somewhere that google might think you are trying to spam when you update old blog posts; however, I do that frequently because I don’t want stale or outdated info on my site to MISLEAD readers. So I hope Google is smarter than that. If that’s the case, they should issue a new date and call it fresh if it has been changed. But how much of a change will make it an update to Google, and how much will they think is just an attempt to spam? They’ll have to figure this out and so will we. Thoughts?

This patent really does address a lot of topics. Don’t take everything in it as something that Google is actually doing, though.

As I mentioned at the start of the post, the version of this patent that was published as a pending patent application caused a lot of arguments in many forums. I think the important thing to keep in mind are the questions that it raises. Don’t rely upon the patent to answer them, but rather keep an eye out for the answers to those questions while you’re optimizing pages, while reading through other whitepapers or patents, or reading blog posts or watching videos from Google.

Matt Cutts has actually come out and stated that Google doesn’t look at things like the length of registration of domain names to determine if a page is web spam, for instance, even though that possibility is mentioned in the patent.

As I mentioned to Girish, don’t take everything that you find in this patent as gospel. The patent covers an incredible amount of ground, but it’s quite possible that Google isn’t doing many of the things mentioned in the patent. But definitely keep the points it raises in mind, and you may see other ways that Google might be working to remove stale content and web spam from search results.

I don’t like publishing posts without dates. When I read a post that doesn’t have a date, I can’t help but wonder if its brand new or 5 years old, and it bugs me. 🙂

Google does try to monitor changes to pages it finds at specific URLs, and track those changes, so removing the date might not make much of a difference when it comes to how old or new Google might perceive a page to be. Google is much more likely to assign a date to a page based upon the first time it crawled that page than a “republished” date or a server “last updated” time stamp.

In the blog post How Google Might Track Changes on Webpages, there’s a section in the middle of the post about “document inception dates” which discusses how Google might determine how old a page might be. Often, that’s when the search engine first learns about a page, often by crawling it. It may also track changes to that page over time as well. The document inception date concept did originally come out of the historical data patent.

I did recently write a post on how Google might decide that an older page has become a doorway page when certain changes are made to that page. That post was inspired by a recently filed update to one of the patents that came out of the Historical Data patent as well. If the changes to the blog post cover the same topic, and don’t do things like add a number of new topics, and possibly unrelated links, then those changes probably aren’t a problem.

So, if you write a blog post about ice fishing, and you then change it to be about how to win at blackjack and include some links to some casino sites, that could be a serious problem.

Google was just granted one of the “Systems and methods for determining document freshness” pending patent applications this week, and it seems to define the freshness of a “document” differently than we were told that Google was when they calculate freshness for their freshness update, by telling us that they look at the “last modified” date on a server. During the Freshness update, Google was telling us that they use the “first time crawled” date to determine how old a page might be, and then track changes on that page. I’m trying to reconcile the two different statements, but that’s one of the things that makes spending time with these patents interesting – we get some idea of the challenges search engineers are faced with when they work to index and rank pages. 🙂

“I like looking at patents and whitepapers and other primary sources from search engines to help me in my practice of SEO. I’ve been writing about them for more than 5 years now, and am putting together this series of the 10 Most important SEO patents to share some of what I’ve learned during that time. These aren’t patents about SEO, but rather ones that I would recommend to anyone interested in learning more about SEO by looking at patents from sources like Google or Microsoft or Yahoo.”

I think this statement should be placed on the first post of the 10 articles that you are going to write, I mean you should place the above text here:

I agree. I’ve moved that paragraph from the top of this post to the top of the first post in the series. It was the introduction that I should have started everything off with, and might have if I wrote all of the posts at the same time. 🙂