Google Patent on Anchor Text and Different Crawling Rates

How does a search engine use information from anchor text in links pointed to pages?

Why and how do some pages get crawled more frequently than others?

How might links that use permanent and temporary redirects be treated differently by a search engine?

A newly granted patent from Google, originally filed in 2003, explores these topics, and provides some interesting answers, and even some surprising ones.

Of course, this is a patent, and may not necessarily describe the actual processes in use by Google. It is possible that they are being used, or were at one point in time, but there has been plenty of time since the patent was filed for changes to be made to the processes described.

It has long been observed and understood that different pages on the web get indexed at different rates, and that anchor text in hyperlinks pointing to pages can influence what a page may rank for in search results.

Why Use Anchor Text to Determine Relevancy?

When you search, you expect a short list of highly relevant web pages to be returned. The authors of this patent tell us that previous search engine systems only associated the contents of the web page itself with the web page when indexing that page.

They also tell us that there is valuable information about pages that can be found outside the contents of the web page itself in hyperlinks that point to those pages. This information can be especially valuable when the page being pointed towards contains little or no textual information itself.

For image files, videos, programs, and other documents, the textual information on pages linking to those documents may be the only source of textual information about them.

Using anchor text, and information associated with links can also make it possible to index a web page before the web page has been crawled.

Creating Link Logs and Anchor Maps

Link Log – A large part of the process involved in this patent includes the creation of a link log during the crawling and indexing process, which contains a number of link records, identifying source documents, including URLs, and the identity of pages targeted by those links.

Anchor Map – Information about text associated with links ends up in a sorted anchor map, which includes a number of anchor records, indicating the source URL of anchor text, and the URL of targeted pages. The anchor information is sorted by the documents that they target. The anchor records may also include some annotation information.

In the creation of a link log and anchor map, a page ranker process may be used to determine a PageRank, or some other query-independent relevance metric, for particular pages. This page ranking may help determine crawling rates and crawling priorities.

Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed.

A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.

Different Layers and Different Crawling Rates

Base Layer

The base layer of this data structure is made up of a sequence of segments. Each of those might cover more than two hundred million uniform resource locations (URLs), though that number may have changed since this patent was written. Together, segments represent a substantial percentage of the addressable URLs in the entire Internet. Periodically (e.g., daily) one of the segments may be crawled.

Daily Crawl Layer

In addition to segments, there exists a daily crawl layer. In one version of this process, the daily crawl layer might cover more than fifty million URLs. The URLs in this daily crawl layer are crawled more frequently than the URLs in segments. These can also include high priority URLs that are discovered by system during a current crawling period.

Optional Real-Time Layer

Another layer might be an optional real-time layer, that could be comprised of more than five million URLs. The URLs in a real-time layer are those URLs that are to be crawled multiple times during a given epoch (e.g., multiple times per day). Some URLs in an optional real-time layer are crawled every few minutes. Real-time layer also comprises newly discovered URLs that have not been crawled but should be crawled as soon as possible.

Same Robots for All Layers

The URLs in the different layers are all crawled by the same robots, but the results of the crawl are placed in indexes that correspond to those different layers. A scheduling program bases the crawls for those layers based on the historical (or expected) frequency of change of the content of the web pages at the URLs and on a measure of URL importance.

URL Discovery

The sources for URLs used to populate this data structure include:

Direct submission of URLs by users to the search engine system

Discovery of outgoing links on crawled pages

Submissions from third parties who have agreed to provide content, and can give links as they are published, updated, or changed – from sources like RSS feeds

It’s not unusual to see blog posts in the web index these days that indicate they are only a few hours old. It’s quite possible that those posts are included in the real time layer, and may even be entered into the index through an RSS feed submission.

Processing of URLs and Content

Before this information is stored in a data structure, a URL (and the content of the corresponding page) is processed by programs designed to ensure content uniformity and to prevent the indexing of duplicate pages.

The syntax of specific URLs might be looked at, and a host duplicate detection program might be used to determine which hosts are complete duplicates of each other by examining their incoming URLs.

Parts of the Process

Rather than describing this process step-by-step, I’m pointing out some of the key terms, and some of the processes that it covers. I’ve skipped over a few aspects of the patents that increase the speed and efficiency of the process, such as partitioning, and the incremental addition of data through a few processes.

It’s recommended that you read the full text of the patent if want to see more of the technical aspects involved in this patent.

Epochs – a predetermined period of time, such as a day, in which events in this process take place.

Active Segment – the segment from the base layer that gets crawled during a specific epoch. A different segment is selected each epoch, so that over the course of several epochs, all the segments are selected for crawling in a round-robin style.

Movement between Daily Layer and Optional Real-Time Layer – Some URLs might be moved from one layer to another based upon information in history logs that indicate how frequently the content associated with the URLs is changing, as well as individual URL page ranks that are set by page rankers.

The determination as to what URLs are placed in those layers could be made by computing a daily score like this – daily score=[page rank].sup.2*URL change frequency

URL Change Frequency Data – When a URL is accessed by a robot, the information is passed through content filters, which may determine whether content at a URL has changed and when that URL was last accessed by a robot. This information is placed in history logs and then go to the URL scheduler.

A frequency of change may then be calculated, and supplemental information about a URL may also be reviewed, such as a record of sites whose content is known to change quickly.

PageRank – a query-independent score (also called a document score) is computed for each URL by URL page rankers, which compute a page rank for a given URL by looking at the number of URLs that reference a given URL and the page rank of those referencing URLs. This PageRank data can be obtained from URL managers.

URL History Log – can contain URLs not found in data structure, such as within log records for URL’s that no longer exist, or which will no longer be scheduled for crawling because of such things as requests by a site owner that the URL not be crawled.

Placement into Base Layer – When the URL scheduler determines that a URL should be placed in a segment of base layer, an effort is made to ensure that the placement of the URL into a given segment of base layer is random (or pseudo-random), so that the URLs to be crawled are evenly distributed (or approximately evenly distributed) over the segments.

Processing rules might be used to distribute URLs randomly into the appropriate base, daily, and realtime layers.

When All URLs Cannot be Crawled During a Given Epoch – This could be addressed using two different approaches:

A crawl score could be computed for each URL in active segment, daily layer, and real-time layer, and only URLs with high crawl scores are passed on to the URL managers.

In the second, the URL scheduler decides upon an optimum crawl frequency for each such URL and passes the crawl frequency information on to URL managers – which is used by URL managers to decide which URLs to crawl. These approaches could be used by themselves, or in a combined approach to prioritize the URLs to crawl.

Factors Determining a Crawl Score — where a crawl score is computed, URLs receiving a high crawl score are passed on to the next stage, and URLs with a low crawl score are not passed on to the next stage during the given epoch. Factors that could be used to compute a crawl score might include:

The current location of the URL (active segment, daily segment or real-time segment),

URL page rank, and;

URL crawl history – can be computed as: crawl score=[page rank].sup.2*(change frequency)*(time since last crawl).

Other modifications may impact the crawl score. For example, the crawl score of URLs that have not been crawled in a relatively long period of time can be up-weighted so that the minimum refresh time for a URL is a predetermined period of time, such as two months.

Where Crawl Frequency is Used – The URL scheduler may set and refine a URL crawl frequency for each URL in the data structure. This frequency represents a selected or computed crawl frequency for each URL. Crawl frequencies for URLs in daily layer and real-time layer will tend to be shorter than the crawl frequency of URLs in the base layer.

The range of crawl frequencies for any given URL can range from a minute or less to a periods that can take months. The crawl frequency for a URL is computed based on the historical change frequency of the URL and the page rank of the URL.

Dropping URLs – The URL scheduler determines which URLs are deleted from data structure and therefore dropped from system. URLs might be removed to make room for new URLs. A “keep score” might be computed for each URL, and the URLs could then be sorted by the “keep score.”

URLs with a low “keep score” could be eliminated as newly discovered URLs are added. The “keep score” could be the page rank of a URL determined by page rankers.

Crawl Interval – A target frequency that a URL should be crawled. For a URL with a crawl interval of two hours, the URL manager will attempt to crawl the URL every two hours. There are a number of criteria that can be used to prioritize which URLs that will be delivered to the URL server, including “URL characteristics” such as the category of the URL.

Url Server Requests – there are times when the URL server requests URLs from URL managers. The URL server may sometimes request specific types of URLs from the URL managers based upon some policy, such as eighty percent foreign URLs/twenty percent news URLs.

Robots – Programs which visit a document at a URL, and recursively retrieve all documents referenced by the retrieved document. Each robot crawls the documents assigned to it by the URL server, and passes them to content filters, which process the links in the downloaded pages.

The URL scheduler determines which pages to crawl as instructed by the URL server, which takes the URLs from the content filters. Robots differ from normal web browsers – a robot will not automatically retrieve content such as images embedded in the document, and are not necessarily configured to follow “permanent redirects”.

Host Load Server – used to keep from overloading any particular target server accessed by robots.

Avoiding DNS Bottleneck Problems – a dedicated local DNS database may be used to resolve IP addresses with domain names so that domain names for IP addresses that have been crawled before don’t have to be looked up every time that a robot goes to visit a URL.

Handling Permanent and Temporary Redirects – Robots do not follow permanent redirects that are found at URLs that they have been requested to crawl, but instead send the source and target (redirected) URLs of the redirect to the content filters.

The content filters take the redirect URLs and place them in link logs where they are passed back to URL managers. It is the URL managers that determine when and if such redirect URLs will be assigned to a robot for crawling. Robots are set to follow temporary redirects, and obtain page information from the temporary redirects.

Content Filters – robots send the content of pages to a number of content filters. Those content filter sends information about each page to a DupServer to uncover duplicate pages, including:

The URL fingerprint of the page,

The content fingerprint of the page,

The page’s page rank, and;

An indicator as to whether the page is source for a temporary or permanent redirect.

Handling Duplicates – when found, the page rankings of the duplicate pages at different URLs are compared and the “canonical” page for the set of duplicate pages is identified. If the page sent to the DupServer is not the canonical page, it is not forwarded for indexing.

Instead, an entry might be made for the page in the history log, and the content filter may then cease to work on the page. The DupServer also assists in the handling of both temporary and permanent redirects encountered by the robots.

Link Records and Text Surrounding Links – a link log contains one link record per URL document. A URL document is a document obtained from a URL by a robot and passed to content filter. Each link record lists the URL fingerprints of all the links (URLs) that are found in the URL document associated with a record, and the text that surrounds the link.

For example, a link pointing to a picture of Mount Everest might read “to see a picture of Mount Everest click here.” The anchor text might be the “click here” but the additional text “to see a picture of Mount Everest” could be included in the link record.

RTlog Matching Document PageRanks with Source URLs – an RTlog stores the documents obtained by robots, and each document is coupled with the page rank assigned to the source URL of the document to form a pair. A document obtained from URL “XYZ” is paired with the page rank assigned to the URL “XYZ” and this pair is stored in an RTlog.

There are three RTlogs, one for the active segment of the base layer, one for the daily layer, and one for the realtime layer.

Creation of Link Maps – the global state manager reads the link logs, and uses the information from the log files to create link maps and anchor maps. The records in the link map are similar to the records in the link log, except that the text is stripped. The link maps are used by page rankers to adjust the page rank of URLs within the data structure. These page rankings persist between epochs.

Creation of Anchor Maps – The global state manager also creates anchor maps. Anchor maps are used by the indexers at the different layers to facilitate the indexing of “anchor text” as well as to facilitate the indexing of URLs that do not contain words.

Text Passages – Each record in the link log includes a list of annotations, such as the text from an anchor tag pointing to a target page. The text included in an annotation can be a continuous block of text from the source document, in which case it is referred to as a text passage.

Annotations may also include text outside the anchor tag in the document referred to by a URL. That text could be determined by being within a predetermined distance of an anchor tag in a source document. The predetermined distance might be be based on:

A number of characters in the HTML code of the source document,

The placement of other anchor tags in the source document, or;

Other predefined criteria, referred to as anchor text identification criteria.

Other Annotations – annotations may also include a list of attributes of the text they include. For example, when the text in an annotation is composed in HTML, examples of attributes may include:

Emphasized text – <em>

Citations – <cite>

Variable names – <var>

Strongly Emphasized – <strong>

Source Code – <CODE>

Text position,

Number of characters in the text passage,

Number of words in the text passage, and;

Others.

Annotations as Delete Entries – Sometimes an annotation is a delete entry, which is generated by the global state manager when it determines that a link no longer exists.

Use of Anchor Text from Duplicates – sometimes anchor text pointed to duplicates is used to index the canonical version of the page. We are told that this can be useful when one or more of the links to one or more of the non-canonical pages has anchor text in a different language than the anchor text of the links to the canonical page.

Conclusion

This patent appears to illustrate a lot of the process behind how anchor text may be used to add relevancy to pages that are pointed at by links which use that anchor text. Many of the observations and assumptions about links, that people who watch search engines have made, are addressed within the patent in one way or another.

The information about how a crawler might handle permanent and temporary redirects differently is interesting, as is the annotation information that a search engine might use. I’m not sure that I’ve seen mention or hint of the use of anchor text pointing to duplicates to provide relavance to the canonical versions of pages before.

The information about crawling rates, and the possible role of PageRank along with frequency of changes in content, which could influence how frequently a page is crawled is also the most detailed that I can recall seeing in a patent from Google.

Please don’t take any of the information from this patent as gospel – but keep in mind that it is a document created by people from Google, and that if the processes described within it aren’t being used, that they were seriously enough considered to protect them as intellectual property of the search engine.

The patent provides some details that I don’t believe I’ve seen before elsewhere, and reinforces a lot of assumptions and educated guesses based upon observation and some testing. The four years since it was filed does feel like a long time ago, though.

I must agree with William ,this was four years back , since it was documented by Google : I don’t know what to believe.
BUT nice detailing , i have been looking for a glossary which can tell me more than just ‘PPC’ and ‘CPC’

Very interesting reading! Most of what has been laid out in this patent seems consistent with what we’ve been seeing over the last years, but some does indeed need some closer observation to find out if it is actually followed. Thanks for sharing this, Bill!

This patent definitely reinforces some of the more common SEO assumptions. Figure 2 isn’t easy to digest but interesting to study. Among other things I found the concept of the link record most insightful. I think a lot of people always assumed that there must be a link record of some sort to complement the anchor text but looking at this patent I think it is very likely Google actually implemented it.

I think that this is very interesting, since it is naturally inefficient (why bother breaking the urls up into groups like this?). I assume they designed it like this to enable them to run it effectively using the google file system, where they want sequential access to queues.

With this in mind, I wouldn’t be surprised if it is still done like this, they do like to partition problems.

I think that’s a good point about efficiency under the google file system, Tim. The timing is about right, too. This patent was filed in 2003, and the publication that I linked to was also out that year, too.

We do know that Google went some infrastructure changes since then, even naming one of the “Big Daddy,” but have no idea if that would change something like this kind of set up. The processes in this patent are core to what Google does, but if the patent protects some “new” process, then they were likely to have been using a process that was somehow different earlier than this.

I’ve noticed crawl rates increasing by quite a bit recently. All of my sites that were partially indexed (floating around 250 pages) have been full indexed out to around 400 pages now. I have a feeling that this is further obfuscation to deny us a look at supplemental pages though.

Tying patents together into a timeline would be a great project to study the back end development of Google. Tie that to front end/product launches and you’d have a pretty good base for predictive modelling for the Google machine.

It’s a pity there are no exact formulas or pointers to what is good for ranking and what is bad – but anyhow this is a patent, a document that is made not to tall about commercial secrets 🙂

@Handling Duplicates – when found, the page rankings of the duplicate pages at different URLs are compared and the “canonical” page for the set of duplicate pages is identified. If the page sent to the DupServer is not the canonical page, it is not forwarded for indexing.@

There are still many indexed absolutely equal pages in google’s index, so this rule seemed to change I think…

Thanks for this post, I wonder how long it took to write such a giant article 🙂

Greater indexing by the Google isn’t a bad thing as I see it. I do miss the “supplemental” tag as a way of telling that you may have some pages that you might want to get more links to. But you should probably have some idea through your log files and visits if you have some underperforming pages.

Jeffrey,

Thanks. I do find looking at these patent filings to be one way to try to get a different perspective on what search engines are doing, or are trying to do, and I think that looking over them provides value to people who are trying to work on the web. And I got a kick out of that screenshot from the patent which does look a lot like a mindmap.

Search◊ Engines Web,

I’d like to know why some blogs are updated in search results so frequently, too. I’ve also seen a number of news sources listed at Google News within 8 minutes of being posted online, which is really incredible.

Life Design,

I’ve considered a timeline like that. With almost 600 patent filings from Google, it’s a considerable undertaking. Thanks.

You’re absolutely correct in that there’s a fine line between trade secrets and patents. A patent does have to be detailed enough to protect an idea within the context of some working process or method, but it doesn’t have to, and shouldn’t necessarily, give away the actual commercial practice involved.

The issues involving duplicates are difficult, and it can be hard to tell which published version of a page is the “canonical” one, and which should be considered a duplicate and not indexed. I think that is a little easier when the pages involved are on the same domain, but I don’t want to have to rely upon a search engine making that decision for me.

Really a facinating breakdown of this patent with great detail. You manage to write in such a way as to help all levels understand how patents work. Thanks for such a great post on this and other Google patents.

Trustworthiness may play a role, and quality may be considered by looking at the PageRank for a page.

Indexing rates are tied to frequency of change on a site as well – so a site that is considered to be high quality and trustworthy that doesn’t change much over time isn’t going to get re-indexed very frequently.

We can’t say that this is exactly how Google crawls pages, but it provides some interesting insights into what they may be doing, and the assumptions behind their possible approach. If anything though, I would guess that the actual process is even more complicated than what is presented in this patent.

Sorry about any headaches. 🙂 The really interesting thing about this is that it was filed in 2003. The patent gives us some insight into a pretty detailed process, but it’s possible that they may have transformed this process over the past 6 years.

I’m not sure that we are going to see search engines looking to anchor text any less in the future. Both links and anchor text as such an important part of the Web, that I believe we’re likely going to see both used as ways search engines rank pages. We may see search engines adopt other ranking signals, such as those involving user behavior data as well, and possibly some kind of trust rankings involing people participating on social networks.

Even though it was posted back in 2007 Google still places a lot of importance in anchor text, I’m sure you’d agree.

In your opinion is it more natural to link to, say, Sheilas Wheels (a female car insurance company in UK) with the anchor text ‘Sheilas Wheels’, or with ‘car insurance’? I think the former, but Google obviously thinks the latter. What’s your take?

One last general question on patents. I thought a patent meant that another company couldn’t use the same technique/technology? We know that all the search engines use anchor text as a relevance signal but shouldn’t patent like these reserve it’s use for Google? And… if it did how could Google ever be sure it wasn’t being used elsewhere since a ranking algorithm has to remain secret.

I think it’s probably pretty natural to link to your example site either way – the name of the company, or a term that describes what they actually do.

Search engines have been using anchor text to determine the relevance of a page being linked to since the World Wide Web Worm started doing it in 1994, and I don’t beleive that they patented the process behind how they do that. Google’s patents like this one don’t give them the exclusive right to use anchor text as a relevance ranking factor, but rather give them an exclusive right to the more detailed process behind what is described within the patent.

Chances are good that at an organization the size of Google or Bing or Yahoo, there are likely analysts who keep an eye on what their competitors are doing, and how they may be ranking pages and what kinds of practices they follow, just as any site owner might look at what their competitors are doing.

Trackbacks

[…] 11th, 2007 by Loren Baker, Editor | No Comments submit_url =”http://www.searchenginejournal.com/google-granted-patent-on-anchor-text-link-crawling/6087/”; Google has recently been granted a patent by the US government which the company originally filed in 2003. The patent, writes Bill Slawski of SEO by the SEA, explores how Google would (a patent does not always have to be used) use anchor text links to rank and classify pages, Google crawls some sites more frequently than others, and how the search engine would treat links which use a redirect different than a natural link. […]

[…] Wie für ein Patent üblich, ist der Inhalt sehr ausführlich und detailliert beschrieben. seobythesea.com hat sich abermals die Mühe gemacht, ein Google-Patent wiederzugeben und zu analysieren. Please don’t take any of the information from this patent as gospel – but keep in mind that it is a document created by people from Google, and that if the processes described within it aren’t being used, that they were seriously enough considered to protect them as intellectual property of the search engine. […]

[…] In a patent dating back to 2003 (granted in December 2007) Google entitled Anchor tag indexing in a web crawler system (analysis) Google explains how they could place urls in a series of crawl layers to determine how often the page needed to be crawled. For example news.bbc.co.uk would be in the “real time” crawl layer to be crawled almost continually whereas an average blog homepage might be in the “daily” crawl layer to be visited once per day. […]

[…] It can be used in a number of ways by a search engine, such as being combined with relevance factors to rank search results, or to determine which web pages to crawl (pdf) and how frequently to crawl them, or which part of a database a document should be placed within. […]

[…] Especially beyond the beginner stage, you need to make it a habit to research and read as much as possible. For advanced SEO knowledge, my favorite sources of research material are search engine–related patents and information retrieval research papers. In order to avoid getting lost while reading such documents, I recommend first reading the excellent tutorials on linear algebra and Information Retrieval from Dr. Garcia. Bill Slawsky also has an excellent blog where he unearths useful patents and provides excellent commentary. In fact Bill unearthed a very relevant patent, which provided a valuable insight for my Toolbar PageRank challenge. It is called: Anchor tag indexing in a web crawler system. Here is part of Bill’s conclusion: The information about crawling rates, and the possible role of PageRank along with frequency of changes in content, which could influence how frequently a page is crawled is also the most detailed that I can recall seeing in a patent from Google. […]

[…] We have known for some time that Google updates portions of its indexes at different rates. You see very active blog sites like SEO Theory, SearchEngineLand, etc. move into the Main Web Index within minutes or hours of posting new content, whereas your static content sites (or less popular blogs) may take days or weeks to update in the Main Web Index. […]

[…] example, last December I wrote a post titled Google Patent on Anchor Text and Different Crawling Rates, about a Google patent filed in 2003 which gave us a look at how the search engine crawled web […]

[…] of a crawler on web pages, Anchor tag indexing in a web crawler system, which I wrote about in Google Patent on Anchor Text and Different Crawling Rates back in 2007. 4:03 pm | Tags: crawling, google, google patent, page speed, patent, robots | […]