Updating Google’s Historical Data Patent, Part 1 – Freshness

I had discussed it on forums when the original patent application came out in March of 2005, but didn’t provide a write up of the document here. I realized a few weeks ago that I probably should.

The historical data patent is important because it discusses a large number of techniques that a search engine might use in fighting “spamming techniques” that might artifically “inflate” the rankings of web sites, and it works to identify “stale” sites that may be ranked higher than fresher sites that might contain more recently updated information.

I’ll be writing a few posts over the next few weeks about the patent, and try to include some updates that have happened since it was first published. This first post looks at how the “freshness” of a page or document might influence its rankings in search results.

Fresh and Stale Web Pages

How does a search engine tell how fresh a web page might be, or how stale it is? What do those words even mean? Why is it important? What difference does the age of a page make? Does staleness or freshness depend upon the content on a page?

The Constitution of the United States is an old document, but it’s not stale. A news article about the “World Series” from 1918 may not be what a baseball fan wants to see when searching for “World Series” this October.

While Babe Ruth is well known as a feared slugger for the New York Yankees, he’s not as well remembered from his earlier days as a Boston Red Sox pitcher who threw a shutout in that 1918 World Series. Interesting information, but again, not what a searcher is likely to be looking for in an October 2008 search for “World Series.”

Just how do we tell the age of a document, and determine whether or not it is stale? What types of things would be used to give a score to a document based upon that age?

A search engine might look at information from different sources to learn about:

The age of a document

The age of links leading to and from that document

2. Determining the Age of a Document

The history of a document, such as its age and information about links to it, can influence ranking scores under this historical data patent. A search engine needs a starting date for a document, also referred to as a document inception date.

A search engine might look at the following to to decide how old a page might be:

When It is first crawled by the search engine

When it is first submitted to the search engine

When the search engine first discovers a link to the document

When the Domain was registered

When the page was first referenced in another document

When a document first reaches a certain number of pages

By the time stamp of the document on the server it is hosted upon

Under a link-based ranking system that doesn’t use age-based information, a document with less links to and from the document may rank lower than a document with more links to and from it.

In a system which does use age-based information, if a document with less links can be determined to be newer, based upon the document inception date, it could possibly rank higher than an older document that has more links if it has a higher rate of growth of links.

But too many links, coming too quickly to the newer document may also be a sign that some type of spamming is happening (See how Yahoo may handle this issue).

So, how is that rate determined, and how much does it influence the overall ranking of a page?

This complicated looking formula is given as one way of determining that how the age of a document might influence how it ranks:

H=L/log(F+2),

where H may refer to the history-adjusted link score, L may refer to the link score given to the document, which can be derived using any known link scoring technique (e.g., the scoring technique described in U.S. Pat. No. 6,285,999) that assigns a score to a document based on links to/from the document, and F may refer to elapsed time measured from the inception date associated with the document (or a window within this period).

The historical data patent goes on to explain that sometimes some “older documents may be more favorable than newer ones” and that some sets of results can be fairly mature. The scores of documents can be influenced (positively or negatively) by the difference between the document’s age, and the average age of documents resulting from a query.

So, a fairly new site that appears amongst a set of results that are, on the average fairly old, may find it being negatively influenced by that difference in age.

This followup patent application from Monika Henzinger added another way of looking at how fresh a document might be. It takes a look at how fresh the pages and links pointing to a document are to determine the freshness of that document.

A New York Times article from last year, Google Keeps Tweaking Its Search Engine, uncovered a Google initiative that goes by the name QDF, or Quality Deserves Freshness. It discusses whether topics are “hot,” and if people are writing about those topics in the news, in blogs, and whether searchers are looking for information about those in searches.

Looking at user behavior and click-throughs to pages are other ways of determining whether a document is fresh or stale, and the patent includes those as other ways of determining just how fresh pages might be. I’ll address those topics in a future post.

Should freshness be part of how pages are ranked by a search engine? If you run a website, how fresh are the pages of your site, and how can you make them fresher if they seem on the stale side?

It think it is a topic that’s worth revisiting, and exploring in a few different ways. It covers such a wide range of topics, from the weight of links, to the impact of anchor text, to how click-throughs might influence rankings, and more, that it seemed to make sense to break it into parts.

I’m a little sad to see summer end, but Autumn seems like it’s going to bring some interesting days ahead. I hope your’s is enjoyable, and I suspect that you’ll miss having the kids around.

It pains me to see the Babe in the wrong uniform, especially aftet the events of the last couple of days.

Good post as always Bill.

I do think the idea of freshness should play a role in search results, though naturally it should depend on the query. I suspect it will be easier to organize the results based on freshness than it will be to understand a queries intent should lead to a freshness based result.

Then again I’ll often type a year into a query because I do want something fresh and still find some very old results. Probably picking up a copyright date or something similar.

You’re right this topic can be explored and expanded into a wider range of topics just for the link analysis alone.

When I found the picture of Babe Ruth, the Red Sox uniform looked pretty out of place to me too, even though it was exactly what I was hoping to find.

Experimenting with freshness in searches leads to some interesting results. A Google search for “superbowl” shows a lot of 2008 and 2009 results with many of the 2009 results being pages selling tickets for the event. A Google search for “presidential elections” shows some 2008 specific results, some “fresh” news stories in the middle of the search results, and some history/reference resources that look pretty up to date with an exception or two.

I think the impact of freshness does depend upon the query being searched for. There likely are some kinds of queries that are impacted less by “freshness” and that’s probably a good thing. It’s something to keep in mind when looking at query terms though.

There is some interesting aspects of the patent when it comes to different aspects of links and anchor text. I’m looking forward to digging deeper into those, too.

Thanks for your kind words. I’m hoping that asking lots of questions, trying to provide some answers, and looking at some new patent filings and articles and blog posts that might seem to be related can spur some discussion.

And that it can provide some ideas for people who visit here regardless of whether they are marketers, site owners, or people who just have an interest in search and want to understand a little more about what might be happening after they type something into a search box at a search engine and then receive a set of search results.

So many of us rely upon search engines, and yet we don’t know what goes into the delivery of the results that we see in response to our searches. Freshness is an important aspect of how relevant results might be, but it’s something that isn’t discussed much. It probably should be talked about more.

Would freshness also be determined by whether the actual content on the page has changed? For websites like blogs and news pages this kind of freshness is definitely a good factor for determining a page’s usefulness… But some pages may just be “done” and it’s content doesn’t need to be updated or freshened up.

I wonder how much of the content has to be refreshed to make it more “appealing” to the Search Engines.

Sure some parts of a web site are likely to hardly ever update, but other portions of the site could be “freshed” up from time to time.. Do you think this would help or are they more interested in seeing more new content appearing on th site and the oder content not getting refreshed at all..

That is one of the topics that I will be addressing in a future post on the patent. Changes in content are something that a search engine does consider, and a search engine may look at content on different topics differently based upon the content being considered. The genre, to use that term, like “blogs” or “news” may be important. It’s interesting that Google is now showing dates of many blog posts in search results.

I think that the importance of freshness of content may have to be viewed not only in the context of an individual page or site, but also in its framework – the other pages and documents on the Web that may involve similar content.

Some sites go stale naturally. I’m sure there will be blogs about the 2008 election that will hibernate and I’m pretty sure that people in 2016 will search for “why McCain lost to Obama in 2008″ and I hope that the top results are blogs and not wikipedia.

The time stamp issue is one that could cause problems, which is why the patent filing will consider some other indications of time, and why the followup patent application from Monika Henzinger provides an alternative approach which looks at the age of sites and links to a page, instead of a timestamp upon that page.

Bill, I am glad you made the comment about the Constitution. I get questions all the time about whether frequent changes on a page are good for ranking. My answer is always an unequivocal “maybe”. I suspect that for certain verticals, it might be helpful, but it shouldn’t be. A page should be considered fresh if new links are still coming in. In fact, unchanged text and fresh links is probably the ultimate signal of authority.

It is interesting to see some of the other news that has come out after the historical data patent was published. No doubt that freshness is playing an important role in what search engines are trying to do, and it may be essential that they find a way to filter search results by considerations like that in addition to “relevance” for a query submitted. If the search results that they show to searchers aren’t “fresh” then those searchers may look elsewhere.

I think that a “maybe” is often the best answer for many questions regarding what a search engine is doing, and will do in the future. Having said that, I think you make a very good point about a page that hasn’t changed much over time and is still getting new links pointed towards it – that seems to be a decent indication that people find the content on that page to be of lasting and timely value.

I’m not certain at a search engine gives higher rankings based upon the age of a domain either, but there may be something in the value of a history of content, and of the acquisition of links pointing to the pages of a domain over time, and other indications of being established that may aid it.

We know that one aspect of ranking – the link analysis part – requires getting links to a page – those provide a search engine with an idea of how important a page might be (on the assumption that if important pages link to another page, it’s likely the page they are pointing to is important, too). While it’s possible that people may point links to empty domains, it’s more likely that they will link to a page that contains content that they find interesting or useful or helpful or controversial or worth returning to for one reason or another. An empty page won’t do that.

We also know that search engines will consider a wide variety of ranking signals, including visible ones such as content upon the pages themselves. While a domain name may help, especially if people link to a page using the domain name in the anchor text to the links, by itself it may not produce a very strong signal to search engines that it should be considered relevant for a query that someone is searching for, regardless of how long it may have been registered as a domain.

Thanks for your reply. I have well understooded the reply of yours. But there is actually no statics proof to show that Search engine do or does provide higher domain tanking. Yes i do agree that domain over the years of usage even if its just static. They do have PR ranking if the domain is of high key word search.

I too agree that content is a part of the most important part of seo and ranking in google. Especailly achor texting with high Pr sites. Does affect alot of the site in ranking. This backlink stuff i highly belive it really and does gives the power juice of PR that links to the site from a higher PR site.

just for fun of it. I will try and get a high keyword search domain. And just do nothing but leave just some keyword to the domains. Then over time like says 2years. Let maybe take real test on this. As i highly do not really belive that serach engine does provide higher domain ranking.

Unless we do have statics proof that there is real domains that has this kind of ranking of domians just based on serach engines.

waiting for your kind reply. I formost i mean no offence Mr Bill
Anyway…I love your blog and i love reading it. Just that there is actually too many pages and links for me to read from day to nite..Cheers.

There are so many ranking signals for search engines, that isolating one factor, like the age of a domain, may not be something that can be realistically done. Testing in an environment where controls can’t be put into place, and where there are so many factors that we are unaware of may make such an experiment virtually impossible.

There are CMS systems that do include dates in URLs, and it’s possible that those might be viewed as an indication of document ages, much like post dates in blogs, and the dates associated with edits in wiki software.

One concern I would have with looking at a date expressed in a URL (or blog post date) would be that someone could come in and change the content of such a page, and the date wouldn’t be an accurate indication of the age of the content on the page. But you raise a good point – we may not anticipate the form of some information that could provide us with dates and times that something might be created.

A very informative article thankyou – again. I’m not sure if freshness should be a huge consideration, I’ve done websites before for customers who want to promote their shop / restaurant / pub. The sites usually have a page describing the business, an overview of their services and a contact us page – maybe a menu if they serve food. Sometimes this is used where an ecommerce site isn’t necessary or wanted.

If Google made this play a heavy part on their algorithm then it wouldn’t help those sites very much yet they are there to serve a purpose and probably are of a lot of use to people.

It is possible that freshness plays more of a role for some kinds of sites than others.

For example, a page about sports news might be one that should be boosted when it contains fresh information, and a page about a restaurant might not, or at least freshness might not carry as much weight.