How search engines could get granular

One area that is worth looking at for SEO in 2009 (for me at least) is page segmentation. Now this approach really isn’t new and I came across papers as far back as 1997 and beyond. But unsurprisingly most IR methods don’t just appear over night. The big three, (of search) have each had various research papers and patents dating back as far as 2003-4. It just seems to have some traction and is sensible as well.

Essentially page segmentation is when a search engine looks to break a given web page down into its component parts. They could analyze a web page and assign various relevance or importance scoring for different regions of a page. Some of the methods include fixed-length page segmentation (FixedPS), DOM, (DomPS) and location based and white spaces (vision based or VIPS) and even a combined approach (CombPS).

As with many IR methodologies they try to improve the signal to noise ratio. In this case by hopefully identify the noisy segments; resources can be focused on the relevant areas of a web page. Furthermore most people do tend to understand web pages in a segmented or structured view. When you arrived at this page did you instinctively know where to find the main content? Aware of common locations for navigation and other elements? Banner blindness? You get the idea.

Advantages of page segmentation

The main advantages are increased relevance and streamlining processing elements. Search engines hope to use page segmentation to be able to asses a more finite understanding of a given pages relevancy, but also (theoretically) be capable of dealing with multi-topic pages, semantically related or not.

The second advantage, processing and resource management, can be achieved as they could define site templates in an attempt to only crawl/index the relevant parts of the page and not the boilerplate elements.

Now, while there are a few ways of going about it, what’s important here is that such systems are sensible not only from a relevancy perspective, but could also help crawling and indexing resource management.

One has to imagine new ideas at the big three will be tempered in a volatile economy. Once a template has been established, indexing a site on a regular basis could be far easier on a search engine (and site owner as well). Just have a little ‘template bot’ crawl a few pages now and again to ensure the profile is unchanged.. but I’m rambling now…

Another implementation (as noted by the Google patent) could be pages that have a number of listings that are geographic in nature. As search for ‘stone oven pizza, Toronto’ could produce better results as larger listings of pizza shops in Toronto could be segmented and digested by more finite parameters than normal.

“The text associated with the smallest hierarchical level surrounding a business listing may be associated with that business listing” – Patent; Document segmentation based on visual gaps

Segmenting the page

The nuts and bolts I shant trouble you with (links later as always) but it varies from code analysis (DOM) approaches to vision based. The main idea is establishing common (boilerplate) segments of a web page… And from there the systems can be set to even more granular levels to find an optimal rate (playing with the dials).

Boilerplate type elements can include; Headers, Footers, Navigational elements found throughout a site or a single page. When looking at multiple pages common elements can also be identified such as the phrase ‘Copyright 2009’ for example. Within a piece of content common boiler plate elements such as a copyright notice or navigational links (Home, ‘Contact Us’) can also be identified and disregarded as needed. The same can be said of advertisements and other blocks of information found throughout a website.

By disregarding such boiler plate text during indexation the search engine can also attain greater relevance and save processing resources.

Now, obviously we’re a content centric bunch in the SEO world and so understanding how they might look to identify ‘content’ areas of the page is paramount (more later on the why). Elements often cited are;

Number of images in the block

Size of the images

Number of links

Anchor text length

Number of words

Length of form elements

Text formatting elements (<strong> <text> <i> <em>

Other code elements (<table> <p> <hr> <ul> <td>)

Background color of a node (or child node)

Also the size and position of the block can give added signals as to where the core content of the page resides. In most situations for search optimizers, we deal in this space. This is where we really play – this isn’t 03 and site wide footer links sucked long ago.

Now it’s not all dill and pickles…there are a few potential issues with page segmentation systems.

Problems

Not all people value the same parts of a page given the different types of sites out there. For example some may look at the stories on the home page of a social media site while others may look to the latest comments… because of this setting the thresholds and valuations of segments is problematic. Even the boilerplate concepts suffer from this (as the ‘Top 10’ and comments are on every page at a site like Sphinn)

One might also have varied feelings of importance to navigational elements when not directly on a page of interest (thus editorial areas can also vary). As with the above example, navigating to the ‘search marketing’ segment may be of greatest interest to me. What of a side panel element of ‘related topics’ which may or may not be of great importance to the viewer? The point being that it’s not a golden egg entirely.

As you can see, adoption still has a few mountains to traverse, but even in a limited capacity there are uses… one really jumps to mind (and shame on you if you weren’t already asking)…

Linked implications

Yes, that’s right… you know we just had to end up here right? One of the things that really drew me into this topic more and more over the last year is the potential implication for links. One would have to imagine the link dominant search engines would welcome a system that could potentially provide more granular levels of link valuations (on site and off).

To begin with, page segmentation can help bolster link analysis methods such as page rank, HITS and their ilk. Or so the story goes. Consider a page with a variety of semantically or not so related content, complete with links (internal or external). Traditional analysis tells us that the page is treated as a whole and thus link relevance can be effected from a lack of focused theme. If search engines can begin to break out blocks of information, independent of the whole, new valuations can be had for links from within a single document. In short there could be more link juice to go around.

Another interesting element would be the ability to build links to a multi-semantic page with diverse anchor texts. Many times in SEO one creates target page(s) built around terms and builds related links to that page. This has always made ecommerce SEO a struggle between clicks to purchase and SEO readiness as far as structural elements and ‘landing pages’ are concerned. Page segmentation methods mean we could build more diversified link profiles to a given page (such as a main category page in the case of the ecommerce example).

Think of links from a block-to-page level and page-to-block (instead of say PageRank which is page-to-page). One can see how greater relevance from link analysis can be had.

Now this can be a doubled edge sword in that block level link analysis (such as BLPR) could play into the valuation of links. This could mean wholesale devaluation or dampened of certain link types. This could include;

Advertising blocks

Blog comment links

Header/footer links

Site wide links

Forum signatures

As you can imagine, this would put a high premium of editorial (diverse) links within content. The ‘boilerplate’ models could also easily pick out mass paid link programs and article marketing links as common (unnatural) elements.

Spam busting

I think it is interesting that much of what has been looked at in page segmentation, to me, has some interesting implementations in spam detection. For starters more complex template and content analysis is bound to turn up many boiler plate websites such as those produced by web spammers. On a granular level, poorly generated content for spam sites could in itself create boilerplate text right within the content.

Beyond that certain template systems employed by spammers over and over can be identified and cross referenced with other factors (link spam analysis, IPs, whois) to profile spammers. And speaking of link spam, these systems could also identify common locations which boilerplate link texts are showing up for a given link profile. Ultimately any IR system should have some spam detection capabilities and these methods satisfy that on a few levels.

As mentioned above some elements of block level link analysis could be used to identify linking schemes in concert with existing methods. Consider large scale paid blog reviews or article marketing campaigns where the template changes from site to site, blog to blog, but the main content (once identified) contains identical anchor texts/author bios. Analysis on a page by page basis wouldn’t be as effective as a block by block analysis.

What could it mean to your SEO?

And so what does any of this mean to you? Well, to be honest we don’t know if any of the big three implement page segmentation concepts on any level, though Microsoft certainly has had a strong addiction over the years. As with many search applications, the end user experience is a running concern. Many of the adaptations we’d make with such methods in mind would ultimately make for a better and more concise end user experience. At very least we can improve usability and prepare for potential search evolutions all at the same time.

Some key take-away could be;

Create distinct blocks when constructing pages and ensure it is obvious where the content it.

Ultimately if such systems did gain traction it would become increasingly possible to rank a single page for a variety of terms, beyond the abilities we currently have for targeting. One instance one might find for these concepts are ecommerce applications and varied product lines (though semantically related). Let’s look at the following example;

In an instance such as this formatting and targeting the text within each of the blocks becomes of great importance. We could also consider alternating the background colors of the product nodes to define them as unique segments. You can also ensure the upper and lower segments are properly targeted as a parent or child node of the over-all presentation.

For all we know search engines could look at top performing pages in a query space and analyze them for semantic block and other segmentation variables to create new signals for other pages in the set (query space). We can say that there is interest, potential and even potential advantages from processing and spam detection stand points. What we can’t say is how deep or valuable page segmentation will be to search engines in the future.

So far Microsoft and Yahoo seem to have the most interest, although I wouldn’t count Google out as what I read of ‘block-level-pagerank’ seemed promising. I wouldn’t go changing how you optimize just yet, but tuck this one away. It is something to watch for in 2009… just in case.

Comments

I'd doubt it as nofollow is 'really' only supposed to be about not voting for a given page, ie; not sure of the quality of the site

It shouldn't have any bearing on the relevance of the content. Not many papers/patents really delved into the link implications which is strange as I think it is a HUGE potential benefit of such as system.

Now... how do we go about testing to see what if any of the methodologies are out there in the wild? Dats my next question...

It's vaguely a point regarding page segmentation, but not wholey. I have thought for a long time now that image size should be supported by an increased 'allowance' for anchor text size. I.e. the bigger the image the bigger the allowance for anchor text. Matt Cutts once recommended 4-5 words I think, but really, if the only way a system/annotated machine/etc can be displayed is via an image, 4-5 words doesn't really help.

You would think that page segmentation is an opportunity for bots to read into this area a little more?

Good presentation Dave. I would add that segmentation also benifits the visitor by acting as a "guide" in the decision making process. I dont have any numbers but I would bet money that a segmented ecom website will convert much better than a non-segmented site.

Thanks for a fantastic post on this concept. Although I have read and thought about these methods for a while, this is one of the best concise summaries of it I have found.

At a higher level, different search engines and their implication of block level analysis should be able to build up strong patterns between links and content themes to identify spamming and link trading which may add little value to 'page' value.

Page segmentation and its relationship to SEO and search indexing is certainly an interesting area, and one I will pay more attention to. I think it could well be a bit of a 'playing field leveller', as search engines could more intelligently filter out low value pages and sites, and possibly reward sites that haven't undertaken 'textbook seo' practices to boost their traffic, but nonetheless have strong relevant and helpful content to consumers.

I really appreciate the quality of the thinking that went into this post. Nicely done. Seems like there are some interesting CMS implications to this theory. I can imagine we'll see a lot of interesting block-related plugins show up in the future.

It is certainly an interesting area of study, although not new as I mentioned.

For me, at this point, I think digging into how the system may be manipulated is the next step. In order to assess the liklihood of any new signal is seeing what spam related costs may be in relation to it.

I believe we all know that editorial links within the content are king and that strong layout encourages user satisfaction. Thus nothing very new would come from adapting to deal with such methods. It's simply an interesting area of discussion and worth doing.

@Nathan - I must agree that it could help weed out the spammy link builders and that's a good thing IMO - I believe in simply creating great content, resources and products for the end user and that's where quality links come from. Such methods certainly to lend themselves to such approaches and would make me a happy SEO :0)

For a page to rank for more terms might lead to user confusion if the segment is not the primary content, so if they are looking for "puppy rescues" and there is an article buried in the page somewhere (properly segmented), then we'll be losing the happy user experience.

It is an interesting concept, and I've been of the belief that Google has been using it for at least a year... based on how things get re-indexed with changes.

Yes, there certainly are issue such as that as well, but unless people start to semantically mark-up pages, it will remain so. It takes more work and unfortunately that is likely to hold it back from greater adoption.

If Google is applying such methods, they're not great at it IMO. I've seen instances of side-bar blog rolls in SERPs as well as text from embeded Twitter feeds as well. So, this does hint at it either not being implemented, or still having issues if it is (with Google).

I think that the granular approach will gain traction as search engines try to make sense of a noisy signal. Right now we have page density, but that is not the best. Filter out the ads and banners to get to the meat. :idea:

Now how about interlinking of pages with the block segmentation. There has been a lot of stuffs like PR conservation and sculpting. Is block-page analysis more effective in preserving the PR? then how?is it a good idea to provide no follow links within block-page & page-block. Anyways. thanks for the great post. Love it!

Very thorough post, and how you've broken down the process is particularly helpful. Page segmentation is such a complex element of SEO and one that is rarely discussed. Interesting that Google doesn't use it more. I look forward to hearing more about this from you in the future.