Search Indexing Dead Ends: IBM Patent Explores Dangling Nodes

Someone is randomly clicking on links through your site, and looking at and reading those pages. They come to a page that has no links on it. Where do they go next? What kind of impact do those pages have on rankings for your site’s pages.

Nodes and Dangling Nodes in Ranking Web Pages

Many ranking systems use graphs to represent and analyze information structures. On the Web, pages can be seen as “nodes” and the links between pages as “edges.” Ranking those nodes, or pages, in those graphs by their quality or importance can be a valuable approach.

PageRank is one of the systems that uses the idea of nodes and edges to score pages on the web, and the technique of PageRank can also be applied to the scoring of nodes in other types of networks.

A patent from IBM attempts to provide some improvements to the PageRank algorithm, and other similar information graphs when looking at a specific type of node, known as a “dangling node.”

Sometimes nodes either have no links out to other pages, or their outlinks are not accessible to a ranking processor. When that happens, those nodes are known as “dangling nodes”.

The newly granted patent explores dangling nodes in depth, and provides a list of pages that may be treated by search engines as dangling nodes. A node may be dangling for a variety of reasons:

The search engine may know about the page, but it may have not yet been crawled.

The page may genuinely have no outlinks.

The page may be linked to, but disallowed from crawling by a robots.txt file.

A link may exist to the page, but the page may have been deleted by its author.

Postscript or PDF files, which rarely have embedded outlinks.

If the page requires authentication.

Pages that return error messages (http header error messages in the 400 and 500 classes) at the time of a crawl, indicating that a page isn’t available.

Problems with Dangling Nodes

The patent filing looks at what it calls problems with how dangling nodes are treated.

Identical Treatment – Conventional approaches to graph ranking techniques usually treat all types of dangling nodes identically, and remove dangling nodes from a graph before calculating rankings and then add the dangling nodes back into the graph ranking analysis. See:

The inventors listed in this patent tell us that “removing the dangling nodes entirely skews the results on the non-dangling nodes somewhat since the outdegrees from the non-dangling nodes are adjusted to reflect the lack of links to dangling nodes.”

Web Crawling – These conventional approaches to ranking with dangling nodes do not account for the various types of dangling nodes. Understanding these different types of dangling nodes may be helpful when decisions need to be made involving web crawlers and decisions about which links to dangling web pages should be followed, and how to assign ranks to those dangling web pages. See:

Node Rot – Pages sometimes get deleted. When a page is deleted, links that point to it will become “broken”. The existence of broken links on a page may be taken to be an “indication of low standards on part of its author.” The URL to the deleted page may be considered a dangling node, and the page linking to it may be seen as less valuable. See:

The patent tells us that a system is needed to rank dangling nodes in a graph, and adjust rankings for something that they call “penalty pages,” which are pages with links to pages that return error messages when a crawler comes to visit and send information back to a search engine indexer.

A dangling web page processing system ranks dangling web pages on the web.

The system ranks dangling web pages of high quality that cannot be crawled by a crawler.

In addition, the system adjusts ranks to penalize dangling web pages that return errors when links on the dangling web pages are crawled.

By providing a rank for dangling web pages, the present system allows the concentration of crawling resources on those dangling web pages that have the highest rank in the uncrawled region.

The system operates locally to the dangling web pages, providing efficient determination of ranks for the dangling web pages. The system explicitly discriminates against web pages on the basis of whether they point to penalty pages, i.e., pages that return an error when a link is followed.

By incorporating more fine-grained information such as this into ranking, the system can improve the quality of individual search results and better manage resources for crawling.

Conclusion

There’s a lot of discussion in this patent about how the different types of dangling nodes (or pages) I’ve listed above can be ranked differently, and how the rankings of pages which link to those can be affected by the existence of those dangling nodes.

Under the system described, links to deleted pages, and links to pages that produce server errors can have a negative impact upon the pages of your site that provide those links.

If you’ve ever wondered how pages that a search engine hasn’t crawled can be ranked highly in search results, or how a pdf file might have a certain PageRank, it’s worth working through the detailed description of this patent.

While the major commercial search engines may not use the methods described in this patent, it provides some insights into assumptions made by search engineers involving the rankings of those pages that can’t be crawled by search engines for one reason or another, or for pages that have no links pointing out from them.

We had a nice discussion about PDF files in June, where I brought up some issues involving dangling Nodes and PDF files – Monetizing Pdf’s. If you have some thoughts on the topic, please feel free to add to the conversation.

Some additional papers

There are a lot of papers referred to in this document. Since the patent was filed in 2004, it doesn’t include references in its description to newer papers that explore issues related to dangling nodes.

There were a number of additional documents referred to in the patent in a section labeled “Other References.” These weren’t necessariliy cited by the authors of the document, and some of them were provided by the patent examiner. I was able to find links to all but two of them. They may provide some additional insight into this patent, so I include the list here:

Yes. Dangling nodes can be pages that have no links on them pointing out at all, such as most PDFs, PostScript pages, HTML pages with no links, text pages, and others. But dangling nodes can also be pages that do have links pointing out from them that a search engine doesn’t know about, if a search engine knows about links pointing to those pages.

Google may consider a page to be a dangling node if the page has a link pointing to it, but Google cannot crawl the page because:

The search engine is disallowed from crawling the page because of a robots.txt disallow statement, or;
There’s a meta robot statement in the head of a page indicating “noindex/nofollow” or even “index/nofollow,” or;
There’s a link to the page, but the page is behind a required login, so the search engine doesn’t seen any links pointing out from it, or;
The page is linked to, but no longer exists, or;
The page is linked to, but there’s a server error keeping the page from resolving, or;
The search engine has discovered a link to a page, but hasn’t crawled the page yet, or;
There’s a link, but the URL in the link is misformed and there’s no real page on the other side of the link, or;
Other reasons.

The inventors listed in the patent filing have also written a paper on this topic, which is an easier read than the patent. The paper is at:

The interesting question isn’t so much whether or not all of these different kinds of pages are dangling nodes or not, but rather how a search engine might account for them when calculating something like PageRank. Some dangling node pages, like PDFs can have fairly high PageRanks, and can rank well in search results. Pages that a search engine are disallowed from being crawled, or haven’t been crawled yet can still have PageRank calculated for them, and may be indexed based upon the URLs found linking to the pages. One major difficulty is in deciding how and at what stage to calculate PageRank for dangling nodes.