Mapping the Deep Web

There are a lot of pages on the Web that conventional search engines can’t find, crawl, index, and show to searchers. The University of California (UC), funded partially by the US Government, has been working to change that.

When you search the Web at Google or Yahoo or Bing, you really aren’t searching the Web, but rather the indices that those search engines have created of the Web. To some degree, it’s like searching on a map of a place instead of the place itself. The map is only as good as the people mapping it.

Map makers have consistently worked to develop new ways to get more information about the areas that they survey. For example, a New Deal program in the 1930s under the Agricultural Adjustment Administration led to the creation (pdf) of a $ 3,000,000 map.

In 1937, 36 photographic crews flew 375,000 square miles (970,000 square km), and by late 1941 AAA officials had acquired coverage of more than 90 percent of the country’s agricultural land.

From its initial goal of promoting compliance, the Agriculture Department’s aerial photography program became a tool for conservation and land planning as well as an instrument of fair and accurate measurement.

Local administration and a widely perceived need to increase farm income fostered public acceptance of a potentially intrusive program of overhead surveillance.

The map created was pieced together from a very large number of prints that scaled roughly so that one inch on the map equaled 660 ft. This post includes images from the Library of Congress of people working on preparing that map, and measuring images from it.

Regardless of efforts like that, there are still blind spots on maps.

Regardless of the efforts of search engines, there are also many blind spots on their indices of the Web. The amount of pages online that search engines know about is probably a much smaller number than the amount of pages that they don’t know about.

Some of those blind spots are from webmasters inadvertently blocking content on their web pages, by using java script in navigation that search engines can’t crawl, or requiring visitors to accept cookies to see pages when search engine crawling programs can’t accept cookies, or due to a good number of other factors.

Some of those blind spots are from sites that might like to share the information contained on their pages, but have made it accessible only through search forms on their web sites.

A patent from three researchers on behalf of the Regents of the University of California explores ways to index pages on the Web that are publicly accessible, but require visitors to enter query terms in a search box to find pages. Of course, the search engines aren’t standing still on finding these kinds of pages either. See my post from a few years back: Google Diving into Indexing the Deep Web.

Inaccessible Web Pages

There are a few different ways that search engines get information about web pages. One of them is to use programs to spider or crawl Web pages and collect the addresses, or URLs of other pages on the Web. Another are the XML sitemaps that search engines will use to learn about new pages. Search engines will also accept data from XML and RSS feeds from sites to discover new URLs, and information at those URLs.

Places like the US Patent Office make copies of many granted patents and published pending patent applications available to the public, but those are hidden from search engines because the patent office has set up that information so that it can only be accessed by queries from searchers. Google started their own patent search engine, but it doesn’t have as timely information as the patent office’s databases.

The University of California patent uses the example of Pubmed, Amazon.com, and DMOZ as three sites that search engines might have trouble with. I’m not sure that it’s true that the setup of those sites are difficult for search engines to index any more, but there are sites, like the patent office’s that can’t be easily indexed by the search engines.

Why explore the Deep Web, and make more information available for search engines to index?

We’re told in the patent:

The method and system would improve the overall user experience by reducing wasted time and effort searching through a multitude of site-specific search interfaces for Hidden Web pages.

Finally, current search engines introduce a significant bias into search results because of the manner in which Web pages are indexed.

By making a larger fraction of the Web available for searching, the method and system is able to mitigate the bias introduced by the search engine to the search results.

Interestingly, one of the inventors listed on the patent is Junghoo Cho, whose paper Efficient Crawling Through URL Ordering, co-authored by Hector Garcia-Molina and Larry Page, was cited on a now missing page on Stanford’s site amongst a list of papers that influenced the early days of Google. It may be one of the first papers that explained what metrics a web crawling program might look at when deciding which pages to crawl on the Web when faced with a choice of URLs found on other pages.

That paper was published more than a decade ago, and while there has been news of Google experimenting with crawling through forms to find inaccessible pages to index, it’s hard to gauge how effective their efforts have been.

Identifying Queries at Site Search Interfaces

The process described in the UC patent describes how a crawling program might attempt to uncover pages that have site-specific search interfaces by entering query terms into those search forms. It might attempt to start with a “seed” term, possibly found on the search interface itself. After trying other queries, it may create a results index from searches that discover pages.

The next step would be to download those pages, and explore them for possible other queries that could be searched with on the site, estimating from words and phrases found on those pages which terms might be most efficient in finding other pages through the search interface.

Many site search interfaces include more than one form field for multiple attributes, and potential keywords might be identified for each of those attributes.

A method and system for autonomously downloading and indexing Hidden Web pages from Websites includes the steps of selecting a query term and issuing a query to a site-specific search interface containing Hidden Web pages.

A results index is then acquired and the Hidden Web pages are downloaded from the results index. A plurality of potential query terms are then identified from the downloaded Hidden Web pages.

The efficiency of each potential query term is then estimated and a next query term is selected from the plurality of potential query terms, wherein the next selected query term has the greatest efficiency.

The next selected query term is then issued to the site-specific search interface using the next query term. The process is repeated until all or most of the Hidden Web pages are discovered.

While the patent provides many details on how a hidden web crawler would work, the inventors of this patent have also published a much easier to read whitepaper that covers much of the same territory: Downloading Hidden Web Content.

Being able to access pages on the Web that contain publicly accessible documents like patents or health information can yield a lot of benefits, much like the 1937 mapping project I mentioned at the start of this post.

Related

Reader Interactions

Comments

This is fascinating Bill. Do you think that the deep web is anything like space or the ocean? What percent of the deep web do you think search engines as a combined whole have indexed? 3%? 10%? 30%? More? I’m curious to know what you think.

Hey Jey, I’m sure Bill will agree, that the ‘observable web’ will vastly out size the hidden web due to the old motivator ‘cold hard cash’, just as libraries are full of inane and meaningless fictons compared with the small ‘reference library’ to be found typically within, tucked into a dusty corner 😉 Porn alone will probably represent the equivalence of ‘Dark Matter’lol

Interesting thoughts. I would say that it is probably closer to 20-30% then 3%, but that is just my opinion. I think Google, Yahoo, etc. will eventually find most pages that are linked to in some way. You have to ask yourself if a page is out there on the deep web, with no pages linking to it, how important is it?

In some ways the deep web is like those unexplored territories. A recent news article that I read, but don’t remember the source for stated that there were more stars in the sky than grains of sand on Earth. Is the universe that vast? I don’t know, but that was a great way to illustrate the possibility.

We don’t know much about the bottoms of the oceans as well, and the things we learn about it are sometimes surprising. What percentage of the deep web have the search engines indexed? I think we have to ask instead what percentage of the Web have search engines indexed, and what percentage is the Deep Web, or the parts of the Web that they haven’t indexed. I’m not sure that we can give a reasonable answer – what we don’t know about the things we don’t know do not easily conform to our expectations or our estimations.

I have seen estimates that more than 99% of the Web isn’t indexed, but I’m really not willing to venture a guess.

I have seen many pages on the Web that aren’t very well optimized for search engines, and which aren’t hidden behind logins or search boxes, and I don’t know if we can refer to those as part of the Deep Web.

There are pages formatted as Word Documents, Powerpoint Presentations, Excel spreadsheets, and so on that conventional search engines couldn’t index and now can.

There are also many pages that had long dynamic URLs that search engines had trouble with, but which search engines are doing a better job of indexing.

Search engines do provide results that include abstracts of many journal articles where the content of those journals isn’t accessible to visitors without paying to subscribe or to buy indivual articles themselves. In some ways, those could be considered part of the Deep Web.

Search engines tend to avoid crawling many pages that may contain some useful information, but contain little unique information. For example, a site that might have many thousands of pages with short entries on them, consisting of a few unique words and a fair amount of duplicated boilerplate may not be something search engines value enough to crawl and index.

There are also pages on the Web that site owners have barred search engines from visiting and indexing through robots.txt or meta noindex elements statements.

If forced to compare the sizes between the visible web that search engines can visit and index, and the invisible or deep web that they can’t, I would guess that there are far more pages that are within the deep web than there are within the indexes of the search engines.

And I would agree with you that cash is a common reason for many pages not being indexable.

Many sites and businesses require paid subscriptions or membership logins to see their pages. This includes journals and magazines, paid forums, news archives, and many more. Many pages on the web are private for one reason or another, and don’t welcome search engines. Those can include personal blogs, photo galleries, documents generated by cloud computing applications, and more.

I would imagine that the owners of many pages on the Web wouldn’t want search engines to index the content of those pages, but there are some that do contain public documents that are hard, or impossible for search engines to access.

For example, the US patent office has technical limits on the number of searches that can be performed on their databases, and it’s quite possible that they couldn’t handle the load that might be forced upon their server, or servers if those documents could be indexed by search engines. They also have sections of their site that require a paid subscription.

You have to ask yourself if a page is out there on the deep web, with no pages linking to it, how important is it?

I know that some businesses have placed documents on the Web that have gotten them in trouble with the Securities and Exchange Commission, because they contain information that shouldn’t have been made public, even though the people placing them on the Web didn’t intend them to be public. Those were pretty important.

At the risk of confronting the citation analysis approach that Google bases PageRank upon, I’m not sure that a lack of links to a page is a measure of how important or unimportant that page might be. While links are one possible measure of a perceived importance, they aren’t a measure of actual importance.

Thank you. I think it’s a useful way of thinking about search engines – as a map of the Web rather than the Web itself. They determine what to show us based upon how they decide what to crawl and index and measure. Their mapping isn’t always precise, and it may not always bring us the best results, but like all map makers, they have to decide what techniques and methods they might use to bring the best results they possibly can. If Google doesn’t deliver results to us from the Deep Web just yet, that’s because those results are harder to reach and involve a lot more work.

I’m guessing that you’re refering to the print making machine in that first picture. I thought it was pretty odd looking myself, which is why I decided to use it for this post. I don’t think it was standard photography equipment back then. The film those map makers/photographers were using was around 6 inches wide, too.

At least now there’s a way to uncover those hidden web pages. Like you’ve mentioned, there are other webpages that contain a lot of more useful information but they just don’t know SEO. We don’t even know but there are a lot of great websites with a lot of potential that’s being closed because they lack SEO. Sometimes useful information are hard to find.

There are a lot of great websites that don’t show up in search results because the search engines have trouble crawling their pages. The search engines have been trying to come up with ways to help site owners, including initiatives like the canonical meta element, XML sitemaps, and webmaster tools that can provide some helpful information.

Google even published an introductory guide to SEO (their Search Engine Optimization Starter Guide). It doesn’t cover every aspect of how to do SEO for a site, but it’s not a bad starting point for webmasters.

It would be great to see more pages from the deep web become available to searchers. Hopefully efforts like the one behind this patent, to crawl more of the deep web, will make that information more visible so that we can find it.

A reliable technology for crawling the “deep web” would really open up web searches to some truly interesting personal information.

Online public records searches are one of the main “walled gardens” that would be opened up by this technology, revealing all sorts of personal records information currently accessible only through online court records search and property records searches.

People search sites like 123People are getting better at finding and revealing this personal data, but are still far from opening up all of the public records information online that is potentially available. However, whole potential of “Googling” people hasn’t even begun to be realized yet.

You raise some issues that are ones that we all should be considering much more carefully.

In my days working for the Courts in Delaware, we were the record holders for some information that could be accessed by the public in person, but it was never anticipated that the information contained in those records might one day be available on the Web to anyone who wanted to find it. In some ways, privacy was protected by limited accessibility.

That includes legal judgments, recorded deeds, fictitious (trade names or “doing business as” names), licenses to carry concealed deadly weapons, and civil and criminal case records, and others.

While most of those records were considered public, we were faced with the possibility of putting much of that information up on the Web were people could search for it, and had to spend a lot of time trying to decide what to put online, and whether or not it should be presented differently. For instance, should legal judgment records include social security numbers on them when they were placed online.

Many states and other governments haven’t placed a lot of that kind of information on the Web, yet. Chances are that more and more of it will become available in the near future. It’s going to make the collection of personal information easier in many ways, but it’s also probably going to transform how we think about public record information as well.

Yes I’ve read somewhere that what these search engines have indexed combined doesn’t even reach half of the total billions of web pages out there. By saying “reaching half” is already an overstatement since I think the figure is like 10% or less.

I’m not sure that there’s really any easy way to estimate the number of pages that might be in the Deep Web, but I have seen some people estimate that the total number is much higher that the amount of pages on the visible web.

I do believe that bias in search results is more based upon the approaches that the search engines take to discovering pages on the Web than it is to any intentional desire to not show specific pages.

There’s a good chance that many of the most informative and scholarly pages on the Web about many subjects aren’t going to appear in search results because those pages are behind subscription logins. Some great sources of local history, such as the archives to newspapers like the New York Times are also behind walled gardens that one has to pay to see. We are gaining more and more access to those types of results through things like Google Scholar and Historical Google News Search results, that may lead us to paid or subscription-based results at times. The barriers aren’t always created by the search engines, but rather by the owners of much of that content.

@ Alex: “Never thought about the fact that Google is more like a lense you watch through. Its not the real web. You only see the results that win by an algorithm defined by the search engine.

Google has so much power to influence what we see and no one questions that.”

Now, imagine how many good websites there are out there and that no one knows about. But what we dont know cant hurt us, right? How can we miss something we dont know exist.

It is, however, a disturbing thought that Google can funnel the search enginge traffic where ever they feel like with their algorithm. We do know their policy, but we have no way of telling if they serve us the entire “truth” in any given subject, if you understand what I mean. Not accusing them of anything, but just think about it.

The Internet of Things will add considerably content to the deep web creating huge information shadows for each device or thing connected. Couple this with the continued growth of mobile computing and you can see where this goes. I think the end result is that specialized search engines become more and more important as only they will be able to traverse and catalog the content in a way that makes it accessible beyond a link. Call it consumable results instead of link results.

Hightly insightful post. I personally have to agree with Shannon, in the way that I believe hte total amount of indexed pages does not reach the halfway line. But I guess, that’s something so debatable & the de facto answer will highly likely never be known.
Thanks for the article-tip, Dave (Tribbett). Enjoyed that too.

I think this will improve the quality of searches for all search engine. There are countless pages with good information that are not being crawled by the search engines for reasons like using JavaScript, flash etc. Hopefully this will change that.

Chances are that this kind of mining of the deep web will uncover more pages that would otherwise rarely be seen in search results, and likely increase the quality of those results by including within them more relevant pages. It may also cause some pages that were ranked more highly in the past to move down in search results, especially for long tail type results.

I actually have an article directory in a sub-folder in my site and I have found that submitting sitemaps is absolutely key. Google did not find tons of pages that were four folders deep until I submitted a sitemap specifically for that script. Once I told Google where they were, they began indexing them like crazy…literally thousands of pages. XML sitemaps are an absolute must…no exceptions.

Hi Bill
I have noticed that with my sites even if I don’t link it anywhere and don’t even write content google will find it eventually.
The same can’t be said about yahoo and bing where I need to submit my sites to get indexed.
So I would say google is very good at finding pages.

Thanks. Ideally you shouldn’t ever have to submit your pages to a search engine to have it find them. Keep in mind that Yahoo is using Bing’s crawl of the web to populate its search database these days, so if something isn’t crawled by Bing, it probably won’t be showing up in Yahoo either.

When you create a page or post, usually it’s linked to by other pages on your site, and a search engine should be able to find it from those links, even if you don’t have external links pointed at that new page.