Menu

The screenshots below are from a video produced by Yelp.com and TripAdvisor.com. They show what Google local search results would look like if powered by local review sites. According to the study published at the website FocusOnTheUser.eu (FOTU), 23% of users prefer local search results powered by local review sites. What do you think?

Here is an example of FOTU's proposed local results for the query [hotel bilbao spain].

As you can see, local results powered by local review sites lack important details local searchers want. Because these results use relevancy algorithms instead of localized algorithms, the wrong kinds of businesses and businesses in the wrong location will appear frequently.

Here is another example from the FOTU website. In this case Google local results for [pediatrician nyc] powered by Zocdoc.com.

Clearly the results in the screenshots above are not better for local searchers than Google local search results. How did FocusOnTheUser.eu reach the opposite conclusion?

The study is based on results from previous studies. One study was debunked earlier this year by SearchEngineLand.com's Founding Editor Danny Sullivan. The other study found no significant increases when local results were setup like the ones in the FOTU video.

Usability experts and Google agree, when you know the answer to a user's query it is best not to make users "click anywhere." That is the idea behind Google "instant" search results and why Google does not make users click local results when it "knows" what the user wants. When users mouse over local search results at Google, the local information users want is displayed on the right side of the page.

It is impossible to compare Google local results without testing page functionality. FOTU did not address, test or measure Google local page functionality. The site claims to "preserve" a way for users not to have to click to have to find information but provides no guidance for finding missing details. Even after clicking the links from the video, I could not find phone numbers or other important details. Instead of testing how Google local search results actually work, FOTU tested something else.

Instead of measuring success with user focused metrics, FOTU measured success as increased click through rates. The tools they used will tell you what users did but not what they intended to do. Users typically do not click away from information they intend to find so clicks are not always the best measure of utility.

At 5:50 seconds into the video, FocusOnTheUser.eu presents the results of its study. According to the video, the site proved local searchers prefer the proposed results over Google by 23%. These results are based on a "significant increase" in "click engagement" but where?

A number of clicks went to links created by the tool used for the study. These links do not exist in real world local search results.

Some clicks went to a cafe in a city 18 miles away. Other clicks went to a cafe clearly marked as being CLOSED. Still other clicks went to a duplicate of the first listing with different ratings and reviews. When you add all of these clicks together and add clicks caused by missing map pins you come up with almost 23%.

According to the site, ratings and reviews are especially important to users. Ratings and reviews are important but only when they are for things users want to find. FOTU claims that having reviews and ratings exclusively provided by Google raises critical questions. That said none of the data provided appears to reflect any clicks for ratings or reviews. In fact, most if not all of the clicks shown went to local business websites.

FOTU claims "Google promotes search results drawn from Google+ ahead of the more relevant ones you would get from using Google's organic search algorithm." I will talk about that more in a minute but the FOTU widget proves that Google does not promote Google+ ahead of other results. The FOTU widget does however exclude all sites other than Google and local review websites.

FOTU wants Google to use standard relevancy algorithms for localized search results but Google local search results are based primarily on relevance, distance and prominence," not just relevance. FOTU provides a widget to demonstrate the differences between Google's search algorithm and the Google Maps algorithm used for local search queries. I used the widget for the screen shots below. As shown below, using standard algorithms for localized search queries is not in the best interest of users.

Yelp.com, TripAdvisor.com and other local review websites are not local businesses and do not have local business addresses. For that reason these sites probably should not rank ahead of true local businesses. The FOTU widget essentially excludes all local business websites from appearing in local search results.

At the end of the day, these local sites want a "single conspicuous" link "directly to" their site from Google local results. It is important to remember that these are the same sites that blocked Google in the past and forced Google to invest billions of dollars in local search. Now they want Google to change things for them even though nothing is to stop them from doing the same thing again.

Before governments force companies to change things based on allegations from potential competitors, I think it is important for an unbiased investigation be conducted.

The robots.txt was first officially rolled out 20 years ago today! Even though 20 years have passed, some folks continue to use robots.txt disallow like it is 1994.

Before jumping right into common robots.txt mistakes, it's important to understand why standards and protocols for robots exclusion were developed in the first place. In the early 1990s, websites were far more limited in terms of available bandwidth than they are today. Back then it was not uncommon for automated robots to accidentally crash websites by overwhelming a web server and consuming all available bandwidth. That is why the Standard for Robot Exclusion was created by consensus on June 30, 1994. The Robots Exclusion Protocol allows site owners to ask automated robots not to crawl certain portions of their website. By reducing robot traffic, site owners can free up more bandwidth for human users, reduce downtime and help to ensure accessibility for human users. In the early 1990s, site owners were far more concerned about bandwidth and accessibility than URLs appearing in search results.

Throughout internet history sites like WhiteHouse.gov, the Library of Congress, Nissan, Metallica and the California DMV have disallowed portions of their website from being crawled by automated robots. By leveraging robots.txt and the disallow directive, webmasters of sites like these reduced downtime, increased bandwidth and helped ensure accessibility for humans. Over the past 20 years this practice has proved quite successful for a number of websites, especially during peak traffic periods.

Using robot.txt disallow proved to be a helpful tool for webmasters; however, it spelled problems for search engines. For instance, any good search engine had to be able to return quality results for queries like [white house], [metallica], [nissan] and [CA DMV]. Returning quality results for a page is tricky if you cannot crawl the page. To address this issue, Google extracts text about URLs disallowed with robots.txt from sources that are not disallowed with robots.txt. Google compiles this text from allowed sources and associates it with URLs disallowed with robots.txt. As a result, Google is able to return URLs disallowed with robots.txt in search results. One side effect of using robots.txt disallow was that rankings for disallowed URLs would typically decline for some queries over time. This side effect is the result of not being able to crawl or detect content at URLs disallowed with robots.txt.

Not disallowing URLs 24 hours in advance. - In 2000 Google started checking robots.txt files once a day. Before 2000, Google only checked robots.txt files once a week. As a result, URLs disallowed via robots.txt were usually crawled and indexed during the weeklong gap between robots.txt updates. Today, Google usually checks robots.txt files every 24 hours but not always. Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers. Other search engines may take longer than 24 hours to check robots.txt files. Either way, it is entirely possible for content disallowed via robots.txt to be crawled during gaps between robots.txt checks during the first 24 hours. In order to prevent pages at URLs that should be disallowed with robots.txt from being crawled, the URLs must be added to robots.txt at least 24 hours in advance.

Disallowing a URL with robots.txt to prevent it from appearing in search results. - Disallowing a URL via robots.txt will not prevent it from being seen by searchers in search results pages. Crawling and indexing are two independent processes. URLs disallowed via robots.txt become indexed by search engines when they appear as links in pages not disallowed via robots.txt . Google is then able to associate text from other sources with disallowed URLs to return URLs disallowed via robots.txt in search results pages. This is done without crawling pages disallowed with robots.txt. To prevent URLs from appearing in Google search results, URLs must be crawlable and not disallowed with robots.txt. Once a URL is crawlable, noindex meta tags, password protection, X-Robots-Tag HTTP headers and/or other options can be implemented.

Using robots.txt disallow, to remove URLs of pages that no longer exist from search results. - Again, the robots.txt file will not remove content from Google. Google does not assume that content no longer exists just because it is no longer accessible to search engines. Using robots.txt to disallow URLs of pages that have been indexed but no longer exist, prevents Google from detecting that the page has been removed. As a result, these URLs will be treated just like any other disallowed URL and will probably linger in search results for some time. In order for Google to remove old pages from search results quickly, Googlebot must be able to crawl the page. In order for Google to crawl a page, it must not be disallowed with robots.txt. Until Google detects that content has been removed, keyword and link data for these pages will continue to appear in Google Webmaster Tools. When pages have been removed from a website and should be removed from search results pages, allow search engines to crawl the pages and return a 410 HTTP response. I was recently able to have 150,000 pages removed from search results in 7 days using this method.

Disallowing URLs that redirect with robots.txt. - Disallowing a URL that redirects (returns a 301 or 302 HTTP response or MetaRefreshes) to another URL, "disallows" search engines from detecting the redirect. Because the robots.txt file does not remove content from search engines indexes, disallowing a URL that redirects to another URL typically results in the wrong URL appearing in search results. This in turn causes analytics data to be even further corrupted. For redirects to be handled by search engines correctly and not screw up analytics, redirected URLs should be accessible to search engines and not disallowed via robots.txt.

Using robots.txt to disallow URLs of pages with noindex meta tags - Disallowing URLs of pages with noindex meta tags will "disallow" engines from seeing the noindex meta tag. As a result and as mentioned earlier, disallowed URLs can appear indexed in search results. If you do not want the URL of a content page to be seen by users in search results, use the noindex meta tag in the page and allow the URL to be crawled.

Using robots.txt to disallow URLs of pages with rel=canonical or nofollow meta tags and X-Robots-Tags - Disallowing a URL prevents search engines from seeing HTTP headers and meta tags. As a result, none of these will be honored. In order for engines to honor HTTP response headers or meta tags, URLs must not be disallowed with robots.txt

Disallowing Confidential Information via robots.txt. - Anyone who understands robots.txt can access the robots.txt file for a website. For instance, google.com/robots.txt and apple.com/robots.txt. Clearly, the robots.txt was never intended as a mechanism for hiding information. The only way to prevent search engines from accessing confidential information online and displaying it to users in search results pages is to place that content behind a login.

Robots.txt postpone. - If Google tries to access a robot.txt file but does not receive a 200 or 404 HTTP response, Google will postpone crawling until a later time. For that reason it is important to ensure that robots.txt URLs always return a 200, 403 or 404 HTTP response.

403 robots.txt. - Returning a 403 HTTP response for robots.txt indicates that no file exists. As a result, Googlebot can assume that it is safe to crawl any URL. If your robots.txt returns a 403 HTTP response and this is an issue, simply change the response to a 200 or 404.

User-Agent directive override - When generic user-agent directives come before specific directives in robots.txt, later directives can override earlier directives as far as Googlebot is concerned. This is why it is best to test robots.txt in Google Webmaster Tools.

Robots.txt case sensitivity - The URL of the robots.txt file and URLs in the robots.txt file are case-sensitive. As a result, you can expect issues if your file is named ROBOTS.TXT and included URLs are accessible via mixed cases.

Removing robots.txt file URLs from search results. - To prevent a robots.txt files from appearing in Google search results, webmasters can disallow robots.txt via robots.txt and then remove it via Google Webmaster Tools. Another way is by using x-robots-tag noindex in the HTTP header of the robots.txt file.

robots.txt Crawl-delay - Sites like http://cs.stanford.edu/robots.txt include a "Crawl delay" in robots.txt but these are ignored by Google. In order to control Google crawling, use Google Webmaster Tools.