Archives for February 2008

One of Google’s goals is that you should be able to throw just about anything into a search box (package tracking numbers, airline flight numbers, etc.) and Google will try to do something reasonable, such as return the status of a flight. Recently I was trying to reverse engineer a USB protocol and needed to convert some numbers between base 16 (hexadecimal) and base 10 (decimal). On a hunch, I threw the conversion into a Google search box. Sure enough, it worked fine.

One interesting thing about my job is that I get to see a lot of unusual claims. Recently I was on an email thread and the images team wanted to address a misconception. Google Images doesn’t have a dedicated blog right now, so I offered some space on my blog if someone wanted to do a guest post. Here’s the guest post:

Every now and then a story surfaces that Google has ‘censored’ images or web pages and removed them from our site without saying a word to anyone. For example, we noticed some sites in the Middle East and beyond are asking about Egyptian striker Mohammad Aboutrika’s goal celebration during Egypt’s African Cup of Nations match against Sudan. After scoring the 3rd goal, Aboutrika revealed a t-shirt with the message ‘Sympathize with Gaza’.

Well it turns out this image was difficult to find on images.google.com for the first few days after the match, and the story that’s gathered steam is that Google removed it. Some outlets said that this was under pressure from the Israeli government.

First of all let’s put the story straight: we definitely didn’t do this. In fact from the very beginning you could find the image quite easily on YouTube and also on Google News.

The reason for the delay in the image showing up on Google Images was that it can take a few days between when an image appears and when its crawled by the Googlebot, as explained here. It’s there now – you can find several copies of the image on a search for [Aboutrika] or [Aboutrika Gaza] quite easily.

No-one from any government has contacted us about this image, and we have no reason to remove it.

The Google EMEA Product Team

(This is Matt again. By the way, “EMEA” stands for “Europe, Middle East, and Africa.”) I can add a little more perspective on this as well. Google works hard to be comprehensive, which is great, and in my opinion Google has good coverage of the web in our index (including images). But we don’t crawl every single image or document across the entire web, and sometimes it takes time to discover a document. That’s just the way search engines work, and there’s no need to assume an ulterior motive on Google’s part.

Just to give another example, a few months ago I saw similar questions regarding image search results in Japanese. On October 14th, a Sunday morning TV show introduced a new virtual girl singer named Miku Hatsune. For a few days, Google didn’t have pictures of the character, but it wasn’t anything intentional on Google’s part; sometimes it does take a little while for our web or image crawl to discover a document. Happily, if you search for [Hatsune Miku] now you’ll find lots and lots of pictures.

So my takeaway is even though search engines can be very comprehensive, it can still take time to discover documents; please don’t assume that Google has negative intentions just because you don’t see a particular image. When I joined Google in early 2000, we measured the time to update our index in months. Personally, I think it’s great that people now start to wonder why we don’t have a particular web page or image within just a few days. Over time, Google is getting fresher and fresher in my experience, but making a search engine work really well is a difficult task. Rest assured that we’ll keep working on improving freshness, coverage of the web, relevance, and the overall user experience.

By the way, if you followed the advice in my recent security tips for WordPress post, you wouldn’t have to read about the update on my blog. Instead, you would already be subscribed to the WordPress security/developers’ feed (Atom feed link) that is suitable for subscribing in Google Reader or your favorite feed reader. I highly recommend subscribing to that feed so that you’re less likely to be caught by surprise when there’s a security issue with WordPress.

Matt Cutts – Google Not prepared, but informal remarks. High order nits: what do people worry about? He often finds that honest webmasters worry about dupe content when they don’t need to. G tries to always return the “best” version of a page. Some people are less conscious. The person claimed he was having problems with dupe content and not appearing in both G and Y. Turns out he had 2500 domains. A lot of people ask about articles split into parts and then printable versions. Do not worry about G penalizing for this. Different top level domains: if you own a .com and a.fr, for example, don’t worry about dupe content in this case. General rule of thumb: think of SE’s as a sort of a hyperactive 4 year old kid that is smart in some ways and not so in others: use KISS rule and keep it simple. Pick a preferred host and stick with it…such as domain.com or www.domain.com.

If this is an accurate summary, and I’m reading what you’re saying, then there’s no need to worry about duplicate content issues when submitting articles. Is that correct?

My response:

What I was saying was: I often get questions from whitehat sites who are worried that they might receive duplicate content penalties because they have the same article in different formats ( e.g. a paginated version and a printer-ready version). While it’s helpful to try to pick one of those articles and exclude the other version from indexing, typically a whitehat site doesn’t neet to worry about 1-3 versions of an article on their own site. However, I would be mindful that taking all your articles and submitting them for syndication all over the place can make it more difficult to determine how much the site wrote its own content vs. just used syndicated content. My advice would be 1) to avoid over-syndicating the articles that you write, and 2) if you do syndicate content, make sure that you include a link to the original content. That will help ensure that the original content has more PageRank, which will aid in picking the best documents in our index.

We use additional heuristics of course, but I figured other people might want to hear that take.