The
dark future of search
is being foreshadowed by this Twitter vs. Google fight. The latest
Twitter volley at Google is this quote (seen on
GigaOm) from
Twitter CEO Dick Costolo:

"Google crawls us at a rate of 1300 hits per second... They've indexed
3 billion of our pages," Costolo said. "They have all the data they
need."

There's no doubt that 1,300 hits per second is a large number, but
let's put that in perspective:

In
February 2010,
Twitter was at 50 million tweets per day. This is just under 600
tweets per second.

In June 2011, Twitter was at 200 million tweets per day. This is
over 2,300 per second.

In October 2011, Twitter hit 250 million tweets per day or just
under 3,000 per second.

They have spikes of over 7,000 tweets per second, with the
largest (so far) being just over 25,000 tweets per second.

For part of 2010, Google was perhaps able to keep up with the stream
at 1,300 requests per second. Somewhere between February and June, the
average volume of tweets outpaced them.

Let's assume that they kept pace until June 2011, and that on June 1,
Twitter went from somewhere in the range of 1,300 tweets per second to
their reported 2,300 tweets per second. Google is 1,000 tweets behind
per second.

By the end of the year, Google missed 15.5 billion tweets. They are
two months behind if they didn't skip any, and the tweet volume did
not increase. But it did increase by 25% or so by October, and surely
it has grown more since then.

If Google has only indexed 3 billion pages so far, they have
approximately 12 days of tweets at current volume. It's pretty hard to
rationalize the 3 billion pages number against the 1,300 per second
number. Was Google indexing at a much slower rate before? Did they not
start until a few months ago?

Of course Google may be getting multiple tweets per request, perhaps
by crawling the timelines of important users. But this means that they
probably get a lot of requests that don't give them any new tweets, or
else the timeliness of the data is poor.

No matter how you slice it, it appears Google would be unable to keep
up. Even if they were keeping up now, Twitter's growth probably sets a
time limit for which keeping up remains possible.

Perhaps Google is super clever, and can index only the right
tweets. I think that it's more probable they have "enough" data to
surface results for the super popular topics, and miss nearly
everything in the long tail of the distribution. I expect that this
adversely affects search quality, which one suspects is a high
priority for the world's best search engine.

Google is no saint. They are just as guilty of the same data
hoarding. If you ran these numbers for YouTube indexing, I think you
will find the situation is much worse. I imagine that most of these
data silo companies purposefully set their crawl rates too low for
anyone to achieve high quality search results.

In the case of Twitter, the end result for users is even worse because
Twitter's own attempts at search are terrible and are getting worse
over time. At least Google makes a decent YouTube search, even if no
one else can.

Even if Google could get all the tweets, they still would have very
little to no Facebook data. I still think the best strategy in this
situation for them is to create their own social data and use that
instead. It's a tough road, but they seem to have little choice.

In the end, it's not about Google or Twitter or Facebook, but the
stifling of innovation and competition around data. We can only hope
that some federated solution or some data-liberal company wins out in
the end.

Buy my book

About the author

I'm a hacker and entrepreneur based in Albuquerque, New
Mexico. I have founded several startups built on XMPP
technology including Collecta, a real-time search engine for
the Web, and Chesspark, a real-time, multi-user gaming
platform. You can learn more about me on the about page.