Topsy: Now Searching Tweets Back To May 2008

Looking for old tweets? Look to Topsy. The service has just expanded to have what it claims to be the largest searchable collection of past tweets, over 5 billion of them, stretching back to at least May 2008. That makes it more comprehensive than Google’s Twitter search or even Twitter’s own Twitter Search.

Topsy will be sharing the news itself later today, on its blog. Beyond being comprehensive, another nice thing about Topsy is the ability to restrict a search using special “operators” or commands — such as “from” — to find tweets from a particular user or the ability to see tweets within a particular date range. Topsy has an advanced search page that makes it easy, as well as a list of commands.

Google lacks this type of filtering; Twitter has it, but only for going through tweets back for about a week or less. Of course, the Topsy tweets don’t always work as advertised. More on this, and how Topsy measures up against Google and Twitter, below.

Show Me The First Tweet By…

What was the first tweet from Ashton Kutcher? Heck, what was my first tweet? That’s a good test of comprehensiveness, if you can find the first tweet from well established Twitter accounts.

Using Twitter’s advanced search page, I can search for all tweets by Kutcher — from:aplusk – but the results only take me back 5 days.

How about Google? When Google’s Twitter archive search launched, it touted having tweets stretching back through February 11, 2010. That’s further back than Twitter search goes, but it won’t get me to Kutcher’s first tweet, not by a long shot. (A regular Google search for ashton kutcher first tweet, however, takes me right to his first one on Jan. 15, 2009).

Worse, there’s no “from” command at Google that lets me find tweets just from Kutcher. Instead, at best, you have to search for @aplusk, which brings back tweets from him plus anyone mentioning him. In addition, there may be non-Twitter updates mixed in with Twitter’s archive search, since other sources such as Facebook or MySpace also feed into it.

At Bing Social Search, the “from” command does work, so that I can see all the tweets by Kutcher it has indexed — and just tweets, nothing else mixed in. However, those only go back for six days

At Topsy, I can get the nearly 4,000 “All Time” tweets posted by Kutcher listed:

That sounds great, but getting to the last tweet is difficult. If you sort those tweets by “timeline,” so that the oldest tweet comes last, you’ll find that you can’t actually “page” your way back to it. Only pages 1 through 10 of search results are shown, currently getting you back to May 2010.

A trick is to search by specific date range. For example, here’s a search for all of January 2009, narrowed to those from Kutcher. The problem is that that his first tweet, which happened in this period, doesn’t actually appear. Switching the two pages of results from “relevancy” to “timeline” view makes things worse, listing only links that may or may not have been from Kutcher (it’s hard to tell).

The only way I could find his first tweet, in the end, was to search for the text “dropping my first tweet,” which listed his first tweet in the top results at Topsy. However, it was listed without a time stamp, which doubles as a way to click directly to the actual tweet, making me suspect that Topsy has some database issues.

Behind The Scenes

Despite this, Topsy clearly has a lot of tweets that go back in time. I suspect that when the bugs get worked out, doing a search to find someone’s first tweet — or tweets made within a particular data range — will be really useful.

Topsy knows things need to improve and is working on it. In the meantime, it emphasizes the fact that the date range feature can be used to view “highlights” for a particular period, telling me:

Reverse chronology is not well supported in the current user interface, which focuses on relevance, but we plan to introduce option for this in an upcoming release.

When you choose timeline sort on Topsy, the results are sorted by newest first but filtered by quality — it’s the top 100 results in a given time period, by newest first and a good way to track new, high quality results on any query. Think of it as the highlights for a given time period.

As for how far back the archives go and how the data was gathered, Topsy told me:

We started collecting tweets in May 2008 by polling search.twitter.com for all tweets with links. Our first index was built this way.

Topsy became the first search engine to start indexing native retweets via Twitter’s retweet streaming API in December 2009. The index contains every native retweet since. We’ve recently signed a contract with Twitter to index the entire firehose [firehose is jargon for the ability to tap Twitter’s full stream of tweets].

The firehose does not contain all historical tweets (not for Topsy or Google). We do plan to work with Twitter to complete our index some day. Since the number of tweets per day has grown dramatically, the historical tweets will actually represent a pretty small part of the index.

By the way, while Topsy says you can go back through at least May 2008, I found some tweets that were older than that. I also could find data stretching way back through Dec. 2006 (by doing a date-restricted search for the word the). However, the further back you go, the more likely you’re getting only tweets associated with a link — and tweets that might not let you click from the date stamp to the actual tweet.

How They Stack Up

How do the major Twitter archive search services stack up? It’s really only Topsy versus Google, in this department. Twitter itself isn’t currently focused on trying to create a huge, searchable archive of tweets.

Make no mistake. Twitter has all the tweets people have done over time. They haven’t been lost. But when I spoke in June to Mike Abbott, Twitter’s vice president of engineering who oversees search, he explained to me that Twitter is focusing on building search products that others aren’t doing. With Google then, and Topsy now, focusing on comprehensive searching, Twitter is looking in other directions.

“Google doing it [archive search] takes some of the pressure off. Where do we want to innovate in this world and drive unique set of experiences?,” Abbott told me. He said such items would be finding ways to better connect Twitter users together with others of similar interest, or to do a search on Twitter that just shows tweets from your friends and followers.

Indeed, since I spoke with Abbott, Twitter’s released new ways to find people to follow when searching or when browsing your Twitter home page. The “Suggestions For You” feature, I’ve found to be incredibly useful. Our past articles below have more about these features:

So when I do the stack-up chart below, keep in mind that while I’m listing Twitter, it’s only to provide a benchmark to compare how Google and Topsy go beyond standard Twitter Search on the comprehensiveness of searching front.

Feature

Twitter

Google

Topsy

Farthest Back
You Can Search

4 to 7 days

Feb. 2010

May 2008
(at least)

Search By Username

Yes

No

Yes

Date Range
Search

Yes

Only by clicking in timelines

Yes
(though buggy)

Sort Options

By Date

By Relevancy
(Any time)
& By Date (Latest)

By Relevancy (Relevance) &
By Date (Timeline / All Time)

Show Only Photos?

No

Yes

Yes

Note the last row — the ability to search for tweets containing photos. Topsy makes it especially easy to find images that have been tweeted and says it has over 300 million images indexed. It even has a special page just for photo searching, Topsy Photos. For other services that let you find photos shared via Twitter, see our Google Adds Images To Real-Time Results post. Topsy also says it has indexed 2.5 billion links that have been shared on Twitter.

In the future, I’ll expand the table above to include some other services. In the meantime, here are some past articles that cover Twitter-related searching in various aspects: