Lab Notes

How The Archivist Polls Twitter

You may be wondering how frequently The Archivist updates archives. Well, the answer to the question is more complicated that it may first appear. Let’s dig in.

The Archivist interacts with Twitter using the Twitter Search API, which it polls at variable intervals based on the frequency with which a particular archive is updated. We call this the elastic degrading polling function. This algorithm helps The Archivist be a good Twitter citizen, allowing us to poll Twitter conservatively while at the same time maintaining archives with the latest tweets.

Here’s how the algorithm works: When a user makes an archive ‘active’, the polling process begins. Every archive is inspected once an hour to determine how ‘hot’ it is. We determine how hot an archive is by recording how many results we get back each time we poll Twitter (the maximum we can pull at any one time is 1500). We use this number to determine how frequently to poll Twitter for that archive. Depending on the number, we either hold off on polling for a given interval or query again, based on the following buckets:

So, let’s look at an example. Say we have an archive going for the term ‘Wittgenstein’. When the Archivist checks on this archive at 10 AM, it discovers that the last query for Wittgenstein only returned 10 tweets. It also discovers that this archive was last updated at 9 AM. The Archivist won’t poll Twitter for this archive, because the tweet count isn’t high enough andthe archive had been queried within 24 hours. Since the archive is in the 24 hour bucket, the same thing will happen when The Archivist checks on this archive each hour.

Once 9 AM rolls around on the next day, since 24 hours have passed,Twitter will be polled for the Wittgenstein archive .

Now,let’s say for some reason there’s a flurry of tweets about Wittgenstein—when that archive was updated at 9am, it pulled 600 tweets. In this case, the archive adjusts because it has become hot. It is now in the 1 hour bucket instead of the 24 hour bucket. So, when 10 AM rolls around, the Wittgenstein archive gets updated again.

But let’s say at 10 AM it pulls only 250 tweets. Well, now the archive moves to the 8 hour bucket. So,the Wittengenstein archive will not be polled again again until 6 PM. Let’s say it pulls 1000 tweets. Well, it goes back to the 1 hour bucket, since it appears to be hot. At 7 PM the term is checked again. This time, the response is only 10 tweets. It seems to have cooled off quickly, so we’ll move it back to the 24 hour bucket.

Some of you may notice that there’s a chance that The Archivist could possibly miss tweets when a term becomes hot. This is a reality of our architecture and is justified by the following: First, once a term gets hot, the amount of data can grow quickly. Ultimately, in that scenario,The Archivist becomes a statistical sample as opposed to a true historical record. Second, Twitter itself doesn’t guarantee that all tweets will be returned for a given search. See http://help.twitter.com/entries/66018-my-tweets-or-hashtags-are-missing-from-search and http://dev.twitter.com/doc/get/search for more on this. Consequently there is no way that The Archivist can ever claim to be a true historical record. Third, The Archivist is optimized for following non-trending topics over a long period of time, as opposed to trending topics over a short time. For a tool optimized for the latter scenario, see Archivist Desktop. Another option would be to run your own instance of The Archivist Web and tweak the polling algorithm, which would be trivial to do. Contact me if you are interested in doing so.

Why are you not just using the streaming API and capturing all of the tweets Twitter send through? With a good queue in place you can easily handle 1,000,000+ tweets/hour with far fewer resources than polling Twitter themselves.

@Luke L - I spent a lot of time investigating the streaming API and decided against it in favor of a polling architecture for the following reasons:

1. I want to allow users to use the Search syntax enabled by the search API – Booleans, exclusions, from:, etc. This isn’t offered in the streaming API. Yes, I could implement that myself but if the Search API is already doing it...

2. I’m constantly adding/dropping filters which means I’ll be constantly connecting/disconnecting to the stream. This could result in quite a bit of data loss, churn and general instability.

3. While I can add filters to the Streaming API, the number of filters required for the Archivist (which monitors lots of terms and searches) required special permission from Twitter.

Hello,
I have saved 3 archives 29 hours ago . But it doesnt update. So there are still about 1000 Tweets.
Do you know why it doesnt update?Their archive status is "Archiving".

Thanks for your reply,
Greets.Seyhan

Jackie said on Aug 29, 2011

I have an archive set up and it says it hasn't updated in the last 46 hours. Shouldn't this never happen according to the chart above? I thought it updates automatically every 24 hours?

winenews said on Oct 21, 2011

I love this tool. Thank you.
I was searching in the French wine world (in English). Most popular words are often de, les, la etc . Imagine same with all other languages.
"The" and "and" didn't show up. Am I just too early to ask to remove other lang prepositions?
Still it was cool (or distressing...) to see that with "wine tasting " the word "free" was very high in the word results.
Thank you.

As with some other commenters, I'm confused - my archive's over 50 hours old and has never updated. I hope this just means that your outside bucket is bigger than you'd said above, rather than some bigger bug.

I'm having the same problem as reported by others. An archive +32 hours old, and no updates. The worst is, I created it for a 2-day conference, and with the amount of tweets being sent, it should have (according to the algorithm above) updated at least once in the first hour, and then perhaps again in 8 hour increments, or no longer than 24 hours. It's rather disappointing, now that the conference is over, that only the first 312 tweets are actually showing up in the statistics and visualizations.

Gianfranco Cecconi said on Nov 18, 2011

Same problem as many others above: after the very first archival at search creation, 47 hours have passed without an update. I am archiving @guykawasaki who's the most prolific Twitter users I know, so either your algorithm is broken or the engine that runs it is. We know it's just an alpha but let your users know if you can't cope with it please!

André said on May 11, 2012

Hmm... Only got nine hits on the #poplevison tag, which is at the moment trending worldwide, and my other archives hasn't updated since their creation either.