What is the size of the Library of Twitter?

22022011

The Library of Babel is a theoretical library that holds the sum of all books that can be written with (i) a given set of symbols and (ii) a given page limit. According to Wikipedia, the Library of Babel is based on a short story by the author and librarian Jorge Luis Borges (1899–1986). Its idea is simple: the library holds all books that can be produced by every combinatorially possible sequence of symbols up to a certain book length. In Jorge Luis Borges case, the Library is immensly large since it contains all possible books up to 410 pages. The American Scientist calculates:

… each book has 410 pages, with 40 lines of 80 characters on each page. Thus a book consists of 410 [pages] × 40 [lines] × 80 [characters] = 1,312,000 symbols. There are 25 choices for each of these symbols, and so the library’s collection consists of 251,312,000 books.

But what is the size of a Library of Twitter, i.e. the size of the set of all theoretically possible tweets? It should be (i) much smaller and (ii) much easier to calculate due to the particular structure of tweets. Here’s a brief back-of-the-envelope calculation:

Given the 140 character limit of tweets, and assuming an english vocabulary of 26 symbols expanded by basic syntactical elements such as punctuation (.), commas (,), spaces ( ), at signs (@), hashs (#) and a few others, we end up with 140 characters and all combinatorially possible sequences of a vocabulary of maybe 50 symbols. Based on these (conservative) assumptions, the Library of Twitter holds at least 50140 tweets.

In other words, the size of the Library of Twitter is at least 7.17 × 10237 [1] or:

While this number seems impressive, it pales in comparison to the size of the Library of Babel (which is 1.956 × 101834097). As with the Library of Babel, most of the Library of Twitter contents would be non-sensical. But on the upside, the library would also contain all tweets ever written in the past and all theoretically possible tweets to be written in the future. Thereby, 50140 is an upper bound on the information that can be conveyed in 140 characters given a vocabulary of 50 symbols [2]. This first approximate upper bound should be informative for future studies of Twitter to answer questions such as: How many of the theoretically possible tweets have already been written – or in other words – how much is there left to write before we run out of (sensical) combinatorial options?

I’ll leave it to somebody else to calculate the number of bits and hard drives necessary to store, mine and search the Library of Twitter.

[1] all numbers calculated with WolframAlpha
[2] It is obvious that larger assumed vocabularies would significantly increase the size of the library.

Actions

Information

9 responses

An important point in Borges’ story is that it’s told from the point of view of one inhabitant of the library, who doesn’t necessarily know the whole story. He narrates that there is a theory going around that the library contains every possible book, only once, and that therefore the library is finite. Another theory claims that the library’s contents repeat themselves, the library being an infinite Universe—if I recall correctly, some librarians spend their lives looking for proof of this, unsuccessfully.

Yeah, kind of. If they find two identical books, that refutes the theory that there’s only one copy of each, and nothing stops us from thinking then that the library is infinite. Some people instead choose to just walk in one direction, hoping to find the edge of the library at some point. But they haven’t been successful. Ultimately the story is an alegory of inquiry in a Universe that paradigmatically represents inquiry—a library.

Mind you, this is all from memory, and I’m not sure it’s entirely reliable :-)

About me

Markus Strohmaier, Full Professor of Web-Science at the Faculty of Computer Science at University of Koblenz-Landau (Germany) and Scientific Director at GESIS - the Leibniz Institute for the Social Sciences (Germany).

My research focuses on the World Wide Web, my interests include social computation, agents, online production systems and crowdsourcing.