Talk about big data: How the Library of Congress can index all 170 billion tweets ever posted

The Library of Congress has received a 133TB file containing 170 billion tweets -- every single post that's been shared on the social networking site -- and now it has to figure out how to index it for researchers.

In a report outlining the library's work thus far on the project, officials note their frustration regarding tools available on the market for managing such big data dumps. "It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data," the library says. "Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task."

If private organizations are having trouble managing big data, how is a budget-strapped, publicly funded institution -- even if it is the largest library in the world -- supposed to create a practical, affordable and easily accessible system to index 170 billion, and counting, tweets?

Twitter signed an agreement allowing the nation's library access to the full trove of updates posted on the social media site. Library officials say creating a system to allow researchers to access the data is critical since social media interactions are supplanting traditional forms of communication, such as journals and publications.

The first data dump came in the form of a 20TB file of 21 billion tweets posted between 2006 when Twitter was founded and 2010, complete with metadata showing the place and description of tweets. More recently, the library got its second installment with all the tweets since 2010. In total, the pair of copies of the compressed files total 133.2TBs. Henceforth, the library is collecting new tweets on an hourly basis through partnering company Gnip. In February 2011 that amounted to about 140 million new tweets each day. In October of last year, it had grown to nearly a half-billion tweets per day.

Researchers are already clamoring for access to the data -- the library says it has had more than 400 inquires. The project is being done in parallel to efforts by Twitter to give users a record of their Twitter history, including an itemized list of every tweet they have posted from their account.

The Library of Congress is not foreign to managing big data: Since 2000, it has been collecting archives of websites containing government data, a repository already 300TBs in size, it says. But Twitter archives pose a new problem, officials say, because the library wants to make the information easily searchable. In its current tape repository form, a single search of the 2006-2010 archive alone -- which is just one-eighth the size of the entire volume -- can take up to 24 hours. "The Twitter collection is not only very large, it also is expanding daily, and at a rapidly increasing velocity," the library notes. "The variety of tweets is also high, considering distinctions between original tweets, re-tweets using the Twitter software, re-tweets that are manually designated as such, tweets with embedded links or pictures and other varieties."

The solution is not easily apparent. The library has begun studying distributed and parallel computing programs, but it says they're too expensive. "To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost prohibitive and impractical for a public institution."

So what's the library to do? Big data experts say there are a variety of options to consider. It would probably make the most sense for library officials to find a tool for storing the data, another for indexing it, and yet another to run queries against it, says Mark Phillips, director of community and developer evangelism at Basho, maker of Riak, an open source database tool with a simple, massively scalable key-value store.

Big data management tools have turned into a robust industry with both proprietary and open source options available for different use cases and costs. One of the biggest questions Library of Congress officials will have to tackle is how hands-on they're willing to be in creating and managing the system. If the library wants to take an open source route, there are a variety of tools that can be used to create and manage databases -- everything from a Hadoop cluster to a Greenplum database that specializes in high input/output read/write capabilities. Those can be combined with Apache Solar, which is an open source search tool. Open source provides a free way for developers to take the source code and construct a system based on commodity hardware, but also can take a lot of developer work on the back end. The library can also go the proprietary -- and more expensive -- route of using database software from the likes of Oracle or SAP.

Either way, the amount of data the library has for the Twitter project is not insurmountable. 133TB, and growing, is a large amount of data, but Basho has customers managing petabytes of data on its platform, Phillips says. If the library can track how much the database will be growing each month or quarter, then so long as it has the hardware capacity to store the data, the database software should be able to handle it.

Should the library use the cloud? Theoretically, the library could use a public cloud resource like Amazon Web Services to store all this data and just have AWS provide the constantly increasing amount of hardware capacity that's needed to store all these tweets. Seth Thomas, a Basho engineer, doesn't know if that would be cost-effective over the long term, though. A hybrid architecture is likely more fiscally wise since the library plans to keep this data forever. Perhaps storing the data on-site and using a cloud-based service for an analytics tool could work. That would allow the queries to dynamically scale resources as they are needed to execute a search, enabling the final system to handle the range of requests leveled upon it.

However the library decides to index the tweets, just remember next time you update your status on Twitter, it's being recorded somewhere.

Network World staff writer Brandon Butler covers cloud computing and social collaboration. He can be reached at BButler@nww.com and found on Twitter at @BButlerNWW.

Copyright 2018 IDG Communications. ABN 14 001 592 650. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of IDG Communications is prohibited.