Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

Watch for an announcement at the Linguistics Data Consortium (LDC), who will be distributing it soon, and then order your set of 6 DVDs. And let us hear from you - we're excited to hear what you will do with the data, and we're always interested in feedback about this dataset, or other potential datasets that might be useful for the research community.

Update (22 Sept. 2006): The LDC now has the data available in their catalog. The counts are as follows:

File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663

serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234serve as the industrial 52serve as the industry 607serve as the info 42serve as the informal 102serve as the information 838serve as the informational 41serve as the infrastructure 500serve as the initial 5331serve as the initiating 125serve as the initiation 63serve as the initiator 81serve as the injector 56serve as the inlet 41serve as the inner 87serve as the input 1323serve as the inputs 189serve as the insertion 49serve as the insourced 67serve as the inspection 43serve as the inspector 66serve as the inspiration 1390serve as the installation 136serve as the institute 187serve as the institution 279serve as the institutional 461serve as the instructional 173serve as the instructor 286serve as the instructors 161serve as the instrument 614serve as the instruments 193serve as the insurance 52serve as the insurer 82serve as the intake 70serve as the integral 68