The Google Ngram Vieweris a free on-line program that provides the user with a visual representation of the relative frequencies of lexical bundles, or Ngrams, that have occurred in published texts over the years. This is the result of an effort of Google to capture the text from thousands of books and transform it into a digital format. These texts are now being offered for sale by Google in this new format but as an offshoot of this for-profit project, Google has made public a unique linguistic resource. It can be accessed on the Internet at: https://books.google.com/ngrams

This Ngram Viewer was made available by Google in mid-December, 2010 and is purportedly intended for an academic audience. In fact however, a scholarly audience would seem to be too narrow a focus for Google. The output of this Viewer appeals to the general public as much as to an academic sector of the population. As quoted in The New York Times, “The goal (of the Ngram Viewer) is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books." This is according to Erez Lieberman Aiden, who collaborated with Jean-Baptiste Michel and Google to spearhead the research project, which ultimately would "demonstrate how vast digital databases can transform our understanding of language, culture and the flow of ideas." Messrs. Aiden and Michel (both Harvard Fellows) describe this method as culturomics.

Their study, which was published in the journal, Science, shortly after the release of the program to the public, aims to reveal the rich opportunities available for research when 'very high-turbo data analysis' is utilized. By entering a single- or multi-word phrase, anyone can view the published frequency of the entries over time. This enables the user to browse cultural trends throughout the publishing world from the year 1500 to the present. By extension, what has been published theoretically reflects society as a whole. Interestingly, what is entered as an American English expression sometimes results in different output from the same expression viewed as a British English phrase. There are many examples of this and the user is encouraged to experiment.

In fact, what took years to accomplish was not writing the code to analyze the data but the collection of the data itself. A very rough estimate of the words available in the data sets provided is 2 trillion words. While the goal of the project was not to serve as a tool for language learners, it does demonstrate the enormous potential of marrying vast data bases (of words) with powerful computerized analysis.

This manipulation of text that Google is making available to the public has several elements that guide the user on a predetermined path. Looking at the Ngram Viewer from a language learner's perspective, four features stand out:

Since this is a graphic representation of published usage of lexical bundles, or Ngrams, the Viewer does not include examples of authentic conversational language. Conversational language is distinct from written text in many ways and its absence is a significant gap for the serious language learner.

The Google Ngram Viewer doesn't provide the user with frequently used (but often invisible) linguistic 'chunks' or 'lexical bundles' as they have appeared in published texts. The program requires that the user provide the Ngrams and then the program will reveal the relative frequency that that phrase or bundle has been used historically in published texts. if an Ngram is unknown to the user, it cannot be provided. This format is not directed towards the language learner, whose knowledge of the target language is limited.

What is referred to as 'raw data' appears to be text that has already been manipulated to such an extent that no further unique analysis by the user is possible.

The context within which the Ngram, or lexical bundle, is used is unknown.

While what the user is being offered provides a view of Ngram usage that is peculiar to Google's perspective, the massive effort made by the company cannot be minimized. To have catalogued billions of words in at least eight languages and dialects from thousands of books is an enormous task. It is an example of the cost and dedication that is required to assemble a corpus large enough to permit meaningful analysis. Google estimates that they have scanned over 10% of all books ever published to construct this corpus.