On This Page

Linguistic Corpora at Brown

Brown is a member of the Linguistic Data Consortium (LDC), an open consortium of universities, libraries, corporations and government research laboratories that create and distribute a wide array of language resources.

You must fill out the Language Corpora Access Request Form (see link below) to be granted access to corpora from the LDC or other institutions.

A link to this form is also present in the library catalog record for each corpus.

What happens once I fill out the form?

After your request has been reviewed by the Library and your advisor, if applicable, you will receive an email notifying you that access to the corpora has been granted. This email will include instructions on how to access the corpora.

Some corpora require the signing of a special user license agreement. If the corpora that you have requested requires such a license, then you will receive an email with the license. Once the signed license has been returned, you will receive email notification that access has been granted.

Once granted, your authorization may be restricted or limited, depending on your status, project, or other factors.

Corpora Basics

What are linguistic corpora?

Linguistic corpora are collections of data, either written texts or a transcription of recorded speech, selected according to external criteria to represent, as far as possible, a language or language variety. They provide a source of data for linguistic research. Many linguistic corpora are available electronically as machine-readable texts. Using and manipulating these data require some knowledge of the programming of text files and the writing of codes.

The CECL specializes in the collection and use of corpora for linguistic and pedagogical purposes. CECL has compiled three main types of corpora: learner corpora, pedagogical corpora, and multilingual corpora. Corpora predominantly English, but other languages, including French, Dutch and Swedish are also represented.

Brown is a member of the Linguistic Data Consortium (LDC), an organization that creates and distributes language resources. LDC provides tools for, and information about, 'unpacking' the data in their corpora. See the 'Linguistic Corpora at Brown' box on this page for more information about accessing LDC corpora.

Selected English-Language Corpora

The Dictionary of Old English Corpus is an online database consisting of at least one copy of every Old English text. In some cases, more than one copy is included, if it is significant because of dialect or date. As such, the DOEC represents about three million words of Old English and another two million words of Latin.

Research teams around the world have prepared electronic corpora of their national or regional variety of English according to a common design and scheme for grammatical annotation. Currently available corpora include: Canada, Jamaica, Hong Kong, India and Ireland.

Over 2,600 texts (mostly scholarly editions) ranging from classic works of French literature to various kinds of non-fiction prose and technical writing. The powerful search and retrieval engine, PhiloLogic3, enables scholars to perform textual analysis, and to limit and extend both the searches and search results from various points of view.