Assessing Language Patterns: A Look at Texas Newspapers, 1829-2008

This visualization plots the language patterns embedded in 232,567 pages of historical Texas newspapers, as they evolved over time and space. For any date range and location, you can browse the most common words (word counts), named entities (people, places, etc), and highly correlated words (topic models). [ About Mapping Texts ]

About this Project

“Mapping Texts” is a collaborative project between the University of North Texas and Stanford University whose goal has been to develop a series of experimental new models for combining the possibilities of text-mining and geospatial analysis in order to enable researchers to develop better methods for finding and analyzing meaningful language patterns embedded within massive collections of historical newspapers.

This visualization plots the language patterns embedded in 232,567 pages of historical Texas newspapers, as they evolved over time and space. Using the date range slider to select any time period and the map of Texas to select any combination of locations, you can browse through three major categories of the newspapers’ language patterns: most common words (word counts), named entities (people, places, etc), and highly correlated words (topic models).

For more information see the project’s main site: mapping texts. For more information about the language analysis categories, explore the “about” link on each category below.

About for Word Counts

Word counts are one of the most widely used metrics for assessing language use in texts. For any given date range and set of locations that you select, you will see a ranked tally of the most frequently appearing words in the newspapers. You can choose between viewing these counts as either a ranked list or a word cloud.

You can see these word counts for any time period, and for any combination of individual cities or newspapers available during the years you select.

In generating our word counts, we have removed "stop words," which are commonly appearing terms--such as "a" "the" "and" and "but"--that by themselves have little meaning, but nonetheless appear quite frequently. This is a common technique in natural language processing, and is intended to help expose the most frequently appearing words that do have some useful meaning.

Because our collection of digitized newspapers contains some "noise" from the digitization process (that is, scanned words were mistakenly jumbled by the computer during the transition from an image to electronic text), sometimes nonsensical words appear in the word count lists (such as "nnd" for "and"). For more on the digitization quality of the newspapers see "Assessing Newspaper Quality" and our White Paper.

About the Named Entity Counts

Named entity counts are counts of all the particular “entities” (things that are usually considered nouns, like people or places), which are counted and tallied just like basic word counts.

For any given date range and set of locations that you select, you will see an ranked tally of the most frequently appearing named entities in the newspapers. You can choose between viewing these counts as either a ranked list or a word cloud.

You can see these word counts for any time period, and for any combination of individual cities or newspapers available during the years you select.

In order to identify the named entities in each set of newspapers, we used the the Stanford Named Entity Recognizer (http://www-nlp.stanford.edu/software/CRF-NER.shtml) because during our testing of various potential parsers the Stanford NER outperformed all others in terms of accuracy, while also maintaining a processing speed comparable with other taggers considered.

Because our collection of digitized newspapers contains some "noise" from the digitization process (that is, scanned words were mistakenly jumbled by the computer during the transition from an image to electronic text), sometimes nonsensical words appear in the word count lists (such as "nnd" for "and"). For more on the digitization quality of the newspapers see "Assessing Newspaper Quality" and our White Paper.

About the Topic Modeling

Topic modeling is a method of text-analysis has grown in popularity among humanities scholars in recent years. The basic concept is to use statistical methods to uncover connections between collections of words (which are called “topics”) that appear in a given text.

So, for example, running a topic modeling program over a body of text will produce a series of “topics,” which are strings of words (such as “texas, street, address, good, wanted, Houston, office”) that may not necessarily appear next to one another within the text but nonetheless appear to have a statistical relationship to one another. Topic modeling, in other words, uses statistics to produce lists of words that appear to be highly correlated to one another in the hope of exposing larger, wider patterns in language use than a close-reading would be able to provide.

For every one one of those eras, we used MALLET to generate the top ten topics, with 100 words per topic for all Texas newspapers from that era. We also generated topics for each Texas city during those eras, so you can explore topic models either region-wide or drill down to individual cities.

These topics represent "statistical themes" that the MALLET program detects in the newspaper collections. The words in each topic appear in ranked order--that is, the first word in a collection of 100 is the most relevant, the second word the second-most relevant, and so on. (One way to think about this would be to consider the words in a topic to be magentic to one another, with the first word being the most magnetic, the second word the second-most magentic, and so forth.)

Sometimes the topic is a collection of nonsensical words (like “anu, ior, ethe, ahd, uui, auu, tfie” and so on), when the algorithm found a common thread among the “noise” (that is, words that were jumbled by the digitization process) and recognized a commonality between these non-words, which it then grouped into a “topic.”

More often, however, the topic models group together words that have a clear relationship to one another. If, for example, you were to select all the newspapers from the Republic of Texas era, one of the topic models offered includes “Texas, government, country, states, united, people, mexico, great, war . . . “ which seems to suggest that a highly relevant theme in the newspapers during this era were the international disputes between the United States and Mexico over the future of the Texas region (and the threat of war that came with that). What is even more revealing, however, is that most of the other topic models suggest that this was only one-—and perhaps even a lesser-—concern than other issues within the newspapers of 1830s and 1840s Texas, such as matters of the local economy (“sale, cotton, Houston, received, boxes, Galveston”), local government (“county, court, land, notice, persons, estate”), and social concerns (“man, time, men, great, life”).