Text-as-data journalism? Highlights from a decade of SOTU speech coverage

In a guest post for OJB, Barbara Maseda looks at how the media has used text-as-data to cover State of the Union addresses over the last decade.

State of the Union (SOTU) addresses are amply covered by the media —from traditional news reports and full transcripts, to summaries and highlights. But like other events involving speeches, SOTU addresses are also analyzable using natural language processing (NLP) techniques to identify and extract newsworthy patterns.

Every year, a new speech is added to this small collection of texts, which some newsrooms process to add a fresh angle to the avalanche of coverage.

NLTK includes a SOTU corpus, easily accessible with the corpus downloader, and Kaggle offers a dataset with a number of texts, but these are limited to speeches from 1945 to 2006, and 1989 to 2017, respectively.

NLTK’s State of the Union corpus is outdated

The following list includes examples of the last 10 years (in chronological order) from different media outlets.

Published: January 23, 2007Media outlet/Author: The NYTType of analysis: Term frequency (comparison across speeches by the same speaker)Text collection: SOTU speeches by President George W. Bush from 2001 to 2007 (i.e. 7 speeches)Method: Frequency of terms visualized by speech as aggregates (left and right), as well as individual occurrences that the user can explore in context (left).

Published: January 28, 2010Media outlet: FiveThirtyEightAuthor: Nate SilverType of analysis: Term frequency (comparison across speeches by different speakers) / Speech similarityText collection: SOTU speeches made by every president since John F. Kennedy (1962) in advance of their respective midterm elections (14 speeches in total, from 1962 to 2010)Method: 70 relevant keywords were broken down into six categories (topics): process, values, domestic policy, foreign policy, the economy, framing/narrative. Their frequencies were visualized in tables like the one below. A color code was also assigned to each cell to make the table easier to understand.

Each visualization was followed by explanations and insights like the following one:

‘One reason that Obama’s speeches may come across as a bit aloof is that they are quite devoid of values buzzwords and particularly the terms “free” or “freedom”, which were among the more frequently employed words by most of his predecessors. He’s also failed to make use of one of Bill Clinton’s favorite hobbyhorses, which is the term “opportunity”.’

Published: January 25, 2011Media outlet/Author: The NYTType of analysis: Term frequency (comparison across speeches by different speakers)Text collection: All the SOTU speeches from 1934 to 2011Method: A number of terms of interest (single words and bigrams) were selected and counted in each speech. Seventeen, to be exact: jobs, invest, deficit, small business, social security, power, innovative, compete, health care, tax, bipartisan, cooperate, terror, enemies, freedom, Afghanistan, recommended. It’s not clear if a stemmer was used in the analysis, but each term comes associated with a series of words that share a common stem to let the reader know that all of those variations were taken into consideration.

Published: January 24, 2012Media outlet: The National Post
Author: Richard JohnsonType of analysis: Term frequency (comparison across speeches by different speakers)Text collection: 12 SOTU speeches from 2001 to 2012 (8 speeches by President George W. Bush, and 4 by President Barack Obama)Method: A series of selected “categories” (29) were counted and visualized according to their frequency in each speech. Some of the categories were treated as topics, like “Jobs/Employment”, while most of the rest correspond to a single term (and their common stems, like in the case of “Free/Freedom”).

Words that are less relevant in speech comparisons across different decades gain relevance in a smaller text collection (in connection with the historical context). Note examples like Saddam, Middle East, Iraq.

Published: January 18, 2015Authors: Benjamin Schmidt and Mitch FraasMedia outlet: The AtlanticType of analysis: Term frequency (comparison across speeches by different speakers)Text collection: 224 SOTU addresses (i.e. all of them up to that moment, from Washington to Obama)Method: Using the Bookworm platform for text analysis, the authors determined which were the terms that had the highest frequency in the collection.

HER: Sometimes the context in which a word is used tells us more than raw frequencies. Before the Civil War, many presidents used female pronouns to refer to foreign states. Language evolved, and her disappeared from State of the Union addresses.

Published: January 14, 2016Media outlet: VoxAuthor: Javier ZarracinaType of analysis: Term frequency (comparison across speeches by a single speaker)Text collection: 8 SOTU speeches (2009–2016)
Method: Stopwords were removed from the text collection and the most common terms were found (and compared across years). It’s worth noting how the author decided to eliminate terms like “America” and “United States,” which in this case (speeches made in and about this country) are expected to be very common and at the same time not meaningful when detecting frequent topics/issues.

Published: January 11, 2016Media outlet: VoxAuthor: Alvin ChangType of analysis: Word count comparison (comparison across speeches by different speakers)Text collection: The entire collection of SOTUS speeches up to that moment (1790–2016)Method: Instead of working with keywords, this article focuses on the amount of words in each speech. The interactive data visualization includes 13 barcharts connected in a narrative that starts with Obama, the context of his previous speeches, then moves to compare him with Clinton, and then continues to go back in time highlighting different legal and historic facts. The final chart provides information about the length of the speech, its form of delivery (written or spoken), speaker and year. The data for this piece was scraped from The American Presidency Project.

Published: January 12, 2016Media outlet: The Washington Post
Authors: Kennedy Elliott, Ted Mellnik and Richard JohnsonType of analysis: Term frequency (comparison across speeches by different speakers)Text collection: 117 SOTU addresses (1900–2016)Description: Based on an analysis by Wayne Fields, a professor of English and American Culture Studies at Washington University in St. Louis, and Mark Liberman, a linguist at the University of Pennsylvania, this piece looks at the frequency of a series of selected terms grouped into 7 topics: nationalism, issues, daily lexicon, foreign policy, rhetoric, economy, and who we are.

Published: January 30, 2018Media outlet: The Washington Post
Authors: Reuben Fischer-Baum, Ted Mellnik and Kevin Schaul
Text collection:Type of analysis: First time occurrence of terms (comparison across speeches by different speakers)Method: Terms (i.e. stems) compared chronologically across speeches to determine earliest occurrence in the text collection. Speech transcripts were obtained from the American Presidency Project, and the stemmer used is the one provided by Natural Node (Porter and Lancaster stemmers). Text data cleaning included removal of people’s names, places, contractions and acronyms.

Users can examine individual words to see them in context, along information about the speaker (Clinton in this case) and the year. / Reuben Fischer-Baum, Ted Mellnik and Kevin Schaul / Washington Post

*This link was not working at the time of writing. That’s also the reason why the image for this interactive piece was taken from Nicholas Diakopoulos’ presentation “From Words to Pictures: Text Analysis and Visualization”