Terrier

About

TERRIER (Temporally Extended, Regular, Reproducible International Event
Records) BETA is a new machine coded event dataset produced from a historical
corpus ranging from 1979 to 2016, available for download at
OSF. Event data generates structured records of
political events described in text in the form of (1) a source actor (2)
committing an action (3) against a target. The political events recorded in the
dataset include a wide range of political behaviors: meetings, statements,
provision of aid, protests, attacks, and violence. This dataset is an initial
beta release of the data, lacking event geolocation. We encourage researchers
to carefully check the data they use and to contact our team with any issues
they uncover regarding the data by opening a thread on our discussion
forum.

The dataset was produced by a team at the University of Oklahoma as part of the
NSF RIDIR grant “Modernizing Political Event Data” SBE-SMA-1539302. Any
opinions, findings, and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the views of
NSF or the U.S. government.

Getting Started

What is event data?

Event data, at its most basic, consists of a “triple” of information: an event, such as a protest or attack, performed by a source actor against a target. These events and actors are automatically recognized in text, extracted, and resolved to a defined set of codes, such that “demonstrated” and “rallied in the streets” would both be coded as a “Protest” event and “Angela Merkel” and “German Ministry of Defense” would both be represented as DEU GOV. Performing this process on many millions of documents produces a set of structured data that is much easier to analyze than the raw documents.

In producing event data, we build on the dominant paradigm of event coding in English, which consists of automatically comparing grammatically parsed sentence text with hand-defined dictionaries using an event coding tool. The tool follows instructions about how to combine the extracted noun- and verb phrases into a direct event with a source and target, and resolves the extracted text to specified set codes defined in an ontology. An automated event coding system thus consists of two components: a set of dictionaries that map noun and verb phrases to their corresponding actor and event codes in an ontology, and an event coder that applies these dictionaries to the text and makes decisions about how to combine individual actors and actions into coded events.

The ontology we use is the CAMEO ontology, which is the current standard coding ontology for most event data. CAMEO enforces a requirement for a source actor and target actor to go along with each event. Actors and events are each assigned hierarchical codes. Actor codes begin with high-level information comprising their country or international status, following by a functional role code, such as “GOV” or “MIL”, with secondary codes providing greater detail in some cases. Event codes can be aggregated into five top level classes, 20 intermediate event types, or around 200 low-level codes. Each code is documented in the codebook, available here.

Terrier Dataset

The Terrier dataset was generated from roughly 200 million news stories from 500 news sources around the world, ranging in dates from 1979 into 2016. The raw text of each story was obtained from LexisNexis either though special API access and later through bulk dumps provided by LexisNexis on mailed hard disks. From these news articles, we produced around 60 million events.

The number of events produced per month (below) shows a marked increase over
time, as more source text becomes available and as dictionary coverage
improves. This exponential increase is familiar from other event datasets and
poses challenges for researchers making over-time claims. Importantly, however,
it shows no major missing time periods.

The events in TERRIER have initial geolocation information attached to them.
The geolocation process arbitrarily selects a location extracted by CLIFF-CLAVIN from the
sentence to the event in question. As with many maps,
a high-level plot of geolocated events reveals good coverage of the world’s
population density.

Codebook

The data available on OSF comes in two formats: JSON
and CSV. The JSON data includes field names for each entry, and the TSV does
not. These are the fields available for each event, presented in the order they
occur within the TSV files. For more details on what the different codes
represent, please consult the CAMEO
manual.

Sources

The English language sources used in TERRIER include LexisNexis’ complete
collection of articles published by the following sources between 1979 and
early 2016. (Note that LexisNexis does not possess many articles for
these sources in the 1980s and 1990s).

Technical Details

Producing event data requires passing text through a series of tools to
grammatically parse sentences, recognize events, and record them in a
structured format. This process depends on recognizing actors, events, and
targets in text and comparing them to hand-built dictionaries to produce
standardized actor and event codes. TERRIER was produced with several open
source tools and tools produced and maintained by the Open Event Data Alliance.

Grammatical parsing

The first step in producing event data is to annotate the text with grammatical
markup to provide information about the structure of sentences and the
syntactic relationships between different parts of the sentence. The step
automatically identifies noun- and verb phrases and the relationships between
them, making it easier to determine who the actors are and what events are
occurring.

CoreNLP

To perform this step, we draw on the large body of work conducted in
computational linguistics and natural language processing over the past two
decades. Specifically, we use Stanford University’s
CoreNLP to provide provide a
constituency parse of each document.

Biryani

Because of the size of our corpus (~2TB, 300 million stories), running CoreNLP
was not a trivial task. We developed a distributed task-queue tool for
distributing CoreNLP jobs across a cluster of machines to speed
processing.1 Our tool,
biryani, uses a Kalman filter to
dynamically adjust the batch size and thread count in processing. More details
are available in an article
here.

Event Coding

The second major step in producing event data is to recognize political events
in text, which words in the sentence correspond with actors, targets, and
events, and which codes to assign to each actor or event.

Once CoreNLP generates grammatical information on the document, we are left
with the task of determining which noun phrases correspond to our “source” and
“target” actors, and which verb phrases could be events. In addition to finding
these spans of text, we also want to resolve them to predefined categories to
make them easily analyzable for social science research.

Petrarch2

The heart of our event data pipeline is Petrarch2, which locates actors,
events, and targets in text, compares them to dictionaries that map short
phrases to actor- and event codes, and returns a complete event. Petrarch2 is a
well known workhorse in automated event data. It is available for download
here and is described in a white
paper
here.

Birdcage

We produced a new pipeline to run Petrarch2 at scale over many millions of
documents. Although Petrarch2 is quite fast, it is natively parallel and also
requires slower pre- and post-processing steps, including geolocation and final
formatting. To bundle all of these steps together, we created
Birdcage, a distributed pipeline
that can quickly generate event data from CoreNLP-processed text.

Although a nice Spark wrapper for CoreNLP exists, we preferred our simpler distributed approach because of its portability across systems and our desire to not depend on maintaining a Spark cluster. [return]