— Command line tools

Red Hen uses the Linux operating system (Debian) and in some cases the Mac OS X shell with the standard GNU utilities from MacPorts.

While we work continually to expose more of the Red Hen dataset to the search engine interfaces, some data types and certain operations of search and analysis can be accessed only from the command line. Much of the cutting-edge work is done directly on the servers, and graduate students and researchers may have projects that benefit from command line access.

Navigation

The Cartago server keeps two versions of NewsScape -- a text-only tree and a full
tree with video and images. For text only, you navigate using the command "dday":

dday 2013-05-03dday 5 (for five days ago)dday - 1 (for one day earlier -- spaces on both sides of the minus sign)dday + 4 (for four days later)

dday 2015-04-01_1600_US_CNN_Legal_View_With_Ashleigh_Banfield (the day of a file)

Similarly, for the full tree with video and images, you navigate using the command "sweep":

sweep 2013-05-03

The letter l (lowercase L) is an alias for ls -Ll to list files.

Standard tools

On the command line, you can use GNU core utilities like grep, sed, awk, regular expressions, find, cut, xargs, and tr to search the text. In the default bash shell, you also have some additional functionality, such as string manipulation. For
instance, to examine all the Named Entities in the annotated CNN files for one day, issue

grep '|NER' *CNN*seg > ~/${PWD##*/}_CNN_NER.txt

The redirect symbol is '>' -- it sends the output to a file that you name. To
indicate your home directory, use '~/'. To include the date of the
files you are grepping, you can use the chopped string ${PWD##*/}
-- try issuing this in any directory:

echo $PWD
echo ${PWD##*/}

The variable $PWD is a so-called environment variable, which
contains local information about your context, in this case the
name of the present working directory -- the same as if you issue

pwd

Then ${PWD##*/} chops off the path, leaving just the present
directory name. You can also save the output -- say of all frames in a file -- to an identically named file with a new extension, as in this for loop:

Search tools

In addition to the core GNU utilities, Red Hen has developed the dedicated search utilities peck, peck-segment, peck-filter, peck-intersect, and peck-clip. Union is also available as a use of some standard unix commands. These search utilities perform searches within the several forms of annotation present in the NewsScape corpus; see Current state of text tagging.

Peck searches files for regex patterns by primary tag. The output files (which we call seeds) present one hit or seed per line. Each seed is a sentence with metadata. The seeds are produced in csv files with delimiter = |. Accordingly, seeds can be imported to a statistical software package such as R.

Peck-segment searches for a regex pattern within a segment type, such as commercials.

Peck-intersect finds the intersection of different seeds produced by peck and/or peck-segment.

Peck-filter filters seeds by a second set of search criteria. It can be run before or after peck-intersect. Peck-filter just removes search results that don't satisfy some second criteria; it's like putting your search through a strainer.

Peck-clip creates a video clip of each seed.

Union finds the union of different seeds, such as seeds that draw on different primary tags. Union is not a separate script but a recipe for using standard unix commands. To find the union of seeds, concatenate and sort them (using *nix system calls cat and sort). Filenames for news broadcasts in Red Hen follow a specific order, including date, time, country, name of show, etc. A line in a seed has in its first field the filename, and in its second the start-timestamp for the beginning of the utterance. Since sort operates on the entire line, it will by default sort all lines in a seed first by filename and, within filename, by timestamp. This sorting pattern can be forced with cat filenames | sort -t '|' -k 1,1 -k 2,2 > <output file>.

A line in a seed has the following form:

Filename | expression | hot-link to the moment in the broadcast when the expression is uttered, so one can see the full audiovisual presentation and performance | start-timestamp | end-timestamp | primary tag on which the search was conducted | output of the hit. Example below:

2014-10-12_0000_US_KNBC_NBC_Nightly_News.seg|>> I THINK YOU GOT TO HAVE THE BAD DAYS SO YOU CAN LOVE THE GOOD DAYS EVEN MORE.|https://tvnews.sscnet.ucla.edu/edge/video,7625e7bc-51ab-11e4-b579-089e01ba0326,3338|20141012005538.967|20141012005543.038|POS_02|I/PRP|THINK/VBP|YOU/PRP|GOT/VBD|TO/TO|HAVE/VB|THE/DT|BAD/JJ|DAYS/NNS|SO/IN|YOU/PRP|CAN/MD|LOVE/VB|THE/DT|GOOD/JJ|DAYS/NNS|EVEN/RB|MORE./VBP|

For example,

use peck to search for the construction "ProperNoun is the (title)(ProperNounString) of":

Each command operates on a
day's worth of files, typically around a hundred news shows. Let's say we start with two peck searches. First we look for
instances of the word "time" in the frame annotations:

peck seg "time" FRM_01 ~/time-frames.csv

Instead of an open-ended search for the word time, we can also
force just the frame TIME:

peck seg "FRM_01\|TIME\|" FRM_01 ~/TIME-frame.csv

or "TIME" as a semantic role:

peck seg "SRL\|TIME\|" FRM_01 ~/TIME-SRL.csv

or some relevant frame element:

peck seg "\|Measure_duration" FRM_01 ~/Measure_duration.csv

Then we look in the parts-of-speech annotations for a particular
construction, say the indefinite article followed by an adjective:

peck seg "a\|[a-zA-Z]+/JJ" POS_01 ~/a-JJ.csv

Once we have these seeds, we can use intersect to find temporal
expressions in sentences that contain this particular
construction:

peck-intersect ~/TIME-frame.csv ~/a-JJ.csv ~/a-JJ-TIME.csv

This file can be read into R for statistical analysis.
For some purposes, we'd want to remove multiple annotations of the
same caption line. Each line contains a link to the location in the video where the
sentence was spoken; this allows us to do multimodal research on
constructions.

The very simple logical architecture of peck and intersect is quite powerful. Intersect can be used recursively, so we can build a complex
search in multiple steps, starting with peck:

peck-filter

The peck-filter script takes an existing seed file -- the csv output of a peck or peck-segment search -- and adds a second set of search criteria. The output is a file of seeds that meet both criteria -- a strict subset of the original search. peck-filter can be run before or after peck-intersect, and allows you to define complex combinations of search conditions.

We can now do repeated peck searches and combine the results (per-show OR), or intersect two searches (per-show AND). Peck-filter is a new mix-and-match module that handles sentence-level AND conditionality -- it takes the timestamps from a peck result and looks for patterns only under that timestamp, which is to say, one sentence.

The scripts peck, peck-segment, peck-filter, and peck-intersect are designed for automation and can be run incrementally. We could write
scripts that call them repeatedly at different dates and create
visualizations on the fly: the morning news in a new form.

Let's say we create two
or more peck searches that run on every day of the corpus, and we
use intersect to locate some complex construction. We aggregate the
result from every day into a single csv file. Then we set up a
crontab that runs the same pecks with intersect at 2am on incoming
files. We add the output to this single csv file, pipe it to R,
generate a graph, and post it online. Instant construction updates
or even discovery. For instance, we could run a monitor for "because
NOUN" and see when the media start using it. You could get an e-mail
when it's spotted.

We might even be able to create clickable graphs that allow people
to access the underlying communicative act from the graph, as an
access interface.

Using Regex to Locate a Pattern in the Tagged Data

The
possibilities for tagging and search in Red Hen are unlimited. To begin
to use the command-line tools for search, it is indispensable to

Acquire a rudimentary familiarity with regular expressions (regex). There are many gentle introductions and tutorials, such as http://www.regular-expressions.info/tutorialcnt.html.
Learning regex is a little like learning long division or how to factor
a quadratic formula: it is for the most part easy, but takes some study
and practice. Basic work can be done in the Edge Search Engine, but
advanced work requires regex.

Begin
to use regex to match tagging. Finding a simple alphanumeric string,
e.g. "Napoleon," can be done directly in the Edge Search Engine. One can
also use the Edge Search Engine for some basic boolean searches and
some basic regex searches. See — How to use the Edge search engine.
But advanced work usually requires working from the command line to
conduct a regex search on a .seg file (or an .ocr file if you want to
search on-screen text). For example, one could use the Edge Search
Engine to find the strings "a dog" or "the dog" or even EITHER "a dog"
OR "the dog." But let us take a trivial example of a search for
something other than words: If you wanted to find examples of a
determiner (a or an or the or these, etc.)
followed by a singular noun, you would need to know that determiners are
marked DT and that singular nouns are marked NN. You would need to
know that MBSP (POS_01) would tag "a dog" as |a/DT/I-NP/O/a|dog/NN/I-NP/O/dog|, or, somewhat easier in this case, that the Stanford Part-of-Speech tagger (POS_02) would tag "a dog" as |A/DT|DOG/NN|. So,
one would need to use regex to search for
'\|[a-zA-Z]+\/DT\|[a-zA-Z]+\/NN\|' if you wanted to find a determiner
followed by a singular noun. One could also use the equivalent regex
pattern '\|\w+\/DT\|\w+\/NN\|'. Using peck, the command would be peck seg '\|\w+\/DT\|\w+\/NN\|'POS_02
<output file>. This says, "go pecking through the seg files in a
directory for the following pattern: a pipe (\|) followed by a word
(\w+) followed by the tag for a determiner (\/DT) followed by a pipe
(\|) followed by a word (\w+) followed by the tag for a singular noun
(\/NN) followed by a pipe (\|); and do this in lines that have as
primary tag POS_02 (that means, tagged by the Stanford POS parser); and
put all the hits in the named output file. (By the way, please do not
run this search as practice; it will produce very many hits.) It
produces hits like the following example, which actually has two
sections that match the pattern, marked in red:

2015-02-05_1200_US_KNBC_KNBC_Early_Today.seg|WHY
WAS AN SUV
STOPPED ON THE
TRACKS?|https://tvnews.sscnet.ucla.edu/edge/video,ca43190e-ad32-11e4-ac58-089e01ba0326,22|20150205120022.388|20150205120025.190|POS_02|WHY/WRB|WAS/VBD|AN/DT|SUV/NN|STOPPED/VBD|ON/RP|THE/DT|TRACKS?/NN

Note
also that peck and our other command-line tools deliver
character-separated value files (where the separator is a pipe |), so
they can be imported directly into the statistical software package R
for analysis and graphic presentation.