Red Hen data format

Red Hen aims to facilitate collaborative work across different types of expertise, geographical locations, and time. To facilitate this, we have developed a shared data format, specified below. To simplify interoperability and ensure we can scale, we currently rely on flat files rather than databases, and on metadata stored in time-stamped single lines rather than in multi-line hierarchical systems like xml.

Related

File names

The basic unit in the Red Hen dataset is a video file, typically a one-hour news program, though it could also be a one-minute campaign ad. A series of files are then created around this video file, for instance,

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.frm.json

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.img

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.jpg

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.json

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.mp4

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.ocr

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.seg

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.txt

The text files here are .ocr (on-screent text from optical character recognition), .seg (metadata from NLP), and .txt (closed captioning or teletext from the television transport stream).

Main file types

Caption text (txt) (extracted from the television transport stream)

Online transcript (tpt) (downloaded and mechanically aligned)

On-screen text (ocr) (created through optical character recognition)

Annotated text (seg) (automated and manual tags)

Thumbnails (img directory) (extracted at ten-second intervals)

Image montages (jpg) (assembled from thumbnails)

Video (mp4) (compressed and resized)

Text file data structure

The data in the text files is structured as follows:

A header with file-level information

A legend with information about the different modules that have been run on the file

The data section

Header block

So for instance in a seg file, we have these field names and values in the header (see also example .seg files):

You see the syntax: each line starts with primary tag, say SEG_02. The primary tags are numbered to allow different techniques to generate the same kind of data. For instance, we might imagine two methods of speaker diarization using DIA_01 and DIA_02. The next field is the date the module was run -- YYYY-mm-DD HH:MM. The field separator is the pipe symbol. Next are fields for Source_Program, Source_Person, and Codebook with labels for each field in the annotation. It's critical we have good information in the legends, so that the significance of each column in the main data section is systematically tracked.

The full specification of the data structure of a particular annotation is provided in the Edge2 Search Engine Definitions. These definitions are dynamically read by the Edge2 search engine at startup.

Main body

Again in .seg files, the main body of data follows the legends, using the primary tags they define. Each line has an absolute start time and end time in Universal time, a primary tag, and some content:

In the Edge2 search engine, we leave out the position information and invariant labels in
the search interface, which leaves these fields to search under
FRM_01:

Token

Frame Name

Semantic Role (one or more SRL fields)

Frame Element (one or more FE fields)

Parts of speech tags

The Treebank II page at
http://www.clips.ua.ac.be/pages/mbsp-tags lists the tags used by the MBSP (Memory-Based Shallow Parser) engine, which generates our POS_01 tags. Each
word is encoded with the original form, a part of speech tag, two
chunk tags with relation tags, and a lemma. So for instance the
caption line "CC1|GET RID OF PREPAID PROBLEMS." is encoded like
this:

The original tokens are in bold black, and the lemmas in
bold blue. The bold red is the part-of-speech tag.
The two remaining tags are so-called chunk tags that provide
information about the word's syntactic relations -- its roles in the
sentence or phrase.

This is entirely systematic -- each key word follows a
pipe symbol, and there are always exactly three annotations for each
word, including for the final period, which is treated as if it were
its own word. Each annotation is separated by a forward slash. Empty
chunk values are indicated by a captal O.

To make these searchable, we might give them names as follows:

POS_01 word

POS_01 part of speech

POS_01 relation 1

POS_01 relation 2

POS_01 lemma

This should make all of these entries searchable. The user would
of course need to know the Codebook; the help screens and tutorial
could refer them to the Treebank II reference page for MBSP.

POS_02 is much simpler, since it just has the first two fields.

Audio Pipeline Tags

Red Hen's audio pipeline automatically tags audio features in recordings. Red Hen's work on audio detection began during her Google Summer of Code 2015, during which a number of open-source student coders, mentored by more senior Red Hens, tackled individual projects in audio detection. Their code was integrated into a single pipeline by Xu He in early 2016. Red Hen is grateful to the Alexander von Humboldt Foundation for the funding to employ Xu He during this period. The Audio Pipeline produces a line in the Credit Block (see above), but otherwise creates tags in the main body, as follows.

GEN and SPK are results with the speaker boundaries given by the Speaker Diarization algorithm, whereas GENR and SPKR are results produced for 5-second segments.

Log Likelihood=-21.5807774474

is the natural logarithm of the likelihood under a Mixture of Gaussians, and it indicates how confident the algorithm is about its recognition results,the higher the likelihood, the more confident it is.

End tag

Finally, the last line of the file should have this kind of information, with an end timestamp:

END|20150703232959|2015-07-03_2300_US_WKYC_Channel_3_News_at_7

The end timestamp is derived from the start time plus the video duration. This is useful for running quick checks that the entire file was processed and not truncated.

We need a very high level of consistency in the output files, since they need to be reliably parsed by our statistical tools and search engines.

Not implemented

Tags may be derived from the downloaded CNN transcript, integrated in .tpt files.

ELAN eaf files

In the summer of 2017, Peter Uhrig at FAU Erlangen created some 300,000 eaf files, the file format used by ELAN, for English-language files between 2007 and 2016. These files have now been added to the Red Hen dataset. They integrate the output of the Gentle forced aligner with Sergiy Turchyn's computer-vision-based gesture detection code. They contain precise timestamps for the beginning and end of each word, in this case the word "the":