How to build GUM and contribute corrections

If you notice some errors in the GUM corpus, you can contribute corrections by forking the repository, editing specific files and submitting a pull request. The GUM build bot script will propagate changes to other relevant corpus formats and merge the changes

The build bot is also used to reconstruct reddit data, merging all annotations after plain text data has been restored using _build/process_reddit.py.

GUM is distributed in a variety of formats which contain partially overlapping information. For example, almost all formats contain part of speech tags, which need to be in sync across formats. This synchronization is created by the GUM Build Script. As a result, it's important to know exactly where to correct what.

Overview

Of the many formats available in the GUM repo, only four are actually used to generate the dataset, with other formats being dynamically generated from these.
They are found under the directory _build/src/ in the sub-directories:

rst/ - rhetorical structure theory analyses in the rs3 format as used by rstWeb

All other formats are generated from these files and there is no possibility to edit them (changes will be quashed on the next build process). References to source directories below (e.g. xml/) always refer to these sub-directories (_build/source/xml/).

Committing your corrections to Github

Because multiple people can contribute corrections simultaneously, merging corrections is managed over github.
To contibute corrections directly, you should always:

Fork the dev branch

Edit, commit and push to your branch

Make a pull request into the origin dev branch

Alternatively, if you have minor individual corrections, feel free to open an issue in our Github tracker and describe your change requests as accurately as possible.

Correcting token strings

Token strings come from the first column of the files in xml/. These should normally not be changed. Changing token strings in any other format has no effect (changes will be overwritten or lead to a conflict and crash).

Correcting POS tags and lemmas

GUM contains lemmas and three types of POS tags for every token:

'Vanilla' PTB tags following Santorini (1990)

Extended PTB tags as used by TreeTagger (Schmid 1994)

CLAWS5 tags as used in the BNC

You can correct lemmas and extended PTB tags in the xml/ directory. Vanilla PTB tags are produced fully automatically from the extended tags and should not be corrected. Correct the extended tags instead.
CLAWS tags are produced by an automatic tagger, but are post-processed to remove errors based on the gold extended PTB tags. As a result, most CLAWS errors can be corrected by correcting the PTB tags. Direct corrections to CLAWS tags are likely to be destroyed by the build script. If you find a CLAWS error despite a correct PTB tag, please let us know so we can improve post-processing.

Correcting TEI tags in xml/

The XML tags in the xml/ directory are based on the TEI vocabulary. Although the schema for GUM is much simpler than TEI, some nesting restrictions as well as naming conventions apply.
Corrections to XML tags can be submitted, however please make sure that the corrected file validates using the XSD schema in the _build directory. Corrections that don't validate will fail to merge.
If you feel the schema should be updated to accommodate some correction, please let us know.

You can't alter Universal Dependencies data in dep/ud/, since it is automatically generated from the Stanford Dependencies. If changes to Stanford Dependencies do not propagate as expected to UD data, please contact us.

const/ - Constituent trees

Constituent trees in const/ are generated automatically based on the tokenization, POS tags and sentence breaks from the XML files, and cannot be corrected manually at present. Note that token-less data for reddit documents is included in the release under target/const/ for convenience. This data can be used to restore reddit constituent parses using _build/process_reddit.py without having to re-run the Stanford Parser.

coref/ - Coreference and entities

Coreference and entity annotations are available in several formats, but all information is projected from the tsv/ directory.

rst/ - Rhetorical Structure Theory

The rst/ directory contains Rhetorical Structure Theory analyses in the rs3 format. You can edit the rhetorical relations and even make finer grained segments, but:

You cannot edit the tokenization, which is expressed by spaces inside each segment, but ultimately generated from the XML files from xml/

By convention, you are not allowed to make segments that contain multiple sentences according to the <s> elements in the XML files

Overview

The build script in _build is run like this:

> python build_gum.py [-s SOURCE_DIR] [-t TARGET_DIR] [OPTIONS]

Source and target directories default to _build/src and _build/target if not supplied. Parsing and re-tagging CLAWS tags are optional if those data sources are already available and no POS tag, sentence borders or token string changes have occurred. See below for more option settings.