Project description

Concrete-python is the Python interface to Concrete, a
natural language processing data format and set of service protocols
that work across different operating systems and programming languages
via Apache Thrift. Concrete-python contains generated Python
classes, utility classes and functions, and scripts. It does not contain the
Thrift schema for Concrete, which can be found in the
Concrete GitHub repository.

This document provides a quick tutorial of concrete-python installation and
usage. For more information, including an API reference and development
information, please see the online documentation.

License

Copyright 2012-2019 Johns Hopkins University HLTCOE. All rights
reserved. This software is released under the 2-clause BSD license.
Please see LICENSE for more information.

Requirements

concrete-python is tested on Python 3.5 and requires the
Thrift Python library, among other Python libraries. These are
installed automatically by setup.py or pip. The Thrift
compiler is not required.

Note: The accelerated protocol offers a (de)serialization speedup
of 10x or more; if you would like to use it, ensure a C++ compiler is
available on your system before installing concrete-python.
(If a compiler is not available, concrete-python will fall back to the
unaccelerated protocol automatically.) If you are on Linux, a suitable
C++ compiler will be listed as g++ or gcc-c++ in your package
manager.

If you are using macOS Mojave with the Homebrew package manager
(https://brew.sh), you can install the accelerated protocol using
the script install-mojave-homebrew-accelerated-thrift.sh.

Basic usage

Here and in the following sections we make use of an example Concrete
Communication file included in the concrete-python source distribution.
The Communication type represents an article, book, post, Tweet, or
any other kind of document that we might want to store and analyze.
Copy it from tests/testdata/serif_dog-bites-man.concrete if you
have the concrete-python source distribution or download it
separately here: serif_dog-bites-man.concrete.

First we use the concrete-inspect.py tool (explained in more detail
in the following section) to inspect some of the contents of the
Communication:

concrete-inspect.py --text serif_dog-bites-man.concrete

This command prints the text of the Communication to the console. In
our case the text is a short article formatted in SGML:

Reading Concrete

There are even more annotations stored in this Communication, but for
now we move on to demonstrate handling of the Communication in Python.
The example file contains a single Communication, but many (if
not most) files contain several. The same code can be used to read
Communications in a regular file, tar archive, or zip
archive:

Here we used get_tokens, which abstracts the process of extracting
a sequence of Tokens from a Tokenization, and lun, which
returns its argument or (if its argument is None) an empty list
and stands for “list un-none”. Many fields in Concrete are optional,
including Communication.sectionList and Section.sentenceList;
checking for None quickly becomes tedious.

In this Communication the tokens have been annotated with
part-of-speech tags, as we saw previously using
concrete-inspect.py. We can print them with the following code:

Here we used AnalyticUUIDGeneratorFactory, which creates generators of
Concrete UUID objects (see Working with UUIDs for more information).
We also used now_timestamp, which returns a Concrete timestamp representing
the current time. But now how do we know which tagging is ours? Each
annotation’s metadata contains a tool name, and we can use it to
distinguish between competing annotations:

from concrete.util import CommunicationWriter
with CommunicationWriter('serif_dog-bites-man.concrete') as writer:
writer.write(comm)

Note there are many other useful classes and functions in the
concrete.util library. See the API reference in the
online documentation for details.

concrete-inspect.py

Use concrete-inspect.py to quickly explore the contents of a
Communication from the command line. concrete-inspect.py and other
scripts are installed to the path along with the concrete-python
library.

–id

Run the following command to print the unique ID of our modified
example Communication:

concrete-inspect.py --id serif_dog-bites-man.concrete

Output:

tests/testdata/serif_dog-bites-man.xml

–metadata

Use --metadata to print the stored annotations along with their
tool names:

Other options

Use --ner, --pos, --lemmas, and --dependency (together
or independently) to show respective token-level information in a
CoNLL-like format, and use --text to print the text of the
Communication, as described in a previous section.

Run concrete-inspect.py--help to show a detailed help message
explaining the options discussed above and others. All
concrete-python scripts have such help messages.

create-comm.py

Use create-comm.py to generate a simple Communication from a text
file. For example, create a file called history-of-the-world.txt
containing the following text:

The dog ran .
The cat jumped .
The dolphin teleported .

Then run the following command to convert it to a Concrete
Communication, creating Sections, Sentences, and Tokens based on
whitespace:

Other scripts

concrete-python provides a number of other scripts, including but not
limited to:

concrete2json.py

reads in a Concrete Communication and prints a
JSON version of the Communication to stdout. The JSON is “pretty
printed” with indentation and whitespace, which makes the JSON
easier to read and to use for diffs.

create-comm-tarball.py

like create-comm.py but for multiple files: reads in a tar.gz
archive of text files, parses them into sections and sentences based
on whitespace, and writes them back out as Concrete Communications
in another tar.gz archive.

fetch-client.py

connects to a FetchCommunicationService, retrieves one or more
Communications (as specified on the command line), and writes them
to disk.

fetch-server.py

implements FetchCommunicationService, serving Communications to
clients from a file or directory of Communications on disk.

search-client.py

connects to a SearchService, reading queries from the console and
printing out results as Communication ids in a loop.

validate-communication.py

reads in a Concrete Communication file and prints out information
about any invalid fields. This script is a command-line wrapper
around the functionality in the concrete.validate library.

Use the --help flag for details about the scripts’ command line
arguments.

Working with UUIDs

Each UUID object contains a single string,
uuidString, which can be used as a universally unique identifier for the
object the UUID is attached to. The AnalyticUUIDGeneratorFactory produces
UUID generators for a Communication, one for each analytic (tool) used to
process the Communication. In contrast to the Python uuid library, the
AnalyticUUIDGeneratorFactory yields UUIDs that have common prefixes within a
Communication and within annotations produced by the same analytic, enabling
common compression algorithms to much more efficiently store the UUIDs in each
Communication. See the AnalyticUUIDGeneratorFactory class in the API
reference in the online documentation for more information.

Note that uuidString is generated by
a random process, so running the same code twice will result in two
completely different sets of identifiers. Concretely, if you run a parser to
produce a part-of-speech TokenTagging for each Tokenization in a
Communication, save the modified Communication, then run the parser again on
the same original Communication, you will get two different identifiers for
each TokenTagging, even though the contents of each pair of
TokenTaggings—the part-of-speech tags—may be the identical.

Validating Concrete Communications

The Python version of the Thrift Libraries does not perform any
validation of Thrift objects. You should use the
validate_communication() function after reading and before writing
a Concrete Communication:

Other Concrete tools will raise an exception if a required field is
missing on deserialization or serialization, and will raise an
exception if a “default required” field is missing on serialization.
By default, concrete-python does not perform any validation of Thrift
objects on serialization or deserialization. The Python Thrift classes
do provide shallow validate() methods, but they only check for
explicitly required fields (not “default required” fields) and do
not validate nested objects.