Generic Features Library

Generic Features Library

This document describes the generic features library that is available as part
of ddlib, the utility library included with DeepDive (under
$DEEPDIVE_HOME/ddlib/ddlib).

By "generic features" we denote a set of features that are not application- or
domain-dependent and can be used to obtain good baseline quality for mention and
relation extractions. Feature engineering is indeed one of the most
time-consuming operation in Knowledge Base Construction (KBC) and it is often
difficult to start building feature from scratch. The goal of the generic
features library is to allow users of DeepDive who are not KBC experts to get
their application off the ground with good starting quality.

The 'generic features' library leverages on Natural Language Processing (NLP)
annotations (Part of Speech, Named Entity Recognition, dependency paths, ...) to
the sentences in the corpus to build the features. Examples of features include:
the dependency path between two mentions composing a relation mention, the Named
Entity Recognition tags of the words composing a mention, the dependency path
between a mention and a keyword from a user-specified dictionary, and many
others. See below for the complete list.

The user of the library can optionally specify one or more dictionaries. These
are sets of words that the user believes are relevant for the correct
classification of mentions and relations, and are often
domain-/application-specific. The generic features library uses the dictionaries
to create additional features, allowing the inclusion of domain-knowledge in the
set of features. More details about dictionaries and their use in the library
are in the Using Dictionaries section below.

List of Generic Features

The generic feature library creates two different sets of features for mentions
and relations, due to the different nature of these objects, and to which
features are more relevant for each type.

There are various "classes" of generic features, which can be distinguished by
their prefix.

The list of generic features for a mention is the following:

The set of Part of Speech tag(s) of the word(s) composing the mention (prefix:
POS_SEQ);

The set of Named Entity Recognition tag(s) of the word(s) composing the
mention (NER_SEQ);

The set of lemmas of the word(s) composing the mention (LEMMA_SEQ);

The set of word(s) composing the mention (WORD_SEQ);

The (sum of the) length(s) of the word(s) composing the mention (LENGTH);

A feature denoting whether the first word of the mention starts with a capital
letter (STARTS_WITH_CAPITAL);

The lemmas and the NER tags in a window of size up to 3 around the mention,
both on the left and on the right of the mention. These are also combined
(i.e., a window on the left and a window on the right are merged into a
single feature), to give a total of (up to) 15 features (3 on left, 3 on
right, 3 times 3 combinations of left and right) with lemmas, and 15 for
NERs (W);

Features denoting whether the mention (or a substring of it of length up to 3)
appears in a user-specified dictionary (IN_DICT);

Features indicating whether the sentence containing the mention also contains
some keyword that appears in a user-specified dictionary (KW_IND);

The shortest dependency path(s) between the mention and the keyword(s) from
user-specified dictionaries that appear in the sentence. Multiple variants
of the dependency path are used as feature (edge labels and lemmas, edge
labels only, edge labels and lemmas replaced with dictionary identifier if
the lemma is in a dictionary) (KW);

The list of generic features for a relation is the following (the prefixes are
the same as the ones for the mentions, except where otherwise specified):

The set of Part of Speech tags for the words between the mentions in the
relation;

The set of Named Entity Recognition tags for the words between the mentions
in the relation;

The set of lemmas of the words between the mentions in the relation;

The set of words between the mentions in the relation;

The sum of the lengths of the words in the mentions;

Indicator feature for whether the mentions start with a capital letter;

The n-grams of size up to 3 of the lemmas and the NER tags of the words
between the mentions in the relation (prefix: NGRAM);

The lemmas and the NERs in a window of size up to 3 around the mentions
composing the relation. These are only combined (i.e., a left window and a
right window are merged into a single feature), giving a total of (up to) 9
features for the lemmas, and 9 for the NERS;

Features denoting whether the mentions in the relation (or substrings of them
of size up to 3) appear in some user-specified dictionaries;

Indicator features denoting whether the sentence containing the relation also
contains keywords appearing in user-specified dictionaries;

The shortest dependency paths between the mentions and keywords in
user-specified dictionaries that are in the sentence. Each feature is
composed by both dependency paths from each mention to the keyword. Multiple
variants of the paths are used, as in the mention case;

If the two mentions composing a relation are 'inverted' with respect to a
canonical order defined by the user, a prefix indicating this fact is prepended
to all the generic features;

Using the generic feature library

In order to use the "generic features" functionality, the user must import
ddlib in her Python extractor:

import ddlib

$DEEPDIVE_HOME/ddlib/ddlib must appear in the user's PYTHONPATH
environmental variable in order to be able to use ddlib.

Loading dictionaries

As explained in the introduction of this document, the user may optionally
specify one or more dictionaries of keywords that are used to create generic
features and can be seen as a way to incorporate domain-/application-specific
knowledge to the set of generic features.

Dictionaries are seen as sets of keywords that are mapped to a dictionary
identifier. All keywords in a dictionary are mapped to the same dictionary
identifier. Keywords are replaced dictionary identifiers in some features, with
the effect of reducing sparsity. In practice, a dictionary is a plain text file
containing one keyword per line:

keyword1
keyword2
keyword3
...

Note that keywords can actually be composed by multiple words.

The user may load a dictionary by calling the ddlib.load_dictionary function,
e.g.:

The dict_id parameter is optional and allows the user to specify the
dictionary identifier. If this is not specified, the system will use an
incremental positive integer as identifier. Multiple dictionaries can be loaded
through multiple calls and they will all be used in the generic features.

Generating the features

The library represents features as strings.

To obtain the generic features for a mention, the library provides the generator
ddlib.get_generic_features_mentions, which can be used as follows:

import ddlib
...
for feature in ddlib.get_generic_features_mention(sentence, span):
# do something with the feature

The first parameter sentence is a ordered list of ddlib.Word objects, where
each object represents a word in the sentence and the list is sorted according
to the order of the words in the sentence. The second parameter, span, is a
ddlib.Span object, representing the text span corresponding to the mention.
Consult the Pydoc documentation (and the code) for ddlib for more information
about these objects and how to generate them (especially the get_sentence and
get_span functions)

For relations, the user can obtain the generic features using the
ddlib.get_generic_features_relations as follows:

import ddlib
...
for feature in ddlib.get_generic_features_relation(sentence, span1, span2):
# do something with the feature

The parameters are respectively a ordered list of ddlib.Word objects and the
two ddlib.Span objects representing mentions composing the relation.

We remark that ddlib.get_generic_featurse_mention and
ddlib.get_generic_features_relation are Python
generators, so they should be used
in a loop.

Moreover, the generators may yield multiple copies of the same feature (e.g., if
a word appears twice between two mentions in a relation, the feature
NGRAM_1_[word] will be generated twice). It is the user's responsibility to
filter out duplicated features if needed.