The project started 2019-04-13. The first release of Deep UD is planned before Syntaxfest 2019 (to be held in the last week of August in Paris). It will be numbered 2.4, referring to the release of Universal Dependencies it is based on, i.e., UD 2.4. In the future we plan to release a new version of Deep UD after each release of UD. At present, Deep UD is a result of automatic processing of data released in UD. Future releases of Deep UD may contain additional deep-syntactic, lexical and semantic annotations and resources.

Deep UD data will be released through the LINDAT/CLARIN repository, same as UD itself. Deep UD 2.4 will be accessible via the persistent URL http://hdl.handle.net/11234/1-3022.

Documentation

A Deep UD release contains all treebanks of the corresponding UD release except those that:

are not distributed together with the underlying text. For copyright reasons, several UD treebanks are distributed as hollow annotations and the user must obtain the underlying text corpus through other channels. In UD 2.4 this applies to the following six treebanks: Arabic NYUAD, English ESL, French FTB, Hindi-English HIENCS, Japanese BCCWJ, Mbya Guarani Dooley.

have incomplete or missing lemmatization. This applies to additional 19 treebanks of UD 2.4. Treebanks are not excluded if all words have lemmas but the lemmas were predicted by a stochastic model rather than assigned by a human annotator.

Consequently, Deep UD 2.4 contains 121 treebanks of 73 languages.

All treebanks in Deep UD have enhanced graphs. In contrast, only some treebanks have enhanced graphs in UD (24 treebanks of 16 languages in UD 2.4), and if they do, they typically do not apply all five enhancement types that are defined in the guidelines (only 6 treebanks of 3 languages, viz. Dutch, English, and Swedish, have all five enhancement types); see also Table 1 in our Syntaxfest paper. We call the enhanced dependencies in the main UD release trusted annotations (they may or may not be manually checked but they are overseen by people directly responsible for the given treebank). In Deep UD 2.4, we copy the trusted enhanced graphs of the six treebanks where all five enhancements are present. For the other 18 treebanks we discard the trusted annotation and proceed as if there were no enhancements (in the future, we plan to make use of the trusted enhancements). We then generate enhanced graphs from the basic UD trees using the Stanford Enhancer (available as a part of the Stanford CoreNLP toolkit; TO DO: document how exactly we invoke the Stanford Enhancer).

While the enhanced graphs can be encoded in the standard CoNLL-U file format, we also add other annotations for which we use the extended CoNLL-U-Plus format. At present there are two extra columns: DEEP:PRED and DEEP:ARGS. These columns have non-empty values for tokens that are recognized as predicates; in release 2.4 we restrict ourselves to verbs and participles. The value of DEEP:PRED lexically identifies a predicate; often but not always it corresponds to the LEMMA of the token. The value of DEEP:ARGS contains links to the arguments of the predicate, if they correspond to one or more tokens of the same sentence. Arguments are numbered arg1, arg2 etc. Diathesis is normalized, therefore arg1 corresponds to the subject in an active clause and to the oblique agent in a passive clause; arg2 typically corresponds to the object in an active clause and to the subject in a passive clause. Our numbered arguments can be compared to the “canonical” subject and object of Candito et al. (2017); we prefer to avoid the terms subject and object in this context, as they are normally associated with certain syntactic rules that do not necessarily hold for arguments found in non-canonical positions. Each numbered argument is followed by a numeric reference to the token that serves as the head of the phrase representing the argument. If the argument is represented by coordination, there are links to all participating conjuncts. For example, arg1:33|arg2:12,27 means that argument 1 (possibly the agent) is headed by node 33, while argument 2 (possibly the patient) is coordination and the conjuncts are headed by nodes 12 and 27, respectively.

The motivation for the predicate-argument annotation is to answer the main question of natural language understanding: “Who did what to whom?” It should hold for all occurrences of a predicate (verb sense) that the argument with a particular number corresponds to the same semantic role. However, the actual label of the role depends on the predicate and on the language. We do not say anything about the roles in release 2.4 but we plan on linking the frames to existing valency dictionaries in the future, in languages where such resources are available.

Note that thanks to grammatical coreference (Zikánová et al., 2015), a token may serve as an argument of multiple predicates in the sentence. For example, in I saw the boy who was injured in the accident, the boy is arg2 of saw and arg2 of injured, while it is neither subject nor object of injured. (Nevertheless, the UD v2 guidelines define an enhanced relation labeled nsubj:pass and going from injured to boy, so the enhanced graph will contain this relation. This relation should not be interpreted as “boy is the subject of injured.” It should rather be interpreted as “boy refers to the same entity as the subject of injured.”) On the other hand, an argument will not be mentioned in the DEEP:ARGS column if there is no token in the current sentence that represents the argument. This can happen if the argument is unexpressed (which happens frequently to subjects in pro-drop languages) or if the token representing the argument appears in a different sentence.

Enhanced Graphs

Gapping and Stripping

Enhanced graphs should contain additional nodes (called “empty nodes” or “null nodes”) to represent elided copies of predicates in coordination. This is the only situation in which the UD v2 guidelines license an empty node, hence an occurrence of an empty node directly corresponds to this type of enhancement. 58 treebanks of 121 in Deep UD 2.4 have at least one empty node. Gapping is a rare phenomenon and some languages may not permit it at all, yet it is likely that there are other treebanks where gapping exists but has not been properly identified. The Stanford Enhancer relies on the orphan relation in the basic UD annotation, and some UD treebanks currently do not use the relation properly. See also the statistics of empty nodes in Deep UD treebanks.

Coordination

For every coordination structure, the edge incoming to the first conjunct is copied to all other conjuncts, providing a direct link between the parent of the coordination and each conjunct. Similar edge propagation occurs with outgoing edges but only those leading to dependents that are shared among the conjuncts (as opposed to private dependents of the first conjunct). 116 treebanks of 121 in Deep UD 2.4 have at least one example of a parent shared among conjuncts. The Tagalog and Warlpiri datasets are too small, with no examples of coordination. In Japanese, the semantic equivalent of coordination is analyzed as subordination, hence none of the three Japanese treebanks contains shared parents. Finally, in Galician CTG, there are only four instances of the conj relation, although there are 4266 coordinating conjunctions (CCONJ). This is in sharp contrast to the other Galician treebank and to most other languages. Probably due to a conversion error, conjuncts are connected using other relations, such as nmod. See also the statistics of enhanced annotation of coordination in Deep UD treebanks.

External Subjects

Certain types of non-finite, ‘open’ clausal complements inherit their subject from the subject, object, or oblique argument of the matrix clause. Example: Susan wants to buy a book. In the basic tree, Susan will be attached as nsubj of wants, while there will be no subject dependent of buy. In contrast, the enhanced graph will have an additional nsubj relation between buy and Susan. 115 treebanks of 121 in Deep UD 2.4 have at least one example of an inherited external subject. It is missing from all three Japanese corpora, from Korean GSD (but not Korean Kaist), from Tagalog TRG, and from Turkish IMST (but not Turkish GB). These treebanks have no or very few occurrences of xcomp. Note that not all occurrences of xcomp must lead to an external subject: in pro-drop languages, the subject may be unexpressed. See also the statistics of external subjects of open complements in Deep UD treebanks.

Relative Clauses

The noun modified by a relative clause plays a semantic role in the frame of the subordinate predicate. In the basic UD tree, it is represented by a relative pronoun; however, in the enhanced graph it is linked from the subordinate predicate instead of the pronoun, the pronoun is detached from the predicate and attached to the noun it represents, via a special relation ref. Only 54 treebanks of 121 in Deep UD 2.4 contain a ref relation in their enhanced graphs. The likely reason is that the Stanford Enhancer relies on the optional relation subtype acl:relcl to recognize relative clauses, but many treebanks use acl instead. It should be possible to improve the recognition of relative clauses in the future. See also the statistics of enhanced relative clauses in Deep UD treebanks.

Case Information

The labels of certain dependency relations in the enhanced graphs are augmented with case information (adposition and/or morphological feature). 120 treebanks of 121 in Deep UD 2.4 contain case-augmented relations. The exception is the Kyoto corpus of Classical Chinese. It is currently unclear why the Stanford Enhancer fails to augment oblique relations in this corpus. There are hundreds of occurrences of obl dependents that have an adposition among their children (the most frequent adpositions in this context are 於, 與, and 為). See also the statistics of case-augmented relations in Deep UD treebanks.

Predicate-Argument Structure

Predicates

Only verbal predicates are processed and annotated in Deep UD version 2.4. By default, a predicate is identified by its lemma, which is copied from the LEMMA column to the new column DEEP:PRED. However, phrasal verbs in Germanic languages are presented together with the verbal particle; hence English come on is a separate predicate, distinct from come (note that the string in DEEP:PRED may contain spaces). Other types of compound verbs are treated similarly, and so are so-called pronominal (or inherently reflexive) verbs. See also the statistics of predicates in Deep UD treebanks.

Diathesis Normalization

At present we only distinguish active and passive clauses. We postpone normalization of other valency-changing operations to future work. A clause is considered passive if at least one of the children of the predicate is attached via a relation that has the :pass subtype; otherwise the clause is considered active. Such a subtype occurs with passive subjects, passive auxiliaries, and reflexive markers in reflexive passive/middle constructions. Note that in some languages a clause could be passive without any of these words (we should also check the feature Voice=Pass but we do not do it now); also note that the :pass subtype is optional. There might thus be passive clauses that we cannot recognize. 83 treebanks of 121 in Deep UD 2.4 contain at least one passive clause. See also the statistics of diathesis and arguments in Deep UD treebanks.