Worse, the official docs don't contain even a list of the possible tags for most of these properties, nor the meanings of any of them. They sometimes mention what tokenization standard they use, but these claims aren't currently entirely accurate and on top of that the standards are tricky to track down.

Part of speech tokens

The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.

More precisely, the .tag_ property exposes Treebank tags, and the pos_ property exposes tags based upon the Google Universal POS Tags (although spaCy extends the list).

spaCy's docs seem to recommend that users who just want to dumbly use its results, rather than training their own models, should ignore the tag_ attribute and use only the pos_ one, stating that the tag_ attributes...

are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent.

That is to say, if spaCy releases an improved model trained on a new treebank, the tag_ attribute may have different values to that which it had before. This clearly makes it unhelpful for users who want a consistent API across version upgrades. However, since the current tags are a variant of Penn Treebank, they are likely to mostly intersect with the set described in any Penn Treebank POS tag documentation, like this: http://web.mit.edu/6.863/www/PennTreebankTags.html

The more useful pos_ tags are

A coarse-grained, less detailed tag that represents the word-class of the token

However, we can see from spaCy's parts of speech module that it extends this schema with three additional POS constants, EOL, NO_TAG and SPACE, that are not part of the Universal POS Tag set. Of these:

NO_TAG is an error code. If you try parsing a sentence with a model you don't have installed, all Tokens get assigned this POS. For instance, I don't have spaCy's German model installed, and I see this on my local if I try to use it:

RCMOD: relative clause modifier (not used by Spacy - relcl is used instead as noted below)

ROOT: root

XCOMP: open clausal complement

and also contains the actual linguistic definitions of these terms, complete with examples. However, as with part of speech tokens, spaCy doesn't quite adhere to the scheme it claims to adhere to. Looking in its symbols file, we can see that it defines a constant for each of the tokens above except PREDET, which spaCy doesn't use for some reason. Additionally, as noted in https://github.com/explosion/spaCy/issues/233, there are several dependency tokens that spaCy can emit that are neither included in the symbols file nor in the 2012 CLEAR documentation. These include acl, case, compound, dative, nummod, and relcl.

Fortunately, we can find at least brief descriptions of what these undocumented dependencies mean in code comments on the DEPTagEn interface inside the nlp4j (previously called ClearNLP) project, which spaCy uses to train its parser. For instance, the meanings of the tokens above:

acl - finite and non-finite clausal modifier.

case - case marker

compound - compound nouns/numbers

dative - dative

nummod - numeric modifiers

relcl - relative clause modifiers

These admittedly aren't great descriptions, but at least they're something! The spaCy team is aware of the deficiencies of the documentation and working to fix it, so hopefully in a while we'll have better documentation all in one place.

Email codedump link for What do spaCy&#39;s part-of-speech and dependency tags mean?