Introduction

The XMG system corresponds to what is usually called a “metagrammar compiler” (see below). More precisely it is a tool for designing large scaled grammars for natural language. Provided a compact representation of grammatical information, XMG combines elementary fragments of information to produce a fully redundant strongly lexicalised grammar. It is worth noticing that by XMG, we refer to both

a formalism allowing one to describe the linguistic information contained in a grammar,

a device computing grammar rules from a description based on this formalism.

What is a metagrammar ?

This term has been introduced at the end of the 1990s by MH Candito. During her PhD, she proposed a new process to generate semi-automatically a Tree Adjoining Grammar (TAG) from a reduced description that captures the linguistic generalizations appearing among the trees of the grammar. This reduced description is the metagrammar.

What is a metagrammar compiler ?

Once we have described the grammar rules by specifying the way structure is shared, i.e. by defining reusable fragments, we use a specific tool to combine these. Such a tool is called a metagrammar compiler.

What is XMG-2 ?

A distinction has to be made between XMG and XMG-2 (sometimes called XMG-NG).
XMG is a metagrammar compiler dedicated to the generation of Tree Adjoining Grammars and Interaction Grammars. XMG-2 is a whole new project which has been developed at the LIFO (University of Orléans) and the SFB 991 (University of Düsseldorf). XMG-2 makes it possible to create new compilers, adapted to other generation tasks. Its modularity allows to simply assemble Domain Specific Languages, and automatically generate the processing chain for these languages.

In other words, XMG-2 is a tool which allows to generate compilers such as XMG: a metacompiler (or compiler compiler).

This user documentation of XMG-2 is based on the documentation for XMG, and includes the new features provided by the recent extensions.

Getting started

Installation

Option 1: standard installation

if you are using a Debian based distribution (like Ubuntu):

Git:

sudo apt-get install git

Download and install Gecode (4.0 not supported yet):

(if you only use XMG for Interaction Grammars, you can skip this step)

Using XMG without installing anything

Updating XMG-2

To get the latest version of XMG-2, regardless of the installation option you chose, you can type this command (in the xmg-ng directory):

git pull

Creating a first compiler

The instructions detailed here is equivalent to using the script reinstall.sh (see section Scripts). This means that you can skip this section by only typing:

./reinstall.sh
(at the root of the XMG-2 installation directory)

Before compiling a metagrammar, a compiler needs to be created. XMG-2 assembles compilers by combining compiler fragments called bricks. These bricks are distributed into packages called contributions. For example:

the contribution core provides bricks offering support for the basic features of a compiler

the contribution treemg makes it possible to process tree descriptions

the contribution synsemCompiler makes the synsem compiler (equivalent to XMG-1) available

Installing a contribution, with the command install makes all the bricks of this contribution available for being assembled.

After these operations, the compiler synsem (Tree Adjoining Grammar with semantics based on predicate logic) is available.

Compiling a toy-metagrammar

The XMG system includes a toy metagrammars that we highly recommend to manipulate. The files containing these metagrammars should be in the Metagrammars directory of the XMG installation. To compile one of the synsem examples (adapted to the compiler we just built), just type:

xmg compile synsem MetaGrammars/synsem/TagExample.mg

(see also List of XMG's options below)
The result of this compilation will be a file named TagExample.xml

Options include

–force to generate the grammar even if an XML file already exists

–latin to manipulate metagrammars written in latin encoding

–debug to print some useful information about compilation

–notype to disable the strong type checking (equivalent to XMG1)

–more to generate additional files (type hierarchy, etc)

To launch the GUI, type:

xmg gui tag

You can then open the grammar file (.xml) which was generated by the compiler (Fichier → Ouvrir un XML).

Compiling an existing metagrammar

To compile metagrammars which were created using XMG1, it is usually necessary to use the –notype option to cancel the type checking steps which did not exist in XMG1.

Writing a Metagrammar

Getting started

A metagrammar is composed of one or several text files, which are usually using the prefix .mg or .xmg. Any text editor can be used to write XMG code, although Emacs is recommended because of the different XMG modes created for it:

the emacs and vim modes for XMG-1 (only tree descriptions and predicate semantics).

new emacs modes inspired from this one, which are automatically generated when a compiler is built (in the file .install/yap/xmg/compiler/X/generated/emacs_mode, where X is the name of the compiler).

Including data from other files

To ease their development and reuse Metagrammars can be written in separated files. For example, all type declarations can be isolated in a file. To include the code of a file into another:

include file_to_include.mg

Principles and Constants

Principles

The first piece of information one has to give in a metagrammar is the principles that will be needed to compute the grammar structures. The instruction used to do this is the use principle with (constraints) dims (dimensions) statement. For instance, one may decide to force the syntactic structures of the output grammar to have the grammatical function gf with the value subj only once. This is told by:

use unicity with (gf = subj) dims (syn)

In the syn dimension, we use the unicity principle on the attribute-value gf = subj.
At the time of this writing 3 principles are available in the XMG system, namely:

unicity: uniqueness on a specific attribute-value

rank: ordering of clitics by means of associating the rank property to nodes

color: automatization of the node merging by assigning color to nodes

Note that principles use as parameters pieces of information that are associated to nodes with the status property (see below).

Types and Constants

XMG includes light typing. By “light” we mean that one has to type the pieces of information that are used, but for now there is no strong type checking during compilation and execution (but only a syntax checking). There are 4 ways of defining types:

as an enumerated type, using the syntax type Id = {Val1,…,ValN} such as in:

type CAT={n,v,p}

(note that the values associated to a type are constants)

as an integer interval, using the syntax type Id = [I1 .. I2] such as in:

(this is useful when one wants to avoid having to define acceptable values for every single piece of information). Note that XMG integrates 3 predefined types: int, bool (whose associated values are + and -) and string.

Once types have been defined, we can define typed properties that will be associated to the nodes used in the tree descriptions. The role of these properties is either (a) to provide specific information to the compiler so that additional treatments can be done on the output structures to ensure their well-formedness or (b) to decorate nodes with a label that is linked to the target formalism and that will appear in the output (see XMG's graphical output). The syntax used to define properties is property Id : Type, such as in:

property extraction : bool

There also exists a syntactic sugar concerning properties. Here one may want to avoid having to state extraction=+ several times. An alternative to this is to associate an abbreviation (between curly-brackets):

property extraction : bool {extra = +}

This means that using extra is equivalent to giving the value + to the property extraction of a node, ie equivalent to extraction=+.

Eventually we have to define typed features that are associated to nodes in several syntactic formalisms such as Feature-Based Tree Adjoining Grammars (FBTAG) or Interaction Grammars (IG). The definition of a feature is done by writing feature Id : Type, such as in:

feature num : NUMBER

Up to now, we have seen the declarations that are needed by the compiler to perform different tasks (syntax checking, output processing, etc). Next we will see the heart of the metagrammar: the definition of the clauses, ie the classes.

Classes

Here we will see how to define classes (i.e. the abstractions in the XMG formalism). Note that in TAG these classes refer to tree fragments. A class always begins with class Id, such as in:

class CanonicalSubj

N.B. A class may be parametrized, in that case the parameters are bracketted and separated by a colon, as presented in Miscellaneous.

Import

To reach a better factorization, a class can inherit from another one. This is done by invoking import Id (where Id is a class name), such as in:

import TopClass[]

That is to say, the metagrammar corresponds to an inheritance hierarchy. But what does inherit mean here ? In fact, the content of the imported class is made available to the daughter class. More precisely, a class uses identifiers to refer to specific pieces of information. When a class inherits from another one, it can reuse the identifiers of its mother class (provided they have been exported, see below). Thus, some node can be specialized by adding new features and so on.

Note that XMG allows multiple inheritance, and besides it offers an extended control of the scope of the inherited identifiers, since one can restrict the import to specific identifiers, and also rename imported identifiers (see Miscellaneous).

N.B. When importing a class, even if it has parameters in its definition, these cannot be instantiated.

Export

As we just saw, we use identifiers in each class. One important point when defining a class is the scope we want these identifiers to have. More precisely we can give (or not) an extern visibility to each identifier by using the export declaration. Only exported identifiers will be available when inheriting or calling (ie instantiating) a class. Identifiers are exported using export id1 id2 … idn such as in:

export ?X ?Y

(The ? indicated X and Y are variables, and not skolem constants, ie anonymous constants that would have been prefixed with !) Besides, when exporting an identifier you can rename it so that it can later be referred to by a new name (to avoid name conflict). This is done by typing export id1=id1new, example:

export ?X=?U ?Y

(here the X variable will be referred to by using ?U in the daughter class, ?Y will still be called ?Y)

Identifiers

In XMG, identifiers can refer either to a node, the value of a node property, or the value of a node feature. But whatever an identifier refers to, it must have been declared before by typing declare id1 id2 … idn, such as in:

declare ?X ?Y ?Z

Note that in the declare section the prefix ? (for variables) and ! (for skolem constants) are mandatory.

Content

Once the identifiers have been declared and their scope defined, we can start describing the content of the class. Basically this content is given between curly-brackets. This content can either be:

a statement

a conjunction of statements represented by S1 ; S2 in the XMG formalism

a disjunction of statements represented by S1 | S2

a statement associated to an interface (see below)

By statement we mean:

an expression: E (that is a variable, a constant, an attribute-value matrix, a reference (by using a dot operator, see the example below), a disjunction of expressions, or an atomic disjunction of constant values such as @{n,v,s}),

a unification equation: E1=E2,

a class instanciation: ClassId[] (note that the square-brackets after the class id are mandatory even if the instantiated class has no parameter),

a description belonging to a dimension: this is where the main description task takes place (see section Dimensions)

Valuations

Once all the classes have been defined, we can ask for the evaluation of the classes that will trigger the
combination of the fragments (ie classes calling classes that contain disjunction and/or conjunction of
fragments). For each of these specific classes, we will obtain an accumulated tree description that may
lead to the building of 0, 1 or more TAG trees. The syntax of the evaluation instruction in XMG is value Id, such as in:

value n0Vn1

Dimensions

SYN: a tree description language

The <syn> dimension allows to describe trees, initially to create Tree Adjoining Grammars or Interaction Grammars. To use this language, you can either build a new compiler using the brick syn (contribution treemg) or use one of the existing compilers including the dimension: synsem (contribution synsemCompiler, with predicate based semantics) or synframe (contribution synframeCompiler, with frame based semantics).

A syntactic description is given following the pattern <syn>{ formulas }. Now what kind of formulas does a syntactic description contain ? The answer is nodes. These nodes are in relation with each other. In XMG, you may give a name to a node by using a variable, and also associate properties and/or features with it. The classic node definition is node ?id ( prop1=val1 , … , propN=valN ) [ feat1=val1 , … , featN=valN ] such as in:

node ?Y (gf=subj)[cat=n]

Here we have a node that we refer to by using the ?Y variable. This node has the property gf (grammatical function) associated with the value subj, and the feature structure [cat=n] (note that associating a variable to a node is optional).

Once you defined the nodes of the tree fragment, you can describe how they are related to each other. To do this, you have the following operators:

→ strict dominance

→+ strict large dominance (transitive non-reflexive closure)

→* large dominance (transitive reflexive closure)

» strict precedence

»+ strict large precedence (transitive non-reflexive closure)

»* large precedence (transitive reflexive closure)

= node equation

Each subformula you define can be added conjunctively (using “;”) or disjunctively (using “|”) to the description. For instance, the fragment:

XMG also supports an alternative way of specifiyng how the nodes are related to each other. This alternative syntax should allow the user to both define the nodes and give their relations at the same time:

Thus the tree fragment above could be defined in the XMG syntax the following way:

class Example
{<syn>{
node [cat=S] {
node [cat=N]
node [cat=V]
}
}
}

Note that the use of variables to refer to the nodes becomes useless inside the fragment, nonetheless we may want to assign variables to node to reuse them later through inheritence.

IFACE: connecting dimensions

Interfaces correspond to attribute-value matrices, allowing one to associate a global name to an identifier.
The syntax of the interface is the following (the interface is between square-brackets):

class Id
{ ... }*= [Name1=Id1, ... , NameN=IdN]

The *= operator represents unifying extension.
When a class is valuated, the descriptions (contained in the classes) it refers to are accumulated. At the
same time, the interfaces associated with these descriptions are accumulated. The semantics of their
accumulation may correspond to unification.

Let us see the use of an interface in an example. Considering the tree fragment used so far. Imagine we
want to refer to the N node outside of the class. To do so, we give this node a global name. We can do
this by using the following interface:

In a class A which is combined with Example, you can constraint the identification of a local node X with the subj node of Example by reusing the feature subj in the interface of A: *=[subj=?X]
Note that the interface may also be used to give names to properties or feature values.

The interface can also be accessed as a regular dimension, meaning that the *= operator can be replaced as follows:

FRAME: describing semantics using typed feature structures

The <frame> dimension can be used in a compiler by using the frame brick (contribution framemg). A set a pre-assembled compilers use this brick: synframeCompiler (with Tree Adjoining Grammars) and framelpcompiler (with morphological descriptions).

This dimension allows to describe typed feature structures. These structures use conjunctive types, which means that types are not atomic, but rather sets of elementary types. When two typed feature structures get unified, the type of the resulting structure is determined by a type hierarchy. In the simple case, and if the types are compatible, the resulting type is the union of both types.

Type hierarchies are defined in two steps. First, the declaration of the atomic types:

frame-types = {t1,t2,...,tn}

where t1, t2, …, tn are constants.

In a second time, the atomic types get organized into a hierarchy by specifying constraints:

frame-constraints = {c1, c2,..., cn }

where c1, c2, …, cn are type constraints. Several sorts of them are available:

The first three constraints are subsumption constraints. causation → event means for example that all frames of type causation also have type event. The two next constraint express incompatibilities of types, meaning for instance that a frame cannot have both types motion and causation.
activity motion → locomotion means that all frames having both type activity and motion will also have type locomotion.
The three last constraints concern attributes. For instance, causation → cause:+ effect:+ makes sure that all frames of type causation will have attributes cause and effect, both with value +.

Exporting the type hierarchy

XMG computes the type hierarchy to handle the unification of typed features structures during the compilation of the metagrammar. However, to be able to reuse this type hierarchy with the generated resource (with a parser for example), the hierarchy needs to be exported. When compiling the metagrammar, the option –more activates the export of additional useful resources, which is the hierarchy in our case. For a file called example.mg, the complete command is the following:

xmg compile synframe example.mg --force --more

The compiled grammar can then be found in the file example.xml and the type hierarchy in the file more.mac.

SEM: describing semantics using predicates

Here we will see how to describe semantic information with predicates. Basically, this dimension allows one to
describe:

That is to say, we define the class BinaryRel in which 3 variables and a skolem constant (prefixed
by “!”) are declared. This class only contains semantic information (dimension <sem>), more
precisely it contains a predicate (whose value is the variable ?P) of arity 2, its arguments are the
variables ?X and ?Y. !L represents the label associated to this predicate. Note that we use the interface
dimension to give the name pred to ?P. Further, this variable may be unified with a constant, and the
value of the predicate thus given.
Finally, it is possible to define a class containing both a semantic and syntactic dimension, and these
dimensions may share identifiers. Besides sharing identifiers may also be done by using the interface
dimension. Thus XMG provides efficient devices to define a syntax / semantics interface within the
metagrammar.

MOPH_LP: describing morphology with ordered fields

This dimension allows to form words by assembling morphemes. First, fields need to be defined and ordered (using constraints), then information can be added to the fields. The description language offered by dimension consists of only one keyword and two operators.

A field suffix is created, and placed on the right of another field root (defined in another class). The string “s” is added into the new field. This complete example can be found in the MetaGrammars/lp_morph/example.mg file of XMG-2 (or on GitHub).

LEMMA: describing lexicons of lemmas

This dimensions is used when parsing (with a TAG for example). The typical use of these lexicons is to list which TAG families are compatible with the lemmas of the language. The description language basically allows to associate values to different attributes. An example of class using the lemma dimension is as follows:

where entry is the lemma, fam a TAG family which can use this lemma as anchor, and cat the syntactic category of the anchor.

Metagrammars containing contributions to the <lemma> dimension (only) must be compiled with the compiler named lex:

xmg compile lex lemma.mg

MORPHO: describing lexicons of inflected forms

This dimensions is used when parsing (with a TAG for example). The typical use of these lexicons is to list which lemmas are compatible with the inflected forms of the language. The description language basically allows to associate values to different attributes. An example of class using the morpho dimension is as follows:

class a
{
<morpho> {
morph <- "a";
lemma <- "avoir";
cat <- v
}
}

where morph is the inflected form, lemma is the lemma associated to this inflected form, and cat the syntactic category of the inflected form.

Metagrammars containing contributions to the <morpho> dimension (only) must be compiled with the compiler named mph:

xmg compile mph morph.mg

Examples

Simple TAG example

Now, we will see in details how to write a metagrammar. We will define a metagrammar generating a
small TAG for French. This small TAG will contain 2 trees, namely the ones representing a transitive
verb either with a canonical subject or a subject in relative position.

Specifying data

First thing to do: defining the principles, types, properties and features we will use. For the sake of clarity,
we will only constraint the produced trees to have no duplicate grammatical function. That is to say,
we will only activate the unicity principle with the gf property as parameter:

We will deal with few types in this example. We only pay attention to grammatical functions and syntactic categories. The first one is a node property and the second one a node feature (ie part of the TAG formalism):

Defining blocs (tree fragments)

The metagrammatical rule we will use is the following:

transitive = (CanSubject | RelSubject) ; Active ; Object

So we will handle 4 tree fragments: Active, CanSubject, Object, and RelSubject. The class transitive will consist of an abstraction on a conjunctive combination including a disjunction on the subject that is used.
The Active class corresponds to the verbal spine:

class Active
export ?X ?Y
declare ?X ?Y
{<syn>{
?X -> ?Y
}
}

The CanSubject class corresponds to the Example class introduced previously:

In this class, we use the dot operator to associate a variable to the record of exported identifiers. For
instance, ?OB being the variable representing the Object class, ?OB.?X refers to the ?X variable of this class, provided it has been exported. In the transitive class we combine conjunctively 3 classes (one being either CanSubject or RelSubject, and Object, and Active). We also unify their s and v nodes so that the tree fragments get merged. Note that we may prefer using a color system to semi-automatize this node unification (see Controlling fragment combination semi automatically by coloring nodes).
Eventually, we know that the transitive class contains all the information needed to build 2 TAG
trees. So we ask for its evaluation by invoking:

value transitive

As a result we obtain the 2 following trees (the first one represents the relative subject, and the second
one the canonical subject) :

More examples

More examples can be found in the Metagrammars folder of the XMG installation directory.

Bricks and contributions

As stated previously, XMG-2 compilers are built using bricks, which implement the compiling steps for parts of metagrammatical languages. For example, the avm brick contain all the support for feature structures, and the syn brick for the language of the <syn> dimension.
Bricks are distributed in contributions, for instance the core contribution, which contains all the basic features of XMG-2. The treemg contribution contains all the bricks which, in addition to the ones of the core contribution, allow to build the synsem compiler (equivalent to XMG-1).

Making bricks available is done by installing them. The install command takes as parameter a contribution and installs all the bricks provided by it.

All contributions with names ending with “compiler” are special, as they contain a different type of bricks. These bricks contain description of compilers which need to be created before one can use them. Creating a compiler is done with the command build.

Compilers are usually named after the dimensions they feature (synframe provides both the <syn> and the <frame> dimensions). The following compilers can be installed: