Saturday, April 26, 2014

Esperanto.
It's a made up language. No really! It's a full-blown, complete language, but
it is entirely made up. It's what's sometimes referred to as a
synthetic or constructed language – as
opposed to a natural language,
such as English, Chinese, or Salish.
(Interesting note: it's a controversial issue, but, some people think that Salishan languages don't have nouns). Worldwide,
it has several million speakers, and has clubs formed around it,
blogs in it, newspapers, etc. You can take online or mail-in courses to learn it. There are many books written in it, or translated to it. Google
Translate even translates to/from Esperanto. The
language has a long history, being formed in the late 1800s by one
Ludwik Lazarus Zamenhof (a.k.a.
“Doktoro Esperanto”),
and is the world's most successful constructed language. The
language's name means
“one who hopes”, and it was the hope of the eponymous Doktoro
that it would be a neutral language, not tied to any place or
ideology, which could be used to promote peace and understanding.

It even has its own flag!

I
have had an off again, on again relationship with Esperanto. But,
don't cry for me, Esperanto, the truth is I've never left you. Sometimes
I have left it alone for years, and then come back to it to study it
with abandon for a time. I've always found it to be fascinating. The
grammar is clean, the pronunciation is regular and easy.
I find it fairly easy to read, at least with some of the simpler
sentence constructs, and I like the complete regularity of the
language. Esperanto is
a flexional language, and the declensions of the various word type
are completely, strictly regular. There are no special cases. There
are no “yes, but...” rules. There aren't any exceptions to the
grammar rules – none. Which has always made me think that this
would be an excellent language for use with computer programs. It
should, in theory, be easy to write a program that would parse a
sentence in Esperanto, and then return some sort of object describing
the parsed sentence. Add a little vocabulary, and you could provide a
language interface to your program. A little rudimentary language
learning on your part, and you've got a great way to control a
program, have a natural language interface to turn lights on and off
in your house, control a robot, change TV channels, or whatever
you like!

It
seems like somebody should have already written one of these. It
seems an obvious thing to do. I looked all over the web, but, I
couldn't find a simple parser for Esperanto, that would give me an
object describing a sentence. So I wrote one.

A word
about Grammar

The
thing is, people mean different things when they use the word
“grammar”. Many people take a prescriptive,
lawyerly approach, where they may judge some utterance to be either
“good” or “bad” grammar. This
type
of grammar, where rules are prescribed and are not to be broken,
is commonly referred to as a “classical grammar”, and that
approach has no use here.

Nope!

In
linguistics, there are many ways to approach grammar, and many
meanings to the word. The meaning I am using here is that of a
descriptive grammar – that is, I'm not concerned with what is right
or wrong, but rather, what are some rules that can describe the
underlying construct of a particular language? For instance, how
would I find a set of rules that, given a particular concept or
thought,
would tell me how to piece together an utterance in the target
language, and have it adhere to what other speakers of the language
expect? This is called a “generative grammar”, a concept
championed by Noam
Chomsky in several of his early books, notably Syntactic
Structures.
Dr.
Chomsky has moved somewhat away from his early theories, but, they
will work for our purposes. Another key concept in what I wrote is
termed a “transformational grammar” (Also
a concept from Chomsky),
which allows a structure in some form to be transformed into another
by some rule or set of rules. In practical terms, this allows us to
do things such as deal with repeating conjunctions (“this and that
and these and those”), deal with the question construct in
Esperanto (“Ĉu” at the beginning of a sentence turns it into a
question, and in our object model, and that should be a top level
node containing “Ĉu”, with the rest of the sentence in a
subnode), and so forth. Transformational rules allow us to do these
transforms.

Some
approaches to computational linguistics take the approach of attempting to discover these underlying grammatical constructs in a language. Generally, a
very large corpus of sample material in the target language is fed
into a computer program, and statistical and probabilistic methods
are used to find relationships
within the target material. This is certainly an interesting subject,
as it leads directly into parallel processing and massively parallel
systems (Hadoop and Google's MapReduce come to mind), but it is not the
approach I used here. We already know the structure of Esperanto. It
is well defined, and clearly laid out in many books, online courses,
and so forth. What I intended to do was to take that written form,
intended for humans, and turn that set of rules into a computer
program that could recognize and digest utterances in our
well-defined target language. Note also that this is one-way. The
program does not take concepts and generate utterances in Esperanto,
but
only pulls apart existing utterances.

As
the source of the rules and
grammar,
I used A
Complete Grammar of Esperanto,
by Ivy Kellerman Reed. This is one of the seminal works on Esperanto,
and is available many places online for free, such as this
one at Project Gutenberg. I
started reading the book as a refresher in Esperanto, and, as I went,
I wrote the computer program, using the examples and exercises in the
book as test inputs for the program. I haven't yet completed the
book, but, so far, so good! And so far, quite good enough for the
intended purpose, which is to be able to control computer programs or devices (such as a robot) by
typing Esperanto sentences.

Some Things This Program Does Not Do

It
doesn't care about meaning, for one. It strictly adheres to syntax
and syntactically decomposing utterances in Esperanto. Semantics
and deeper meaning are left to the program consuming the resultant
object from this program.

It
also is strongly bound to the syntax of the target language, and does
not learn. If it encounters an utterance that it cannot decipher, it
will produce an object
either partially or wholly in error – however, since the rules of
the language are so firm, it's pretty easy to identify things that do
not conform to those rules.
It's
not a translator. While it is conceivable that the output from this
program could be passed through another program with a different set
of rules and transforms, and produce utterances in some other
language, I have not written any program to do that. I mainly
intended this as input to some other program or device, in order to
direct its actions.

How it Works

Subphrase Grouping

The code does two passes through an utterance. The first is the Grouper, which does more or less low-level parsing, such as dividing sentences into subphrases on commas or other punctuation, determining word type via endings, or recognizing special marker words such as "kaj" ("and") or “Ĉu”(question marker). It also groups things into subphrases depending on the words' relationships to one another -- for instance, adjectives and nouns that are adjacent to each other and agree in case are grouped into a subphrase. At the end of this pass, which recursively looks at the phrases it parses, the phrases and subphrases are grouped into a tree structure, with the individual words as leaf nodes. The following picture illustrates:

In the above example, the sentence, "Fortaj ĉevaloj marŝas kaj kuras en la verdaj kampoj." ("[The,A] Strong horse walks and runs in the green fields.") is parsed. We can see that it has created a top-level group on punctuation (in this case, a period, "."), under which it has several subgroups corresponding to things like noun phrases and verb phrases. For more complicated sentences, we may also see prepositional phrases and so forth. This program keeps things simple as much as it can -- for instance, if a word like a pronoun can be parsed by regular syntactic rules, (such as, by its ending), it doesn't get special processing. Remember, the program parses syntactically only, and disregards meaning in its processing.

In this example, the noun phrase is expanded, and we can see that it has grouped "Fortaj ĉevaloj" together, since they agree in number and type. The words have further attributes assigned to them, such as Fortaj being recognized as a plural adjective relating to the sentence's agent ("subject" could be considered analogous, in other classification systems). These attributes are simply that -- text that is assigned to the word during processing, akin to a tag cloud or other classification system. This allows for other processing to be done past what this program does, opening up the door for things like statistical analysis of text, Kmeans classification, and so forth. Or, just simply acting on the input!

To illustrate, here is "Fortaj", in more detail:

We can see here that the program has recognized "Fortaj" as flexional, and that it has 100% confidence in its results. The program is set up to use probability to determine how to classify a word, but, as Esperanto is so regular, just simply recognizing the ending of the word has proved to be reliable so far, and probabilistic methods of identification have not yet been needed. The program has a lexicon, and further builds that lexicon during parsing, but, it does not have an entry for Fort[], which means that its Translation property on the word object is null. Again, the program operates syntactically only, and assumes that any translation or action done on words it parses is up to the consumer of its results. It's the calling program's responsibility to know what to do with the words and phrases it finds.

Transformations

After groupings have been done, the next processing pass does any transformations that need to be done. At this point, we have a tree structure, as described above, which should be completely parsed. The transfomation pass takes this parsed tree, and further applies rules in order to produce the final result. For instance, one of the rules we have been talking about is that “Ĉu” marks a question. The transformation for this is that “Ĉu” forms a top node, with the rest of the transformation underneath it. A somewhat similar rule may be to group subphrases related to conjunctions under that conjunction -- for example, the equivalent of

For the transform mentioned above, here are before and after examples. Note in the before example, the word "Ĉu" does not have anything special about it, although the grouper did recognize it as a word needing a transformation. In the second example, we can see that new subgroups have been created under the word (see near the highlighted line in the second picture, below). Similarly, the question word or any of its subwords and subphrases could be marked in any number of ways (for instance, attributes on the words) for further processing or as a marker to the consuming program to take some action.

After transform:

The transforms are implemented using the Visitor pattern, where each type of transform is implemented as a small class which is handed a Group object. The object does any transform, and returns true if it found anything on which to operate. The main loop recurses over the tree until all transforms return false, meaning they found nothing on which to make any changes. Each Transformer class is a small, simple class that does one operation on its input group.

Next Steps

The next steps that this program needs to take are more transformations, specifically ones that explicitly express grammar rules of Esperanto, or ones that mark leaf nodes in specific ways -- for instance, marking all leaf nodes that are related to a pronoun in the sentence, and so forth.

The program also needs to further encapsulate the chapters of the source text (A
Complete Grammar of Esperanto), and make sure that it covers any of the more esoteric rules of the language.

Other next steps could be things like loosening its cohesion to Esperanto specifically, and make it so that it can do data mining to find its own rules and transformations in sample text -- that is, make it learn on its own. This is, obviously, a large undertaking and I have purposely left it out of scope for this program.

Another possible expansion is to do the exact reverse of parsing -- take some programmatic object, and use the rules defined here to not parse text, but to generate correct text in Esperanto.

Where to Download the Code

The code for this project is available at the link below. Please let me know if you find it useful, and, if you expand on it, please let me know, so we can roll your changes in!

https://sourceforge.net/projects/esperantoparser/

Afterthoughts

Linguistics
is a fascinating subject, but it can be hard to figure out where to
start. For a good overview, I recommend getting a copy of Teach
Yourself Linguistics,
by
Jean Aitchison, who is
the Emeritus Rupert Murdoch Professor of Language and Communication
at Oxford. Despite her rather formidable title, Dr. Atchison has
written a very straightforward and easy to read introduction to
linguistics. For an interesting read on language formation and
development, The
Language Instinct by Steven
Pinker, a well known professor in the Psych department at
Harvard, is a classic. To go a bit deeper on a much more technical
level, Syntactic
Structures by Noam
Chomsky dramatically changed the field of linguistics when it was
published. It's worth following that link just to read the
introduction and overview. Finally, for a good layman's introduction to everything
related to cognition, neuroscience, and, of course, language and
linguistics, I highly recommend the
Brain Science Podcast,
by Dr. Ginger Campbell.
Dr. Campbell gives overviews of many fascinating books, far more than
I for one could ever read, and has interesting interviews with
leading researchers and authors working in fields such as cognitive
science and neuroscience. I really can't recommend her podcast
strongly enough. And she also goes to Dragon*Con,
so how cool is that?!

And one more thought. I know, I know, as you've been reading this, you have been thinking, "Yes, yes, this is all fine and good, but, what about a free editor for Ancient Egyptian hieroglyphics?" So, without further ado or non sequiturs, here it is. JSesh, open source and available for most platforms. It's pretty awesome!