After campaigning once or twice for a conlang SE, there finally is one! And they are not getting enough questions. Writing a good question is turning out to be harder than I thought. Let’s take a few questions that I personally have.

How do I write a machine generated conlang? The hypothetical answer would have to be huge! Off the top of muy head, one would need to cope with phonetic fonts, randomly generating words, filling up a dictionary, writing a formal grammar, writing a formal grammar that works both for machines and natural languages, possibly writing a parser- which is HARD, using corpus management and concordance tools to examine texts. If these texts are machine generated, now you need a “lorum ipsum” generator, which is harder to write than a word generator. It would take a book to explain it. And it would take a few more books to explain Regex, NLP libraries and just basic programming.

How should my next (hypothetical) language, Droid, express “performativity“? First off, it is a question I got, but the answer isn’t going to be in someone else’s head. How should a sculptor carve this statue’s nose? Who knows! It is artistic prerogative. I suppose objectively, one could list every method that any language marks any grammatical thing. But even that wouldn’t be a good answer because language tend to have few strategies that use over and over. Once you have a language with a syntax based mostly on word endings (like latin), the answer for expressing any grammatical marker is probably yet another suffix.

How do I say, “today is a good day to die” in Klingon/Na’vi/toki pona/Esperanto? This question calls for someone who has invested the 500 to 2000 hours of effort into learning Klingon. Those people tend to clump together on some specific forum or mailing list. Why they choose one platform over another, I don’t know. The toki pona community has stampeded from yahoo mailing list to livejournal, to a phpBB forum, to Facebook. For the most part, people don’t care about the platform as much as they care where the community’s self appointed grammarian is. Both toki pona and Klingon have examples of self appointed grammarians, y’all know who you are! [And this isn't bad, without the self appointed grammarians acting as essentially pro bono tutors, the community would die]

What the language specific questions call for is a single community/single platform. We’ll have to see if Conlang SE’s tagging rises to the occasion.

One new thing on Conlang SE is that now when I ask a toki pona question, folk there want it to be glossed so that people who’ve never studied toki pona have a hope of following the Q & A discussions.

Try it out!
Go try it out. It is in the wild west phase where the community hasn’t finished deciding what is a good sort of question, so expect more than usual number of close votes and people wanting to quibble on if this or that is on topic rather than answering your question. Also there are at least one or two people there who think their job is to find people wrong on the internet and set them straight rather than down-voting an answer, or ignoring a question whose premise bothers them.

That said, the SE format in other domains has created excellent questions and answers with the minimum of flame war and fluff that goes with, say, a phpBB or mailing list.

The CountVecorizor is the most audacious model of human language. It counts the words and converts the entire sentence into a vector of word counts for each different word in the sentence (or text) and by absence, zeros for all other words in the dictionary. When the number of words in the dictionary is small, the fact that one word was left out holds more information. This is important for small languages.

Toki pona’s adjective chains and pi chains are sort of like this. We clearly can project a chain-of-things or graph-of-things model with head words and successive groupings, but sometimes there are so many possible graphs or groupings that you might as well view the structure as a bag of words. A “pretty little girls school” has plausible interpretations for all possible graphs and groupings, might as well look at as a bag of words that is about “pretty, little, girls” and “school” and not about octopi.

So a hypothetical small language that uses bag of words explicitly might work like this:

The basic unit is a word. Compound words are created per morphology and are explicitly joined. If we didn’t we’d have the jan pona problem, pairs of words that smell like lexemes, but look like two words joined only by syntax.

The basic unit of syntax is a bag of words. One would need to see where a bag of word starts or ends, so you’d need audible parenthesis, at least on one side and on sentence start and finish. These markers should be consistently left or right branching but not both. If you didn’t, you’d get toki pona parsing problem with sentence starts (no token, presumably uses English-like inflection, or in writing, inaudible punctuation), la-phrase parsing (branches the wrong way)

The separator for each bag of words can act as syntactic markers for what ever the metaphysical obsessions are, e.g. obsessing on who did it, who was on the receiving end (syntactic alignment), obsessing on who is experiencing it (an alternate to subject-object), what-is-changing (polysynthetic languages where the verb takes on all duties).

Bags of bags of words are unordered. This means we’d need a lot of bag markers.

The syntax explicitly allows parsers to project any structure you’d like over words (i.e. you can force English adjective order over what are semantically adjectives), but two bags of words with the same words are strongly equal. Any pattern detectable in one person or another’s bag of words is purely an implementation detail. If this rule isn’t explicit, our brains will always create sentences with excess structure and find structure where there is none.

Obviation
Obviation is another audacious mechanism. When collection of larger structures is undefined, a marker is made up to indicate that they reference the same thing. Anyhow, it just occurred to me as about the sloppiest way to coordinate a sentence.

x = BoW
BoW, BoW, x BoW

So a whole language can be made with:

x words, say 100
y bag markers, say 50-100

Likely usage pattern
Beginners would have large bags of words and a discourse would have few bags. Experts would have many small bags of words and extensive obviation markers. Over time, the community would accidentally project more familiar structures over these bags of words and we’d have (deep) tree based syntaxes all over again.

Compound Words
So English has “goodbye” which most people would say is one word. But it used to be “god be with you” and somewhere in between then and now and phrase turned into a single word. Professional linguists use a variety of tests to check to see if a word is a word or a phrase made of words, for example, is the meaning opaque & not recoverable from inspecting the parts, do you have to memorize it, does it resist or prohibit being re-arranged, does it take an plural, e.g. “Say your goodbyes” vs “? Say your god be with yous”

(Similar process going on with Icelandic “gerðu svo vel” and Swedish “varsågod/var så god”, which AFAIK, used to be the same phrase. In Icelandic, it is still sort of like a phrase, in Swedish, it is not a compound word any more, it is just a word with no internal syntactic structures at all)

So toki pona has a lot of words are various degrees of opaqueness. If you don’t memorize “jan pona” you won’t necessarily guess it means friend. Also like English black bird, blackbird, the compound word refers to something more specific than just an black bird, but refers to a specific species.

The phenomena of “word building” using syntactically correct phrases exists, linguists study it, but probably not as much as they should because it is marginal in English, e.g. willow-o-the-wisp, chemin-de-fer. Notice that these real life compound words are written with hyphens.

But we tell ourselves a lie to keep the number of words to 120, rather than being a bit more honest that to competently read toki pona, you need to memorize one or two thousand compound words.

Innovation.
We have all heard some accidental racist say “That savage tribe doesn’t have a word for numbers” (because they are so stoopid, unlike myself). Toki pona has a running joke that toki pona doesn’t have numbers. But this ignores the fact that anyone that understands math can think up a dozen ways to communicate exact numbers using just about any language, including one intentionally crippled to make it hard to do so. Toki pona designed crippled with only three root words for exact numbers, ala, wan, tu, plus some other root words suggestive of numeric, quantitative or logical things, like mute, suli, ale. Toki pona builds up new lexical units using compound words, e.g. tu wan, by assigning new meanings to existing words, e.g. luka became 5, by analogy one can create a full body-part based county system. So for a language to *really* lack numbers, it *really* needs stupid speakers. All languages have the means to innovate and create new ways to express things.

Standards
What really is toki pona? Well, it doesn’t have a governance structure, like programming languages. It doesn’t have a standard implementation because languages run on brains. Without a standards board, what toki pona is how the individuals use it. Corpus linguistics is sort of a weird sort of evidence– something might never been seen in the corpus of observed utterances, but people might feel that a never seen before plural is okay, or a certain error might be common but people are pretty sure it is an error. And then there is the great mass of data in between– words, phrases and constructions that appear in the corpus over and over. Those phrases really are the language, even if they violate the so-called official grammars. The official grammar is something of a small lie, it says “This grammar is the language”, but really the language is the corpus of all utterance ever heard. Our brain translates that input into a tangle of brain cells. That tangle of brain cells is different for each person, and is capable of creating new toki pona sentences with a strong family resemblance to the initial corpus. At the moment, we can’t observe or make sense of the specific configurations of the neural networks of each toki pona speaker to infer what the typical grammar *really* is. We do write formal grammar, and these formal grammars are a small lie because the real grammar is too unwieldly to commit to paper.

Examples- people make decisions about order of adjectives, these orders are sometimes strict, sometimes loosey goosey and these rules are just about never written into a formal grammar. Ordering of phrases (time-manner-place) is another area where the corpus has strong opinions, but a formal grammar won’t necessarily have strong opinions. (Or it might have a strong opinion, people are formalizing the hard to formalize parts of languages all the time, but it was written by a PhD in Linguistics and only he and his adviser really understand it. Turning a neural network into a formal language is no easy task for some corners of grammar)

What to do? If one is making a new fake language, you should be thinking of making:

1) A corpus- preferably written by many people. It is rare to get many people to learn a fake language and write in it tho.
2) A formal grammar- it is handy for dealing with all the issues that a formal grammar deals well with.
3) Lots of examples showing the “correct”/”incorrect” ways of saying things. (Corpus linguistics, but with contrived texts with constructions explicitly marked as correct, as opposed to regular corpus texts that at best are “on the average probably correct for most people”

Esperanto is an invented language that is popular with people who enjoy learning languages and travel. It’s so successful that there are 2nd and 3rd generation speakers.

Passport Service (Passporta Servo) is a pre-internet bed and breakfast service aimed at Esperantists. It matched up people willing to host other Esperantists as a gesture of international goodwill and as an opportunity to use Esperanto. Passport Service still exists.

AirBnB in terms of scale now dwarfs pre-internet services like Passport Service, nobody involved in Bed & Breakfasts in any sort can ignore the arrival of AirBnB. Even hotels can’t ignore AirBnB. I’ve used AirBnB a few times & noticed you can filter hosts by what language they speak. (I will admit, sometimes it doesn’t matter what language they speak, I’ve done AirBnB where I never met the host because it wasn’t an Bed and Breakfast as much as a small scale hotel in the form of several homes)

As a guest, I would go out of my way to find hosts that speak Esperanto & odds are they would be interested in chatting in Esperanto.

As a host, I probably would continue to use Passport Service, especially if I only was interested in hosting other Esperantists. AirBnB isn’t set up to let hosts be that picky about who is a guest.

Anyhow, I know that Esperanto is a rare language, but it is a rare language popular among the sort of people that would use AirBnB.

I would’t recommend Klingon tho, the Klingons track blood all over the place when the visit.

Seriously. Please add Esperanto to the list of host languages, it will create value.

The recommended way to use this, should I (or someone!) complete it and write a compiler is that certain advanced toki pona users would write tp++ and compile it to ordinary, human readable toki pona for posting long documents on forums. I suppose an on the fly compiler could be written for chatting. As such, anyone who doesn’t want to think about tp++, they don’t have to, they could just read the compiled version. Could someone take tp++ source code and post it uncompiled to forums and mailing lists? Sure, but that would be violating the spirit of this mini-project.

tp++ is a superset of tp
Almost all existing valid toki pona is valid tp++. A compiler may run in strict mode to make certain rules obligatory.
The compiled output of all tp++ is ordinary, valid toki pona.

Type Annotation
Numbers are annotated with a #. Depending on compiler setting, the compiler will compile numbers to stupid, half-stupid, advanced or poman numbers, since at the moment those are they only systems you see people use.
Dates can be proper modifier dates or any of the existing community proposals for dates. Dates would compile to suno/mun/sike suno with numbers as above.

Formatting
As a computer language, it needs to have a sense of formatting expressions. tp++ will use Markdown as the formatting syntax. HTML is too complex. Markdown has the advantage of either compiling down to nothing or to HTML.

Particle Innovations
There will be a set of particles with certain strong meanings that compile down to their weaker meanings.
mon compiles to pi when pi indicates personal ownership. soweli mon jan Mato. Matt’s cat.

Sentence busting.
Currently you can’t easily put certain kinds of sentences together. You can have multiple subjects, multiple actions, multiple objects, but if you have notion that requires a sentence, you can’t embedded it into one sentence you have to split it into many sentences and rely on the reader to coordinate the ni’s.

jan Mato li jo e soweli tanen soweli li ken moku e soweli lili ike.

compiles to

jan Mato li jo e soweli tan ni: soweli li ken moku e soweli lili ike.

Arbitrary phrase ordering.
Currently, we can front phrases with la. In tp++ any phrase can be fronted by adding la to it’s head particle. En is start of a subject phrase.

La’s phrases are not in a natural location and lack a “warning” creating unnecessary garden path parsings.

lani telo li anpa la mi li tawa ala.

compiles to

telo li anpa la mi tawa ala.

mi li tawa ala lani telo li anpa.

compiles to

telo li anpa la mi tawa ala.

Scope and Sentence Ends.
Sentence ends are obligatory and can be period, semi colon, question mark, exclamation mark.
! is emphatic.
?! or !? is surprising.
!! is a command. If the target is a machine, it should execute it.
? is a query. If the target is a machine, it should respond with matching facts.
?? is an rhetorical question and expects no response.
. is a fact. If the target is a machine it should be inserted into the current knowledge base.

Scopes being with //[ and end with ]//. The compile down to nothing. Declarations exist only in a given scope.

Vocabulary
Proper nouns would be written in any supported natural language, at least including English. It would compile using a look dictionary for well known proper nouns, and machine transliteration for unknown words.

Neologisms
Neologisms are words that the compiler hasn’t seen before, aren’t expansions. For example, if official toki pona got the word apeja, it would be treated as a neologism by compilers written today. Since we want the compiler to continue to work, we need to let users specify that a word is a neologism and should be outputted.

Expansions
Expansions are closely related to neologisms. An expansion is a word you invent and it is automatically expanded into a valid toki pona fragment.

An expansion is valid toki pona word that expands into a toki pona noun, verb or modifier phrase.
An expansion can either be contingent or noncontingent.
Contingent expansions expansions can only occur in certain locations, such as verbs, or are different depending on if they are a verb or noun.
Noncontingent expansions always expand to the same phrase, no matter where they appear. This may cause problems when a phrase takes on an unexpected meaning when in verb positions, or when it is a modifier.

pasin li pona tawa mi.
//I like grain spirits.

compiles to

telo nasa pi kiwen lili pan li pona tawa mi.

Imports
@@ imports a text file, for example a collection of declarations.

@@tedious_variable_declarations.tp

Fragments
Fragments are permissible utterances so long as they start with a particle and end with a sentence terminator. It is included in the output as is.

Comment fragments are started, include or are terminated by … These fragments are not parsed because the mean the sentence is missing a word. For example, the transmission was interrupted. As such, a compiler can’t be expected to make sense of it, but possibly a human could. So it behaves like a comment. It is included in compiled output.

Comments
Anything that starts with // is a comment. All comments are stripped from compiled text.
Block comments are between /* and */. These are also removed from compiled text.
” delineates foreign text. Foreign text is preserved in output, but not parsed.
/// is toki pona that is a declaration or other toki pona that will not be a part of the final document.

Unnecessary irregularity
mi li and sina li are obligatory in tp++ and automatically compile to bare mi and sina.

mi li jan. //source code
mi jan. //result

Coordinated Values
A programming language does a lot of variable to value binding. For example, x=1+1, is evaluated and x binds to the value 2. In toki pona, there are not enough clues to allow for variable declaration, nor binding. The closest thing we have are pronouns.

Declared Variables
Here is a declaration for the variable jan Mato for the scope of the entire application.

jan Mato: is the declaration, some noun phrase.
(jm) is the annotation. It has to be attached to each jan Mato or ona that refers to jan Mato, or else it refers to a different jan Mato.
The li chain is used for validating the pronoun. jan Mato can be represented by ona mija, ona wan. The animate/inanimate marker could be used for improving machine translation but would not necessarily change the toki pona output.

Imagine you had a script that took every English sentence and replaced “ain’t” with “isn’t”. That would be a one rule cross compiler. I think this idea is very powerful and applicable to evolving a conlang without actually teaching people fancy words like “isn’t”

All compilers create code that runs in a runtime or an actual computer. Natural languages, like zombies, run on brains. If a runtime can’t make sense of “isn’t”–maybe they’re from Alabama–then we would need a new runtime or would have to train the person to understand the new syntax.

In the world of web browsers, there is a programming language called JavaScript. It executes code in your browser and might do one of a million things. The JavaScript runtime is the only runtime that exists on virtually every machine in the world, so software developers are keen to write with it. But, they have to use the existing syntax, which is clunky and feature impoverished because the standard was set a very long time ago. People upgrade their browsers slowly. When a new feature is added to JavaScript, it needs to be added to about five different runtime flavors, Internet Explorer, Chrome, Firefox and so on. This takes huge amounts of time, so developers tend to target the oldest runtimes to make sure everyone’s browser understands the syntax well enough to execute. This barrier to progress has led to many strategies for dealing with the crappy, but frozen language specification. (JavaScript’s real politics are more complicated, I’m oversimplifying)

In the world of languages, people learn a language’s syntax and have a hard time dealing with innovations. Norwegian and Swedish are very close, but different enough for people to act like they are different languages. Changing a language by decree is pretty hard– there are so many runtimes–read brains–out there that are already set in their habits using the old syntax, vocabulary and so on.

In JavaScript, a smart guy working for Microsoft wrote a cross compiler and language called TypeScript. Cross compilers translate on language to another. The source language is a super set language.

A superset language is like the language of lawyers as compare to the language of children. A lawyer can understand a child’s language, but the child is going to be blown away by the vocabulary and fiercely complex discourse of a lawyer. In computing, C++ is a superset of C, any C++ compiler can run C code, but not vica versa, C++ has to many additions for C compilers to understand.

When a superset language compiles to a subset, things are erased or replaced with the corresponding idiom.

For example, in TypeScript, functions can signal to the compiler what type a variable is. This allows the compiler and tooling to catch mistakes.

In this case, it is obvious that multiply doesn’t involve breeding cats, but only numbers. You can see here why it is called type erasing because some of the annotations and syntax were erased.

But we can’t give TypeScript to browsers. No browser understands TypeScript. So we compile it down to

function multiply(a, b) { return a * b)

Note that it is exactly the same as ordinary JavaScript . It is executable not only by all existing browsers, but human readable as well. Some cross compilers achieve their result by creating something that runs, but is otherwise wildly different from what handwritten code looks like.

How can we use this idea for a conlang? If the conlang is suitable for description with a formal grammar, then you can create a parse tree for a sentence. This allows you to do interesting things like colorizing certain words by part of speech, machine glossing to English, formatting as intralinear gloss and so on. But you still have to work in the constraints of the existing syntax.

When I wrote the parser for toki pona, I realized that it is extremely hard to identify prepositions and a few other situations. So I essentially, created a few annotations and conventions, such as using # for numbers, putting commas before all prepositions when used as prepositions, quotes for direct speech and quotes for foreign text and dashes for compound words. These narrow the number of alternative parsings down to a manageable point where an amateur can create a parse tree or toki pona.

So where is this heading? Wouldn’t it be cool to write a toki pona syntax that is a superset of exisiting toki pona, but compiles to ordinary toki pona readable by anyone with a basic understanding of toki pona?

With this sort of toki pona, you can write more tools for toki pona word processing and simplify certain steps. For example, one point of complexity is dealing with proper modifiers. There are thousands of cities and they are slightly different in each language. If they could be marked and written in regular English or French, then the toki pona compiler could automatically convert French to Kanse and Washington to Wasinton. This is just the tip of the iceberg, more in the next post.

What sort of mantra is worth reciting?
The number pi has nonrepeating digits. If you looked long enough, then eventually you wound find the digits that encode your name in ascii. You would also find the digits that encode your picture in a JPG, a GIF and an PNG file. You would also find the string of pi digits that encode all movies ever made, and in all formats, both encrypted and decrypted and so on.
There is a way to enumerate all rational numbers.

So, I’m thinking about how would one create an enumeration of all possible toki pona sentences, excluding the uninteresting ones with infinite repeating sections. That enumeration of all sentences would contain the biographies of everyone you know, and the answer to all the questions you ever had. It also would include lies and slander, but also mostly gibberish.

So let’s start enumerating!

Word li Word. There are 125*125 of these.
W li W [Prep W]. There are 125*125*6*125 of these. At 2 seconds per sentence, it would take a bit under a year to chant all these.

Alternatively….

Sentences can be simple or compound, S, or S la S.
Sentences must contain a subject, a verb phrase, optionally some direct objects and optionally up to six different prepositional phrases.
Phrases can optionally have modifiers or pi chains.

So the whole of toki pona could be a chain of decisions starting with S, and running until the maximum phrase size is reached. Enumerating systematically would result in a lot of similar sentences. (ni li a. ni li pona. ni li soweli. ad nauseum) Enumerating them stochastically would be more interesting to read. Now if we could map digits of pi to the choices in building up a sentence (i.e. compound or not, transitive or not, with prep phrase or not), then we could get a list of sentences that would eventually cover all possible toki pona sentences.

I just wrote a C# parser for toki pona. It was an ad hoc grammar, meaning I didn’t write a PEG or YACC or other formal description of the grammar and then process it into a parser. (I didn’t use or write a compiler compiler). Why? Because I kept feeling like a computer compiler is aimed at the problem of translation of one language to machine code. Also, I didn’t get a comp-sci degree, so I don’t actually follow how compiler compilers work.

From what I understand, I wrote a “order of operations” parser, which takes a paragraph, chops it at sentence breaks creating a string of sentences, then chops at the predicate marker creating subject/predicate arrays, and so on. This works for 90% of toki pona’s parsings needs.

Then other things came up and I stopped moving the C# parser forward. Now I’m learning C++ and mostly I keep thinking of how I could use C++ to do toki pona parsing. When I wrote the C# parser, I decided to favor powerful and expressive parsing over fast or general parsing (i.e. able to parse any language with a formal grammar). For example, if you have mixed English toki pona texts, you can do dictionary lookups to determine what words are English. Dictionary lookups are computationally very slow. But C++ is supposed to be very fast, like 4x or more faster than C#.

After I wrote my parser, I wrote up some lessons learned.

1) Natural language processing is about processing discrete arrays of tokens. It superficially looks like string processing, but string processing means dealing with irrelevant white space, punctuation, capitalization and other crap that just gets in the way of dealing with higher level concepts. But you need to be able to do the same things you do with substrings, but with arrays of tokens, for example, finding a substring? Well, actually I need to find a sub-token-list. Need to do a string replacement? Actually, I need to do a token replacement. Surprisingly, arrays and lists don’t normally support the full range of operations that string processing provides for

2) C++ allows for fast cache pre-fetching if you stick to using memory aligned data structures. In otherwords, if the words have to be looked up all over memory, you get the speed of memory. But if the data is moved through the pre-fetch cache, you get the speed of the cache, which is like 100x (1000x?) faster. In C#, I have no idea what the memory layout of my data is, so I can’t exploit this. But all my data is an adjacent stream of data, I should be able to exploit this.

3) Perf optimization in C# was limited to looking at which routines used the most time. In my case, it was substring checks, which got faster after I told C# to stop taking into account international considerations when doing substring checks– the .NET framework was doing extra work just in case the string was German or Chinese. My other attempts to improve perf made no impact– memoization had no impact.

4) I know that stream processing is more efficient that any alternatives (i.e. if your data structure is a stream that the CPU does an operation on, then moves to the next, etc, as opposed to say, a tree). My C# code encouraged using data structures that aren’t streams. C++’s string library seems to encourage treating all strings as streams, i.e. more like C# StringBuilders.

5) My C# code works great on a server. But if I want to give it away, I’d have to create a REST API. What would be even better is if I could give away a Javascript library. Then people could use it along with their favorite web framework, be it Python Flask, Ruby and Rails, or what have you. As it happens, reasonably efficient C++ to Javascript cross compiling has appeared on the scene.

6) My C# code was very toki pona centric. At the end, I could see which parts of the library could have been used by any conlang project.

7) The C# parser didn’t have the concept of a runtime. When I speak English, the run time is my human brain. I hear a cake recipe and that moves me to make a cake. I almost created a runtime that represented sort of a database of sentences that could be queried, but didn’t get far because I didn’t succeed in making a template matcher.

8) Speaking of templates, templates were not first class citizens. Imagine that the grammar had this seemingly unnecessarily set of rules:

jan pona ==> jan-pona (colocation translates to friend when it comes time to use the parse tree in the “runtime” which in my case was just text colorization and English glossing)
mi pilin e ni: S => complex-sentence. The other rules of the grammar can generate this pattern, but it looks like us humans use these templates as first class citizens– we use them too often and too predictably to imagine that we thought up that template on the spot. The template has slots just like a complex verb in other languages, e.g.
mi [pronoun modifier] [li] pilin [adverb] e ni: S => complex-sentence.

So here are my C++ Goals
1) API that works with arrays of tokens as expressively as the string api works with char arrays.
2) Templates
3) Immutable data structures
4) Relies on knowledge (i.e. not just grammar rules, but long lists of colocatoins, English dictionary lookups, i.e. mini databases)
5) Has a runtime and all utterance can manipulate the runtime (i.e. statement == store this sentence in the knowledge base, special statements ==> remove this sentence from the knowledge base, update a sentence, retrieve sentences of a similar template)
6) Still support colorization scenarios, DOM-like APIs, etc.
7) Works for two languages, and I choose Esperanto and toki pona, only because they are well documented
8) Community corpus driven standards for “done” and “correctness”, i.e. it’s works because it works with typical community texts.
9) Will not try to deal with the problem of incorrect texts. (Doing parsing transformations that turn invalid text into the intended text is still too hard)

1) Small lexicon. Incompletely described languages also have small lexicons. Klingon falls into this category. The lexicon can grow. Esperanto on day 365 was a small language. Something like a century later, it is a large lexicon language.

2) Closed lexicons. All (?) languages exhibit the feature where some classes of words are closed, e.g prepositions in English– you can’t make up your own. Proper nouns in English need only follow the phonotactic rules, make ‘em up all day. If a lexicon is small and closed, then there are still new lexemes, but the will be made of recognizable parts. It’s sort of like, you can’t use new ingredients, but if you make a new recipe, you have to show the recipe. The recipe could still be incoherent.

3) Small distance to your native tongue. This is what really makes language easy. A condialect would be the easiest. The maligned re-lex is small in the sense that really you just need a lexicon and the rules for (possibly mechanically) mapping grammar from one language to the other.

4) Small phonetic inventory. Doesn’t make it especially easy though, cf Hawaiian with the long words with repeated vowels.

5) Small syntax. Regularization reduced size twice- irregular morphology can be looked at as lexical syntax (a new word say for each form of a certain tense), or as a complex set of rules with exceptions and exceptions to the exceptions. However, one of the magic things about syntax is that a small number of rules can in the right hands make a massive maximal sentence with enough complexity to be hard to read and sometimes one more rule would make certain areas of complexity go away. This is essentially the story of the evolution of modern computer programming languages.

Verbs have valency, which is how many “arguments” they have. For example intransitive means no arguments (or only one, the subject), transitive, means one (or two, subject and object). So a typical lojban verb (gismu) works like this:

I eat an (object), with a (tool), like (similar object), at (a location), for (some goal, benefit), from (some place or causal reason), with (a collaborator). That is it. There are no more slots. This compares to English where we have many more slots via a much larger list of prepositions.

The toki pona verb always follows the same pattern, the lojban one has different meanings for different slots depending on the verb (gismu)

Of course the toki pona phrase can be re-arranged, except for the e phrases, which must come first. Also, routinely, the prepositional phrases can be modifiers for a single content word, which doesn’t have an analogy in lobjan, AFAIK (which isn’t much).

Also, another observation. If there are only 6 slots, marked by a particle on the head verb, wouldn’t these turn into case markers, like in a single generation of human use?

I read about the dictionary making for Algonquin, a highly synthetic language with few unbound morphemes. Everything of interest is a bound morpheme. Full words necessarily drag along with them a lot of other cruft, as if a dictionary had a definition for unsympathetically but the word sympathetic wasn’t allowed to be a stand alone word.

Surprisingly, toki pona is like that. toki pona has compound words, which if you are a grumpy cat, you can call them colocations (words that appear together commonly), or just call them compound words– because they behave rather similar to two stem words in languages with bound morphemes. Beyond that, we have “templates”.

Noun Phrases (content phrase)
jan pona. This is a perfect compound word. It takes modifiers, resists splitting, and it has two “slots”– stuff goes before it and after it.

These phrases have little internal structure. These are useful for machine parsing, the traditional dictionary just works. You could look up words by their head word and life is beautiful.

kin la. == really. Also a good compound word, it has two slots, you can put more la phrases before, a sentence after and that is it.

Verb Phrase
Verbs phrases are closer to templates because the head verb is one word.
kama sona. This isn’t a perfect compound word, it has three slots: [0] kama [1] sona [2]. The head verb is still kama and you can add modals before, negation, intensification and adverbs after kama. stuff after sona describes sona, not the kama sona phrase.

Templates are a lousy fit for a traditional dictionary. The head word could be in a variety of places. Sometimes the template doesn’t rely on any specific word, e.g.

I don’t even know where to put that in dictionary alphabetical order. I feel like I’m back in Algonquin again.

Sentences
mun li pimeje e suno. eclipse. This almost doesn’t feel like a template anymore. To use it in a sentence requires extensive rework. It has at least 4 template points not counting adding all the optional things available to the maximal sentence.

Other patterns.
kule lon palisa luka. Fingernail polish. This is also a template with significant internal structure.

Advice:
Keep the templates separate from untemplated definitions.
Be explicit about the slots in templates.

Unrelated advice:
Be wary of unwarranted glosses and translations.
jan Sonja said telo is sauce, so I guess it is.
If I say telo means rocket fuel, it’s an unwarranted translation unless there is some text to set that up.

e.g. Tab starting a paragraph is actually a few spaces. Sometimes those spaces disappear and sometimes they are just spaces.
jan ilo li wile pali.

Solutions.Explicit paragraph marker, for example, four dashes centered on a page, like the divider you see in some novels between “scenes”.Assume double space is a paragraph. This is wrong a lot of the time.Synthetic paragraphs. Apply rules such as this: Any sentence ending in ni: is in the same paragraph as the following sentence. Any vocative followed by a sentence is in the same paragraph. Quoted text initiates a new paragraph. However, I suspect this would be a lot of work and would fail, resulting in too many synthetic paragraphs that ‘consume’ the entire text.Ignore the problem and turn everything into a series of sentences, or a huge single paragraph.Two parsing modes. Strict and Loosey-Goosey. In strict mode, paragraphs are started by tabs. In Loosey-Goosey mode, tabs, blank lines are assumed to be paragraph breaks and it is just accepted that this will be wrong a lot of the time.

So I was wondering what a parser for Esperanto would look like if I only used the 14 rules.

I re-read them and quickly decided that actually, at the time of the 14 rules, the bulk of the language specification must have been in the dictionary and sample texts. Zamenhof didn’t know how to write a formal grammar, he’d probably have to live another like 50-75 years before formal grammars were in the popular imagination.

One fascinating and na’vi like feature of Esperanto is that modifiers can lead or follow the noun the modify. A lorem ipsum generator could help test if these possibilities are workable. I suspect not– in a maximal phrase you wouldn’t be able to coordinate modifiers with what is being modified. I could be wrong, so lets write a parser and find out.

Stems. Esperanto has like 800 stems. This is a small lie of Esperanto because with borrowing, this has since turned into 8000+

Words. Words are prefixes plus one or more stems plus derivational suffixes plus grammatical suffixes, which include part of speech suffixes.

Sentences. Sentences, it appears, are unordered collections of phrases. This is a little lie of Esperanto, because people in practice follow an order rigid enough to make the accusative unnecessary. Sentences can contain other sentences.

There is more too it, but I think I can write a mini-parser with just the above.

So I can open a novel, and as part of being a human raised in an English speaking community, I pretty much understand everything. I can open a textbook on Calculus on Logic and while I can read the whole thing in English– there is even awkward but grammatically correct ways to read off the formulae– I’m not going to understand it just because I know English. I think this is some pretty conservative evidence that math and logic are not really natural languages, they are more like a foreign language embedded into a natural language.

So I was trying to deal with conjunction in toki pona. Sometimes they are made unnecessary by the “chain pattern”– one similar structure after another implies “and”. Sometimes they indicate discourse connectors, by tagging a sentence with “or” or “but”. Those two forms of logic are effortless to parse (except when people ignore the chain pattern and try to explicitly add “and” words) Finally we get these monsters:

How to parse this? I have no idea, it reads like a logic puzzle and you’d have to introduce a foreign logic system to do something with it. It looks syntactically valid. So I’m thinking my parser should represent a modifier chain as above, but make no claims about what it means. So it parses one way, and if someone (ha! unlikely) ever decided to implement a logic subsystem, they could take this parse and then transform it into all the possible meanings, truth tables and so on.

But for these applications, we don’t care:

grammar check– it’s valid syntax.glossing– It glosses to English, and is equally ambiguous and unintelligible in English.syntax highlighting– you only need to recognize an “and”/”or”/”but” sequence to color the text, you don’t need to know what it means or parse it as just one parse tree.chat bot– A chat bot would never explore these corners of possible meaning in the universe of representable meanings that toki pona can represent.

Other Observations.

1) * jan li kepeken ilo en kepeken soweli. (Don’t use can to combine prep phrases)
2) */? jan li tawa en kama. (Don’t use en when you can use li– but if this was a modifier chain, and a predicate sentence, then its probably okay)
3) * jan li kepeken ilo anu kepeken soweli. (Don’t “or” prep phrases)
4) * jan li moku e ilo anu e soweli. Don’t use both anu and e, don’t use both taso and e [Update, changed to moku because kepeken has had some recent POS confusion from toki pona version pu)
5) */? ante jan li kepeken e ilo. Don’t use anything but anu or taso as a tag-conjunction.
6) * en jan li kepeken ilo. Don’t start sentence with en. (En is implied, although it would have made for a nice audible sentence demarcation)
7) ? waso pi laso en pimeja li pona tawa mi. This is really hard to parse. “and”ing modifiers in the subject slot is only sometimes distinguishable from mistakes and “and”ing subjects.