You can use the formatting commands describes in TextFormattingRules in your comment. If you want to post some code, surround it with <verbatim> and </verbatim> tags. Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link. You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.

word prediction using openfst/opengrm-ngram

I would like to write an Android app that does word prediction and completion as you type, using an n-gram language model.

I find a lot of references on google on how to build language models and evaluate their perplexity (srilm, mitlm, etc. and here, openfst and opengrm-ngram), but not a lot on how to apply them to a word prediction problem in practice. I am completely new to wfsts. Is there perhaps a standard recipe for word prediction using openfst and opengrm-ngram?

From what I can gather, it seems that you must build the language model fst, then compose it with an fst that is built from the seen word sequence up to the point in question, and finally do a shortest path search. Perhaps something along the lines of the following? :

How would the resulting words.fst be processed to obtain the n-best predicted words? Perhaps using something like? :

<verbatim>
fstproject --project_output
</verbatim>

Regarding Android, I see that they have packaged openfst for the platform. Is a language model fst created by opengrm-ngram completely "backwards-compatible" with openfst, so that I can load the model fst and do composition and shortest path search with only the openfst API in Android?

there are some wrinkles to using a language model (LM) fst in the way you describe. But as to your last question, the language models that are produced are simply FSTs in the format of openfst, so you should be able to use them via the openfst API in Android. The problem with the fstcompose approach that you detail is that the opengrm LM fst provides probabilities over whole sequences, not just the next word. One approach to building words.fsa would be to make this represent abc(sigma) where abc is your history and sigma ranges over your possible next words. But the probabilities that you would derive for each x would be for string the abcx, where x is the last word in the string. This doesn't include probability for abcxy. instead, to get the right probability over all possible continuations, you want words.fsa to represent abc(sigma)*. That would give you the right probability mass, but unfortunately would be expensive to compute, especially after each word is entered. The better option is for you to use the C++ library functionality to walk the model with your history, find the correct state in the model (the model is deterministic for each input, when treating backoff arcs as failures), then collect the probabilities for all words from that state, following backoff arcs correctly to achieve appropriate smoothing. This requires being very aware of exactly how to walk the model and collect probabilities for all possible following words, then finding the most likely ones, etc., efficiently. It is possible, but non-trivial. As with other openfst functionality, it is there for you to create your own functions, but not much in the way of hand holding to get you there. Very interesting application, though; and by the end you'd understand the FST n-gram model topology well and be able to build other interesting applications with it. Hope that answers at least some of your question.

The 'farcompilestrings' command appears to work, but ngramcount fails with the error "ERROR: None of the input FSTs had a symbol table". I've tried with '-keep_symbols=0', and and without any '-keep_symbols' argument, and the results seem to be the same

I tried creating a symbol file with the UTF-8 characters in my 'corpus', but farcompilestrings doesn't like the space symbol in that file (it reports 'ERROR: SymbolTable::ReadText: Bad number of columns (1), file = animals.char.syms, line = 5:<>').

It seems like there must be an option to tell ngramcount to use bytes or UTF-8 characters as symbols (analogous to farcompilestrings '-token_type=utf8'), but I haven't found it.

ngramcount wants an explicit symbol table, which is why the utf8 token_type for farcompilestrings isn't working. And, correct, farcompilestrings uses whitespace as the symbol delimiter, so representing spaces is a problem. Agreed, it would be nice to have that option in ngramcount, perhaps in a subsequent version... In the meantime, we generally use underscore as a proxy for space for these sorts of LMs. So, convert whitespace to underscore, then whitespace delimit and run as with a standard corpus. Then you just have the chore of converting to/from underscore when using the model. As an aside, you might find Witten-Bell to be a good smoothing method for character-based LMs, or any scenario with a relatively small vocabulary and a large number of observations. You can set the witten_bell_k switch to be above 10, and that should give you better regularization. Hope that helps.

It looks like ld is not finding the OpenFst library shared objects. It would be great is you could show us what was the ld command line make was trying to run at that time. Also could you tell us which configure=/=make commands/parameters you used to install OpenFst and try to compile OpenGrm. Finally, which OS are using?

epsilon transitions are for backoff, and the weights on those arcs are negative log alpha, where alpha is the backoff constant that ensures normalization over all words leaving that state. Ideally, those backoff transitions are interpreted as failure transitions, i.e., only traversed if the symbol does not label a transition leaving the state. See the ngramapply command for an example of the use of a phi matcher to handle failure transitions.

Models without <s> or </s>

To build an automaton that has valid paths, you need to specify some kind of termination cost, which is what the final symbol allows for. So, you can't really do without having an entry for end-of-string in your raw n-gram counts; otherwise, it will represent this as an automaton that accepts no strings (empty) since no string can ever terminate. If you have a large set of counts without start or stop symbols, you should just add a unigram of the end-of-string symbol with some count (maybe just 1). The start symbol you can do without (though it is often an important context to include). Hope that helps.

if you run it in verbose mode (ngramperplexity --v=1 as shown in the quick tour) then you can see the -log probability of each item in the string. You can then read this text into your own code to calculate perplexity in whatever way you wish. However, typically the end of string marker is included in standard perplexity calculations, and should be included if you are comparing with other results. No options to exclude it in the library. Hope that helps.

ngramapply only applies the LM weights to your input token sequence, it does not compute perplexity.

The total weight output by ngramapply for each string should be equal to the logprob(base 10) generated in the verbose output for ngramperplexity (but note that the ngram apply value is negative log using base e).

Experimenting with =ngramnormalize

I have been experimenting a bit with the new ngrammarginalize tool - very cool. I noticed that in some cases it can get stuck in the do{}while loop in the StationaryStateProbs function.

I trained a model using a small amount of data and witten_bell smoothing + pruning. It may be worth noting that the data is very noisy.
I have been digging around in the paper and source code a bit to figure out what might be the issue. The issue seems to be here:

where last_probs[st] can occasionally be a very small negative value. If I am following the paper correctly, this corresponds to the epsilon value mentioned at the end of Section 4.2. (I'm not entirely convinced I followed correctly to this point though so please correct me if I'm wrong).
Anyway, if last_probs[st] turns out to be negative, for whatever reason, then there is a tendency to get stuck in this block. This also seems to be affected by the fact that the comparison delta is computed relative to last_probs[st] , so if last_probs[st] happens to get smaller with each iteration, then the comparison also becomes more 'sensitive', in the sense that a smaller absolute difference will still evaluate to 'true'. So as we iterate, it seems that some values become more sensitive to noise (maybe this is the numerical error you mention?) and the process gets stuck.

I noticed that if I set the comparison to an absolute:

if (fabs( fabs((*probs)[st]) - fabs(last_probs[st])) > converge_eps )

then it always terminates, and does so quickly for each iteration, even for very small values of converge_eps.

I have not convinced myself that this is a theoretically acceptable solution, nor have I validated the resulting models, but it does some reasonable at a superficial level.

For the second issue, I would be interested in seeing the original ARPA format file and follow this through to see what the potential issue might be. So please email or point me to that if you can. For the first issue, I have seen some problems with the convergence of the steady state probs for Witten-Bell models in particular, which is due, I believe, to under-regularization (as we say in the paper). Your tracing this to the value of last_probs[st] is something I hadn't observed, so that could be very useful. In answer to your question, I wonder if there's not a way to maintain the original convergence criterion, but control for the probability falling below zero (presumably due to floating point issues). One way to do this would be to set a very small minimum probability for the state and set a state's probability to that minimum if the calculated value falls below it. Since last_probs[st] gets its value from (*probs)[st], then it would never fall below 0 either. I'll look into this, too, thanks.

Hi, The 2nd issue was my own embarrassing mistake. I was setting up a test distribution to share, and realized that I still had an old, duplicate version of ngramread on my $PATH. This was the source of the conversion issue. When I switched it things worked as expected for all models.

when you apply the model to new text, you are typically using farcompilestrings to compile each string into FSTs. It has a switch (--unknown_symbol) to map out of vocabulary (OOV) items to a particular symbol. The language model then needs to include that symbol, typically as some small probability at the unigram state. There are a couple of ways to do this -- see the discussion topic below entitled: Model Format: OOV arcs.

thank you for your answer. I am no expert of FSTs, but I guess It should be possible to obtain the same result by somewhat filtering the complete FST, through composition (?) with another FST. After all, it should look like a FST where some paths have already been walked through. Please excuse my vagueness.

yes, this is the way to think about it. The complication comes from the random sampling algorithm, and the way that it interprets the backoff arcs in the model. The algorithm assumes a particular model topology during the sampling procedure, which will not be preserved in the composed machine that you propose. But, yes, essentially that is what would be done to modify the algorithm.

skip-grams

Skip k-grams generally require a model structure that is trickier to represent compactly in a WFST than standard n-gram models. This is because there are generally more than one state to backoff to. For example, in a trigram state, if the specific trigram 'abc' doesn't exist, a standard backoff n-gram model will backoff to the bigram state and look for 'bc'. With a skip-gram, there is also the skip-1 bigram 'a_c' to look for, i.e., there are two backoff directions. So, to answer your question, it is something we've thought about and are thinking about, but nothing imminent.

calculating perplexity for >10 utterances using example command

I am a newbie who wants to calculate perplexity for a text file consisting of more than 10 lines. The examples provided for that only works for < 10, and it is not obvious to me how to bypass that. (Yes, I know I can use a loop over the utterances, I just hope I do not have to.)

And specifying a text file as the first parameter for farcompilestrings does not work either:
FATAL: FarCompileStrings: compiling string number 1 in file test.txt failed with token_type = symbol and entry_type = line
FATAL: STListReader::STListReader: wrong file type:

I could not find related usage info. I'd appreciate some help.
Thanks!

the -generate_keys=N switch for farcompilestrings creates numeric keys for each of the FSTs in the FAR file using N digits per FST. (One FST for each string in your case.) So, with N=1 you can index 0-9 strings, with N=2 you can index 0-99 strings, etc. So, for your example, you just need to up your -generate_keys argument with the number of digits in the total corpus count.

One more question, just so that I can see more clearly how ngramperplexity works.
When I specify the "--OOV_probability=0.00001 --v=1" arguments, I get "[OOV] 9" for "ngram -logprob" in the output when using unigrams, and somewhat higher values for longer n-grams. What is the reason for his? (I guess I do not yet see how exactly perplexity calculation works.)

The model will only have probability mass for the OOV symbol if there are counts for it in the training corpus. ngramperplexity does have a utility for including an OOV probability, but this is done on the fly, not in the model structure itself. If you want to provide probability mass at the unigram state for the OOV symbol, you could create a corpus consisting of just that symbol, train a unigram model, then use ngrammerge to mix (either counts or model) with your main model. Then there would be explicit probability mass allocated to that symbol. You can use merge parameters to dictate how much probability that symbol should have. Hope that helps.

A related question: would it be sensible to replace some (or all) singletons in the training corpus with the OOV symbol, thus learning OOV probabilities in context instead of solely at the unigram level? Is that approach likely to improve LM performance on unseen data?

Hi,
I built 3-gram language model on few English words.
My c++ program receive streaming character one by one.
I would like to use the 3-gram model to score the up-coming character with history context. What example can I start with ? Is it possible not to convert to farstrings each time ?

This is one of the benefits of having the open-source library interface in C++, you can write functions of your own. We choose to score strings (when calculating perplexity for example) when encoding the strings as fars, but you could perform a similar function in your own C++ code. I would look at the code for ngramperplexity as a starting point, and learn how to use the arc iterators. Once you understand the structure of the model, you should be able to make that work. Alternatively, print out the model using fstprint and read it into your own data structures. Good luck!

How to understand the order one count result?

I tried order-1 ngram count with this simple text
Goose is hehe
Goose is hehe
Goose is
Goose
But I don't understand the resulting count fst.
0 -1.3863
0 0 Goose Goose -1.3863
0 0 is is -1.0986
0 0 hehe hehe -0.69315
The document says Transitions and final costs are weighted with the negative log count of the associated n-gram, but I can't make sense with these numbers, can someone help me out? Thx!!

The counts are stored as negative natural log (base e), so -0.69315 is -log(2), -1.0986 is -log(3) and -1.3863 is -log(4). The count of each word is kept on arcs in a single state machine (since this is order 1) and the final cost is for the end of string (which occurred four times in your example). You printed this using fstprint, but you can also try ngramprint which in this case yields:

<s> 4
Goose 4
is 3
hehe 2
</s> 4

where <s> and </s> are the implicit begin of string and end of string events. These are implicit because we don't actually use the symbols to encode them in the fst.

Hope that clears it up for you. If not, try the link in the 'Model Format' section of the quick tour, to the page on 'precise details'.

you've introduced non-determinism into the ngram models via your replace class modification. The ngrammerge (and ngramprint) commands are simple operations that expect a standard n-gram topology, hence the error messages. For more complex model topologies of the sort you have, you'll have to write your own model merge function that does the right thing when presented with non-determinism. The base library functions don't handle these complex examples, but the code should give you some indication of how to approach such a model mixture. Such is the benefit of open source!

it appears that you have n-grams ending in your stop symbol (probably </s>) that have backoff weights, i.e., the ARPA format has an n-gram that looks like:

-1.583633 XYZ </s> -0.30103

But </s> means end-of-string, which we encode as final cost, not an arc leading to a new state. Hence there is no state where that backoff cost would be used. (Think of it this way: what's the next word you predict after </s>? In the standard semantics of </s>, it is the last term predicted, so nothing comes afterwards.) Do you also have n-grams that start with </s>?

So, one fix on your ARPA format is just to remove the backoff weight after n-grams that end in </s>.

Another case is there are n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but It is a the only choice to train class-based LM. How do I fix it? Is it reasonable to remove directly these n-grams?

Another case is there are some n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but it is my only choice to train class-based LM with automatic class clustering from large plain data . How do I fix it? Is it reasonable to remove directly these n-grams?

Infinity values / ill formed FST

I'm currently playing around with a test example and I noticed than after ngrammake if I call fstinfo (not ngraminfo) on the resulting language model fstinfo complains about the model being ill-formed. This is due to transitions (typically on epsilons) that have "Infinity" weight, which does not seem to be supported by openFST. Is that "working as intended"? The problem is later if I call fstshortestpath to get e.g. the n most likely sentences from the model the result contain not only "Infinity" weights but also "BadNumber" which might be a result of the infinite values.

yes, under certain circumstances, some states in the model end up with infinite backoff cost, i.e., zero probability of backoff. In many cases this is, in fact, the correct weight to assign to backoff. For example, with a very small vocabulary and many observations, you might have a bigram state that has observations for every symbol in the vocabulary, hence no probability mass should be given to backoff. Still, this does cause some problems with OpenFst. In the next version (due to be released in the next month or so) we will by default have a minimum backoff probability of some very small epsilon (i.e., very large negative log probability). As a workaround in the meantime, I would suggest using fstprint to print the model to text, then use sed or perl or whatever to replace Infinity with some very large cost -- I think SRILM uses 99 in such cases, which would work fine.

If I may add another quick question, when running fstshortestpath on the ngram count language model (i.e. after ngramcount but before ngrammake) I was expecting to get the most frequent n-gram, but instead the algorithm never seems to terminate. Any idea why that is? I though that shortestpath over the tropical sr should always terminate anyway.

The ngram count Fst contains arcs with negative log counts. Since the counts can be greater than one, the negative log counts can be less than zero. Hence the shortest path is an infinite string repeating the most frequent symbol. Each symbol emission shortens the path, hence non-termination.

Build failure on Fedora 17

Hi. I maintain several voice-recognition-related packages, including openfst, for the Fedora Linux distribution. I am working on an OpenGrm NGram package. My first attempt at building version 1.0.3 (with GCC 4.7.2 and glibc 2.15) failed:

In file included from ngramrandgen.cc:32:0:
./../include/ngram/ngram-randgen.h:55:48: error: there are no arguments to 'getpid' that depend on a template parameter, so a declaration of 'getpid' must be available [-fpermissive]
./../include/ngram/ngram-randgen.h:55:48: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope

It appears that an explicit #include <unistd.h> is needed in ngram-randgen.h. That header was probably pulled in through some other header in previous versions of either gcc or glibc.

Expected result when using a lattice with ngram perplexity?

I was wondering what the expected result is when feeding a lattice, rather than a string/sentence, to the ngramperplexity utility? Is this supported? It seems to report the perplexity of an arbitrary path through the lattice.

ngramperplexity reports the perplexity of the path through the lattice that you get by taking the first arc out of each state that you reach. (Note that this is what you want for strings encoded as single-path automata.) Not sure what the preferred functionality should be for general lattices. Could make sense to show a warning or an error there; but at this point the onus is on the user to ensure that what is being scored is the same as what you get from farcompilestrings - unweighted, single-path automata. If you have an idea of what preferred functionality would be for non-string lattices, email me.

there is no single method; rather there are several ways to perform composition with the model, depending on how you want to interpret the backoff arcs. The most straightforward way to do this in your own code is to look at src/bin/ngramapply.cc and use the composition method for the particular kind of backoff arc, e.g., ngram.FailLMCompose() when interpreting the backoff as a failure transition. In other words, write your own ngramapply method based on inspection of the ngramapply code.

Witten-Bell generalizes straightforwardly to fractional counts, as you point out. No immediate plans for new versions of other smoothing methods along those lines, so if that's something that you need urgently, you would need to implement it.

this is basically a floating point precision issue, the system is trying to subtract two approximately equal numbers (while calculating backoff weights). The new version of the library coming out in a month or so has much improved floating point precision, which will help. In the meantime, you can get this to work by modifying a constant value in src/include/ngram/ngram-model.h which will allow these two numbers to be judged to be approximately equal. Look for:
static const double kNormEps = 0.000001;
near the top of that file. Change to 0.0001, then recompile.

This sort of problem usually comes up when you train a model with a relatively small vocabulary (like a phone or POS-tag model) and a relatively large corpus. The n-gram counts end up not following Good-Turing assumptions about what the distribution should look like (hence the odd discount values). In those cases, you're probably better off with Witten-Bell smoothing with the --witten_bell_k=15 or something like that. Or even trying an unsmoothed model.

And stay tuned for the next release, which deals more gracefully with some of these small vocabulary scenarios.

that error is coming from a sanity check that verifies that every state in the language model (other than the start and unigram states) is reached by exactly one 'ascending' arc, that goes from a lower order to a higher order state. ARPA format models can diverge from this, by, for example, having 'holes' (e.g., bigrams pruned but trigrams with that bigram as a suffix retained). But ngramread should plug all of those. maybe duplication? I'll email you about this.

Benoit found a case where certain 'holes' from a pruned ARPA model were not being filled appropriately in the conversion. The sanity check routines on loading the model ensured that this anomaly was caught (causing the errors he mentioned), and we were able to find the cases where this was occurring and update the code. The updated conversion functions will be in the forthcoming version update release of the library, within the next month or two. In the meantime, if anyone encounters this problem, let me know and I can provide a workaround.