Overview

In a previous article, I presented a maximum entropy modeling library called SharpEntropy, a C# port of a mature Java library called the MaxEnt toolkit. The Java MaxEnt library is used by another open source Java library, called OpenNLP, which provides a number of natural language processing tools based on maximum entropy models. This article shows you how to use my C# port of the OpenNLP library to generate parse trees for English language sentences, as well as explores some of the other features of the OpenNLP code. Please note that because the original Java OpenNLP library is published under the LGPL license, the source code to the C# OpenNLP library available to download with this article is also released under the LGPL license. This means, it can freely be used in software that is released under any sort of license, but if you make changes to the library itself and those changes are not for your private use, you must release the source code to those changes.

Introduction

OpenNLP is both the name of a group of open source projects related to natural language processing (NLP), and the name of a library of NLP tools written in Java by Jason Baldridge, Tom Morton, and Gann Bierner. My C# port is based upon the latest version (1.2.0) of the Java OpenNLP tools, released in April 2005. Development of the Java library is ongoing, and I hope to update the C# port as new developments occur.

Tools included in the C# port are: a sentence splitter, a tokenizer, a part-of-speech tagger, a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks"), a parser, and a name finder. The Java library also includes a tool for co-reference resolution, but the code for this feature is in flux and has not yet been ported to C#. All of these tools are driven by maximum entropy models processed by the SharpEntropy library.

Since this article was first written, the coreference tool has been ported to C# and is available, along with the latest version of the other tools, from the SharpNLP Project on CodePlex.

Setting up the OpenNLP library

Since this article was first written, the required binary data files have now been made available for download from the SharpNLP Project on CodePlex. Instead of downloading the Java-compatible files from Sourceforge and then converting them via the ModelConverter tool, you can download them directly in the required .nbin format.

The maximum entropy models that drive the OpenNLP library consist of a set of binary data files, totaling 123 MB. Because of their large size, it isn't possible to offer them for download from CodeProject. Unfortunately, this means that setting up the OpenNLP library on your machine requires more steps than simply downloading the Zip file, unpacking, and running the executables.

First, download the demo project Zip file and unzip its contents into a folder on your hard disk. Then, in your chosen folder, create a subfolder named "Models". Create two subfolders inside "Models", one called "Parser" and one called "NameFind".

Secondly, download the OpenNLP model files from the CVS repository belonging to the Java OpenNLP library project area on SourceForge. This can be done via a CVS client, or by using the web interface. Place the .bin files for the chunker (EnglishChunk.bin), the POS tagger (EnglishPOS.bin), the sentence splitter (EnglishSD.bin), and the tokenizer (EnglishTok.bin) in the Models folder you created in the first step. This screenshot shows the file arrangement required:

Place the .bin files for the name finder into the NameFind subfolder, like this:

Then, place the files required for the parser into the Parser subfolder. This includes the files called "tagdict" and "head_rules", as well as the four .bin files:

These models were created by the Java OpenNLP team in the original MaxEnt format. They must be converted into .NET format for them to work with the C# OpenNLP library. The article on SharpEntropy explains the different model formats understood by the SharpEntropy library and the reasons for using them.

The command line program ModelConverter.exe is provided as part of the demo project download for the purpose of converting the model files. Run it from the command prompt, specifying the location of the "Models" folder, and it will take each of the .bin files and create a new .nbin file from it. This process will typically take some time - several minutes or more, depending on your hardware configuration.

(This screenshot, like the folder screenshots above it, is taken from the Windows 98 virtual machine I used for testing. Of course, the code works on newer operating systems as well - my main development machine is Windows XP.)

Once the model converter has completed successfully, the demo executables should run correctly.

What does the demonstration project contain?

As well as the ModelConverter, the demonstration project provides two Windows Forms executables: ToolsExample.exe and ParseTree.exe. Both of these use OpenNLP.dll, which in turn relies on SharpEntropy.dll, the SharpEntropy library which I explored in my previous article. The Parse Tree demo also uses (a modified version of) the NetronProject's treeview control, called "Lithium", available from CodeProject here

The Tools Example provides a simple interface to showcase the various natural language processing tools provided by the OpenNLP library. The Parse Tree demo uses the modified Lithium control to provide a more graphical demonstration of the English sentence parsing achievable with OpenNLP.

Running the code in source

The source code is provided for the two Windows Forms executables, the ModelConverter program, and the OpenNLP library (which is LGPL licensed). Source code is also included for the modified Lithium control, though the changes to the original CodeProject version are minimal. Source code for the SharpEntropy library can be obtained from my SharpEntropy article.

The source code is written so that the EXEs look for the "Models" folder inside the folder they are running from. This means that if you are running the projects from the development environment, you will either need to place the "Models" subfolder inside the appropriate "bin" directory created when you compile the code, or change the source code to look for a different location. This is the relevant code, from the MainForm constructor:

This could be replaced with your own scheme for calculating the location of the Models folder.

A note on performance

The OpenNLP code is set up to use a SharpEntropy.IO.IGisModelReader implementation that holds all of the model data in memory. This is unlikely to cause problems when using some of the simple tools, such as the sentence splitter or the tokenizer. More complex tools, such as the parser and name finder, use several large models. The maximum entropy model data for the English parser consumes approximately 250 MB of memory, so I would recommend that you use appropriately powerful hardware when running this code. If your PC runs out of physical memory and starts using the hard disk instead, you will obviously experience an extreme slowdown in performance.

Detecting the end of sentences

If we have a paragraph of text in a string variable input, a simple and limited way of dividing it into sentences would be to use input.Split('.') to obtain an array of strings. Extending this to input.Split('.', '!', '?') would handle more cases correctly. But while this is a reasonable list of punctuation characters that can end sentences, this technique does not recognize that they can appear in the middle of sentences too. Take the following simple paragraph:

Mr. Jones went shopping. His grocery bill came to $23.45.

Using the Split method on this input will result in an array with five elements, when we really want an array with only two. We can do this by treating each of the characters '.', '!', '?' as potential rather than definite end-of-sentence markers. We scan through the input text, and each time we come to one of these characters, we need a way of deciding whether or not it marks the end of a sentence. This is where the maximum entropy model comes in useful. A set of predicates related to the possible end-of-sentence positions is generated. Various features, relating to the characters before and after the possible end-of-sentence markers, are used to generate this set of predicates. This set of predicates is then evaluated against the MaxEnt model. If the best outcome indicates a sentence break, then the characters up to and including the position of the end-of-sentence marker are separated off into a new sentence.

All of this functionality is packaged into the classes in the OpenNLP.Tools.SentenceDetect namespace, so all that is necessary to perform intelligent sentence splitting is to instantiate an EnglishMaximumEntropySentenceDetector object and call its SentenceDetect method:

The simplest EnglishMaximumEntropySentenceDetector constructor takes one argument, a string containing the file path to the sentence detection MaxEnt model file. If the text shown in the simple example above is passed into the SentenceDetect method, the result will be an array with two elements: "Mr. Jones went shopping." and "His grocery bill came to $23.45."

The Tools Example executable illustrates the sentence splitting capabilities of the OpenNLP library. Enter a paragraph of text into the top textbox, and click the "Split" button. The split sentences will appear in the lower textbox, each on a separate line.

Tokenizing sentences

Having isolated a sentence, we may wish to apply some NLP technique to it - part-of-speech tagging, or full parsing, perhaps. The first step in this process is to split the sentence into "tokens" - that is, words and punctuations. Again, the Split method alone is not adequate to achieve this accurately. Instead, we can use the Tokenize method of the EnglishMaximumEntropyTokenizer object. This class, and the related classes in the OpenNLP.Tools.Tokenize namespace, use the same method for tokenizing sentences as I described in the second half of the Sharpentropy article, which I won't repeat here. As with the sentence detection classes, using this functionality is as simple as instantiating a class and calling a single method:

This tokenizer will split words that consist of contractions: for example, it will split "don't" into "do" and "n't", because it is designed to pass these tokens on to the other NLP tools, where "do" is recognized as a verb, and "n't" as a contraction of "not", an adverb modifying the preceding verb "do".

The "Tokenize" button in the Tools Example splits text in the top textbox into sentences, then tokenizes each sentence. The output, in the lower textbox, places pipe characters between the tokens.

Part-of-speech tagging

Part-of-speech tagging is the act of assigning a part of speech (sometimes abbreviated POS) to each word in a sentence. Having obtained an array of tokens from the tokenization process, we can feed that array to the part-of-speech tagger:

The POS tags are returned in an array of the same length as the tokens array, where the tag at each index of the array matches the token found at the same index in the tokens array. The POS tags consist of coded abbreviations conforming to the scheme of the Penn Treebank, the linguistic corpus developed by the University of Pennsylvania. The list of possible tags can be obtained by calling the AllTags() method; here they are, followed by the Penn Treebank description:

The maximum entropy model used for the POS tagger was trained using text from the Wall Street Journal and the Brown Corpus. It is possible to further control the POS tagger by providing it with a POS lookup list. There are two alternative EnglishMaximumEntropyPosTagger constructors that specify a POS lookup list, either by a filepath or by a PosLookupList object. The standard POS tagger does not use a lookup list, but the full parser does. The lookup list consists of a text file with a word and its possible POS tags on each line. This means that if a word in the sentence you are tagging is found in the lookup list, the POS tagger can restrict the list of possible POS tags to those specified in the lookup list, making it more likely to choose the correct tag.

The Tag method has two versions, one taking an array of strings and a second taking an ArrayList. In addition to these methods, the EnglishMaximumEntropyPosTagger also has a TagSentence method. This method bypasses the tokenizing step, taking in an entire sentence, and relying on a simple Split to find the tokens. It also produces the result of the POS tagging, with each token followed by a '/' and then its tag, a format often used for the display of the results of POS tagging algorithms.

The Tools Example application splits an input paragraph into sentences, tokenizes each sentence, and then POS tags that sentence by using the Tag method. Here, we see the results on the first few sentences of G. K. Chesterton's novel, The Man Who Was Thursday. Each token is followed by a '/' character, and then the tag assigned to it by the maximum entropy model as the most likely part of speech.

Finding phrases ("chunking")

The OpenNLP chunker tool will group the tokens of a sentence into larger chunks, each chunk corresponding to a syntactic unit such as a noun phrase or a verb phrase. This is the next step on the way to full parsing, but it could also be useful in itself when looking for units of meaning in a sentence larger than the individual words. To perform the chunking task, a POS tagged set of tokens is required.

The EnglishTreebankChunker class has a Chunk method that takes in the string array of tokens and the string array of POS tags that we generated by calling the POS tagger, and returns a third string array, again with one entry for each token. This array requires some interpretation for it to be of use. The strings it contains begin either with "B-", indicating that this token begins a chunk, or "I-", indicating that the token is inside a chunk but is not the beginning of it. After this prefix is a Penn Treebank tag indicating the type of chunk that the token belongs to:

The Tools Example application uses the POS-tagging code to generate the string arrays of tokens and tags, and then passes them to the chunker. The result shows the POS tags indicated as before, but with the chunks shown by square-bracketed sections in the output sentences.

Full parsing

Producing a full parse tree is a task that builds on the NLP algorithms we have covered up until now, but which goes further in grouping the chunked phrases into a tree diagram that illustrates the structure of the sentence. The full parse algorithms implemented by the OpenNLP library use the sentence splitting and tokenizing steps, but perform the POS-tagging and chunking as part of a separate but related procedure driven by the models in the "Parser" subfolder of the "Models" folder. The full parse POS-tagging step uses a tag lookup list, found in the tagdict file.

The full parser is invoked by creating an object from the EnglishTreebankParser class, and then calling the DoParse method:

There are many constructors for the EnglishTreebankParser class, but one of the simplest takes three arguments: the path to the Models folder, and two boolean flags: the first to indicate if we are using the tag lookup list, and the second to indicate if the tag lookup list is case sensitive or not. The DoParse method also has a number of overloads, taking in either a single sentence, or a string array of sentences, and also optionally allowing you to request more than one of the top ranked parse trees (ranked with the most probable parse tree first). The simple version of the DoParse method takes in a single sentence, and returns an object of type OpenNLP.Tools.Parser.Parse. This object is the root in a tree of Parse objects representing the best guess parse of the sentence. The tree can be traversed by using the Parse object's GetChildren() method and the Parent property. The Penn Treebank tag of each parse node is found in the Type property, except for when the node represents one of the tokens in the sentence - in this case, the Type property will equal MaximumEntropyParser.TokenNode. The Span property indicates the section of the sentence to which the parse node corresponds. This property is of type OpenNLP.Tools.Util.Span, and has the Start and End properties indicating the characters of the portion of the sentence that the parse node represents.

The Parse Tree demo application shows how this Parse structure can be traversed and mapped onto a Lithium graph control, generating a graphical representation of the parse tree. The work is kicked off by the ShowParse() method of the MainForm class. This calls the recursive AddChildNodes() method to build the graph.

The Tools Example, meanwhile, uses the built-in Show() method of the root Parse object to produce a textual representation of the parse graph:

Name finding

"Name finding" is the term used by the OpenNLP library to refer to the identification of classes of entities within the sentence - for example, people's names, locations, dates, and so on. The name finder can find up to seven different types of entities, represented by the seven maximum entropy model files in the NameFind subfolder - date, location, money, organization, percentage, person, and time. It would, of course, be possible to train new models using the SharpEntropy library, to find other classes of entities. Since this algorithm is dependent on the use of training data, and there are many, many tokens that might come into a category such as "person" or "location", it is far from foolproof.

The name finding function is invoked by first creating an object of type OpenNLP.Tools.NameFind.EnglishNameFinder, and then passing it the path to the NameFind subfolder containing the name finding maximum entropy models. Then, call the GetNames() method, passing in a string array of entity types to look for, and the input sentence.

The result is a formatted sentence with XML-like tags indicating where entities have been found.

It is also possible to pass a Parse object, the root of a parse tree structure generated by the EnglishTreebankParser, rather than a string sentence. This will insert tags into the parse structure showing the entities found by the Name Finder.

Conclusion

My C# conversion of the OpenNLP library provides a set of tools that make some important natural language processing tasks simple to perform. The demo applications illustrate how easy it is to invoke the library's classes and get good results quickly. The library does rely on holding large maximum entropy model data files in memory, so the more complicated NLP tasks (full parsing and name finding) are memory-intensive. On machines with plenty of memory, performance is impressive: a 3.4 Ghz Pentium IV machine with 2 GB of RAM loaded the parse data into memory in 12 seconds. Querying the model once loaded by passing sentence data to it produced almost instantaneous parse results.

Work on the Java OpenNLP library is ongoing. The C# version now has a coreference tool and its development is also active, at the SharpNLP Project on CodePlex. Investigations into speedy ways of retrieving MaxEnt model data from disk rather than holding data in memory also continue.

"On tree diagrams and XML" - the NetronProject's Lithium treeview control [This article has since been removed from CodeProject by its author].

History

Third version (13th December 2006. Added references to the SharpNLP Project on CodePlex.

Second version (4th May 2006). Added .NET 2.0 versions of the code for download.

Initial version (30th October 2005).

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

Richard Northedge is a senior developer with a UK Microsoft Gold Partner company. He has a postgraduate degree in English Literature, has been programming professionally since 1998 and has been an MCSD since 2000.

How about if I want to Extract some specific information which the name entity cognition not available in the NameFind Folder (person.bin, organization.bin, etc).

For Example I want to extract knowledge in IT Support (HelpDesk) area.
For Example:
I want to extract all indication of Modem Troubleshooting, its cause and its solution.

Extract all indication of Networking troubleshooting, ist cause and its solution.

Ex:
(1)
Cause: DNS Server Not Set
Indication: No IP, Cannot connect to other computer.
Solution: Set DNS Server Range IP for client.

(2)
Cause: Modem driver not compatible with the version of OS.
Indication: Modem detect but functional, Modem driver already install but modem still not functional.
Solution: Find new driver update for the modem.

How could I extract those information from a text document. I want to make an automatic knowledge acquitiotion from text document.

Should I make the 'indication' name entity cognition which use machine learning for each item of IT Support Area ?

I think what you're describing is essentially a document classification task. You can certainly use maximum entropy modelling techniques to perform document classification, but that's a different type of task from the name finding. It's a task for which there is no "higher level" module in SharpNLP, but you could certainly write something based on SharpEntropy to do it. Take a look at my SharpEntropy article rather than looking at this one.

Each support log entry would be an event, and your outcomes would be "Modem troubleshooting", "Network troubleshooting", and other categories as decided on by you. Then you would have to generate a context by selecting features from the support log entry. Then you could train your model by getting a set of support log entries that you have classified in advance of being of a certain type (Modem, Network, etc.) Having trained a model, you can then apply it to support log entries you haven't manually classified and get it to tell you what type of log entry they are.

Whether it works well or not will depend on the sophistication you put into selecting features from the log entry that are good signals for determining the log entry type.

Could give the new link to download EnglishPOS.bin.
I have try to find here http://opennlp.sourceforge.net/models/english/[^] But I Can't find it. Could you help me please. You can send me to my email : ardi_skom@yahoo.com

Im not sure how to ask this question, but in laymans terms, did i read somewhere that the taggers use references from the new york times paper which have been manually tagged? and if so, is it possible you could provide a link?

Some time ago I emailed one of the developers working on the original Java version of OpenNLP regarding this question, and he replied: "The corpora that the models are trained on is mostly from copywritten texts which I can't distribute." Presumably because of licensing restrictions - a lot of tagged corpora cost quite a bit of money to use, because they cost money for people to do the tagging by hand.

Sorry, i dont fully understand.. For my Uni project, i have used the namefinder models.. My lecturer is going to ask me what locations does the locationfinder pick up, does the nbin file reference a database of locations? or what is the basis upon which the locations are found?

does the nbin file reference a database of locations? or what is the basis upon which the locations are found?

No, the nbin file is a maximum entropy model. It was created from a source file that consisted of a whole load of sentences, tagged to indicate where the locations were. For example:

The man came from <START>London<END>.
I went to <START>New York<END> for my holiday.
This sentence doesn't have a location.
etc...

This source file was then run through the OpenNLP.Tools.NameFind.DefaultNameContextGenerator class to create a set of events that were processed by SharpEntropy to create the maximum entropy model. If you look in the code in the DefaultNameContextGenerator class, you will see that it is recognising various word features, and using those word features to create a context leading to an outcome of either "is a location" or "isn't a location".

Thank you so much for this tool and tutorial. It has been very helpful for me. Also, would you know of any tool that "normalizes" words, i.e. I would like all my past, future tense words to be present tense, and I want all my plural, possessive words to be singlur.
Any help will be greatly appreciated.

You should be able to get some way towards what you want by using the morphological processor in SharpWordNet. If you've ever used the WordNet UI (eg. the online one at http://wordnet.princeton.edu/perl/webwn[^]) you'll see that when you do a search, it will "normalise" in the way you suggest - eg. enter "apples", it finds "apple"; enter "bought", it finds "buy". The way it does this is by a simple set of suffix-changing rules (allowing apples -> apple), an exceptions file for irregular verbs etc. (allowing bought -> buy), and a few other things. SharpWordNet echoes this functionality by providing a set of morphology classes in the "Morph" code subfolder. You can find an example of these classes being used within the coreference code in OpenNLP.

The latest .NET OpenNLP release is based on version 1.3 of the Java OpenNLP. 1.3 is the latest Java release, but the .NET version doesn't contain all of it. It does contain the coreference tool, but not the Spanish language features or the NGram utility. I'm only interested in English language tools, so the Spanish isn't on my to-do list.

The changes after the Java version 1.3 seem mainly to do with the UIMA framework. Is UIMA the best way forward? How does it fit in with SharpNLP? Not sure yet.

hello
i thank you for your your great work and your time
i have a problem in the "parse" and the "find name"
whan i click on the parse button or find name button..my pc load load and load alot of time (more than hour) finally i have message tell that" the memory not be enough " what can i do now?? my ram is 256
thank you for your helping

Sabrysoft
student at Faculty of Computers and Information(FCI)
3rd year Computer scince Department
helwan University
Egypt

how are you
please could you help me
i want to emplement the virtual chat project but i dont know how can i convert the english into predicate calculus
could you tell me how??
or if i wrong in my thinking
could you advise me
thank you

Sabrysoft
student at faculty of computer and information
3rd year Computer scince Department
helwan University
Egypt

You would need two things; a set of maximum entropy models trained on Spanish data, and some classes to replace the English-specific algorithms with Spanish ones. Version 1.3 of the Java OpenNLP library does contain some Spanish-specific classes, but I haven't found any published maxent models that work with them.

Hi everyone,
I am not sure where the problem comes from but the sentence splitter often keep two sentences together. Here is an example of a simple text it has a problem with:My mother love to dance in the evening. She has driving her car the whole morning. I love eating in the kitchen apples. The only woman I know is seeking true love. it's new friend want to die like a hero. the truth is out there.
here it sees as one sentence :it's new friend want to die like a hero. the truth is out there.

The code I am using is very simple:
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");

/* Sentence splitter has a bug returning space in the end of every sentence. We fix it our way.*/
string[] sentences = mSentenceDetector.SentenceDetect(paragraph);
return alignSentences(sentences);

Does anyone have a clue why isn't it working like it should and can help me?

The sentence splitter examines each potential end of sentence, considers various properties of the potential end of sentence and then applies the maximum entropy method to assign a probability of it being a real end of sentence. Based on the maximum entropy model you are using (EnglishSD.nbin), the properties of the potential end of sentence "a hero. the truth" are not making the probability grade and so are not considered the end of a sentence. One of the things the code uses to determine end of sentences is whether the following word has a capital letter; that lower case "t" on "the truth" is probably reducing the probability significantly.

The relationship you're looking for is the hypernym / hyponym relationship. You can ask the question "What is red a kind of?" by finding all the hypernyms of "red"; and you can ask the question "What kinds of colors are there?" by finding all the hyponyms of "color".

However, this is complicated by the fact that in the WordNet database, both "red" and "color" are polysemous: they have more than one sense. You either need to determine in advance which sense of each word you need, or you need to scan all the senses for the information you are looking for.

As a general introduction to coding against the SharpWordNet library (taken from a discussion I had on Codeplex):

Create a WordNetEngine by instantiating one of its concrete implementations (there is only 1 currently; the DataFileEngine, which takes a path to the WordNet database files):

Take a look at the classes in the SharpEntropy.IO namespace in the SharpEntropy library. The GisModelReader class is a base class for loading GIS models, and contains source code and comments that explain the usual structure of GIS model files. The BinaryGisModelReader class is a class that inherits from GisModelReader and does the reading of .nbin format GIS model files.

Hi,
I've followed through your articles in doing statistical sentence parsing. The only difference is the I change the codes to VB instead of C#. When I reach the NameFind section I got the following error

Hello!
I've some quick questions. First of all, We are in process of building the POS Tagger for Urdu. We dont have any large tagged data available. So in order to work out, we will have to use the technique of boot strapping. That is, with initial repository of 10,000 tagged words, we will train the model. And then will expose it to unseen text. The automatic tagged text will be corrected and merged with the previous tagged text and will again train it. The question is that the model generated in the previous pass will be of any use? Or we will have to train altogether again? I mean is there any way to incremental learning?

Second question is that intuitively speaking, in order to achieve about 60% accuracy, how much training data is required?

You can't add information to a model once it has been created. Think of it as "compiled" - if you want to make changes, you have to make changes to the source data (the training file) and recompile (retrain).

I'm afraid I wouldn't be able to even guess at quantifying how much training data you will need, because my knowledge of Urdu is nil.

Use constructors that take in an IMaximumEntropyModel parameter rather than a string containing a path to a model file, and pass in a SharpEntropy.GisModel object using an implementation of SharpEntropy.IO.IGisModelReader that doesn't load all the data into memory up front.

you can then use the SqliteGisModelReader provided with the SharpEntropy article. Or write your own Reader/Writer pair using the Sqlite classes as a guide.

If you load the model data to memory up front, you will take a one-time performance hit for the load but accessing the models will be very fast. If you leave the model on disk and access it as needed, you won't have that up front performance hit, but accessing the models will necessarily be much slower. How much slower depends on your IGisModelReader implementation.

Yes, I have implementations of LuceneGisModelReader and LuceneGisModelWriter classes that use Lucene.Net. I'd be happy to mail them to you or make them available on Codeplex - under the LGPL of course. Testing suggested the performance was comparable with other persistant storage formats, but I wrote the code a while ago and I may have chosen a suboptimal way of using Lucene - would be great if you found a better way.

jconwell wrote:

Where can I find info on the nbin file format?

Unfortunately I don't have any documentation on the file format other than the code itself, but the GIS models hold the following data:

Each model has a CorrectionConstant (integer) and CorrectionParameter (double).
Each model has a series of Outcomes, each of which has an Id (integer) and a label (string).
Each model has a series of Predicates, each of which has an Id (integer) and a label (string).
Each Predicate may have 0 or more PredicateParameters, each of which has a PredicateId (integer FK back to Predicate), an OutcomeId (integer FK back to Outcome), and a Parameter (double).

This structure is represented in various different ways in the different implementations of the GisModelReader / GisModelWriter.

1. What are the differences between your full parser against Link Grammar's Parser.

There are a lot of differences - some of the main ones are:

(a) they attempt to classify the grammar of a sentence in very different ways. The SharpNLP parser assigns a category to each word and phrase in the sentence, but the Link Grammar classfies the sentences by identifying relationships between words in the sentence and classifying each relationship.

(b) the Link Grammar parser is largely rules based - it has a large dictionary of words and rules about the legal relationships between those words, and uses that to resolve possible relationships within a sentence. The SharpNLP parser is based on statistical probability - it is trained on a set of example sentences, and uses the probability model generated from that training process to make inferences about a sentence.

Some of the tools do not use a dictionary at all, and the ones that do use it only as a supplement to the training model to cut down possible options, not the main input. The tagdict file, for instance, is a simple text file listing words and those words' possible parts of speech, as a sanity check and performance booster for the part-of-speech tagger. To change the main input (the model files), you would need to train new models from marked up sample data. See my comments on a previous thread about the Name Finder tool.

williamwlk wrote:

3. How may I reach you privately? I have some questions that I need to ask you privately and commercially.

I believe you can click "email" rather than "reply" on the links under my message.

Thank you for your explanation and your pointers. We do not have experience writing MT applications too. That is why I have been extensively reviewing all the technologies available on the net as far as English Parsing is concerned.

SharpNLP will certainly help with syntactic analysis of English sentences, and the maximum entropy tools that underpin it may be of use in other aspects of the machine translation task if you want to use a statistical approach for them. Do you have any parser tools to handle the Myanmar language?

williamwlk wrote:

Or let me put it this way, do you approve/support my idea that I am going to use SharpNLP for our MT application?

Well, there are two caveats I suppose:

First, SharpNLP is not part of my day job, so my work on updates and issue fixing is sporadic, when I get the time;
and second, the licence is LGPL, so if you make changes to the library as part of your work, you are obliged to contribute those changes back to the project.

Myanmar script/text is different from that of English. There is no white space between words. We have so far developed a few good algorithms to tokenzie myanmar text based on syllables. However, we have not had a full parser yet that could identify words, phrases, lexically or syntactically. Perhaps, it is our second phase of our project. It is not easy, I must say.

The first phase that I am going to kick off soon is to write an MT application which supports English to Myanmar only. Only one direction. From English to Myanmar. That is easier than the reverse direction because English parsers are widely available while we know both languages well.

Richard Northedge wrote:

First, SharpNLP is not part of my day job, so my work on updates and issue fixing is sporadic, when I get the time;
and second, the licence is LGPL, so if you make changes to the library as part of your work, you are obliged to contribute those changes back to the project.

Noted with thanks. I highly respect the LGPL License and it is our pleasure to be compliant to that.

I am currently compiling a budget report for this matter and the project time frame is optimistically 6 months and pessimistically one year.