Category: NLP

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Stanford CoreNLP is here and available on NuGet. It is probably the most powerful package from whole The Stanford NLP Group software packages. Please, read usage overview on Stanford CoreNLP home page to understand what it can do, how you can configure an annotation pipeline, what steps are available for you, what models you need to have and so on.

The next thing we need to do is to create StanfordCoreNLP pipeline. But to instantiate a pipeline, we need to specify all required properties or at least paths to all models used by pipeline that are specified in annotators string. Before starting samples, let’s define some helper function that will be used across all source code pieces: jarRoot is a path to folder where we extracted files from stanford-corenlp-3.2.0-models.jar; modelsRoot is a path to folder with all models files; ‘!’ is overloaded operator that converts model name to relative path to the model file.

Now we are ready to instantiate the pipeline, but we need to do a small trick. Pipeline is configured to use default model files (for simplicity) and all paths are specified relatively to the root of stanford-corenlp-3.2.0-models.jar. To make things easier, we can temporary change current directory to the jarRoot, instantiate a pipeline and then change current directory back. This trick helps us dramatically decrease the number of code lines.

However, you do not have to do it. You can configure all models manually. The number of properties (especially paths to models) that you need to specify depends on the annotators value. Let’s assume for a moment that we are in Java world and we want to configure our pipeline in a custom way. Especially for this case, stanford-corenlp-3.2.0-models.jar contains StanfordCoreNLP.properties (you can find it in the folder with extracted files), where you can specify new property values out of code. Most of properties that we need to use for configuration are already mentioned in this file and you can easily understand what it what. But it is not enough to get it work, also you need to look into source code of Stanford CoreNLP. By the way, some days ago Stanford was moved CoreNLP source code into GitHub – now it is much easier to browse it. Default paths to the models are specified in DefaultPaths.java file, property keys are listed in Constants.java file and information about which path match to which property name is contained in Dictionaries.java. Thus, you are able to dive deeper into pipeline configuration and do whatever you want. For lazy people I already have a working sample.

C# Sample

Stanford Temporal Tagger(SUTime)

SUTime is a library for recognizing and normalizing time expressions. SUTime is available as part of the Stanford CoreNLP pipeline and can be used to annotate documents with temporal information. It is a deterministic rule-based system designed for extensibility.

There is one more useful thing that we can do with CoreNLP – time extraction. The way that we use CoreNLP is pretty similar to the previous sample. Firstly, we create an annotation pipeline and add there all required annotators. (Notice that this sample also use the operator defined at the beginning of the post)

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

Some weeks ago, Microsoft Research announced NLP toolkit called MSR SPLAT. It is time to play with it and take a look what it can do.

Statistical Parsing and Linguistic Analysis Toolkit is a linguistic analysis toolkit. Its main goal is to allow easy access to the linguistic analysis tools produced by the Natural Language Processing group at Microsoft Research. The tools include both traditional linguistic analysis tools such as part-of-speech taggers and parsers, and more recent developments, such as sentiment analysis (identifying whether a particular of text has positive or negative sentiment towards its focus)

In the first call we ask the SPLAT to return list of supported languages splat.Languages() and you will see [|”en”; “bg”|] (English and Bulgarian). The mystical Bulgaria… I do not know why, but NLP guys like Bulgaria. There is something special for NLP :).

The next call is splat.Analyzers(“en”) that returns list of all analyzers that are available for English language (All of them are available from DEMO app)

“Base Forms-LexToDeriv-DerivFormsC#”

“Chunker-SpecializedChunks-ChunkerC++”

“Constituency_Forest-PennTreebank3-SplitMerge”

“Constituency_Tree-PennTreebank3-SplitMerge”

“Constituency_Tree_Score-Score-SplitMerge”

“CoRef-PennTreebank3-UsingMentionsAndHeadFinder”

“Dependency_Tree-PennTreebank3-ConvertFromConstTree”

“Katakana_Transliterator-Katakana_to_English-Perceptron”

“Lemmas-LexToLemma-LemmatizerC#”

“Named_Entities-CONLL-CRF”

“POS_Tags-PennTreebank3-cmm”

“Semantic_Roles-PropBank-kristout”

“Semantic_Roles_Scores-PropBank-kristout”

“Sentiment-PosNeg-MaxEntClassifier”

“Stemmer-PorterStemmer-PorterStemmerC#”

“Tokens-PennTreebank3-regexes”

“Triples-SimpleTriples-ExtractFromDeptree”

This is a list of full names of analyzers that are available for now. The part of the analyzer’s name that you have to pass to the service to perform corresponding analysis is highlighted in bold. To perform the analysis, you need to have an access guid and pass it as an email to splat.Analyze method. It is probably a typo, but as it is. Let’s call all analyzers on the one of our favorite sentences “All your types are belong to us” and look at the result.

As you see, service returns result as string[]. All result strings are readable for human eyes and formatted according to “NLP standards”, but some of them are really hard to parse programmatically. FSharp.Data and JSON Type Provider can help with strings that contain correct Json objects.

For example, if you need to use “Sentiment-PosNeg-MaxEntClassifier” analyzer in strongly typed way, then you can do it as follows:

For analyzers like “Constituency_Tree-PennTreebank3-SplitMerge” you need to write custom parser that proceses bracket expression (“(TOP (S (NP (PDT All) (PRP$ your) (NNS types)) (VP (VBP are) (VP (VB belong) (PP (TO to) (NP (PRP us)))))))”) and builds a tree for you. If you are lazy to do it yourself (you should be so), you can download SilverlightSplatDemo.xap and decompile source code. All parsers are already implemented there for DEMO app. But this approach is not so easy as it should be.

Summary

MSR SPLAT looks like a really powerful and promising toolkit. I hope that it continues growing.

The only wish is an API improvement. I think there should be possible to use services in a strongly typed way. The easiest way is to add an ability to get all results as Json without any cnf forms and so on. Also it can be achieved by changing WCF service and exposing analysis results in a typed way instead of string[].

Some weeks ago, I announced FSharp.NLP.Stanford.Parser and now I want to clarify the goals of this project and show an example of usage.

First of all, this is not an attempt to re-implement some functionality of Stanford Parser. It is just a tiny dust layer that aimed to simplify interaction with Java collections (especially Iterable interface) and bring the power of F# constructs (like pattern matching and discrimination unions) to the code that deals with tagging results.

Task

Let’s start with some sample NLP task: We want to show related questions before user asks a new one (as it works on StackOverflow). There are many possible solutions for this task. Let’s look at one that at the first step tries to understand key phrases that identify this question and runs the search using them.

Approach

First of all, let’s choose some real questions from StackOverflow to analyze them:

Now we can use Stanford Parser GUI to visualize the structure of these questions:

As you can see this question is about “F# project” and “object browser”This question is about “WebSharper”, “Mono 3.0” and “Mac”This one is about “extra methods”, “type providers” and “F#”The last one is about “MonoDevelop” and “F# projects”.

We can notice that all phrases that we have selected are parts of noun phrases(NP). As a first solution we can try to analyze tags in the tree and select NP that contains word level tags like (NN,NNS,NNP,NNPS).

As a result of both samples you will see the same output. For example, if you start program with these parameters:

1 text "A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads
text in some language and assigns parts of speech to each word (and other token),
such as noun, verb, adjective, etc., although generally computational
applications use more fine-grained POS tags like 'noun-plural'."

F# Sample

let main file =
let classifier =
CRFClassifier.getClassifierNoExceptions(
@"..\..\..\..\temp\stanford-ner-2013-06-20\classifiers\english.all.3class.distsim.crf.ser.gz")
// For either a file to annotate or for the hardcoded text example,
// this demo file shows two ways to process the output, for teaching
// purposes. For the file, it shows both how to run NER on a String
// and how to run it on a whole file. For the hard-coded String,
// it shows how to run it on a single sentence, and how to do this
// and produce an inline XML output format.
match file with
| Some(fileName) ->
let fileContents = File.ReadAllText(fileName)
classifier.classify(fileContents)
|> Iterable.toSeq
|> Seq.cast<java.util.List>
|> Seq.iter (fun sentence ->
sentence
|> Iterable.toSeq
|> Seq.cast<CoreLabel>
|> Seq.iter (fun word ->
printf "%s/%O " (word.word()) (word.get(CoreAnnotations.AnswerAnnotation().getClass()))
)
printfn ""
)
| None ->
let s1 = "Good afternoon Rajat Raina, how are you today?"
let s2 = "I go to school at Stanford University, which is located in California."
printfn "%s\n" (classifier.classifyToString(s1))
printfn "%s\n" (classifier.classifyWithInlineXML(s2))
printfn "%s\n" (classifier.classifyToString(s2, "xml", true));
classifier.classify(s2)
|> Iterable.toSeq
|> Seq.iteri (fun i coreLabel ->
printfn "%d\n:%O\n" i coreLabel
)

C# Sample

class Program
{
public static CRFClassifier Classifier =
CRFClassifier.getClassifierNoExceptions(
@"..\..\..\..\temp\stanford-ner-2013-06-20\classifiers\english.all.3class.distsim.crf.ser.gz");
// For either a file to annotate or for the hardcoded text example,
// this demo file shows two ways to process the output, for teaching
// purposes. For the file, it shows both how to run NER on a String
// and how to run it on a whole file. For the hard-coded String,
// it shows how to run it on a single sentence, and how to do this
// and produce an inline XML output format.
static void Main(string[] args)
{
if (args.Length > 0)
{
var fileContent = File.ReadAllText(args[0]);
foreach (List sentence in Classifier.classify(fileContent).toArray())
{
foreach (CoreLabel word in sentence.toArray())
{
Console.Write( "{0}/{1} ", word.word(), word.get(new CoreAnnotations.AnswerAnnotation().getClass()));
}
Console.WriteLine();
}
} else
{
const string S1 = "Good afternoon Rajat Raina, how are you today?";
const string S2 = "I go to school at Stanford University, which is located in California.";
Console.WriteLine("{0}\n", Classifier.classifyToString(S1));
Console.WriteLine("{0}\n", Classifier.classifyWithInlineXML(S2));
Console.WriteLine("{0}\n", Classifier.classifyToString(S2, "xml", true));
var classification = Classifier.classify(S2).toArray();
for (var i = 0; i < classification.Length; i++)
{
Console.WriteLine("{0}\n:{1}\n", i, classification[i]);
}
}
}
}

As I see, it is still not so simple as it should be. I’ve seen sometimes questions from C# guys about different NLP tasks with answers pointing to my “The Stanford Natural Language Processing Samples, in F#” repository (like this). Probably, it is no so easy to find the latest version of IKVM.NET Compiler (it is not included into IKVM.NET NuGet package) and manage to quickly rebuild Stanford Parser from the scratch for the first time.

I have decided to create a NuGet package for clear porting of Stanford Parser to .NET with strongly signed assemblies and without dependencies to F#. My primary goal has been to find a clear, simple and intuitive way to try NLP magic from .NET for all NLP lovers. Now, it is simpler then ever:

Stanford NER (also known as CRFClassifier) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. The software provides a general (arbitrary order) implementation of linear chain Conditional Random Field (CRF) sequence models, coupled with well-engineered feature extractors for Named Entity Recognition. (CRF models were pioneered by Lafferty, McCallum, and Pereira (2001); see Sutton and McCallum (2006) for a better introduction.) Included with the download are good 3 class (PERSON, ORGANIZATION, LOCATION) named entity recognizers for English (in versions with and without additional distributional similarity features) and another pair of models trained on the CoNLL 2003 English training data. The distributional similarity features improve performance but the models require considerably more memory.

Don Syme is an Australian computer scientist and a Principal Researcher at Microsoft Research, Cambridge, U.K. He is the designer and architect of the F# programming language, described by a reporter as being regarded as “the most original new face in computer languages since Bjarne Stroustrup developed C++ in the early 1980s.

Earlier, Syme created generics in the .NET Common Language Runtime, including the initial design of generics for the C# programming language, along with others including Andrew Kennedy and later Anders Hejlsberg. Kennedy, Syme and Yu also formalized this widely used system.

He holds a Ph.D. from the University of Cambridge, and is a member of the WG2.8 working group on functional programming. He is a co-author of the book Expert F# 2.0.

In the past he also worked on formal specification, interactive proof, automated verification and proof description languages.