README.md

natural

"Natural" is a general natural language facility for nodejs. Tokenizing,
stemming, classification, phonetics, tf-idf, WordNet, string similarity,
and some inflections are currently supported.

It's still in the early stages, so we're very interested in bug reports,
contributions and the like.

Note that many algorithms from Rob Ellis's node-nltools are
being merged into this project and will be maintained from here onward.

At the moment, most of the algorithms are English-specific, but in the long-term, some diversity
will be in order. Thanks to Polyakov Vladimir, Russian stemming has been added!, Thanks to David Przybilla, Spanish stemming has been added!.

String Distance

Natural provides an implementation of the Jaro–Winkler string distance measuring algorithm.
This will return a number between 0 and 1 which tells how closely the strings match (0 = not at all, 1 = exact match):

Stemmers

Currently stemming is supported via the Porter and Lancaster (Paice/Husk) algorithms.

var natural =require('natural');

This example uses a Porter stemmer. "word" is returned.

console.log(natural.PorterStemmer.stem("words")); // stem a single word

in Russian:

console.log(natural.PorterStemmerRu.stem("падший"));

in Spanish:

console.log(natural.PorterStemmerEs.stem("jugaría"));

attach() patches stem() and tokenizeAndStem() to String as a shortcut to
PorterStemmer.stem(token). tokenizeAndStem() breaks text up into single words
and returns an array of stemmed tokens.

natural.PorterStemmer.attach();
console.log("i am waking up to the sounds of chainsaws".tokenizeAndStem());
console.log("chainsaws".stem());

the same thing can be done with a Lancaster stemmer:

natural.LancasterStemmer.attach();
console.log("i am waking up to the sounds of chainsaws".tokenizeAndStem());
console.log("chainsaws".stem());

Classifiers

Two classifiers are currently supported, Naive Bayes and logistic regression.
The following examples use the BayesClassifier class, but the
LogisticRegressionClassifier class could be substituted instead.

The classifier can also be trained with and can classify arrays of tokens, strings, or
any mixture of the two. Arrays let you use entirely custom data with your own
tokenization/stemming, if you choose to implement it.

classifier.addDocument(['sell', 'gold'], 'sell');

The training process can be monitored by subscribing to the event trainedWithDocument that's emitted by the classifier, this event's emitted each time a document is finished being trained against:

classifier.events.on('trainedWithDocument', function (obj) {
console.log(obj);
/* {
* total: 23 // There are 23 total documents being trained against
* index: 12 // The index/number of the document that's just been trained against
* doc: {...} // The document that has just been indexed
*/ }
});

A classifier can also be persisted and recalled so you can reuse a training

tf-idf

Term Frequency–Inverse Document Frequency (tf-idf) is implemented to determine how important a word (or words) is to a
document relative to a corpus. The following example will add four documents to
a corpus and determine the weight of the word "node" and then the weight of the
word "ruby" in each document.

Multiple terms can be measured as well, with their weights being added into
a single measure value. The following example determines that the last document
is the most relevant to the words "node" and "ruby".

The examples above all use strings, which case natural to automatically tokenize the input.
If you wish to perform your own tokenization or other kinds of processing, you
can do so, then pass in the resultant arrays later. This approach allows you to bypass natural's
default preprocessing.

Tries

Tries are a very efficient data structure used for prefix-based searches.
Natural comes packaged with a basic Trie implementation which can support match collection along a path,
existence search and prefix search.

Building The Trie

You need to add words to build up the dictionary of the Trie, this is an example of basic Trie set up:

ShortestPathTree

ShortestPathTree represents a data type for solving the single-source shortest paths problem in
edge-weighted directed acyclic graphs (DAGs).
The edge weights can be positive, negative, or zero. There are three APIs:
getDistTo(vertex),
hasPathTo(vertex),
pathTo(vertex).

digraph is an instance of EdgeWeightedDigraph, the second param is the start vertex of DAG.

getDistTo(vertex)

Will return the dist to vertex.

console.log(spt.getDistTo(4));

the output will be: 0.35

hasDistTo(vertex)

console.log(spt.hasDistTo(4));
console.log(spt.hasDistTo(5));

output will be:

truefalse

pathTo(vertex)

this will return a shortest path:

console.log(spt.pathTo(4));

output will be:

[5, 4]

LongestPathTree

LongestPathTree represents a data type for solving the single-source shortest paths problem in
edge-weighted directed acyclic graphs (DAGs).
The edge weights can be positive, negative, or zero. There are three APIs same as ShortestPathTree:
getDistTo(vertex),
hasPathTo(vertex),
pathTo(vertex).

digraph is an instance of EdgeWeightedDigraph, the second param is the start vertex of DAG.

getDistTo(vertex)

Will return the dist to vertex.

console.log(spt.getDistTo(4));

the output will be: 2.06

hasDistTo(vertex)

console.log(spt.hasDistTo(4));
console.log(spt.hasDistTo(5));

output will be:

truefalse

pathTo(vertex)

this will return a shortest path:

console.log(spt.pathTo(4));

output will be:

[5, 1, 3, 6, 4]

WordNet

One of the newest and most experimental features in natural is WordNet integration. Here's an
example of using natural to look up definitions of the word node. To use the WordNet module,
first install the WordNet database files using the WNdb module:

npm install WNdb

(For node < v0.6, please use 'npm install WNdb@3.0.0')

Keep in mind that the WordNet integration is to be considered experimental at this point,
and not production-ready. The API is also subject to change.

It suggests corrections (sorted by probability in descending order) that are up to a maximum edit distance away from the input word. According to Norvig, a max distance of 1 will cover 80% to 95% of spelling mistakes. After a distance of 2, it becomes very slow.

Development

The current configuration of the unit tests requires the following environment variable to be set:

export NODE_PATH=.

License

Copyright (c) 2011, 2012 Chris Umbel, Rob Ellis, Russell Mull

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

WordNet License

This license is available as the file LICENSE in any downloaded version of WordNet.
WordNet 3.0 license: (Download)

WordNet Release 3.0 This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have read, understood, and will comply with these terms and conditions.: Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.