salient

salient is a natural language processing and machine learning toolkit. Salient contains many common tasks from sentiment analysis, part of speech tagging, tokenization, neural networks, regression analysis, wiktionary parsing, logistic regression, language modeling, mphf, radix trees, vocabulary building and the potential for more awesomeness. It can be used for many classification tasks, categorization, and many common text processing tasks all in node.js :D

Tokenization

There are plenty of libraries that perform tokenization, this library is
no different, the only exception is that this library will also do some
tokenization steps necessary to cleanup random HTML, XML, Wiki, Twitter
and other sources. More examples are in the specs directory. Tokenizers
in salient are built on top of each other and include the following:

Handles Time, Numerics (including 1st, 2nd..etc), numerics with commas/decimals/percents, $, words with hyphenations, words with and without accepts, with and without apostrophes, punctuations and optional emoticon preservation.

Part of Speech Tagging

Part of speech tagging is done primarily through the use of the trigram hidden-markov model. While there are many methods used since then, Trigram HMM, seems to be the easiest to implement while maintaining an effective accuracy. This was built through the use of several resources online including bootstrapping the vocabulary using Wiktionary (https://www.wiktionary.org/). This is a common alternative technique to the unsupervised learning technique by providing a bit of an edge to the model with an existing dictionary of sorts. In some cases, the dictionary can be generated from a part of speech corpus (sometimes manually or automatically tagged).

On top of Wiktionary, I am using several corpus to build the English language model including: Brown Corpus, Penn TreeBank, Twitter TreeBank. These treebanks provide a resource for calculating and training the model for supervised learning cases. The actually tagging portion is done using the Viterbi path finding algorithm implemented for all standard models. The spanish model is trained using the IULA Spanish LSP TreeBank. You will notice both models are stored in the bin directory.

Glossary

Glossary is sometimes used for looking up concepts, terms or relationships between terms. This is far from perfect, but it gives a good usecase for some information retrieval. You may find other libraries do this better, but I'm currently using this part of the library to build towards sentiment analysis use cases.

Glossary has things that I found useful when building the sentiment analysis portion which include catalogging things like copulae verbs, linking verbs, terms that are often filtered (i.e. stop terms), question terms, time sensitive nouns, amplifiers, clauses, coordinating conjunctions, negations, conditionals (ORs), and contractions. All these proved very useful in the sentiment analysis stage where the particular algorithm I implemented is described in more detail below.

As you can see in the code sample below, I have done a bit of chunking for terms and with some common filtering rules I can combine filtered terms, determiners with their nouns..etc. Additionally, the output of glossary helps me debug when something gets parsed oddly. Nevertheless, with this utility I can follow the flow of logic in a given sentence and make it easier to look at relationships between terms.

The flow of logic is further helped when I include negations of any sort show below. The inclusion of a negation breaks the flow of our logic and instead negates the rest of our sentence. This is explored further in the sentiment analysis stage of the library.

You can additionally use the output of the glossary.parse function to retreive a simple concepts map. This effectively is looking for noun terms within the parsed output of the glossary. So the above example would look like so:

Sentiment Analysis

The approach I took to sentiment analysis builds on top of a rather simple naive bayes approach. The naive bayes approach is such that we have a set of buckets which are classified between varying degrees of positive and negative and neutral. The terms in each category should be up to n-gram terms. The sentiment algorithm combines the use of amplifiers (terms that tend to amplify or show additional excitement on an existing term; i.e. 'crazy good').

It makes use of most of the features obtained from the glossary above (including negations). The analyzer will look for scorable terms that our bayes buckets specify, it will look for LCS (longest common substring) and determine whether the terms are even scorable due to varying rules due to filtering. Once we've effectively scored all the possible terms in our text, we can go through the text once again in a node by node sequence (using a finite state machine) to detect things like conditionals, negations, semantic clauses, amplifiers, inclusions (AND), or final orientation terms such as hashtags which may negate the entire text (common in the case of sarcastic tweets).

Finally, once we've determined the final polarity of our text, we give it a cumulative score. This process also identifies semantic orientation on a per term basis, which means we can go back and actually see the orientation of individual terms.

While the above test cases are cool, it doesn't mean this is some super magical system that can get it right all the time. Language is complex and finicky - and sentiment analysis requires inductive reasoning along with a load of other AI problems. However, that being said, you should see a reasonable amount of cases are shown in bin/tests.

Notes

It should be noted that most machine learning algorithms would be better suited in environments that can take advantage of many cores, such as in the case of GPU accelerated machine learning. Such things are necessary in order to speed up the learning rate as well as the task at hand given that many complex linear algebra operations can be done efficiently in parallel. As a result, this project is more of an example test case implementation for a wide-variety of machine learning and artificial intelligence problems. For more robust implementations, it is recommended that you glean from my implementations and others (i.e. by reference to Andrew Ng's Machine Learning course) and use that within the scope of your projects. However, there are other techniques such as map-reduce that may be able to improve the performance of running some of these operations within this package on multiple cores and multiple systems in parallel.