How can we help you today?

Natural Language Processing (NLP)

Modified on: Mon, 24 Oct, 2016 at 3:00 AM

Like many types of analyses, Exaptive applications are great for performing and interacting with Natural Language Processing (NLP) models and methods. NLP models tend to be complex, and have deep result sets that are prime for interactive exploration by your users. Additionally, you may want to crowdsource annotation or other types of active feedback from your users as they explore your models. This is where data applications really shine over static visualizations, and where the Exaptive platform can help make a powerful NLP application.

Using NLP Components in a Xap

Getting to your data: There are a number of ways to make text available.

SQL: If you have your text data in a SQL database, there are components available that let you connect, run a general query, and use the results.

Web APIs: there are several API components available to read from search sources such Google Books and PubMed. There additionally is a component that can hit any standard web-based REST API and return the results as data that you can use directly.

ElasticSearch: if the text data you want to work with is in Elastic Search, you can use Exaptive components to serve data requests from searches directly inside your application.

Testing: If you know what kind of application you would like to build but do not yet have the text data available, there are components available that provide text data specifically for testing. This allows you to dial performance measures like number of documents, document size, multiple_languages, and types of character sets to include/exclude.

Example Models and Use Cases:

Text Clustering: turn your documents into an interactive landscape with key terms positioned by their co-occurrence in your corpus, and clustered into similar topics.

LDA Topic Modeling: a powerful tool for discovering abstract topics that describe your document set, this is a deeply explorable model with a lot of context.

Sentiment Analysis: classify each document or snippet on a sliding scale of positive, negative, or neutral.

Named Entity Extraction: by interacting with powerful APIs like IBM Watson's Alchemy directly as a component, powerful features like entity and concept recognition are easily integrated.

String Distance: How similar are two given strings? By using metrics like Edit Distance and Jaccard Shingle Analysis, this powerful tool can help identify where and how strings are similar; which is very useful in fuzzy matching use cases.

Text Network: imagine your documents as a network graph, each document, author, and key term a node. Each document is connected to the key terms it includes, to the author that wrote it, and to any other document that it is substantially similar to. Similar metrics connect the other parts of the corpus in a highly enlightening network.

Visualizing Results: since NLP models tend to have complex and deep results, there are a number of highly interactive and interconnectable visualization components at your fingertips. Doing some text clustering? Try visualizing in our Scatterplot or wordcloud. Trying to find documents that are similar? Try coupling the String Distance algorithm component with our Overlap Diagram. How about Sentiment Analysis? Use the Scatterplot by encoding sentiment to the x axis and term frequency to the y axis to build a sentiment landscape.

Creating your own NLP Component

If the Xap Store doesn't have components available for the specific NLP model you would like to use, or if you have a great idea for a new way to explore text data, building your own Exaptive component is easy. There is great documentation available on how to build a component, as well as language specific component information on JavaScript, R, and Python. NLP components largely work exactly the same, but there are a few small quirks you will want to be familiar with as you build your NLP component.

Libraries: major libraries for NLP tasks make short work of much functionality. Any NLP component that is service installable (pip for Python, CRAN for R) is available. This means for Python you can use libraries like NLTK, gensim, and sci-kit learn; and for R can use popular libraries like tm, stringR, and OpenNLP.

Unicode: In the Exaptive platform, all strings are communicated from component to component in full Unicode. This means easy integration with alternate character sets. However, if your component needs another encoding (e.g. UTF-8, UTF-16, Latin), conversions can easily made in-component with standard conversion tools.

NLTK assets: one common special case you will likely run into if you use the popular NLTK library is the need to use their internally registered assets as part of your processing. Unlike other common assets, NLTK assets are more than just files your code needs to reference. However, they are very simple if you follow this trick.

Exporting your model: Once your model is built, you want to export it from the component so it can be used by other components. For details on this, please see the guide on the Exaptive data model. In short, your model will need to be constructed in a JSON serializable style object.