Industries

The Easiest-to-Use Free/Open Source Text Analysis Software

The good news about free and open-source solutions for text analytics is that there’s a ton of them.

The bad news is that you’ll need a linguist working together with a data scientist to get some of them to work. “Some assembly required” is definitely true of many solutions for text analytics and sentiment analytics.

For this reason, we’re focusing on tools that a normal business user can actually get up and running within a few minutes. We promise, you won’t need to compile source code or master complex algorithms. You may need to watch a few YouTube demos, but you were probably expecting that.

I’ve personally demoed the following solutions to test their ease of use. Many tools were demoed, but few were selected. Here are the ones we’ll cover:

RapidMiner is a free, open-source platform for data science, including data mining, text mining, predictive analytics etc. The features of RapidMiner can be significantly enhanced with add-ons or extensions, many of which are also available for free.

Thomas Ott, marketing data scientist at RapidMiner, explains, “The beauty of RapidMiner is that it’s visual programming: You don’t have to write the code, and you don’t have to know the math behind it.”

Among other extensions, the RapidMiner Marketplace offers a very functional and user-friendly add-on for sentiment analytics developed by third-party vendor AYLIEN.

AYLIEN’s extension can automatically scrape data from Twitter (as can RapidMiner). It then analyzes tweets and scores them with a three-value sentiment scale: positive, negative or neutral.

In addition to reading from web sources such as Twitter, RapidMiner can also read directly from flat files such as CSV and Excel files or databases.

RapidMiner also offers its own extension for text analytics, which includes powerful text processing features that can be combined with advanced clustering algorithms and machine learning operators.

As Ott explains, “there are two main approaches to looking at text. One is doing a high-level overview: word counts, word frequency, where words occur in the corpus [the collection of documents being analyzed] etc. The other is more heavy-duty, e.g., sentiment analysis and other techniques in which you train a machine-learning algorithm on a data set.”

Adding clustering algorithms to text mining workflow in RapidMiner

As you can see in the above screenshot, adding advanced analytics to a basic text mining workflow in RapidMiner is as simple as dragging and dropping operators into the proper locations.

Once this is done, it’s possible to output complex visualizations. For example, you can create a network showing the relationships between a specific term you want to focus on (such as a brand name) and other terms in the document you’re analyzing.

The following screenshot is an example of this kind of visualization. Ott applied clustering algorithms to Federal Reserve Bank meeting minutes to understand relationships between the currencies and concepts under discussion in the meetings:

Using clustering algorithm to show relationships between terms in a document

When combined with sentiment analysis, such clustering techniques can have transformative impacts on small businesses in traditional industries such as retail. Ott gives a great example:

“I’m a beer brewer, and did some Twitter analysis of the brands that people are talking about based on region. It turns out that people on the coasts talk about IPAs, people in the midwest talk about stouts, and people in the southwest talk about ales. This is a key thing for small businesses to look at. Say, for instance, that I’m a Kia dealer, and I find out that people in Michigan like red cars and people in Montana like blue—I can then adjust my stock accordingly.”

Thomas Ott, marketing data scientist at RapidMiner

Takeaway: RapidMiner is the easiest to use and most fully featured text mining tool of the platforms I demoed. With the AYLIEN extension, you’ll be able to perform basic sentiment analysis within minutes of downloading and installing.

KNIME is another robust open-source data mining platform available in a free version with rich functionality.

Like RapidMiner, KNIME offers an intuitive visual workflow builder for “programming-free” data mining. It also offers a number of the same operators as RapidMiner (operators are known as “nodes” in KNIME).

With KNIME, users can perform tasks such as:

Stemming: Collapsing variations on key terms into basic forms

Stop word filtering: Removal of insignificant terms such as “in,” “for” and “the”

KNIME is also comparable to RapidMiner in its ability to read directly from Twitter as well as flat files such as CSVs.

I personally found KNIME’s workflow interface to be somewhat more difficult to use than RapidMiner’s, despite their similarities. For example, the logic of the input/output pipeline strikes me as being implemented more naturally in RapidMiner than in KNIME.

Additionally, RapidMiner offers detailed automated suggestions around why operators in your workflow don’t connect, which make it easy for a complete novice to build a functional text mining pipeline and troubleshoot problems when they occur.

KNIME offers good descriptions of nodes, but explanations of why nodes won’t connect are often cryptic.

Finally, RapidMiner currently offers more text processing and sentiment analytics extensions than KNIME.

On the other hand, the free version of KNIME offers more extensive data processing capabilities than the free version of RapidMiner. There’s no limit to the number of rows you can process with the free version of KNIME, making it better for large data sets, or on the number of physical cores/logical processors the free version can leverage in data processing.

Takeaway: Experienced business analysts and data scientists will be comfortable using either RapidMiner or KNIME, and should demo both in order to make a decision based on the advanced functionality of these platforms. Novices will be better served by starting with RapidMiner. Not only is RapidMiner’s interface easier to learn, but there’s also more documentation out there on how to use it.

Open Calais is a cloud-based content tagging tool offered by Thomson Reuters. Unlike RapidMiner and KNIME, it’s not a data mining suite with text mining extensions, and it doesn’t do sentiment analysis. Instead, it excels in the realm of entity recognition and extraction.

You feed unstructured text into Open Calais, and it recognizes entities such as people, products and companies. Open Calais also recognizes relationships between entities and facts about entities. It even organizes entities into topics.

Open Calais can thus be used to quickly extract information from documents. This information can then be used to tag documents for classification.

Some use cases for this functionality include:

Tagging blog articles to improve navigation on a site

Tagging internal resources on a corporate intranet to help employees find them using search

Tagging knowledge base articles and academic archives etc.

Takeaway: Unlike RapidMiner and KNIME, Open Calais won’t work for basic text processing or advanced sentiment analytics. It’s very good at recognizing entities for analysis of unstructured text, and is a robust tool for document tagging.

AntWordProfiler

AntWordProfiler is a freeware tool created by Laurence Anthony, a professor at Waseda University’s Center for the English Language Education in Science and Engineering. Anthony has a PhD in linguistics, and the tool he’s created excels at quick vocabulary profiling of large files.

Determining word frequency with AntWordProfiler

AntWordProfiler uses preloaded vocabulary and thesaurus lists, which can be edited by the user, in order to determine word frequency. Users can also load custom vocabulary lists into the tool.

Results can then be saved in a text file formatted for easy viewing in Excel or another spreadsheet tool. There’s also a document viewer that highlights where terms in your vocabulary lists appear in the document.

Takeaway: AntWordProfiler can be used for quick counts of word frequency in complex, unstructured texts, as well as custom vocabulary profiling of unstructured texts. Unlike RapidMiner and KNIME, however, it’s not an end-to-end text mining solution.

Grab Bag: Even More Toys!

Here are a few other neat toys you should consider experimenting with:

Carrot2: A dedicated tool for applying clustering algorithms to documents. There’s a web-based interface for applying some common clustering algorithms that can help with organizing documents into thematic categories. Carrot2 also integrates with the APIs of popular search engines in order to automatically cluster the results of keyword searches. It can thus be used in search engine optimization (SEO).

AYLIEN Google Sheets add-on: AYLIEN, the same company that develops the sentiment analytics extension for RapidMiner that we examined, also offers an add-on for doing sentiment analysis directly within Google Sheets. This is one of the easiest ways to score sentiment in a spreadsheet-style interface, but the number of API calls you can make per day with the free plan is limited.

The Data Science Toolkit: A collection of easy-to-use, web-based text mining tools, including basic sentiment analysis. The sentiment analysis tool only supports analysis of short chunks of text at this point. There are also lots of tools for geocoding text. For instance, you can translate street addresses to coordinates. These tools are also available via API calls for advanced use cases.