Tag Archives: freeling

Introduction

Language Identification is a key task in the text mining process. Successful analysis of extracted text with natural language processing or machine learning training requires a good language identification algorithm. If it fails to recognize the language, this error will nullify subsequent processes. NLP algorithms must be adjusted for different corpuses and according to the grammar of different languages. Certain NLP software is best suited to certain languages. For example NLTK is the most popular natural language processing package for English under Python, but as FreeLing is best for Spanish. The efficiency of language processing depends on many factors.

A very high level model for text analysis includes the following tasks:

Text Extraction
Text can be extracted by: scraping a web site, importing it in a specific format, getting it from a database, or accessing it via an API.

Text Identification
Text identification is a process which can separate interesting text from other content or format that adds noise to the analysis. For example a blog can include advertising, menus, and other information besides the main content.

NLP
NLP is a set of algorithms to aid in the processing of different languages. See links to NLP software packages and articles here.

Machine Learning
Machine learning is a necessary step for tasks such as collaborative filtering, sentiment analysis and clustering.

Software Alternatives

There is a lot of language identification software available on the web. NLTK uses Crúbadán, while Gate includes TextCat. At Data Big Bang, we like to use Google Language API because it is very accurate even for just one word. It also includes an accuracy measure in the response.

Sadly, Google has deprecated the Google Language API Family and we have added them to our “Google NoAPI” list. They can be used until they are shut down.

Google Language API for language identification is very easy to use and was very permissive in terms of usage limitation but now the rate limit status can be found in the console.

Benchmarking

Different language identification algorithms can be easily benchmarked against the Google’s. Testing with single words and small sentences is a good indicator, especially if the algorithms will be used for services like twitter where the sentences are very short.