Natural Language Processing Made Easy – using SpaCy (​in Python)

Introduction

Natural Language Processing is one of the principal areas of Artificial Intelligence. NLP plays a critical role in many intelligent applications such as automated chat bots, article summarizers, multi-lingual translation and opinion identification from data. Every industry which exploits NLP to make sense of unstructured text data, not just demands accuracy, but also swiftness in obtaining results.

Natural Language Processing is a capacious field, some of the tasks in nlp are – text classification, entity detection, machine translation, question answering, and concept identification. In one of my last article, I discussed various tools and components that are used in the implementation of NLP. Most of the components discussed in the article were described using venerated library – NLTK(Natural Language Toolkit).

In this article, I will share my notes on one of the powerful and advanced libraries used to implement nlp – spaCy.

Table of Content

About spaCy and Installation

SpaCy pipeline and properties

Tokenization

Pos Tagging

Entity Detection

Dependency Parsing

Noun Phrases

Word Vectors

Integrating spaCy with Machine Learning

Comparison with NLTK and CoreNLP

1. About spaCy and Installation

1.1 About

Spacy is written in cython language, (C extension of Python designed to give C like performance to the python program). Hence is a quite fast library. spaCy provides a concise API to access its methods and properties governed by trained machine (and deep) learning models.

1.2 Installation

Spacy, its data, and its models can be easily installed using python package index and setup tools. Use the following command to install spacy in your machine:

sudo pip install spacy

In case of Python3, replace “pip” with “pip3” in the above command.

OR download the source from here and run the following command, after unzipping:

python setup.py install

To download all the data and models, run the following command, after the installation:

python -m spacy.en.download all

You are now all set to explore and use spacy.

2. SpaCy Pipeline and Properties

Implementation of spacy and access to different properties is initiated by creating pipelines. A pipeline is created by loading the models. There are different type of models provided in the package which contains the information about language – vocabularies, trained vectors, syntaxes and entities.

We will load the default model which is english-core-web.

import spacy
nlp = spacy.load(“en”)

The object “nlp” is used to create documents, access linguistic annotations and different nlp properties. Let’s create a document by loading a text data in our pipeline. I am using reviews of a hotel obtained from tripadvisor’s website. The data file can be downloaded here.

The document is now part of spacy.english model’s class and is associated with a number of properties. The properties of a document (or tokens) can listed by using following command:

dir(document)
>> [ 'doc', 'ents', … 'mem']

This outputs a wide range of document properties such as – tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. Let’s explore some of these properties.

2.1 Tokenization

Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating the document:

# first token of the doc
document[0]
>> Nice
# last token of the doc
document[len(document)-5]
>> boston
# List of sentences of our doc
list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]

2.2 Part of Speech Tagging

Part-of-speech tags are the properties of the word that are defined by the usage of the word in the grammatically correct sentence. These tags can be used as the text features in information filtering, statistical models, and rule based parsing.

2.3 Entity Detection

Spacy consists of a fast entity recognition model which is capable of identifying entitiy phrases from the document. Entities can be of different types, such as – person, location, organization, dates, numerals, etc. These entities can be accessed through “.ents” property.

Let’s find all the types of named entities from present in our document.

2.4 Dependency Parsing

One of the most powerful feature of spacy is the extremely fast and accurate syntactic dependency parser which can be accessed via lightweight API. The parser can also be used for sentence boundary detection and phrase chunking. The relations can be accessed by the properties “.children” , “.root”, “.ancestor” etc.

Let’s parse the dependency tree of all the sentences which contains the term hotel and check what are the adjectival tokens used for hotel. I have created a custom function that parses a dependency tree and extracts relevant pos tag.

4. Machine Learning with text using Spacy

Integrating spacy in machine learning model is pretty easy and straightforward. Let’s build a custom text classifier using sklearn. We will create a sklearn pipeline with following components: cleaner, tokenizer, vectorizer, classifier. For tokenizer and vectorizer we will built our own custom modules using spacy.

Let’s now create a custom tokenizer function using spacy parser and some basic cleaning. One thing to note here is that, the text features can be replaced with word vectors (especially beneficial in deep learning models)

5. Comparison with other libraries

Spacy is very powerful and industrial strength package for almost all natural language processing tasks. If you are wondering why?

Let’s compare Spacy with other famous tools to implement nlp in python – CoreNLP and NLTK.

Feature Availability

Feature

Spacy

NLTK

Core NLP

Easy installation

Y

Y

Y

Python API

Y

Y

N

Multi Language support

N

Y

Y

Tokenization

Y

Y

Y

Part-of-speech tagging

Y

Y

Y

Sentence segmentation

Y

Y

Y

Dependency parsing

Y

N

Y

Entity Recognition

Y

Y

Y

Integrated word vectors

Y

N

N

Sentiment analysis

Y

Y

Y

Coreference resolution

N

N

Y

Speed: Key Functionalities – Tokenizer, Tagging, Parsing

Package

Tokenizer

Tagging

Parsing

spaCy

0.2ms

1ms

19ms

CoreNLP

2ms

10ms

49ms

NLTK

4ms

443ms

–

Accuracy: Entity Extraction

Package

Precition

Recall

F-Score

spaCy

0.72

0.65

0.69

CoreNLP

0.79

0.73

0.76

NLTK

0.51

0.65

0.58

End Notes

In this article we discussed about Spacy – a complete package to implement NLP tasks in python. We went through various examples showcasing the usefulness of spacy, its speed and accuracy. Finally we compared the package with other famous nlp libraries – corenlp and nltk.

Once the concepts described in this article are understood, one can implement (really) challenging problems exploiting text data and natural language processing.

I hope you enjoyed reading this article, feel free to post your doubts, questions or any thoughts in the comments section.

Shivam Bansal is a data scientist with exhaustive experience in Natural Language Processing and Machine Learning in several domains. He is passionate about learning and always looks forward to solving challenging analytical problems.

Hi shivam! very nice article.It helped me to get started with spacy. I have one question for you. I am trying to extract some entities like dates, location, name from the hundreds of resume’s. Will it be possible to do so using spacy? will it do the task quickly within few minutes and how much accurate it will be? Please provide your guidance as well as opinion.
Thanks.

Nice article , thank you for posting it here , i have one question for you , how can i use dependency tree output or pos information which i got from the spacy in multi class classification problem , i would be very thank full if can i give more elaborate information on this

document = unicode(open(‘Tripadvisor_hotelreviews_Shivambansal.txt’).read().decode(‘utf8’))
line is depricated in Python 3.
Should be just
document = open(‘Tripadvisor_hotelreviews_Shivambansal.txt’).read()

Role :
There are different jobs – waiters, chefs, chauffeurs – (countrywise)
1) Information to be Extracted from the CV is all possible roles that the candidate can possible be work as.
2) Assign a ranking to these roles (eg : Candidate is a better waiter than a chef)
3) Identify from the CV if there are certifications that the candidate has

Hi,
I am doing a task of cleaning and preprocessing of data and i dont know machine learning and NLP but i have to do that task using nlp and spacy can you please suggest me how to learn this step by step so that i can be able to code for cleaning of textual data using nlp and spacy