Book Details

ISBN 139781849513609

Paperback272 pages

Book Description

Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.

Table of Contents

Chapter 1: Tokenizing Text and WordNet Basics

Introduction

Tokenizing text into sentences

Tokenizing sentences into words

Tokenizing sentences using regular expressions

Filtering stopwords in a tokenized sentence

Looking up synsets for a word in WordNet

Looking up lemmas and synonyms in WordNet

Calculating WordNet synset similarity

Discovering word collocations

Chapter 2: Replacing and Correcting Words

Introduction

Stemming words

Lemmatizing words with WordNet

Translating text with Babelfish

Replacing words matching regular expressions

Removing repeating characters

Spelling correction with Enchant

Replacing synonyms

Replacing negations with antonyms

Chapter 3: Creating Custom Corpora

Introduction

Setting up a custom corpus

Creating a word list corpus

Creating a part-of-speech tagged word corpus

Creating a chunked phrase corpus

Creating a categorized text corpus

Creating a categorized chunk corpus reader

Lazy corpus loading

Creating a custom corpus view

Creating a MongoDB backed corpus reader

Corpus editing with file locking

Chapter 4: Part-of-Speech Tagging

Introduction

Default tagging

Training a unigram part-of-speech tagger

Combining taggers with backoff tagging

Training and combining Ngram taggers

Creating a model of likely word tags

Tagging with regular expressions

Affix tagging

Training a Brill tagger

Training the TnT tagger

Using WordNet for tagging

Tagging proper names

Classifier based tagging

Chapter 5: Extracting Chunks

Introduction

Chunking and chinking with regular expressions

Merging and splitting chunks with regular expressions

Expanding and removing chunks with regular expressions

Partial parsing with regular expressions

Training a tagger-based chunker

Classification-based chunking

Extracting named entities

Extracting proper noun chunks

Extracting location chunks

Training a named entity chunker

Chapter 6: Transforming Chunks and Trees

Introduction

Filtering insignificant words

Correcting verb forms

Swapping verb phrases

Swapping noun cardinals

Swapping infinitive phrases

Singularizing plural nouns

Chaining chunk transformations

Converting a chunk tree to text

Flattening a deep tree

Creating a shallow tree

Converting tree nodes

Chapter 7: Text Classification

Introduction

Bag of Words feature extraction

Training a naive Bayes classifier

Training a decision tree classifier

Training a maximum entropy classifier

Measuring precision and recall of a classifier

Calculating high information words

Combining classifiers with voting

Classifying with multiple binary classifiers

Chapter 8: Distributed Processing and Handling Large Datasets

Introduction

Distributed tagging with execnet

Distributed chunking with execnet

Parallel list processing with execnet

Storing a frequency distribution in Redis

Storing a conditional frequency distribution in Redis

Storing an ordered dictionary in Redis

Distributed word scoring with Redis and execnet

Chapter 9: Parsing Specific Data

Introduction

Parsing dates and times with Dateutil

Time zone lookup and conversion

Tagging temporal expressions with Timex

Extracting URLs from HTML with lxml

Cleaning and stripping HTML

Converting HTML entities with BeautifulSoup

Detecting and converting character encodings

What You Will Learn

Learn Text categorization and Topic identification

Learn Stemming and Lemmatization and how to go beyond the usual spell checker

Replace negations with antonyms in your text

Learn to tokenize words into lists of sentences and words, and gain an insight into WordNet

Transform and manipulate chunks and trees

Learn advanced features of corpus readers and create your own custom corpora

Tag different parts of speech by creating, training, and using a part-of-speech tagger

Improve accuracy by combining multiple part-of-speech taggers

Learn how to do partial parsing to extract small chunks of text from a part-of-speech tagged sentence

Produce an alternative canonical form without changing the meaning by normalizing parsed chunks

Make your site more discoverable by learning how to automatically replace words with more searched equivalents

Parse dates, times, and HTML

Train and manipulate different types of classifiers

Authors

Jacob Perkins

Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go.

He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com.

To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.