Use your business data to your advantage with the help of Syncfusion’s new data science offerings. Discover how a custom big data solution can provide your company with valuable predictions about key market trends.

Introduction

This article will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees.

The functions detailed in these recipes modify data, as opposed to learning from it. That means it's not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when.

Whenever the term chunk is used in this article, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What's important in this article is what you can do with a chunk, not where it came from.

Filtering insignificant words

Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase "the movie was terrible", the most significant words are "movie" and "terrible", while "the" and "was" are almost useless. You could get the same meaning if you took them out, such as "movie terrible" or "terrible movie". Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words, and keep the significant ones, by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

Word

Tag

a

DT

all

PDT

an

DT

and

CC

or

CC

that

WDT

the

DT

Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix.

How to do it...

In transforms.py there is a function called filter_insignificant(). It takes a single chunk, which should be a list of tagged words, and returns a new chunk without any insignificant tagged words. It defaults to filtering out any tags that end with DT or CC.

def filter_insignificant(chunk, tag_suffixes=['DT', 'CC']): good = []

for word, tag in chunk: ok = True

for suffix in tag_suffixes: if tag.endswith(suffix): ok = False break

if ok: good.append((word, tag))

return good

Now we can use it on the part-of-speech tagged version of "the terrible movie".

How it works...

filter_insignificant() iterates over the tagged words in the chunk. For each tag, it checks if that tag ends with any of the tag_suffixes. If it does, then the tagged word is skipped. However if the tag is ok, then the tagged word is appended to a new good chunk that is returned.

There's more...

The way filter_insignificant() is defined, you can pass in your own tag suffixes if DT and CC are not enough, or are incorrect for your case. For example, you might decide that possessive words and pronouns such as "you", "your", "their", and "theirs" are no good but DT and CC words are ok. The tag suffixes would then be PRP and PRP$. Following is an example of this function:

Filtering insignificant words can be a good complement to stopword filtering for purposes such as search engine indexing, querying, and text classification.

Correcting verb forms

It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of "is our children learning?" is "are our children learning?". The verb "is" should only be used with singular nouns, while "are" is for plural nouns, such as "children". We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk.

Getting ready

We first need to define the verb correction mappings in transforms.py. We'll create two mappings, one for plural to singular, and another for singular to plural.

Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping, is to are, was to were, and vice versa.

How to do it...

In transforms.py there is a function called correct_verbs(). Pass it a chunk with incorrect verb forms, and you'll get a corrected chunk back. It uses a helper function first_chunk_index() to search the chunk for the position of the first tagged word where pred returns True.

In this case, "were" becomes "was" because "child" is a singular noun.

How it works...

The correct_verbs() function starts by looking for a verb in the chunk. If no verb is found, the chunk is returned with no changes. Once a verb is found, we keep the verb, its tag, and its index in the chunk. Then we look on either side of the verb to find the nearest noun, starting on the right, and only looking to the left if no noun is found on the right. If no noun is found at all, the chunk is returned as is. But if a noun is found, then we lookup the correct verb form depending on whether or not the noun is plural.

Plural nouns are tagged with NNS, while singular nouns are tagged with NN. This means we can check the plurality of a noun by seeing if its tag ends with S. Once we get the corrected verb form, it is inserted into the chunk to replace the original verb form.

To make searching through the chunk easier, we define a function called first_chunk_ index(). It takes a chunk, a lambda predicate, the starting index, and a step increment. The predicate function is called with each tagged word until it returns True. If it never returns True, then None is returned. The starting index defaults to zero and the step increment to one. As you'll see in upcoming recipes, we can search backwards by overriding start and setting step to -1. This small utility function will be a key part of subsequent transform functions.

Swapping verb phrases

Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, "the book was great" can be transformed into "the great book".

How to do it...

In transforms.py there is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around.

The result is "great the book". This phrase clearly isn't grammatically correct, so read on to learn how to fix it.

How it works...

Using first_chunk_index() from the previous recipe, we start by finding the first matching verb that is not a gerund (a word that ends in "ing") tagged with VBG. Once we've found the verb, we return the chunk with the right side before the left, and remove the verb.

The reason we don't want to pivot around a gerund is that gerunds are commonly used to describe nouns, and pivoting around one would remove that description. Here's an example where you can see how not pivoting around a gerund is a good thing:

Swapping noun cardinals

In a chunk, a cardinal word—tagged as CD—refers to a number, such as "10". These cardinals often occur before or after a noun. For normalization purposes, it can be useful to always put the cardinal before the noun.

How to do it...

The function swap_noun_cardinal() is defined in transforms.py. It swaps any cardinal that occurs immediately after a noun with the noun, so that the cardinal occurs immediately before the noun.

def swap_noun_cardinal(chunk): cdidx = first_chunk_index(chunk, lambda (word, tag): tag == 'CD') # cdidx must be > 0 and there must be a noun immediately before it if not cdidx or not chunk[cdidx-1][1].startswith('NN'): return chunk

The result is that the numbers are now in front of the noun, creating "10 Dec" and "the 10 top".

How it works...

We start by looking for a CD tag in the chunk. If no CD is found, or if the CD is at the beginning of the chunk, then the chunk is returned as is. There must also be a noun immediately before the CD. If we do find a CD with a noun preceding it, then we swap the noun and cardinal in place.

Swapping infinitive phrases

An infinitive phrase has the form "A of B", such as "book of recipes". These can often be transformed into a new form while retaining the same meaning, such as "recipes book".

How to do it...

An infinitive phrase can be found by looking for a word tagged with IN. The function swap_infinitive_phrase(), defined in transforms.py, will return a chunk that swaps the portion of the phrase after the IN word with the portion before the IN word.

How it works...

This function is similar to the swap_verb_phrase() function described in the Swapping verb phrases recipe. The inpred lambda is passed to first_chunk_index() to look for a word whose tag is IN. Next, nnpred is used to find the first noun that occurs before the IN word, so we can insert the portion of the chunk after the IN word between the noun and the beginning of the chunk. A more complicated example should demonstrate this:

We don't want the result to be "recipes delicious book". Instead, we want to insert "recipes" before the noun "book", but after the adjective "delicious". Hence, the need to find the nnidx occurring before the inidx.

There's more...

You'll notice that the inpred lambda checks to make sure the word is not "like". That's because "like" phrases must be treated differently, as transforming them the same way will result in an ungrammatical phrase. For example, "tastes like chicken" should not be transformed into "chicken tastes":

Singularizing plural nouns

As we saw in the previous recipe, the transformation process can result in phrases such as "recipes book". This is a NNS followed by an NN, when a more proper version of the phrase would be "recipe book", which is an NN followed by another NN. We can do another transform to correct these improper plural nouns.

How to do it...

transforms.py defines a function called singularize_plural_noun(), which will de-pluralize a plural noun (tagged with NNS) that is followed by another noun.

How it works...

We start by looking for a plural noun with the tag NNS. If found, and if the next word is a noun (determined by making sure the tag starts with NN), then we de-pluralize the plural noun by removing an "s" from the right side of both the tag and the word.

The tag is assumed to be capitalized, so an uppercase "S" is removed from the right side of the tag, while a lowercase "s" is removed from the right side of the word.

Chaining chunk transformations

The transform functions defined in the previous recipes can be chained together to normalize chunks. The resulting chunks are often shorter with no loss of meaning.

How to do it...

In transforms.py is the function transform_chunk(). It takes a single chunk and an optional list of transform functions. It calls each transform function on the chunk, one at a time, and returns the final chunk.

As you can see, the punctuation isn't quite right. The commas and period are treated as individual words, and so get the surrounding spaces as well. We can fix this using regular expression substitution. This is implemented in the chunk_tree_to_sent() function found in transforms.py.

Using this function results in a much cleaner sentence, with no space before each punctuation mark:

>>> from transforms import chunk_tree_to_sent>>> chunk_tree_to_sent(tree)'Pierre Vinken, 61 years old, will join the board as a nonexecutivedirector Nov. 29.'

How it works...

To correct the extra spaces in front of the punctuation, we create a regular expression punct_re that will match a space followed by any of the known punctuation characters. We have to escape both '.' and '?' with a '\' since they are special characters. The punctuation is surrounded by parenthesis so we can use the matched group for substitution.

Once we have our regular expression, we define chunk_tree_to_sent(), whose first step is to join the words by a concatenation character that defaults to a space. Then we can call re.sub() to replace all the punctuation matches with just the punctuation group. This eliminates the space in front of the punctuation characters, resulting in a more correct string.

There's more...

We can simplify this function a little by using nltk.tag.untag() to get words from the tree's leaves, instead of using our own list comprehension.

Flattening a deep tree

Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.

Getting ready

We're going to use the first parsed sentence of the treebank corpus as our example. Here's a diagram showing how deeply nested this tree is:

You may notice that the part-of-speech tags are part of the tree structure, instead of being included with the word. This will be handled next using the Tree.pos() method, which was designed specifically for combining words with pre-terminal Tree nodes such as part-of-speech tags.

How to do it...

In transforms.py there is a function named flatten_deeptree(). It takes a single Tree and will return a new Tree that keeps only the lowest level trees. It uses a helper function flatten_childtrees() to do most of the work.

The result is a much flatter Tree that only includes NP phrases. Words that are not part of a NP phrase are separated. This flatter tree is shown as follows:

This Tree is quite similar to the first chunk Tree from the treebank_chunk corpus. The main difference is that the rightmost NP Tree is separated into two sub-trees in the previous diagram, one of them named NP-TMP.

The first tree from treebank_chunk is shown as follows for comparison:

How it works...

The solution is composed of two functions: flatten_deeptree() returns a new Tree from the given tree by calling flatten_childtrees() on each of the given tree's children.

flatten_childtrees() is a recursive function that drills down into the Tree until it finds child trees whose height() is equal to or less than three. A Tree whose height() is less than three looks like this:

>>> from nltk.tree import Tree>>> Tree('NNP', ['Pierre']).height()2

These short trees are converted into lists of tuples using the pos() function.

>>> Tree('NNP', ['Pierre']).pos()[('Pierre', 'NNP')]

Trees whose height() is equal to three are the lowest level trees that we're interested in keeping. These trees look like this:

The recursive nature of flatten_childtrees() eliminates all trees whose height is greater than three.

There's more...

Flattening a deep Tree allows us to call nltk.chunk.util.tree2conlltags() on the flattened Tree, a necessary step to train a chunker. If you try to call this function before flattening the Tree, you get a ValueError exception.

Being able to flatten trees, opens up the possibility of training a chunker on corpora consisting of deep parse trees.

CESS-ESP and CESS-CAT treebank

The cess_esp and cess_cat corpora have parsed sentences, but no chunked sentences. In other words, they have deep trees that must be flattened in order to train a chunker. In fact, the trees are so deep that a diagram can't be shown, but the flattening can be demonstrated by showing the height() of the tree before and after flattening.

As in the previous recipe, the height of the new tree is three so it can be used for training a chunker.

How it works...

The shallow_tree() function iterates over each of the top-level sub-trees in order to create new child trees. If the height() of a sub-tree is less than three, then that sub-tree is replaced by a list of its part-of-speech tagged children. All other sub-trees are replaced by a new Tree whose children are the part-of-speech tagged leaves. This eliminates all nested sub-trees while retaining the top-level sub-trees.

This function is an alternative to flatten_deeptree() from the previous recipe, for when you want to keep the higher level tree nodes and ignore the lower level nodes.

Converting tree nodes

As you've seen in previous recipes, parse trees often have a variety of Tree node types that are not present in chunk trees. If you want to use the parse trees to train a chunker, then you'll probably want to reduce this variety by converting some of these tree nodes to more common node types.

Getting ready

First, we have to decide what Tree nodes need to be converted. Let's take a look at that first Tree again:

Immediately you can see that there are two alternative NP sub-trees: NP-SBJ and NP-TMP. Let's convert both of those to NP. The mapping will be as follows:

Original Node

New Node

NP-SBJ

NP

NP-TMP

NP

How to do it...

In transforms.py there is a function convert_tree_nodes(). It takes two arguments: the Tree to convert, and a node conversion mapping. It returns a new Tree with all matching nodes replaced based on the values in the mapping.

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.