In this article we will be discussing the very basic knowledge that is required when we deal with natural language. We will begin with basic terminology and statistical laws which helps us understand the way the words behave in a natural language corpus. A corpus is usually defined as a collection of documents, where each document is a natural language text in our case. We will then move on to some processing techniques like tokenization and normalization and discuss the technical difficulties one can arrive at. We will also briefly discuss the n-gram framework widely used in natural language. One should be very well familiar with the concepts discussed in this article if the problem to be solved deals with natural language. There are no prerequisites for this article other than a basic understanding of conditional probabilities. After reading this article one can proudly say that he/she is familiar with Natural Language Processing. Once you get familiar with these concepts, you can use natural language tools like NLTK which have well updated libraries for everything discussed here.

Behaviour of natural language

Function words and Content words

Function words have little meaning but serve as important elements for the structure of sentences

The winfy prunkilmonger from the glidgement mominkled and brangified all his levensers vederously.

Don’t know any content words but function words help form the structure

Mainly to prevent incorrect parsing of the phrase. Explicit mention. Some possible usages:

Noun modified by an ‘ed’-verb: case-based, hand-delivered

Entire expression as a modifier in a noun group:

three-to-five-year direct marketing plan

Language Specific Issues:

French:

l’ensemble: want to match with un ensemble

German:

Noun compounds are not segmented

Lebensversicherungsgesellschaftsangestellter

Compound splitter required for German information retrieval

Chinese and Japanese: No Space Between Words

Sanskrit

Samas and Sandhi

In such cases we need something called “Word Segmentation”

Word Segmentation or Word Tokenization

Greedy algorithm for Chinese

Maximum Matching

Start a pointer at the beginning of the string

Find the largest word in dictionary that matches the string starting at pointer

Move the pointer over the word in string.

Word Tokenization also might be required for English Text

#ThankYouSachin

Normalization

Indexed Text and Query Terms must have the same form.

U.S.A. and USA should be matched

We implicitly define equivalence classes of terms

Some possible basic rules

Reduce all letters to lower case

However Upper Case in mid sentence might carry some information of being a proper noun or named entities

Also, it protects US from us

Lemmatization

Process of grouping together the different inflected forms of a word so they can be analysed as a single item

am,are,is→be

car, cars, car’s, cars’ → car

Have to find the correct dictionary headword form

Morphology

Study of internal structure of words

Words are built from smaller meaningful units called Morphemes

Free morphemes works independently, whereas bound morphemes do not.

Two categories:

Stems:

The core meaning bearing units

Affixes

The bits and pieces adhering to stems to change their meanings and grammatical functions

Prefix: un-, anti-, etc.

Suffix: -ity, -ation, etc.

Infix: E.g. in sanskrit: ’n’ in ‘vindati’ , contrasted to ‘vid’

Stemming

Reducing terms to their stems, used in information retrieval

vs Lemmatization

Both related to grouping inflectional forms and sometimes derivationally related forms of a word to a common base form

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

An n-gram is a contiguous sequence of n items from a given sequence of text.

When the items are words, n-grams may also be called shingles.

An n-gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a "digram"); size 3 is a trigram.

Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.

Applications

A type of probabilistic language model for predicting the next item a sequence (discussed in the next sub-section.)

Used as features for in various classification/modeling tasks, e.g. Machine Translation.

In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution

For language identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for different languages.

Although often criticized theoretically, in practice, n-gram models have been shown to be extremely effective in modeling language data, which is a core component in modern statistical language applications.

Probabilistic Language Modelling

Goal:

Compute probability of a sequence of words

Or, related task: probability of an upcoming word

Follow the chain rule:

Hence:

Estimating these values:

Problem is after a given length, your data sets cut short.

Markov Assumption

k-th order model:

N-gram modelling equivalent to (n-1) order markov model

We usually extend to trigrams, 4-grams, 5-grams…

But language has long-distance dependencies
The computer which I had just put into the machine room on the fifth floor crashed.

This is less frequent and we usually get away with n-gram models.

While estimating probabilities, we can also count on the start symbols for sentences… something like:

Workout example: 9222 Restaurant Sentences

Bigram Counts:

Unigram Counts:

Result:

Computing Sentence Probability:

Practical Issues

We do things in log space

Avoids Underflow

Adding is faster than multiplying

Handling Zeros

Use Smoothing … Taught Later

Google released its N-Grams in 2006.

Generalization and the problem of Zeros

The Shannon Visualization Method

Use language model to generate word sequences

Choose a random bigram (<s>,w) as per its probability (choose anything from 0 to 1)

Choose a random bigram (w,x) as per its conditional probability (choose anything from 0 to 1)

And so on until we choose </s>

Shakespeare as Corpus

884k tokens, 29k types

Shakespeare produced 300,000 bigram types out of V2 = 844 million possible bigrams.

So 99.96% of the possible bigrams were never seen (have zero entries in the table)

Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare.

In the unigram sentences, there is no coherent relation between words, and in fact none of the sentences end in a period or other sentence-final punctuation.

The bigram sentences can be seen to have very local word-to-word coherence (especially if we consider that punctuation counts as a word)

The trigram and quadrigram sentences are beginning to look a lot like Shakespeare.

Indeed a careful investigation of the quadrigram sentences shows that they look a little too much like Shakespeare.

It cannot be but so are directly from King John

The above is a peril of overfitting

N-grams only work well for word prediction if the test corpus looks like the training corpus

In real life, it often doesn’t

We need to train robust models that generalize!

In extreme case we can even arrive at the problem of zeros

Problem of zeros:

Zero probability n-grams

P(offer | denied the) = 0

The test set will be assigned a probability 0

And the perplexity can’t be computed

Solution: Smoothing Techniques

Laplace Smoothing (Add-one Estimation)

Pretend we saw each word (N-gram in our case) one more time than we did