Tags

The Art of Tokenization

Tokenization

The process of segmenting running text into words and sentences.

Electronic text is a linear sequence of symbols (characters or words or phrases). Naturally, before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc. This process is called tokenization.

In English, words are often separated from each other by blanks (white space), but not all white space is equal. Both “Los Angeles” and “rock 'n' roll” are individual thoughts despite the fact that they contain multiple words and spaces. We may also need to separate single words like “I'm” into separate words “I” and “am”.

Tokenization is a kind of pre-processing in a sense; an identification of basic units to be processed. It is conventional to concentrate on pure analysis or generation while taking basic units for granted. Yet without these basic units clearly segregated it is impossible to carry out any analysis or generation.

The identification of units that do not need to be further decomposed for subsequent processing is an extremely important one. Errors made at this stage are very likely to induce more errors at later stages of text processing and are therefore very dangerous.

What counts as a token in NLP?

The notion of a token must first be defined before computational processing can proceed. There is more to the issue than simply identifying strings delimited on both sides by spaces or punctuation.

Different notions depend on different objectives, and often different language backgrounds.

A token is

Linguistically significant

Methodologically useful

Webster and Kit suggest that finding significant tokens depends on the ability to recognize patterns displaying significant collocation. Rather than simply relying on wehther a string is bounded by delimters on either side, segmentation into significant tokens relies on a kind of pattern recognition.

Consider this hypothetical speech transcription:

where is meadows dr who asked

Collocation patterns could help determine if this is about meadows dr (Drive) or dr (Doctor) who.

Standard (White Space) Tokenization

Word tokenization may seem simple in a language that separates words by a special 'space' character. However, not every language does this (e.g. Chinese, Japanese, Thai), and a closer examination will make it clear that white space alone is not sufficient even for English.

Addressing Specific Challenges

Tokenization is generally considered as easy relative to other tasks in natural language, and one of the more uninteresting tasks (for English and other segmented languages). However, errors made in this phase will propogate into later phases and cause problems. To address this problem, a number of advanced methods which deal with specific challenges in tokenization have been developed to complement standard tokenizers.

Bob Carpenter states that tokenization is particularly vexing in the bio-medical text domain, where there are tons of words (or at least phrasal lexical entries) that contain parentheses, hyphens, and so on, and that this turned out to be a problem for WordNet).

Another challenge for tokenization is “dirty text”1. Not all text has been passed through an editing and spell-check process. Text extracted automatically from PDFs, database fields, or other sources may contain inaccurately compounded tokens, spelling errors and unexpected characters. In some cases, when text is stored in a database in fixed fields, with multiple lines per object, fields sometimes need to be reassembled but the spaces have (inconsistently) been trimmed.

It is not safe to make the assumption that source text will be perfect. A tokenizer must often be customized to the data in question.

Low-Level vs High-Level Tokenization

Determining if two or more words should stand together to form a single token (like “Rational Software Architect”) would be a high-level tokenization task. High-level segmentation is much more linguistically motivated than 'low-level' segmentation, and requires (at a minimum) relatively shallow linguistic processing.

Steps in Low Level Tokenization

Step 1: Segmenting Text into Words

The first step in the majority of text processing applications is to segment text into words.

In all modern languages that use a Latin-, Cyrillic-, or Greek-based writing system, such as English and other European languages, word tokens are delimited by a blank space. Thus, for such languages, which are called segmented languages, token boundary identification is a somewhat trivial task since the majority of tokens are bound by explicit separators like spaces and punctuation. A simple program which replaces white spaces with word boundaries and cuts off leading and trailing quotation marks, parentheses and punctuation already produces a reasonable performance.

The majority of existing tokenizers signal token boundaries by white spaces. Thus, if such a tokenizer finds two tokens directly adjacent to each other, as, for instance, when a word is followed by a comma, it inserts a white space between them.

The example given in a following section will show how a standard white space tokenizer fares in a more complex example

Step 2: Handling Abbreviations

In English and other Indo-European languages although a period is directly attached to the previous word, it is usually a separate token which signals the end of the sentence. However, when a period follows an abbreviation it is an integral part of this abbreviation and should be tokenized together with it.

the dr. lives in a blue box.

Without addressing the challenge posed by abbreviation, this line would be delimited into

the dr.
lives in a blue box.

Unfortunately, universally accepted standards for many abbreviations and acronyms do not exist.

The most widely adopted approach to the recognition of abbreviations is to maintain a list of known abbreviations. Thus during tokenization a word with a trailing period can be looked up in such a list and, if it is found there, it is tokenized as a single token, otherwise the period is tokenized as a separate token. Naturally, the accuracy of this approach depends on how well the list of abbreviations is tailored to the text under processing. There will almost certainly be abbreviations in the text which are not included in the list. Also, abbreviations in the list can coincide with common words and trigger erroneous tokenization. For instance, `in' can be an abbreviation for `inches; `no' can be an abbreviation for `number, `bus' can be an abbreviation for `business; `sun' can be an abbreviation for `Sunday; etc.

Step 3: Handling Hyphenated Words

Hyphenated segments present a case of ambiguity for a tokenizer-sometimes a hyphen is part of a token, i.e. self-assessment, F-15, forty-two and sometimes it is not e.g. Los Angeles-based.

Segmentation of hyphenated words is task dependent. For instance, part-of-speech taggers (Chapter ii) usually treat hyphenated words as a single syntactic unit and therefore prefer them to be tokenized as single tokens. On the other hand named entity recognition (NER) systems (Chapter 30) attempt to split a named entity from the rest of a hyphenated fragment; e.g. in parsing the fragment `Moscow-based' such a system needs `Moscow' to be tokenized separately from `based' to be able to tag it as a location.

Types of Hyphens:

End-of-Line Hyphen

True Hyphen

Lexical Hyphen

Sententially Determined Hyphenation

End-of-Line Hyphen

End-of-line hyphens are used for splitting whole words into parts to perform justification of text during typesetting. Therefore they should be removed during tokenization because they are not part of the word but rather layouting instructions.

True Hyphen

True hyphens, on the other hand, are integral parts of complex tokens, e.g.forty-seven, and should therefore not be removed. Sometimes it is difficult to distinguish a true hyphen from an end-of-line hyphen when a hyphen occurs at the end of a line.

Lexical Hyphen

Hyphenated compound words which have made their way into standard language vocabularly. For instance, certain prefixes (and less commonly suffixes) are often written hyphenated, e.g. co-, pre-, meta-, multi-, etc.

Sententially Determined Hyphenation

Here hyphenated forms are created dynamically as a mechanism to prevent incorrect parsing of the phrase in which the words appear. There are several types of hyphenation in this class. One is created when a noun is modified by an `ed'-verb to dynamically create an adjective, e.g. case-based, computer-linked, hand-delivered. Another case involves an entire expression when it is used as a modifier in a noun group, as in a three-to-five-year direct marketing plan. In treating these cases a lexical look-up strategy is not much help and normally such expressions are treated as a single token unless there is a need to recognize specific tokens, such as dates, measures, names, in which case they are handled by specialized subgrammars

This hypothetical sentence poes many challenges:

the New York-based co-operative was fine-
tuning forty-two K-9-like models.

Token

Type

New York-based

Sentential

co-operative

Lexical

fine-tuning

End-of-Line, but could also be considered a Lexical hyphen based on the author's stylistic preferences.

Forty-two

Lexical

K-9-like

Lexical and Sentential

Step 3: Numerical and special expressions

Examples:

Email addresses

URLs

Complex enumeration of items

Telephone Numbers

Dates

Time

Measures

Vehicle Licence Numbers

Paper and book citations

etc

These can produce a lot of confusion to a tokenizer because they usually involve rather complex alpha numerical and punctuation syntax.

Take phone numbers for example -

A variety of formats exist:

123-456-7890

(123)-456-7890

123.456.7890

(123) 456-7890

etc

A pre-processor should be designed to recognize phone numbers and perform normalization. All phone numbers would then be in a single format, making the job of a tokenizer easier.

Date/Time Formats:

8th-Feb

8-Feb-2013

02/08/13

February 8th, 2013

Feb 8th

etc

A pre-processor could recognize all these distinct variations and normalize into a single expression.

The Stanford tokenizer does somewhat better than the OpenNLP tokenizer, which is to be expected. The custom parser (included in the appendix) in the 4th column, does a nearly perfect job, though without the enclitic expansion shown in the first hypothetical pass.

The more accurate (and complex) segmentation process in the fourth and fifth columns require a morphological parsing process.

We can address some of these issues in the first three examples by treating punctuation, in addition to white space, as a word boundary. But punctuation often occurs internally, in examples like u.s.a., Ph.D., AT&T, ma'am, cap'n, 01/02/06 and stanford.edu. Similarity, assuming we want 7.1 or 82.4 as a word, we can't segment on every period, since that would segment these into "7" and "1" and "82" and "4". Should "data-base" be considered two separate tokens or a single token? The number "$2,023.74" should be considered a single token, but in this case, the comma and period do not represent delimiters, where in other cases they might. And should the "$" sign be considered part of that token, or a separate token in its own right?

The java.util.SimpleTokenizer class in Java is an example of a white space tokenizer, where you can define the set of characters that mark the boundaries of tokens. Another Java class, java.text.BreakIterator, can identify word or sentence boundaries, but still does not handle ambiguities.

Named Entity Extraction

It's almost impossible to separate tokenization from named entity extraction. It really isn't possible to come up with a generic set of rules that will handle all ambiguous cases within English; the easiest approach is usually just to have multi-word expression dictionaries.

Install Rational Software Architect on AIX 5.3

Naïve Whitespace Parser

Hypothetical Tokenizer (Ideal Tokenization)

1

install

install

2

rational

rational
software architect for websphere

3

software

4

architect

5

for

6

websphere

7

on

on

8

aix

aix
5.3

9

5.3

Dictionaries will have to exist that express to the tokenization process that "Rational Software Architect for WebSphere" is a single token (a product), and "AIX 5.3" is likewise a single product.

The impact that tokenization has upon the rest of the process can not be understated. A typical next step, following tokenization, is to send the segmented text to a deep parser. In the first column, the rational product would end up being deep parsed into a structure like this:

Note the formation of a prepositional phrase (PP) around "for WebSphere" and the noun phrase trigram "Rational Software Architect". If the sentence was semantically segmented with the aid of a multi-word dictionary, the output from the deep parser would have looked like this:

There is a single noun phrase containing one noun (NNP = singular noun1).

English Enclitics

A clitic is a unit whose status lies between that of an affix and a word. The phonological behavior of a clitic is like affixes; they tend to be short and unaccented. Their syntactic behavior is more like words, often acting as pronouns, articles, conjunctions, or verbs. Clitics preceding a word are called proclitics, and those following are enclictics.

English enclitics include:

The abbreviated forms of be:

’m in I’m

’re in you’re

’s in she’s

The abbreviated forms of auxiliary verbs:

’ll in they’ll

’ve in they’ve

’d in you’d

Note that clitics in English are ambiguous. The word "she's" can mean "she has" or "she is".

A tokenizer can also be used to expand clitic contractions that are marked by apostrophes, for example:

what're => what are
we're => we are

This requires ambiguity resolution, since apostrophes are also used as genitive markers as in "the book's over in the containers' above" or as quotative markers. While these contractions tend to be clictics, not all clictics are marked this way with contractions. In general, then, segmenting and expanding clitics and be done as part of a morphological parsing process.