From time to time, I am requested to give an overview presentation of the technology area Natural Language Processing (NLP), either for the purpose of training the newcomers or as a veteran's introduction to this fascinating area for friends in general. Most of the time, the presentation takes the form of a casual talk, with an interactive Q&A session. After working in this area for 26 years during which I have never had to change my career anywhere else (very lucky indeed), everyday still looks as fresh and bright, or even more intriguing than ever before. I have decided to write down a series of posts on NLP Overview to share my knowledge, based on years of experience and thinking, with the new colleagues as well as the general public. This series serves also as my contribution to the popularization of science with regards to a real life high-tech, now seen in everyday life (e.g. iPhone Siri in people's hands or Google Translate within every netter's reach, and our own NLP supported products for customer insights and sentiments from social media, globally distributed).

Happy reading and hope you will like it.

(The Chinese version of this series is being posted at the same time in my Chinese blog at the largest online community of Chinese professionals and scientists ScienceNet.cn)

OVERVIEW OF NATURAL LANGUAGE PROCESSING (1/5)

by Wei LI

(NLP Word Cloud, courtesy of ourselves who built the NLP engine to parse social media to generate this graph )

An active technology area like NLP will continue to generate new concepts, with new technical terms that are not standardized. Without a proper reference frame, new researchers are often overwhelmed by the confusing list of technical terms. Different researchers sometimes use different terms for the same concept. On the other hand, some terms are ambiguous or mean different things in different people's minds. Whether or not the community has reached a consensus on terms, the key is to decode the semantics or referents behind a technical term in its broad sense or narrow sense, as well as the understanding of possible ambiguity. Seasoned professionals are sensitive to new terms and can rapidly place them in the right spot in the existing concept hierarchy of the area. This series of NLP presentations will use four flowcharts crafted by the author just for you to carpet-comb a full load of NLP related terms in a concept network. All technical terms mentioned in this series are shown by underline (acronyms in italics), some also given hyperlinks for interested readers to further explore.

Let our NLP journey begin.

First, what is NLP, some background on the general concept of Natural Language Processing (NLP) and where it belongs, as well as some of the interchangeable or sister terms of NLP.

Fairly straightforwardly, the broad concept of the term NLP involves the problem area of natural language: as the name suggests, Natural Language Processing refers to the computer processing of natural languages, for whatever purpose, regardless of the processing depth. The so-called natural language stands for the languages we use in our daily life, such as English, Russian, Japanese, Chinese, it is synonymous with human language, mainly to be distinguished from formal language, including computer language. As it stands, natural language is the most natural and most common form of human communication, not only in the spoken form, the written language is also growing exponentially in recent years, when the mobile Internet is hot with all the social media. Compared with formal language, natural language is much more complex, often with omissions and ambiguity, making it difficult to process (hence the NLP as a highly skilled profession, a golden rice bowl for us NLP practitioners, :=)).

The (almost) equivalently used term with NLP is Computational Linguistics (CL). As the name suggests, computational linguistics is interdisciplinary in Computer Science (CS) and Linguistics. In fact, NLP and CL are two sides of the same coin. The NLP focus is practice while CL is a science (theory). One can also say that CL is the scientific basis for NLP and NLP is CL's application. Unlike basic disciplines such as mathematics and physics, Computational Linguistics is by nature problem oriented, having shortened the distance from theory to practice, hence CL and NLP are really the same thing in many use scenarios. Its practitioners therefore can claim themselves to be NLP engineers in industry or computational linguists in academia. Of course, although the computational linguists in academia also need to build NLP systems, their focus is to use the experiments to support the study of the theory and algorithms. On the other hand, the NLP engineers in the industry are mainly charged with the implementation of a real life system or with building production quality software products. That difference allows the NLP engineers to adopt whatever works for the case (following Deng's famous "cat-theory": black or white, it is a good cat as long as it catches a rat), less concerned about how sophisticated or popular a strategy or an algorithm is.

Another term that is often used in parallel to NLP is Machine Learning (ML). Strictly speaking, machine learning and NLP are concepts at completely different levels, the former refers to a class of approaches, while the latter indicates a problem area. However, due to the "panacea" nature of machine learning, coupled with the fact that ML has dominated the mainstream of NLP (especially in academia), many people forget or simply ignore the existence of the other NLP approach, namely hand-crafted linguistic rules. Thus, it is not a surprise that in these people's eyes, NLP is machine learning. In reality, of course, machine learning goes way beyond the field of NLP. The machine learning algorithms used for various language processing tasks can be equally used to accomplish other artificial intelligence (AI) tasks, such as the stock market analysis and forecast, detecting credit card frauds, machine vision, DNA sequencing classification, and even medical diagnosis.

In parallel to machine learning, the more traditional approach to NLP is hand-crafted rules, compiled and formalized by linguists or knowledge engineers. A whole set of such rules for a given NLP task form a computational grammar, which can be compiled into a rule system. Machine learning and rule system have their own advantages and disadvantages. Generally speaking, machine learning is excellent at coarse-grained tasks like document classification or clustering, and the handcrafted rules are good at fine-grained linguistic analysis, such as deep parsing. If we compare a language to a forest and the sentences of the language to trees, machine learning is an adequate tool for overviewing the forest while the rule system sees each individual tree. In terms of data quality, machine learning is strong in recall (coverage of linguistic phenomena) while rules are good at precision (accuracy). So they really complement each other fairly naturally, but unfortunately, there are some "fundamentalist extremists" from both schools who are not willing to recognize the other's strengths and try to belittle or eliminate the other approach. Due to the complexity of natural language phenomena, a practical real-life NLP system often needs to balance or configure between precision and recall and between coarse-gained analysis and fine-grained analysis. Thus, someway of combining the two NLP approaches is often a wise strategy. A simple and effective method of combining them is to build a back-up model for each major task: grammars are first compiled to run for detailed analysis with high precision (at the cost of modest recall) and machine learning is then applied as a default sub-system to catch up the recall. Keep in mind that both approaches face the resource issue of so-called knowledge bottleneck:grammars require skilled labor (linguists) to write and test while ML is insatiable for huge data: especially for the relatively mature method of supervised learning, sizable (human) labeled data (or annotated corpus) are a precondition.

It is worth mentioning that the the traditional AI also relies heavily on manually coded rule system, but there is a fundamental difference between the AI rule system and the grammar-based system for linguistic analysis (parsing). Generally speaking, computational grammars are much more tractable and practical than the AI rule system. An AI rules system not only needs linguistic analysis done by a sub-rule system like computational grammar, it also attempts to encode the common sense (at least the core part of human accumulated common sense) in their knowledge representation and reasoning, making AI much more sophisticated, and often cumbersome, for practical applications. In a sense, the relationship between ML and AI is like the relationship between NLP and CL: ML focuses on the application side of AI while people would assume that AI should then be in a theoretical position to guide ML. In reality, though, it is not the case at all: the (traditional) AI is heavily based on knowledge encoding and representation (knowledge engineering) and logical reasoning with the overhead often too big and too complicated to scale up or too expensive to maintain as real life intelligent systems. The domain of building intelligent systems (NLP included) has been gradually occupied by machine learning whose theoretical foundation involves statistics and information theory instead of logic. The AI scientists such as the cyc inventor Douglas Lenat become rare today at a time statisticians dominate the space. Perhaps in the future, there will be a revival of the true AI, but in the foreseeable future, machine learning, which models human intelligence as a black-box connecting the observable input and output, clearly prevails. Note that the difference between impractical (or too ambitious) AI knowledge engineering approach and the much more tractable NLPcomputational grammar approach determines their different outcomes: while (traditional) AI is practically superseded by ML in almost all intelligent systems, the computational grammar approach proves to be viable and will continue to play a role in NLP for a long time (albeit it also faces constant prejudice as well as challenges from ML and statisticians).

There is also another technical term almost interchangeable with NLP, namely, Natural Language Understanding (NLU). Although the literal interpretation of NLU as a process for machines to understand natural languages may sound like a science fiction with strong AI flavor, in reality, though, the use of NLP vs. NLU, just like the use of NLP vs. CL, is often just a different habit adopted in different circles. NLP and NLU are basically the same concept. By "basically", we note that NLP can refer to shallow or even trivial language processing (for example, shallow parsing, to be presented later in this series, inclduing tasks like tokenization for splitting a sentence into words and morphology processing such as stemming for splitting a word into stem and affix), but NLU by definition assumes deep analysis (deep parsing). Here is the thing: with glasses of AI, it is NLU; looking from the ML perspective, it should only be called NLP.

In addition, the natural language technology or simply language technology are also common referents to NLP.

Since the NLP equivalent CL has two parents, computer science and linguistics, it follows that NLP has two parents too: in fact, NLP can be seen as an application area of both computer science and linguistics. In the beginning, the general concept of Applied Linguistics was assumed to cover NLP, but due to the well established and distinguished status of computational linguistics as an independent discipline for decades now (with "Computational Linguistics" as the key journal, ACL as the community, and ACL annual conference and COLING as the top research meetings), Applied Linguistics now refers mainly to language teaching and practical areas such as translation, and it is generally no longer considered as a parent of NLP or computational linguistics.

Conceptually, NLP (as a problem area) and ML (as methodology) both belong to the same big category of artificial intelligence, especially when we use the NLP-equivalent term natural language understanding or when specific applications of NLP such as machine translation are involved. However, as mentioned above, the traditional AI, which emphasizes knowledge processing (including common-sense reasoning), is very different from the data-driven reality of the current ML and NLP systems.

Now that we have made clear where NLP belongs and what are the synonyms or sister terms of NLP, let us "grasp the key links" to look at NLP per se and to sort out the mystery of the related hierarchy of concepts and terms. I will present NLP at four levels in subsequent posts, using four flowcharts of a conceptual system architecture (shown below, and to be repeated in each subsequent chapter). The four levels are:

1. linguistic level;

2. extraction level;

3. mining level;

4. app level.

These four levels of (sub-)systems basically represents a bottom-up support relationship: 1 ==> 2 ==> 3 ==> 4. Clearly, the core engine of NLP (namely, a parser) sits in the first layer as an enabling technology, and the app level in the fourth layer includes applications like Question Answering, machine translation, intelligent assistants such as Siri.

As a final note, since natural language takes two forms, speech (oral form) and text (written form), NLP naturally covers two important branches in speech processing: 1. speech recognition designed to enable computers to understand human speech; 2. speech synthesis to teach computers to speak back to humans. As the author is no expert in speech, this series of talks will only deal with text-oriented NLP, assuming speech recognition as a preprocessor and speech synthesis as a post-processor of our theme text processing. In fact, this is a valid assumption of labor division even in the actual language systems we see today, for example, the popular application of NLP in smart phones, such as iPhone Siri, uses speech recognition to first convert speech to text which is then fed to the subsequent system for text analysis and understanding.

We show all the four flowcharts below, and will present them one by one in details in subsequent chapters of this series (coming soon, hopefully).

[Acknowledgement] Thanks to the free service of online Google Translate, the original text of this series in Chinese was first automatically translated into English to serve as basis for my reorganization and word polishing. This has definitely saved me considerable time although the English version may not flow as well as its Chinese counterpart.