New from Cambridge University Press!

Sociolinguistics from the Periphery "presents a fascinating book about change: shifting political, economic and cultural conditions; ephemeral, sometimes even seasonal, multilingualism; and altered imaginaries for minority and indigenous languages and their users."

In this book, Noah Smith sets out to ''bridge the gap between NLP [NaturalLanguage Processing] and ML [Machine Learning]'' (p. xv). It is targeted at''graduate students, NLP researchers wanting to better understand ML, and MLresearchers wanting to better understand NLP,'' and as such, assumes a ''[b]asicfamiliarity'' (p. xv) with at least one of the areas.

Smith proposes the general framework of 'linguistic structure prediction' as anapproach to NLP problems that makes clear the connections to ML. Within thisframework, Smith discusses a wide variety of techniques and approaches to bothsupervised and unsupervised learning in considerable technical detail. He coversboth important 'traditional' ideas such as the semiring approach in NLP, by now'standard' approaches such as Support Vector Machines (SVMs) and Maximum Entropymodels, and the general idea of and differences between generative anddiscriminative models.

Chapter one introduces some typical NLP tasks, ranging from Part-of-Speechtagging to syntactic parsing. While providing a basic overview of NLP problemsfor people with a primary background in ML, Smith also shows how to formulatethem as instances of a generic ML problem, showing the connections between thetwo fields. He proposes the abstract ''linguistic structure prediction framework''that is made up of four steps:

First, ''[f]ormally define the inputs and outputs'' (p. 20). Smith providesconcrete examples how to do this in cases such as Part-of-Speech tagging orSyntactic Parsing, thus also introducing in an intuitive way notation that isused consistently throughout the remainder of the book. Second, ''[i]dentify ascoring function over input-output pairs, and an algorithm that can find themaximum-scoring output given an input'' (p. 20), the topic of Chapter two. Third,identify potential data that actually allows one ''to learn to predict outputsfrom inputs, and apply a learning algorithm'' (p. 20), discussed further inChapters three to five. Finally, ''evaluate the model on an objective criterionmeasured on unseen test data'' (p. 21), a point that is addressed in one of thefour appendices.

Chapter two starts with an explicit definition of the scoring function mentionedin step two. A scoring function needs to assign a single real number (a score)to every possible input-output pair for the given task; for example, inPart-of-Speech (PoS) tagging it needs to assign a score to every possible pairof word and PoS sequences of the same length. Given a formal representation ofboth input and output along the lines of Chapter one, these representations arefirst mapped to a feature vector, an ordered sequence of real numbers. As inChapter one, Smith provides concrete and intuitive examples of how to define afeature function for specific NLP tasks, giving people with primarily a MLbackground an idea of what relevant features can look like in NLP. Given afeature vector V, a score is then simply assigned to the correspondinginput-output pair by taking the dot product between the vector V and agiven weight vector W. Thus, problems in NLP can be stated as standardoptimization problems: for a given input I, find the output O such that theproduct of the corresponding V and a given weight vector W is as large aspossible. Obviously, this raises two further questions, namely how one canefficiently find the optimal output value and where the weight vector W comesfrom. The rest of Chapter two is dedicated to the first problem, finding theoptimal output if one already knows the weight vector, for which Smithintroduces the technical term 'decoding'.

Before addressing how the problem of decoding can efficiently be solved,however, Smith introduces his 'five views on decoding' that serve to providemore intuitive ways to think about the rather abstract formulation of thescoring function as just a product of vectors. In particular, Smith shows howapparently different approaches such as Probabilistic Graphical Models, Parsingor Weighted Logic Programs can all be seen as (not necessarily equivalent)instantiations of the abstract idea of decoding just outlined, and that certaintasks are more naturally using one or the other.

The rest of Chapter two provides an exhaustive discussion of DynamicProgramming, one of the key ideas behind many efficient algorithms that have along history in NLP. Even for people familiar with specific instances of DynamicProgramming, e.g. the CKY or the Viterbi algorithm, Smith's presentation isinteresting in that he uses the mathematical idea of a semiring to provide avery general look at Dynamic Programming. His use of important NLP algorithms asexamples again makes it easy for people already familiar with them to understandthe abstract approach, and provides people without a strong background in NLPwith a clear understanding of some of the core algorithms of the field.

Finally, chapter two includes a quick overview of approximate algorithms thatcan be used to solve problems for which no efficient Dynamic Programmingformulation is known, or in cases where solving the DP exactly is stillimpractical, a situation that often arises in practice.

Chapter three addresses the problem of finding a weight vector W, assuming manycorrect input-output pairs are available. This situation is known as supervisedlearning, and the first half of Chapter three provides a detailed discussion ofhow probabilistic models, both generative and conditional, can be used in thissetting. Smith succeeds in making clear the difference between generative andconditional probabilistic models and their relative merits and shortcomings, aswell as in relating them back to the abstract notion of decoding developed inChapter two. He starts with basic and ''traditional'' approaches such as HiddenMarkov Models, Probabilistic Context Free Grammars and Naïve Bayes, againproviding an opportunity to get used in a 'familiar' setting to the intricatenotation and the not exactly trivial mathematical details that necessarily comeup in this context. It is also worth pointing out that Smith provides anintroduction to the use of Dirichlet Priors as an elegant approach to smoothing,an idea that has become increasingly popular in the NLP community over the lastyears.

The second half of chapter three discusses fully discriminative models, inparticular large margin methods, and Smith succeeds in explaining their generaldifference from probabilistic models and why one may choose to use one or theother. Starting with the very intuitive Perceptron algorithm, Smith goes on tosketch the more powerful framework of Support Vector Machines, pointing out thechallenges these models pose in learning weight vectors from the training data.

Chapter four addresses the problem of finding a weight vector W in cases whereone does not have any input-output pairs to learn from, commonly calledunsupervised learning. It provides an excellent discussion of the ExpectationMaximization algorithm, and how it can be applied to probabilistic generativemodels for unsupervised learning. Particularly noteworthy is the discussion ofBayesian Unsupervised Learning, a very recent trend in NLP. Smith both makesvery clear the general difference between the Bayesian and the 'standard'(frequentist) approach and provides an explanation of the relevant formalbackground, including a quick discussion of Markov Chain Monte Carlo methods andVariational Inference.

In addition to unsupervised learning, Chapter four provides a short butinsightful discussion of Hidden Variable Learning. This is the case where oneactually has input-output pairs to learn from, as in Chapter three, but assumesthat there is unknown hidden structure on the input side that is useful inpredicting the output. Importantly, Hidden Variable Learning can also be appliedto conditional and discriminative models whereas 'classical' unsupervisedlearning cannot, thus providing a connection to the discussions in Chapter three.

Chapter five provides additional mathematical details about some of thesubproblems that arose in Chapters three and four. It is the mathematically mostchallenging part of the book but can be used more like a reference for specificmathematical questions that arise when actually implementing one of thetechniques discussed in the previous chapters.

Finally, there are four appendices that provide additional background on some ofthe topics mentioned in the main text. The first appendix discusses numericaloptimization methods, particularly important for Chapters three and four. Thesecond appendix discusses experimental methodology in NLP, thus addressing theimportant question of how to do evaluation. The third and the fourth appendixelaborate on Chapters three and four, discussing Maximum Entropy models and thedifference between locally and globally normalized probabilistic models.

EVALUATION

''Linguistic Structure Prediction'' provides an exhaustive and detailed discussionof ML techniques that are currently used within NLP. It is clearly not anintroductory book for people with absolutely no background in either field, butgiven that it defines its own target audience as graduate students andresearchers, this is not a real criticism. The only thing that I would consider'missing' is a discussion of directed probabilistic graphical models, as Smithfocuses exclusively on undirected ones. Again, however, this is something thatSmith points out himself, and his reason for focusing on the undirected case,''general methods for decoding are [...] easier to understand in the undirectedcase'' (p. 29), is reasonable.

Particularly noteworthy is how up-to-date the book is. Even very recentdevelopments are covered, e.g. Bayesian non-parametric approaches, and theextensive bibliography makes it easy for the reader to identify relevant andrecent papers. Researchers in NLP will welcome it as a useful and up-to-datereference book for ML techniques, in particular given that most of theliterature in ML uses examples that are often hard to connect back to thediscrete and richly structured world of NLP problems.

To conclude, then, one can say that ''Linguistic Structure Prediction'' does agood job in bringing ML and NLP closer together.

ABOUT THE REVIEWER

ABOUT THE REVIEWER:
Benjamin Börschinger received his M.A. in Philosophy and Computational
Linguistics from the University of Heidelberg and is currently working
towards his PhD in Computational Linguistics at Macquarie University,
Australia. His research focuses on computational models of human language
acquisition and unsupervised learning of linguistic structure.
Additionally, he is interested in conceptual questions that arise at the
interface of Linguistics and other disciplines, in particular the Cognitive
Sciences.