Who Should Read This Book

[This is an actual excerpt from the book]

In the summer of 2008, I gave a talk at an international conference in Brighton. The talk was about constructions involving multiple hedging in American English (e.g., I’m gonna have to ask you to + VP). I remember this talk because even though I had every reason to be happy (the audience showed sustained interest and a major linguist in my field gave me positive feedback), I remember feeling a pang of dissatisfaction. Because my research was mostly theoretical at the time, I had concluded my presentation with the phrase “pending empirical validation” one too many times. Of course, I had used examples gleaned from the renowned Corpus of Contemporary American English, but my sampling was not exhaustive and certainly biased. Even though I felt I had answered my research questions, I had provided no quantitative summary. I went home convinced that it was time to buttress my research with corpus data. I craved for a better understanding of corpora, their constitution, their assets, and their limits. I also wished to extract the data I needed the way I wanted, beyond what traditional, prepackaged corpus tools have to offer.

I soon realized that the kind of corpus linguistics that I was looking for was technically demanding, especially for the armchair linguist that I was. In the summer of 2010, my lab offered that I attend a one-week boot camp in Texas whose instructor, Stefan Th. Gries (University of Santa Barbara), had just published Quantitative Corpus Linguistics with R. This boot camp was a career-changing opportunity. I went on to teach myself more elaborate corpus-linguistics techniques as well as the kinds of statistics that linguists generally have other colleagues do for them. This led me to collaborate with great people outside my field, such as mathematicians, computer engineers, and experimental linguists. While doing so, I never left aside my research in theoretical linguistics. I can say that acquiring empirical skills has made me a better theoretical linguist.

If the above lines echo your own experience, this book is perfect for you. While written for a readership with little or no background in corpus linguistics, computer programming, or statistics, Corpus Linguistics and Statistics with R will also appeal to readers with more experience in these fields. Indeed, while presenting in detail the text-mining apparatus used in traditional corpus linguistics (frequency lists, concordance tables, collocations, etc.), the text also introduces the reader to some appealing techniques that I wish I had become acquainted with much earlier in my career (motion charts, word clouds, network graphs, etc.).

Goals

This is a book on empirical linguistics written from a theoretical linguist’s perspective. It provides both a theoretical discussion of what quantitative corpus linguistics entails and detailed, hands-on, step-by-step instructions to implement the techniques in the field.

Summary

This textbook examines empirical linguistics from a theoretical linguist’s perspective. It provides both a theoretical discussion of what quantitative corpus linguistics entails and detailed, hands-on, step-by-step instructions to implement the techniques in the field. The statistical methodology and R-based coding from this book teach readers the basic and then more advanced skills to work with large data sets in their linguistics research and studies. Massive data sets are now more than ever the basis for work that ranges from usage-based linguistics to the far reaches of applied linguistics. This book presents much of the methodology in a corpus-based approach. However, the corpus-based methods in this book are also essential components of recent developments in sociolinguistics, historical linguistics, computational linguistics, and psycholinguistics. Material from the book will also be appealing to researchers in digital humanities and the many non-linguistic fields that use textual data analysis and text-based sensorimetrics. Chapters cover topics including corpus processing, frequencing data, and clustering methods. Case studies illustrate each chapter with accompanying data sets, R code, and exercises for use by readers. This book may be used in advanced undergraduate courses, graduate courses, and self-study.

Table of contents

Chapter 1. Introduction Pages 1-12

In this chapter, I explain the theoretical relevance of corpora. I answer three questions: what counts as a corpus?; what do linguists do with the corpus?; what status does the corpus have in the linguist’s approach to language?

Part I

Chapter 2. R Fundamentals Pages 15-49

This chapter is designed to get linguists familiar with the R environment. First, it explains how to download and install R and R packages. It moves on to teach how to enter simple commands, use ready-made functions, and write user-defined functions. Finally, it introduces basic R objects: the vector, the list, the matrix, and the data frame. Although meant for R beginners, this chapter can be read as a refresher course by those readers who have some experience in R.

Chapter 3. Digital Corpora Pages 51-67

A corpus is a digital text or collection of texts. After presenting a tentative typology of corpora, this chapter provides guidelines as to making your own corpora before presenting the characteristics of ready-made, annotated corpora.

Chapter 4. Processing and Manipulating Character Strings Pages 69-86

In this chapter, you will learn techniques to handle text material with R. Some of these techniques involve regular expressions, i.e. patterns that describe a set of strings. After describing individual functions, this chapter teaches you how to combine them.

Chapter 5. Applied Character String Processing Pages 87-114

In this chapter, you will learn how to handle text material by combining the R techniques that you have learnt in the previous chapter.

Chapter 6. Summary Graphics for Frequency Data Pages 115-135

In this chapter, you will learn how to process frequency data and represent your findings graphically.

Part II

Chapter 7. Descriptive Statistics Pages 139-149

Descriptive statistics summarize information. In this chapter, we review two kinds of descriptive statistics: measures of central tendency and measures of dispersion. Measures of central tendency are meant to summarize the profile of a variable. Although widespread, these statistics are often misused. I provide guidelines for using them. Measures of dispersion are complementary: they are meant to assess how good a given measure of central tendency is at summarizing the variable.

Chapter 8. Notions of Statistical Testing Pages 151-195

In this chapter, you will learn the basics of statistical thinking, namely inferential statistics and statistical testing. These fundamentals will serve as a basis for the following chapters.

Chapter 9. Association and Productivity Pages 197-238

This chapter covers association measures and productivity measures with respect to lexico-grammatical patterns.

Chapter 10. Clustering Methods Pages 239-294

In this chapter, I introduce clustering techniques Their aim is to form clusters of objects so that similar objects are grouped in the same cluster and different objects are grouped in different clusters. I also introduce the concept of a network graph which, although not a clustering technique, is a useful, related addition to your corpus linguistics tool repository.