New from Cambridge University Press!

Edited By Keith Allan and Kasia M. Jaszczolt

This book "fills the unquestionable need for a comprehensive and up-to-date handbook on the fast-developing field of pragmatics" and "includes contributions from many of the principal figures in a wide variety of fields of pragmatic research as well as some up-and-coming pragmatists."

''Corpus Linguistics with BNCweb'' is the sixth in a series of titles from PeterLang devoted to English corpus linguistics. BNC is an acronym for the BritishNational Corpus (http://www.natcorp.ox.ac.uk/), which has been maintained atLancaster University since the early 1990's and consists of 100 million words ofboth written (90%) and spoken (10%) British English in over 4000 texts, whichare categorized by genre. A large corpus such as the BNC can provide accurateinformation on both a word's meaning and usage through the implementation ofvarious query tools as explicitly described in this detailed guide to the BNCweb.

Emphasized at the outset is the fact that working with a corpus solves twoproblems for language researchers: how to base conclusions on actual usagerather than on mere introspection and how to consider a large amount of datawithout the time-consuming task of interviewing individual informants. A corpusthen is not about what a researcher believes, but about what many people do withlanguage. Lexical behavior is revealed in patterns that can be quickly andconveniently displayed in concordance lines through the use of sophisticatedsearch tools such as the BNCweb, which is designed for working with words andphrases and their co-occurrence frequencies.

SUMMARY

Chapter 1 begins by describing the purpose of each of the subsequent chapters,advising readers to utilize the manual while on the BNCweb(http://www.bncweb.info/) although the exhaustive inclusion of screenshots forevery sample query discussed renders such full participation unnecessary. Theauthors delineate both the advantages and limitations of working with a corpusand what is essentially a search engine with various parameter settings, theBNCweb query tool.

For example, in using the DISTRIBUTION feature to look at the behavior of'shall' in the spoken portion of the corpus, the term is found to co-occur witheither 'I' or 'we' in 90% of cases. Because the BNCweb also allows forseparating data by such sociological features as age, gender, and class, it canbe further determined that the declarative forms of 'I/we shall' are morecommonly used by older speakers, whereas the interrogative forms of 'shall I/we'are more commonly used by younger speakers, perhaps providing a 'snapshot' oflanguage change in progress and indicating that the declarative form may be onits way out of the language or simply attesting to the fact that youngerspeakers ask more questions. It would be interesting to compare this Britishusage of 'shall' to North American usage. The corpus was not catalogued by raceof the speakers, an unfortunate oversight in the data collection phase.

Some basic principles of corpus linguistics research such as representativenessand methodology are outlined in Chapter 2. A corpus is a principled collectionof text, and no corpus can be truly representative of a language as a whole. The BNC, however, by incorporating a massive number of different text types froman array of social strata strives to present a picture of late 20th centuryBritish English usage. It is described by the authors as ''a synchronic andstatic corpus which consists of a large number of text samples that are heavilymarked-up with information about the texts, speakers, and writers, and annotatedwith linguistic information (e.g. parts of speech).''

Also, the authors make the important point here that corpus linguistics,although concerned only with performance data, does offer a way to exposelinguistic competence. The example of complex (multi-word) prepositions such as'in terms of' and 'in response to' is used to illustrate how frequent phrasalpatterns can be indicative of mental chunking of ''indivisible units.'' Constituent boundaries are evidenced by the non-random distribution of filledpauses in the spoken portion of the corpus. The authors demonstrate that''filled pauses occur very frequently both immediately before and after complexprepositions,'' but rarely in internal positions surrounding the noun. I foundthis particular example extremely relevant to my own corpus research on nounplus preposition clusters in academic writing.

Chapter 3 is largely cautionary as to how generalizable findings from any corpusand the BNC in particular can be. After describing the BNC in some detail as abalanced reference corpus of 4000 files, the authors explain why they used boththe highly accurate (98-99%) CLAWS POS tagset and a smaller, simplified tagsetof only 11 tags to facilitate such searches as for ''any verb,'' for example. Allwords in the corpus are annotated for HEADWORD and LEMMA as well using XMLformat in the underlying source files. A discussion of the significance oftype/token ratios is also useful here.

Chapters 4 and 5 focus on methodology. Chapter 4 is where the reader may wantto begin actually sitting at a computer with access to the BNCweb, butscreenshots are provided. Several alternative ways to conduct basic searchesare covered along with some guidance in how to read and manipulate the displayof concordance lines. The default view is of complete sentences, but the usercan select the KWIC (''Key Word in Context'') view, which aligns the query item ina fixed, central position to facilitate detection of recurrent languagepatterns. Query results can also be displayed in random or corpus order andsaved in QUERY HISTORY. The inclusion of hands-on exercises at the end of thisand other chapters gives the reader a good idea of the kinds of specificquestions that can be answered through corpus inquiry and enhances thesuitability of this text for classroom instruction.

Chapter 5 on ''the comparability and reliability of findings'' emphasizes whynormalized frequencies as determined by statistical significance arefundamentally important when comparing corpora or subsections of a corpus inorder to ensure that high frequencies are not simply due to chance alone. Rawfrequencies are meaningful only if you are dealing with corpora of the samesize. In comparing the normalized frequencies of the discourse marker 'in fact'in the written and spoken subsections of the BNC, the authors demonstrate thatit is almost twice as frequent in the spoken data, which is relatively scantcompared to the written portion.The calculation of normalized frequencies is discussed in some detail becausethe authors contend that it is ''the number one source of error for novices incorpus linguistics.'' In the interest of reliability, there is also a 'CorpusFrequency Wizard' interface on-line for doing statistical calculations athttp://sigil.collocations.de/wizard.html.

Chapter 6 outlines the use of ''Simple Query Syntax'' for more sophisticatedsearches of particular affixes, parts-of-speech, wildcards, andlexico-grammatical patterns using metacharacters.

Chapters 7 and 8 explain how search results can be further manipulated andanalyzed for specific purposes. Chapter 7 describes the automated features ofDISTRIBUTION and SORT. For example, 'because' is deemed to be ''overused'' inschool essays because this is the only written genre showing frequenciescomparable to those in the spoken genres of the corpus. Frequency breakdownsfurther allow the sorting of co-occurrence patterns by type and token.

Chapter 8, in which COLLOCATIONS are discussed in great detail, covers theautomated analysis of concordance lines. A collocation is ''the habitualco-occurrence of two (or more) words,'' and ''collocational tendencies canarguably be seen as part of the meaning of a word.'' The concept of semanticprosody is discussed here using the example of the word 'cause', which is shownto have ''an overwhelming tendency to co-occur with events of a negative orunfortunate nature.'' The value of such idiomatic information to non-nativespeakers is appropriately mentioned here.

Chapter 9 explains how concordance lines may be manually annotated (tagged orclassified) depending on the user's query results. Both advantages anddisadvantages of categorizing queries are discussed. Users more familiar withMicrosoft Excel will appreciate the inclusion of instructions on how to exportand re-import query results to and from the spreadsheet database.

Chapter 10 provides a detailed guide in ways to create subcorpora in order torestrict searches to particular text types. All texts are classified accordingto domain, genre, time period, medium, and the sociological factors mentioned above.

Chapter 11 covers KEYWORD and FREQUENCY LIST features. A keyword is defined asone that occurs ''with significantly greater frequency in one part of the corpusthan [in] another.'' A comparison between academic lectures and academicwriting confirms the relatively high concentration of verbs in the former andnouns in the latter. Frequency lists are considered ''useful for detectingpotentially salient linguistic items within the corpus.'' In written genres,'the' is found to be the most frequent word (again attesting to the 'nouniness'of more formal registers), and pronouns such as 'I', 'you', and 'it' are themost frequent words in spoken genres. The more nominalized style of academictexts is also indicated by the relatively higher frequencies of prepositionssuch as 'of', 'in', 'by', and 'with' in this genre, another fact I foundparticularly supportive of my own corpus research.

Chapter 12 discusses the Corpus Query Processor (CQP) for more advanced searchesand experienced users. Also mentioned is the IMS Open Corpus Workbench(http://cwp.sourceforge.net/), which allows for searching any annotated corpusin the proper format.

Chapter 13 concerns the more practical aspects of running BNCweb for networkadministrators. Topics include administrative access, customizableconfiguration settings, the cache system of previous searches, and disk-spacerequirements.

Finally, a brief list of references is provided, noting seminal works in Englishcorpus linguistics by Douglas Biber, Graeme Kennedy, Geoffrey Leech, CharlesMyer, Michael Scott, John Sinclair, and Michael Stubbs. There is also an11-page glossary of computerese terms relevant to corpus inquiry. Fourappendices provide all genre classifications for the texts in the corpus,part-of-speech tags (CLAWS), explication of the Simple Query Syntax, andHTML-entities for less common characters. A brief index is included as well.

EVALUATION

This is a general, introductory text suitable for an undergraduate and/orgraduate class in corpus linguistics. It demonstrates how corpus work is verymuch a balance between what the tools can deliver and how the human user canmanipulate those tools to answer very elaborate types of questions aboutlexico-syntactic patterns.

The greatest attribute of this text is that it is not just a corpus usagemanual, but an explication of corpus linguistics theory and methodology. Inclear prose and using many illustrative examples, the authors go into greatdetail in their discussions about conducting various search queries, customizingannotations, contrasting raw and normalized frequencies, and enhancing validityand reliability. Throughout the text, the authors point out that thereader/user should consider intuitively what they may expect to find withparticular queries before doing the actual searches. This practice reinforcesthe value of corpus work in that our assumptions about language usage arefrequently found to be in error or in need of some finer revision in light ofthe search results.Even though the BNCweb provides a wide range of search options, the web-basedinterface is attractive and quite easy to use.

Some may find it a tedious read, especially the latter chapters for advancedusers and network administrators, but such is the nature of the beast. Thisvolume keeps it interesting with numerous suggestions about the types ofquestions that can and cannot be answered through both simple and more complexqueries, and the chapter-final exercises are inspiring of innovative approachesto corpus linguistics. The potential for corpus linguistics discoveries aboutword/phrase frequencies has yet to be fully exploited, especially in the areasof lexicography, sociolinguistics, and second/foreign language teaching. Acomparable, user-friendly mechanism for discovering and comparing the patternsof North American English usage would certainly be welcome on this side of the pond.

ABOUT THE REVIEWER

ABOUT THE REVIEWER:
Elizabeth Craig is an experienced ESL/EFL teacher and teacher-trainer with
a master's degree in applied linguistics (TESOL) and a doctorate in second
language acquisition. She was the English Language Fellow to Paraguay in
2006-2007 for the U.S. State Department and is currently teaching English
and linguistics courses at The University of Georgia. Her dissertation
(http://linguistlist.org/pubs/diss/browse-diss-action.cfm?DissID=25900)
consists of an examination of N+P clusters in a corpus of native-speaker
freshman compositions in an effort to address preposition errors in second
language writing. Dr. Craig is also Supervising On-line Editor of 'English
around the World', a free, weekly newspaper insert for English language
educators in and around Asunción, Paraguay.