f
r
e
q
u
e
n
c
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Frequency per text
Figure 5.6 Pseudo-title frequency across samples
(i.e. the variable that changes, in this case the “type”: whether the construction
is a pseudo-title or corresponding apposition) is in the top-most row. Table 5.7
presents the results of a chi-square analysis of the data in table 5.5.
In essence, the chi-square test calculates the extent to which the distribution
in a given dataset either conﬁrms or disconﬁrms the “null hypothesis”: in this
case, whether or not there are differences in the distribution of pseudo-titles and
equivalent appositives in the four regional varieties of ICE being compared. To
perform this comparison, the chi-square test compares “observed” frequencies
in a given dataset with “expected” frequencies (i.e. the frequencies one would
expect to ﬁnd if there were no differences in the distribution of the data). The
higher the chi-square value, the more signiﬁcant the differences are.
The application of the chi-square test to the frequencies in table 5.5 yielded
a value of 65.686. To interpret this number accurately, one ﬁrst of all needs to
knowthe “degrees of freedom” in a given dataset (i.e. the number of data points
Table 5.7 Chi-square results for differences in the distribution of pseudo-titles
and corresponding appositives in the samples from ICE components
Statistical test Value Degrees of freedom Signiﬁcance level
Chi-square 65.686 3 p .000
5.5 The statistical analysis of pseudo-titles 129
that may vary). Since table 5.5 contains four rows and two columns, the degrees
of freedom can be calculated using the formula below:
(4 −1) × (2 −1) = 3 degrees of freedom
With three degrees of freedom, the chi-square value of 65.686 is signiﬁcant at
less than the .000 level.
While it is generally accepted that any level below .05 indicates statistical
signiﬁcance, it is quite common for more stringent signiﬁcance levels to be
employed (e.g. p
<
=
.001). Because the signiﬁcance level for the data in table 5.7
is considerably beloweither of these levels, it can be safely and reliably assumed
that there are highly signiﬁcant differences in the distributions of pseudo-titles
and appositives across the four varieties of English represented in the table.
The chi-square test applied to the data in table 5.5 simply suggests that
there are differences in the use of pseudo-titles in the four national varieties
of English being investigated. The chi-square test says nothing about differ-
ences between the individual varieties (e.g. whether ICE-USA differs from
ICE-NZ). To be more precise about how the individual varieties differ from
one another, it is necessary to compare the individual varieties themselves in
a series of 2×2 chi-square tables. However, in examining a single dataset as
exhaustively as this, it is important to adjust the level that must be reached for
statistical signiﬁcance because, as Sigley (1997: 231) observes, “If . . . many
tests are performed on the same data, there is a risk of obtaining spuriously
signiﬁcant results.” This adjustment can be made using the Bonferroni correc-
tion, which determines the appropriate signiﬁcance level by dividing the level
of signiﬁcance used in a given study by the number of different statistical tests
applied to the dataset. The Bonferroni-corrected critical value for the ICE data
being examined is given below and is based on the fact that to compare all
four ICE components individually, six different chi-square tests will have to be
performed:
.05 / 6 = .0083
Signiﬁcance level Number of tests performed Corrected value
Table 5.8 contains the results of the comparison, from most signiﬁcant differ-
ences down to least signiﬁcant differences.
The results in table 5.8 illustrate some notable differences in the use of
pseudo-titles and equivalent appositives in the various national varieties. First
of all, the use of these constructions in ICE-GB is very different from their
use in the other varieties: the levels of signiﬁcance are very high and reﬂect
the deeply ingrained stigma against the use of pseudo-titles in British press re-
portage, a stigma that does not exist in the other varieties. Second, even though
130 Analyzing a corpus
Table 5.8 Comparison of the distribution of pseudo-titles and corresponding
appositives in individual ICE components
Countries Statistical test Degrees of freedom Value
6
Signiﬁcance level
NZ and GB Chi-square 1 50.938 p << 0.0001
Phil and GB Chi-square 1 44.511 p << 0.0001
US and GB Chi-square 1 19.832 p << 0.0001
US and NZ Chi-square 1 7.796 p = .005
US and Phil Chi-square 1 4.830 p = .028 (non-sig.)
Phil and NZ Chi-square 1 .273 p = .601 (non-sig.)
pseudo-titles may have originated in American press reportage, their use is
more widespread in ICE-NZ and ICE-Phil, though with the Bonferroni correc-
tion the values are just below the level of signiﬁcance to indicate a difference
between ICE-USA and ICE-Phil. Finally, there were no signiﬁcant differences
between ICE-NZ and ICE-Phil. These results indicate that pseudo-title usage
is widespread, even in British-inﬂuenced varieties such as New Zealand En-
glish, and that there is a tendency for pseudo-titles to be used more widely than
equivalent appositives in those varieties other than British English into which
they have been transplanted from American English.
While the chi-square statistic is very useful for evaluating corpus data, it
does have its limitations. If the analyst is dealing with fairly small numbers
resulting in either empty cells or cells with low frequencies, then the reliability
of chi-square is reduced. Table 5.9 lists the correspondence relationships for
appositives in the four varieties of English examined.
Three of the cells in the category of “total equivalence” contain fewer than
ﬁve occurrences, making the chi-square statistic invalid for the data in table
5.9. One way around this problem is to combine variables in a principled
manner to increase the frequency for a given cell and thus make the results
Table 5.9 Correspondence relationships for appositives in the samples from
ICE components
Country Total equivalence Determiner deletion Partial equivalence Total
USA 1 (2%) 14 (28%) 36 (71%) 51 (101%)
Phil 1 (2.6%) 8 (21%) 29 (76%) 38 (100%)
NZ 0 (0%) 13 (42%) 18 (58%) 31 (100%)
GB 8 (10%) 22 (28%) 48 (62%) 78 (100%)
Total 10 (5%) 57 (29%) 131 (66%) 198 (100%)
6
In a 2 ×2 chi-square table, as Sigley (1997: 226) observes, the distribution is “binomial” rather
than “continuous.” It is therefore customary in a 2 ×2 table to use Yates’ correction rather the
normal Pearson chi-square value.
5.5 The statistical analysis of pseudo-titles 131
Table 5.10 Correspondence relationships for appositives in the samples from
ICE components (with combined cells)
Total equivalence/
Country determiner deletion Partial equivalence Total
USA 15 (29%) 36 (71%) 51 (100%)
Phil 9 (24%) 29 (76%) 38 (100%)
NZ 13 (42%) 18 (58%) 31 (100%)
GB 30 (39%) 48 (62%) 78 (101%)
Total 67 (34%) 131 (66%) 198 (100%)
Statistical Degrees of Signiﬁcance
test Value freedom level
Chi-square 3.849 3 p = .278
of the chi-square statistic more valid. As was noted in section 5.1, one rea-
son for recording the particular correspondence relationship for an apposi-
tive was to study the stylistic relationship between pseudo-titles and various
types of equivalent appositives: to determine, for instance, whether a newspa-
per prohibiting pseudo-titles relied more heavily than those newspapers allow-
ing pseudo-titles on appositives related to pseudo-titles by either determiner
deletion (e.g. the acting director, Georgette Smith → acting director Geor-
gette Smith) or total equivalence (Georgette Smith, acting director → acting
director Georgette Smith). Because these two correspondence relationships in-
dicate similar stylistic choices, it is justiﬁable to combine the results for both
choices to increase the frequencies and make the chi-square test for the data
more valid.
Table 5.10 contains the combined results for the categories of “total equiv-
alence” and “determiner deletion.” This results in cells with high enough fre-
quencies to make the chi-square test valid. The results indicate, however, that
there was really no difference between the four varieties in terms of the cor-
respondence relationships that they exhibited: the chi-square value (3.849) is
relatively low and as a result the signiﬁcance level (.278) is above the level
necessary for statistical signiﬁcance.
It was expected that ICE-GBwould contain more instances of appositives ex-
hibiting either total equivalence or determiner deletion, since in general British
newspapers do not favor pseudo-titles and would therefore favor alternative ap-
positive constructions. And indeed the newspapers in ICE-GBdid contain more
instances. But the increased frequencies are merely a consequence of the fact
that, in general, the newspapers in ICE-GB contained more appositives than
the other varieties. Each variety followed a similar trend and contained fewer
appositives related by total equivalence or determiner deletion and more related
by partial equivalence. These ﬁndings call into question Bell’s (1988) notion
132 Analyzing a corpus
Table 5.11 The length of pseudo-titles in the various components of ICE
Country 1–4 words 5 or more words Total
USA 57 (97%) 2 (3%) 59 (100%)
Phil 71 (86%) 12 (15%) 83 (101%)
NZ 66 (81%) 16 (20%) 82 (101%)
GB 23 (100%) 0 (0%) 23 (100%)
Total 217 (88%) 30 (12%) 247 (100%)
Statistical test Value Degrees of freedom Signiﬁcance level
Chi-square 12.005 3 p = .007
Likelihood ratio 15.688 3 p = .001
of determiner deletion, since overall such constructions were not that com-
mon and whether a newspaper allowed or disallowed pseudo-titles had little
effect on the occurrence of appositives related by determiner deletion. Having
a greater effect on the particular correspondence relation that was found was
whether the appositive contained a genitive noun phrase or some kind of post-
modiﬁcation, structures that led to a partial correspondence with a pseudo-title
and that occurred very commonly in all varieties.
While combining values for variables can increase cell values, often such
a strategy does not succeed simply because so few constructions occur in a
particular category. In such cases, it is necessary to select a different statistical
test to evaluate the results. To record the length of a pseudo-title or appositive,
the original coding system had six values: one word in length, two words, three
words, four words, ﬁve words, and six or more words. It turned out that this
coding scheme was far too delicate and made distinctions that simply did not
exist in the data: many cells had frequencies that were too low to apply the chi-
square test. And combining categories, as is done in table 5.11, still resulted
in two cells with frequencies lower than ﬁve, making the chi-square results for
this dataset invalid.
In cases like this, it is necessary to apply a different statistical test: the log-
likelihood (or G
2
) test. Dunning (1993: 65–6) has argued that, in general, this
test is better than the chi-square test because it can be applied to “very much
smaller volumes of text . . . [and enable] comparisons to be made between the
signiﬁcance of the occurrences of both rare and common phenomena.” Dunning
(1993: 62–3) notes that the chi-square test was designed to work with larger
datasets that have items that are more evenly distributed, not with corpora
containing what he terms “rare events” (e.g. two instances in ICE-USA of
pseudo-titles lengthier than ﬁve words). Applied to the data in table 5.11, the
log-likelihood test (termed the “likelihood ratio” in SPSS parlance) conﬁrmed
that the length of pseudo-titles varied by variety.
The results of the log-likelihood test point to a clear trend in table 5.11:
that lengthier pseudo-titles occur more frequently in ICE-Phil and NZ than in
ICE-USA and GB. In fact, ICE-GB had no pseudo-titles lengthier than ﬁve
5.5 The statistical analysis of pseudo-titles 133
Table 5.12 The length of appositives in the various components of ICE
Country 1–4 words 5 or more words Total
USA 22 (43%) 29 (57%) 51 (100%)
Phil 14 (37%) 24 (63%) 38 (100%)
NZ 14 (45%) 17 (55%) 31 (100%)
GB 32 (41%) 46 (59%) 78 (100%)
Total 82 (41%) 116 (59%) 198 (100%)
Statistical test Value Degrees of freedom Signiﬁcance level
Chi-square .574 3 p = .902
words, and ICE-USA had only two instances. These ﬁndings are reﬂected in
the examples in (14) and (15), which contain pseudo-titles lengthier than ﬁve
words that occurred predominantly in newspapers in ICE-Phil and ICE-NZ.
(14) a. Salamat and Presidential Adviser on Flagship Projects in Mindanao Robert
Aventajado (ICE-Philippines)
b. Time Magazine Asia bureau chief Sandra Burton (ICE-Philippines)
c. Marikina Metropolitan Trial Court judge Alex Ruiz (ICE-Philippines)
d. MILF Vice Chairman for Political Affairs Jadji Murad (ICE-Philippines)
e. Autonomous Region of Muslim Mindanao police chief Damming Unga (ICE-
Philippines)
(15) a. Oil and Gas planning and development manager Roger O’Brien (ICE-NZ)
b. NewPlymouth Fire Service’s deputy chief ﬁre ofﬁcer Graeme Moody (ICE-NZ)
c. corporate planning and public affairs executive director Graeme Wilson (ICE-
NZ)
d. Federated Gisborne-Wairoa provincial president Richard Harris (ICE-NZ)
e. Wesley and former New Zealand coach Chris Grinter (ICE-NZ)
The pseudo-title is a relatively new and evolving structure in English. There-
fore, it is to be expected that its usage will show variation, in this case in the
length of pseudo-titles in the various components of ICE under investigation.
The appositive, on the other hand, is a well-established construction in English,
and if the length of appositives is considered, there were no differences between
the varieties, as is illustrated in table 5.12. Table 5.12 demonstrates that it is
more normal for appositives to be lengthier, and that while ICE-GB has more
appositives than the other varieties, the proportion of appositives of varying
lengths is similar to the other varieties.
One reason for the general difference in length of appositives and pseudo-
titles is that there is a complex interaction between the form of a given pseudo-
title or appositive and its length. In other words, three variables are interacting:
“type” (pseudo-title or appositive), “form” (simple noun phrase, genitive noun
phrase, noun phrase with post-modiﬁcation), and “length” (one to four words
or ﬁve words or more). Table 5.13 provides a cross tabulation of all of these
variables.
134 Analyzing a corpus
Table 5.13 The form and length of pseudo-titles and corresponding
appositives
Type Form 1-4 words 5 or more words Total
PT Simple NP 216 (90%) 23 (10%) 239 (100%)
Gen. NP 0 (0%) 0 (0%) 0 (0%)
Post. Mod. 1 (13%) 7 (87%) 8 (100%)
Total 217 (88%) 30 (12%) 247 (100%)
Appos Simple NP 52 (84%) 10 (16%) 62 (100%)
Gen. NP 18 (67%) 9 (33%) 27 (100%)
Post. Mod. 12 (11%) 97 (89%) 109 (100%)
Total 82 (41%) 116 (59%) 198 (100%)
A chi-square analysis of the trends in table 5.13 would be invalid not only
because some of the cells have values lower than ﬁve but because the chi-square
test cannot pinpoint speciﬁcally which variables are interacting. To determine
what the interactions are, it is more appropriate to conduct a loglinear analysis
of the results.
A loglinear analysis considers interactions between variables: whether, for
instance, there is an interaction between “type,” “form,” and “length”; between
“type” and “form”; between “form” and “length”; and so forth. In setting up
a loglinear analysis, one can either investigate a predetermined set of associa-
tions (i.e. only those associations that the analyst thinks exist in the data), or
base the analysis on a “saturated model”: a model that considers every possible
interaction the variables would allow. The drawback of a saturated model, as
Oakes (1998: 38) notes, is that because it “includes all the variables and inter-
actions required to account for the original data, there is a danger that we will
select a model that is ‘too good’. . . [and that ﬁnds] spurious relationships.” That
is, when all interactions are considered, it is likely that signiﬁcant interactions
between some interactions will be coincidental. Thus, it is important to ﬁnd
linguistic motivations for any signiﬁcant associations that are found.
Because only three variables were being compared, it was decided to use a
saturated model to investigate associations. This model generated the following
potential associations:
(16) a. type*form*length
b. type*form
c. type*length
d. form*length
e. form
f. type
g. length
5.5 The statistical analysis of pseudo-titles 135
Table 5.14 Associations between various variables
K Degrees of freedom Likelihood ratio Probability Probability
3 2 .155 .9254 .9300
2 7 488.010 .0000 .0000
1 11 825.593 .0000 .0000
Likelihood ratio and chi-square tests were conducted to determine whether there
was a signiﬁcant association between all three variables (16a), and between all
possible combinations of two-way interactions (16b–d). In addition, the vari-
ables were analyzed individually to determine the extent to which they affected
the three- and two-way associations in 16a–d. The results are presented in table
5.14.
The ﬁrst line in table 5.14 demonstrates that there were no associations
between the three variables: the likelihood ratio score had probability where
p > .05. On the other hand, there were signiﬁcant associations between the
two-way and one-way variables.
To determine which of these associations were strongest, a procedure called
“backward elimination” was applied to the results. This procedure works in a
step-by-step manner, at each stage removing from the analysis an association
that is least strong and then testing the remaining associations to see which is
strongest. This procedure produced the two associations in table 5.15 as being
the strongest of all the associations tested. Interpreted in conjunction with the
frequency distributions in table 5.13, the results in table 5.14 suggest that
while appositives are quite diverse in their linguistic form, pseudo-titles are
not. Even though a pseudo-title and corresponding appositive have roughly the
same meaning, a pseudo-title is mainly restricted to being a simple noun phrase
that is, in turn, relatively short in length. In contrast, the unit of an appositive
corresponding to a pseudo-title can be not just a simple noun phrase but a
genitive noun phrase or a noun phrase with post-modiﬁcation as well.
These linguistic differences are largely a consequence of the fact that the
structure of a pseudo-title is subject to the principle of “end-weight” (Quirk
et al. 1985: 1361–2). This principle stipulates that heavier constituents are best
placed at the end of a structure, rather than at the beginning of it. A pseudo-
title will always come at the start of the noun phrase in which it occurs. The
lengthier and more complex the pseudo-title, the more unbalanced the noun
Table 5.15 Strongest associations between variables
Degrees of freedom Likelihood ratio Probability
Type*form 2 246.965 .0000
Length*form 2 239.067 .0000
136 Analyzing a corpus
phrase will become. Therefore, pseudo-titles typically have forms (e.g. simple
noun phrases) that that are short and non-complex structures, though as table
5.12 illustrated, usage does vary by national variety. In contrast, an appositive
consists of two units, one of which corresponds to a pseudo-title. Because this
unit is independent of the proper noun to which it is related – in speech it
occupies a separate tone unit, in writing it is separated by a comma from the
proper noun to which it is in apposition – it is not subject to the end-weight
principle. Consequently, the unit of an appositive corresponding to a pseudo-
title has more forms of varying lengths.
The loglinear analysis applied to the data in table 5.12 is very similar to the
logistic regression models used in the Varbrul programs: IVARB(for MS-DOS)
and GoldVarb (for the Macintosh) (cf. Sankoff 1987; Sigley 1997: 238–52).
These programs have been widely used in sociolinguistics to test the interaction
of linguistic variables. For instance, Tagliamonte and Lawrence (2000) used
GoldVarb to examine which of seven linguistic variables favored the use of
three linguistic forms to express the habitual past: a simple past-tense verb,
used to, or would. Tagliamonte and Lawrence (2000: 336) found, for instance,
that the type of subject used in a clause signiﬁcantly affected the choice of verb
form: the simple past was used if the subject was a second-person pronoun,
used to was used if the subject was a ﬁrst-person pronoun, and would was
used if the subject was a noun phrase with a noun or third-person pronoun as
head.
Although the Varbrul programs have been used primarily in sociolinguistics
to study the application of variable rules, Sigley (1997) demonstrates the value
of the programs for corpus analysis as well in his study of relative clauses in
the Wellington Corpus of New Zealand English. The advantage of the Varbul
programs is that they were designed speciﬁcally for use in linguistic analy-
sis and are thus easier to use than generic statistical packages, such as SPSS
or SAS. But it is important to realize that these statistical packages, as the
loglinear analysis in this section demonstrated, can replicate the kinds of statis-
tical analyses done by the Varbrul programs. Moreover, as Oakes (1998: 1–51)
demonstrates, these packages can perform a range of additional statistical anal-
yses quite relevant to the concerns of corpus linguists, from those, such as the
Pearson-Product Moment, that test correlations between variables to Regres-
sion tests, which test the effects that independent variables have on dependent
variables.
5.6 Conclusions
To conduct a corpus analysis effectively, the analyst needs to plan out
the analysis carefully. It is important, ﬁrst of all, to begin the process with a very
clear research question in mind, so that the analysis involves more than simply
“counting” linguistic features. It is next necessary to select the appropriate
5.6 Conclusions 137
corpus for analysis: to make sure, for instance, that it contains the right kinds
of texts for the analysis and that the corpus samples to be examined are lengthy
enough. And if more than one corpus is to be compared, the corpora must
be comparable, or the analysis will not be valid. After these preparations are
made, the analyst must ﬁnd the appropriate software tools to conduct the study,
code the results, and then subject them to the appropriate statistical tests. If
all of these steps are followed, the analyst can rest assured that the results
obtained are valid and the generalizations that are made have a solid linguistic
basis.
Study questions
1. What is the danger of beginning a corpus analysis without a clearly thought-
out research question in mind?
2. How does the analyst determine whether a given corpus is appropriate for
the corpus analysis to be conducted?
3. What kinds of analyses are most efﬁciently conducted with a concordancing
program?
4. What kinds of information can be found in a tagged or parsed corpus that
cannot be found in a lexical corpus?
5. The data in the table below are adapted from a similar table in Meyer (1996:
38) and contain frequencies for the distribution of phrasal (e.g. John and
Mary) and clausal (e.g. We went to the store and our friends bought some
wine) coordination in various samples of speech and writing from the Inter-
national Corpus of English. Go to the web page given below at Georgetown
University and use the “Web Chi-square Calculator” on the page to deter-
mine whether there is a difference between speech and writing in the distribu-
tion of phrasal and clausal coordination: http://www.georgetown.edu/cball/
webtools/web chi.html.
Syntactic structures in speech and writing
Medium Phrases Clauses Total
Speech 168 (37%) 289 (63%) 457 (100%)
Writing 530 (77%) 154 (23%) 684 (100%)
6 Future prospects in corpus linguistics
In describing the complexity of creating a corpus, Leech (1998: xvii)
remarks that “a great deal of spadework has to be done before the research
results [of a corpus analysis] can be harvested.” Creating a corpus, he comments,
“always takes twice as much time, and sometimes ten times as much effort”
because of all the work that is involved in designing a corpus, collecting texts,
and annotating them. And then, after a given period of time, Leech (1998: xviii)
continues, the corpus becomes “out of date,” requiring the corpus creator “to
discard the concept of a static corpus of a given length, and to continue to collect
and store corpus data indeﬁnitely into the future . . .” The process of analyzing a
corpus may be easier than the description Leech (1998) gives above of creating
a corpus, but still, many analyses have to be done manually, simply because we
do not have the technology that can extract complex linguistic structures from
corpora, no matter how extensively they are annotated. The challenge in corpus
linguistics, then, is to make it easier both to create and analyze a corpus. What
is the likelihood that this will happen?
Planning a corpus. As more and more corpora have been created, we have
gained considerable knowledge of howto construct a corpus that is balanced and
representative and that will yield reliable grammatical information. We know,
for instance, that what we plan to do with a corpus greatly determines how it is
constructed: vocabulary studies necessitate larger corpora, grammatical studies
(at least of relatively frequently occurring grammatical constructions) shorter
corpora. The British National Corpus is the culmination of all the knowledge
we have gained since the 1960s about what makes a good corpus.
But while it is of prime importance to descriptive corpus linguistics to create
valid and representative corpora, in the ﬁeld of natural language processing this
is an issue of less concern. Obviously, the two ﬁelds have different interests: it
does not require a balanced and representative corpus to train a parser or speech-
recognition system. But it would greatly beneﬁt the ﬁeld of corpus linguistics if
descriptive corpus linguists and more computationally oriented linguists and en-
gineers worked together to create corpora. The British National Corpus is a good
example of the kind of corpus that can be created when descriptive linguists,
computational linguists, and the publishing industry cooperate. The TalkBank
Project at Carnegie Mellon University and the University of Pennsylvania is a
multi-disciplinary effort designed to organize varying interest groups engaged
in the computational study of human and animal communication. One of the
138
Future prospects in corpus linguistics 139
interest groups, Linguistic Exploration, deals withthe creationandannotationof
corpora for purposes of linguistic research. The Michigan Corpus of Academic
Spoken English (MICASE) is the result of a collaborative effort involving both
linguists and digital librarians at the University of Michigan. Cross-disciplinary
efforts such as these integrate the linguistic and computational expertise that
exists among the various individuals creating corpora, they help increase the
kinds and types of corpora that are created, and they make best use of the limited
resources available for the creation of corpora.
Collecting and computerizing written texts. Because so many written texts
are now available in computerized formats in easily accessible media, such as
the World Wide Web, the collection and computerization of written texts has
become much easier than in the past. It is no longer necessary for every written
text to be typed in by hand or scanned with an optical scanner and then the
scanning errors corrected. If texts are gathered from the World Wide Web, it
is still necessary to strip them of html formatting codes. But this process can
be automated with software that removes such markup from texts. Creating
a corpus of written texts is now an easy and straightforward enterprise. The
situation is more complicated for those creating corpora containing texts from
earlier periods of English: early printed editions of books are difﬁcult to scan
optically with any degree of accuracy; manuscripts that are handwritten need
to be manually retyped and the corpus creator must sometimes travel to the
library where the manuscript is housed. This situation might be eased in coming
years. There is an increased interest both among historical linguists and literary
scholars in computerizing texts from earlier periods of English. As projects
such as the Canterbury Project are completed, we may soon see an increased
number of computerized texts from earlier periods.
Collecting and computerizing spoken texts. While it is now easier to prepare
written texts for inclusion in a corpus, there is little hope for making the col-
lection and transcription of spoken texts easier. For the foreseeable future, it
will remain an arduous task to ﬁnd people who are willing to be recorded, to
make recordings, and to have the recordings transcribed. There are advantages
to digitizing spoken samples and using specialized software to transcribe them,
but still the transcriber has to listen to segments of speech over and over again
to achieve an accurate transcription. Advances in speech recognition might au-
tomate the transcription of certain kinds of speech (e.g. speeches and perhaps
broadcast news reports), but no software will be able to cope with the dysﬂu-
ency of a casual conversation. Easing the creation of spoken corpora remains
one of the great challenges in corpus linguistics, a challenge that will be with
us for some time in the future.
Copyright restrictions. Obtaining the rights to use copyrighted material has
been a perennial problem in corpus linguistics. The ﬁrst release of the British
National Corpus could not be obtained by anyone outside the European Union
because of restrictions placed by copyright holders on the distribution of certain
written texts. The BNC Sampler and second release of the entire corpus have
140 Future prospects in corpus linguistics
no distribution restrictions but only because the problematic texts are not in-
cluded in these releases. The distribution of ARCHERhas been on hold because
it has not been possible to obtain copyright permission for many of the texts
included in the corpus. As a result, access to the corpus is restricted to those
who participated in the actual creation of the corpus – a method of distribution
that does not violate copyright law. It is unlikely that this situation will ease
in the future. While texts are more widely available in electronic form, partic-
ularly on the World Wide Web, getting permission to use these texts involves
the same process as getting permission for printed texts, and current trends in
the electronic world suggest that access to texts will be more restrictive in the
future, not less. Therefore, the problem of copyright restrictions will continue
to haunt corpus linguists for the foreseeable future.
Annotating texts with structural markup. The development of SGML-based
annotation systems has been one of the great advances in corpus linguistics,
standardizing the annotation of many features of corpora so that they can be
unambiguously transferred from computer to computer. The Text Encoding
System (TEI) has provided a system of corpus annotation that is both detailed
and ﬂexible, and the introduction of XML (the successor to HTML) to the ﬁeld
of corpus linguistics will eventually result in corpora that can be made available
on the World Wide Web. There exist tools to help in the insertion of SGML-
conformant markup to corpora, and it is likely that such tools will be improved
in the future. Nevertheless, much of this annotation has to be inserted manually,
requiring hours of work on the part of the corpus creator. We will have much
better annotated corpora in the future, but it will still be a major effort to insert
this annotation into texts.
Tagging and parsing. Tagging is now a standard part of corpus creation, and
taggers are becoming increasingly accurate and easy to use. There will always be
constructions that will be difﬁcult to tag automatically and will require human
intervention to correct, but tagged corpora should become more widespread
in the future – we may even see the day when every corpus released has been
tagged.
Parsing is improving too, but has a much lower accuracy than tagging. There-
fore, much human intervention is required to correct a parsed text, and we have
not yet reached the point that the team not responsible for designing the parser
can use it effortlessly. One reason the British component of ICE (ICE-GB) took
nearly ten years to complete was that considerable effort had to be expended
correcting the output of the parsing of the corpus, particularly the spoken part.
At some time in the future, parsing may be as routine as tagging, but because a
parser has a much more difﬁcult job than a tagger, we have some time to wait
before parsed corpora will be widely available.
Text analysis. The most common text analysis program for corpora, the con-
cordancer, has become an established ﬁxture for the analysis of corpora. There
are many such programs available for use on PCs, Macintoshes, and even the
World Wide Web. Such programs are best for retrieving sequences of strings
Future prospects in corpus linguistics 141
(such as words), but many can now search for particular tags in a corpus, and if
a corpus contains ﬁle header information, some concordancing programs can
sort ﬁles so that the analyst can specify what he or she wishes to analyze in a
given corpus: journalistic texts, for instance, but not other kinds of texts.
More sophisticated text analysis programs, such as ICECUP, are rare, and
it is not likely that we will see as many programs of this nature in the future
as concordancers. And a major problem with programs such as ICECUP and
many concordancers is that they were designed to work on a speciﬁc corpus
computerized in a speciﬁc format. Consequently, ICECUP works only on the
British component of the International Corpus of English, and Sara on the
BNC (though there are plans to extend the use of Sara to other corpora as well).
The challenge is to systemize the design of corpora and concordancers so that
any concordancer can work on any corpus. Of course, it is highly likely that
the next generation of corpus linguists will have a much better background in
programming. Thus, these corpus linguists will be able to use their knowledge
of languages such as Perl or Visual Basic to write speciﬁc “scripts” to analyze
texts, and as these scripts proliferate, they can be passed from person to person
and perhaps make obsolete the need for speciﬁc text analysis programs to be
designed.
Corpus linguistics has been one of the more exciting methodological devel-
opments in linguistics since the Chomskyan revolution of the 1950s. It reﬂects
changing attitudes among many linguists as to what constitutes an adequate
“empirical” study of language, and it has drawn upon recent developments in
technology to make feasible the kinds of empirical analyses of language that cor-
pus linguists wish to undertake. Of course, doing a corpus analysis will always
involve work – more work than sitting in one’s ofﬁce or study and making up the
data for a particular analysis – but doing a corpus analysis properly will always
have its rewards and will help us advance the study of human language, an area
of study that linguists of all persuasions would agree we still know relatively
little about.
Appendix 1
Corpus resources
Cross references to resources listed in this table are indicated in boldface. The
various resources are alphabetized by acronym or full name, depending upon which
usage is most common.
The publisher has used its best endeavors to ensure that the URLs for external websites
referred to in this book are correct and active at the time of going to press. However, the
publisher has no responsibility for the websites and can make no guarantee that a site
will remain live or that the content is or will remain appropriate.
Resource Description Availability
American National Corpus Currently in progress; is
intended to contain spoken
and written texts that
model as closely as
possible the texts in the
British National Corpus
Project website:
http://www.cs.vassar.edu/
∼ide/anc/
American Publishing
House for the Blind Corpus
25 million words of edited
written American English
Originally created by IBM;
described in Fillmore
(1992)
ARCHER (A
Representative Corpus of
English Historical
Registers)
1.7 million words
consisting of various
genres of British and
American English covering
the period 1650–1990
In-house corpus (due to
copyright restrictions); an
expanded version,
ARCHER II, is underway
Bank of English Corpus 415 million words of
speech and writing (as of
October 2000); texts are
continually added
Collins-Cobuild:
http://titania.cobuild.collins.
co.uk/boe info.html
Bergen Corpus of London
Teenage English (COLT)
500,000-word corpus of
the speech of London
teenagers from various
boroughs; available online
or as part of the British
National Corpus
Project website:
http://www.hd.uib.no/colt/
Birmingham Corpus 20 million words of written
English
Evolved into Bank of
English Corpus
142
Corpus resources 143
British National Corpus
(BNC)
100 million words of
samples of varying length
containing spoken (10
million words) and written
(90 million words) British
English
BNC website:
http://info.ox.ac.uk/bnc/
index.html
British National Corpus
(BNC) Sampler
2 million words of speech
and writing representing
184 samples taken from the
British National Corpus
BNC website:
http://info.ox.ac.uk/bnc/
getting/sampler.html
Brown Corpus One million words of
edited written American
English; created in 1961;
divided into 2,000-word
samples from various
genres (e.g. press
reportage, ﬁction,
government documents)
See: ICAME CD-ROM
Cambridge International
Corpus
100 million words of
varying amounts of spoken
and written British and
American English, with
additional texts being
added continuously
CUP website:
http://uk.cambridge.org/elt/
reference/cic.htm
Cambridge Learners’
Corpus
10 million words of student
essay exams, with
additional texts being
added continuously
CUP website:
http://uk.cambridge.org/elt/
reference/clc.htm
Canterbury Project Ultimate goal of the project
is to make available in
electronic form all versions
of the Canterbury Tales
and to provide an interface
to enable, for instance,
comparisons of the various
versions
Project website:
http://www.cta.dmu.ac.uk/
projects/ctp/
Chemnitz Corpus A parallel corpus of
English and German
translations
Project website:
http://www.tu-chemnitz.de/
phil/english/real/
transcorpus/index.htm
144 Appendix 1
Child Language Data
Exchange System
(CHILDES) Corpus
Large multi-lingual
database of spoken
language from children and
adults engaged in ﬁrst or
second language
acquisition
http://childes.psy.cmu.edu/
See also: MacWhinney
(2000)
Corpora Discussion List
(made available through
the Norwegian Computing
Centre for the Humanities)
Internet discussion list for
issues related to corpus
creation, analysis, tagging,
parsing, etc.
http://www.hit.uib.no/
corpora/welcome.txt
Corpus of Early
English Correspondence
Two versions of English
correspondence: the full
version (2.7 million words)
and a sampler version
(450,000 words)
Project website:
http://www.eng.helsinki.ﬁ/
doe/projects/ceec/
corpus.htm
Sampler version available
on ICAME CD-ROM
Corpus of Middle English
Prose and Verse
A large collection of
Middle English texts
available in electronic
format
Project website:
http://www.hti.umich.edu/c/
cme/about.html
Corpus of Spoken
Professional English
Approximately 2 million
words taken from spoken
transcripts of academic
meetings and White House
press conferences
Athelstan website:
http://www.athel.com/
cpsa.html
The Electronic Beowulf A digital version of the Old
English poem Beowulf that
can be searched
Project website:
http://www.uky.edu/
∼kiernan/eBeowulf/
guide.htm
English–Norwegian
Parallel Corpus
A parallel corpus of
English and Norwegian
translations: 30 samples of
ﬁction and 20 samples of
non-ﬁction in the original
and in translation
Project website:
http://www.hf.uio.no/iba/
prosjekt/
The Expert Advisory
Group on Language
Engineering Standards
(EAGLES)
Has developed “A Corpus
Encoding Standard”
containing guidelines for
the creation of corpora
Project website:
http://www.cs.vassar.edu/
CES/
Corpus resources 145
FLOB (Freiburg–
Lancaster–Oslo–
Bergen) Corpus
One million words of
edited written British
English published in 1991;
divided into 2,000-word
samples in varying genres
intended to replicate the
LOB Corpus
See: ICAME CD-ROM
FROWN (Freiburg–Brown)
Corpus
One million words of
edited written American
English published in 1991;
divided into 2,000-word
samples in varying genres
intended to replicate the
Brown Corpus
See: ICAME CD-ROM
Helsinki Corpus Approximately 1.5 million
words of Old, Middle, and
Early Modern English
divided into samples of
varying length
See: ICAME CD-ROM
Helsinki Corpus of Older
Scots
Approximately 400,000
words of transcribed
speech (recorded in the
1970s) from four rural
dialects in England and
Ireland
See: ICAME CD-ROM
Hong Kong University of
Science and Technology
Learner Corpus
25 million words of learner
English written by
ﬁrst-year university
students whose native
language is Chinese
Contact:
John Milton, Project
Director, lcjohn@ust.hk
ICAME Bibliography Extensive bibliography of
corpus-based research
created by Bengt Altenberg
(Lund University, Sweden)
ICAME website:
–1989:
http://www.hd.uib.no/icame
/icame-bib2.txt
1990–8:
http://www.hd.uib.no/icame
/icame-bib3.htm
ICAME CD-ROM 20 different corpora (e.g.
Brown, LOB, Helsinki) in
various computerized
formats (DOS, Windows,
Macintosh and Unix)
ICAME website:
http://www.hit.uib.no/icame
/cd/
146 Appendix 1
International Corpus of
English (ICE)
A variety of million-word
corpora (600,000 words of
speech, 400,000 words of
writing) representing the
various national varieties
of English (e.g. American,
British, Irish, Indian, etc.)
Three components now
complete:
Great Britain:
http://www.ucl.ac.uk/
english-usage/ice-
gb/index.htm
East Africa: http://www.tu-
chemnitz.de/phil/english/
real/eafrica/corpus.htm
New Zealand:
http://www.vuw.ac.nz/lals/
corpora.htm#The New
Zealand component of the
International
ICECUP (ICE Corpus
Utility Program)
Text retrieval software for
use with ICE-GB
Survey of English Usage
website:
http://www.ucl.ac.
uk/english-usage/ice-gb
/icecup.htm
ICE-GB (British
component of the
International Corpus of
English)
One million words of
spoken and written British
English fully tagged and
parsed
See: International Corpus
of English
International Corpus of
Learner English (ICLE)
Approximately 2 million
words of written English
composed by non-native
speakers of English from
14 different linguistic
backgrounds
Project website:
http://www.ﬂtr.ucl.ac.be/FL
TR/GERM/ETAN/CECL/
introduction.html
Lampeter Corpus Approximately 1.1 million
words of Early Modern
English tracts and
pamphlets taken from
various genres (e.g.
religion, politics) from the
period 1640–1740;
contains complete texts,
not text samples
Project website:
http://www.tu-
chemnitz.de/phil/english/
real/lampeter/
lamphome.htm
Corpus resources 147
Lancaster Corpus A precursor to the
million-word
Lancaster–Oslo–Bergen
(LOB) Corpus of edited
written British English
See: Lancaster–Oslo–
Bergen (LOB) Corpus
Lancaster/IBM Spoken
English Corpus
53,000 words of spoken
British English (primarily
radio broadcasts); available
in various formats,
including a parsed version
See: ICAME CD-ROM
Lancaster Parsed Corpus A parsed corpus containing
approximately 140,000
words from various genres
in the Lancaster–Oslo–
Bergen (LOB) Corpus
See: ICAME CD-ROM
Lancaster–Oslo–Bergen
(LOB) Corpus
One million words of
edited written British
English published in 1961
and divided into 2,000-
word samples; modeled
after the Brown Corpus
See: ICAME CD-ROM
Linguistic Data
Consortium (LDC)
For an annual fee, makes
available to members a
variety of spoken and
written corpora of English
and many other languages
LDC website:
http://www.ldc.upenn.edu/
London Corpus The original corpus of
spoken and written British
English ﬁrst created in the
1960s by Randolph Quirk
at the Survey of English
Usage, University College
London; sections of the
spoken part are included in
the London–Lund Corpus
Can be used on-site at the
Survey of English Usage:
http://www.ucl.ac.uk/
english-usage/home.htm
London–Lund Corpus Approximately 500,000
words of spoken British
English from various
genres (e.g. spontaneous
dialogues, radio
broadcasts) that has been
prosodically transcribed
See: ICAME CD-ROM
148 Appendix 1
Longman–Lancaster
Corpus
A corpus available in
orthographic form that
contains approximately 30
million words of written
English taken from various
varieties of English
world-wide
Longman website:
http://www.longman-
elt.com/dictionaries/corpus/
lclonlan.html
Longman Learner’s Corpus 10 million words of writing
by individuals from around
the world learning English
as a second or foreign
language
Longman website:
http://www.longman-
elt.com/dictionaries/corpus/
lclearn.html
The Longman Spoken and
Written English Corpus
(LSWE)
Approximately 40 million
words of samples of
spoken and written British
and American English
Described in Biber et al.
(1999)
Map Task Corpus Digitized transcriptions of
individuals engaged in
“task-oriented dialogues”
in which one speaker helps
another speaker replicate a
route on a map
Project website:
http://www.hcrc.ed.ac.uk/
maptask/
Michigan Corpus of
Academic Spoken English
(MICASE)
Various types of spoken
American English recorded
in academic contexts: class
lectures and discussions,
tutorials, dissertation
defenses
Searchable on the Web:
http://www.hti.umich.edu/
m/micase/
Nijmegen Corpus A 130,000-word parsed
corpus of written English
TOSCA Research website:
http://lands.let.kun.nl/
TSpublic/tosca/
research.html
The Northern Ireland
Transcribed Corpus of
Speech
400,000 words of
interviews with individuals
speaking Hiberno-English
from various regions of
Northern Ireland
See Kirk (1992)
Penn–Helsinki Parsed
Corpus of Middle English
1.3 million words of parsed
Middle English taken from
55 samples found in the
Helsinki Corpus
Project website:
http://www.ling.upenn.edu/
mideng/
Corpus resources 149
Penn Treebank (Releases I
and II)
A heterogeneous collection
of speech and writing
totaling approximately 4.9
million words; sections
have been tagged and
parsed
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC95T7.html
Polytechnic of Wales
Corpus
A 65,000-word parsed
corpus of the speech of
children ages 6–12
conversing in playgroups
of three and in interviews
with adults
See: ICAME CD-ROM
Santa Barbara Corpus of
Spoken American English
Large corpus containing
samples of varying length
of different kinds of spoken
American English:
spontaneous dialogues,
monologues, speeches,
radio broadcasts, etc.
Project website:
http://www.linguistics.ucsb.
edu/research/sbcorpus/
default.htm
First release of corpus can
be purchased from the
Linguistic Data
Consortium (LDC):
http://www.ldc.upenn.edu/
Catalog/LDC2000S85.html
Susanne Corpus 130,000 words of written
English based on various
genres in the Brown
Corpus that have been
parsed and marked up
based on an “annotation
scheme” developed for the
project
Project website:
http://www.cogs.susx.ac.uk/
users/geoffs/RSue.html
Switchboard Corpus 2,400 telephone
conversations between two
speakers from various
dialect regions in the
United States; topics of
conversations were
suggested beforehand
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC97S62.html
TalkBank Project Cross-disciplinary effort to
use computational tools to
study human and animal
communication
Project website:
http://www.talkbank.org/
150 Appendix 1
Tampere Corpus A corpus proposed to
consist of various kinds of
scientiﬁc writing for
specialized and
non-specialized audiences
Described in Norri and
Kyt ¨ o (1996)
Text Encoding Initiative
(TEI)
Has developed standards
for the annotation of
electronic documents
Project website:
http://www.tei-c.org/
TIMIT Acoustic-Phonetic
Continuous Speech Corpus
Various speakers from
differing dialects of
American English reading
ten sentences containing
phonetically varied sounds
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC93S1.html
TIPSTER Corpus Collection of various kinds
of written English, such as
Wall Street Journal and
Associated Press news
stories; intended for
research in information
retrieval
Linguistic Data
Consortium (LDC)
website:
http://www.ldc.upenn.edu/
Catalog/LDC93T3A.html
Wellington Corpus One million words of
written New Zealand
English divided into genres
that parallel the Brown and
LOB corpora but that were
collected between 1986
and 1990
See: ICAME CD-ROM
York Corpus 1.5 million words taken
from sociolinguistic
interviews with speakers of
York English
See Tagliamonte (1998)
Appendix 2
Concordancing programs
PC/Macintosh-based programs
Conc (for the Macintosh)
John Thomson
available from Summer Institute of Linguistics
http://www.indiana.edu/∼letrs/help-services/QuickGuides/about-conc.html
Concordancer for Windows
Zdenek Martinek in collaboration with Les Siegrist
http://www.ifs.tu-darmstadt.de/sprachlit/wconcord.htm
Corpus Presenter
Raymond Hickey
http://www.uni-essen.de/∼lan300/corpus presenter.htm
Corpus Wizard
Kobe Phoenix Lab, Japan
http://www2d.biglobe.ne.jp/∼htakashi/software/CWNE.HTM
Lexa
Raymond Hickey
Available from Norwegian Computing Centre for the Humanities
http://www.hd.uib.no/lexainf.html
MonoConc Pro 2.0
Athelstan
http://www.athel.com/mono.html#monopro
ParaConc (for multilingual corpora)
Michael Barlow
http://www.athel.com/
Sara
British National Corpus
http://info.ox.ac.uk/bnc/sara/client.html
Tact
Centre for Computing in the Humanities, University of Toronto
http://www.chass.utoronto.ca:8080/cch/tact.html
WordCruncher (now called “Document Explorer”)
Hamilton-Locke, Inc. (an older version is also available on the ICAME
CD-ROM, 2nd edn.)
http://hamilton-locke.com/DocExplorer/Index.html
151
152 Appendix 2
WordSmith
Mike Scott
Oxford University Press
http://www.oup.com/elt/global/catalogue/multimedia/wordsmithtools3/
Web-based programs
CobuildDirect
http://titania.cobuild.collins.co.uk/direct info.html
KWiCFinder
http://miniappolis.com/KWiCFinder/KWiCFinderHome.html
The Michigan Corpus of Academic Spoken English (MICASE)
http://www.hti.umich.edu/micase/
Sara
Online version of the British National Corpus
http://sara.natcorp.ox.ac.uk
TACTWeb
http://kh.hd.uib.no/tactweb/homeorg.htm
References
Aarts, Bas (1992) Small Clauses in English: The Nonverbal Types. Berlin and NewYork:
Mouton de Gruyter.
(2001) Corpus Linguistics, Chomsky, and Fuzzy Tree Fragments. In Mair and Hundt
(2001). 5–13.
Aarts, Bas and Charles F. Meyer (eds.) (1995) The Verb in Contemporary English.
Cambridge University Press.
Aarts, Jan and Willem Meijs (eds.) (1984) Corpus Linguistics: Recent Developments
in the Use of Computer Corpora. Amsterdam: Rodopi.
Aarts, Jan, Pieter de Haan, and Nelleke Oostdijk (eds.) (1993) English Language Cor-
pora: Design, Analysis, and Exploitation. Amsterdam: Rodopi.
Aarts, Jan, Hans van Halteren, and Nelleke Oostdijk (1996) The TOSCA Analysis
System. In Koster and Oltmans (1996). 181–91.
Aijmer, Karin and Bengt Altenberg (eds.) (1991) English Corpus Linguistics. London:
Longman.
Altenberg, Bengt and Marie Tapper (1998) The Use of Adverbial Connectors in
Advanced Swedish Learners’ Written English. In Granger (1998). 80–93.
Ammon, U., N. Dittmar, and K. J. Mattheier (eds.) (1987) Sociolinguistics: An Interna-
tional Handbook of the Science of Language and Society, vol. 2. Berlin: de Gruyter.
Aston, Guy and Lou Burnard (1998) The BNCHandbook: Exploring the British National
Corpus with SARA. Edinburgh University Press.
Atwell, E., G. Demetriou, J. Hughes, A. Schiffrin, C. Souter, and S. Wilcock (2000)
A Comparative Evaluation of Modern English Corpus Grammatical Annotation
Schemes. ICAME Journal 24. 7–23.
Barlow, Michael (1999) MonoConc 1.5 and ParaConc. International Journal of Corpus
Linguistics 4 (1). 319–27.
Bell, Alan (1988) The British Base and the American Connection in NewZealand Media
English. American Speech 63. 326–44.
Biber, Douglas (1988) Variation Across Speech and Writing. New York: Cambridge
University Press.
(1990) Methodological Issues Regarding Corpus-based Analyses of Linguistic
Variation. Literary and Linguistic Computing 5. 257–69.
(1993) Representativeness in Corpus Design. Literary and Linguistic Computing 8.
241–57.
(1995) Dimensions of Register Variation: ACross-Linguistic Comparison. Cambridge
University Press.
Biber, Douglas, Edward Finegan, and Dwight Atkinson (1994) ARCHER and its
Challenges: Compiling and Exploring a Representative Corpus of English
Historical Registers. In Fries, Tottie, and Schneider (1994). 1–13.
153
154 References
Biber, Douglas, Susan Conrad, and Randi Reppen (1998) Corpus Linguistics:
Investigating Language Structure and Language Use. Cambridge University
Press.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad, and Edward Finegan
(1999) The Longman Grammar of Spoken and Written English. London: Longman.
Biber, Douglas and Jen´ a Burges (2000) Historical Change in the Language Use of
Women and Men: Gender Differences in Dramatic Dialogue. Journal of English
Linguistics 28 (1). 21–37.
Blachman, Edward, Charles F. Meyer, and Robert A. Morris (1996) The UMBIntelligent
ICE Markup Assistant. In Greenbaum (1996a). 54–64.
Brill, Eric (1992) A Simple Rule-Based Part-of-Speech Tagger. Proceedings of the 3rd
Conference on Applied Natural Language Processing. Trento: Italy.
Burnard, Lou (1995) The Text Encoding Initiative: An Overview. In Leech, Myers, and
Thomas (1995). 69–81.
(1998) The Pizza Chef: A TEI Tag Set Selector. http://www.hcu.ox.ac.uk/TEI/pizza.
html.
Burnard, Lou and C. M. Sperberg-McQueen (1995) TEI Lite: An Introduction to Text
Encoding for Interchange. http://www.tei-c.org/Lite/index.html.
Burnard, Lou and Tony McEnery (eds.) (2000) Rethinking Language Pedagogy from a
Corpus Perspective. Frankfurt: Peter Lang.
Chafe, Wallace (1994) Discourse, Consciousness, and Time. Chicago: University of
Chicago Press.
(1995) Adequacy, User-Friendliness, and Practicality in Transcribing. In Leech,
Myers, and Thomas (1995). 54–61.
Chafe, Wallace, John Du Bois, and Sandra Thompson (1991) Towards a New Corpus of
American English. In Aijmer and Altenberg (1991). 64–82.
Chomsky, Noam (1995) The Minimalist Program. Cambridge, MA: MIT Press.
Coates, Jennifer (1983) The Semantics of the Modal Auxiliaries. London: Croom
Helm.
Collins, Peter (1991a) The Modals of Obligation and Necessity in Australian English.
In Aijmer and Altenberg (1991). 145–65.
(1991b) Cleft and Pseudo-Cleft Constructions in English. Andover: Routledge.
Composition of the BNC. http://info.ox.ac.uk/bnc/what/balance.html.
Cook, Guy (1995) Theoretical Issues: Transcribing the Untranscribable. In Leech,
Myers, and Thomas (1995). 35–53.
Corpus Encoding Standard (2000) http://www.cs.vassar.edu/CES/122.
Crowdy, Steve (1993) Spoken Corpus Design. Literary and Linguistic Computing 8.
259–65.
Curme, G. (1947) English Grammar. New York: Harper and Row.
Davies, Mark (2001) Creating and Using Multi-million Word Corpora from Web-based
Newspapers. In Simpson and Swales (2001). 58–75.
de Haan, Pieter (1984) Problem-Oriented Tagging of English Corpus Data. In Aarts and
Meijs (1984). 123–39.
DuBois, John, StephanSchuetze-Coburn, Susanna Cumming, andDanae Paolino(1993)
Outline of Discourse Transcription. In Edwards and Lampert (1993). 45–89.
Dunning, Ted (1993) Accurate Methods for the Statistics of Surprise and Coincidence.
Computational Linguistics 19 (1). 61–74.
References 155
Eckman, Fred (ed.) (1977) Current Themes in Linguistics. Washington, DC: John Wiley.
Edwards, Jane (1993) Principles and Contrasting Systems of Discourse Transcription.
In Edwards and Lampert (1993). 3–31.
Edwards, Jane and Martin Lampert (eds.) (1993) Talking Data. Hillside, NJ: Lawrence
Erlbaum.
Ehlich, Konrad (1993) HIAT: A Transcription System for Discourse Data. In Edwards
and Lampert (1993). 123–48.
Elsness, J. (1997) The Perfect and the Preterite in Contemporary and Earlier English.
Berlin and New York: Mouton de Gruyter.
Fang, Alex (1996) AUTASYS: Automatic Tagging and Cross-Tagset Mapping. In
Greenbaum (1996a). 110–24.
Fernquest, Jon (2000) Corpus Mining: Perl Scripts and Code Snippets. http://www.
codearchive.com/home/jon/program.html.
Fillmore, Charles (1992) Corpus Linguistics or Computer-Aided Armchair Linguistics.
In Svartvik (1992). 35–60.
Finegan, Edward and Douglas Biber (1995) That and Zero Complementisers in Late
ModernEnglish: ExploringARCHERfrom1650–1990. InAarts andMeyer (1995).
241–57.
Francis, W. Nelson (1979) A Tagged Corpus – Problems and Prospects. In Greenbaum,
Leech, and Svartvik (1979). 192–209.
(1992) Language Corpora B.C. In Svartvik (1992). 17–32.
Francis, W. Nelson and H. Kuˇ cera (1982) Frequency Analysis of English Usage: Lexicon
and Grammar. Boston: Houghton Mifﬂin.
Fries, Udo, Gunnel Tottie, and Peter Schneider (eds.) (1994) Creating and Using English
Language Corpora. Amsterdam: Rodopi.
Garside, Roger, Geoffrey Leech, and Geoffrey Sampson (1987) The Computational
Analysis of English. London: Longman.
Garside, Roger, Geoffrey Leech, and Tam´ as V´ aradi (1992) Lancaster Parsed Corpus.
Manual to accompany the Lancaster Parsed Corpus. http://khnt.hit.uib.no/
icame/manuals/index.htm.
Garside, Roger, Geoffrey Leech, and Anthony McEnery (eds.) (1997) Corpus Annota-
tion. London: Longman.
Garside, Roger and Nicholas Smith (1997) A Hybrid Grammatical Tagger: CLAWS 4.
In Garside, Leech, and McEnery (1997). 102–121.
Gavioli, Laura (1997) Exploring Texts through the Concordancer: Guiding the Learner.
In Anne Wichmann, Steven Fligelstone, Tony McEnery, and Gerry Knowles (eds.)
(1997) Teaching and Language Corpora. London: Longman. 83–99.
Gillard, Patrick and Adam Gadsby (1998) Using a Learners’ Corpus in Compiling ELT
Dictionaries. In Granger (1998). 159–71.
Granger, Sylvianne (1993) International Corpus of Learner English. In Aarts, de Haan,
and Oostdijk (1993). 57–71.
(1998) Learner English on Computer. London: Longman.
Greenbaum, Sidney (1973) Informant Elicitation of Data on Syntactic Variation. Lingua
31. 201–12.
(1975) Syntactic Frequency and Acceptability. Lingua 40. 99–113.
(1984) Corpus Analysis and Elicitation Tests. In Aarts and Meijs (1984). 195–201.
(1992) A New Corpus of English: ICE. In Svartvik (1992). 171–79.
156 References
(ed.) (1996a) Comparing English Worldwide: The International Corpus of English.
Oxford: Clarendon Press.
(1996b) The Oxford English Grammar. Oxford: Oxford University Press.
Greenbaum, Sidney, Geoffrey Leech, and Jan Svartvik (eds.) (1979) Studies in English
Linguistics. London: Longman.
Greenbaum, S. and Meyer, C. (1982) Ellipsis and Coordination: Norms and Preferences.
Language and Communication 2:137–49.
Greenbaum, Sidney and Jan Svartvik (1990) The London–Lund Corpus of Spoken
English. In Svartvik (1990). 11–45.
Greenbaum, Sidney and Ni Yibin (1996) About the ICE Tagset. In Greenbaum (1996a).
92–109.
Greenbaum, Sidney, Gerald Nelson, and Michael Weizman (1996) Complement Clauses
in English. In Thomas and Short (1996). 76–91.
Greene, B. B. and G. M. Rubin (1971) Automatic Grammatical Tagging. Technical
Report. Department of Linguistics: Brown University.
Hadley, Gregory (1997) Sensing the Winds of Change: An Introduction to Data-Driven
Learning. http://web.bham.ac.uk/johnstf/winds.htm.
Haegeman, Lilliane (1987) Register Variation in English: Some Theoretical Observa-
tions. Journal of English Linguistics 20 (2). 230–48.
(1991) Introduction to Government and Binding Theory. Oxford: Blackwell.
Halteren, Hans van and Theo van den Heuvel (1990) Linguistic Exploitation of Syntactic
Databases. The Use of the Nijmegen Linguistic DataBase Program. Amsterdam:
Rodopi.
Haselrud, V. and Anna-Brita Stenstr¨ om (1995) Colt: Mark-up and Trends. Hermes 13.
55–70.
Hasselg˚ ard, Hilde (1997) Sentence Openings inEnglishandNorwegian. InLjung(1997).
3–20.
Hickey, Raymond, Merja Kyt¨ o, Ian Lancashire, and Matti Rissanen (eds.) (1997) Tracing
the Trail of Time. Proceedings from the Second Diachronic Corpora Workshop.
Amsterdam: Rodopi.
Hockey, Susan (2000) Electronic Texts in the Humanities. Oxford: Oxford University
Press.
J¨ arvinen, Timo (1994) Annotating 200 Million Words: The Bank of English Project.
Proceedings of COLING ’94, Kyoto, Japan. http://www.lingsoft.ﬁ/doc/engcg/
Bank-of-English.html.
Jespersen, Otto (1909–49) A Modern English Grammar on Historical Principles.
Copenhagen: Munksgaard.
Johansson, Stig and Knut Hoﬂand (1994) Towards an English–Norwegian Parallel
Corpus. In Fries, Tottie, and Schneider (1994). 25–37.
Johansson, Stig and Jarle Ebeling (1996) Exploring the English–Norwegian Parallel
Corpus. In Percy, Meyer, and Lancashire (1996).
Johns, Tim F. (1994) From Printout to Handout: Grammar and Vocabulary Teaching in
the Context of Data-driven Learning. In Odlin (1994). 293–313.
Kalton, Graham (1983) Introduction to Survey Sampling. Beverly Hills, CA: Sage.
Kennedy, Graeme (1996) Over Once Lightly. In Percy, Meyer, and Lancashire (1996).
253–62.
Kettemann, Bernhard (1995) On the Use of Concordancing in ELT. TELL&CALL 4.
4–15.
References 157
Kirk, John (1992) The Northern Ireland Transcribed Corpus of Speech. In Leitner
(1992). 65–73.
Koster, C. and E. Oltmans (eds.) (1996) Proceedings of the ﬁrst AGFL Workshop.
Nijmegen: CSI.
Kretzschmar, William A., Jr. (2000) Review of SPSS Student Version 9.0 for Windows.
Journal of English Linguistics 28 (3). 311–13.
Kretzschmar, WilliamA., Jr. and E. Schneider (1996) Introduction to Quantitative Anal-
ysis of Linguistic Survey Data. Los Angeles: Sage.
Kretzschmar, William A., Jr., Charles F. Meyer, and Dominique Ingegneri (1997) Uses
of Inferential Statistics in Corpus Linguistics. In Ljung (1997). 167–77.
Kyt ¨ o, M. (1991) Variation and Diachrony, with Early American English in Focus. Stud-
ies on ‘can’/‘may’ and ‘shall’/‘will’. University of Bamberg Studies in English
Linguistics 28. Frankfurt am Main: Peter Lang.
(1996) Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding
Conventions and Lists of Source Texts. 3rd edn. Department of English: University
of Helsinki.
Labov, W. (1972) The Transformation of Experience in Narrative Syntax. In Language
in the Inner City. Philadelphia: University of Pennsylvania Press. 354–96.
Landau, Sidney (1984) Dictionaries: The Art and Craft of Lexicography. New York:
Charles Scribner.
Leech, Geoffrey (1992) Corpora and Theories of Linguistic Performance. In Svartvik
(1992). 105–22.
(1997) Grammatical Tagging. In Garside, Leech, and McEnery (1997). 19–33.
(1998) Preface. In Granger (1998). xiv–xx.
Leech, Geoffrey, Roger Garside, and Eric Atwell (1983) The Automatic Grammatical
Tagging of the LOB Corpus. ICAME Journal 7. 13–33.
Leech, Geoffrey, Greg Myers, and Jenny Thomas (eds.) (1995) Spoken English on Com-
puter. Harlow, Essex: Longman.
Leech, Geoffrey and Elizabeth Eyes (1997) Syntactic Annotation: Treebanks. In Garside,
Leech, and McEnery (1997). 34–52.
Leitner, Gerhard (ed.) (1992) New Directions in English Language Corpora. Berlin:
Mouton de Gruyter.
Le´ on, Fernando S´ anchez and Amalio F. Nieto Serrano (1997) Retargeting a Tagger. In
Garside, Leech, and McEnery (eds.). 151–65.
Ljung, Magnus (ed.) (1997) Corpus-based Studies in English. Amsterdam: Rodopi.
MacWhinney, Brian (1996) The CHILDES System. American Journal of Speech-
Language Pathology 5. 5–14.
(2000) The CHILDES Project: Tools for Analyzing Talk. 3rd edn., vol. 1: Transcription
Format and Programs, vol 2: The Database. Mahwah, NJ: Erlbaum.
Mair, Christian (1990) Inﬁnitival Complement Clauses in English. Cambridge Uni-
versity Press.
(1995) Changing Patterns of Complementation, and Concomitant Grammaticalisa-
tion, of the Verb Help in Present-Day British English. In Aarts and Meyer (1995).
258–72.
Mair, Christian and Marianne Hundt (eds.) (2001) Corpus Linguistics and Linguistic
Theory. Amsterdam: Rodopi.
Maniez, Fran¸ cois (2000) Corpus of English Proverbs and Set Phrases. Message posted
on the Corpora List, 24 January. http://www.hit.uib.no/corpora/2000–1/0057.html.
158 References
Marcus, M., B. Santorini, and M. Marcinkiewicz (1993) Building a Large Annotated
Corpus of English: The Penn Treebank. Computational Linguistics 19. 314–30.
Markus, Manfred (1997) Normalization of Middle English Prose in Practice. In Ljung
(1997). 211–26.
Melˇ cuk, Igor A. (1987) Dependency Syntax: Theory and Practice. Albany: State Uni-
versity of New York Press.
Melamed, Dan (1996) 170 General Text Processing Tools (Mostly in PERL5).
http://www.cis.upenn.edu/∼melamed/genproc.html.
Meurman-Solin, Anneli (1995) ANewTool: The Helsinki Corpus of Older Scots (1450–
1700). ICAME Journal 19. 49–62.
Meyer, Charles F. (1992) Apposition in Contemporary English. Cambridge University
Press.
(1995) Coordination Ellipsis in Spoken and Written American English. Language
Sciences 17. 241–69.
(1996) Coordinate Structures in the British and American Components of the Inter-
national Corpus of English. World Englishes 15. 29–41.
(1997) Minimal Markup for ICE Texts. ICE NEWSLETTER 25. http://www.cs.umb.
edu/∼meyer/icenews2.html.
(1998) Studying Usage on the World Wide Web. http://www.cs.umb.edu/
∼meyer/usage.html.
Meyer, Charles F. and Richard Tenney (1993) Tagger: An Interactive Tagging Program.
In Souter and Atwell (1993). 302–12.
Meyer, Charles F., Edward Blachman, and Robert A. Morris (1994) Can You See Whose
Speech Is Overlapping? Visible Language 28 (2). 110–33.
Milton, John and Robert Freeman (1996) Lexical Variation in the Writing of Chinese
Learners of English. In Percy, Meyer, and Lancashire (1996). 121–31.
Mindt, Dieter (1995) An Empirical Grammar of the English Verb. Berlin: Cornelsen.
M¨ onnink, Inga de (1997) Using Corpus and Experimental Data: A Multimethod
Approach. In Ljung (1997). 227–44.
Murray, Thomas E. and Carmen Ross-Murray (1992) On the Legality and Ethics of
Surreptitious Recording. Publication of the American Dialect Society 76. 15–75.
(1996) Under Cover of Law: More on the Legality of Surreptitious Recording.
Publication of the American Dialect Society 79. 1–82.
Nelson, Gerald (1996) Markup Systems. In Greenbaum (1996a). 36–53.
Nevalainen, Terttu (2000) Gender Differences in the Evolution of Standard English:
Evidence from the Corpus of Early English Correspondence. Journal of English
Linguistics 28 (1). 38–59.
Nevalainen, Terttu, and Helena Raumolin-Brunberg (eds.) (1996) Sociolinguistics and
Language History: Studies Based on the Corpus of Early English Correspondence.
Amsterdam: Rodopi.
Newmeyer, Frederick (1998) Language Form and Language Function. Cambridge,
MA: MIT Press.
Nguyen, Long, Spyros Matsoukas, Jason Devenport, Daben Liu, Jay Billa, Francis
Kubala, and John Makhoul (1999) Further Advances in Transcription of Broadcast
News. Proceedings of the 6th European Conference on Speech Communication and
Technology, vol. 2. Edited by G. Olaszy, G. Nemeth, K. Erdohegy (aka EuroSpeech
’99). European Speech Communication Association (ESCA). 667–70.
References 159
Norri, Juhani and Merja Kyt ¨ o (1996) A Corpus of English for Speciﬁc Purposes: Work
in Progress at the University of Tampere. In Percy, Meyer, and Lancashire (1996).
159–69.
Oakes, Michael P. (1998) Statistics for Corpus Linguistics. Edinburgh University Press.
Odlin, Terrence (ed.) (1994) Perspectives on Pedagogical Grammar. New York:
Cambridge University Press.
Ooi, Vincent (1998) Computer Corpus Lexicography. Edinburgh University Press.
Oostdijk, Nelleke (1991) Corpus Linguistics and the Automatic Analysis of English.
Amsterdam: Rodopi.
Pahta, P¨ aivi and Saara Nevanlinna (1997) Re-phrasing in Early English. Expository
Apposition with an Explicit Marker from 1350 to 1710. In Rissanen, Kyt ¨ o, and
Heikkonen (1997). 121–83.
Percy, Carol, Charles F. Meyer, and Ian Lancashire (eds.) (1996) Synchronic Corpus
Linguistics. Amsterdam: Rodopi.
Porter, Nick and Akiva Quinn (1996) Developing the ICE Corpus Utility Program. In
Greenbaum (1996a). 79–91.
Powell, Christina and Rita Simpson (2001) Collaboration between Corpus Linguists and
Digital Librarians for the MICASE Web Search Interface. In Simpson and Swales
(2001). 32–47.
Prescott, Andrew (1997) The Electronic Beowulf and Digital Restoration. Literary and
Linguistic Computing 12. 185–95.
Quinn, Akiva and Nick Porter (1996) Annotation Tools. In Greenbaum (1996a). 65–78.
Quirk, Randolph (1992) On Corpus Principles and Design. In Svartvik (1992). 457–69.
Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik (1972) AGram-
mar of Contemporary English. London: Longman.
(1985) A Comprehensive Grammar of the English Language. London: Longman.
Renouf, Antoinette (1987) Corpus Development. In Sinclair (1987). 1–40.
Rissanen, Matti (1992) The Diachronic Corpus as a Window to the History of English.
In Svartvik (1992). 185–205.
(2000) The World of English Historical Corpora: fromCædmon to the Computer Age.
Journal of English Linguistics 28 (1). 7–20.
Rissanen, Matti, Merja Kyt ¨ o, and Kirsi Heikkonen (eds.) (1997) English in Transition:
Corpus-based Studies in English Linguistics and Genre Styles. Topics in English
Linguistics 23. Berlin and New York: Mouton de Gruyter.
Robinson, Peter (1998) New Methods of Editing, Exploring, and Reading The Canter-
bury Tales. http://www.cta.dmu.ac.uk/projects/ctp/desc2.html.
Rocha, Marco (1997) A Probabilistic Approach to Anaphora Resolution in Dialogues
in English. In Ljung (1997). 261–79.
Ryd´ en, Mats (1975) Noun-Name Collocations in British Newspaper Language. Studia
Neophilologica 67. 14–39.
Sampson, Geoffrey (1998) Corpus Linguistics User Needs. Message posted to the
Corpora List, 29 July. http://www.hd.uib.no/corpora/1998–3/0030.html.
Samuelsson, Christer and Atro Voutilainen (1997) Comparing a Linguistic and a
Stochastic Tagger. Proceedings of the 35th Annual Meeting of the Association for
Computational Linguistics and the 8th Conference of the European Chapter of the
Association for Computational Linguistics. Madrid: Association for Computational
Linguistics. 246–53.
160 References
S´ anchez, Aquilino and Pascual Cantos (1997) Predictability of Word Forms (Types)
and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the
CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish. Inter-
national Journal of Corpus Linguistics 2. 259–80.
Sanders, Gerald (1977) A Functional Typology of Elliptical Coordinations. In Eckman
(1977). 241–70.
Sankoff, David (1987). Variable rules. In Ammon, Dittmar, and Mattheier (1987).
984–97.
Schmied, Josef (1996) Second-Language Corpora. In Greenbaum (1996a). 182–96.
Schmied, Josef and Hildegard Sch¨ afﬂer (1996) Approaching Translationese Through
Parallel and Translation Corpora. In Percy, Meyer and Lancashire (1996).
41–55.
Schmied, Josef, and Claudia Claridge (1997) Classifying Text- or Genre-Variation in
the Lampeter Corpus of Early Modern English Texts. In Hickey, Kyt¨ o, Lancashire,
and Rissanen (1997). 119–35.
Sigley, Robert J. (1997) Choosing Your Relatives: Relative Clauses in New Zealand
English. Unpublished PhD thesis. Wellington: Department of Linguistics, Victoria
University of Wellington.
Simpson, Rita, Bret Lucka, and Janine Ovens (2000) Methodological Challenges of
Planning a Spoken Corpus with Pedagogical Outcomes. In Burnard and McEnery
(2000). 43–9.
Simpson, Rita and John Swales (eds.) (2001) Corpus Linguistics in North America. Ann
Arbor: University of Michigan Press.
Sinclair, John (ed.) (1987) Looking Up: An Account of the COBUILD Project. London:
Collins.
(1991) Corpus, Concordance, Collocation. Oxford University Press.
(1992) Introduction. BBC English Dictionary. London: HarperCollins. x–xiii.
Smith, Nicholas (1997) Improving a Tagger. In Garside, Leech and McEnery (1997).
137–50.
Souter, Clive and Eric Atwell (eds.) (1993) Corpus-Based Computational Linguistics.
Amsterdam: Rodopi.
Sperberg-McQueen, C. M. and Lou Burnard (eds.) (1994a) Guidelines for Electronic
Text Encoding and Interchange ( TEI P3). http://etext.virginia.edu/TEI.html.
(1994b) A Gentle Introduction to SGML. In Guidelines for Electronic Text En-
coding and Interchange ( TEI P3). http://etext.lib.virginia.edu/bin/tei-tocs?div=
DIV1&id=SG.
Stenstr¨ om, Anna-Brita and Gisle Andersen (1996) More Trends in Teenage Talk: A
Corpus-Based Investigation of the Discourse Items cos and innit. In Percy, Meyer,
and Lancashire (1996). 189–203.
Svartvik, J. (ed.) (1990) The London–Lund Corpus of Spoken English. Lund University
Press.
(1992) Directions in Corpus Linguistics. Berlin: Mouton de Gruyter.
Svartvik, Jan and Randolph Quirk (eds.) (1980) ACorpus of English Conversation. Lund
University Press.
Tagliamonte, Sali (1998) Was/Were Variation across the Generations: View from the
City of York. Language Variation and Change 10 (2). 153–91.
References 161
Tagliamonte, Sali and Helen Lawrence (2000) “I Used to Dance, but I Don’t Dance
Now”: The Habitual Past in English. Journal of English Linguistics 28 (4). 324–
53.
Tannen, D. (1989) Talking Voices: Repetition, Dialogue, and Imagery in Conversational
Discourse. Cambridge University Press.
Tapanainen, Pasi and Timo J¨ arvinen (1997) A Non-projective Dependency Parser.
http://www.conexor.ﬁ/anlp97/anlp97.html [also published in Procs. ANLP-97.
ACL. Washington, DC].
The Independent Style Manual, 2nd edn. (1988) London: The Independent.
The Spoken Component of the BNC. http://info.ox.ac.uk/bnc/what/spok design.html.
The Written Component of the BNC. http://info.ox.ac.uk/bnc/what/writ design.html.
Thomas, Jenny and Mick Short (eds.) (1996) Using Corpora for Language Research.
London: Longman.
Thompson, Henry S., Anne H. Anderson, and Miles Bader (1995) Publishing a Spoken
and Written Corpus on CD-ROM: The HCRC Map Task Experience. In Leech,
Myers, and Thomas (1995). 168–80.
Tottie, G. (1991) Negation in English Speech and Writing. A Study in Variation. San
Diego: Academic Press.
van Halteren, Hans and Theo van den Heuvel (1990) Linguistic Exploitation of Syntactic
Databases. Amsterdam: Rodopi.
Voutilainen, Atro (1999) A Short History of Tagging. In Hans van Halteren (ed.)
Syntactic Wordclass Tagging (1999). Dordrecht: Kluwer.
Voutilainen, Atro and Mikko Silvonen (1996) A Short Introduction to ENGCG.
http://www.lingsoft.ﬁ/doc/engcg/intro.
Wheatley, B., G. Doddington, C. Hemphill, J. Godfrey, E.C. Holliman, J. McDaniel, and
D. Fisher (1992) Robust Automatic Time Alignment of Orthographic Transcriptions
with Unconstrained Speech. Proceedings of ICASSP-92, vol. 1. 533–6.
Willis, Tim (1996) Analysing the Lancaster/IBM Spoken English Corpus (SEC) Using
the TOSCA Analysis System (for ICE): Some Impressions from a User. In Percy,
Meyer, and Lancashire (1996). 237–51.
Wilson, A. and Tony McEnery (1994) Teaching and Language Corpora. Technical
Report. Department of Modern English Language and Linguistics, University of
Lancaster.
Wilson, Andrewand Jenny Thomas (1997) Semantic Annotation. In Garside, Leech and
McEnery (1997). 53–65.
Woods, A, P. Fletcher, and A. Hughes (1986) Statistics in Language Studies. Cambridge
University Press.
Index
Aarts, Bas, 4, 102
adequacy, 2–3, 10–11
age, 49–50
Altenberg, Bengt, 26–7
AMALGAM Tagging Project, 86–7, 89
American National Corpus, 24, 84, 142
American Publishing House for the Blind
Corpus, 17, 142
analyzing a corpus, 100
determining suitability, 103–7, 107t
exploring a corpus, 123–4
extracting information: deﬁning parameters,
107–9; coding and recording, 109–14,
112t; locating relevant constructions,
114–19, 116f, 118f
framing research question, 101–3
future prospects, 140–1
see also pseudo-titles (corpus analysis
case study); statistical analysis
anaphors, 97
annotation, 98–9
future prospects, 140
grammatical markup, 81
parsing, 91–6, 98, 140
part-of-speech markup, 81
structural markup, 68–9, 81–6
tagging, 86–91, 97–8, 111, 117–18, 140
types, 81
appositions, 42, 98
see also pseudo-titles (corpus analysis
case study)
ARCHER (A Representative Corpus of
English Historical Registers), 21,
22, 79 n6, 140, 142
Aston, Guy, 19
AUTASYS Tagger, 87
Bank of English Corpus, 15, 96, 142
BBC English Dictionary, 15, 16–17
Bell, Alan, 100, 101–3, 104, 108, 110, 131
Bergen Corpus of London Teenage English
see COLT Corpus
Biber, Douglas, 10, 19–20, 22, 32, 33,
36, 39–40, 41, 42, 52, 78, 121,
122, 126
Biber, Douglas, et al. (1999) 14
Birmingham Corpus, 15, 142
Blachman, Edward, 76–7
BNC see British National Corpus
Brill, Eric, 86
Brill Tagger, 86–8
British National Corpus (BNC), 143
annotation, 84
composition, 18, 31t, 34, 36, 38, 40–1, 49
copyright, 139–40
planning, 30–2, 33, 43, 51, 138
record keeping, 66
research using, 15, 36
speech samples, 59
tagging, 87
time-frame, 45
British National Corpus (BNC) Sampler,
139–40, 143
Brown Corpus, xii, 1, 143
genre variation, 18
length, 32
research using, 6, 9–10, 12, 42, 98, 103
sampling methodology, 44
tagging, 87, 90
time-frame, 45
see also FROWN (Freiburg–Brown)
Corpus
Burges, Jen, 52
Burnard, Lou, 19, 82, 84, 85–6
Cambridge International Corpus, 15, 143
Cambridge Learners’ Corpus, 15, 143
Canterbury Project, 79, 143
Cantos, Pascual, 33 n2
Chafe, Wallace, 3, 32, 52, 72, 85
CHAT system (Codes for the Human Analysis
of Transcripts), 26, 113–14
Chemnitz Corpus, 23, 143
CHILDES (Child Language Data Exchange
System) Corpus, xiii, 26, 113, 144
Chomsky, Noam, 2, 3
CIA (contrastive interlanguage analysis), 26
CLAN software programs, 26
CLAWS tagger, 25, 87, 89–90
Coates, Jennifer, 12, 13
162
Index 163
collecting data
general considerations, 55–6
record keeping, 64–6
speech samples, 56; broadcasts, 61;
future prospects, 139; microphones, 60;
“natural” speech, 56–8, 59; permission,
57; problems, 60–1; recording, 58–9;
sample length, 57–8; tape recorders, 59–60
writing samples: copyright, 38, 61–2, 79 n6,
139–40; electronic texts, 63–4; future
prospects, 139; sources, 62–4
see also sampling methodology
Collins, Peter, xii–xiii
Collins COBUILD English Dictionary, 15
Collins COBUILD Project, 14, 15
COLT Corpus (Bergen Corpus of London
Teenage English), xiii–xiv, 18, 49, 142
competence vs. performance, 4
computerizing data
directory structure, 67, 68f
ﬁle format, 66–7
markup, 67, 68–9 see also annotation
speech, see speech, computerizing
written texts, 78–80, 139
concordancing programs
KWIC format, 115–16, 116f
for language learning, 27–8
“lemma” searches, 116
programs, 115, 117, 150–1
with tagged or parsed corpus, 117–18
uses, 16, 86, 114
“wild card” searches, 116–17
Conrad, Susan, 126
contrastive analysis, 22–4
contrastive interlanguage analysis (CIA), 26
Cook, Guy, 72, 86
copyright, 38, 44, 57, 61–2, 79 n6, 139–40
Corpora Discussion List, 144
corpus (corpora)
balanced, xii
construction see planning corpus
construction
deﬁnitions, xi–xii
diachronic, 46
historical, 20–2, 37–8, 46, 51, 78–9
learner, 26–7
monitor, 15
multi-purpose, 36
parallel, 22–4
parsed, 96
resources, 142–9
special-purpose, 36
synchronic, 45–6
corpus linguistics, xi, xiii–xiv, 1–2, 3–4
Corpus of Early English Correspondence, 22,
37, 144
Corpus of Middle English Prose and Verse, 144
Corpus of Spoken Professional English, 71,
144
corpus-based research, 11
contrastive analysis, 22–4
grammatical studies, 11–13
historical linguistics, 20–2
language acquisition, 26–7
language pedagogy, 27–8
language variation, 17–20
lexicography, 14–17
limitations, 124
natural language processing (NLP), xiii,
24–6
reference grammars, 13–14
translation theory, 22–4
Crowdy, Steve, 43, 59
Curme, G., 13
data-driven learning, 27–8
de Haan, Pieter, 97–8
descriptive adequacy, 2, 3
diachronic corpora, 46
dialect variation, 51–2
dictionaries, 14–17
Du Bois, John, 32, 52, 85
Dunning, Ted, 132
EAGLES Project see Expert Advisory Group
on Language Engineering Standards, The
Ebeling, Jarle, 23
education, 50
Ehlich, Konrad, 77
Electronic Beowulf, The, 21, 144
electronic texts, 63–4
elliptical coordination
frequency, 7, 12
functional analysis, 6–11
genres, 6, 9–10
position, 6–7
repetition in speech, 9
serial position effect, 7–8, 8t
speech vs. writing, 8–9
suspense effect, 7–8, 8t
empty categories, 4–5
ENGCG Parser, 96
EngCG-2 tagger, 88
EngFDG parser, 91, 93–4, 93–4 n8, 96
English–Norwegian Parallel Corpus, 23,
62, 144
ethnographic information, 65–6
see also sociolinguistic variables
Expert Advisory Group on Language
Engineering Standards, The (EAGLES),
xi, 84, 144
explanatory adequacy, 2, 3, 10–11
164 Index
Extensible Markup Language see XML
Eyes, Elizabeth, 91
Fernquest, Jon, 114
Fillmore, Charles, 4, 17
Finegan, Edward, 22
Fletcher, P., 121–2
FLOB (Freiburg–Lancaster–Oslo–Bergen)
Corpus, 21, 45, 145
“frame” semantics, 17
Francis, W. Nelson, 1, 88
FROWN (Freiburg–Brown) Corpus, 21, 145
FTF see fuzzy tree fragments
functional descriptions of language
elliptical coordination, 6–11, 8t, 12
repetition in speech, 9
voice, 5–6
fuzzy tree fragments (FTF), 119, 119f
Gadsby, Adam, 27
Garside, Roger, 88–9
Gavioli, Laura, 28
gender, 18, 22, 48–9
generative grammar, 1, 3–5
genre variation, 18, 19–20, 31t, 34–8, 35t, 40–2
Gillard, Patrick, 27
government and binding theory, 4–5
grammar
generative, 1, 3–5
universal, 2–3
“Grammar Safari”, 28
grammars, reference, 13–14
grammatical markup see parsers
grammatical studies, 11–13
Granger, Sylvianne, 26
Greenbaum, Sidney, 7, 14, 22, 35t, 64, 75, 95
Greene, B. B., 87, 88
Haegeman, Lilliane, 2–3, 4–5, 6
Hasselg˚ ard, Hilde, 23
Helsinki Corpus, 145
composition, 20–1, 38
planning, 46
research using, 22, 37, 51
symbols system, 67
Helsinki Corpus of Older Scots, 145
historical corpora, 20–2, 37–8, 46, 51, 78–9
see also ARCHER; Helsinki Corpus
Hoﬂand, Knut, 23
Hong Kong University of Science and
Technology (HKUST) Learner Corpus,
26, 145
Hughes, A., 121–2
ICAME Bibliography, 145
ICAME CD-ROM, 67, 145
ICE (International Corpus of English), 146
annotation, 82–3, 84, 85, 87, 90
composition, 34, 35t, 36, 38, 39, 40–2, 104
computerizing data, 72, 73
copyright, 38, 44
criteria, 50
record keeping, 66
regional components, 104, 105–6, 106t,
110, 123, 124
research using, 6, 9 see also pseudo-titles
(corpus analysis case study)
sampling, 44, 56
time-frame, 45
see also ICECUP; ICE-GB; ICE-USA
ICE Markup Assistant, 85, 86
ICE Tree, 95
ICECUP (ICE Corpus Utility Program), 19,
116, 119, 146
ICE-East Africa, 106, 106t, 107t, 110,
123t, 124
ICE-GB, 146
annotation, 25, 83–4, 86, 92–3, 92f, 96,
117–18, 118f, 140
composition, 106t
computerizing data, 73
criteria, 50
record keeping, 64–5
research using, 14, 19, 115–16, 116f
see also pseudo-titles (corpus analysis
case study)
ICE-Jamaica, 106t, 107t, 110, 123t
ICE-New Zealand, 106, 106t, 107t, 123t,
125, 130–3
ICE-Philippines, 106, 106t, 107t, 110, 123t,
125, 130–3
ICE-Singapore, 106t, 110, 123t
ICE-USA
composition, 53, 106t
computerizing data, 70, 71, 73–4, 79
copyright, 62
criteria, 46–7
directory structure, 67–8, 68f
length, 32–3
record keeping, 64, 65
research using see pseudo-titles (corpus
analysis case study)
sampling, 58, 60–1
ICLE see International Corpus of Learner
English
Ingegneri, Dominique, 42–3
International Corpus of English see ICE
International Corpus of Learner English
(ICLE), 26, 27, 146
Jespersen, Otto, xii, 13
Johansson, Stig, 23
Index 165
Kalton, Graham, 43
Kennedy, Graeme, 89
Kettemann, Bernhard, 27–8
Kirk, John, 52
Kolhapur Corpus of Indian English, 104
Kretzschmar, William A., Jr., 42–3
Kucera, Henry, 1
KWIC (key word in context), 115–16, 116f
Kyt, M., 37
Kyt, Merja, 42
Labov, W., 9
Lampeter Corpus, 38, 146
Lancaster Corpus, 12, 147
see also LOB (Lancaster–Oslo–Bergen)
Corpus
Lancaster–Oslo–Bergen Corpus see LOB
(Lancaster–Oslo–Bergen) Corpus
Lancaster Parsed Corpus, 91–2, 96, 147
Lancaster/IBM Spoken English Corpus, 96,
147
Landau, Sidney, 16
language acquisition, 26–7, 47
language pedagogy, 27–8
language variation, 3, 17–20
dialect variation, 51–2
genre variation, 18, 19–20, 31t, 34–8,
35t, 40–2
sociolinguistic variables, 18–19, 22, 48–53
style-shifting, 19
Lawrence, Helen, 52, 136
LDB see Linguistic Database Program
LDC see Linguistic Data Consortium
learner corpora, 26–7
Leech, Geoffrey, xi, 4, 87, 91, 138
lemmas, 16, 116, 116 n5
length
of corpus, 32–4, 126
of text samples, 38–40
lexicography, 14–17
Linguistic Data Consortium (LDC), 24,
98, 147
Linguistic Database (LDB) Program, 93, 115
linguistic theory
adequacy, 2–3, 10–11
corpus linguistics, xi, xiii–xiv, 1–2, 3–4
generative grammar, 1, 3–5
government and binding theory, 4–5
minimalist theory, 3
see also functional descriptions of language
LOB (Lancaster–Oslo–Bergen) Corpus, 12,
14–15, 19, 39, 45, 87, 147
see also FLOB
(Freiburg–Lancaster–Oslo–Bergen)
Corpus
London Corpus, 12, 13, 50, 103, 147
London–Lund Corpus, 147
annotation, 82
composition, 53
names in, 75
research using, 12, 19, 39, 42, 98
Longman Dictionary of American English, 15
Longman Dictionary of Contemporary
English, 15
Longman Essential Activator, 27
Longman–Lancaster Corpus, 12, 148
Longman Learner’s Corpus, 26, 27, 148
Longman Spoken and Written English Corpus,
The (LSWE), 14, 90, 148
LSWE see Longman Spoken and Written
English Corpus, The
Mair, Christian, 45
Map Task Corpus, 59, 148
markup, 67, 68–9
see also annotation
Markus, Manfred, 78, 79
Melcuk, Igor A., 93 n8
Meyer, Charles F., 6, 7, 28, 42–3, 76–7,
98, 101, 103
Michigan Corpus of Academic Spoken
English (MICASE), 148, 151
composition, 36, 53
computerization, 69, 72–3
planning, 139
record keeping, 65
Mindt, Dieter, 12–13
minimalist theory, 3
modal verbs, xii–xiii, 12–13
monitor corpora, 15
Morris, Robert A., 76–7
multi-purpose corpora, 36
Murray, James A. H., 16
native vs. non-native speakers, 46–8
natural language processing (NLP), xiii, 24–6
Nelson, Gerald, 22, 83
Nevalainen, Terttu, 22, 51
Nguyen, Long et al., 70
Nijmegen Corpus, 87, 92, 93, 96, 148
NLP see natural language processing
Norri, Juhani, 42
Northern Ireland Transcribed Corpus
of Speech, The, 52, 148
noun phrases, 5–6, 13–14
null-subject parameter, 2–3
Oakes, Michael P., 134, 136
observational adequacy, 2
Ooi, Vincent, 15
Oostdijk, Nelleke, 93
Oxford English Dictionary (OED), 16
166 Index
ParaConc, 24, 150
parallel corpora, 22–4
parsed corpora, 96
parsers
probabilistic, 91
rule-based, 92–4, 93–4 n8, 95–6
parsing a corpus
accuracy, 91, 95
complexity, 93–5
disambiguation, 95
future prospects, 140
manual pre-processing, 95–6
normalization, 96
parsers, 91–4, 95
post-editing, 95
problem-oriented tagging, 97–8
speech, 94–5, 96
treebanks, 91–2
part-of-speech markup see taggers
PC Tagger, 98, 113, 113f
Penn–Helsinki Parsed Corpus of Middle
English, 96, 148
Penn Treebank, xii, xiii, 25, 37, 91, 96, 149
Perl programming language, 114
planning corpus construction, 30, 44–5, 53
British National Corpus, 30–2, 33, 43, 51,
138
future prospects, 138–9
genres, 31t, 34–8, 35t, 40–2
length of text samples, 38–40
native vs. non-native speakers, 46–8
number of texts, 40–3
overall length, 32–4
range of speakers and writers, 40–2, 43–4
sociolinguistic variables, 18–19, 48–53
time-frame, 45–6
Polytechnic of Wales Corpus, 96, 149
pro-drop, 2–3
programming languages, 114
pseudo-titles (corpus analysis case study),
101–2
determining suitability, 103–7, 107t
extracting information: deﬁning parameters,
107–9; coding and recording, 109–14,
112t, 113f; locating relevant
constructions, 115, 117–19, 118f, 119f
framing research question, 101–3
statistical analysis: exploring a corpus,
123–4; using quantitative information,
125–36, 125t, 127t, 128f, 128t, 130t,
131t, 132t, 133t, 134t, 135t
Quirk, Randolph et al., 13–14, 101, 108, 135
record keeping, 64–6
reference grammars, 13–14
Reppen, Randi, 126
research see analyzing a corpus; corpus-based
research
Rissanen, Matti, 20, 21, 37–8, 46, 51
Rocha, Marco, 97
Rubin, G. M., 87, 88
Ryd´ en, Mats, 101
sampling methodology
non-probability sampling, 44, 45
probability sampling, 43–4
sampling frames, 42–3
see also speech samples
Sampson, Geoffrey, 114
S´ anchez, Aquilino, 33 n2
Sanders, Gerald, 6–7
Santa Barbara Corpus of Spoken American
English, 32, 44, 69, 71, 85, 149
Sara concordancing program, 18–19,
150, 151
scanners, 79
Sch¨ afﬂer, Hildegard, 23–4
Schmied, Josef, 23–4, 47
SEU see Survey of English Usage
SGML (Standard Generalized Markup
Language), 82–5, 86
Sigley, Robert J., 105 n3, 129, 136
Sinclair, John, 14, 15
small clauses, 4
Smith, Nicholas, 88–9, 90
social contexts and relationships, 52–3
sociolinguistic variables, 18–19
age, 49–50
dialect, 51–2
education, 50
gender, 18, 22, 48–9
social contexts and relationships, 52–3
see also ethnographic information
software programs, 18–19, 24, 26, 115
Someya, Yasumasa, 114
special-purpose corpora, 36
speech, computerizing
background noise, 75
detail, 71–2
extra-linguistic information, 72
future prospects, 139
iconicity and speech transcription, 75–8
lexicalized expressions, 72–3
linked expressions, 73
names of individuals, 75
partially uttered words, 73
principles, 72
punctuation, 74–5
repetitions, 73–4
speech-recognition programs, 24–6, 70–1
transcription programs, 69–70
Index 167
transcription time, 71
unintelligible speech, 74
vocalized pauses, 72
speech, repetition in, 9
speech samples, 56
broadcasts, 61
future prospects, 139
microphones, 60
“natural” speech, 56–8, 59
parsing, 94–5, 96
permission, 57
problems, 60–1
recording, 58–9
sample length, 57–8
tape recorders, 59–60
see also planning corpus construction;
speech, computerizing
speech-recognition programs, 24–6, 70–1
Sperberg-McQueen, C. M., 82
Standard Generalized Markup Language see
SGML
statistical analysis, 119–21
backward elimination, 135
Bonferroni correction, 129
chi-square test, 127–32, 127t, 134, 135
cross tabulation, 125
degrees of freedom, 129
determining suitability, 121–2
exploring a corpus, 123–4
frequency counts, 120
frequency normalization, 126
kurtosis, 127, 127t
length of corpora, 126
linguistic motivation, 122
log-likelihood (G
2
) test, 132, 135
loglinear analysis, 134, 136
macroscopic analysis, 122
non-parametric tests, 126, 127
normal distribution, 126–7
programs, 120, 136
pseudo-titles (case study), 123–36, 125t,
127t, 128f, 128t, 130t, 131t, 132t,
133t, 134t, 135t
saturated models, 134
skewness, 127, 127t
using quantitative information, 125–36
Varbrul programs, 136
structural markup, 81
detail, 85–6
display, 86
intonation, 85
SGML, 82–5, 86
TEI, 67, 84, 85, 86, 98, 149
timing, 68–9
XML, 84, 86
style-shifting, 19
Survey of English Usage (SEU), 42, 98
Susanne Corpus, 96, 149
Svartvik, Jan, 75
Switchboard Corpus, 25, 149
synchronic corpora, 45–6
taggers, 86–90
accuracy, 89–90
probabilistic, 88–9
rule-based, 88
tagging a corpus, 86–91
accuracy, 89–90, 91
disambiguation, 88
discourse tagging, 97
future prospects, 140
limitations, 117–18
post-editing, 89
problem-oriented tagging, 97–8, 111
semantic tagging, 97
taggers, 86–90
tagsets, 86, 87, 90–1
see also SGML; Text Encoding Initiative
(TEI)
TAGGIT, 88
Tagliamonte, Sali, 52, 136
tagsets, 86, 87, 90–1
TalkBank Project, 138, 149
Tampere Corpus, 42, 150
Tannen, D., 9
Tapper, Marie, 26–7
Text Encoding Initiative (TEI), 67, 84, 85, 86,
98, 150
text samples
length, 38–40
number, 40–3
that, 22
Thomas, Jenny, 97
Thompson, Sandra, 32, 52, 85
time-frame, 45–6
TIMIT Acoustic–Phonetic Continuous Speech
Corpus, 24–5, 150
TIPSTER Corpus, 25, 150
TOSCA parser, 91, 92–3, 92f, 95–6
TOSCA tagset, 87
TOSCA Tree editor, 95
transcription programs, 69–70
translation theory, 22–4
tree editors, 95
treebanks, 91–2
see also Penn Treebank
universal grammar, 2–3
Varbrul programs, 136
verb complements, 45
verb phrases, 13
168 Index
verbs, modal, xii–xiii, 12–13
voice, 5–6
Weizman, Michael, 22
Wellington Corpus, 89, 136, 150
Willis, Tim, 95
Wilson, Andrew, 97
Woods, A., 121–2
World Wide Web, 28, 63–4, 79–80,
140, 142–51
written texts
collecting data, 61–4, 139
computerizing, 78–80, 139
copyright, 38, 44, 61–2, 79 n6, 139–40
electronic texts, 63–4
see also planning corpus construction
XML (Extensible Markup Language), 84, 86
York Corpus, 52, 150