The Indian exception: complex prepositions in the Kolhapur Corpus

I am a big fan of old corpora. Of course, I do also appreciate XXL corpora compiled semi-blindly from the Web. But, comparatively speaking, the older corpora have the kind of spick-and-span internal structure that makes them pleasant to use.

Complex prepositions

In a sociolinguistics course that I taught this Fall semester, we studied World Englishes and decided to focus on Indian English and compare this variety to British and American English. To operationalize the comparison, we chose to replicate Hirschmüller’s work on complex prepositions (Hirschmüller 1989). We found the idea in Leitner (1991: 228), whose main research question goes as follows:

The focus of this paper derived from a typological interest in Indian English: is it one possible extension of Kinglish or is it one of the Kinglishes?

He adds:

(i) to what extent can the Kolhapur Corpus contribute to our understanding of the nature of Indian English (intra-varietal description); (ii) to what extent can the classical corpus tradition, of which the Kolhapur Corpus is a part, advance a typology of varieties (inter-varietal comparison), and (iii) are other complementary types of corpora required?

Prepositions are grammatical or function words whose purpose is to introduce nouns or noun phrases, such as on, in, at, since, for, ago, before, to, until, etc. (the list is very long).1 As function words, prepositions belong to a closed-class set (a society of words that admits no new members).

Complex prepositions are multiword expressions (i.e. expressions that consist of several words): ahead of, along with, apart from, such as, thanks to, together with, on account of, on behalf of, on top of. Linguists are not sure whether they belong to a closed-class set:

Prepositions can be regarded as occupying an intermediate position between open and closed sets. It has been argued that this is accounted for by the open-ended nature of complex prepositions, i. e. of two-, three-, or more- word sequences that function as prepositions, e. g. such as, in view of, in spite of, with regard to. (Leitner 1991, 224)

Intuitively, the use complex prepositions is typical of formal registers. Leitner assumes that « non-native Englishes are often claimed to use a more formal register than native Englishes » (although he does not back up his claim with citations). If complex prepositions are a token of formality, then we should expect Indian English to use more complex prepositions than British English and American English, especially in formal contexts.

The Brown, LOB, and Kolhapur corpora

Hirschmüller (1989) studied the distribution of complex prepositions in three corpora: the Brown Corpus, the LOB Corpus, and the Kolhapur Corpus. Each corpus contains approximately one million word tokens.

The Brown Corpus is a corpus of American English. It was initially compiled in 1964, revised in 1971, and amplified in 1979 by W. N. Francis and H. Kučera. The LOB Corpus is the British counterpart to the Brown Corpus. It was compiled by Stig Johansson, Geoffrey Leech, and Helen Goodluck in 1978. Here is what I wrote recently about the nature of these two corpora:

Corpus-linguistics textbooks frequently present the Brown Corpus (Francis and Kučera 1979) and its British counterpart, the Lancaster-Oslo-Bergen corpus (Johansson, Leech, and Goodluck 1978) as paragons of balanced corpora. Each attempts to provide a representative and balanced collection of written English in the early 1960s. For example, the compilers of the Brown Corpus made sure that all the genres and subgenres of the collection of books and periodicals in the Brown University Library and the Providence Athenaeum were represented. Balance was achieved by choosing the number of text samples to be included in each category. By way of example, because there were about thirteen times as many books in learned and scientific writing as in science fiction, 80 texts of the former genre were included, and only 6 of the latter genre. (Desagulier 2017, 4)

The Kolhapur Corpus is a corpus of Indian English. It was compiled in 1986 by S. V. Shastri, C. T. Patilkulkarni, and G. S. Shastri. Here is how the corpus compilers describe their work:

(…) it is felt that the value of the Indian corpus is immensely enhanced in general and in particular as a source for the description of Indian English as the Independence as the Indianness of Indian English is a post-Independence phenomenon and may have reached a discernible stage in the thirty years after Independence. (http://clu.uni.no/icame/kolhapur/kolman.htm)

All three corpora were designed to be comparable as they share identical architectures.2 Tab. 1 below shows the distribution of texts per category.

Tab. 1. The parallel composition of the Brown, LOB, and Kolhapur corpora

Label

Text category

Brown Corpus

LOB Corpus

Kolhapur Corpus

A

Press: reportage

44

44

44

B

Press: editorial

27

27

27

C

Press: reviews

17

17

17

D

Religion

17

17

17

E

Skills, trades and hobbies

36

38

38

F

Popular lore

48

44

44

G

Belles lettres, biography, essays

75

77

70

H

Miscellaneous (docs, reports, etc.)

30

30

37

J

Learned and scientific writings

80

80

80

K

General fiction

29

29

58

L

Mystery and detective fiction

24

24

24

M

Science fiction

6

6

2

N

Adventure and western fiction

29

29

15

P

Romance and love story

29

29

18

R

Humour

9

9

9

Total

500

500

500

In practice, you have one text file per text category. The screenshot below shows the file hierarchy of the Kolhapur Corpus. The file hierarchy of the Brown and LOB corpora is strictly identical.

This is good news to the corpus linguist! Complex prepositions can be extracted easily from each file, linked to each corpus by means of the filename prefix, and linked to a text category by means of the alphabetic identifier.

Hirschmüller’s results

Fig. 2 is a screenshot of Hirschmüller’s results, as found in Leitner (1991, 225). It displays the 10 most frequent complex prepositions found in all three corpora, out of a total of 235 complex prepositions, 81 of which consist of two words and 154 of three and more. If we compare the total columns, we observe a higher incidence of complex prepositions in the Kolhapur Corpus than in the other two corpora.

Fig. 2. The frequency and rank order of the 10 most common complex prepositions of three or more words in Kolhapur (KOL), Lancaster —Oslo/Bergen (LOB) and Brown (BRN). From Leitner (1991, 225)

Fig. 3. displays the distribution of the most complex prepositions (three words and more). These are over-represented in the corpus of Indian English.

Fig. 3. The distribution of complex prepositions of three or more words in the Kolhapur (KOL), Lancaster — Oslo/Bergen (LOB) and Brown (BRN) corpora. From Leitner (1991, 225)

However, the above tables are tables of counts, we have no idea about the statistical significance of the distributions (i.e. we have no way of telling whether the distributions are due to chance). Furthermore, in Fig. 3, the text categories are denoted by their respective alphabetic character, which makes it difficult to tell apart formal and less formal registers.

We stored all three corpora at the root of a hard drive, in a directory namedcorpora.

Fig. 4. How we stored all three corpora

I obtained these corpora a long time ago when I purchased the ICAME Corpus Collection on CD-ROM (version 2, 1999). Unfortunately, I cannot share them here. The good news is that Brown and LOB are available from the CLARINO Bergen Centre (an offshot from the giant CLARIN project) as part of the Brown family. The Kolhapur corpus is much harder to find online.

Replicating Hirschmüller (1989)

To replicate Hirschmüller (1989), we set out to extract all prepositions (simple and complex) from the three corpora. Because they are plain-text corpora, Brown, LOB, and Kolhapur are not POS-tagged (Fig. 5). This means that we cannot rely on tags to extract complex prepositions. We might have tagged the corpora with TreeTagger, but that would have been time consuming and somehow pointless since complex prepositions would not have been tagged as such (there is no ready-made tag for complex prepositions).

Fig. 5. What a corpus file looks like (Kolhapur)

Instead, we used the inventory of English prepositions from the PDEP project to formulate our R query. Out of the 305 simple and complex prepositions retrieved from the PDEP database, 297 made the cut (8 were considered too rare and discarded). This is what the matching expression looks like (hang in there):

We placed the expression inside the extraction script from Section 5.3 of my Corpus Linguistics and Statistics with R book and ran it over each of the three corpora. We obtained three data sets, which are available for download from Nakala’s secure server:

Visualizing the table with correspondance analysis

Because the table is quite large, how about visualizing it with correspondence analysis? Correspondence analysis is my favorite clustering technique. I present it in details in my book (Section 10.4).3

Before running a correspondence analysis, the table of nominal data has to be converted into a table of counts.

Additionally, we need to know how many words each preposition contains.

# convert the row names into factors
dfcm$preposition <- as.factor(rownames(dfcm))
# count the number of words per preposition with the stringi package
# this is done by counting the spaces between each word and adding 1
install.packages(stringi)
library(stringi)
dfcm$prep.length <- (stri_count(dfcm$preposition, regex=" ", opts_regex=stri_opts_regex(case_insensitive=TRUE)))+1
dfcm$prep.length <- as.factor(dfcm$prep.length)
# once this is done, remove the dfcm$preposition column
dfcm$preposition <- NULL

It is now time to run the correspondence analysis. For this task, my favorite R package is FactoMineR.

The three corpora consist of written texts and the overall level of formality is therefore pretty high. Hirschmüller observed two things: (1) complex prepositions cluster in non-fictional texts, a preference that is amplified in the Kolhapur Corpus; (2) learned and bureaucratic writing shows a more pronounced pattern in the Kolhapur Corpus than in the British and American corpora.

The CA plot reflects these tendencies (Fig. 6). The first dimension (along the horizontal axis) accounts for 82.89% of the variance. It shows a clear divide between Brown and LOB (to the left) and Kolhapur (to the right). Large complex prepositions (three words and more: prep.length.3 and prep.length.4) are far more likely to occur in Indian English than in its British and US counterparts. No such preference is observed for one-word and two-word prepositions (prep.length.1 and prep.length.2). Very formal text categories cluster to the right, along with the Kolhapur corpus: learned_scientific, press_reviews, religion, miscellaneous (governmental documents, foundation reports, industry reports, college catalogue, industry in-house publications). All in all, complex prepositions are specific to the Kolhapur Corpus, especially in formal contexts.

Observing different results would have been surprising because we used the same data as Hirschmüller and tested the significance of our findings with the same χ² tests (correspondence analysis relies heavily on χ²). However, replicating a study made in the late 1980s makes sense in times of replication crisis.

The status of English

The Eighth Schedule to the Constitution of India recognizes 22 official languages other than English, and there are discussions to include almost 40 more. The count does not even include non-official languages and dialects. Needless to say India is a linguist’s paradise. English has a special place insofar as it came along with the British Empire. We expect it to be influenced by the contexts where it has been used the most: political, educational, and financial institutions.

The above gave me another research idea: studying the differences between France’s French and Pondichery’s French. If you have heard of a case study that operationalizes this research question in an interesting way, please let me know!

There are differences. For example, according to the Kolhapur manual: « (…) there are some important differences dictated mainly by logistic and practical considerations. Firstly, as far as synchronicity is concerned there is a major departure in that, while the Brown and LOB corpora draw their samples from the materials published in the calendar year 1961 the Indian corpus is drawn from materials published in the year 1978. this decision was made after consultation with authors of the earlier corpora to make sure that the comparability will not suffer much as result of this. » [↩]