Aims and methods of quantitative linguistics

Overview

While the formal branches of linguistics use only the qualitative mathematical means (algebra, set theory) and logics to model structural properties of language, quantitative linguistics (QL) studies the multitude of quantitative properties which are essential for the description and understanding of the development and the functioning of linguistic systems and their components. The objects of QL research do, therefore, not differ from those of other linguistic and textological disciplines; nor is there a principal difference in epistemological interest. The difference lies rather in the ontological points of view (do we consider a language as a set of sentences with their structures assigned to them, or do we see it as a system which is subject to evolutionary processes in analogy to biological organisms, etc.) and, consequently, in the concepts which form the basis of the disciplines.

Differences of this kind form the ability of a researcher to perceive – or not – elements, phenomena, or properties in his area of study. A linguist accustomed to think in terms of quantities, probabilities and trends is more likely to find the study of properties such as length, frequency, age, degree of polysemy etc. interesting and necessary than a researcher who thinks in terms of set theory and algebra does. There is, however, an immense number of of properties and processes in language which can be detected and analysed only with quantitative methods on the basis of quantitative concepts: features and interrelations which can be expressed only by numbers or rankings.

And there are interrelations among these features which play central roles in the development of language(s) because their consequences form the structures and properties we can observe in language and text. Among these interrelations are, e.g. dependences of length (or complexity) of syntactic constructions on their frequency and on their ambiguity, of homonymy of grammatical morphemes on their dispersion in their paradigm, the length of expressions on their age, the dynamics of the flow of information in a text on its size, the probability of change of a sound on its articulatory difficulty … in short, in every field and on each level of linguistic analysis – lexicon, phonology, morphology, syntax, text structure, semantics, pragmatics, dialectology, language change, psycho- and sociolinguistics, in prose and lyric poetry – phenomena of this kind are predominant. They are observed in every language in the world and at all times.

Moreover, it can be shown that these properties of linguistic elements and their interreations abide by universal laws, which can be formulated in a strict mathematical way – in analogy to the laws of the well-known natural sciences. Emphasis has to be put on the fact that these laws are stochastic; they do not capture single cases (this would neither be expected nor possible), they rather predict the probabilities of certain events or certain conditions in a whole. It is easy to find counter-examples to any of the examples cited above. However, this does not mean that they contradict the corresponding laws. Divergences from a statistical av-erage are not only admissible but even necessary – they are themselves determined with quan-titative exactness. This situation is, in principle, not different from that in the natural sciences, where the old deterministic ideas have been disused since long and have been replaced by modern statistical/probabilistic models.

The role of QL is now to unveil corresponding phenomena, to systematically describe them, and to find and formulate laws which explain the observed and described facts. Quantitative interrelations have an enormous value for fundamental research but they can also be used and applied in many fields such as computational linguistics and natural language processing, language teaching, optimisation of texts etc.

Early modern linguistics, in the time after the seminal contribution of de Saussure, was mainly interested in the structure of language. Consequently, linguists adopted the qualitative means of mathematics: logics, algebra, set theory. The historical development of linguistics and a subsequent one-sided emphasis on certain elements in the structuralist achievements resulted in the emergence of an absolutely static concept of system, which has prevailed until our days. The aspects of systems which exceed structure, viz. functions, dynamics, processes, were disregarded almost completely. To overcome this flaw, the quantitative parts of mathematics (e.g., analysis, probability theory and statistics, function theory, differential and difference equations) must be added to the qualitative ones, and this is the actual aim of QL.

Last but not least, important applications in the fields of language and text technology, computational linguistics etc. have adopted quantitative methods because purely qualitative means failed in practice. Nowadays, most working systems in these fields apply QL techniques and, therefore, gain increasing interest also among teachers and students.

Objectives of QL

As briefly mentioned above, QL cannot be characterised by a specific cognitive interest. QL researchers study the same scientific objects as other linguists. However, QL emphasises, in contrast to other branches of linguistics, the introduction and application of additional, advanced scientific tools. Principally, linguistics tries, in the same way as other empirical (“factual”) sciences do in their fields, to find explanations for the properties, mechanisms, functions, the development etc. of language(s). It would be a mistake, of course, to think of “final” explanation which would help to conceive the “essence” of the objects (cf. Popper 1971: 23, Hempel 1952: 52ff; cf. also Kutschera 1972: 19f ). Science strives for a hierarchy of explanations which lead to more and more general theories and cover more and more phenomena without ever being able to find an end of explanation. Due to the stochastic properties of language, quantification and probabilistic models play a crucial role in this process. In the framework of this general aim, QL has a special status only because it makes special efforts to care for the methods necessary for this purpose, and it will have this status only as long as these methods are not yet common in all the areas of language and text research. We can characterise this endeavour by two complementary aspects:

On the one hand, the development and the application of quantitative models and methods is indispensable in all cases where purely formal (algebraic, set-theoretical, and logical) methods fail, i.e. where the variability and vagueness of natural languages cannot be neglected, where tendencies and preferences dominate over rigid principles, where gradual changes debar the application of static/structural models. Briefly, quantitative approaches must be applied whenever the dramatic simplification, which is caused by the qualitative yes/no scale, cannot be justified or is inappropriate for a given investigation or application.

On the other hand, quantitative concepts and methods are superior to the qualitative ones on principled grounds, as has been shown above. The quantitative ones enable a more adequate description of reality by providing an arbitrarily fine resolution. Between the two extreme poles yes/no, true/false, 1/0 of qualitative concepts, as many grades as are needed can be distinguished up to the infinitely many “grades” of the continuum.

Generally spoken, the development of quantitative methods aims at improving the exactness and precision of the possible statements on the properties of linguistic and textual objects. Exactness depends, in fact, on two factors:

on the acuity of the definition of a concept and

on the quality of the measurement methods with which the given property can be determined. Success in defining a linguistic property with sufficiently crisp concepts enables us to operate it with mathematical means, provided the operations correspond to the scale level (cf. above) of the concepts.

Such operations help us deriving new insights which would not be possible without them: appraisal criteria which exist at the time being only in a subjective, tentative form can be made objective and operationalised (e.g. in stylistics), interrelations between units and properties can be detected, which remain invisible to qualitative methods, and workable methods for technical and other fields of application can be found where traditional linguistic methods fail or produce inappropriate results due to the stochastic properties of the data or to the sheer mass of them (e.g., in Natural Language Processing).