Menu

Electronic Corpus of Welsh (CEG)

The original project was financed during 1993/4 with a £21,000 grant awarded by the Higher Education Council for Wales to Ellis, O’Dochartaigh & Hicks from the IT Unit, Welsh Department and School of Psychology, University of Wales, Bangor.

It included 1,079,032 words of written Welsh prose, mainly form 1970 onwards, based on 500 samples of around 2,000 words each. The data was tagged and analysed for various linguistic studies, and the original files may still be accessed at http://www.bangor.ac.uk/canolfanbedwyr/ceg.php.en now maintained by the staff of the Language Technologies Unit.

Because of the demand for a user-friendly, searchable interface for the corpus, in 2012 the Unit developed another version, using the Cysefin and Hebog platform to display the data. The texts displayed are the same in both versions, the only difference is in the methods of showing and searching the data. The version using Cysefin and Hebog may be accessed from the Welsh National Corpus Portal.

Brief Summary

This is a word frequency analysis of 1,079,032 words of written Welsh prose, based on 500 samples of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) Welsh prose writing. It was conceived as providing a Welsh parallel to the Kucera and Francis
analysis for American English, and the LOB corpus for British English, in the expectation that such an analysed corpus would provide research tools for a number of academic disciplines: psychology and psycholinguistics,
child and second language acquisition, general linguistics, and the linguistics of Modern Welsh, including literary analysis.

The sample included materials from the fields of novels and short stories, religious writing, childrenís literature both factual and fiction, non-fiction materials in the fields of education, science, business, leisure activities, etc., public lectures, newspapers and magazines, both national and local, reminiscences, academic writing, and general administrative materials (letters, reports, minutes of meetings).

The resultant corpus was analysed to produce frequency counts of words both in their raw form and as counts of lemmas where each token is demutated and tagged to its root. This analysis also derives basic information concerning the frequencies of different word classes, inflections, mutations, and other grammatical features.

Background

This project was funded for the academic year 1993-94 by a grant of £21K from the Higher Education Funding Council for Wales to Ellis, O’Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, Bangor. The researchers began work on the project in October 1993, and after the sample range had
been identified in collaboration with Professor Gwyn Thomas of the Department of Welsh, proceeded to collect the required range of texts. The original intention was that this range of materials would be acquired in an electronic
form from Welsh language publishers and other bodies, such as local authorities, governmental organizations, and papurau bro (locally produced newspapers). However, it proved to be impossible to collect the necessary breadth of
materials in an electronic form, primarily because at that time Welsh language publishers did not generally keep computer-based archive copies of books which they may have published using electronic means.

Under these circumstances, having acquired around 200 usable samples from various bodies, it was decided to input the remainder by using both typists and an OCR system. The task of checking such typed copy, and in particular of correcting the errors introduced by the OCR software, was carried out by the researcher, assisted by the on-going development of the Welsh spelling-checker, CySill. The additional costs of this work were borne by funding from the Welsh IT Unit at Bangor.

Where material was obtained directly from publishers or from individual authors, permission was sought for the data to be included in the project analysis, with the understanding that if they were ever to be made available to a wider audience, then a formal request would be made to the copyright holders for this use. Where samples were taken either by typing or by OCR from published works, formal permission for their use has not yet been requested, as it was regarded that the samples of 2000 words in most cases could be regarded as “fair-dealing” for academic research purposes under the Copyright Acts. Any future public use of these materials will require the formal permission of their copyright holders.

It was decided to use the analytical software for Welsh which had been developed for a Welsh language spelling checker, then under way in the School of Psychology for Bwrdd yr Iaith Gymraeg / The Welsh Language Board. This spelling checker in its improved form involved a set of lemmatization algorithms for handling the language in a computer environment and it was felt that these programs could be adaptable for lemming the CEG text samples. The basic program for the spelling checker was modified to allow it to process and analyze the texts in an interactive way. This required the ability to present the original text on screen for inspection by the researcher, and to offer interactive dialogue boxes to solve two fundamental problems with the software. These were,
the appearance of words or word forms which did not appear in the spelling checker’s own dictionary, and the possibility of homographs. The latter difficulty was solved by arranging for the software to identify a lemma
by stripping off a particular ending and/or by demutating a word, then continuing to try possible endings and initial mutations in combinations with other lemmas to check for possible homographs, effectively on the
fly. Any such forms identified were presented on-screen to the researcher, with the original text still visible, to allow an informed choice to be made between the possibilities. In a similar way, the appearance of an unrecognized word or word form generated a dialogue box to allow the researcher to enter such words into a user dictionary, as well as allowing the forms to be incorporated into the tagged files which were produced from each separate text sample.

The main researcher worked on 350 out of the 500 samples, and a part-time researcher was employed through the Welsh IT Unit to analyze 150 of the samples. The average time for the analysis of each was around 1 hour, though the need to read over and correct typed or OCR scanned text, raised this to a figure of around 2 hours per sample.

Most users will probably only want to access the processed results – the frequency counts of word forms or lemmas
presented below. However, we also provide the original text samples as ASCII files along with the 500 tagged files for those who need to find words or constructions in their original context or for scholars who wish to correct or take forward the analyses presented here.

Lemma [tab] Raw word [tab] Part Of Speech [tab] Mutation – if present [tab] Line Number

Each line shows the lemmatized form, the original word, the part of speech, type of mutation if present, and
the location of the word (sample number, sentence number within sample, word number within sentence). For verbal forms, a number is used with the lemma to show the particular morphographemic form appearing.

Illustration of a sample sentence from a text follows:

We believe this text corpus is of value for an analysis of Welsh prose sentence patterns, for co-occurrence analyses of both individual lemmas and grammatical parts of speech in running texts, and for further linguistic analysis by specialist researchers in the field of Welsh syntax and child language acquisition. However, researchers must take note of some limitations in data quality, particularly regarding the accuracy of some of the lemma tags which were prejudiced by word form homography – these limitations are described below.

Data quality

We believe that the accuracy of the raw word forms in the database and their counts is quite high. Whatever errors (spelling or typographical) there were in the original samples will be carried over to the corpus. We must surely have introduced and failed to detect some additional errors in input, but we have tried hard to keep this number
very low.

Tag quality is something of a different matter. The problems of high homography rates, a limited window template-matching lemmatiser with few rules, and the need for skilled linguistic analysis, compounded into a non-trivial number of tagging errors. A preliminary analysis of 5% of the corpus indicates that there is an error rate of 4% +/- 3%.

These tagging errors are by no means distributed equally about the database. Thus, for example, inaccuracies in the tagging of yn, bod/fod, and a, that is more generally the high frequency closed class words, are much more common than inaccuracies with the open class words. Thus while the token error rate is perhaps 4%, the type error rate is much less than that. We do not have the resources to correct these miscodings.
As well as noting the errors on a print-out of the output files, it would be necessary for any corrections to be written back to the files, and we estimate that a detailed correction of the full set would require two years
work. Having tried to raise these resources, and waited too long, we have decided to release the database as it now stands – it is certainly better than nothing.

Nonetheless, researchers must take note of these limitations in data quality, particularly regarding the accuracy of some of the lemma tags.

a

a

part

[74.2.1]

bod:3

ydi

vbf

[74.2.2]

hynny

hynny

DemPron

[74.2.3]

‘n

‘n

vbadj

[74.2.4]

golygu

golygu

vb

[74.2.5]

bod

fod

vb

meddal

[74.2.6]

y

y

DefArt

[74.2.7]

rhai

rhai

pron

[74.2.8]

dagreuol

dagreuol

adj

[74.2.9]

yn

yn

prep

[74.2.10]

ein

ein

pron

[74.2.11]

plith

plith

nm

[74.2.12]

yn

yn

YnPred

[74.2.13]

iach

iachach

CompAdj

[74.2.14]

na

na

conj

[74.2.15]

‘r

‘r

DefArt

[74.2.16]

rhai

rhai

pron

[74.2.17]

sych

sych

adj

[74.2.18]

?

?

punct

[74.2.19]

We believe the Counts of raw word forms to be highly accurate.

The Lemma Counts with analysis of inflections and mutations runs at about 96% accuracy
with most problems on the high frequency closed class words.

Processed Results: Counts of Raw Word Forms

The word counts are based on the actual word forms occurring. These words include spellings which represent dialectal forms, informal spellings of Welsh forms (generally following the suggestions of Cymraeg Byw, though this is by no means a universally applied standard for informal writing), foreign words (particularly from English), as well as wrongly spelled Welsh words (that is, misprints in the original texts).

Total number of word form tokens in the corpus is 1,079,032.

The total number of separate word form types is 37,195.

The 50 most frequent raw word forms are:

55588

yn

.

3821

cael

45945

y

.

3754

yw

33327

i

.

3546

wrth

33231

a

.

3545

ni

32573

‘r

.

3463

hyn

26927

o

.

3023

na

15888

ar

.

2870

o+l

14990

ei

.

2721

hynny

14845

‘n

.

2646

fe

14523

yr

.

2613

er

11785

ac

.

2594

neu

9922

oedd

.

2585

nid

9338

bod

.

2542

at

9056

mae

.

2511

sy

7751

am

.

2417

‘w

7093

wedi

.

2401

hi

6118

ond

.

2360

dim

5568

un

.

2278

mynd

5415

‘i

.

2240

byddai

5294

eu

.

2160

gyda

4991

gan

.

2137

yng

4988

fel

.

2110

iawn

4578

mewn

.

2066

pob

4149

a+

.

2065

lle

4142

roedd

.

2027

pan

At the other end of the frequency range, there is a very long tail of single occurrence forms, with 44% of
the total entries falling in to this group, and between them, the numbers of single, double and triple occurrence words make up 64% of the total number of separate words (37,195). As might be expected, a large number of these very low frequency words consist of foreign borrowings, mis-spellings, dialectal forms and other types of variant spellings, and numbers. In most cases, the analysis program does distinguish between several of these categories (mis-spellings, foreign words, informal spellings), but such entries would require further checking if 100% accuracy was essential.

Lemma Counts with analyses of inflections and mutations

The lemming software was used to demutate and uninflect word forms in order to track them back to
their lemma. Examples of the resulting lemma analysis are shown for illustration in the table below:

ceg

118

ceg

n

118

ceg

109

nf

ceg

22

nf

cheg

21

nf

llaes

geg

56

nf

meddal

ngheg

10

nf

trwynol

cegau

9

npl

cegau

9

npl

rhodio

16

rhodio

vb

16

rhodia

2

vbf

rhodia

1

vbf :3

rodia

1

vbf :3

meddal

rhodiai

1

vbf

rodiai

1

vbf :10

meddal

rhodio

12

vb

rhodio

7

vb

rodio

5

vb

meddal

rhodiwn

1

vbf

rhodiwn

1

vbf :4.1

The lemma ceg appears 118 times. It appears exclusively as a noun. 109 of these occurrences are
as the noun singular feminine (ceg) and 9 as the noun plural (cegau). As the singular noun it appeared 22 in unmutated form, 21 times with aspirate mutation, 56 with soft mutation, and 10 times as a nasal mutation.

The lemma rhodio appeared 16 times, always as a verb. Two of these occurrences were as the
third person singular present (rhodia) (once in unmutated form and once with soft mutation), 1 occurrence was as the third person singular imperfect in soft mutated form (rodia), 12 occurrences as the verb noun rhodio (7 times unmutated and 5 times with soft mutation), and once as the third person plural present tense (rhodiwn). There
are many verb forms for Welsh – the full list of verb form codes is shown below.

Verb-form Codes

The table of verb form codes
is shown below:

1

af

present tense first person singular

2

i

present tense second person singular

3

a

present tense third person singular

4

wn

present tense first person plural

5

wch

present tense second person plural

6

ant

present tense third person plural

7

ir

present tense impersonal

8

it

imperfect tense first person singular

9

et

imperfect tense second person singular

10

ai

imperfect tense third person singular

11

em

imperfect tense first person plural

12

ech

imperfect tense second person plural

13

ent

imperfect tense third person plural

14

id

imperfect tense impersonal

15

ais

past tense first person singular

16

aist

past tense second person singular

17

odd

past tense third person singular

18

asom

past tense first person plural

19

asoch

past tense second person plural

20

asant

past tense third person plural

21

wyd

past tense impersonal

22

aswn

pluperfect first person singular

23

asit

pluperfect second person singular

24

aset

pluperfect second person singular

25

asai

pluperfect third person singular

26

asem

pluperfect first person plural

27

asech

pluperfect second person plural

28

asent

pluperfect third person plural

29

asid

pluperfect impersonal

30

ed

impersonal imperative

31

wyf

subjunctive first person singular

32

ych

subjunctive second person singular

33

o

subjunctive third person singular

34

om

subjunctive first person plural

35

och

subjunctive second person plural

36

ont

subjunctive third person plural

37

er

subjunctive second person singular

38

es

past tense first person singular

39

est

past tense first person singular

40

ith

Informal third person singular

41

iff

Informal Future third person singular

42

on

Informal Past third person plural

43

an

Informal Future third person plural

The file, Lemma Counts with Analysis, downloadable below, is tab-separated and can be imported into Excel where it can be readily manipulated to provide a wide range of analyses. One example, based on a sort of the final field
(mutation), generates the following results for initial mutations.

Initial mutationsWelsh words can exhibit one of four types of morphophonemic initial mutation, and the occurrences and relative frequencies of such forms in the sample are:

Use of these Materials

These materials have been produced on a small budget for academic research. You are welcome to use the materials for any non-commercial purpose. We have produced these analyses in good faith to the best of our abilities given the limited resources. As we have described above, you should be aware that there are some inaccuracies in the taggings. We bear no responsibility for any damaging consequences that may result from these.

We welcome further research to extend or correct these linguistic descriptions.