Tag: spelling standardization

The final aspect of standardization I will discuss will be common early modern spellings forced to modern equivalents, decisions where the payoff of consistency outweighs slight data loss.

The VEP team decided to force bee > be, doe > do, and wee > we.

Naturally one can see the problems inherent to these forced standardizations.

Bee in early modern spelling can stand for the insect as well as the verb. Similarly with doe, it can signify a deer or a verb. For wee, it can be either an adjective or a pronoun. We hypothesized for our drama corpus that 1) bee would be overwhelmingly the verb; 2) doe would overwhelmingly be the verb; 3) wee would overwhelmingly be the pronoun.

I searched for the first 1,000 instances of a spelling in the early modern corpus and Key Words in Context.

I generated CSVs of the 1,000 hits of the spellings in question, including surrounding text to gain context and determine the word’s signification

When I located a word that deviated from the meaning VEP projected the spelling would be associated with, I highlighted the entry and took notes in a column beside the line

After I read through 1,000 instances of the spelling, I tallied the number of times the word did not match our hypothesized meaning.

BEE > BE

CORPUS

INSTANCES OF INSECT

INSTANCES OF SPELLING

PERCENTAGE OF ERROR

EM Drama

17

1,000

1.7%

Key Words

71

1,000

7.1%

Bee as bee is higher in the first 1,000 hits of Key Words for generic reasons. Key Words contains all of EEBO-TCP. There are early dictionaries (Thomas Elyot) and husbandry texts (John Fitzherbert). Morever, compilers like George Gascoigne recognized the metaphorical power of the bee’s work–travelling from flower to flower to make sweet honey–and used it as meta-commentary for their labor gathering the most delightful and edifying writing.

DOE > DO

CORPUS

INSTANCES OF ANIMAL

INSTANCES OF SPELLING

PERCENTAGE OF ERROR

EM Drama

0

1,000

0%

Key Words

1

1,000

.1%

I further looked into variant spellings of the conjugation does in the drama corpus, to see how common the animal would be opposed to the verb. Searching for does in the corpus yielded one instance of the animal in the first 1,000 instances of the spelling (.1%). Searching doe’s in the corpus yielded 158 instances of the spelling, all which were the verb.

The above results suggest minimal data loss for standardizing all instances of doe to do in the drama corpus.

WEE > WE
It is harder to pin down figures for this decision.

Searching for wee in the early modern drama corpus, I identified 4 of the first 1,000 instances that were not the pronoun. One looked like it should have been well, another looked like an elision of God be with yee (God b’wee). The remaining two instances were French, which standardized are to be oui.

Based on the first 1,000 instances of wee in Key Words in Context, there was too much noise. It seems that text you search in Key Words in Context doesn’t preserve TCP notation for illegible characters, the bullet (•). There were many places I had to look at the original TCP files to determine the signification of wee because the pronoun we and the adjective wee didn’t make sense. When consulting the file, I matched wee to words with illegible characters (e.g, we•e).

What do these standardizations mean for the drama corpus?
If you work on bee and deer imagery in early modern drama, you will want to look somewhere else. For the bee example, if the 17 in 1,000 instances of the spelling bee as insect holds steady over the 6,694 instances of bee in the drama corpus, that means ~113 of those 6,694 spellings of bee refer to the insect. Overall, with an error rate of 1.7%, data loss in the corpus is minimal when the spelling bee is forced to be.

Granted, I looked at the first 1,000 instances of spellings in the corpus and in Key Words. Consequently I reviewed inconsistent portions of these corpora. The VEP team decided the sampling was telling for the context of the drama corpus. Another inconsistency with the files is the order in which they were searched between Key Words and the drama corpus. Key Words doesn’t provide the user with options for ordering the results, therefore the words are displayed in chronological order. For the drama corpus, files were searched from smallest to largest TCP file number. Overall, the frequency of significations suggest small margins of error for the standardizations of bee, doe, and wee within the corpus.

Optimizing VARD for the early modern drama corpus required “forcing” lexical changes to create higher levels of standardization in the dataset. Jonathan Hope gave me editorial principles to follow as we considered what words/patterns VARD should change that it wasn’t. We wanted to standardize prepositions, expand elisions, and preserve verb endings. Unfortunately, preserving Early Modern verb endings (-st, –th) would require an overhaul of VARD’s dictionary.

There were three routes I followed to force standardization: manually selecting variants over others to change confidence scores; marking non-variants as variants and inputting their standardized form; adding words to the dictionary.

For the early modern drama corpus, the VEP team identified two grammatical features for forced standardization. We decided to implement consistent spelling for pronouns, adverbs, and prepositions; and expanding elisions that would interfere with algorithmic analysis, like topic modeling. Granted, more could have been changed, but we erred on the side of caution to see how effective the changes would be overall.

After documenting forced changes, I will discuss their implications for the dataset, which will come in the next entry.

Of course, you will want to check how your VARD installation handles these words. VARD keeps a running list of changes it makes, which silently trains the program as it executes. It is good practice to examine what VARD changes certain words to. Your datasets will be different than mine. These changes are based on blatant errors the VEP team located in the early modern drama corpus. Datasets, based on their content, have different curation needs.

*The dataset we use is interspersed with different languages, especially Latin. I had to add foreign words–like nunc and qui–to the dictionary to prevent skewing the frequency of certain vocabulary.

Since I have discussed how VARD behaves with character encoding and symbols, I will devote space to explaining how I tweaked VARD to standardize Jonathan Hope’s early modern drama corpus.

Given the size of Hope’s corpus, it required automating the process of comparing VARD’s output to the original play files. Erin Winter wrote a case-sensitive python script that generated a CSV recording all of VARD’s changes and their frequencies. I compared the original words to VARD’s normalizations, looking at only the highest frequencies. I looked at unique spellings changed within the frequency range of approximately 46,000 to 100 times, which amounted to nearly 3,000 cases. (There were approximately 58,000 unique spellings in the corpus changed 10 or fewer times.) To offer a glimpse, here are the 10 most frequent VARD normalizations for the early modern drama corpus:

ORIGINAL

NORMALIZED

FREQUENCY

haue

have

45680

selfe

self

18473

Ile

Isle

16095

loue

love

15666

thinke

think

10450

mee

me

10437

vpon

upon

10287

owne

own

10205

vp

up

9704

’tis

it is

9691

The CSV tracking normalizations proved a painless way to identify where VARD needed a gentle push in another direction. Note Ile in the above table. Yes, England is an island (of which writers were aware), but 16,095 changes to Isle seemed suspect. When I looked at files with VARD-inserted XML tags, it became obvious those Iles should have been standardized to I’lls. There, VARD was simply wrong. (I will devote the next post to where VARD goofs–sometimes amusingly–in standardization.)

By researching questionable corrections, I was able to formulate standardization rules more “aggressive” than which the program instantiates with. (You can locate the default rules in the file “rules.txt,” in VARD’s “training” folder.) These rules dictate modern letter substitutions for common early modern letter combinations. Examples of the rules are as follows:

CHARACTERS

CHANGE TO

LOCATION IN WORD

vv

w

Anywhere

ie

y

Anywhere

Given the above rules, when VARD processes the word alvvaies, the program may suggest multiple variants: alwaies and alvvays. This contributes to competing spellings for variations across documents standardized, which you can find proliferate when VARD handles early modern prepositions and adverbs, even words with hyphens (e.g.: ne’er, ne’re, nev’r normalize differently; should the hyphen be eliminated or maintained?).

My additions to “rules.txt” aided not only spelling standardization but expanding elisions. The rules mainly gave VARD an extra push to handle early modern English morphemes. While “rules.txt” contains the rule ie at the end of words can be changed to y, it didn’t have a rule to help with standardizing the common adverb ending lie. Here is a table of the rules I added:

CHARACTERS

CHANGE TO

LOCATION IN WORD

cyon

tion

End

lie

ly

End

shyp

ship

End

t’

*to_

Start

th’

*the_

Start

tiue

tive

End

vn

un

Start

vs

us

Anywhere

ynge

ing

End

While not comprehensive, the rules definitely aided VARD’s efforts. Of course, entering rules is only one step of the process. For the rules you add to the dictionary, you must manually train VARD to implement them.

*A final word regarding the entries I made to expand the elisions t’ and th’ when they begin words. I typed an underscore (_) to reflect that there is a space after to and the in the rules. VARD will recognize spaces for rule input. In the GUI the rule will be displayed with an underscore; you do not not type the underscores in. The rules worked, and the program properly expanded words after some manual training. It changed th’ambassador to the ambassador, t’change to to change.