Abstract

Machine learning and text mining offer new models for text analysis in the humanities by
searching for meaningful patterns across many hundreds or thousands of documents. In this study,
we apply comparative text mining to a large database of 20th century Black Drama in an effort to
examine linguistic distinctiveness of gender, race, and nationality. We first run tests on the
plays of American versus non-American playwrights using a variety of learning techniques to
classify these works, identifying those which are incorrectly classified and the features which
distinguish the plays. We achieve a significant degree of performance in this
cross-classification task and find features that may provide interpretative insights. Turning
our attention to the question of gendered writing, we classify plays by male and female authors
as well as the male and female characters depicted in these works. We again achieve significant
results which provide a variety of feature lists clearly distinguishing the lexical choices made
by male and female playwrights. While classification tasks such as these are successful and may
be illuminating, they also raise several critical issues. The most successful classifications
for author and character genders were accomplished by normalizing the data in various ways.
Doing so creates a kind of distance from the text as originally composed, which may limit the
interpretive utility of classification tools. By framing the classification tasks as binary
oppositions (male/female, etc), the possibility arises of stereotypical or "lowest common
denominator" results which may gloss over important critical elements, and may also
reflect the experimental design. Text mining opens new avenues of textual and literary research
by looking for patterns in large collections of documents, but should be employed with close
attention to its methodological and critical limitations.

Introduction

The Black stage has been an important locus for exploring the evolution of Black identity and
self-representation in North America, Africa, and many African diaspora countries.
African-American playwrights have examined many of the most contentious issues in American
history since emancipation — such as migration, exploitation, racial relations, racial violence,
and civil rights activism — while writers outside of the United States have treated similar
themes arising from the history of colonialism, slavery, and apartheid. Alexander Street Press
(ASP), in collaboration with the ARTFL Project, has developed a database of over 1,200 plays
written from the mid-1800s to the present by more than 180 playwrights from North America, as
well as English-speaking Africa, the Caribbean, and other nations. The Black Drama collection is
remarkable for the wealth of information provided for each play as well as for the authors,
characters, and performance histories of these works. While such extensive metadata permits
sophisticated search and analysis of the collection, it also provides an environment that lends
itself well to experiments in machine learning and text mining.

Using the Black Drama data, we examine the degree to which machine learning can isolate
stylistic or content characteristics of authors and/or characters having particular attributes —
gender, race, and nationality — and the degree to which pairs of author/character attributes
interact. We attempt to discover if lexical style or content markers could be found which
reliably distinguish plays or speeches broken down by a particular characteristic, such as
gender of character. A positive result would constitute strong evidence for distinctive, in this
example male and female, character voices in the sample of plays. If distinctiveness could be
shown, we then sought some "characterization" of the differences found, in terms
of well defined grammatical or semantic classes.[1] Secondarily, we attempt to see whether and
how plausibly comparative data mining tools could aid scholars doing literary critical work.
Beyond the statistical results, we examine features the various algorithms used to classify
texts and try to determine the degree to which they illuminate or deepen our understanding of
the texts in the database.

We find that comparative tools doing supervised learning are quite good at classifying plays
by American versus non-American authors. Even the outliers, or misclassified plays, demonstrate
that these algorithms are able to identify American writers by language usage with remarkable
reliability. We are slightly less successful when trying to distinguish the gender of author
and/or character. This gender experiment, however, does reveal differences in the ways male and
female authors and characters use language.

Equally important to the relative abilities of individual tools to classify texts, our
experiments have alerted us to potential concerns about data mining as a critical or
interpretive endeavor. Comparative classifiers quite powerfully lump together texts and objects
that they find "similar" or "dissimilar." However, as we go
forward, we have to try to develop results analysis that does not rely on simple binary
opposition. Framing comparative tasks based on existing, often binary categories, can lead to
results which have a distinctly stereotypical or "lowest common denominator"
feel. Application of these new technologies in humanistic research requires that that we
understand not only how the tools work, but that we also bring to bear critical evaluation of
the questions we ask, the tasks we ask the tools to perform, and the results obtained.

The Black Drama Database

The Black Drama collection developed by Alexander Street Press has, at the time of this work,
over 1,200 plays by 181 primary authors containing 13.3 million words, written from the middle
of the 19th century to the present, including many previously unpublished works [BLDR 2005]. As
might be expected of a collection of texts from a particular literary genre, this collection
cannot in any way be considered a "random" or statistically representative sample of Black
writing in the 20th century. Rather, it reflects editorial decisions and includes key works from
a number of American and non-American artistic movements, such as the Harlem Renaissance, Black
Arts Movement, and Township Theatre, as well as the principle works by many critically acclaimed
authors whom the editors consider, not unreasonably, the most important playwrights. The
database contains 963 works by 128 male playwrights (10.8 million words) and 243 pieces by 53
female playwrights (2.5 million words). The most important authors in the collection include
Langston Hughes (49 plays), Ed Bullins (47), OyamO (43), and Willis Richardson (41). Plays by
Americans dominate the collection (831 titles), with the remaining 375 titles representing the
works of African and Caribbean authors. The database contains 317,000 speeches by 8,392 male
characters and 192,000 speeches by 4,162 female characters. There are 336,000 speeches by 7,067
black characters and 55,000 by 1,834 white characters with a smattering of speeches by other
racial groups. As would be expected, the predominance of American authors is reflected in the
nationalities of speakers in the plays. 272,000 speeches are by American characters and 71,000
by speakers from a variety of African nations.

Like other Alexander Street Press data sets, the Black Drama collection is remarkable for its
detailed encoding and amount of metadata associated with authors, titles, acts/scenes,
performances, and characters. Of particular interest for this study are the data available for
authors and characters which are stored as "stand-off mark-up" data tables. The
character table, for example, contains some 13,360 records with 30 fields including name(s),
race, age, gender, nationality, ethnicity, occupation, sexual orientation, performers, if a real
person, and type. Even more extensive information is available for authors and titles. The
character data are joined to each character speech, giving 562,000 objects that can be queried
by the full range of character attributes.

The ARTFL search system, PhiloLogic [PhiloLogic 2007], allows joining of object attribute
searches, forming a matrix of author/title/character searching. For example, one can search for
words in speeches by female, black, American characters depicted by male, non-American authors
in comedies first published during the first half of the 20th century. Given this architecture,
designed to support traditional digital humanities research tasks, user-initiated full-text word
searches on author and character attributes can help scholars examine specific questions about
language use and character depiction. Initial work, for example, on racial epithets in this
collection reveals striking differences in the use of such language between male and female
authors and characters as well as characters of different races. The plays feature extensive use
of racial epithets, especially the frequently appearing "n-word" and its
variants. The 5,116 (3.8/10000 words) occurrences of this slur appear in just under 1 in 100
dialogue shifts in the collection and in almost half of all the plays (515). Its extensive use
by a substantial majority (119/181) of playwrights in this collection suggests that it has had
an important role in the representation of Black experience in the past century.[2] An
examination of frequencies suggests that its use is marked by a variety of social factors, such
as gender of author and gender/race of character. Male playwrights use the
"n-word" twice as frequently as female authors (4.2 vs 2.1/10000 words).
Similarly, male characters overall are depicted using this slur almost twice as frequently as
female characters (9.5 vs 5.0/1000 speeches). However, factoring author gender into the equation
changes the rates somewhat. While male playwrights still represent the genders using it at
roughly a 2 to 1, male to female ratio (10.3 vs 5.4/1000 speeches), female authors depict female
characters using it at a rate more closely equal to that of males (4.8 male vs 3.5 female/1000
speeches). This leveling of comparative rate may be an artifact of the moderate preference of
female authors to represent female characters, as just over a third (34.7%) of speeches in plays
by male authors are female characters, while female playwrights allocate slight more than half
(52.1%) of speeches to female characters. Similar gender distinctions are also apparent in
representation of character race. White characters comprise 14% of speeches (10.5% in female
authors). Male authors represent white characters using this racial slur at just under twice the
rate as black characters (15.1 vs 8.6/1000 speeches), female authors represent this distinction
at a 5 to 4 ratio (5.4 vs 4.3/1000 speeches).

While illustrative, such "micro-studies" based on standard full-text searches
for specific words or patterns can do little more than hint at larger discursive and
representation issues, such as differences between male and female writing. We believe that the
new generation of machine learning tools and text data mining approaches have the potential to
reveal more general variations in language use because they look for patterns of differential
language use broken down by various combinations of author and character attributes.

Philomine: Machine Learning and Text Mining

Machine learning and text mining techniques are commonly used to detect patterns in large
numbers of documents. These tools are often used to classify documents based on a training
sample and apply the resulting model to unseen data. A spam filter, for example, is trained on
samples of junk mail and real mail, and then used to classify incoming mail based on the
differences it found in the training set. In our application, we already know the
classifications of interest, such as author gender and race, for the entire collection. Thus, we
apply supervised learning techniques[3] with three rather different objectives: to test whether
we can build a training model to classify texts accurately, to identify individual texts which
are incorrectly identified after training, and to isolate a set of features[4] which are
strongly associated with a particular class. To test the accuracy of a training model, we use
n-fold cross-validation, which trains on some of the sample and then attempts to predict the
remaining portion of the data set.[5] The machine learning algorithms employed for this study,
several implementations of Multinomial Naive Bayes and Support Vector Machine learners, build
models by assigning weights to features.[6] Features with high weights, positive or negative,
are the most influential in identifying a text as belonging to one class or another. Thus,
"viagra" would be assigned a high feature weight in a model underlying a spam
filter. In a classification task that results in a high success rate, misclassified documents
draw particular interest because these failures of the model often point to literary or
linguistic elements which distinguish the outliers from the mass of correctly classified
documents.

To support text mining experimentation, we have implemented a set of machine learning
extensions to PhiloLogic, our full text analysis system, called PhiloMine.[7]
PhiloMine allows the interactive construction of machine learning and text mining tasks, and
provides links back to documents and features which facilitate further examination. The
interactive environment allows users to select and modify the features used for any task in a
variety of ways. Feature selection and manipulation may be the most important aspects of text
mining, since the accuracy and validity of classification tasks arise from careful feature
selection. For example, basing a strong classification of American and non-American texts on
orthographic variations (color/colour) is certainly effective, but of little use for exploring
more interesting differences. In addition to PhiloMine, we also use stand-alone text mining
applications in cases where we require more sophisticated text sampling or other functions not
supported in the interactive environment.

Black Nationality and the Diaspora

The diverse Black Drama corpus is a useful collection for examining the stage throughout the
Anglophone black diaspora, allowing specific focus on the impact of colonialism and
independence, and for comparing African-American plays with works from other cultures. The
collection contains 394 plays by American and 303 plays by non-American playwrights written
during the period we are studying, 1950-2006. In this experiment, we tested the degree to which
we could distinguish between American and non-American plays. To this end, we generated an
excluded feature list of words with common spelling differences — such as
"color/colour,"
"center/centre," etc — that would have had an impact on results. We further
excluded words or names that might appear frequently in a small number of texts, limiting
classification on features present in less than 80% but in more than 10% of the documents. The
resulting feature list numbered approximately 4200 surface forms of words.

For this preliminary experiment, we achieved accuracy rates ranging 85% to 92% depending on
the algorithm selected. Specific classifiers yielded slightly different accuracy rates, partly
because they weight features and function differently. Using the parameters described above with
PhiloMine, the Multinomial Naive Bayesian (MNB) classifier generally had high rates of success
distinguishing American and non-American plays, 88.8 percent correct with 84.4 percent correct
on cross-validation. Other classifiers achieved similar performance rates for this task. The
Weka (Weka3, [Weka3]) implementation of MNB attained 91.4% training and 84.7% cross-validated accuracy
and the Weka SMO[8] achieved 94.0% cross-validated accuracy. To ensure that results were not
artifacts of the classifiers themselves, we ran random falsification tests, deliberately
randomizing the document instances to confuse the classifier [Ruiz and López-de-Teruel 1998]. As
would be expected, random falsification generally gave accuracy rates around 50%, such as 49.2
percent using MNB. Our tests revealed that there are significant differences between American
and non-American plays, and various learning algorithms can identify them reliably.

Splitting the time period in 1984, we found some indication that American and non-American
plays might be slightly less distinguishable over the past twenty years. Table One shows the
differences in accuracy rates for the earlier period (236/178 American/non-American) and the
later period (158/125).

MNB

WekaSMO

WekaNB

raw/xval

xval

raw/xval

1950–1984

93.7/87.6

95.7

96.9/90.8

1985–2006

87.6/81.0

88.0

96.1/82.3

Table 1.

Percentage Accuracy by Period/Test

The somewhat lower accuracy for the classification of plays in the later period
suggests that these discourses might be merging to an extent. But one should not exclude the
possibility that smaller sample sizes may have had some impact on this task.

The efficacy of this classification task is matched, to some degree, by the less than
startling features most strongly associated with the two bodies of texts. Appendix One shows the
top 200 features most predictive of works by American and non-American playwrights. The American
plays are marked by their use of slang, references to some place names, and orthographical
renderings of speech. The state names suggest a Southern, rural backdrop to many plays, but the
terms "hallway" and "downtown" have a decidedly urban feel. The
top features of non-American authors had very few slang words and comparatively fewer words that
reflect spoken language. Many, in fact, belong to a traditional social sphere or reveal more
formal attitudes toward government, as noted by terms like "crown,"
"palace,"
"politicians" and "corruption." The features assigned the highest
probabilities in this classification task strike the casual observer as both expected and
reasonable.

While an extended discussion of the features isolated in this experiment is beyond the scope
of this paper, a couple of observations are warranted. First, is the number of features that can
be selected for a successful classification task is surprisingly small. We selected the top 30
features from the lists for American and non-American playwrights (a standard PhiloMine
function):

Based on this tiny feature set, we achieved
similar cross-validated classification performance: MNB 90%; Weka Bayes 85.5%; and Weka SMO
88.8%. While effective on a technical level, few critics would find this list to be a sufficient
way to characterize the differences between American and non-American Black drama. The second
observation is that different classifiers appear to give weights to remarkably different
features. For example, the Weka Naive Bayesian classifier generates a single list of features
that looks very little like those above. One sees no place names, no "speech"
words, and few terms that stand out on their own as necessarily African, traditional, urban, or
Southern:

While all three of these
algorithms produced strong results classifying authors as American and non-American, the
different features they rely on raise questions about their usefulness for literary scholars
studying text collections.

In the case of these three tests, the MNB classifier's feature sets are the easiest to grasp
immediately. They point to differences in speech patterns between Americans and non-Americans,
different social structures, and different environments. The feature set of the Weka Naive
Bayesian classifier makes very little intuitive sense. Why is it that the word
"thinks," either by itself or in conjunction with other terms, most effectively
distinguishes a play as American or non-American? To a lesser degree, the Weka SMO classifier's
feature set suffers from the same liability. These last classifiers do not give the easy access
into the plays that the MNB does. The user can look at the feature set of that classifier and,
almost immediately, begin thinking about issues like the migration of black Americans from the
rural South to Northern cities. The drawback to the immediacy of this feature set is that it
also has a "stereotypical" feel. Plays by non-American black authors, following a
simple-minded reading of the MNB result set, might be thought to take place in villages where
chiefs and elders buy wives with goats and cattle. In this sense, the features tend to obscure
any nuance in either corpus because the algorithm itself tends to be very selective as it
generates a model.

Beyond feature lists, an examination of incorrectly classified documents demonstrates the
utility of comparative classification. In most PhiloMine tasks, the system will identify
documents that are not correctly classified. As might be expected, some non-American subgroups
were classified better than others. Thus, British, Australian, and Canadian authors were more
frequently misclassified in this task than African or Caribbean playwrights. Inspecting the
outliers was instructive. For example, eight plays by Kia Corthron were consistently
misclassified as American. Upon further review, the problem arose not because of the algorithm,
but the metadata we provided it. The editors incorrectly identified this American playwright as
a Canadian. Because of the lexical features it found, the classifier decided that her plays, in
fact, should not be grouped with those of the non-American authors, as the metadata would have
had her, but with the American authors. This misclassification was actually a correct
classification that alerted us to the error in the metadata. Other outliers also bear further
inspection, which may raise further critical questions. The plays of Montserratian Edgar White
were correct classified except for one, When Night Turn Day, which
is set in the Bronx. Similarly, American Joseph Walker was correctly classified in 9 of 10
cases, except for The Lion is a Soul Brother, set in "rural/jungle
West Africa." OyamO's Three Red Falcons?, set in rural Nigeria, was
incorrectly classified, but so too was his 1995 play, Dancing on the
Brink, which was set in Palo Alto. Examination of outliers in a generally successful
classification task may provoke new questions about why a particular text was not classified
properly.

Comparative text mining on the nationality of Black playwrights in the second half of the 20th
century shows clear distinctions between American and non-American authors. The algorithms
consistently achieve high levels of performance and generate distinguishing feature lists that
are of potential interest in helping to characterize topics, themes, and styles that mark each
group. Finally, the outliers or incorrectly classified authors and plays may lead to further
critical scrutiny, allowing for variations within the general classification.

It may be objected that our classification task itself is a trivial case, attempting to
confirm a distinction that is all too obvious to be of significant literary interest. A task
like this, however, provides a good test case for verifying how well classifiers are able to
function. We were able to check the accuracy of our results easily through the bibliographic
data. In the case of the outlier, Corthron, a web search confirmed that she is, in fact,
American. Of course, features that reveal certain linguistic and dramatic differences between
the American and non-American plays might not be particularly surprising. Authors tend to place
characters in localities and use idiomatic language they themselves know. For example, an
American playwright is more likely to set a play in Mississippi and depict characters using a
Southern vernacular. However, the features are often able to bring out further, less obvious
stylistic and creative choices in operation across text corpora. It is beyond the scope of this
paper to address just what these could be. A scholar with a deeper knowledge of this body of
works could potentially examine the feature sets and find lexical patterns that point to larger
tendencies. In the end, the algorithms can only rely on and propose what they find to be
statistically significant. The scholar, as always, must decide the meaning of the results.

Mining Author and Character Gender

Classification tasks based on the gender of authors, characters, and a combination of the two
provides a more challenging test than classification by author nationality. Previous work
suggests that gender classification tasks are somewhat less accurate and that the feature sets
they generate are less obviously distinctive [Argamon et al. 2003], [Koppel et al. 2002]. The nature of these
documents, plays with male and female characters and with a significant amount of generic
language found in stage directions and other textual objects, increases the difficulty of
automatic classification. The issue of male authors writing female characters and vice versa
makes the task of classifying by gender more difficult, but also introduces the ability to test
the degree to which authors of one gender can effectively write characters of the other
gender.

The Black Drama collection contains 573 (82%) plays by male and 124 (18%) by female
playwrights written between 1950-2006. An initial classification using our default feature set
selection is deceptively accurate, with MNB returning 79.4% cross-validated accuracy and Weka
SMO 85.1% cross-validated. The Weka "confusion matrix" indicates that a
significant majority of male authors are correctly classified while performance on female
authors is less than 50%:

a

b

← classified as

680

45

a = male

85

64

b = female

Table 2.

The classifier performed only slightly better than if it had simply guessed
"male" for every document (82%). Therefore, to determine whether the system was
in fact finding a gender difference, we had to balance corpus sizes. PhiloMine supports a
standard corpus balancing option that randomly selects documents from the larger sample until it
has found a number equal to the smaller class. Since this is a random selection, effectively
comparing all of the female plays against a set of male documents, one needs to perform this
task numerous times to determine an average accuracy rate. Using the Weka NB and SMO functions
five times each, cross-validated results indicate a fair degree of success:

WekaSMO (cross-validated): 70.2%, 78.2%, 79.0%, 73.0%, 73.8%

WekaNB (cross-validated): 65.7%, 74.2%, 68.1%, 70.2%, 69.8%

Classifying author gender
of randomly balanced test sets arrived at an accuracy of 70% for Naive Bayesian and 75% for
Support Vector Machine. Similar performance of high 60%/low 70% accuracy was also generated by
the MNB function. Applying a random falsification test to one of these tasks we found an
accuracy rate of 49%. This suggests that, while it is not as strong as in the nationality
classification, a signal can be found that is not an artifact of the selected classifier or
other influences.

The classification tasks to this point have been looking at entire plays with no attempt to
control for expected skewing factors, such as stage directions, which would not be expected to
have as strong a gendered writing style; and characterization, where authors depict characters
of the other gender. Controlling for obvious skewing influences, particularly as a control
following more generic experiments, tends to provide more coherent results, but at the cost of
creating composite documents which do not reflect the structure and organization of the plays as
individual works.

For this experiment, we rebuilt the Black Drama collection under PhiloLogic/PhiloMine to
behave as a database of some 13,000 distinct objects (characters) containing 8.9 million words.
We eliminated stage directions, cast lists, and other such apparatus containing some 4.5 million
words, while combining individual speeches of each character into one object for faster
processing. This changed the unit of analysis from documents to composite character speeches
with associated metadata, including author attributes. For the period 1950-2006, there are 4,228
characters by male authors and 865 characters by female playwrights with more than 200 words.
For the same period, there are 3,226 male characters and 1,742 female characters.

Classifying all the characters by author gender using the PhiloMine Bayesian function resulted
in 75.5% cross-validated accuracy. This general result was confirmed on 5 runs using the random
document balancing function (865 characters), with cross-validated accuracy rates of 72.8%,
72.1%, 71.0%, 72.2%, and 71.6%. Deliberately shuffling the character instances of randomly
balanced characters returned results approximating the expected 50% accuracy: 47.6%, 51.2%, and
50.8%. Similar results were obtained for classifications by character gender. Overall, for all
of the characters, MNB accurately classified 73.2% of the instances. Five runs of randomly
balanced characters (1,742) resulted in accuracy rates of 72.5%, 71.1%, 72.3%, 72.2%, and 71.4%.
Random falsification tests again approximate the expected 50% accuracy: 48.4%, 48.1%, and 50.4%.
Classification of gender or author and character on composite character objects showed modest
improvement in accuracy. Further, we found that classifying on more major characters (total
words greater than 1,000 and 2,000) again resulted in modest increases in accuracy.

This experiment suggests that, as we balance our authors and characters more rigorously,
essentially testing on a more and more abstract representation of the texts, our success rates
improve. We first extracted all speeches with character gender attributes from the corpus,
splitting them into tokenized word frequency vectors for all authors, all characters, male
authors, female authors, male characters, and female characters. For each of these, we used
SVM-Light[9] to build a model to identify authors and characters by gender. As indicated in
Table Three, the system correctly identified 88.2% of the authors' gender and 77.4% of the
speakers' gender.

Given:

Female Author

Female Speaker

Male Author

Male Speaker

Full Sample

Full Sample

Find gender of:

Speaker

Author

Speaker

Author

Speaker

Author

Accuracy

69.7%

83.1%

78.9%

88.6%

77.4%

88.2%

Majority Class

54.5%

74.5%

69.4%

84.7%

66.6%

8.13%

Table 3.

Accuracy over 10-fold cross-validation

Performance varied when examining subsets of the corpus, from 86.6% for gender of
author in male characters to 69.7% for gender of speaker in female authors. The differences in
accuracy in male and female author/character might, however, result from the fact that male
authors tend to include fewer female characters, as we show in the Majority Class of Table Three,
which indicates the rate of male instances for each assessment. For female authors, male
characters constitute 54.5% of the speakers.

We then equalized a test sample for class by discarding instances in the majority classes
until we had a balanced set. As part of this process, we further corrected for number of words
in each character class by selecting character instances to balance the word frequencies
overall. As shown in Table Four, this test sample produced a balanced dataset by number of
instances and number of average words spoken by each character.

Name

Female Author

Female Speaker

Male Author

Male Speaker

Author

Speaker

Given:

Female Author

Female Speaker

Male Author

Male Speaker

Find gender of:

Speaker

Author

Speaker

Author

Speaker

Author

Accuracy

71.9%

82.8%

80.3%

83.7%

79.9%

86.4%

Average words/document

Male

770.1

877.8

853.4

645.2

859.4

750.8

Female

777.9

877.9

853.0

645.0

859.4

750.8

Table 4.

Accuracy over 10-fold cross-validation

The creation of a more controlled data set discriminated author and character gender
effectively, achieving cross-validated accuracies of between 72 and 86 percent. Author gender
was identified more accurately and consistently than speaker gender. Table Five demonstrates
that male authors/speakers are correctly identified somewhat more often than are female authors
in five of the six cases, with the sole exception of characters in female-authored plays, where
identification rates are roughly the same for males and females.

Name

Female Author

Female Speaker

Male Author

Male Speaker

Author

Speaker

Given:

Female Author

Female Speaker

Male Author

Male Speaker

Find gender of:

Speaker

Author

Speaker

Author

Speaker

Author

Male Accuracy

71.8%

85.7%

81.7%

86.6%

81.1%

88.9%

Female Accuracy

72.1%

79.9%

78.8%

80.8%

78.6%

83.9%

Table 5.

This suggests that male language in Black Drama is somewhat more marked than female
language, both at the level of authors and at the level of language represented by how authors
write characters.

Machine learning systems are clearly able to identify gender of playwrights and their
characters with impressive accuracy. We have found that on raw running texts, the systems can
reliably identify author gender at rates between the high 60s and mid-70s percent accuracy. And,
as the plays are processed in various ways to eliminate potential skewing factors — such as
unbalanced subsets, extraneous textual data (e.g. stage directions), and differences in raw word
counts — classification performance increases. This increase in performance comes, however, at
the cost of increasing the distance from the text themselves.

Given the ability to classify author and character genders, we will now return to the texts
briefly to examine the features most characteristic of gendered writing in Black Drama. Appendix
Two shows the top 200 features as measure by Bayesian probability ratios, broken down by male
and female playwrights without respect to character gender. The features suggest a rather
traditional set of gender distinctions. Male authors tend to focus on legal/criminal issues
(officer, gang, pistol, jail, etc.); numerous obscenities and slurs (bullshit, nigger(s),
goddamn, fuck, shit); music (band, drum, leader, drums, spiritual, player, jazz); and money
(dollars, price, cents, cash, etc.). Female playwrights of this period tend to privilege issues
concerning family/home (child, stories, hug, mama, girls, birth); emotive states (smiling,
imagine, memories, memory, happiness, happy); descriptions (handsome, lovely, grace, cute,
courage, loving, ugly); and women (herself, girls, she, female, lady, women, her). The
representation of traditional gender roles are most notable in the characterization of
non-American male authors. As shown in Appendix Three, male characters are given very clear
public roles, with the top eight words by probability ratios being "chief, order,
government, lead, power, position, country, land" while the female characters are
limited to the private realm ("husband, dress, shame, marry, doctor, married, please,
parents").

Gender characterization among American authors writing between 1950-2006 (Appendices 4 and 5)
provides evidence that men and women depict characters slightly differently. The feature sets
are rather similar: both contain numbers (men generally do counting) and share terms as varied
as sir, american, power, bitch, country, and killed. But the male list is noticeably coarser,
with more profanity, the term "nigger(s)," and more references to law enforcement
and violence. With only a few exceptions, the female character feature lists have basically the
same domestic tenor. In contrast to male characters, female characters apparently use very
little profanity and seem to be much less involved in struggles with public authorities. Of
course, these lists only reveal generalities. The features are differential frequencies. There
might in fact be foul-mouthed female characters in the corpus who are active in the public
sphere. But those characters would probably be exceptions. The degree to which these lists
reveal true differences among black American male and female authors is a matter for discussion.
The important thing is that the mining algorithm gives fuel to the discussion and serves as a
starting point for closer textual study.

This same character gender classification test on non-American authors yields feature sets
suggesting even more disparate depictions of the sexes than among American authors. Appendices 6
and 7 show that, for authors of both sexes, male characters inhabit the public sphere, their
discourse deals with leadership, and they are more likely to use grammatically precise terms
like "which" and "whom." Female characters' language, again,
primarily centers on domestic concerns. Distinctions between male and female authors' use of
language in the depiction of their characters are few. One of the striking differences, however,
is that only the male characters written by female authors in this data set use scatalogical
language at a significant rate. And comparing the results from the American and non-American
tests highlights the different concerns for these characters who inhabit different cultures.

Classification of texts by gender of playwrights, characters, and the combination of the two
is a more difficult test than by nationality. Results ranging from high-60 percent accuracy for
plays as they are found to the mid-80s in carefully constructed samples extracted from the
texts. It is also clear that the features used to construct classifier models might be of
interest to literary researchers in that they identify themes and language use that characterize
gender distinctions. Of course, men talk more of wives than women and only women tend to call
other women "hussies," so it is hardly surprising that male and female
authors/characters speak of different things in somewhat different ways. The features suggest,
however, that we are finding "lowest common denominators" which distinguish male
from female, but which may also privilege particular stereotypes. The unhappy relationship of
Black American men with the criminal justice system or the importance of family matters to women
are both certainly themes raised in these plays. The experimental design itself, using
classifiers to detect patterns of word usage which most distinguish the genders, may bring to
the forefront literary and linguistic elements which play a relatively minor role in the texts
themselves.

Conclusion

We have found that, although algorithms can in fact detect differences in lexical usage to a
striking degree and output feature sets that characterize differences between corpora of data,
human scholars must still do the work of scrutinizing results and, more importantly, decide how
best to develop these tools for humanities research. Fundamentally, automatic classifiers deal
with general features and common traits. In contrast, in recent decades, literary criticism has
focused on the peripheries, often looking at the ways understudied works by authors from
underrepresented groups work within a larger cultural context. Literary critics have tried to
nuance general understanding about what the mainstream is and how it works. As we mentioned
above, a danger in framing comparative tasks based on binary oppositions is that doing so can
produce simplistic or stereotypical results. Furthermore, given the power of classifiers, we
might always be able to prove or detect some binary opposition between two groups of texts. And
so the task before us, if we want to develop tools to aid literary criticism, is to try in some
way to respond to the values driving scholarship currently and, as those values change, continue
to take them into account. We must also keep in mind that measures of success and procedure are
often different for computer scientists and literary critics. For example, using only 60 total
feature terms, 30 per corpus, we can classify the Black Drama texts as American or non-American
with approximately 90% accuracy. Distinguishing differences on such a small number of words is
an impressive technical feat, to be sure. But to a literary scholar, such a circumscribed way of
thinking about creative work may not be terribly fruitful. As we go forward, we will have to try
to bridge gaps such as these. Our success in this endeavor will, of course, depend upon close
collaboration between those building the data mining tools and those who will finally use
them.

Notes

[1]The experimental protocol which we have been developing for this purpose, as applied by,
e.g. [Argamon et al. 2003], addresses both goals using techniques from machine learning,
supplemented by more traditional computer-assisted text analysis.

[2]It is important to mention that this term occurs much more frequently than less offensive
terms like negro* (3351) or positive expressions such as afr.* americ.* (230) in this sample.

[3]Supervised machine learning is a process for building a classifier from labeled training
data. For example, one might present a learner with two sets of e-mails, identified by human
evaluators as either spam or not spam. The learner generates a statistical model which best
captures this distinction. Following training, the model generated by the learner is used to
classify unseen instances, such as incoming e-mail. Supervised learning is contrasted with
unsupervised learning, which is not based on preidentified class labeled training data. For
example, clustering algorithms use a variety of measures of similarity to assign instances to
groups.

[4]
"Features" in this context means the data being used to perform the machine
learning task. Each instance, typically a document or part of a document, may have an arbitrary
number of features which may include word, lemma or word sequence (n-grams) frequencies as well
as other elements of the data which can be computed, such as sentence length or part-of-speech
frequencies.

[5]Cross validation is a technique to evaluate the likelihood that the machine learning model
being generated by a learner is "over-fitted" to the training data, which would limit the
effectiveness of the classifier to properly handle unseen instances. Over-fitting will also
tend to weight relatively unrepresentative features too highly. Cross validation is performed
by subdividing the training data into random groups (often 10), training on some of these
groups and evaluating the predictions on the remainder.

[6]Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) are commonly used
techniques in the machine learning community. MNB is based on calculating the prior
probabilities of each feature in two or more classes. Thus, "viagra" in an MNB spam filter
would be assigned as very high probability of being in a spam e-mail. For an unseen instance,
the system calculates the probabilities for each feature being a member of a particular class,
adding up all of the probabilities to assign one or more classifications to the new instance.
SVMs are somewhat more complex, as they attempt to divide training data into maximally
separated groups by adjusting feature weights.

[7]Open source distribution and demonstrations
available at [PhiloMine 2007]. PhiloMine makes use of a number of important
open source packages and modules which are listed in the acknowledgments on the site.

[8]The Weka3 system refers to Sequential Minimal Optimization (SMO) which is one of a number
of heuristics to allow Support Vector Machines to perform reasonably quickly and effectively.

[9]We conducted these experiments using a stand-alone parallelized implementation of the
SVM-Light system[10] with PGPDT [Zanni et al. 2003],
[Zanni et al. 2006] on the University of
Chicago Teraport.

PhiloLogic 2007
PhiloLogic 2007: Open source distribution, documentation and
demonstrations available at http://philologic.uchicago.edu/

PhiloMine 2007
PhiloMine 2007: Open source distribution and demonstrations
available at http://philologic.uchicago.edu/philomine/. PhiloMine makes use of a number of important
open source packages and modules which are listed in the acknowledgments on the site.