Saturday, April 12, 2014

Finding a minimal set of English vocabulary with WordNet, Perl, and GLPK integer programming

Preface:
This was my semester project for the class "Principles of Optimization" which was one of my absolute favorite classes that I took in college. It uses WordNet, which is a fantastic database for looking at relationships between English words, (it seems to be the basis of many online tools such as the Visual Thesuarus that is constantly advertised on dictionary.com, it was fun to see how the graphs I made for this project exactly matched the content of Visual Thesuarus, a website that seems to be just begging for an open-source ripoff...). To access WordNet, I used the Perl library WordNet::QueryData which seemed to work quite well. For the Integer Programming, I had a much harder time, I used the library Math::GLPK which has bugs that confused me for quite a while (I wish I remembered what they were, but I don't, all I remember is a general feeling of frustration with the library), if I were to redo this project, I'd take a different approach, maybe writing lp text files and solving with the glpk command line interface, or maybe using Python to interface with an IP solver. I used Cytoscape to generate the network graphs.

For those just here for example code for WordNet::QueryData, or Math::GLPK,
you can find it in a the perl file that you can download here. All others, Read On!

Finding a minimal
set of English vocabulary

Introduction:

English is a language
with a huge amount of words. WordNet is a database of words, word
senses, and other information about English lexicography and syntax.
WordNet contains definitions for 147306 words, with each word having
one or more senses. Senses are grouped into ‘synsets’, a synset
is a group of senses that share a definition. Because a word can
have more than one sense, it can be a member of multiple synsets.

The set of all synsets
comprises the lexical expressivity of the language. If all of the
words contributing to a synset were to be removed from English,
either a new word would have to be added, or the same meaning would
have to be conveyed by a string of words, or by some non-lexical
device (such as word order, or word stress, or wild gesticulations).
In this project, I develop a technique for using integer programming
to find the minimal set of words from a vocabulary necessary for
every synset to be represented by at least one word. I term such a
subset a “synsetically complete lecixal subset” or “SCLeSub”.
A language's SCLeSub contains a word for every idea that the whole
language has a word for, but it has fewer total words, thus the
SCLeSub has a higher ideas per word ratio than the parent vocabulary.

Goals:

To use linear
programming to explore the properties of the English lexical space as
modeled by the WordNet database. To answer the questions:

1. How many words can
be eliminated from English without losing any senses (i.e. synsets)?

2. Does any part of
speech contain more redundant vocabulary than other parts of speech?

3. Is the SCLeSub of
English unique, or are there multiple alternative SCLeSubs?

4. Does translating
ordinary prose into words found only in the SCLeSub affect the
readability of the text?

Rationale:

The questions posed in
the Goals section are certainly interesting from a linguistic
perspective, and the answers will tell us something about the
structure of English vocabulary. They may also be useful from a
practical perspective. For example, a writer restricts himself to
only words in a SCLeSub will produce works with a greater number of
puns and multiple interpretations than a writer not following a
SCLeSub. Adherence to a SCLeSub is therefore an easy, even
mechanical, way to make an author's writing seem more nuanced.
Furthermore, starting from the output generated in this study,
software could easily be developed to partially automate the process
of translating existing English language texts into language adhering
to a SCLeSub.

The Integer Program:

To answer the
questions listed under the goals section I wrote a Perl script to
query the WordNet database, generate the IP matrix and constraints,
call GLPK to solve the problem, then format and print data about the
solution. To query WordNet, I used the library “WordNet::QueryData”
(http://search.cpan.org/dist/WordNet-QueryData/QueryData.pm), to call
GLPK from Perl I used the library “Math::GLPK”
(http://qosip.tmit.bme.hu/~retvari/Math-GLPK.html) (Math::GLPK is a
little bit buggy and has some undocumented behavior, so I don't
really recommend it, even so, I think that it's the best option for
calling GLPK from Perl).

For the first two
questions, the data can be represented by a bipartite graph with the
two node types being words and synsets, and edges only connecting
words and synsets. An edge is drawn between a word and a synset if
that word can mean the idea represented by the synset (fig. 1). In
terms of this representation of the data, a SCLeSub can be found by
removing as many word nodes as possible such that each syngroup node
is still attached to at least one word node (fig. 2). The problem
can be represented as an IP of the following form:

There is one decision
variable per word.

The decision variables
are Boolean.

A value of 1 means
that the word kept in the network, a value of 0 means that the word
is thrown out of the network.

The objective function
is to minimize the sum of all of the decision variables.

There is one
constraint for each syngroup.

The constraints say
that the sum the decision variables representing the words that can
mean the idea represented by the syngroup must be at least 1.

Or, more formally:

Program 1:

Where rjn is
1 if word n can take the meaning of syngroup j, and 0 otherwise.

Figure 1: A
section of a bipartite graph of adverbs and adverbial syngroups. Red
nodes are words. Blue nodes are syngroups.

Figure 2:
The
same graph as figure 1, but with colors indicating membership in a
SCLeSub. Green
nodes are words that are retained in the SCLeSub. Red nodes are
words that are thrown out of the network. Blue nodes are syngroups.

Results:

1.
How many words can be eliminated from English without losing any
senses (i.e. synsets)?

To
answer this question, it was necessary to solve five integer
programs. Once to find a SCLeSub for the entire language, and then
once for each part of speech: noun, adjective, verb, and adverb.
Table 1-A
shows a summary of the solutions to those optimizations. From
the table we can observe a number of interesting phenomena. As we
might expect, the number of synsets in the language is equal to the
sum of the number of synsets in the individual parts of speech. This
is logical because it seems impossible for a word sense to be, for
example, both an object and an action at the same time. The sum of
the number of words from each part of speech, however, is larger than
the number of words in the complete language. This also is logical,
because one can think of many words that can take one of several
parts of speech. The word “duck” for example is counted both
under “Nouns”, where it
can mean “small wild or domesticated web-footed broad-billed
swimming bird usually having a depressed body and short legs”and “Verbs”, where it
can mean “to move the head or body quickly downwards or away”.
Table 1 shows that about 46
percent of the words in English can be eliminated without eliminating
any senses.

2.
Does any part of speech contain more redundant vocabulary than other
parts of speech?

What
is perhaps more interesting is that the
sum of the sizes of the SCLeSubs for the individual parts of speech
is about 6.3% larger than the SCLeSub calculated from all words.
This could suggest
that by merging word nodes
to connect the individual
part of speech networks and
form the “All words”
network, we
are allowing more nodes to be eliminated, due to increased
connectivity of the merged nodes.
We could naïvely conclude,
because merging the networks
decreases the number of nodes while keeping the number of edges
constant, that the size of the number of words eliminated will depend
on the edges/node ratio. Table 1 B shows us that the situation is
somewhat more complicated. Among
the different parts of speech there is no consistent correlation
between the percent of words retained in the SCLeSub and the ratios
of the various graph features. The only
remaining explanation is that there must be differences in the
topologies of the word/sensynset graphs among the various parts of
speech.

3. Is the SCLeSub of
English unique, or are there multiple alternative SCLeSubs?

To arrive at an answer
to this question I had to extend the original IP slightly, in a way
analogous to the “flux variability analysis” technique used in
metabolic modeling in biology. The goal here is to define the
boundaries of the solution (SCLeSub) space. After solving the
original optimization problem (Program 1), original optimal objective
function can be set as a constraint, and a series of new IPs can be
solved where for each one, the objective function is to minimize or
maximize a single decision variable (Program 2). I deem this
technique “SCLeSub variability analysis” or SVA. A convenient
feature of SVA is that after solving the initial IP, the solution can
be used as a feasible basis for the subsequent IPs, this means that
it is quicker for the computer to solve the IPs in the SVA
calculation than it is for the computer to solve the original
problem. Nevertheless, SVA problem is highly computationally
intensive. On my home computer, a 2.8 GHz quad-core Intel i7, I
started the execution of all five SVA problems (all words, nouns,
adjectives, verbs, and adverbs) at the same time. The adverbs
problem finished in about 2 hours. Interestingly, the adjectives
problem finished second, after about 10 hours. The verbs problem,
despite having a smaller constraint matrix and fewer decision
variables than the adjectives problem, took about 18 hours. At 18
hours, the nouns problem had just finished 100 iterations, and the
all words problem had not even gotten that far, so I killed the
processes and settled with what I had. The results the SVA analysis
(table 2) clearly show that the composition of the SCLeSubs are not
unique.

Program 2:

Solve for each k = 1 to
|x|. Maximize if xk = 0 in the solution to the original
IP, minimize if xk = 1 in the original solution. rjn
is 1 if word n can take the meaning of syngroup j, and 0 otherwise.
z* is the value of the objective function in the original
problem.

A

Nouns

Adjectives

Verbs

Adverbs

All words

Sum of parts
Of speech

Word-synset edges

146,312

30,002

25,047

5,580

206,941

206,941

Synsets

82,115

18,156

13,767

3,621

117,659

117,659

Words Total

117,798

21,479

11,529

4,481

147,306

155,287

Words in SCLeSub

62,364

12,700

6,294

2,833

79,230

84,191

B

Nouns

Adjectives

Verbs

Adverbs

All words

Sum of parts
Of speech

Percent of words in SCLeSub

52.94

59.13

54.59

63.22

53.79

54.22

edges per word

1.24

1.40

2.17

1.25

1.40

1.33

edges per synset

1.78

1.65

1.82

1.54

1.76

1.76

edges per node

0.73

0.76

0.99

0.69

0.78

0.76

words per synset

1.43

1.18

0.84

1.24

1.25

1.32

Table 1:
Comparison of SCLeSub problem and its solution among the various
parts of speech. (A) Entries in the “Sum of parts of speech”
column are the sum of the other numbers in that row, except for the
number in the “All words” column. (B) Entries are derived from
the data in (A).

Adjectives

Verbs

Adverbs

Total Words

21,479

11,529

4,481

In every SCLeSub

Count

9,591

5,058

2,178

Percent

44.65

43.87

48.61

In at least 1 SCLeSub

Count

7,491

2,883

1,581

but not in every one

Percent

34.88

25.01

35.28

Not in any SCLeSub

Count

4,397

3,588

722

Percent

20.47

31.12

16.11

Table 2:
Comparison of the SCLeSub space among adjectives, verbs, and adverbs.

Most of the words in
the SCLeSubs must be in every SCLeSub, but for every grammatical
category, there is still a substantial population of words that are
in some SCLeSubs, but not every one. Once again, if there are
consistent patterns in the bulk statistics, they are obscure
patterns, suggesting significant topological differences among the
underlying graphs of the various parts of speech.

4. Does translating
ordinary prose into words found only the SCLeSub affect the
readability of the text?

In light of the above
discussed results, the concept of “the SCLeSub” no longer seems
valid. Thus, to answer this last question, one of several approaches
could be taken. The vocabulary could be restricted to only words
that appear in at least one SCLeSub, or an arbitrary SCLeSub could be
chosen and the vocabulary restricted to words that are members of
that particular SCLeSub. Here, I follow the second rule, for no
other reason than that it will result in more word replacements than
the other method. I used the optimal solution for Program 1 output
by GLPK after inputting the “All words” network as the basis for
these translations (see Supplemental data 1). The following two
sentences are taken from the New York Times (underlined words are
words that were changed):

Original: A
tactical flight officer on board the helicopter fired multiple times
into the vehicle in an attempt to shoot the tires.

Translated: A tactical
flight officer on board the chopper fired multiple times into
the vehicle in an endeavor to shoot the tyres.

Original: Floridians
can cast their ballots into the ocean and hope they end up in a state
that's more capable of counting them.

Translated: Floridians
can cast their ballots into the ocean and hope they wind
up
in a state that's more capable of counting them.

I
also tried translating the Christmas song “Silent Night”, and
“Yellow Submarine” by the Beatles, but neither one seemed to have
any words or phrases that were not already in the SCLeSub.

Making
text conform to the SCLeSub, does seem to make the text a little less
natural sounding, particularly in the first example, but still
readable, and in many cases no changes are even needed. Many of the
words in WordNet are actually phrases, so an automated translator
would
undoubtedly find more places to substitute words than I did just
searching manually.

Conclusion:

While
the results I have arrived at in this project are interesting, they
also leave plenty of room for further research. For example, I
conclude that the word-synset graphs for each part of speech must
have distinct topological characteristics, but I do not attempt to
explain what those characteristics might be. Future research in this
area might start with comparing the frequencies of various network
motifs among the graphs.

A
striking feature of the SCLeSub space for every part of speech is how
broad it is. There are thousands of different possible SCLeSubs. If
someone wanted to use a SCLeSub-like subset of the language for some
practical purpose, it would make sense to find some criteria for
picking a specific SCLeSub out of the SCLeSub space. The total
word-space could be iteratively reduced by solving a series of IPs,
with the optimal objective function for the previous IP becoming a
constraint in the subsequent IP, and a lower priority criterion
becoming the new objective function. Alternatively, a single IP
including multiple criteria in the objective function, each criterion
weighted with a constant coefficient, could be solved. One possibly
useful additional optimization criterion could be the rank of words
in a list of word-frequencies. After the size of the SCLeSubs is
found (by Program 1), a smaller area of SCLeSub space, perhaps a
unique x, could be found by Program 3. If the goal is to find a set
of common words, perhaps words to teach foreign students of English,
the objective function would be a minimization, if the goal is to
find a set of obscure words, perhaps for a politician or comedian,
then objective function would be a maximization.

So
far, particularly based on the exploration of goal 4, I'm not
convinced that could ever really be a useful practical application of
the SCLeSub concept, but perhaps if translations were attempted with
into a SCLeSub optimized for word obscurity (thus excluding as many
common words as possible), the changes would be more dramatic.
Contrariwise, a SCLeSub may find use not as a list of words to be
used, but as a list of words to be avoided. Perhaps the words in the
“Not in any SCLeSub” category are more sophisticated and erudite
sounding than words in the other two categories; they are certainly
less vague.

While
working on this project, I learned about formulating potentially
useful integer programs, I learned about how to integrate GLPK calls
into computer programs, and I made some really kick-ass graphs in
Cytoscape (see appendix and supplementary PDFs). I enjoyed working
on this project, I hope you enjoyed reading about it.

Program
3:

Like
Program 2, except for the objective function. Fi is the
rank of word xi in a word-frequency list, with the most
often used word ranked 1 and least often used word ranked n.

Appendix:

There's some supplemental data to this project which you can download here

Supplemental_Data_1_All_Words_Solution.txt
: as alluded to in the text, this file contains the data describing
one SCLeSub of the 'All words' network. If there is only one word on
a line, then that word is member of the SCLeSub. If there are
multiple words, then the first word is not a member of the SCLeSub,
but the subsequent words are in the SCLeSub and share a synset with
the first word. To make a sentence conform to the SCLeSub, search
for all the words and phrases in the sentence, and if one is not in
the SCLeSub, choose the appropriate word to replace is with from
among the words following it in the text file.

wordNetGLPKperl.pl
: This is the perl script I wrote to do all of the heavy lifting for
this project. It's pretty sloppy, but it gets the job done.

adverb_map.pdf
: This is the entire graph for adverbs, including the data from SVA
analysis. Light blue nodes are syngroups. Red nodes are words that
are not in any SCLeSub. Yellow nodes are words that are in some
SCLeSubs, but not in every SCLeSub. Green nodes are words that are
in every SCLeSub.

I
generated this graph in Cytoscape using data output by my perl
script. If you zoom in to about %1500 percent you can actually read
the names of the words. Also, you can search for words in the PDF
and see the network context around them. This is the best looking of
the three I think.

verb_map.pdf
: Like above, but for with verbs.

adjective_map.pdf
: Like above, but for with adjectives.

I
tried to make ones for nouns and all words, but they're huge and I
couldn't get them to turn out well.