NLP Analysis in Python using Modal Verbs

Modal verbs are auxiliary verbs which indicate semantic information about an action, i.e. likelihood (will, should) , permission (could, may), obligation (shall/must). One interesting concept to explore is whether the presence of these verbs varies over different types of text, and whether that means anything.

I extended the example to include an additional corpus of court cases, and extra helper verbs- This includes the contents of ~15,000 legal documents.

We first define a function to retrieve genres of literature, and a second to retrieve words from the genre. For the legal documents, I am reading from an index I previously built of n-grams (i.e. word/phrase counts).

The tabulate method is provided by NTLK, and makes a nicely formatted chart (in a command line it makes everything line up neatly)

can

could

may

might

must

will

would

should

legal

13059

7849

26968

1762

15974

20757

19931

13916

adventure

46

151

5

58

27

50

191

15

belles_lettres

246

213

207

113

170

236

392

102

editorial

121

56

74

39

53

233

180

88

fiction

37

166

8

44

55

52

287

35

government

117

38

153

13

102

244

120

112

hobbies

268

58

131

22

83

264

78

73

humor

16

30

8

8

9

13

56

7

learned

365

159

324

128

202

340

319

171

lore

170

141

165

49

96

175

186

76

mystery

42

141

13

57

30

20

186

29

news

93

86

66

38

50

389

244

59

religion

82

59

78

12

54

71

68

45

reviews

45

40

45

26

19

58

47

18

romance

74

193

11

51

45

43

244

32

science_fiction

16

49

4

12

8

16

79

3

Looking at these numbers, it is clear that we need to add a concept of normalization. My added corpus has a lot more tokens than the Brown corpus, which makes it hard to compare across.

The frequency distribution class exists to count things, and I didn’t see a good way to normalize the rows. I re-wrote the tabulate function to do this – it simply finds the max for each row, divides by that, and multiplies by 100.

It would be nice to see how similar these genres are – we can compute that by imagining the counts of modals as describing vectors. The angle between vectors approximates “similarity”. The nice thing about this measure is that it removes other words (words which may only exist in one text – some of this will be due to how well the data is cleaned, which does not reflect on the genre of literature).

Some genres appear similar to legal documents – it is possible, however, that some verbs are not independent. For instance, you might see “may” and “might” with equal similarity. One way to test this might be to flip what we track for distance (make a vector for each modal, rather than genre)

The following code tracks the distance between each modal and the mean, using the different genres as dimensions. Since each of them contributes to the mean somewhat, there is guaranteed to be some similarity, but note that some are closer than others. Note also that these have to be normalized, like the last example, or the answer will be defined by the ‘legal’ genre.

What I’d infer from this is that the least helpful verb for distinguishing genres is “must,” and the most helpful is “may.”

can similarity to mean: 76.0
could similarity to mean: 67.6
may similarity to mean: 61.5
might similarity to mean: 70.0
must similarity to mean: 79.7
will similarity to mean: 67.7
would similarity to mean: 73.6
should similarity to mean: 74.2