Porter Stemming

September 8, 2009

Our solution follows the reference implementation in the three places where it differs from the original algorithm. We represent a word as a list of characters in reverse order; thus the word praxis is represented by the list (#\s #\i #\x #\a #\r #\p). Storing the word as a list of characters eliminates all the arithmetic present in the original program; storing the word in reverse order makes it easy to access the end of the word, where all the action is. Our first functions determine if a letter is a vowel or consonant:

We implement measure differently than Porter. Since the measure increases by one any time a consonant follows a vowel, we scan through the string, keeping track of whether the previous letter was a vowel or consonant; we first reverse the string into the zs list, determining if the letter is a vowel or consonant as we go:

Here we see why ends? was written as it was. The => operator of a cond clause takes the result of a predicate, if it is non-#f, and passes it to a single-argument function in its consequent; thus, s is a higher-order function, similar to the setto function of the reference implementation, that takes a string str and returns a function that takes a list cs representing a word stem (the word stem returned by the ends? function) and returns a new cs list with the characters of str prepended to the front of cs in reverse order:

We use string-downcase and the assert macro from the Standard Prelude. The function is reasonably fast; Porter claimed in 1960 that stemming 10,000 words took 8.1 seconds on an IBM 370 mainframe, but on a recent-vintage personal computer our function takes about a sixth of a second to process the 23,531 words of Porter’s test vocabulary, with no errors.

;; step2() maps double suffices to single ones. so -ization ( = -ize plus
;; -ation) maps to -ize etc. Call this function only if the string before the
;; suffix has at least one non-initial consonant-sequence.

;; In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j]
;; inclusive. Typically i is zero and j is the offset to the last character of a string,
;; (p[j+1] == ”). The stemmer adjusts the characters p[i] … p[j] and returns the new
;; end-point of the string, k. Stemming never increases word length, so i <= k <= j. To
;; turn the stemmer into a module, declare 'stem' as extern, and delete the remainder of
;; this file.

;; This port has only been tested with PLT Scheme, though there is little PLT-specific code.
;; This code is offered in the hope it will be useful, but with no warranty of correctness,
;; suitability, usability, or anything else.

;; The algorithm as described in the paper could be exactly replicated
;; by adjusting the points of DEPARTURE, but this is barely necessary,
;; because (a) the points of DEPARTURE are definitely improvements, and
;; (b) no encoding of the Porter stemmer I have seen is anything like
;; as exact as this version, even with the points of DEPARTURE!

;; Release 1

;; Note that only lower case sequences are stemmed. Forcing to lower case
;; should be done before stem(…) is called.

;; This implementation is not a particularly fast one.
;; In particular, there is a common optimization to make the many ends-with
;; tests faster by switching on the last or penultimate letter, which I chose
;; not to use here for the sake of readability.
;; I suspect without proof that the many reversals could be ameliorated by just
;; reversing the string in the first place and pre-reversing the patterns, or
;; by maintaining a last position in the string vector instead of using a list of
;; characters.