Tuesday, July 15, 2008

Stemming, Part 6: Stemmer Predicates

In the last few postings we’ve been looking at functions and how they’re used
in Clojure. One of the fundamental kinds of functions is the predicate: a
function that tests something and returns true or false. By convention, these
functions end in a ?. Clojure has a number of these, and for the Porter
Stemmer, we’ll define a few.

Sets

Sets can act as predicates. As we saw when we were discussing stop
words, sets are also functions that test for membership.

Built-Ins

Clojure defines a number of built-in predicates, and higher-order functions
are often useful for creating other predicates.

(zero? num) Returns whether its argument is zero.

(pos? num) Returns whether its argument is a positive number.

(neg? num) Returns whether its argument is a negative number.

(complement fn) Returns a new function that returns the opposite of the
predicate function passed into it. For example, (complement zero?) returns a
predicate that tests whether its argument is not zero.

(cond testexpression ...) A structure that acts as a series of nested
if statements. Each test is followed by one expression. If the test
evaluates as true, the expression is evaluated and its value is returned by
the cond expression. An optional final test, by default :else, can be used
if no previous tests evaluated as true. If no default test is provided, cond
returns nil.

For example, in the last post, we had defined count-item, which had two
nested if expressions:

In the last post, we also defined member?. How would you define it using
cond?

Stemmer Predicates

With all that we’ve learned, we’re ready to define a number of predicates that
we can use later in the Porter Stemmer.

vowel-letter? is a set of the standard vowel letters. This will only be
used to define consonant?.

(def vowel-letter?#{\a\e\i\o\u})

consonant? returns true if the index in the stemmer points to a
consonant letter. Alternatively, it tests whether a given index points to a
consonant letter.

(defn consonant?"Returns true if the ith character in a stemmer is a consonant. i defaults to :index."([stemmer](consonant?stemmer(get-indexstemmer)))([stemmeri](let [c(nth (:wordstemmer)i)](cond (vowel-letter?c)false(= c\y)(if (zero? i)true(not (consonant?stemmer(dec i)))):elsetrue))))

vowel? is the logical opposite of consonant?.

(def vowel?(complement consonant?))

vowel-in-stem? returns true if any of the characters before the index is
a vowel character.

cvc? return true if the characters before the index (or another character)
is a CVC sequence (consonant-vowel-consonant).

(defn cvc?"true if (i-2 i-1 i) has the form CVC and also if the second C is not w, x, or y. This is used when trying to restore an *e* at the end of a short word. E.g., cav(e), lov(e), hop(e), crim(e) but snow, box, tray "([stemmer](cvc?stemmer(get-indexstemmer)))([stemmeri](and (>= i2)(consonant?stemmer(- i2))(vowel?stemmer(dec i))(consonant?stemmeri)(not (#{\w\x\y}(nth (:wordstemmer)i))))))

Notice that we’ve established a pattern here: these all take one or two
arguments. With one argument, they test against the :index character in the
stemmer. With two arguments, they test against any character: