Monday, July 14, 2008

Stemming, Part 5: Functions and Recursion

So far in this Clojure tutorial/NLP tutorial, we’ve mainly been looking at
Clojure’s functions, but the last posting actually included a good chunk of
code for the Porter Stemmer. Today, we’ll review functions. Some of this we’ve
seen before, but some of it will be new.

Functions

As with any functional language, functions are the building blocks of Clojure.
I’m going to briefly summarize what we’ve learned about functions so far, and
then we’ll explore one important topic—recursion—in more detail.

Defining Functions

Generally, functions are defined using defn, followed by the name of the
function, a vector listing the parameters it takes, and the expressions in the
body of the function. You can also include a documentation string between the
function name and the parameter vector:

(The str function here just creates strings out of all its arguments and
concatenates them together.)

Higher-Order Functions

Higher-order functions are functions that take other functions as values.
These may use the other function in a calculation, or they may create a new
function from the existing one. For example, map calls a function on each
element in a sequence:

user=>(map greetings'(EricElsa))("Hello, Eric""Hello, Elsa")

complement, however, creates a new function that is equivalent to (not
(original-function *args...*)):

Sometimes, particularly when you’re using a higher-order function, you may
want to create a short function just for that one place, and giving it a name
would clutter up your program. Or you may want to define a function inside
another function, in a let expression, to using only within that function.

For either of these, use a function literal. A function literal looks like a
regular function definition, except instead of defn, use fn; the name is
optional; and generally it cannot have a documentation string. For example,
suppose that greeting, which we defined above, doesn’t exist, and we want to
greet a list of people. We could do this by creating a function literal and
passing it to map.

I’ve already mentioned that Clojure allows you to provide different versions
of a function for different argument lists. Do this by grouping each set of
parameter vector and expressions in its own list. The documentation string, if
there is one, comes before all of the groups:

user=>(defn count-parameters"This returns the number of parameters passed to the function."([]0)([a]1)([ab]2)([abc]3)([abcd]4)([abcde]5))#'user/count-parametersuser=>(count-parameters'p0'p1'p2)3user=>(count-parameters)0

This also works in function literals. This creates a shortened version of
count-parameters called cp and calls it twice within a vector. The two
calls are collected in a vector because let only returns one value, and we
want to see the value of both calls to cp.

user=>(let [cp(fn ([]0)([a]1)([ab]2)([abc]3))][(cp'p0'p1)(cp)])[20]

Recursion

Many problems can be broken into smaller versions of the same problem.

For example, you can test whether a list contains a particular item by looking
at the first element of the list. At each place in the list, ask yourself: is
the list empty? If it is, the item is not in the list. However, if the first
element is what you’re looking for, good; if it’s not, strip the first element
off the list and start over again.

This way of attacking problems—by calling the solution within itself—is called
recursion. Recursive problems are fundamental to functional programming,
so let’s look at it in more detail.

General Recursion

The main pitfall in creating a recursive problem is to make sure that it will
end eventually. To do this, you need to make sure that you have an end
condition. In the list-membership problem I described above, the end
conditions are the empty list and the first element being the item you are
looking for.

In this, the first if tests whether the sequence is empty (that is, if seq
cannot create a sequence out of it; if it is, it returns nil, which
evaluates as false. The second if tests whether the first element in the
sequence is what we’re looking for; if it is, this returns the sequence at
that point. This is a common idiom in Clojure. Since the sequence has at least
one item, it will evaluate as true, fulfilling its contract as a test, but it
returns more information than that. This function is useful for more than
just determining whether an item is in a sequence: it also finds the item and
returns it. Finally, if the first element is not the item being sought, this
*calls member? again with everything except the first item of the list*.
This is the recursive part of the function. Let’s test it.

user=>(member?'(1234)2)(234)user=>(member?'(1234)5)nil

Sometimes it’s helpful to map this out. In the diagram below, a call within
another call is indented under it. A function call returning is indicated by a
“=>” and a value at the same level of indentation as the original call. Here’s
the sequence of calls in the first example above.

In this case, we’re just returning the results of the membership test back
unchanged, but we could modify the results as they’re being returned. For
example, if we wanted to count how many times as item occurs in a sequence, we
could do this:

In many ways, this is very similar to member?. It first tests whether the
sequence is empty, and if it is, it returns zero. If the first element is the
item, this calls itself (it recurses) and increments the result by one, to
count the current item. Otherwise, it recurses, but it does not increment the
result. Let’s see this in action.

You see that the results gets incremented as the computer leaves each function
call where 2 is the first element in the input list.

Tail-Recursive Functions

While these two examples of recursion are superficially similar, on a deeper
level they are very different. In member?, once you’ve computed the result,
you can return it immediately back to the function that originally called
member?. If Clojure had some way to jump completely out of a function from
any level, you could do this. On the other hand, when count-item recurses,
it is not finished with its calculation yet. It has to wait for itself to
return and possibly add one to the result. (Sentences like that remind me why
recursion can be confusing. Don’t worry. Eventually your brain will get used
to being twisted into a pretzel.)

There’s a term to describe functions like member? that are finished with
their calculations when they recurse. It’s called tail-call recursion or
tail-recursive. This is a very good quality. The computer
can optimize these calls to make them very fast and very efficient.

But the Java Virtual Machine doesn’t recognize tail recursion on its own, so
Clojure needs a little help to make these optimizations. You signal a
tail-recursive function call by using the recur built-in instead of the
function name when you recurse. Thus, we could re-write member? to be
tail-recursive like this.

This should work exactly the same (and it does—try it), but for long lists, it
should be much more efficient.

In fact, it’s often worth putting in a little extra work to make
non-tail-recursive functions tail-recursive. There’s a straightforward
transformation you can use to make almost any function tail recursive. Just
add an extra parameter and use it to accumulate the results before you make
the recursive function call. The first time you call the function, you need to
pass the base value into the function for that parameter. For example,
count-item with tail recursion would look like this:

The difference here is the accum parameter. When item equals the first
element of the sequence, accum is incremented before the recursive
functional call is made. When the function starts again on the shorter list,
accum is incremented. Finally, when the end of the list is reached, accum
is returned. It contains the counts accumulated as count-item walked down
the sequence.

Whenever we call count-item, we have to include a superfluous zero that we
really don’t care about. That seems messy and error-prone. How can we get rid
of it?

Essentially, we want to hide this version of count-item and replace it
with a new version that handles the extra zero for us. Then, we call the new
version and forget about this one. There are several ways to actually do this:

Have the public function named count-item and create a private version
named something like count-item-;

Use let to define the private function inside count-item; or

Use loop.

loop? Yes, this is new. loop is a cross between a function call and let.
It looks a lot like let because it allows you to define variables. But it
also acts as a target for recur. How would count-item look with loop?

Notice that the first line of loop looks a lot like the first line of let.
Both declare and initialize a series of variables. Just to keep things clear,
I’ve renamed sequence to sq within the loop. Also, item isn’t included
in the list of variables that loop declares, since it doesn’t change.
Finally, at the end of the loop are two recur statements. When they are
evaluated, they cause the program to jump back to the loop statement, but
this time, instead of the original values used to initialize the loop
variables, the values in the recur call are used.

Stemmer Utility

With recursion, we can define another utility to use on the stemmer
structures. This function will take a predicate and a stemmer. For each step,
it will test the stemmer with the predicate, and if true, it will pop one
character from the word and recurse. If there are no letters left in the word
or if the predicate evaluates to false, it will return the stemmer the way it
is:

(defn pop-stemmer-on"This is an amalgam of a number of different functions: pop (it walks through the :word sequence using pop); drop-while (it drops items off while testing the sequence against drop-while); and maplist from Common Lisp (the predicate is tested against the entire current stemmer, not just the first element)."[predicatestemmer](if (and (seq (:wordstemmer))(predicatestemmer))(recur predicate(pop-wordstemmer))stemmer))

I’m ready for a break. Next time, We’ll look at some more of the functions
that Clojure provides, and we’ll define a few predicates of our own.