Friday, June 27, 2008

Tokenization, Part 6: Function Overloading

In the last posting, we added the ability to filter a list of input tokens
with a list of stop words, usually words that are so common that they aren’t
interesting for many types of analysis. But because we don’t want to use a
stop list every time, we didn’t add such filtering to the tokenize-str
function. Today, we’ll do that.

Function Overloading

Clojure allows what can be called function overloading. Basically, you can
have several different versions of the same function all assigned to the same
variable name. Each version has to have a different number of arguments. (The
number of arguments is sometimes referred to as the function’s arity, so
each different version of the function has to have a different arity.) Since
the overloaded functions have different argument lists, Clojure can tell which
version to use when you call that function.

(Actually, calling this function overloading is misleading. That makes it
sound as if we’re assigning multiple function objects to the same variable.
The overloading really happens inside a function object, and a single
function object is assigned to a variable.)

For reference, here is the current version of tokenize-str and an example of
using it and filtering with a stop words list:

Next add another set of arguments plus function body on the same
parentheses-level as the one already there:

(defn tokenize-str;; This is the first, original version of the function.([input-string](map to-lower-case(re-seq token-regexinput-string)));; This is the second, overridden version of the function.([input-stringstop-word?](filter (complement stop-word?)(map to-lower-case(re-seq token-regexinput-string)))))

There are a couple of things that are new here. Let’s break them down.

First, I’ve included a couple of comments. When Clojure reads a semicolon
(";"), it ignores everything to the end of the line. Comments are there for
you to read.

Second, I’ve included a new override for the function that takes two arguments
(arity 2, also written tokenize-str/2). The second argument is a function
that returns true if a token is a stop word. filter then calls the
complement of this function on what is essentially the body of the original
version of tokenize-str (tokenize-str/1).

(Hint: It can be difficult to read functional programming. Sometimes it’s
easier to read from right to left. First, look at (re-seq ...) and read left
to (map ...) and (filter ...).)

So how do we use this? Call (load-file "word.clj") to load the most recent
code, and let’s run the example from yesterday with this function.

user=>(word/tokenize-str"The cat in the hat.")("the""cat""in""the""hat")user=>(word/tokenize-str"The cat in the hat."word/stop-words)("cat""hat")

Recursive Functions

This works, but it’s a little messy. What if we want to change how we tokenize
at the most basic level? We’ll need to change our code in two places, and
that’s twice as many places to make an error.

It would be useful if we could call the first version of the function from
inside the second version, and we can. Just call it the way we would from any
other place in the code. It would look like this:

(defn tokenize-str;; This is the first, original version of the function.([input-string](map to-lower-case(re-seq token-regexinput-string)));; This is the second, overridden version of the function.([input-stringstop-word?](filter (complement stop-word?)(tokenize-strinput-string))))

Now, if we want to modify the way we tokenize, we only need to do it once, in
tokenize-str/1 (the original version of the function). tokenize-str/2 only
implements stop-word filtering.

Of course, a function calling itself could be dangerous: if you’re not
careful, it could end up calling itself forever. Whenever we have a function
call itself, we need to make sure that there is a path through the function
that must get traveled at some point and that does not call itself. In
this case, the first version acts as such a path: if we call the second
version, it calls the first version, which just returns with a list of tokens.

Recursion can be a little mind-bending. Don’t worry about that at this point.
We’ll see recursion again many times, and soon it will become an old friend.

A little strange, but an old friend nevertheless.

So far, we’ve only been tokenizing strings that we type in. Of course, in
practice we’ll want to read data in from a file. Next time, I’ll talk about
how to do that.

3 comments:

In the tokenize-str function you put a question mark at the end of the stop-word? parameter. Why a question mark? I took it out of my code and it still runs just fine so I'm assuming that's some sort of parameter convention. Thanks.