Wednesday, June 25, 2008

Tokenization, Part 4: Organization

Improved Tokenization

As I’ve already said, the way we’re tokenizing is lacking. One major problem
at this point is that "The" and "the" are returned as two separate tokens.
To take care of that, we need to convert all tokens to lower-case as they’re
processed.

That’s a two-step process: first, convert a token to lower case; second, apply
that conversion to each token as it’s read in.

How do we do this?

Java Interop

In Clojure, strings are the same as Java strings: a Clojure string takes the
same methods as a Java string, and like a Java string, a Clojure
string is immutable: once you’ve created it, you can’t change it.

So now the question is how do we call Java methods?

In fact, it’s fairly simple: to call the length method on a string, use
.length (a dot and the method name) as the function to call, and pass the
string as the first argument to that function. Other arguments are passed in
after the method:

This also works for calling static methods, just pass in the name of the class
instead of a class instance as the first argument. For example, the first line
below calls the static method Runtime.getRuntime():

(Runtime gets automatically imported from java.lang. Later I’ll show you
how to import other Java classes and how to create instances of them.)

Java strings have a toLowerCase method, which will do exactly what we want.
The only problem is that Java methods aren’t the same as Clojure functions, so
we’ll need to wrap the method call in a function. Open the word.clj file
you’ve been working on and add this beforetokenize-str:

(defn to-lower-case[token-string](.toLowerCasetoken-string))

That creates a function called to-lower-case, which just wraps a method call
to String.toLowerCase.

Higher-Order Functions

Now we need to solve the second part of the problem: applying to-lower-case
to every token as we create it.

Clojure, like all lisps, Python, Ruby, and many other computer languages,
treat functions as objects in their own right. That means you can take a
function, assign it to a variable (which is what defn does), and pass it
around as an argument. Plus, Clojure has a number of functions—called
higher-order functions—that take functions as arguments and do
interesting things with them, either creating new functions based on the
original function or applying that function to a set of data.

map is one of those functions. It takes another function and a sequence, and
it applies the function to every item in the sequence. Finally, it returns a
new sequence containing the results of applying the function to each item in
the input sequence. For example, let’s apply to-lower-case to a sequence of
strings (make sure you call (load-file "word.clj") so that to-lower-case
is defined in the REPL):

Call (load-file "word.clj") again and test the new version of
tokenize-str:

user=>(tokenize-str"This is a LIST OF TOKENS.")("this""is""a""list""of""tokens")

There we go: all lower-case tokens.

Sequence Literals

In the first map example above, I included something that we haven’t seen
before: sequence literals. Sequences, or lists, are a big part of any lisp,
including Clojure.

A list in Clojure is printed like they are in all lisps: a space-delimited
list of items surrounded by parentheses. They look just like a function call,
which is awkward. It means that if you just type in a list, Clojure will try
to call it like a function:

That’s a poor way of saying that you can’t treat a string like its a function.

To type the list and have Clojure recognize it as a list, you have to quote
it: put a single-quote character in front of the list:

user=>'("my""name")("my""name")

Organizing Your Code

So far we haven’t had any problem with variable names clashing with each
other. But if we start using different libraries and files written by
different people, we could easily run across several that use the same
variable name for different things.

How do we get around that? We need some way to keep all those names separate.

Namespaces

Clojure uses namespaces to keep variable names from clashing. (These are
different than Java’s packages, if you’re familiar with those.) By default,
all of the built-in functions are in the clojure namespace. When you’re in
the REPL, everything is in the user namespace. Remember what the REPL prompt
looks like?

user=>

The user at the beginning of the prompt tells which namespace we’re
currently working in. Clojure indicates which namespace a variable is in by
printing the namespace and a forward slash before the variable name. For
example, the user/tokenize-str that is printed after loading word.clj
indicates that the last variable read in was tokenize-str in the user
namespace.

Here we want to define everything we’re working on in a namespace called
word. To do that, just add these lines to the top of word.clj:

(in-ns 'word)(clojure/refer'clojure)

The first line creates a new namespace, word, and uses that to contain all
the variables defined in the rest of the file.

Immediately after the first line, we can’t reference any of the functions that
Clojure provides. The second line fixes that, (clojure/refer 'clojure) call
makes everything in the clojure namespace available in the word namespace.
Remember that clojure/refer references the refer variable in the clojure
namespace. So even though the we can’t access Clojure’s built-ins directly at
this point, we can still reference them using their full (namespace plus name)
names.

(As an aside, notice that in the second line, the second clojure is quoted.
That’s because symbols can either be variables or symbol objects in their own
rights. Without the quote, Clojure thinks that the symbol is a variable; with
the quote, it reads the second clojure as a symbol, which is what refer
wants.)

Now, quit Clojure, go back in, and re-load the file. (We want to start with a
clear slate; otherwise, the old function definitions will still be hanging
around to confuse us.)

user=>(load-file "word.clj")#'word/tokenize-str

The first thing to notice is that the variable returned at the end has a new
namespace prefix, word. Let’s try to use it:

user=>(tokenize-str"This is a String.")java.lang.Exception:Unabletoresolvesymbol:tokenize-strinthiscontext

Oops. No tokenize-str is found. We have to tell Clojure to look in the
word namespace:

user=>(word/tokenize-str"This is a String.")("this""is""a""string")

Remember to check out the new code from this posting in the Google Code
project for word-clj.

Was a bit confused by how (.getRuntime Runtime) no longer works. As far as I can tell the new syntax for calling static functions is now (Runtime/.getRuntime), so you have to do (.freeMemory (Runtime/.getRuntime)). See here: http://clojure.org/java_interop

Rich Hickey has great ideas, but he may be pushing Clojure on us a bit too fast. That such major syntactical changes are being made indicate that Clojure should still be far from a 1.0 release.

Still, I'm glad I discovered it, and I do hope more such changes come, because while the concepts behind Clojure sound solid, the syntactic sugar needs to mature much further.

The other factor is memory usage, though. If the input is very long, you may not want to duplicate the entire string in memory, as lower-casing it would require. Instead, you may want to let Clojure's lazy seqs help you to process each token and allow it to be GCed without loading everything into memory.

As usual, there are trade-offs, but these seem pretty interesting in this case.