Getting Started

Congratulations. We are nearly finished. On Linux, the
finite-state applications you see in this
directory run in an xterm window. They cannot be started by double clicking
the icons.

In what follows, when we say to "enter" a command, that means to type
the command at the command-line prompt and press the Enter (or Return) key.

If you copied the for-linux folder into your
home directory, open a Terminal window and enter the command

cd ~/for-linux

to go into the directory. Then enter the command

ls -l

to make sure you have arrived at the right place. You should see
this GettingStarted file and the five applications: xfst, lexc,
lookup, tokenize, and twolc. The display should look something
like this:

The -rwx------ signature in
the beginning of the line indicates that you have the full rights to the
file (read, write, and execute). Sometimes file permissions are not preserved
when files are copied. If you see something other than -rwx------ in
the beginning of the last five lines, enter the command

chmod 700 *

to fix the permissions. You (and only you) should be able to read,
write, move, or execute these files.

If you now issue the command

./xfst

(note the period and the slash), the xfst
application will start and you will see the xfst[0]: prompt.
To try out a simple command, you can type

read regex a b c;

to make your first network. This is just a test, you can immediately
quit with the command exit. To make the programs accessible
from any xterm window, they should be copied to a directory that is on
your "path", a list of directories that Unix searches to find the executable
programs for you. Enter the command

echo $path

to see what directories you have on your path.

If you are an experienced Unix user, you already know what
to do: move the programs to some existing directory such as ~/bin
that is on your path, enter the command rehash,
and follow it by the command which xfst just to make
sure your did everything right. The installation is finished and you
can skip the rest of this section.

If you are a novice Linux user, now is the time to learn a
couple of tricks. First make sure that your current working directory
is ~/for-linux; if you copied the for-linux
folder to your home directory, then enter the command

cd ~/for-linux

If you don't have a ~/bin directory (~/ stands
for the path from the top of the file system into your home directory),
we recommend that you make one and move the programs there. To do that,
enter the commands

mkdir ~/binmv * ~/bin/

The mkdir command creates the folder bin
in your home directory. If you already have a ~/bin directory,
the mkdir command will tell you so, and do nothing.
The mv command moves all the files in your current ~/for-linux
directory into the ~/bin directory. To verify that all
went well, enter the command

ls -l ~/bin/.

You should see the five programs in their new location.

The next step is a little tricky. If you already had a ~/bin
directory, there is a chance that it is already on your path. If that is the
case, entering the command

echo $path

should show something terminating in ~/bin or /home/myname/binin its output (different Linux installations may store the user
home directories in different locations). If that is the case, enter
the commands

rehashwhich xfst

to make sure everything is installed correctly.

If the ~/bin directory does not show up
in your path, we need to put it there. When the xterm application starts,
it looks for a file called .cshrc (the dot is part of
the name) in your home directory. Check first if you have such a file
by doing ls ~/.cshrc. If the file exists, bring it up
in a text editor and add the line

set path = (~/bin $path)

and save the .cshrc file. If the file
does not yet exist, type the following three lines

cat > ~/.cshrcset path = (~/bin $path)^D

where ^D stands for control-D. You now have
a ~/.cshrc file that adds ~/bin to your
path in every newly launched xterm application.

To verify that everything is OK, enter the command

source ~/.cshrc

to add the ~/bin directory to the path in the current
xterm, followed by the commands

rehashwhich xfst

If the which command comes back with a path to the location
where xfst was installed,
all is well and you are done. You can now launch xfst by with the command xfst
in any new xterm window. If xfst
works, the other four applications, lexc,
lookup, tokenize, and twolc, will also launch properly.

If you have followed the above instructions and, at some later date, wish
to uninstall the software, you can do it in any Terminal window on your machine
with the commands

cd ~/bin
rm lexc lookup tokenize twolc xfst

History

The Xerox finite-state software has a long history going back
to the 1980s. The basic finite-state calculus and the maintenance routines
such as determinization and minimization were originally implemented
by Ronald M. Kaplan in Xerox Interlisp (Medley) with help from Martin
Kay and John Maxwell. The system was then re-implemented and improved
in C by Lauri Karttunen and Todd Yampol around 1990 based on Karttunen's
1988 Common Lisp version, which included important contributions from Jan
Pedersen, Atty Mullins, and Doug Cutting. Around the same time, Ken Beesley
and Lauri Karttunen re-implemented the compiler for Kimmo Koskenniemi's
two-level rule formalism (twolc)
and Karttunen and Yampol wrote the lexicon compiler (lexc) that became the basic tool for creating
lexical transducers for a succession of Xerox enterprises: DDS, XSoft,
Inxight.

In 1993, Xerox established a European research center in Grenoble,
France, first called RXRC (Rank Xerox Research Centre) and later XRCE
(Xerox Research Centre Europe). The maintenance and development of the
C-version of the finite-state code moved from Palo Alto to Grenoble when
the Grenoble center was established. The enrichment of the calculus with
replace-rule expressions is the work of Karttunen and André
Kempe, similar to but more versatile and efficient than the compilation algorithm
in Kaplan and Kay's 1994 paper. The xfst
interface and the two runtime applications (tokenize,
lookup) were written at XRCE.
The primary XRCE contributors are Pasi Tapanainen, André Kempe,
Tamás Gaál, Hervé Poirier, Caroline Privault, and
Jean-Marc Coursimault.

In practical use at Xerox, the replace rules of xfst have superseded Koskenniemi's two-level
formalism, which was the dominant paradigm in the early 1990s. Most
Xerox developers now use lexc
to create lexicon-like finite-state transducers and xfst to write rules; the twolc language is falling out of use.

The twolc compiler
is included on the software CD but is not documented in the book Finite
State Morphology. If you are planning to use twolc or want to know about two-level rules,
please read the chapter entitled Two-Level Compiler in the doc
folder.

Known Issues

The software on this CD dates back to the summer of 2002.
It has been used extensively by many developers at XRCE, Parc, and
Inxight. As in any complex piece of software, there are undoubtedly
some errors and misfeatures in the code, but we are not aware of any
serious bugs. However, there are two limitations that the user should
be aware of:

Because of its Unix origins, all the applications assume that
lines in text input files end with the Unix newline character "\n".
Input files that terminate lines with "\r" (Macintosh) or with "\r\n"
(Windows, DOS) cannot be processed with the CD versions of the software.
If you have a source file created on a Macintosh, you can replace the
end-of-line characters in Unix with the command

tr "\r" "\n" < inputfile > outputfile

The command

tr -d "\r" < inputfile > outputfile

converts a Windows/DOS document into Unix format.

Only ISO-8859-1 (= Latin-1) character encoding is supported by
the CD versions of the software. 16-bit Unicode characters (UCS-2) are handled
internally but they cannot be entered directly as input. For example, the
Hebrew letter Alef can be represented as "\u05D0" in a regular expression
where "\u" indicates that the following four Hex characters encode
a Unicode symbol but the symbol will not be printed as the proper Hebrew
character even if the computer has a Hebrew font installed.

In the near future, we will make available new versions of xfst, lexc,
lookup, and tokenize that are aware of different
end-of-line conventions and are able to process UTF-8 encoded Unicode
files. Please check out the book web site, http://www.fsmbook.com, for
updates.