This LINGUIST List issue is a review of a book published by one of our supporting publishers, commissioned by our book review editorial staff. We welcome discussion of this book review on the list, and particularly invite the author(s) or editor(s) of this book to join in. To start a discussion of this book, you can use the Discussion form on the LINGUIST List website. For the subject of the discussion, specify "Book Review" and the issue number of this review. If you are interested in reviewing a book for LINGUIST, look for the most recent posting with the subject "Reviews: AVAILABLE FOR REVIEW", and follow the instructions at the top of the message. You can also contact the book review staff directly.

Oliver Streiter, Department of Western Languages and Literature, National University of Kaohsiung, Taiwan

INTRODUCTION

Alchemist is a tool that allows users to read in raw text files and create a morphological analysis in XML format that can be used as a ''gold standard'' for evaluating the results of an unsupervised morphological analyzer. The user manually identifies morphemes and categorizes them as root or affix, together with, optionally, a degree of certainty of the analyst. In addition, morphemes can be assigned morphosyntactic features, such as part of speech, person, number, and gender. The tool is intended for researchers who want to perform a linguistic analysis and store their data in a standard format. Given its clear and attractive interface, the tool might be effectively used for in-class exercises on morphological analysis.

Alchemist can be freely downloaded from http://linguistica.uchicago.edu/, in a binary version for Windows and Mac OS X, or as source code e.g. for Linux/Unix environments. The type of license, however, under which the tool can be downloaded is not specified. The software is documented in a 19-page PDF file, accessible from the same site.

SYNOPSIS

The first step in using Alchemist is to create a new GOLD standard collection from an input text file (RTF or plain text). Then, the user has to specify the maximal number of words in the list, e.g. 200 or 500 words. Before the word list is created, the user can select or create some rules to 'scrub' the text. It is thus possible to remove unwanted numbers or HTML markup. New scrubbing rules can be added in the form of regular expressions. Once created, scrubbing rules can be saved and loaded in a later session.

Using the white space character as predefined word delimiter, the tool creates a word list from the input text. This word list, called Word Collection, can be sorted from left to right and from right to left to discover prefixes or suffixes respectively. To facilitate the discovery of morphemes, the words in the Word Collection can also be filtered using regular expressions. The documentation contains a number of interesting examples of possible filters.

Using the mouse pointer, the user can mark roots and affixes in the Word Collection. Roots and affixes are then highlighted in different colors in the Word Collection. In addition, the morphemes, their types ('root' or 'affix') and the words in which they occur show up automatically in a list of morphemes, called the Morpheme Explorer. The same morpheme derived from different words and allomorphs can be merged in this Morpheme Explorer in a fairly intuitive way. The morphemes in the Morpheme Explorer can be also used as filter of the word collection. Thus, clicking on one or more affixes followed by the button 'Show Filtered' will cause all words containing this affix to be listed in the Word Collection. Using this filter, the user can jump in a very easy and efficient way from an affix to a root, from the root to other affixes etc.

The Word Collection and its analysis can be stored in XML as GOLD standard (General Ontology for Linguistic Descriptions, http://www.linguistics-ontology.org/) standard. In later sessions, the user can open this XML document and continue the analysis. Merging two analyses or adding one text to an existing analysis doesn't seem to be possible.

EVALUATION

The software documentation is well written and contains a detailed description of all functions of Alchemist. However the documentation does not mention the license. It neither covers the installation process. While installation on Mac OS X and Windows was as smooth as it can be, I abandoned the compilation of Alchemist under Linux after compilation stopped with a cryptic message and neither the software documentation, nor the contact person, nor a Web-search provided any helpful information.

The documentation also lacks a discussion of wider contexts in which the tool can be used. The user's acquaintance with the GOLD standard, or at least the willingness to use this is taken for granted. Explaining the GOLD standard and its usefulness in the introduction of the documentation would increase the relevance of the tool.

The web-page of Alchemist does not contain additional information. There doesn't seem to be any active user group, help desk, mailing list or any other kind of information structure through which users and developers might interact.

The design of the interface is excellent. It integrates a nice help function. The usage is as intuitive as it can be. Singular windows however cannot be resized. Additional space might be gained by putting the R,A,C buttons after the Help button.

When testing the tool in different contexts, the tool, however, does no longer seem as mature as its interface and the documentation suggest. The most serious problem is related to the encoding of the input text file. Unlike a web-browser, there is no way to specify the encoding of the input file. The tool assumes uniformly that files have been encoded in Latin-1 (ISO 8869-1). Alchemist thus produces broken graphical representations for all other encodings, e.g. German, Spanish or French in Unicode. Characters using more than one bite are split into meaningless symbols. As a consequence, the tool is limited to the Latin script and within the Latin script only those writing systems which fall within the scope of ISO 8859-1.

Thus not only many East and Central Asian languages but also many richly accented African languages cannot be processed by Alchemist unless transliterated into a form which falls within the scope of ISO 8859-1. To make it clear, this excludes writing systems using the Arabic script, the Abugida script, the Chinese script, the Cyrillic scripts, in addition to about 100 other scripts. Excluded are also many languages using the Latin script but not included in ISO 8859-1, e.g Czech (ISO 8859-2) and Turkish (ISO 8859-9). This failure to support Unicode should be corrected in future versions if the tools is to have any relevance.

The input and output functions reveal additional problems. Although the input text file can be an RTF-file or a plain text file, the RTF-file I created with OpenOffice, was not processed correctly and RTF tags showed up in the Word Collection. Thus using plain text input files seems to be the only feasible option. The XML output contains huge amounts of rubbish characters. Strictly speaking, this is fool's gold and not XML. An inexperienced user would discard the output and with it the entire tool.

A problematic procedure is the transformation of the input text into word lists. Although this transformation is relatively easy for English, there is no general procedure which can do this transformation for all writing systems of the world without consulting a linguistic database. The white space characters, the hyphen, the apostrophe may or may not be, according to the writing system, part of a word. Thus even common languages like French or Italian are processed incorrectly in Alchemist as two words joined by a ''''' are not split. Languages that can have a white space character within a words, e.g. Vietnamese and Sesotho (Roux 2005, Streiter & Stuflesser 2005) and languages without a word separator require more advanced techniques.

While transforming the Word Collection in a collection of morphemes I encountered the following problem. In some cases I would like to have a link back to the text in which a word occurred. Ambiguous words, e.g. 'reports' can be understood only in context and providing a KWIC view of the word might reveal whether it is a verb or a noun. In addition, when I tried to undo an analysis and deleted the affix from the Morpheme Explorer, the affix disappeared also from the data in the Word Collection. Clicking on one character in the Word Collection and deleting the related root in the Morpheme Explorer splits the root into two roots. I do not know whether this is an intended behavior. Overall the possibilities to undo an analysis or go back to an earlier stage in the analysis are not given.

Finally, there are some minor problems:

* Using the Help-function, the tool crashed several times after a few (maybe inexact) mouse movements on Mac OS X. Unsaved data where then lost.

* The filter on the Word Collection and the Sorting of the Word Collection do not interact in a meaningful way. When a filter is used, words are sorted from left to right. When words are sorted from right to left, no filter can be used. I can however think of no linguistic motivation why both techniques should not be used in combination.

* Sometimes the system shows an unexpected outcome, e.g. after the deletion of a word from the Word Collection, the system falls back on the last morpheme-based filter.

SUMMARY

Overall, Alchemist is a very promising tool which will certainly find its way onto the linguist's Desktop. It is well designed, easy to use and produces an output in an important standard. However, the tool is not as solid as one would wish it to be. The main problem is that it does not support Unicode. This however might be solved easily in future releases. Non-Unicode encoded files could then be converted on the fly to Unicode using functionalities similar to ICONV. To overcome the difficulties in the creation of word lists will require more linguistic intelligence, e.g. in the form of a linguistic database. Finally, it can be hoped that the developing team will succeed in building a community around the tool, so that new users can join discussion groups when seeking support. This will also provide the feedback necessary to overcome last problems with buttons, windows and file formats. After all, alchemy was not that unsuccessful, except in the production of gold. The Alchemist however promises something better, it will help you to produce a gold standard.

Oliver Streiter teaches computational linguistics and corpus linguistics at the National University of Kaohsiung, Taiwan. His current research focuses on the compilation and annotation of linguistic resources to support low density languages.
Respond to list|Read more issues|LINGUIST home page|Top of issue