This release is pre-ALPHA and is made available to allow
testing outside the currently development environment, future
more stable release with better documentation are currently
under development. Please comment on problems directly to
Alan W Black (awb@cs.cmu.edu)
This directory contains scripts and models for the expansion of
non-standard words to simple words. That is this software is designed
to expand arbitrary tokens in text to simple words, expanding numbers,
abbreviations, roman numerals etc. This work is a product of the
CLSP Summer Workshop at Johns Hopkins University 1999.
Authors: Alan W Black (awb@cs.cmu.edu)
Stan Chen (sfc@cs.cmu.edu)
Shankar Kumar (skumar@clsp.jhu.edu)
Mari Ostendorf (mo@rcs.ee.washington.edu)
Chris Richards (crichard@wso.williams.edu)
Richard Sproat (rws@research.att.com)
Please read
http://www.clsp.jhu.edu/ws99/projects/normal/
for more details on the project, and the final report there
for a description of scientific aspects of this work
---------------------------------------------------------------------
The distribution consists of a number of parts
nsw-X.X.tar.gz
Basic expansion scripts and basic expansion models for four
domains. Includes scripts for building new models from data
This part of the distribute is free software and may be used
for any purpose commercial or otherwise.
nsw-data-xxx.tar.gz
XML marked up data (raw, labels and marked up XML) for various
data bases. These fall under varying licences and are not all
freely re-distributable.
nsw-data-rfr.tar.gz
From rec.food.recipes, freely re-distributable
nsw-data-pc110.tar.gz
From the pc110 mailing list, freely re-distributable
PC110 is an IBM palmtop PC, the list is technical e-mail-like.
nsw-data-nantc.tar.gz
Data is a subset of the LDC's North American News Text Corpus
(Wall Street Journal, New York Times, LA Times and two Rueters
news sources). You must have access to the LDC's CD to use this.
Scripts are included to take the raw data from the CD and anotate
with the provided labels.
nsw-data-classifieds.tar.gz
Classified real estate adds from various sources as collected by
the LDC.
============
REQUIREMENTS
============
You must have gnumake (any version) and the festival speech synthesizer
installed. gnu make is available from any good ftp site while Festival
(1.4.0 or later) is available from
http://www.cstr.ed.ac.uk/projects/festival.html
or
http://www.speech.cs.cmu.edu/festival/index.html
In this release festival is simply used as a a scripting language,
and technically only the CMU lexicon needs to be installed in addition
to the festival executable, though I would recommend install a complete
US English voice. Festival is used as it contains all the sub-parts
necessary to run the basic expansion model (e.g. lexical accessed,
CART interpretation, tokenizing, regex support and ngram, viterbi
decoding) even though we are not doing synthesis here.
Later release, that support model building, will require use of more
aspects of Festival and the Edinburgh Tools (notably the Wagon CART
tree builder) and possible other FSM libraries (FSMTOOLS and LEXTOOLS
from AT&T) for some aspects of building.
At present these scripts have only been tested under Unix systems
(Linux, FreeBSD and Solaris). There is nothing that explicitly stops
them from working under NT (everything in the expander is Festival
internal) but we have not looked at this at all.
============
INSTALLATION
============
At present the system only offers the abality to expand texts using
the pre-built domain models (nantc, classifieds, pc110 and rfr).
Requirements
You must have gnumake (any version) and the festival speech synthesizer
installed. gnu make is available from any good ftp site while Festival
(1.4.0 or later) is available from
http://www.cstr.ed.ac.uk/projects/festival.html
or
http://www.speech.cs.cmu.edu/festival/index.html
At present these scripts have only been tested under Unix systems
(Linux, FreeBSD and Solaris). There is nothing that explicitly stops
them from working under NT (everything in the expander is Festival
internal) but we have not looks at this at all.
To install
cd config
cat config-dist >config
cd ..
gnumake
The make process is very short is merely makes a few scripts
based on the pathname of your festival binary
The program festival should be in your path for this to work, or
you may explicitly set the variable FESTIVAL in config/config as ion
FESTIVAL := /usr/local/festival/bin/festival
Note the default automatic setting of this variable (through which festival)
may not work properly in multiple NFS environments.
=====
USAGE
=====
Basic usage will exapnd an arbitrary text file (no XML markup is
required) into words.
bin/nsw_expand -domain classifieds examples/ads2.txt -output ads2.word
Various output formats are support (more to follow). The default output
is simply words with the whitespace/newlines form the original this
wont be useful in many cases.
bin/nsw_expand -domain classifieds examples/ads2.txt -format opl -output ads2.word
Format opl (one per line) outputs each found token, its NSW tag, a
binary flag telling you if this is the first token in a split or not
(tokens that weren't separated from previous tokens by white space
will have 0), and then the list of words that the token expands to.
Other formats will be added when we have a better idea of what is needed.
Multiple files may be expand by listing multiple input files and
specifying an output format containing a %s e.g.
bin/nsw_expand -domain classifieds examples/*.txt -format opl -output out/%s.word
The example database files are marked up in XML format (called NSWML).
The input mode may be specified on the command line
bin/nsw_expand -domain classifieds -mode NSWML data/classifieds/xml/adsBG.aa.xml -mode NSWML
======
FUTURE
======
The beginings of model building is included in this release but its
neither fully tested nor documented yet. We also intend to provide
full scripts and instructions for building expansion models for
unlabelled data.
Documentation, databases, testsuites etc are obviously currently
missing.
If you have specific requests, or are using this work in any way
please let us know as we wnat this to be as useful as possible.
Minor changes as well as large recommendations are welcome.