Czech Feature-based Tagger (and full morphology)

Download

Description

The Feature-based (exponential model) Tagger is a fast implementation
of the Czech tagger developed at UFAL and described elsewhere on
these pages (Czech Language Tagging page).
In order to get the best possible results, the tagger requires
preprocessing by a Czech morphological module with a very high
coverage. This module covers a superset of the Czech "FM" morphology.
Both the morphological module and the tagger are supplied as binary
executables, together with all necessary precompiled Czech data.
Input must be in the ISO Latin 2 (iso-8859-2) code and follow the
usual csts.dtd
definition, and output is produced in the same way (ISO
Latin 2 code, csts.dtd). (As is the case with many of the tools
provided with PDT 1.0, both executables also accept - and then produce
- a "simplified SGML", which is not a real, valid SGML, but simply
contains at least the tags for words, punctuation, and sentence
breaks, one item per line.)

Current on-line (client/server) version of the tagger can be found
here (that is the
same page as the "FM" online morphology; use an appropriate checkbox
to invoke the tagger instead of the "FM" morphology.)

Supported platforms

The tagger and the included morphological module are compiled for
Linux (2.2.x and above, such as Red Hat 6.2 and later) and Solaris
(SunOS 5.7 and later) on Sparc machines.

Installation

The name of the binary packages depends on the version (date) of
the distribution: it has the form CZyymmddx.tgz (for Linux) and
CZyymmddxs.tgz (for Solaris), where yymmdd is the year, month and
date of the distribution. Occasionally, new distribution can be
found on UFAL's/CKL's website(s) in addition to
the one on the distribution CD (CZ010619x.tgz, CZ010619xs.tgz).

Unpack the CZyymmddx[s].tgz archive in a directory
where you want the tagger to live, e.g. (suppose you downloaded or
copied it to your home directory first):