KyTea

This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie").
It is a general toolkit developed for analyzing text, with a focus on Japanese, Chinese and other languages requiring word or morpheme segmentation.

This software package contains source code, and a default model that uses the UTF-8 character encoding, and estimates POS tags as well as pronunciations according to keyboard input (which is slightly different than the actual phonetic pronunciations).
More details, and a number of other models can be found on the KyTea Models page.

The code of KyTea is distributed according to the Apache License Version 2, and can be distributed freely according to this license.
The models included with KyTea or distributed on the KyTea models page may be used for research or commercial purposes (except where noted otherwise), but may not be re-distributed without prior permission.

Install

KyTea has been tested on Linux, Mac OSX, and Windows (via Cygwin).
On Linux or Cygwin, download the source code, and install using the following commands.

Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format).
If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.

test.full will now have a segmented file with each word annotated with a POS and pronunciation.

Usage

kytea

kytea performs word segmentation and tagging

Analysis Options:
-model The model file to use when analyzing text
-nows Don't do word segmentation (raw input cannot be accepted)
-notags Don't do tagging (full input cannot be accepted)
-notag Skips a particular tag (-notag 1 will skip the first tag)
-nounk Don't estimate the pronunciation of unkown words
-wsconst Do not segment some character types (e.g. "D" to not segment digits)
-unkbeam The width of the beam to use in beam search for unknown words
(default 50, 0 for full search)
Format Options:
-in The formatting of the input (raw/full/part/conf/tok, default raw)
-out The formatting of the output (full/part/conf/tok/eda/tags, default full)
-tagmax The maximum number of tags to print for one word (default 3, 0 implies no limit)
-deftag A tag for words that cannot be given any tag (for example,
unknown words that contain a character not in the subword dictionary)
-unktag A tag to append to indicate words not in the dictionary
Format Options (for advanced users):
-wordbound The separator for words in full annotation (" ")
-tagbound The separator for tags in full/partial annotation ("/")
-elembound The separator for candidates in full/partial annotation ("&")
-unkbound Indicates unannotated boundaries in partial annotation (" ")
-skipbound Indicates skipped boundaries in partial annotation ("?")
-nobound Indicates non-existence of boundaries in partial annotation ("-")
-hasbound Indicates existence of boundaries in partial annotation ("|")

train-kytea

train-kytea is a program to train models for KyTea.

Input/Output Options:
-encode The text encoding to be used (utf8/euc/sjis; default: utf8)
-full A fully annotated training corpus (can be used multiple times)
-tok A tokenized training corpus (can be used multiple times)
-part A partially annotated training corpus (can be used multiple times)
-conf A confidence annotated training corpus (can be used multiple times)
-feat A feature file generated by -featout
-dict A dictionary file (one 'word/pron' entry per line, multiple possible)
-subword A file of subword units. This will enable unknown word PE.
-model The file to write the trained model to
-modtext Print a text model (instead of the default binary)
-featout Write the features used in training the model to this file
Model Training Options (basic)
-nows Don't train a word segmentation model
-notags Don't train a tagging model
-global Train the nth tag with a global model (good for POS, bad for PE)
-debug The debugging level during training (0=silent, 1=normal, 2=detailed)
Model Training Options (for advanced users):
-charw The character window to use for WS (3)
-charn The character n-gram length to use for WS for WS (3)
-typew The character type window to use for WS (3)
-typen The character type n-gram length to use for WS for WS (3)
-dictn Dictionary words greater than -dictn will be grouped together (4)
-unkn Language model n-gram order for unknown words (3)
-eps The epsilon stopping criterion for classifier training
-cost The cost hyperparameter for classifier training
-bias Whether to use a bias value in classifier training (true)
-solver The solver (1=SVM, 7=logistic regression, etc.; default 1,
see LIBLINEAR documentation for more details)
Format Options (for advanced users):
-wordbound The separator for words in full annotation (" ")
-tagbound The separator for tags in full/partial annotation ("/")
-elembound The separator for candidates in full/partial annotation ("&")
-unkbound Indicates unannotated boundaries in partial annotation (" ")
-skipbound Indicates skipped boundaries in partial annotation ("?")
-nobound Indicates non-existence of boundaries in partial annotation ("-")
-hasbound Indicates existence of boundaries in partial annotation ("|")