Abstract

This paper introduces two recent open source software packages
developed for unsupervised natural language modeling.
The Morfessor program segments words automatically into
morpheme-like units without any rule-based morphological analyzers.
The VariKN toolkit trains language models producing a
compact set of high-order n-grams utilizing state-of-art Kneser-
Ney smoothing. As an example, this paper shows how to construct
a language model for speech recognition in multiple languages
utilizing only a minimal amount of linguistic resources.
Morfessor and VariKN also have other applications in text understanding,
information retrieval and machine translation. Unsupervised
machine learning techniques are particularly well
suited for the development of systems for less-resourced languages,
because they do not depend on manually designed morphological
or syntactical analyzers or annotated data.