A TokenFilter that decomposes compound words found in many Germanic languages.

Package org.apache.lucene.analysis.compound Description

A filter that decomposes compound words you find in many Germanic
languages into the word parts. This example shows what it does:

Input token stream

Rindfleischüberwachungsgesetz Drahtschere abba

Output token stream

(Rindfleischüberwachungsgesetz,0,29)

(Rind,0,4,posIncr=0)

(fleisch,4,11,posIncr=0)

(überwachung,11,22,posIncr=0)

(gesetz,23,29,posIncr=0)

(Drahtschere,30,41)

(Draht,30,35,posIncr=0)

(schere,35,41,posIncr=0)

(abba,42,46)

The input token is always preserved and the filters do not alter the case of word parts. There are two variants of the
filter available:

HyphenationCompoundWordTokenFilter: it uses a
hyphenation grammar based approach to find potential word parts of a
given word.

DictionaryCompoundWordTokenFilter: it uses a
brute-force dictionary-only based approach to find the word parts of a given
word.

Compound word token filters

HyphenationCompoundWordTokenFilter

The HyphenationCompoundWordTokenFilter uses hyphenation grammars to find
potential subwords that a worth to check against the dictionary. The
quality of the output tokens is directly connected to the quality of the
grammar file you use. For languages like German they are quite good.

DictionaryCompoundWordTokenFilter

The DictionaryCompoundWordTokenFilter uses a dictionary-only approach to
find subwords in a compound word. It is much slower than the one that
uses the hyphenation grammars. You can use it as a first start to
see if your dictionary is good or not because it is much simpler in design.

Dictionary

The output quality of both token filters is directly connected to the
quality of the dictionary you use. They are language dependent of course.
You always should use a dictionary
that fits to the text you want to index. If you index medical text for
example then you should use a dictionary that contains medical words.
A good start for general text are the dictionaries you find at the
OpenOffice
dictionaries
Wiki.