One of the key challenges in learning Chinese is that there is no white space in Chinese text. You have to visually identify words and when you don’t know many words this is particularly difficult. This is what Chinese text looks like:

In this article I learned the technical term for adding white space to Chinese text is “tokenization of raw text”, and since Chinese requires “extensive token pre-processing” then the more particular technical term is “segmentation”. So translated into technical lingo what I was looking for was a Chinese Word Segmenter. The article provides ample information regarding all the science that goes behind coming up with algorithms that allow us to write programs to segment Chinese text.

In this blog post I am only concerned with showing how I used the Chinese Word Segmenter to transform the above text into a more Chinese language student friendly version.

I downloaded and unzipped the segmenter.

I put the text into a plain text file (.txt)

I executed the following command:

./segment.sh ctb file.txt UTF-8 0 > file.segmented.txt

The segmenter provides detailed output regarding the segmentation process. I will only mention the last line of that output here: