Chinese Language Stack Exchange is a question and answer site for students, teachers, and linguists wanting to discuss the finer points of the Chinese language. It's 100% free, no registration required.

Does anyone know of any open source code for segmenting Pinyin syllables that has diacritic tone marks?

Example:

yìrúfǎnzhǎng -> yì rú fǎn zhǎng

C#, Python, or Lex is preferred but any language will do.

I've searched the internet but I'm coming up dry.

Edit: This is the algorithm I'm thinking of implementing myself:

Sort the standard list of all 414 syllables * 5 (tones) distinct syllable+tone combinations by length. Starting at the beginning of the text, look for the longest pinyin syllable that matches. Once a match is found move to the end of the syllable and repeat. Of course, there is the problem with syllables like xian matching xi'an. This happens because xian is longer than xi. For short passages of text and given the expected number of syllables I think I can figure this out by backtracking.

I think you are on the wrong site. This site deals with the language and its usage, so anything to do with programming would be completely offtopic. You should try stackoverflow. That being said, you might want to show your own research there, otherwise your question might get closed.
–
deutschZuidAug 7 '13 at 22:45

I think it is more likely that someone in the Chinese StackExchange knows the answer to this question than anyone monitoring StackOverflow. StackOverflow doesn't even have tags for pinyin, mandarin, or chinese. This question is VERY specific to Mandarin. I've been studying Chinese for over a decade and I use programming to process Chinese all the time. It is very likely that there are others out there like me on the Chinese StackExchange.
–
stuckintheshuckAug 7 '13 at 23:09

1

I can't find a downloadable piece of software that does this, but it is reasonably easy for a programmer to do this (although it can't be done with 100% accuracy). If you want to attempt this yourself and then state where you are stuck then I can assist. If you are just going to wait for a solution to show up your question will likely not end up with any answers. So if you are serious about this have a go. I will check back in 24 hours to see if you want to keep going with this question.
–
xiaohouzi79♦Aug 8 '13 at 1:10

1

It is quite easy to code it yourself once you know what are the possible combinations of initials and finals in pinyin.
–
杨以轩Aug 8 '13 at 2:57

1

To everyone: This question is quite borderline and I was on the side of closing it because your main point is programming-related. Nevertheless, someone made me notice the fact that the question is asking about Pinyin-related stuff and this makes it a less clear-cut case. I won't close it, but if other moderators or if the community close it, then I'll agree with the decision as well.
–
Alenanno♦Aug 8 '13 at 11:11

Not sure if this works for cases such as splitting "changan" to "chang an" and "langan" to "lan gan" instead on "lang an"?
–
xiaohouzi79♦Aug 8 '13 at 4:45

@xiaohouzi79 Those are legitimately ambiguous cases (hence the apostrophes you see in "proper" spelling of Cháng'ān). The default (no apostrophe) case should be "changan" -> "chan" + "gan", even though that's almost certainly wrong semantically.
–
Stumpy Joe PeteAug 8 '13 at 16:39