news and discussions mainly related to Chinese characters and romanization

Main menu

Post navigation

OCR and Pinyin texts

[This entry is largely for my own reference. But feel free to read on, especially if you’re interested in OCR or if you somehow happen to have a lot of Pinyin texts lying around.]

What’s the best way to run optical character recognition (OCR) on texts written in Pinyin with tone marks? Adobe Acrobat 7.0 Standard, the most advanced such software I have on my computer, doesn’t have a “Pinyin” setting. I’d be surprised if any OCR software currently does.

Getting second tones, fourth tones, and umlauts to be read correctly shouldn’t be a big problem, given how the same marks are standard in the orthographies of many European languages. But first tones and third tones are a different matter. The best that can probably be hoped for at present is a more-or-less regular rendering of vowels with first- and third-tone marks as something else that can be fixed quickly through a search-and-replace procedure.

Looks like my other entry didn’t make it. I reported that ocr software like FineReader (Abbyy) can be made to work for pinyin. It requires quite a bit of tweaking and training – but the result is excellent once you understand how to use the software. I’ll try to put together an online tutorial if I have the time…

Hi everybody! I’m a complete newbie, so first of all I wanna apologize if I’m posting in the wrong place; cut to the chase:

Issue 1: I have some pdf files containing only pinyin, and tried to ocr them; I’ve found no difficulty at all with the set of latin alphabet, except for the next set of characters { o ? ?? ? ? ? ? ? ? ? ? ? ? ? á ?? é í ó ú ? Á É Í Ó Ú ? ? ?? ? ? ? ? ? ? ? ? ? ? ? à ?? è ì ò ù ? À È Ì Ò Ù ? a ? e i o u ü A E I O U Ü } which are not recognize by either Abby, Acrobat, Tesseract etc. I’ve tried to train them, use a combination of different languages, and a million things more like asking in dozens of forums, but no luck.

Issue 2: I also have some resources in true-pdf format with those damn subset of embedded fonts, and when trying “copy-and-paste” activities to rearrange the layout, the file becomes unmanageable because the text get completely illegible. I’ve installed most of the fonts that I didn’t have on my pc, and also the Pitstop plugin, but cannot find a solution -for example, substituting throughout the file all those characters that use a certain embedded subset of a font by a different font, keeping the original character shape.

As you can see issue1 and issue2 are related inasmuch as solving issue1 would also put and end to #2.