Counting Syllables Accurately in Python on Google App Engine

I wanted to be able to count syllables accurately in Python and looked around for existing code that I could re-use. I found one or two routines written in PHP that looked promising so I ported them to Python but was pretty disappointed with the accuracy.

I also found a Python routine that is part of the contributed code for NLTK that was not bad but again struggled with some words. You see, I had naively thought this would be a simple exercise. I hadn’t realised that Syllable Counting in the English language is pretty difficult stuff with so many exceptions that it makes the most elegant algorithm convoluted and clumsy.

It works by looking up the pronunciation of the word in the Carnegie Mellon University’s pronunciation dictionary that is part of the Python-based Natural Language Toolkit (NLTK). This returns one or more pronunciations for the word. Then the clever bit is that the routine counts the stressed vowels in the word. The raw entry from the cmudict file for the word SYLLABLE is shown below.

1

SYLLABLE1SIH1LAH0BAH0L

The stressed vowels are denoted by the string of letters ending in a number. They appear to represent the different individual pronunciations of the vowel sound. Anyway, for the words that the dictionary knows about (120,000+ I believe), this represents a very accurate method for obtaining the syllable count.

However, there is a problem. As my target environment is Google App Engine, that little line at the top of the code that says…

1

import nltk

…ruins your entire afternoon.

You see NLTK and Google App Engine don’t work well together due to NLTK’s recursive imports. I spent some time trying to unwind the recursive imports on cmudict so that Google App Engine would work but to no avail.

So then I thought laterally and decided to build my own structure from the cmudict file (the raw text 3.6MB file that NLTK loads and wraps an object around). My plan was as follows:

print" -Word (%s) found twice. First count was %s, second was %s"%(LszWord,GdcSyllableCount[LszWord],LliSyllableList)

except:

print"An error was encountered processing the file."

raise IOError

try:

#-----

# Now write the dictionary away to a new pickle file

LhaOutputFile=open('cmusyllables.pickle','w')

ifnotAlgQuiet:

print"Finished processing input filennNow dumping pickle filen"

pickle.dump(GdcSyllableCount,LhaOutputFile,-1)

ifnotAlgQuiet:

print"Pickle file cmusyllables.pickle has been created."

except:

print"An error was encountered writing the pickle file."

raise IOError

def main():

#-----

# Open the CMU file and for each entry create a dict with the resulting

# number of syallbles

CreatePickle()

if__name__=='__main__':

main()

This results in a dictionary lookup that gives an accurate syllable count (or counts because some words have multiple pronunciations and therefore syllable counts) for the words it has in it’s dictionary.

Words not in the Dictionary

But what about words that the dictionary doesn’t know about? Well the way I handled that is to build a fallback routine into the code. The best (most accurate) mechanical routine I found was PHP-based and is part of Russel McVeigh’s site:

I ported Russel’s code to Python and I added a couple of other exceptions that I found. Most of the mechanical syllable calculation routines I found, work on the following basic syllable rules:

Count the number of vowels in the word

Subtract one for any silent vowels such as the e at the end of a word

Subtract any additional vowels in vowel pairs/triplets (ee, ei, eau, etc.) i.e. each group of multiple vowels scores only one vowel

The number you have left is the number of syllables. However there then follows a series of adjustments where if certain patterns are recognised in the word, syllables are added in or taken away and then finally you end up with the correct syllable count. But, even with all this adjustment it’s never accurate. But perhaps good enough for those words not in the cmudict.

So the code I’ve developed is really simple. It looks up syllable counts in the cmudict and returns the results if found and if not has a guess at the syllable count instead. I’d really like to share the code with you but something in my wordpress theme or the syntax highlighter that I use objects to something in the code. Perhaps, as I’m not a proper programmer it doesn’t like my esoteric, bastardised Hungarian notation variable names?

So I can’t post it here at the moment but will try to get that fixed. If you’re interested contact me and I’ll happily share it.

Danny Goodall

Edit – It looks like I *might* have solved that problem by using a different syntax highlighter.

Post navigation

5 Comments

The Rhymebrain API (RhymeBrain API) also includes syllable counting, and uses the CMU dictionary. As a fallback, when it doesn’t contain the pronunciation of a word, it derives the CMU using machine learning and then counts the number of syllables.