Tuesday, January 8, 2013

Python : Awful Jokes Bot

The bot has a giant nouns.txt file stored locally, where it picks out two random nouns. The two nouns are translated into ARPABET using a lookup table (O(n) time; a binary search can cut it down to O(log n) but I'm lazy).

The first and last few phonemes are matched between the two words (the prefix of one and the suffix of the other), and if there's a match, combine the two words. E.g., 'napkin' and 'instrument' make 'napkininstrument.'

In order to remove the shared phonemes, the program estimates how many consonants to remove by counting consonants in the phonemes. It's very sketchy and a syllabic approach would be much better.

You end up getting 'napkinstrument' if all goes well.

Now, to generate the joke (yes, it goes backwards), the program looks up the Wikipedia articles of the two words and tries to find the most frequent nouns on the page (which will hopefully be keywords). Often times, the word will be the same (one of the most common words on the "napkin" wiki page happens to be "napkin").

In this case, the program generated "napkin" and "castanet."

Putting it together:

What do you get when you cross a castanet and a napkin? A napkinstrument!

The darker side:
The bot sucks. A lot. Sometimes, it just happens to work very well, but there's a bunch of issues with it.

Skimming Wikipedia for keywords is not reliable. Often times you get words that aren't even remotely related. It might be interesting to, instead of using an arbitrary threshold like I do now (take the five most frequent nouns in the article and pick one randomly), take into account the frequencies of the nouns, and to use the more frequent ones more often.

Word truncation based on consonants in phonemes is not a good idea, especially because of blends. This should be intuitive enough; everything about it smells of hack. A better way would be to count syllables, or somehow convert phonemes into words without relying on a dictionary.

Certain words appear too often. The bot has a "may" fetish, for example, because "may" is both a noun and a (participle?), and so when it skims Wiki articles, it tends to choose that a lot.

That's all, folks. The bot currently uses someone else's code for the Wiki search; if I ever replace that I'll GitHub the entire thing and you can generate awful jokes right on your silly machines.