We are hiring

Auto-generating LShift blog posts

I’ve often found myself at a loss for blog post topics, so rather than write one myself I decided to let a computer do the heavy lifting!

Markov chains offer a neat trick for generating surrealist blog oeuvres. They work by figuring out the probability of one word appearing after another, given a suitable corpus of input material.

The meat of the algorithm is surprisingly simple. Given a sequence of tokens (words and punctuation characters), you build a mapping between tokens and the frequencies of the tokens that appear one step to their right.

To give it a whirl, clone the GitHub repo and make sure you have Leiningen installed. Here’s a fairly noddy example:

Using the frequency map, you derive a new stream of tokens from a particular starting token and following the trail by making a weighted random choice among the available next tokens. The random-next-token function takes care of this. There’s also some helpers for stitching the tokens back together and spitting out sentences. Here’s a sample run based on the last twenty blog posts:

The devil is always in the details, and Markov chains are sensitive to the peculiarities of the input text. As is so often the case in the software world, 99% of the code is munging data into shape and 1% is a nifty algorithm. LShift blog posts present a challenge because you’ll find identifiers, magic numbers and great heaving code stanzas slap-bang in the middle of a sentence. Those bits have to be stripped out or the algorithm will get lost down a blind alley of symbology.

You may need to run it a few times before it produces anything giggle-worthy. Automated silliness detection is left as an exercise for the reader…