Markov chain text generation in F#

Jul 4, 2015

A Markov chain models a series of random transitions from one state to another, with the next state depending only on the current state. This idea has plenty of applications, one of them happens to be generating weird text, and we’ll do just that in this post.

Some of these techniques are also applied in my previous text classification posts. You may recognize that we’re forming n-grams from the text using a sliding window (Seq.windowed), where n is pairSize. If our book is really big then we may consider a streamed/lazily evaluated approach, but this is a blog post.

At this point you need a book. May I suggest one of the timeless classics available from Project Gutenberg? I’ll use Franz Kafka’s rather neurotic Metamorphosis — where the main character Jeff Goldblum builds a machine to transform himself into a roach.

Climbing Mt. Markov

For each distinct word, we can find the words directly after it throughout the book. We can construct a Map<string, string list>, i.e. each distinct word and a list of the words that can follow it.

This is probably the hardest part to grasp; maybe best to read it from bottom to top. We’re going to fold over the windowed sequence of words and construct our Markov Map as we go. That Map will be threaded through the fold operation as the accumulated value, and modified with each pass. In updateMap, if a particular word (or key) already exists in the Map then we replace the binding, otherwise we add a new binding.

This has some similarities to our recursive fold method of building the Markov map, in that it’s working with a current state and an accumulator. The markovChain will go on indefinitely, randomly choosing the next element and accumulating until reaching its stop condition.

Hot Kafka

How many sentences of this stuff do we want? The sky’s the limit. We can generate more Kafkaesque material in milliseconds than Franz himself could conjure up in a million years!

“Gregor found it easy to see him?”, and Gregor would have to re-arrange his life. Unlike him, she was looking out the window after she had to carry on with our lives and remember him with respect. It was only able to cover it and be patient, just to move about, crawling up and flat, they had no thought of closing the door and without letting the women when they went round the chest, pushing and pulling at it and rubbed it against the ground.

This may be even creepier than the source material!

Word pair sizing

Notice if you increase the size of the word pairings, e.g. getWordPairs 4, the generated material starts to resemble the original text more closely — this is because each state of your Markov map is more specific. Using a word pair size of 4 or 5 will often regenerate sentences from the source text verbatim. Using a value of 2 will generate much more random looking text, because each state consists of only one word. This will also depend on the amount and variety of source text used to build your Markov model.