(3) Mon Dec 16 2013 13:10Markov vs. Queneau: Sentence Assembly Smackdown:
I mentioned earlier that when assembling strings of words, Markov chains do a better job than Queneau assembly. In this post I'd like to a) give the devil his due by showing what I mean, and b) qualify what I mean by "better job".

Markov wins when the structure is complex

I got the original idea for this post when generating the fake ads for @pony_strategies. My corpus is the titles of about 50,000 spammy-sounding ebooks, and this was the first time I did a head-to-head Markov/Queneau comparison. Here are ten of Markov's entries, using the Markov chain implementation I ended up adding to olipy:

At Gas Pump!

The Guy's Guide To The Atkins Diet

Home Internet Business In The World.

101 Ways to Sharpen Your Memory

SEO Relationship Building for Beginners

Gary Secrets - Project Management Made Easy!

Weight Success

How get HER - Even If It's Just Money, So Easy and Effective Treatment Options

Sams Yourself

Define, With, Defeat! How To Get Traffic To Your Health

The Markov entries can get a little wacky ("Define, With, Defeat!"), which is good. But about half could be real titles without seeming weird at all, which is also good.

By contrast, here are ten of Queneau's entries:

Adsense I Collection Profits: The bottom Guide Income!

Reliable Your Earning Estate Develop Home And to life Fly Using Don't Your Partnership to Death

Help the Your Causes, Successfully Business Vegetarian

Connect New New Cooking

1 Tips, Me Life Starting to Simple Ultimate On Wills How Years Online With Living

Fast Survival Baby (Health Loss) Really How other of Look Symptoms, Your Business Encouragement: drive Health to Get with Easy Guide

At their very best ("Suceeding For Inspiring Life, "How Practice Health Best w/ Beauty"), these read like the work of a non-native English speaker. But most of them are way out there. They make no sense at all or they sound like a space alien wrote them to deal with space alien concerns. Sometimes this is what you want in your generated text! But usually not.

A Queneau assembler assumes that every string in its corpus has different tokens that follow an identical grammar. This isn't really true for spammy ebook titles, and it certainly isn't true for English sentences in general. A sentence is made up of words, sure, but there's nothing special about the fourth word in a sentence, the way there is about the fourth line of a limerick.

A Markov chain assumes nothing about higher-level grammar. Instead, it assumes that surprises are rare, that the last few tokens are a good predictor of the next token. This is true for English sentences, and it's especially true for spammy ebook titles.

Markov chains don't need to bother with the overall structure of a sentence. They focus on the transitions between words, which can be modelled probabilistically. (And the good ones do treat the first and last tokens specially.)

Markov wins when the corpus is large, Queneau when the corpus is tiny

Consider what happens to the two algorithms as the corpus grows in size. Markov chains get more believable, because the second word in a title is almost always a word commonly associated with the first word in the title. Queneau assemblies get wackier, because the second word in a title can be anything that was the second word in any title.

I have a corpus of 50,000 spammy titles. What if I chose a random sample of ten titles, and used those ten titles to construct a new title via Queneau assembly? This would make it more likely that the title's structure would hint at the structure of one or two of the source titles.

This is what I did in Board Game Dadaist, one of my first Queneau experiments. I pick a small number of board games and generate everything from that limited subset, increasing the odds that the result will make some kind of twisted sense.

If you run a Markov chain on a very small corpus, you'll probably just reproduce one of your input strings. But Queneau assembly works fine on a tiny corpus. I ran Queneau assembly ten times on ten samples from the spammy ebook titles, and here are the results:

Beekeeping by Keep Grants

Lose to Audience Business to to Your Backlink Physicists Environment

HOT of Recruit Internet Because Financial the Memories

Senior Guide Way! Business Way!

Discover Can Power Successful Life How Steps

Metal Lazy, Advice

Insiders Came Warts Weapons Revealed

101 Secrets & THE Joint Health Than of Using Marketing! Using Using More Imagine

Top **How Own 101**

Multiple Spiritual Dynamite to Body - To Days

These are still really wacky, but they're better than when Queneau was choosing from 50,000 titles each time. For the @pony_strategies project, I still prefer the Markov chains.

Queneau wins when the outputs are short

Let's put spammy ebook titles to the side and move on to board game titles, a field where I think Queneau assembly is the clear winner. My corpus is here about 65,000 board game titles, gathered from BoardGameGeek. The key to what you're about to see is that the median length of a board game title is three words, versus nine words for a spammy ebook title.

Here are some of Markov's board game titles:

Pointe Hoc

Thieves the Pacific

Illuminati Set 3

Amazing Trivia Game

Mini Game

Meet Presidents

Regatta: Game that the Government Played

King the Rock

Round 3-D Stand Up Game

Cat Mice or Holes and Traps

A lot of these sound like real board games, but that's no longer a good thing. These are generic and boring. There are no surprises because the whole premise of Markov chains is that surprises are rare.

Here's Queneau:

The Gravitas

Risk: Tiles

SESSION Pigs

Yengo Edition Deadly Mat

Ubongo: Fulda-Spiel

Shantu Game Weltwunder Right

Black Polsce Stars: Nostrum

Peanut Basketball

The Tactics: Reh

Velvet Dos Centauri

Most of these are great! Board game names need to be catchy, so you want surprises. And short strings have highly ambiguous grammar anyway, so you don't get the "written by an alien" effect.

Conclusion

You know that I've been down on Markov chains for years, and you also know why: they rely on, and magnify, the predictability of their input. Markov chains turn creative prose into duckspeak. Whereas Queneau assembly simulates (or at least stimulates) creativity by manufacturing absurd juxtapositions.

The downside of Queneau is that if you can't model the underlying structure with code, the juxtapositions tend to be too absurd to use. And it's really difficult to model natural-language prose with code.

So here's my three-step meta-algorithm for deciding what to do with a corpus:

If the items in your corpus follow a simple structure, code up that structure and go with Queneau.

If the structure is too complex to be represented by a simple program (probably because it involves natural-language grammar), and you really need the output to be grammatical, go with Markov.

Otherwise, write up a crude approximation of the complex structure, and go with Queueau.

It seems to me that when Markov chains generate output that's too plausible to be interesting, that's the kind of problem you WANT to have. You can keep giving the algorithm additional constraints until it struggles to generate sensible results. Like, multiply the frequency of alliterative sequences, or don't let titles end until they've re-used the same word three times, or Markov from the beginning of a board game title to the middle of a film title to the end of a board game title.

I agree with your negative general attitude toward Markov chains though, Ben Popik came by the synod other day and now my friend who emailed you to ask if he could pirate Constellation Games wants to do an exquisite corpse project, but I find it hard to muster enthusiasm for what's basically a non-automated Markov chain.

I'm working on a big weird quasi-interactive-fiction project, the scope of which is constantly fluctuating but which is, I think, about 85% of the way towards its simplest completion state. I'd love to get your input on it if you have the time.

As a side note, I think it highlights why the http://kingjamesprogramming.tumblr.com/ output got so much attention and seems more interesting than most Markov chain projects. My suspicion is that having 2 different sets of input which have somewhat disjoint vocabulary and very different styles lets you get "runs" of stuff from each, and that the transitions then happen on the (relatively) rare shared terms.

It might be interesting to work out some generators where that is a more explicit goal; so that you transition the generator from source to source at some (average) predefined rate. Every N words switch between Jane Eyre and Hitchhiker's Guide or something.