As I'm not working right now* ,
I have a surfeit of time to hack on stupid things.
In particular, it seemed like a good idea to hack together a stupid data product.

Inspiration (if you can call it such) struck when the Allen Institute released
a dataset of elementary school science questions,
presumably so that people can train computers to answer them. I am not quite so
ambitious, so instead I decided to use them to train a computer to generate
new random bogus elementary school science questions.

Using markov chains to generate text this way isn't particularly exciting, so
I thought I'd also play around with the other parts of building a data product:
creating a backend service that serves up random questions, and creating a
single-page app that allows you to take these random quizzes.

If you know me, you're thinking "backend service, that sounds like a job for
Haskell" and also "single-page app, that sounds like a job for PureScript".
You're basically right, with a caveat we'll see later, but in this first blog
post let's just focus on the data piece.

We'll use a dead simple markov chain model. For each word in the dataset, we
make a list of all the words that we see following it. That is, if we had only two
questions: "What is love?" and "What the heck?" then our model would say that
'What' can be followed by either 'is' or 'love'. We generate a sentence by picking
a starting word and repeatedly choosing a random next word (based on the transitions
we learned from the data) until we finish a sentence.

This means our data goal is to generate these transitions. We'll represent them
as dicts where the keys are words, and the values are lists of next words:

{'What':['is','the'],'is':['love'],# and so on}

To start with, we can just download the CSV:

We only care about the 'question' column, so let's just read that in.
(As always, use csv.reader, don't try to parse it by hand!)

Now we're going to actually want two markov chain models. If you look at the
data, you'll see that questions and answers look quite different from one another.
So we'll generate one set of "question transitions" and another set of
"answer transitions".

To that end, we need a way to split a raw_question into a "question" piece and
several "answer" pieces. This looks like a job for regular expressions. Using the
example above, we might try something like

re.split("\([A-D]\)",raw_question)

which looks for the letters A - D in parentheses, and splits the string
where it finds them. Looking at the file,
every question has either 3 or 4 answers, which means that the result of that split
should have 4 or 5 elements (including the question text).

We're almost ready, the other thing we'll do is use a couple of sentinels:

START="__START__"STOP="__STOP__"

We'll add a transition from START to the first word of every sentence, and a
transition from the last word of every sentence to STOP. That way we can generate
sentences using essentially the following logic:

collect all of the question "sentences" and answer "sentences" separately

use the sentences to generate transition dicts

serialize the transitions, so we can use them in other programs

First, the collecting:

questions=[]answers=[]forqinraw_questions:forsplitinsplits:pieces=[x.strip()forxinre.split(split,q)]iflen(pieces)in[4,5]:questions.append(pieces[0])answers.extend(pieces[1:])break# go on to the next raw_questionelse:# have you ever seen someone use a for .. else loop?print(q)# me neither!

Now we have a list of questions and a list of answers, so let's turn
each into a dict of transitions. We'll use another regex to turn each question
(or answer) into words:

re.findall("[^ ?\.,]+|\?|\.|\,",sentence)

That regex looks kind of cryptic, but it's just looking to match either

a period,

a comma,

a question mark, or

a "word" that contains none of the above or spaces

We just need to remember to add the "sentinel" words, and we have our function:

defmake_transitions(sentences):transitions=defaultdict(list)forsentenceinsentences:# regex looks for "?", ".", "," or groups of characters that aren't# any of those, and aren't spaceswords=[START]+re.findall("[^ ?\.,]+|\?|\.|\,",sentence)+[STOP]forprev_word,next_wordinzip(words,words[1:]):transitions[prev_word].append(next_word)returntransitions