Thursday, December 4, 2014

tl;dr: It looks like names aren't well modeled as a Markov process, but you can install my R package that does model names as Markov processes and mess around with it.

I don't know how I wound up writing a "name" generator yesterday, but I did. And now it's an R package (just on github for the moment), so you can play around with it too (https://jofrhwld.github.io/nameGenerator/).

I was messing around with some other research questions when I decided to see what would happen if I tried to model given first names as a Markov process. Here's a picture of a Markov chain that contains just the letters of my own name: Joe.

First, you start out in a start state. Then, you move, with some probability, either to the o, j, or e character. Then, you move, with some probability, to one of the other states (one of the other letters or end), or stay in the same state. In this figure, I've highlighted the path that my own name actually takes, but there are actually an infinite number of possible paths through these states, including "names" such as "Eej", "Jeoeojo", "Jojoe", etc.

A Markov chain for all possible names would look a lot like this figure, but would have one state for every letter. Now, I keep saying that you move from one state to the next with "some probability," but with what probability? If you have a large collection of names, you can estimate these probabilities from the data. You just calculate for each letter what the probability is of any following letter. So for the letter "j", you count how many times a name went from "j" to any other letter. For boys names in 2013, that looks like this table.

from

to

count

j

a

176452

j

o

84485

j

e

26118

j

u

25616

j

i

1920

j

h

425

j

r

121

j

c

118

j

d

98

j

end

55

...

...

...

As it turns out, a whole bunch of name data is available in the babynames R package put together by Hadley Wickham. So, I wrote a few functions where it estimates the transition probabilities from the data for a given year (from 1880 to 2013) for a given sex, and then generates random "names", or just returns the most probable path through the states. How often does this return a for real name? Sometimes, but not usually. For example, the most probable path through character states for boys born in 1970 is D[anericha] with that"anericha" bit repeating for infinity. For boys born in 1940, it's just an infinite sequence of Llllllll...

So, that introduces a problem where the end state is just not a very likely state to follow any given letter, so when generating random names from the Markov chain, they come out really really long. I introduced an additional process that probabilistically kills the chain as it gets longer based on the probability distribution of name lengths in the data, but that's just one more hack that goes to show that names aren't well modeled as a Markov chain.

Here's a little sample of random "names" generated by the transition probabilities for girls in 2013:

Elicia

Annis

Ttlila

Halenava

Amysso

Menel

Seran

Pyllula

Paieval

Anicrl

And heres a random sample of "names" generated by the transition probabilities for girls in 1913:

Lbeana

Peved

Math

Bysenen

Viel

Lelinen

Jabbesinn

Mabes

Drana

Lystha

The feeling I get looking at these is that they don't seem particularly gendered, even though there are clear gendered name trends. They don't even seem like they're from different times from each other. A lot of them aren't even orthographically valid. I don't know how they'd perform on "name likeness" tasks, but I don't even know what the point of doing such a task would be, since the Markov process has already failed at being a good model of names.

Maybe there's a lesson to be learned from the Markov process' failure to model names well, but for me it wound up being a silly diversion.

Update: I've now updated the package to generate names on bigram -> character transition probabilities. The generate_n_names2() function generates things that look more like names. It's kind of fun!