Sunday, April 13, 2014

It's a really cool graph, but then, I tend to find analysis of baby names a bit frustrating because they almost always rely strictly on the written, or orthographic, forms of the names. It's not that the way people spell their children's names doesn't matter, but it's half of the puzzle. For example, I'm named after my grandfather. He was German (more specifically, a Donauschwob), so he spelled his name <Josef>, and pronounced the initial sound like <y>, which in the IPA is /j/. When naming me, my parents had a whole bunch of options. Would the pronounce my name like my grandfather did, or like most English speakers would? And how would they spell it? They wound up settling on the English pronunciation, and the German spelling. I've made a little diagram displaying a very partial set of options my parents had in choosing my name.

And of course, Sarah Jessica Parker played a woman named /sændi/ who spelled it <SanDeE☆> in Steve Martin's LA Story, so clearly the spelling of proper names is an important expressive dimension, but still just half the picture.

So, I decided to look at a bit more at popular linguistic structures in baby names. Hadley Wickham has already compiled the top 1000 baby names in the US per year since 1880 (https://github.com/hadley/data-baby-names), and Kyle Gorman has a nice python module that syllabifies CMU dictionary entries (https://github.com/kylebgorman/syllabify). So I put together some sloppy code to analyze it (https://github.com/JoFrhwld/names). The biggest weakness to my approach is the number of names which are not to be found in the CMU dictionary. 2525 out of the total 6782 names in the data (about 40%) aren't in CMU, so this post should be understood as being for entertainment purposes only.

One other thing that bugged me about the name final <n> plot is that it seemed kind of arbitrary to focus on the final letter of the name. I suspect that it's a real trend that people noticed eyeballing lists of names, but that it wasn't compared against other kinds of trends. I went ahead and labeled name initial and name final syllables, codas, onsets and rhymes as being special, but I'm not going to single them out.

Kicking things off, there's a graph of popular syllables between 1880 and 2008. To be included in the graph, a syllable had to be in the top 3 most popular in any given year. The y-axis is how many times more frequent the syllable is than if syllable selection were random. It's not frequency rated, that is, this is just the distribution over names that have that syllable, not babies.

It's a bit chaotic, I know. It's a time like this that I wish I'd learned a little JavaScript so I could make an interactive version with brushing. Here's another version where each syllable gets a facet. They're ordered by their decreasing maximum ratio.

It looks like at the syllable level, name final /nə/ and /li/ for girls are both long time favorites, as well as more popular syllables than any boy's name final /n/ syllable. The most popular boy's name final /n/ syllable looks like it's always been /tən/, but maybe it's flagging a bit compared to the recent surges in /sən/ and /dən/. It also looks like popularity in syllables is pretty evenly split between name initial and name final syllables. For both boys and girls, some kind of initial between /e/ ~ /ɛ/ ~ /æ/ is pretty popular, but I can't be sure what's going on there, because the CMU dictionary has the same entry for both <Aaron> and <Erin>.

But maybe the reason boy's name final /n/ isn't shining through like you might expect is because of phonological reasons. A boy's name ending in a word final syllabic /n/ is necessarily going to pull the preceding consonant into the syllable with it. Looking at the plot above, it's not likely that the preceding consonant is totally random either, cause we've only got /t, s, d/ (all coronals) and vowels preceding the /n/. But for the hell of it, here's the same kind of plot as the ones above, but this time with syllable rhymes.

There's a lot less volatility in the rhymes data, probably because there's fewer different kinds of syllable rhymes. Complex rhymes don't seem to be that popular ever. We've mostly got vowels from open syllables, and syllabic consonants. At any rate, the popularity of name final /n/ for boys is pretty clear, taking over from /i/ (from names like Billy and Jonny). The boy's trend towards name final /n/ seems to be about on par with the trend for girls names to end in /ə/.

I'd like to play around with this data a bit more if I get some time. It occurred to me that you could come up with a few different ways of generating popular names from different eras by randomly sampling popular syllables, or by estimating transition probabilities between syllables and going on a random walk that way.