So I decided to give it another whack. This time I started with the premise, 'what sounds most like a name?'

Names do!

I found a couple files with over a thousand of the most common male and female first names on the US Census Bureau's web page and started playing. I wrote a Python script that used regular expressions to slice a batch of words into three lists;

List 1 = Zero or more vowels + One or more consonants at the start of the wordList 2 = One or more vowels + One or more consonants inside the word (not at the start or the end). We can get 0 or more of these patterns depending on the word.List 3 = One or more vowels + Zero or more consonants at the end of the word.

Side note: If you haven't dug into regular expressions yet I highly recommend you check them out. I avoided them for years and now they're an essential part of my programmer tool box. Another big plus is their utility spans multiple languages.

I also tracked the frequency of each pattern, sorting by most common first and discarding the rares. Finally, I dumped the output formatted as Python lists that I could paste right into the source of the next script.

#!/usr/bin/env python

#——————————————————————————# analyze.py#——————————————————————————

import reimport randomimport operator

_FILENAME = 'data/female2.txt'

## Match 0 or more vowels + 1 or more consonants at the start of the word_LEAD = re.compile(r'^[aeiouy]*[bcdfghjklmnpqurstvwxz]+')## Match 1 or more vowels + 1 or more consonants inside a word (not start/end)_INNER = re.compile(r'\B[aeiouy]+[bcdfghjklmnpqurstvwxz]+\B')# Match 1 or more vowels + 0 or more consonats at the end of a word_TRAIL = re.compile(r'[aeiouy]+[bcdfghjklmnpqurstvwxzy]?$')

def token_lists(names):

lead, inner, tail = {}, {}, {}

## Populate dictionaries; key=pattern, value=frequency for name in names:

I've been experimenting with different sample sizes and varying amounts of culling infrequent patterns. Plus, it's hard to gauge success. You wouldn't want to name your children from those lists, but if I had to populate a fantasy town full of NPCs I'd be content with many of those. Two things I liked was the simplicity of the finished code and that the gender sound mostly survived the mulching process – except Stan the transvestite.

I've seen some pretty good random name generators that even let you generate a name based off of different fantasy races, but the neatest feature I thought it had was it let you check a name to see if it was valid or not also based on if it sounds dwarven, human, etc. Probably a little strict in actual application but I thought it was pretty cool. (Even though I'd never use it probably.)

I was going to mention having made demon and cthulhu-esque name generators but the post was getting kinda long already. I thought medieval weapons would make good dwarven names but it didn't work so hot.

I've seen some pretty good random name generators that even let you generate a name based off of different fantasy races, but the neatest feature I thought it had was it let you check a name to see if it was valid or not also based on if it sounds dwarven, human, etc. Probably a little strict in actual application but I thought it was pretty cool. (Even though I'd never use it probably.)

Reminds me of my first snippet, which checked whether names could be pronounced, whether they contained inappropriate words, and whether they matched race-specific requirements. I never used it either though.

I made a change to my regex. Basically I gave up on the letter 'Q'. I had included 'U' in the consonants above because otherwise a 'Q' before it was left alone. Unfortunately, it was matching strings like 'cthulhu' as one leading pattern. Getting cleaner results.

Have you tried using n-grams to generate the names? It works by chopping up inputs into units (probably syllables in this case), and then determining not the frequency of individual units, but the probability of a given unit following another unit. This way, you end up with units more likely to actually seem to fit together.

I have code for this; it's roughly 350 lines of Lua (including whitespace and comments). I ran it on Shakespeare and the WSJ; the units in this case were words, not syllables. Here are some of the more interesting generated sentences:

- An he is old oblivion; I change favours; I am yours; and Warwick was free scope; I warrant. (Shakespeare, bigrams)

- As to run from the day is well he chang'd into the next village of the Emperor, nay, to give it be restrain'd, and enter THESEUS. (Shakespeare, bigrams)

- Regulators, as compiled by Dow Jones's board, had structural damage, including International Business Machines Corp. hardware that uses index arbitrage at Kidder. (WSJ trigrams)

- "According to available details," says a spokeswoman, the changes were prompted by a third of the stadium, damaged by natural disasters – Hurricane Hugo, " May 15) will recall that the company, which are filled at the end of the U.K. (WSJ trigrams)

- The English and attorney who have been performing since the quake won't make this century ago, citing its results anyway. (WSJ bigrams)

I could probably run the code almost as-is if I had an easy way of breaking words into chunks. Maybe I'll even get around to posting the code at some point.

Have you tried using n-grams to generate the names? It works by chopping up inputs into units (probably syllables in this case), and then determining not the frequency of individual units, but the probability of a given unit following another unit. This way, you end up with units more likely to actually seem to fit together..

Originally, I was going to try using Markhov chains but the first thing I ran into was how do I actually break a word into syllables? Take 'pewter' and 'marker' for example. Most people would break them as (pew) (ter) and (mark) + (er). No pattern to that. I didn't really want to manually specify or build a speech synthesis style dictionary (especially when dealing with fantasy names). That's where I took the easier route of (p)(ewt)(er) and (m)(ark)(er) and building from there.

I'd be astonished if there weren't libraries out there for breaking words into syllables that work "well enough" (most of the time), similar to parts-of-speech tagging (which is some 97% accurate). I agree that you don't really want to be in that business yourself, but even so I imagine you could get a decent approximation without too much work. (Incidentally, I would have grouped those as pew-ter and mar-ker, so there is a pattern of some sort.)

Have you tried using n-grams to generate the names? It works by chopping up inputs into units (probably syllables in this case), and then determining not the frequency of individual units, but the probability of a given unit following another unit. This way, you end up with units more likely to actually seem to fit together.

I have code for this; it's roughly 350 lines of Lua (including whitespace and comments). I ran it on Shakespeare and the WSJ; the units in this case were words, not syllables. Here are some of the more interesting generated sentences:

- An he is old oblivion; I change favours; I am yours; and Warwick was free scope; I warrant. (Shakespeare, bigrams)

- As to run from the day is well he chang'd into the next village of the Emperor, nay, to give it be restrain'd, and enter THESEUS. (Shakespeare, bigrams)

- Regulators, as compiled by Dow Jones's board, had structural damage, including International Business Machines Corp. hardware that uses index arbitrage at Kidder. (WSJ trigrams)

- "According to available details," says a spokeswoman, the changes were prompted by a third of the stadium, damaged by natural disasters – Hurricane Hugo, " May 15) will recall that the company, which are filled at the end of the U.K. (WSJ trigrams)

- The English and attorney who have been performing since the quake won't make this century ago, citing its results anyway. (WSJ bigrams)

I could probably run the code almost as-is if I had an easy way of breaking words into chunks. Maybe I'll even get around to posting the code at some point.