It looks like the the comment got severely truncated at the McLaws page. And even stranger, when you click on the username (Noticias externas? is that you?) it goes to this posting third page, geeks.ms, which seems to be a spanish blog. Now I’m stumped.

I think I get the trackback idea, but you seem to have stepped in to a more advanced realm of blogging of which I’m unfamiliar. ;)

[Noticias is one of many sites that pretend to be Microsoft bloggers. -Raymond]

And that’s generally the best way to do it. No matter how clever your algorithm is, it will fail some of the time. Better to let the user correct the mistake (assuming it happens fairly rarely) than to force the user to accept incorrect values.

In college he went by Horny, Horndog, and sometimes Hornmeister. Funny thing is, Steve is actually a more suitable name to call him by, because he’s not a very memorable person. You’d think everyone would remember meeting a guy with the name Gregory Kenneth Van Horn or a guy everyone called Horny, Horndog, or Hornmeister but no. If you ask people if they remember him they always draw a blank – just as if you’d asked whether they remembered meeting a generic Steve.

To be fair, he was writing from a UK perspective ("Scottish like me"), where "Jr." isn’t an issue. I can’t think offhand of a (traditional) UK name where his function would fail; although it could probably do with a tweak for Irish O’Connell-type surnames.

My mother’s name is what I use to shoot down name parsing algorithms. Mary Ann Louise Smith — yes, her first name is "Mary Ann". I’ve yet to see any kind of name parsing algorithm that can even attempt to handle that along with the more common "Foo Baz van Bar".

Maurits has a reasonable suggestion with a sentinel value; the vast majority of names should be handled trivially.

Say what you will about Perl, but it’s great for text processing and there’s some crazy stuff in CPAN.

Lingua::EN::NameParse will parse a huge variety of names and name formats (though not *everything*, I’m sure). There’s even an option to handle titles like General or Mother Superior. It’s based on a recursive descent grammar for names.

For giggles, I ran the names through Lingua::EN::NameParse. None of this is meant to argue against anyone’s claims that name parsing is hard and impossible to get exactly right; I just wanted to test the module.

Hope the linebreaks come through right — if not, the gist is that Mary Ann does confuse the module, and it says there was a parsing error and "Smith" was not matched; Foo Baz van Bar doesn’t confuse it and is parsed correctly. "Cal Ripken, Jr." confuses it unless the module’s auto_clean option is turned on to strip the comma; in that case, it parses the suffix out right.

For the "Jr." "II." "III." etc, wouldn’t it be enough to "see" the dot at the end? ("van Eyck, Jan" has none)

Of course it would just increase the correctness ratio, not put it at 100% because I am sure there are cases where people would put a dot at the end for just another reason or no good reason at all.

Find a good algorithm with good heuristics (=make one and improve it by testing it ‘medium scale’), and just allow people to easily modify when your algorithm is wrong (not necessarily at creation time btw, else you cannot batch process lists of names, but allow users to edit their automatically created profile)

I’m struggling to think of a situation where you’d want to generate initials from a list of thousands/millions of names. Maybe I’m not imaginative enough.

If it’s for usernames, document edits, or similar then you need the generated strings to be unique. There will be many collisions with a long list of names so you’ll have to add numbers (or something) to what you generate. In which case the result is ugly, strange and "not my initials" whether or not the initials are well chosen.

one approch could be to have a dictionnary of know n firstnames (and another one of abrev if needed). Then from this you can re-order the names if needed and generate the initials your prefer (as for "Firstname Lastname" like "FL" "FLE" "FLS"). indeed you need to update the algorithm with locals cultures… That’s quite a fun devlopment.

"I’m struggling to think of a situation where you’d want to generate initials from a list of thousands/millions of names."

The typical case isn’t for initials, but taking a single name field and splitting it up into First/Middle/Last fields in a database.

You might want the initials autogenerated for the same purpose, although I can’t really think of a good reason to do so.

"the gist is that Mary Ann does confuse the module"

Heh, I suspected it would. Of course, to be fair, she’s the only multiple-word first name that I’ve ever run across (either in person or in the dreaded single->multiple name case above; which I’ve done on millions of records before), so it’s not too surprising.

Don’t suppose this is the Randall most famously associated with Perl, is it? (If you ever do read this)

Slightly OT, but the algorithms that scan for names and addresses for junk mail sometimes are really off in a cute way. I once worked at the U.S. Fish and Wildlife Service. We got a letter from Ed McMahon that used the greeting:

Fwiw, my mother-in-law’s first name is Lee Ann. As well, down here in Texas we have the stereotype of women named Bobby Sue or something (and it’s not a false stereotype either!). So the two-word first name is far from a unique case, though it is fairly rare.