Genderyzer is an experimental research tool for determining the gender mix of a list of names by Jofish Kaye. Helpful feedback welcomed.

Genderyzer takes as input a text file with one name per line, in the order First Initials Last, ignores everything except First, and returns statistics based on the gender mix of the names, best as it can. (Blank lines and those starting with # are ignored.) You probably want it in UTF-8, although it'll try to do the right thing with HTML-encoded characters. Only the most recent file uploaded is kept on the server; it is automatically overwritten whenever a new file is uploaded.

Gender information for names is taken from a variety of sources, including census records, government name guides, and baby name lists. Please let me know if you feel I have incorporated information from your lists and you'd like it removed.

All results are approximate. No guarantees. Please reference this site if you use this service. If you publish academic papers that use this work, please reference this paper: http://doi.acm.org/10.1145/1520340.1520364

15 Oct 09: I recently discovered a design flaw where 29 names were being incorrectly reported. See 29names.txt for details. My apologies; please do continue to send feedback if you notice errors.

"
if not form.has_key(form_field): return #now process the uploaded file if there is one
fileitem = form[form_field]
if not fileitem.file: return
fout = file("../tmp/gendertemp", 'wb')
while 1:
chunk = fileitem.file.read(100000)
if not chunk: break
fout.write (chunk)
fout.close()
print HTML_HEADER
save_uploaded_file("file_1") #do all the cgi field stuff
print HTML_TEMPLATE_start
male, female, initials, ambiguous, unknown = [],[],[],[],{}
from nameSex import isMale
#f=codecs.open("../tmp/genderout.txt",'w','utf_8')
f=open("../tmp/genderout.php",'w')
f.write(""" Header('Cache-Control: no-cache');Header('Pragma: no-cache');Header('Content-Type: text/plain; charset=utf-8')?>
#processed by genderzyer 3.0 -- genderzye -at- jofish dot com
#short version:
#output format for each line is: rating name
#if rating is >1, then it's a man's name
#if rating is <1, it's a woman's name
#if rating is 1, it's ambiguous
#if rating is 0, it doesn't know the name.
#
#longer version:
#so for example, jordan is both a boys' and a girls' name.
#but according to the 2005 US census (ymmv for other demographics, of course)
#it is the 46th most popular boys name and the 86th most popular girls name.
#
#this is encoded internally as boyscore girlscore name, like this:
#1.954 0.914 jordan
#
#the algorithm in the code does this:
#if abs(boyscore-1) > abs(girlscore-1): return boyscore
# else: return girlscore
#
#so for jordan, 0.954 > 0.086 so most likely male and it would return 1.954
#
#0 means we have no data either way
#
#1 means ambiguous: we know it can be both male and female.
#this indicates that these names were found in a baby name book/site/list, but we have no
#concept of relative popularity of the two terms.
#
""")
for line in open("../tmp/gendertemp",'r'):
if len(line)<1: continue
if """""" in line:
print "PHP found in file; no further procesing will be done."
break
if line[0]=='#':
f.write(line)
continue
try:
first = line.split()[0].lower().strip()
except:
continue
try:
if len(first)==1 or first[1]=='.':
if len(line.split())>2: #we have an f. scott fitzgerald problem, perhaps
first = line.split()[1].lower()
if len(first)==1 or first[1]=='.': #oh, it's e. e. cummings
initials+=[first]
continue
else: #I. Last
initials+=[first] #we could be smarter here.
continue
except:
if len(first)==1:
initials+=[first]
continue
if "&" in first:
try:
first = decode_htmlentities(first).encode("utf-8")
except:
mysub=re.compile('&[^;]*;')
first=mysub.sub('',first)
if len(first)<2: continue #check again
rating = isMale(first)
f.write(str(rating) + " "+ line)
if rating == 0:
if first in unknown: unknown[first]+=1
else:unknown[first]=1
continue
if rating > 1: male+=[first]
if rating < 1: female+=[first]
if rating == 1: ambiguous+=[first]
f.close()
totallen = len(male)+len(female)+len(initials)+len(unknown)+0.0
if totallen == 0:
print "

No names were found in the file uploaded. Your file must be plain text, with one name per line.