Monday, June 26, 2006

US genealogy site Ancestry.com has now completed indexing and digitizing the entire US Census from 1790 to 1930 – featuring more than five billion names.

Ancestry.com's team spent 6.6 million hours of labour deciphering handwriting from 13 million original census documents and 21.9 billion keystrokes manually entering information into the database. [Link]

Five billion names from 13 million census pages over 15 censuses? That comes out to about 385 names per page, and would give the United States an average population in those years of 333 million.* This is odd, since our population isn't supposed to reach 300 million until this fall.

Also, they managed to type in 5 billion names with only 21.9 billion keystrokes, meaning that the average American's name was only 4 or 5 letters long. Notwithstanding "Cher," most American names ramble on for at least 6 or 7 characters.

I'll bet it was 5 billion pieces of information - surname, given name, gender, race, age, birthplace, parent's birthplace, immigration data, languages, occupation, etc. If they indexed all of that, then that comes out to something like 1.2 billion pieces for a population of 100 million.

Between 1850 and 1880, they collected only half of that for a smaller population, and between 1790 and 1840 it was even less. So 5 billion as a total number of bits of info is certainly possible.

Randy: Ah, but the original press release said "more than five billion searchable names." Methinks the reporter who regurgitated this release mistook the total number of names in all of Ancestry's databases for the number of names in their U.S. Census collection.

Andy: Nice catch! I've switched the link to a version of the story written by someone who can subtract.