Ok – this post is a little strange and fun. I was thinking about word length and how it relates to designing software/schemas to support multiple-languages. How far do you have to go in your research to figure out the maximum string length to support? So I started digging about and found some interesting things about words. Here are some examples.

If you’re putting together a schema to support hospital patient records, you might have a field for disease name. In that case, you’d have to allow for pnuemonoultramicroscopicsilicovolcanoconiosis which has 45 letters (caused by breathing in siliceous volcanic dust). A field for surgical procedure would have to support hepaticocholangiocholecystenterostomies which has 37 letters (creating a connection between the gall bladder and the hepatic duct). What about a field for how a measurement was obtained – electroencephalographically with 27 letters (using an electroencephalograph).

A schema to support chemical names could really be unlimited given the nature of systematic names for chemicals. The longest one in the dictionary is an acid called tetramethyldiaminobenzhydrylphosphinous with 39 letters (and given a few minutes I could probably draw its chemical structure by following the systematic method I learned at school :-)). The longest published chemical name is a kind of tobacco mosaic virus – ACETYLACETYL-SERYL-TYROSYL-SERYL-ISO-LEUCYL-THREONYL-SERYL-PROLYL-SERYL-GLUTAMINYL-PHENYL-ALANYL-VALYL-PHENYL-ALANYL-LEUCYL-SERYL-SERYL-VALYL-TRYPTOPHYL-ALANYL-ASPARTYL-PROLYL-ISOLEUCYL-GLUTAMYL-LEUCYL-LEUCYL-ASPARAGINYL-VALYL-CYSTEINYL-THREONYL-SERYL-SERYL-LEUCYL-GLYCYL-ASPARAGINYL-GLUTAMINYL-PHENYL-ALANYL-GLUTAMINYL-THREONYL-GLUTAMINYL-GLUTAMINYL-ALANYL-ARGINYL-THREONYL-THREONYL-GLUTAMINYL-VALYL-GLUTAMINYL-GLUTAMINYL-PHENYL-ALANYL-SERYL-GLUTAMINYL-VALYL-TRYPTOPHYL-LYSYL-PROLYL-PHENYL-ALANYL-PROLYL-GLUTAMINYL-SERYL-THREONYL-VALYL-ARGINYL-PHENYL-ALANYL-PROLYL-GLYCYL-ASPARTYL-VALYL-TYROSYL-LYSYL-VALYL-TYROSYL-ARGINYL-TYROSYL-ASPARAGINYL-ALANYL-VALYL-LEUCYL-ASPARTYL-PROLYL-LEUCYL-ISOLEUCYL-THREONYL-ALANYL-LEUCYL-LEUCYL-GLYCYL-THREONYL-PHENYL-ALANYL-ASPARTYL-THREONYL-ARGINYL-ASPARAGINYL-ARGINYL-ISOLEUCYL-ISOLEUCYL-GLUTAMYL-VALYL-GLUTAMYL-ASPARAGINYL-GLUTAMINYL-GLUTAMINYL-SERYL-PROLYL-THREONYL-THREONYL-ALANYL-GLUTAMYL-THREONYL-LEUCYL-ASPARTYL-ALANYL-THREONYL-ARGINYL-ARGINYL-VALYL-ASPARTYL-ASPARTYL-ALANYL-THREONYL-VALYL-ALANYL-ISOLEUCYL-ARGINYL-SERYL-ALANYL-ASPARAGINYL-ISOLEUCYL-ASPARAGINYL-LEUCYL-VALYL-ASPARAGINYL-GLUTAMYL-LEUCYL-VALYL-ARGINYL-GLYCYL-THREONYL-GLYCYL-LEUCYL-TYROSYL-ASPARAGINYL-GLUTAMINYL-ASPARAGINYL-THREONYL-PHENYL-ALANYL-GLUTAMYL-SERYL-METHIONYL-SERYL-GLYCYL-LEUCYL-VALYL-TRYPTOPHYL-THREONYL-SERYL-ALANYL-PROLYL-ALANYL-SERINE – with 1185 letters.

Probably the one that’s going to catch most people out is place names. The bank Kimberly and I use won’t allow a town/city name of more than 30 characters. That’s fine for the USA, where the longest place name has 24 letters (Winchester-on-the-Severn in Maryland or Washington-on-the-Brazos in Texas). However, if the back-end database is coded to only support 30 characters, that wouldn’t work around the world:

In Wales, there are two longest names are Llanfairpwllgwyngyllgogerychwyrndrobwyllllantysiliogogogoch with 58 letters and Gorsafawddachaidraigodanheddogleddolonpenrhynareurdraethceredigion wth 66 letters.

In New Zealand, there’s a hill called Taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu – 85 letters and that name used to be in general use.

Supposedly, this is Swedish for "preparatory work on the contribution to the discussion on the maintaining system of support of the material of the aviation survey simulator device within the north-east part of the coast artillery of the Baltic"

Also, the full chemical name for ‘Tintin’ (the largest of all known proteins, containing some near 27,000 amino acids). The full name has, apparently, got just over 189,000 letters. Here’s an extract –