The field of bioinformatics is a fairly young one, and because of that it’s very easy to be ignorant of its history. Crick and Watson (and those other people) determined the structure of DNA. Sanger worked out how to sequence proteins and nucleic acids. Some other people made all of these things faster and better and now we have huge sequence databases that mean we can get hold of an intractable quantity of data faster than we could ever plausibly need to, and what else is there to know?

Margaret Dayhoff graduated with a PhD in quantum chemistry from Columbia, where she’d performed computational analysis of various molecules to calculate their resonance energies[3]. The next few years involved plenty of worthwhile research that aren’t relevant to the story, so we’ll (entirely unfairly) skip forward to the early 60s and the problem of turning a set of sequence fragments into a single sequence. Dayhoff worked on a suite of applications called “Comprotein”. The original paper can be downloaded here, and it’s a charming look back at a rigorous analysis of a problem that anyone in the field would take for granted these days. Modern fragment assembly involves taking millions of DNA sequence reads and assembling them into an entire genome. In 1960, we were still at the point where it was only just getting impractical to do everything by hand.

This single piece of software was arguably the birth of modern bioinformatics, the creation of a computational method for taking sequence data and turning it into something more useful. But Dayhoff didn’t stop there. The 60s brought a growing realisation that small sequence differences between the same protein in related species could give insight into their evolutionary past. In 1965 Dayhoff released the first edition of the Atlas of Protein Sequence and Structure, containing all 65 protein sequences that had been determined by then. Around the same time she developed computational methods for analysing the evolutionary relationship of these sequences, helping produce the first computationally generated phylogenetic tree. Her single-letter representation of amino acids was born of necessity[4] but remains the standard for protein sequences. And the atlas of 65 protein sequences developed into the Protein Information Resource, a dial-up database that allowed researchers to download the sequences they were interested in. It’s now part of UniProt, the world’s largest protein database.

Her contributions to the field were immense. Every aspect of her work on bioinformatics is present in the modern day — larger, faster and more capable, but still very much tied to the techniques and concepts she pioneered. And so it still puzzles me that I only heard of her for the first time when I went back to write the introduction to my thesis. She’s remembered today in the form of the Margaret Oakley Dayhoff award for women showing high promise in biophysics, having died of a heart attack at only 57.

I don’t work on fruitflies any more, and to be honest I’m not terribly upset by that. But it’s still somewhat disconcerting that I spent almost 10 years working in a field so defined by one person that I knew so little about. So my contribution to Ada Lovelace Day is to highlight a pivotal woman in science who heavily influenced my life without me even knowing.

[1] You think it’s difficult bringing up a compiler on a new architecture? Try bringing up a fruitfly from scratch.
[2] Except for the cases where the low-level language itself is functionally significant, and the cases where the intermediate representation is functionally significant.
[3] Something that seems to have involved a lot of putting punch cards through a set of machines, getting new cards out, and repeating. I’m glad I live in the future.
[4] The three-letter representation took up too much space on punch cards

I was certainly aware there was someone in early bioinformatics called Dayhoff (Dayhoff protein substitution matrices were a part of my life at one point) but I think I’d missed that Dayhoff was a woman, so thanks!

mjg59, thanks for writing this. I raised generations of fruit flies in genetics class and remember those card-sorting machines, which makes Dayhoff’s career particularly interesting to me.

Some background on those machines: EAM or ‘electrical accounting machines’ were early computers that worked by taking a stack of 80-column punch cards and processing them by sorting them into groups and recording totals. The first ones were called tabulating machines and were invented by Herman Hollerith in 1890. They continued to be sold until the 1970s and used for a while after that. I worked in one corner of a US government EAM room in 1979.

Some examples: IBM 407 and the older IBM 402. The IBM 407 cost $800/month in 1976. That’s $3,183/month in current dollars!

One of the earliest uses of computers for consumers was tracking utility payments. Each bill would be accompanied by a punch card that would be returned with the payment check so the card could be loaded into an EAM machine to process the payment. The cards were always labeled ‘Do not fold, spindle, or mutilate’. Bar codes are a lot easier.