You know how every flu season there’s a new flu vaccine, yet somehow for other diseases you only need to be vaccinated once? It’s because there’s no vaccine that can target all types of flu. Apparently, a patient who contracted “swine flu” during the pandemic created a novel antibody with the remarkable ability to confer immunity to all 16 subtypes of influenza A. A group of researchers sifted through the white blood cells of the patient and managed to isolate four B cells that contain the code to produce this antibody. These cells have been cloned and are producing antibodies facilitating further research into a potential broad-spectrum vaccine that could confer broad protection against the flu.

For some reason I find this really interesting. I think it’s because at a gut level it gives me hope that if a killer virus did arise that wipes out most of humanity, there’s some evidence that maybe a small group of people will survive it. Also, never getting the flu again? Yes, please! On the other hand, this vaccine will be a fun one to observe as it evolves, particularly around the IP and production rights that results from this. Who owns it, and who deserves credit for it? Does the patient that evolved the antibody deserve any credit? What will be the interplay between the researchers, the funding institutions, the health industry and the consumer market? Should/can the final result or process be patented so that ultimately, a corporation is granted a monopoly on the vaccine (maybe there’s already a ruling on this)? Should we administer the resulting vaccine to everyone, risking the forced evolution of a new “superstrain” of flu that could be even deadlier, or should we restrict it only to the old, weak, and young? While these questions have been asked and sometimes answered in other contexts, everyone can relate to suffering through the flu, so perhaps the public debate around such issues will be livelier.

The rise of antibiotic-resistant superbugs are a product of our love of antibiotics. In the absence of antibiotics, a bug that has few resistances will grow faster and more efficiently than one that has to put on bullet-proof armor every morning and lug around heavy artillery. In other words, the biological machinery required to produce antibiotic resistance comes at a fitness cost for the bug. In antibiotic-free conditions, non-resistant strains grow faster than the resistant strains; and with as little as 20 minutes per generation, just a couple days can yield hundreds of generations. This is why, thankfully, not every bug out there has a full suite of drug resistance — a chief enemy of the superbug is the common bug.

According to this evolutionary theory for the acquisition and loss of drug resistance genes, a hospital is an ideal breeding environment for superbugs: they are asceptic (less competition from common bugs), and full of antibiotics (plenty of selective pressure to acquire resistance genes).

Thus it is curious to find superbugs in food. Farms are teeming with common bugs, creating a selective pressure to lose antibiotic resistance genes. While antibiotics are routinely put into farm animal feed, it’s probably not cost-effective to use broad-spectrum antibiotics on such a scale. Perhaps O104:H4 is just a spontaneous coincidence, a fluke — a bug had acquired a set of genes, got lucky and grew, and just as quickly got edged out by more competitive neighbors. This could explain why it’s been tough to find its origin.

Fortunately, the entire sequence of the O104:H4 bug is available for download on the internet. Our friends in China — BGI, located in Shenzhen — acquired a sample and in an unusual act released the sequence for public download. This is unusual because research organizations typically hold this kind of data close to the chest, partially for peer review to vet it before public release, and partially for competitive advantage in academic publications — proprietary access to data is a common method to reduce competition for high-profile publications, and thus ensure your academic reputation. Whatever their reasons are for sharing the data, I think it’s worth noting the contribution, because now everybody in the world can perform an analysis on the bug.

And that’s where the fun begins! Analyzing the sequence data requires a little know-how, but fortunately, my “perlfriend” is a noted bioinformaticist. The raw sequence data provided by BGI is a set oversampled sub-sequences, which have to be assembled based on matching up overlapping regions. Once you assemble the sequence, you get a set of contiguous reads, but there are still gaps. It’s a bit like trying to compose a large picture out of a number of small photos taken at random. With enough sampling you will eventually create a complete picture, but for various technical reasons there are still ambiguities and gaps.

After assembly, the genome of O104:H4 is stitched from over a half million short DNA samples into 513 contiguous fragments of DNA (“contigs” in bio-speak), with a total length of 5.3 million base pairs (notably, wikipedia citesE. coli as having only 4.6 million base pairs, so O104:H4 is probably at least 15% longer — and likewise takes more time to replicate than a non-drug resistant strain). Here’s contig 34 of the assembly:

(Fun fact: the word “Gattaca” occurs 252 times in the genome of O104:H4)

Aside from making gratuitous pop culture references, the raw DNA isn’t very useful to us — it’s as if we were staring at binary machine code. In order to analyze the data, you need to “decompile” the methods contained within the DNA. Fortunately, protein sequences are highly conserved. Thus, a function that has been determined through biological experiment (for example, snipping out the DNA and observing what happens to the cell, or transfecting/transforming the DNA into a new cell and seeing what new abilities are acquired) can be correlated with a sequence of DNA, which can then be pattern-matched over the entire record to determine what functions (genes) are inside the overall genome.

The pieces needed to do this reverse-engineering are a protein database, and a tool called “blastx”. All of these tools are available free for download.

The list of known proteins can be downloaded from uniprot.org. Searching for “drug resistance” restricted to E. coli organisms yields a nice list of proteins that have been identified by scientists over the years to confer upon E. coli parts of drug-resistance machinery. Overall, our query to the uniprot database returned 1,378 proteins that are described to confer drug resistance to E. coli.

Have a look at Multidrug transporter emrE [uniprot.org]. Inside the link, you’ll find a description of the biological mechanism for its function (it pumps antibiotics out of the cell), its secondary structure (a notion of the shape of the protein) and its 110-residue amino acid sequence.

(Incidentally, I find it amusing that the sequence for PBP2 is shorter than, for example, my PGP public key block)

PBP2_ECOLI is linked to penicillin resistance, and functions as a mutant of a gene that determines the shape of the bacteria. Reading through the bio-speak, it seems that this resistant variant is adapted to buy Amoxicillin online; bacteria with non-resistant forms of this gene are unable to form properly shaped cell walls and thus die. So, by browsing this database, we are getting a feel for the variety of countermeasures that bacteria has: sometimes they are active (pumping the antibiotic out of the cell) and sometimes they are passive (mutations that enable operation despite the presence of antibiotics).

Now, you need the actual decompiler itself. The program we used is called blast; specifically, a variant known as blastx. Blast stands for “basic local alignment search tool”. This analysis program computes all of the possible translations of the E. coli DNA to protein sequences (there are 6 overall: 5′->3′, 3′->5′, each multiplied by three possible framing positions of the codons), and then does a pattern-matching of the resulting amino acid sequences with the provided database of known drug-resistance sequences. The result is a sorted list of each known drug resistance protein along with the region of the E. coli genome that best matches the protein.

Here, you can see that the gene for PBP2_ECOLI has a 100% match inside the genome of O104:H4.

Now that we have this list, we can answer some interesting questions, such as “How many of the known drug resistance genes are inside O104:H4?” I find it fascinating that this question is answered with a shell script:

My perlfriend writes these so quickly and effortlessly it’s as if she’s tying IMs to friends — I half expect to see an “lol” at the end of the script. Anyways, the above script tells us that 1,138 genes are a 100% match against the database of 1,378 genes. If you loosen the criteria up to a 99% match, allowing for one or two mutations per gene — possibly a result of sequencing errors or just evolution — the list expands to 1,224 out of 1,378.

The inverse question is which drug-resistance genes are most definitely not in O104:H4. Maybe by looking at the resistance genes missing from O104:H4, we can gather clues as to which treatments could be effective against the bug.

In order to rule out a drug-resistance gene, we (arbitrarily) set a criteria of any gene with less than 70% best-case matching as “most likely not” a resistance that the bug has. The result of this query reveals that there are 116 genes that are known to confer drug resistance that are less than 70% matching in O104:H4. Here is the list:

Again, you can plug any of these protein codes into the uniprot database and find out more about them. For example, BLR is the “Beta-lactam resistance protein”:

Has an effect on the susceptibiltiy to a number of antibiotics involved in peptidoglycan biosynthesis. Acts with beta lactams, D-cycloserine and bacitracin. Has no effect on the susceptibility to tetracycline, chloramphenicol, gentamicin, fosfomycin, vacomycin or quinolones. Might enhance drug exit by being part of multisubunit efflux pump. Might also be involved in cell wall biosynthesis.

Unfortunately, a cursory inspection reveals that most of the functions that O104:H4 lacks are just small, poorly understood fragments of machines involved in drug resistance. Which is actually an interesting lesson in itself: there is a popular notion that knowing a DNA sequence is the same as knowing what diseases or traits an organism may have. Even though we know the sequence and general properties of many proteins, it’s much, much harder to link them to a specific disease or trait. At some point, someone has to get their hands dirty and do the “wet biology” that assigns a biological significance to a given protein family. Pop culture references to DNA analysis are glibly unaware of this missing link, which leads to over-inflated expectations for genetic analysis, particularly in its utility for diagnosing and curing human disease and applications in eugenics.

While the result of this just-for-the-fun-of-it exercise isn’t a cure for the superbug, the neat thing about living here in The Future is that just a few days after an outbreak of a deadly disease halfway across the world, the sequence of the pathogen is available for download — and with free, open tools anyone can perform a simple analysis. This is a nascent, but promising, technology ecosystem.

With the madness of CES over and the Chinese New Year holiday coming up, I finally found some time to catch up on some back issues of Science. I came across a beautiful diagram of the metabolic pathways of one of the smallest bacteria, Mycoplasma Pneumoniae. It’s part of an article by Eva Yus et al (Science326, 1263-1271 (2009)).

Looking at this metabolic pathway reminds me of when I was less than a decade old, staring at the schematic of an Apple II. Back then, I knew that this fascinatingly complex mass of lines was a map to this machine in front of me, but I didn’t know quite enough to do anything with the map. However, the key was that a map existed, so despite its imposing appearance it represented a hope for fully unraveling such complexities.

The analogy isn’t quite precise, but at a 10,000 foot level the complexity and detail of the two diagrams feels similar. The metabolic schematic is detailed enough for me to trace a path from glucose to ethanol, and the Apple II schematic is detailed enough for me to trace a path from the CPU to the speaker.

And just as a biologist wouldn’t make much of a box with 74LS74 attached to it, an electrical engineer wouldn’t make much of a box with ADH inside it (fwiw, a 74LS74 (datasheet) is a synchronous storage device with two storage elements, and ADH is alcohol deydrogenase, an enzyme coded by gene MPN564 (sequence data) that can turn acetaldehyde into ethanol).

In the supplemental material, the authors of the paper included what reads like a BOM (bill of materials) for M. pneumoniae. Every enzyme (pentagonal boxes in the schematic) is listed in the BOM with its functional description, along with a reference that allows you to find its sequence source code. At the very end is a table of uncharacterized genes — those who do a bit of reverse engineering would be very familiar with such tables of “hmm I sort of know what it should do but I’m not sure yet” parts or function calls.

Now that’s a memorable factoid. Nature recently published a paper titled “A small-cell lung cancer genome with complex signatures of tobacco exposure” (Nature 463, 184-190 (14 January 2010), Pleasance et al), which as its title implies, contains the summary of the sequence of a cancer genome derived from a lung cancer tumor. It’s an interesting read; I can’t claim to understand it all. At a high level, they found 22,910 somatic substitutions, 65 insertions and deletions, 58 genomic rearrangements, and 334 copy number segment variations were identified; as I understand it, these are uncorrectable errors, i.e. the ones that got past the cell’s natural error-correction mechanisms. That’s out of about 3 gigabases in the entire genome, or an accumulated error rate of about 1 in 5 million.

I’m not an expert on cancer, but the way it was explained to me is that basically every cell has the capacity to become a cancer, but there are several dozen regulatory pathways that keep a cell in check. In a layman sort of way, every cell having the capacity to become a cancer makes sense because we come from an embryonic stem cell, and tumorigenic cancer cells are differentiated cells that have lost their programming due to mutations, thereby returning to being a (rogue) stem cell. So, a cancer happens when a cell accumulates enough non-fatal mutations such that all the regulation mechanisms are defeated. Of course, this is basically a game of Russian roulette; some cells simply gather fatal mutations and undergo apoptosis. In order to become a cancer cell, it has to survive a lot of random mutations, but then again there are plenty of cells in a lung to participate in the process.

Above: a map of the mutations found in the cancer cell. The 23 chromosomes are laid end to end around the edge of the circle. There’s a ton of data in the graph; for example, the light orange bars represent the heterozygous substitution density per 10 megabases. A higher resolution diagram along with a more detailed explanation can be found in the paper.

The tag line for this post is lifted from the discussion section of the paper, where they assume that lung cancer develops after about 50 pack-years of smoking, which roughly translates to the ultimate cancer cell acquiring on average one mutation every 15 cigarettes smoked. Even though this is an over-simplification of the situation, the tag line is memorable because it makes the impact of smoking seem much more immediate and concrete: it’s one thing to say on average, in fifty years, you will get cancer from smoking a pack a day; it’s another to say on average, when you finish that pack of cigarettes, you are one mutation closer to getting cancer.

It’s the year 2009, and I’m wondering: where is my flying car? After all, Hollywood reels from the 60’s and 70’s all predicted that flying cars are what I’d be using to get around town these days. Of course, automotive technology isn’t the only victim of Hollywood hype. The potential impact of personalized genomics has been greatly overstated in movies like GATTACA. This has lead to the pervasive myth that your genome is like a crystal ball, and somehow your fate is predestined by your genetic programming. Recently, my perlfriend co-authored a paper in Nature (“A Personalized Medicine Research Agenda”, Nature Vol 461, October 8 2009), comparing Navigenics’ and 23andMe’s “Direct to Consumer” (DTC) personal genomics offerings. She’s qualified to offer deep insight into personal genomics, since she designed the original Illumina bead chip used by leading companies to generate their DTC genetic data, and she is also the person who made sense of the first complete diploid human genome sequence (12). She’s sort of the biology equivalent of the reverse engineer who takes binary sequences and annotates meaning into the disassembled binary sequences. So, let the mythbusting begin.

Myth: having your genome read is like hex-dumping the ROM of your computer. Many people (I was one of them) have the impression that “reading your genome” means that at the end of the day someone has a record of all the base pairs of DNA in my genome. This is called a “full sequence”. In reality, full sequencing is still cost-prohibitive, and instead a technique called “genotyping” is used. Here, a selective diff is done between your genome and a “reference” human genome, or in other words, your genome is simply sampled in potentially interesting spots for single-point mutations called Single Nucleotide Polymorphisms (SNPs, pronounced “snips”). In the end, about 1 in 3000 base pairs are actually sampled in this process. Thus, the result of a personalized genomic screen is not your entire sequence, but a subset of potentially interesting mutations compared against a reference genome. This naturally leads to two questions: first, how do you choose the “interesting subset” of SNPs to sample? And second, how do we know the reference genome is an accurate comparison point? This sets us up to bust another two myths.

Myth: We know which mutations predict disease. Herein lies a subtle point. Many of the mutations are simply correlative with disease, but not proven to be predictive or causal with disease. The truth is that we really don’t understand why many genetic diseases happen. For poorly understood diseases (which is still most of them), all we can say is that people who have a particular disease tend to have a certain pattern of SNP mutations. It’s important not to confuse causality with correlation. Doing so might lead you to conclude, for example, that diet coke makes you fat, because diet coke is often consumed by people who are overweight.

Thus, there are two echelons of understanding that can come from a genotype: disease correlations, and disease causes. The majority of SNP mutation-based “predictions” are correlative, not causative. As a result, a genotype should not be considered a “crystal ball” for predicting your disease future; rather, it is closer to a “Rorschach blot” that we have to squint and stare at for a while before we can make a statement about what it means. The table below from the paper illustrates how varied disease predictions can be as a result of these disagreements on the interpretation of mutation meanings.

Myth: the “reference genome” is accurate reference. The term “reference genome” alone should tip you off on a problem: it implies there is such a thing as “reference people”. Ultimately, just a handful of individuals were sequenced to create today’s reference genome, and most of them are of European ancestry. As time goes on and more full sequence genetic data is collected, the reference genome wlll be merged and massaged to present a more accurate picture of the overall human race, but for now it’s important to remember that a genotype study is a diff against a source repository of questionable universal validity, partially because it’s questionable if there is such a thing as a “reference human”, i.e. there are structural variations and some SNPs have different frequencies in different populations (e.g. the base “A” could dominate in a European population, but at that same position, the base “G” could dominate in an African population). It’s also important to keep in mind that the “reference genome” has an aggregate error rate of about 1 error every 10,000 base pairs, although to be fair the process of discovering a disease variant usually cleans up any errors in the reference genome for the relevant sequence regions.

So now you can see that in fact “reading your genome” is less of looking into a crystal ball and more of staring at a Rorschach blot obscured by cheesecloth (i.e., the genome is simply sampled and not sequenced). And, even if we could remove the cheesecloth and sequence the genome such that we knew every base pair, it would still be … a Rorschach blot, but in high resolution. It will be decades until we have a full understanding of what all the sequences mean, and even then it’s unclear if they are truly predictive.

Here lies perhaps the most important message, and a point I cannot make fine enough: in most situations, environment has as much, perhaps even more, to do with whom you are, what you become, and what diseases you may develop than your genes. If there is any upside to personal genomics, it won’t be due to crystal ball predictions. It will be the lifestyle changes it can encourage. If there’s one thing I’ve learned from dating a preeminent bioinformaticist, it’s that no matter your genetic makeup, most common diseases can be prevented with proper diet and exercise.