How we are assembling the human genome (idea)

The end result of the Human Genome Project is like a box of broken glass; shattered information worthless by itself.

As many of you know, the human genome is the collection of genetic information about the species Homo sapiens, or humans. In recent months, the public Human Genome Project and the private corporation Celera announced that they had both obtained a complete sequence of the human genome and that both had prepared working drafts of the sequence.

What they failed to note is that in actuality, this is only the simple first leg of the problem; there is a huge amount of work left to be done before we have a thorough understanding of how exactly the human genome works and even knowing its correct order of the sequences. Even more amazing is the fact that at most we only have 60% of the sequence, yet it is marked as completed. The problem isn't solved; all that has happened is that we now finally have the best partially complete data set that current science can produce.

In essence, even though the Human Genome Project is nearing completion, we have far, far more work left to do than has been done already. People might immediately expect huge medical advances as a result of the genome being completed, but there is a mountain of work left to do. As an employee facing these huge challenges, I want to address the difficulties of a genome project such as the human one; the same truth still applies for the genome projects of other species as well.

The rest of this document is an attempt to explain in layman's terms what has been done already and all of the things that need to be done. If you are confused by parts of this or wish to express your rage at me for glossing over or excluding a particular element, send me (tes) a /msg and I will try to clear this up by revising this writeup.

What Is Already Done?

To understand what has been done, first one must understand the techniques used to obtain this information. If you can, imagine that each person has a genome that is a paintedglassrod a mile long. Each millimeter or so on this rod has a colored stripe, one of four different colors. (if this were actually in proportion, the stripes would be much smaller or the rod would be much longer) Now, every single human being has almost identical "genome rods;" to put it in perspective, out of this mile, roughly 5,100 of the 5,280 feet are exactly identical for every human being. Got this picture? Every human's genome is a nearly identical painted glass rod.

Now, the first thing done when sequencing a genome is that all of the pieces that are different for each human are cut out and removed. This leaves each person with a rod roughly 5,100 feet long, with a different colored stripe at each millimeter. Given this, only the genetic information of one person is needed, since all people have this identical sequence. This one person has a glass rod, cut in several places, but totaling 5,100 feet or so in length. Now, given a good sized blood sample, scientists can make millions of exact duplicates of this glass rod.

Even with all of these duplicated, identical rods, however, science does not have the ability to see the sequence of the stripes on the rod. It is far too long for scientific techniques to analyze; at most, we can analyze a few centimeters at a time on either end of the rod. So what do we do with this collection of identical glass rods to solve the problem? We smash the holy crap out of every one of them. In essence, we run them through a chemical blender, breaking it down into millions of small chunks called ESTs that we can analyze.

Once we've got these ESTs, we can then run them through a procedure (there are actually machines that automatically do this) that gives as a result the exact sequence of this little piece. Using stastical techniques, one can determine the number of pieces that actually need to be analyzed before the complete picture is known. Once this number was reached, it was announced to the world as the working draft of the humangenome.

Like I said at the start, the end result of the Human Genome Project is like a box of broken glass; shattered information worthless by itself.

What Needs To Be Done?

Let's make a laundry list, shall we? Remember, each of these items is a problem by itself as big as or bigger than the original Human Genome Project.

1. Making the genome available to the researching public.
Even taking just raw genetic information, encoded by its exact content, the raw data we have now takes up many, many gigabytes of storage space. Once some of the other items here start to take effect, this number will jump easily into the realm of terabytes. For useful public research and investigation into this information, there needs to be a central warehouse of this huge chunk of data. To date, the GenBank database has kept up with this, but the raw amount of data being poured in is on the verge of overwhelming it.

2. Filling in the gaps that the statistics missed.
Even though it is strongly likely that the complete sequence (at least, the part we can generate) is finished, given the process there is still likely to be gaps in the middle somewhere, simply because the chemistry of some pieces caused them to not break very often or break up too much. Some detailed and very careful sequencing will have to be done to obtain sequence data for these missing segments.

3. Error checking.
Again, with this huge amount of data being produced relatively quickly, the likelihood for some error is high. As we speak, the entire procedure is being repeated and more ESTs are being created in order to ensure that fewer errors are found; when errant sequences are discovered using methods of string comparison (meaning computer programs that take one EST sequence and compare it to another), they are thrown out. This produces more data, compounding problem 1.

4. Actual assembly of the order.
Another major problem is the sorting of the pieces into the correct order. Given the huge number of pieces and the lack of real knowledge where these pieces came from, it is a massive computational problem to solve. It will also produce a great deal of data, increasing the challenge of problem 1.

5. Figuring out where the genes are in this completed order.
Another huge computational problem is using statistical methods, developed by careful analysis of the genes that have been discovered in experiments, to pull out the pieces of this massive genome that correlate to genes. Much of the sequence is in fact useless; only certain small parts actually contain the genetic code for anything useful. Locating these pieces is a major problem and will produce a significant amount of data, increasing the challenges of problem 1.

6. Figuring out what the genes do once they are found.
Once we've discovered the genes, we need to know their function. This is old-fashioned bench work, folks; figuring out the function of these genes once they're translated into proteins; what do they do? This leads to more questions relating to how they can be altered to improve function and/or prevent disease, but first we need to know their function. Again, more data for problem 1.

7. Developing techniques to attack the more varied parts of the genome.
As big as the other problems are, this is perhaps the biggest of all. Some pieces in the sequence vary greatly; how can we accurately obtain sequences for all of these variations? Once this is solved, then all of the problems above must be addressed for these sequences, too.

The End Result?

Basically, we've just begun. As you can see, there is a huge need for skilled and efficient bioinformaticists with the huge amount of data being produced; there is also a huge need for funding to analyze this mountain of data we already have, to store it, and to distribute it. These are all massive problems that need our attention in the very near future.