Mapping every protein-DNA interaction site in the genome

Researchers have taken a technique that was popular back in the 1980s and …

These days, for many of the species we study, it's easy to get a sense of what is present in the DNA, as there are ever-expanding collections of genomes available to browse online. But looking at the DNA sequence doesn't necessarily tell scientists what they need to know. In living cells, a large assortment of proteins attach to specific sites in the DNA in order to turn genes on or off, a process that ultimately determines the form and function of a cell; without the proteins, the DNA would be inert. Now, researchers have developed a technique that allows them to determine every single site in the genome that's interacting with a protein in a single experiment. The best part about the technique is that it is an extension of a technique that was popular 20 years ago.

It is currently possible to study whether a protein is attached to specific stretches of DNA, but the techniques only work with small sequences. As a result, you have to know which pieces of DNA to look at, which means we really only look at the parts of the genome that we're already interested in. It's not a great way to discover something new.

A new paper, released over the weekend by Nature Methods, is not only a great way to discover new sites where proteins are acting on DNA, it also identifies every site in the genome at once. To do so, it actually relies on a technique that is useful for identifying the sites on short stretches of DNA called DNase footprinting. DNase is an enzyme that cuts up pieces of DNA at random sites. If a site is already occupied by another protein—a protein that's regulating a nearby gene, for example—then DNase won't be able to get to the DNA in order to cut it.

DNase footprinting works by taking a large population of identical DNA fragments and allowing proteins to stick to them; the fragments are then exposed to DNase at a low concentration so that each fragment is only cut once or twice. What you get is a series of fragments of DNA, ending where the DNase cut. All you have to do is identify where the DNase didn't cut, and you know where the proteins are. DNase footprinting was heavily used back when I was in college in the late 1980s, but has fallen out of favor in recent years as techniques that don't rely on enzymes like DNase were developed.

A team of researchers at the University of Washington developed a variant on DNase footprinting that can work on an entire genome. This produces a random collection of DNA fragments from everywhere in the genome, which makes it difficult to identify where any of them came from. So, the researchers simply sequenced all of them using one of the recent-generation high-throughput genome sequencing machines. Using the yeast genome as a test case, the authors sequenced a total of over 23 million individual fragments and identified where they ended, which is where DNase cut the DNA. Again, sites occupied by proteins were rarely cut, so if the number of cuts were plotted against a DNA sequence, the results looked like this:

A rough idea of the genome footprint data. DNA sites bound by proteins are cut by DNase less often.

The sites where proteins are interacting with the DNA stand out pretty obviously. Setting their acceptable error rate to five percent, the authors were able to identify 4,384 sites in the genome that were bound by proteins. Most of these sites had sequences that matched up well with the known preferences of DNA-binding proteins. The authors also validated their findings against previous studies that identified protein-binding sites by other methods (primarily ChIP), and found that their genome footprinting generally identified a superset of the same locations in the genome.

Most impressive, perhaps, was when they took a careful look at individual binding sites for DNA binding proteins with known structures. These structures can identify the specific bases that are touched by the protein. The authors were able to see that these bases were more protected from DNase than their immediate neighbors. So the technique seems to allow some pretty impressive precision.

The yeast genome was used as a test case because it's compact, and there aren't as many protein-binding sites present. So, the big question is whether the technique will scale to big, messy genomes like our own. The authors suggest that it might. It turns out that areas where proteins bind in the human genome aren't as tightly packed as other areas of the genome, so they're easier for DNase to get at. So, they suggest that the DNA fragments will be biased towards areas of interest, which will mean we don't have to sequence as much DNA to get the big picture.

Until someone tries it, whether it will actually work is anyone's guess. But I'm kind of hoping it does, if only because it would be nice to see a technique from my past wind up being important again.

6 Reader Comments

A nice extension of the existing methods. Limited by starting with purified (i.e. de-proteinized) DNA. What is even more valuable is trying to map binding of your favorite protein in the context of the natural complement of other proteins. But this is hard using this protection scheme. You might want to detect the subtle differences in cleavage when your protein is present or absent. But that raises more issues owing to potential indirect effects (in cells).

I wonder if it would be possible to get around the need to use purified DNA as a source material by using a crosslinking scheme. Take whole genomes with the full compliment of proteins and crosslink the proteins to the DNA. Now mechanically or otherwise (eg thermal denaturation followed by DNase) fragment the resulting covalent protein-DNA mixture and sequence. If the crosslinker and sequencing method are chosen correctly the sites where proteins are bound to the DNA should cause read termination, giving results not unlike those from the method in the article.Of course you'd want to try several different crosslinkers, and they would need to be chosen not the join the DNA strands, and you'd probably get a very strong signal from the histone-DNA interaction that would need to be filtered out or avoided by some other clever trick. Further, while crosslinkers are much smaller than DNase, they are not 0-size (unless you use UV, which doesn't meet the requirements) and hence will not penetrate perfectly into the structured DNA/protein mega-complex. Still, it's an idea

As a bonus you could use cleavable linkers that might let you do some clever 2D MS tricks to identify which proteins bind where (although it might be easier to just use more conventional methods once you've got the site data).

There seems to be some misunderstanding here- the DNase footprint is performed on nuclei, not purified DNA. So the genomic DNA has its full complement of protein bound. Where the DNA is protected by the protein, it will be protected from cleavage by the DNase. The technique will not reveal which proteins are bound, but does show which parts of the DNA are bound by cellular proteins.

unfortunately, there's still limitations. they had to get every cell stopped at a certain stage in the cell cycle, or else there would be far too much noise in the data. that also means you would have great difficulty finding sites where proteins rarely or intermittently bind DNA.

otherwise, though, it looks very interesting. i was also wondering about chemically crosslinking instead of searching for uncut sites. the whole footprinting scheme still seems to be a rather indirect way to find the sites.

Very interesting approach. In vivo footprinting has always been something that has been highly desired but very difficult to accomplish. This new method might change that if it proves widely useful.

I imagine this will be incredibly hard to scale to mammalian cells that are about 2 orders of magnitude larger than yeast. The problem I see is that compared to ChIP-seq which sequences the binding sites directly, a positive approach (albeit with much lower resolution), this footprinting uses a negative approach - that is sequencing all the genome outside of the binding sites. To get sufficient data with mammals would require incredibly deep sequencing and even with modern high-throughput sequencing would be cost prohibitive. They required 23 million tag reads in yeast so that would translate to 2300 million tags to get the same level of coverage for a mammalian genome. That's a lot... and that's not even considering that the mammalian genomes are more complex, have much more repetitive sequence and might need to be sequenced to greater depth to get the same level of information...

The cell synchronization is another problem - it can be accomplished in mammalian cells fairly easily but would that be representative of a true physiological state?