Thursday, June 07, 2007

Illumina-ting DNA-protein interactions

The new Science (sorry, you'll need a subscription beyond the abstracts) has a bunch of genomics papers, but the one closest to my heart is a paper from Stanford and Cal Tech using the Illumina (ex-Solexa) sequencing platform to perform human genome-wide mapping of the binding sites for a particular DNA-binding protein.

One particular interesting angle on this paper is actually witnessing the beginning of the end of another technique, ChIP Chip. Virtually all of the work in this field relies on using antibodies against a DNA-binding protein which has been chemically cross-linked to nearby DNA in a reversible way. This process, chromatin immunoprecipitation or ChIP, was married with DNA chips containing potential regulatory regions to create ChIP on Chip, or ChIP Chip.

It is a powerful technique, but with a few limitations. First, you can only see binding to what you put on a chip, and it isn't practical to put more than a sampling of the genome on a chip. So, if you fail to put the right pieces down, you might miss some interesting stuff. This interacts in a bad way with a second consideration: how big to shear the DNA to. A key step I left out in the ChIP description above is the mechanical shearing of the DNA into small fragments. Only those fragments bound to your protein of interest should be precipitated by the antibody. The smaller your sheared fragment size, the better your resolution -- but also the greater risk that you will successfully precipitate DNA that doesn't bind to any of your probes.

A stepping stone away from ChIP Chip is to clone the fragments and sequence them, and several papers have done this (e.g. this one). The new paper ditches cloning entirely and simply sequences the precipitated DNA using the Illumina system.

With sequencing, your ability to map sites will now be determined by the ability to uniquely identify sequence fragments and again the size distribution of your shattered DNA. Illumina has short read lengths, but the handicap imposed by this is often greatly overestimated. Computational analyses have shown that many short reads are still unique in the genome, and assemblers capable of dealing with whole-genome shotgun of complex genomes with short reads are starting to show up. One paper I stumbled on while finding references for this post includes Pavel Pevzner as an author, and I always find myself much wiser after reading a Pevzner paper (his paper on the Eulerian path method is exquisitely written).

In this paper, read length of 25 nt were achieved, and about 1/2 of those were uniquely mappable to the genome, allowing for up to 2 mismatches vs. the reference sequence. Tossing 50% of your data is frustrating, but with 2-5 million reads in the experiment, you can tolerate some loss. These uniquely mapped sequences where then aligned to each other to identify sites marked by multiple read. 5X enrichment of a site vs. a control run were required to call a positive.

One nice bit of this study is that they chose a very well studied DNA-binding protein for the study. Many developers of new techniques rush for the glory of untrodden paths, but going after something unknown strongly constrains your ability to actually benchmark the new technique. Because the site they went after (NRSF) is well characterized, they could also compare their results to relatively well-validated computational methods. For 94% of their sites, the called peak from their results was within 50nt of the computationally defined site. They also achieved an impressive 87% sensitivity (ability to detect true sites) and 98% specificity (ability to exclude false sites) when benchmarked against well-characterized true positives and known non-binding DNA sites. A particularly interesting claim is that this survey is probably comprehensive and has located all of the NRSF/REST sites in the genome, at least in the cell line studied. This is attributable to the spectacular sequencing depth of the new platforms.

Of course, this is one study with one target and one antibody in one cell line. Good antibodies for ChIP experiments are a challenge -- finding good antibodies in general remains a challenge. Other targeted DNA-binding proteins might not behave so well. On the other hand, improvements in next generation sequencing technologies will enable more data to be collected. With paired-end reads from the fragments, perhaps a significant amount of the discarded 50% of the data could be salvaged as uniquely mappable. Or, just go to even greater depths. Presumably some clever computational algorithms will be developed to tease out sites which are hiding in the repetitive portions of the genome.

It is easy to imagine that in the next few years this approach will be used to map virtually all of the binding sites for a few dozen transcription factors of great interest. Ideally, this will happen in parallel in both human and other model systems. For example, it should be fascinating to compare the binding site repertoire of Drosophila p53 vs. human p53. Another fascinating study would be to take some transcription factors suggested to play a role in development and scan them in multiple mammalian genomes, yielding a picture of how transcription factor binding has changed with different body plans. Perhaps such a study would reveal the key transcription factor changes which separate our development from those of the non-human primates. The future is bound to produce interesting results.

No comments: