Omics! Omics!: June 2007

Thursday, June 28, 2007

Psst! Hot Stock Tip! This company is going to be average!

The last two days have been active on the NASDAQ for the old stomping grounds. Prior to the trading day yesterday a stock analyst upgraded the stock, and MLNM gained about 6% on the day with a trading volume significantly (but less than 2X) above average volume. Today, the company announced some positive results in front-line multiple myeloma treatment, and the stock again turned over 5M+ shares but just nudged up a bit.

What is more than a little funny about yesterday is what the analyst actually said: instead of 'underperforming' the market, he expected Millennium to "Mkt Perform" -- that's right, that it would be exactly middling, spectacularly average, impressively ordinary. Indeed, he put a target on the stock -- $10, or a bit less than what it was selling for that day. For that he was credited with sparking the spike.

What's even more striking is that the day before another investment house downgraded Millennium from 'Overweight' to 'Equal weight'. Each company picks its own jargon, but this is really agreeing -- they both predict Millennium to do as well as the market. Oy!

Far more likely a cause in the spike was leakage of the impending good myeloma news. I've never looked systematically, but good news in biotech seems to be preceded by trading spikes as much as it is followed by them. Periodically someone is nailed for it (and not just domestic design goddesses), but there is probably a lot of leakage that can never be pursued.

I'm sure there are a lot of smart people earning money as stock analysts who carefully consider all the facts and give a well-reasoned opinion free of bias, but they ain't easy to find. For a while I listened to the webcasts of Millennium conference calls, but after a while I realized that (a) no new information came out and (b) some of the questions were too dumb to listen to. Analysts would frequently ask questions whose answer restated what had just been presented, or would ask loaded questions which were completely at odds with the prior presentation. How the senior management answered some of those with a straight face is a testament to their discipline; I would have been lucky to get by with a slight grimace. Some analysts were clearly chummy with company X, and others with company Y, and little could change their minds.

If you look at the whole thing scientifically, the answer is pretty clear: listening to stock analysts is a terrible way to invest. If you want average returns, invest in index funds. If you want to soundly beat the averages, start looking for leprechauns -- their pots of gold are far more plentiful than functional stock picking schemes. Buy a copy of 'A Random Walk on Wall Street' and sleep easy at night. Yes, there are a few pickers who have done well, but they are so rare they are household names. Plus, there are other challenges: Warren Buffett has an impressive track record, but if he continues it until my retirement his financial longevity will not be the point of amazement.

Disclosure: somewhere in the bank lock box I have a few shares of Millennium left -- I think totaling to about the same as the blue book value on my 11-year old car (though perhaps closer to the eBay value of my used iPod). The fact they are in a bank protected them from the grand post layoff clean out.

Wednesday, June 27, 2007

You say tomasil, I say bamatoe ...

There are some food combinations which reoccur frequently in the culinary arts. The pairing of basil and tomato is not only a dominant part of many Italian dishes, but is a great way to add zing to a BLT (or, if your vegetarian or keep kosher, to have a B for BL). Conventionally this is done by separately growing basil and tomato plants, harvesting the leaves and fruits respectively, and co
mbining them in the kitchen.

An Israeli group has published a shortcut to the process at Nature Biotechnology's advance publication site. By transferring a single enzyme from lemon basil to tomato, the authors report significantly altering the aroma and flavor of the transgenic tomatoes.

If you aren't a gardener, you probably haven't run into lemon basil. There are a whole host of basil varieties with different aromas and flavors, with some strongly suggesting other spices such as cinnamon. Basil is a member of the mint family, many of which show interesting scents. Look down your spice rack: many of the spices which are not from the tropics are mints: oregano, thyme, marjoram, savory, sage, wild bergamot, etc. Many of these come in multiple scents: in addition to peppermint and spearmint, there is lemon mint. Thymes come in a variety of scents, including lemon. If you have an herb garden, gently check the stems of your plants -- if they are square, it is probably a member of the mint family.

Of particular interest is the pleiotrophic ffects of the transgene. The inserted gene, geraniol synthase under the control of a ripening-specific promoter, catalyzes the formation of geraniol, an aromatic alcohol original extracted from geraniums. Geraniol itslef apparetnly has a rose-like aroma, but a number of other compounds derivable from geraniol were also increased, such as various aldehydes and esters with other aromas such as lemon-like. This reflects the fact that tomatoes possess many enzymes capable of acting on geraniol. Conversely, the geraniol was synthesized from precursors that feed into the synthesis of the red pigment lycopene and a related compound phytoene, and both of these compounds were markedly lower in the transgenic plants. The tomatoes appear to still be quite red, and well within the wide range of crimsonosity found in tomato varieties. This should come as no surprise to many gardeners: catalogs always warn that trying to grow spearmint or peppermint from seed is not guaranteed to get the right scent. Presumably there are many polymorphisms in monoterpene processing enzymes in the mint genome, and depending on which you assort together you get a different potion of fragrant compounds.

Volunteers sniff-tested and taste-tested (well, got some squirted in the back of their nose -- 'retro-nasal'). Testers generally preferred the smell and 'taste' of the transgenics. Most marketed transgenic plants affect properties key to growers but not consumers; you can't really tell if you have transgenic corn flakes or soy milk without PCR or an immunoassay (or similar). But with this transgenic plant, the nose knows.

Of course, the next line in the alluded-to song is "Let's call the whole thing off". There are many who oppose this sort of tinkering with agricultural plants for a variety of reasons. Myself, I'd leap at a chance to try one. I love tomatoes, provided they are fresh from the garden, and having one more variety to try would be fun!

Note: you need a Nature Biotechnology subscription to access the article. However, Nature is pretty liberal about giving out complimentary subscriptions (I once accidently acquired two), so keep your eye out for an offer.

Roche munches again

Roche is on quite a little acquisition spree in the diagnostics business: first went 454 with its first-to-market sequencing-by-synthesis technology, earlier this month it was DNA microarray manufacturer NimbleGen in another friendly action, and now Roche has launched a hostile bid for immunodiagostics company Ventana.

Three companies, three technologies with proven or developing relevance to diagnostics. What else might be in the radar? One possibility would be protein microarrays, though there are few players in the functional array space (useful for scanning patient responses) -- but perhaps an antibody capture array company? Not yet a proven technology, but one to watch.

All of these buys have a strong personalized medicine / genomics-driven medicine angle. Ventana makes an assay for HER2 to complement Genentech/Roche's Herceptin (Roche owns a big chunk of Genentech & I think is the ex-US distributor); 454 and Nimblegen are solidly in the genomics arena. Roche already has Affy-based chips out for drug metabolizing enzyme polymorphisms.

Friday, June 22, 2007

secneuqes AND sdrawkcaB

One of my early graduate rotation projects (the period when you are scoping out an advisor -- and the advisor is scoping you out!) in the Church lab was to develop a set of scripts to take a bacterial DNA sequence, extract all of the possible Open Reading Frames longer than a threshold & BLAST those against the protein database. Things went great with a sizable sequence from Genbank, so I asked for a large sequence generated in-house. The results were curious: no particularly long ORFs, and none of them matched anything.

Puzzled, I reported this to George & it took him a moment to think of the answer: the sequence was backwards.

We write DNA sequences in a particular order for a reason, because that is the order (5'->3') in which Nature makes DNA. The underlying chemistry is such that this is one of the few inviolate rules of biology: thou shalt not polymerize nucleotides in a 3'->5' direction. The technology which has dominated in recent times, Sanger dideoxy sequencing, relies on DNA polymerization and so can also read sequence only in a 5'->3' direction. Most of the 'next generation' technologies which are coming available, such as 454 and Illumina/Solexa, also rely on polymerase extension and have an imposed direction.

But Sanger sequencing once had a serious rival: chemical sequencing. The Maxam-Gilbert approach relies on chemical cleavage of end-labeled DNA -- and depending on which end you label you can read either strand of a DNA fragment in either direction. George's genomic sequencing and multiplex sequencing also used chemical cleavage, and it turned out that the version of multiplex sequencing then being used probed the DNA in such a way that the reads came out 3'->5', and I had gotten the unreversed file.

I'm not the only person to fall into that trap. There was a burst of excitement over at the Harvard Mycoplasma sequencing project that a long true palindrome had been seen. In molecular biology the term palindrome is bent a bit to mean a sequence that reads the same forwards on one strand and back again on the other strand, but here was a sequence that actually read the same backwards and forwards on the same strand. Such a beastie hadn't been observed (I wonder if one has yet?), and would be a bit of a puzzle. A bit later: "Never mind". Someone had assembled a reversed and unreversed sequence, which were in reality the same thing.

Some such mistakes got farther, much farther. In sequencing the E.coli genome the U.Wisconsin team would compare their results back to all E.coli sequences in Genbank. They came across one that didn't at all fit, at least not until they tried the reverse sequence, which fit perfectly.

One member of the near crop of next generation technologies is a bit different on this score. The sequencing-by-ligation approach from the Church lab, being commercialized by ABI, works with double-stranded DNA, and so you can read either way from a known region. But this isn't exactly reading in either direction, since it is double stranded DNA.

However, some of the distant concepts for DNA sequencing might really throw out the limitation, which has some interesting informatics implications. Many approaches such as nanopores or microscopic reading of DNA sequence do not use polymerases, except maybe to label the DNA. So these methods might be able to read single-stranded DNA in either direction -- and you might not even known which direction you are reading! For de-novo sequencing, this could make life interesting -- though if the read lengths are long enough, it will be much like my surprise in the Church lab -- if you don't find anything biological, try reading backwards.

Thursday, June 21, 2007

That's my boy!

I was out this evening introducing my legacy to the fine points of pea picking. He looked at the hanging pods and asked: "How can you tell what color they are"? While I was trying to figure out what he meant, the answer was supplied: "The color of the pea. Yellow is over green". Score 1 for PBS!

Wednesday, June 20, 2007

Extra! Extra! Mendel was Right!

I couldn't help but be amused by the headline in today's Boston Glob: "Breast cancer genes can come from father". Wow! That pesky Austrian monk was on to something with his crazy ideas! The paper upgraded itself back to the Globe with a decently written story describing a new JAMA article which looked for BRCA mutations in patients with very few female relatives. In a nutshell, the BRCA- phenotype (early predisposition to breast cancer) was hidden in these families due to family structure.

The consequences of this are certainly something to take very seriously: some doctors are not thinking carefully about the paternal side of a woman's family tree when scoping out a rationale for BRCA testing, and insurance companies apparently have been over-emphasizing the maternal side of the tree as well, and in some cases a woman may simply have no (or no known) close female relatives. Clearly the medical world has a Sherpa shortage.

Within a decade or so complete genome sequencing or comprehensive mutational scans will be pretty routine. That won't discount the need for taking a good family history, especially since our ability to interpret those scans may lag the technology for obtaining them

Tuesday, June 19, 2007

Imaging gene expression

In one of my first posts I commented on the challenge of obtaining samples for microarray and other biomarker work. Getting samples for microarrays is at best difficult, painful to the patient and only a little dangerous to them; in many cases the samples are simply unobtainable. Getting a broad range of samples from multiple sites, or a time series is going to be very rarely feasible.

With this backdrop, a recent paper in Nature Biotechnology is quite stunning. Indeed, it is a bit of a surprise that it didn't show up in the mother ship or Science: the paper is well written, audacious in design and shows very nice results.

Using actual liver cancer patients the paper correlates contrast-enhanced CT (aka CAT) imaging features to gene expression patterns detected by microarrays using samples from the same patients. While these patients had to go through biopsies, the approach holds out the hope of calibrating imaging assays for future use.

The imaging-microarray connections have many intriguing possibilities. Some of the linked microarray patterns have clear therapeutic associations, such as cell cycle genes and VEGF. Such an imaging approach might, with much further validation, enable appropriate selection of therapeutic agents -- such as Avastin to target VEGF.

The paper also notes the challenges that lie ahead. The choice of liver cancer was no accident: liver tumors tend to be large and well-vascularized, making them straightforward to image using CT. Some of the imaging features found are generic to tumors, but others have some degree of liver specificity. Expression program to image feature mappings may vary from tumor to tumor.

One potential side-effect of this study would be to increase biopharma interest in liver cancer. Liver cancer is a scourge outside of the Western world (perhaps driven by food-borne toxins) but is not in the top of deadly cancers in the U.S. According to some 2002 figures from the American Cancer Society, liver cancer in the U.S. is about 17K new cases and about 15K fatalities -- a horrible toll, but far less than 160K annual lung cancer deaths. One big attraction for companies is potential payoff, but another is the potential for accelerated development decisions. Being able to subset patients based matching drug mechanism to biology inferred from imaging is potentially a powerful means to do that.

Friday, June 15, 2007

Tagging Along on Grand Rounds

It is an honor that my piece on genome sequencing to explore rare genetic diseases was accepted into Tuesday's Grand Rounds, a venerable (for blogging!) blog carnival covering medicine. This week's was hosted at Dr. Val & the Voice of Reason, part of Revolution Health.

Thursday, June 14, 2007

Evolution's Spurs

As I've commented previously, it is pleasing when a new biological finding can be related to something both familiar and pleasing. In this case, Nature has, with perfect timing, carries a paper about one of my favorite garden plants.

I like to garden but my attention to it is somewhat erratic. As a result, for the ornamentals I have a strong bias towards plants which are perennial or nearly so; in theory you plant them once and enjoy for many years afterwards. There are a few catches, however. First, the hardiness guides in plant catalogs are only rough guides, and the local microenvironment determine whether a plant will actually thrive. As a result, I sometimes end up with very expensive annuals (perennials tend to sell for 2-10X the price of an annual). At a previous residence I couldn't get one wet soil-loving plant (Lobelia cardinalis) to overwinter until I put one directly under the downspout, though about 40 miles to the south I've seen it run rampant on the sides of cranberry bog irrigation ditches. Another species (Gaura lindheimeri) refuses to overwinter for me, but a gardener (and former MLNM employee) a few towns over has a magnificent specimen.

A good perennial garden also requires a certain attention to detail. In particular, many perennials have very restricted bloom times, since they must invest energy in surviving the winter and reappearing in the spring. Many annuals bloom all summer; annuals are grasshoppers, perennials ants. So to get color throughout the garden season, a mixture of plants is needed. Certain times of the mid-summer and early fall are awash in choices, but right after the bulbs fade in early summer can be challenging.

Columbines, species of the genus Aquilegia, are wonderful perennials. While the individual plants are short-lived, plants in a favorable environment will reseed vigorously. They are in bloom now, which is why the new paper's timing is so good. Many of the flowers are bicolor. The foliage is generally a neat mound, often with a bluish tint, making them attractive even when not in bloom. The seed heads are distinctive & visually interesting.

Examples of two of mine, one established & one newly planted, show some of these features.

The signature feature of Aquilegia are the spurs. Depending on the species and variety, these can be nearly non-existent to quite large . I'll confess I had never pondered the biology behind the variation, but that is now remedied.

Figure 2 of the paper shows a remarkable correlation between who pollinates a columbine and the length of its spurs. Three major pollinators were explored: bumble bees, hummingbirds & hawk moths.

The key focus of the paper is distinguishing between two evolutionary models. Both Darwin & Wallace had proposed that such long spurs could evolve through a co-evolutionary race between polinator and flower. Longer spurs mean the pollinator must approach the flower more closely to reach the nectar, increasing pollen transfer. Longer tongues on the pollinators will be favored, as they can reach down longer tubes. This model would tend to suggest gradual, consistent changes in spur length.

A competing hypothesis is that spur length changes abruptly when the pollinator shifts. After long periods of stasis, the introduction of a new pollinator drives a short-term co-evolutionary race.

There is a lot of nice data, which I'm still chewing on, in the paper favoring the latter model. The authors used a large set of polymorphisms in the genome to generate a phylogeny. Regression analysis using this phylogeny showed that inferred pollinator shifts are correlated with large changes in spur length. Interestingly, their phylogeny suggests that there have been only two major shifts in pollinator strategy, each time going from a short-tongue to long-tongue pollinator (bumble bee -> hummingbird and hummingbird -> hawk moth). One interesting supporting piece of evidence: in Eurasia there are no hummingbirds, and there are also no Aquilegia known to be pollinated by hawk moths.

By a nice coincidence I was looking something else up & discovered that the Joint Genome Institute has commenced the sequencing of an Aquilegia species. Aquilegia's family is the first branching of the dicots, and therefore represents an important window into family-level evolution of plants.

One additional consideration in garden planning is what creatures your choices will attract: some are desirable, others not. I won't plant major hosts for Japanese beetles unless I really like them (raspberries escape the edict). On the other hand, bumble bees and hummingbirds are on my desired list (though I've never been able to attract a hummingbird) whereas I'm neutral on hawk moths. So, perhaps I should go to the garden center with a caliper in hand, and only get a few of the super-long spurs just to wonder at them.

Wednesday, June 13, 2007

This Day in History

This week marks the 18th anniversary of my first attempt to sequence DNA. It did not occur to me then that my destiny would be to interpret DNA, not do the actual data acquisition.

My sophomore year in the Land of Blue Hens had been very good, and I had started working with a professor there on some undergraduate research. During the year I learned how to miniprep DNA, run restriction digests and then photograph them on a transilluminator.

I also had engaged in a grand literature search to identify all of the known DNA sequences for our organism, the unicellular alga Chlamydomonas reinhardtii. Back in those days one could not rely on Genbank for completeness -- despite there being about 10 or so C.reinhardtii sequences published, only 2 (the two subunits of RuBisCo, a key photosynthetic enzyme) were deposited. So I would borrow another professor's computer in her office (in those days, computers that moved were never sighted on campus, though luggables such as the original Compaq existed). This required a bit of coordination, as her office was not spacious enough to easily seat two & so I needed to get her to unlock the office to let me in but when she didn't need it. Out of frustration with this arrangement was born an invention: I threw together a clone of the key functionality of the entry software: you could key in a sequence once & then switch to verification mode and key it in again, with the computer complaining audibly if there was a mismatch. At the time, it never occurred to me that this was a major milestone in my career.

A second application soon followed -- I had heard that Chlamydomonas had strongly biased codon usage (it also was very G+C rich), and built my own graphical codon bias indicator. I was a bit disappointed to learn that this was well trod ground in the literature; somewhere I had gotten the delusion that the whole idea was novel.

This was all preparation for the summer. I had been accepted into the Science & Engineering Scholars program, which would allow me to spend most of the summer at school on a small stipend, in dormitory space filled with other S&E scholars as well as those in a parallel program in the humanities. I would learn to sequence DNA!

I would not be the only one learning. My adviser was a biochemist attempting to refit as a molecular geneticist, and he would be learning alongside me. But we would not be alone. The professor whose computer I borrowed was experienced in the art, and we would be borrowing her equipment & lab space. We also had a baguette of a professor (French, crusty on the outside, soft on the inside) who had written papers in the field and had also been entrusted by nearby DuPont to vet their automated DNA sequencer (it was pronounced a failure).

On Monday the 12th I showed up eager for action. My advisor laid out the gameplan: we would run the reactions today, and run the sequencing gels on Tuesday. He neatly laid everything out and then got the piece de resistance out of the freezer -- the big blue egg of 35S-labeled ATP. Within the egg's confines was the actual vial holding the radioactive compound, and the egg was nestled in it's own ice bucket. He walked me through the reactions, we ran them, and then I focused on cleaning up, scanning the bench for any loose materials.

That evening, I had one late-night activity planned. At the very end of the day I waited outside the local victualer, and when my watch marked midnight strode in & marked my newfound legal ability to order from the entire menu. Then it was back to my dorm: tomorrow was another workday.

I arrived the next day again eager for action. My advisor asked me if I had cleaned up diligently, and I nodded an affirmative. He then gently led me to an ice bucket and lifted the lid -- there, floating serenely in the melt water was the bright blue egg of 35S-ATP. I then got a good lesson in checking for radioactive contamination, but there was none. I felt guilty for wasting the 35S-ATP -- our lab ran on a shoestring, but was determined to forge ahead.

The big duty that day was to pour and run the sequencing gel. Yes, in those days there was no capillary sequencing, but rather thin slab gels. Running length meant then, as it still does now, resolution -- so we would use the 1 meter long plates.

Now a key consideration of those plates is that they need to be scrupulously clean. Any speck of dirt or fingerprint would lead to a bubble in the poured gel, destroying at least one lane and probably distorting many of the rest. Using a simple detergent, I was to clean the plates & get them ready for pouring. With help from the experienced professor we would pour the gel & then run our samples. With luck, by end-of-day we could dry down the gel and put it on film for exposure overnight.

The instructions seemed straightforward, but it soon became clear that I had been left alone with a bit of a logistical challenge. I had a standard deep lab sink in which to wash those vitreous monsters. I decided that I could wipe them down with detergent on the lab bench and then balance them over the sink for the rinse cycle.

What ensued was straight from a Road Runner cartoon, with myself as Wile E. Coyote. In mid-rinse the balance was tipped, and the far end of the plate dropped into the sink. Upon hitting bottom, the glass shattered. This altered the balance again so that the near end tipped down, delivering the freshly fractured edges out of the sink and into one of my fingers.

After the initial shock wore off, I realized that I had been seriously gashed but it was no emergency. I had a bit of first aid training, and so I compressed the wound until the bleeding slowed and then hastily scribbled a note saying I would be going to the infirmary. I then walked the mile or so to the south end of campus & had the wound attended to. Luckily, it did not need stitches (I have a potent needle phobia) but rather just an adhesive closure.

It hadn't occurred to me in the heat of the moment what a scene I had left behind. The female professor came back to check on me and apparently had quite a start: undergraduate gone, shattered glass in sink, bloodstained paper towels in the trash & a note with blood drops on it mentioning a trip to the infirmary.

The rest of that summer would not be eventful. We never did succeed in getting our plasmid to work, though I did sequence the control stretch of M13 repeatedly.

My lab career at Delaware would include another trip to the infirmary (needle stick; luckily before I injected the mouse) and another loud accident (from a swinging bucket centrifuge; carefully, but incorrectly, balanced). My undergraduate advisor would ultimately suggest that my graduate work might better focus on the computational interest I had demonstrated, not the lab manipulations I struggled with. Many years later a different group would sequence our gene, acetolactate synthase. Now, just about every gene of Chlamydomonas can now be found in the online databases, as a 3rd release of the draft genome is available; I would have never guessed at the time this would be true so soon in the future.

Monday, June 11, 2007

DNA Under Pressure

Over at Eye on DNA an MD's op-ed column on genetic association studies is getting a good roasting. The naivete about the interplay of genes and the environment suggested by the original article (which would seem to suggest that environment completely trumps genes) reminded me of an idle speculation I've engaged in recently. Now, in this space I frequently engage in speculation, but this involves a celebrity of sorts, which I don't plan to include often. But I think this speculation is interesting enough to share/expose.

The Old Towne Team has been tearing up the American League East this year, much to the glee of the rabid Red Sox fan who lives down the hall from me. A key ingredient to their success is a very strong starting pitching rotation, and leading that rotation in several stats is Josh Beckett, particularly his American League leading record of 9 wins and 0 losses.

Now, as an aside, I have very little respect for baseball statistics. The rules for most seem arbitrary and the number of statistics endless. I have a general suspicion that baseball statisticians believe that some asylum holds a standard deviant, and that Bayes rule has something to do with billiards. But with Beckett on the mound, good things tend to happen.

Beckett was the prize acquisition last year, but at times he looked like a poor buy. The Sox gave up two prospects for him, one of whom hurled a no hitter and the other ended up as NL Rookie of the Year. Last year Beckett had mixed results, but this year he is on fire.

What hath this to do with genetics? Well, for a short time we lost his services due to an avulsion on one of his pitching fingers, which is the technical term for a deep tear in the skin. Beckett has a long history of severe blisters which knock him out of action periodically.

Genes, or the environment? It could be that he just grips the ball in such a way that anyone would lose their skin. But it could also be that he has polymorphisms in some connective tissue genes which make him just a bit more susceptible to this sort of injury. There are many connective tissue disorders known, with perhaps the best known in connection to sports being Marfan's syndrome. Marfan's leads to a tall and lanky physique, ideal for sports such as basketball and volleyball -- and it was Marfan's that killed Olympic star Flo Hyman. Beckett isn't covered with blisters (at least, no such news has reached the press), but what if it is only in the intense pressure of delivering a fastball that the skin gives way. If this were true, the phenotype would be most certainly due to the genotype -- but only in the context of a very specific environmental factor.

Of course, this is a miserable hypothesis to try to test. Perhaps you would scour amateur and professional baseball for pitchers with similar problems and do a case-control study with other pitchers who don't develop blisters. Or, you would need to collect DNA from his relatives, and also teach them all to pitch just like Josh Beckett in order to see if they too develop blisters and avulsions. It could be a first: a genetics study whose consent form includes permission to be entered into the Major League Baseball Draft!

Sunday, June 10, 2007

Temptation Beckons!

For the last decade plus my main programming language has been Perl. It has served me often and served me well, but during that time I submerged an important thought.

I really don't like Perl!

I started with Perl 4, and was slowly seduced. The huge key was the facile text manipulation, particularly the regular expressions. I had previously been using C++, and doing any text processing in it at that time was a bear; in Perl it was so trivial. Plus, Perl had no arbitrary variable size restrictions (and was garbage collected), versus the wierd problems which most versions of sed & awk ran into with large text strings -- such as big DNA sequences. Perl 4 also had these wonderful hash tables, another feature rarely found in languages at that time, which could be so useful. It lacked any sort of complex data structures, but I was just using Perl as my specialized text processing language, so I just didn't do anything requiring fancy data structures. The text processing was what I really needed, and with a little bit of learning how to ping internet servers from Perl I was hooked.

I even made a little spare change, developing the original web interface to Flybase, the grand Drosophila database. Flybase's computer guru was a big fan of the Gopher+ system and skeptical of World Wide Web, and so Flybase had a strong Gopher+ interface and no Web access. I was starting to play around with hyperlinking the heck out of BLAST reports and database entries, and was frustrated that my browser couldn't be used to talk to Flybase -- early web browsers did Gopher but not Gopher+. At some point the lightbulb went off: what if I wrote a system that translated between the two protocols! My 'relay server' concept was born (which, of course, others had thought of first and named 'gateways'). A bunch of toying around & poking through documentation, and the thing worked! Well, mostly -- periodically something would change at Flybase to expose my incomplete understanding of Gopher+, and the system would break. But, I was the only customer & it wasn't bad to maintain. The curation center for Flybase was in the Biolabs at Harvard, and one day I saw one of my friends from there in the hall and the ego switch was thrown: "Hey, look what I've done!". I showed it off, just to show off. But it turned out, the Flybase advisory board was also frustrated at a lack of WWW access & wanted one NOW! My good luck! For a modest fee (but princely by graduate student terms) I would maintain the gateway as a public access point, until a permanent system was built. I enjoyed the extra spending money -- and enjoyed seeing the permanent system relieve me of maintenance duties.

Java came out & I took a look and there was lots to like. Strongly typed; I like that. Lots of Internet-friendly stuff; nice also. So I played around and then built a useful tool, one of the first graphical genome browsers delivered over the Internet (as an applet). That experience help raise my awareness of Java's deficiencies, at least from my standpoint. The graphics model was horribly primitive. The security model meant I couldn't print the diagrams -- my users had to do screen capture to get hard copy. But quite importantly, they hadn't included any regular expression support! In an Internet-targeted language! Perl was still my go-to language (though never with a goto) for any text mangling -- and most of what I did was text mangling.

I got to Millennium and found a thriving Perl community. Perl 5 had come out while I was slogging through my thesis writing, so I hadn't learned it yet. The other choice of languages at MLNM at the time was Smalltalk, which I meant to learn but never quite did. I did learn Perl 5 -- and now you could do everything! All sorts of advanced programming concepts: object-orientation, references, complex data structures. Yippee!

Except, it was the dog's breakfast. Almost perfect backwards compatibility had been maintained, at the cost of importing lots of the weaknesses of Perl 4. The object-orientation was a particularly weird veneer, with lots of traps for the unwary. But it worked, and I had lots of people around me using the same language. We could (and did) share quite a bit of code, and there were dedicated folks willing to solve Perl conundrums.

Now, I'm in a different boat. I am the lone Perl programmer in an environment that again is split between two languages. Both of those languages are strongly typed & carefully designed, though not perfect (what is?). While to date I have mostly been a data analyst, and so my code could live in its own world, increasingly I wish to fold those analyses into better code. Much of the back end is in Python, and most of the code that delivers results to the users are in C#. Trying to learn both at once seems insane, but that is the road I am now headed down.

I've also gotten reminded what I don't like about Perl. One hates to complain about hard work that other's have given away to the world, but many Perl gurus write utterly unreadable code! When I was trying (ultimately successful) to get the Perl SOAP modules working, so that my Perl could speak a common language (Web services) with our C# and Python. A had some difficulty, and was trying to figure out how to change one $*(&(*& character in the generated XML. Digital hieroglyphics!

I had also tried to hone my skills a bit by reading a book of 'Perl Hacks'. The stuff is very clever -- but then I realized how too clever it is! Perl is a language whose culture revels in doing things any-which-way. Yes, I enjoy some of the silliness -- the Bleach module that executes code written all in whitespace (which Bleach can generate from normal code), the write-perl-in-Latin module. But the best 'hacks' were on par with this, turning the language into something entirely different. Some may like that, but it doesn't go with my grain.

I've written two useful pieces of Python so far, and I generally like the language. I'll probably do a longer Python-centered post, but it has the quick-development flavor of Perl but with a much cleaner design. It's awkward though; I'm still diving into the books for just about everything & I'm sure any seasoned Python programmer would say 'you are writing Perl code in Python'!. I've written one toy application in C#. But, of course, the problem is that I have built a good code base in Perl (and of course my Perl is readable!), so if I need to solve a problem quickly I fall back on what I know. Breaking up will certainly be difficult, long & painful.

Saturday, June 09, 2007

When Imagination Trumps Science

I recently finished an interesting book that was a pure impulse item at the local library -- those scheming librarians put books on display all over to snag the likes of me! Imaginary Weapons is the saga of various Department of Defense funded efforts to develop a new class of weapons based on some exotic physics, efforts that are characterized by the steady flow of funding to a scientist of dubious quality to work on a phenomenon that is unrepeatable. My tax dollars at work!

The book is flawed in many ways, and some squishy details at the beginning set me on edge. There is also a lack of a good description of the exact topic being discussed (clear isomers of hafnium), and the author all too often uses 'hafnium' as a shorthand for 'hafnium isomer', even when she is discussing nearby the ordinary, stable form of hafnium. There is also an excess focus on the strange setup of the key experimenter, who uses salvaged dental X-ray equipment for the crucial test. This is probably not the right gear, but the question why is never explored.

The key figure running the 'experiments' (to use the word charitably) is constantly updating what the doubters should have found to reproduce his experiments. "I know signature X was in the paper, but I now know you should look for Y". Negative controls -- forget about it; they were flatly refused.

The truly sad part were the enablers at DARPA, the Defense Advanced Research Projects Administration. DARPA is supposed to fund longshot stuff, and so it could be argued this work was appropriate initially. But to keep sinking money into a clear incompetent, that is the travesty.

The author actually interviewed most of the participants in the fiasco on both sides, but she really missed the golden opportunity. When asked why this research kept being funded, despite criticism from anyone with standing in the physics community, the answer was always that the applications were so promising and it was DARPA's job to fund high-risk, high-reward science. The question that apparently went unasked is 'why this topic'? Why pour so much money into hafnium isomers, rather than zero point energy or antimatter or antigravity? Once you've decided to ignore the recognized experts in a field, how do you go from there? Of course, one can hope this works as 'push polling' to reconsider the meaning of science, but more than likely the next budget request would include funds for the research arm of the Jedi Knights.

Supporting important science that isn't initially respected is a challenge. Biology has plenty of examples of scientists who fought orthodoxy and ultimately were proven correct: Mendel (genetics), Roux (oncogenic viruses), Prusiner (prions), Langer (drug release systems), Folkman (angiogenic factors), Marshall (H.pylori & ulcers), Brown (microarrays) & Venter (whole genome shotgun is just a tiny list. But it is also important to balance that against the stuff that was dodgy then and is still dodgy now, such as Moewus and Kammerer and a host of others. Even if what you claimed to do is eventually done, that doesn't mean you were right -- the claim of cloning a mouse in the 70's has nothing to do with the reality of cloning a mouse in our time. What separates the good fringe science from the crankery is an attention to the criticism, not ignorance of it. I've heard both Langer & Folkman speak, and they clearly kept addressing their critics concerns in their papers. These pioneers also weren't just right; they had done their science well. Mendel found the right laws & his data was generally good; in contrast the uniparental mouse of the 70's is still a fraud despite mammalian cloning ultimately playing out.

Bad work in the guise of science, either outright fraud or self-deception (what Feynman termed 'cargo cult science') will probably be with us forever. Great travesties have been perpetrated claiming to be scientific (e.g. the Tuskeegee syphilis horror). This year's big investigation is bubble fusion; last year's was cloning & next year it will be something else. Reading about science gone wrong isn't much fun (well, the N-ray expose is fun to contemplate!), but it is necessary.

Thursday, June 07, 2007

Illumina-ting DNA-protein interactions

The new Science (sorry, you'll need a subscription beyond the abstracts) has a bunch of genomics papers, but the one closest to my heart is a paper from Stanford and Cal Tech using the Illumina (ex-Solexa) sequencing platform to perform human genome-wide mapping of the binding sites for a particular DNA-binding protein.

One particular interesting angle on this paper is actually witnessing the beginning of the end of another technique, ChIP Chip. Virtually all of the work in this field relies on using antibodies against a DNA-binding protein which has been chemically cross-linked to nearby DNA in a reversible way. This process, chromatin immunoprecipitation or ChIP, was married with DNA chips containing potential regulatory regions to create ChIP on Chip, or ChIP Chip.

It is a powerful technique, but with a few limitations. First, you can only see binding to what you put on a chip, and it isn't practical to put more than a sampling of the genome on a chip. So, if you fail to put the right pieces down, you might miss some interesting stuff. This interacts in a bad way with a second consideration: how big to shear the DNA to. A key step I left out in the ChIP description above is the mechanical shearing of the DNA into small fragments. Only those fragments bound to your protein of interest should be precipitated by the antibody. The smaller your sheared fragment size, the better your resolution -- but also the greater risk that you will successfully precipitate DNA that doesn't bind to any of your probes.

A stepping stone away from ChIP Chip is to clone the fragments and sequence them, and several papers have done this (e.g. this one). The new paper ditches cloning entirely and simply sequences the precipitated DNA using the Illumina system.

With sequencing, your ability to map sites will now be determined by the ability to uniquely identify sequence fragments and again the size distribution of your shattered DNA. Illumina has short read lengths, but the handicap imposed by this is often greatly overestimated. Computational analyses have shown that many short reads are still unique in the genome, and assemblers capable of dealing with whole-genome shotgun of complex genomes with short reads are starting to show up. One paper I stumbled on while finding references for this post includes Pavel Pevzner as an author, and I always find myself much wiser after reading a Pevzner paper (his paper on the Eulerian path method is exquisitely written).

In this paper, read length of 25 nt were achieved, and about 1/2 of those were uniquely mappable to the genome, allowing for up to 2 mismatches vs. the reference sequence. Tossing 50% of your data is frustrating, but with 2-5 million reads in the experiment, you can tolerate some loss. These uniquely mapped sequences where then aligned to each other to identify sites marked by multiple read. 5X enrichment of a site vs. a control run were required to call a positive.

One nice bit of this study is that they chose a very well studied DNA-binding protein for the study. Many developers of new techniques rush for the glory of untrodden paths, but going after something unknown strongly constrains your ability to actually benchmark the new technique. Because the site they went after (NRSF) is well characterized, they could also compare their results to relatively well-validated computational methods. For 94% of their sites, the called peak from their results was within 50nt of the computationally defined site. They also achieved an impressive 87% sensitivity (ability to detect true sites) and 98% specificity (ability to exclude false sites) when benchmarked against well-characterized true positives and known non-binding DNA sites. A particularly interesting claim is that this survey is probably comprehensive and has located all of the NRSF/REST sites in the genome, at least in the cell line studied. This is attributable to the spectacular sequencing depth of the new platforms.

Of course, this is one study with one target and one antibody in one cell line. Good antibodies for ChIP experiments are a challenge -- finding good antibodies in general remains a challenge. Other targeted DNA-binding proteins might not behave so well. On the other hand, improvements in next generation sequencing technologies will enable more data to be collected. With paired-end reads from the fragments, perhaps a significant amount of the discarded 50% of the data could be salvaged as uniquely mappable. Or, just go to even greater depths. Presumably some clever computational algorithms will be developed to tease out sites which are hiding in the repetitive portions of the genome.

It is easy to imagine that in the next few years this approach will be used to map virtually all of the binding sites for a few dozen transcription factors of great interest. Ideally, this will happen in parallel in both human and other model systems. For example, it should be fascinating to compare the binding site repertoire of Drosophila p53 vs. human p53. Another fascinating study would be to take some transcription factors suggested to play a role in development and scan them in multiple mammalian genomes, yielding a picture of how transcription factor binding has changed with different body plans. Perhaps such a study would reveal the key transcription factor changes which separate our development from those of the non-human primates. The future is bound to produce interesting results.

Wednesday, June 06, 2007

Purple & White

When I was out in the garden yesterday a smile was brought to my face by some purple blossoms immediately adjacent to some white ones. Those blossoms have so many personal resonances: a bi-annual race, gustatory delight, visual fun & a bit of history. And this year, I am excessively pleased with myself because thinking about those plants led me to a successful guess as to the climate & weather of a distant city I have never had the pleasure of visiting.

Gardening in New England has some distinct challenges, and this year opened up with Mother Nature's nastiest tricks. I actually got some of the seeds for those plants in on time, as soon as the ground thawed, only to watch two successive late spring snowfalls. So my rare early jump was completely defeated.

The need for the jump is clear. Plants which can be seeded early are cool weather crops, and most do very poorly in warm weather. Before you know it, the heat of summer is upon us and those cool weather crops fade in one way or another. Some truly die, but others 'bolt' by launching flower stalks that simultaneously degrade the flavor of the vegetable. We are already experiencing 90 degree (F) days, so the race is on.

The plant in question is visually fun because it sends thin curling tendrils to wrap around anything it encounters. As a kid I loved uncurling them gently and wrapping them around a support.

If you hadn't guessed the plant already, the history & weather bits are a giveaway, as the city I guessed has much cooler summers than Boston is Brno, or Breunn as Brother Gregor would have known it. Those beautiful flowers are on my pea plants, and it occurred to me that while there were probably many considerations in their choice as a model, being able to grow them frequently would be a plus -- and in Boston you can't do much with peas for most of the summer. The second race does begin in mid-to-late summer, if you try to seed a second crop. The other New England weather treachery is the early late frost, usually followed by a long burst of warm autumn to truly twist the pruning knife in your side -- if that killer frost hadn't arrived, another 5-6 weeks of fresh produce would have come in.

Which, of course, is the main reason I do it. I don't grow large quantities of vegetables, but it really is a magic moment when you nearly instantly transfer something you grew from the plant to your mouth and then savor all its sensuous delights. For peas it is sweetness & crunch.

That choice of peas was quite lucky, as pea genetics are relatively straightforward. Many plants have horribly complicated genetics, and indeed one of the then luminaries whom Mendel corresponded with suggested he repeat his experiments in hawkweek, which is one of those many genetic messes.

Of course, later workers would tease apart some of those messes to lead to interesting discoveries, and more are sure to come. But right now, I just want to discover some pods before my peas wilt in the summer heat.

Tuesday, June 05, 2007

Tagging Up With Protein Microarrays

Molecular Systems Biology, an open access journal, has an impressive new functional protein microarray paper. The authors identified a large number of targets for a yeast ubiquitin transferase (enzymes which transfer a protein tag, ubiquitin, onto other proteins), and the data has a good ring to it.

Some background: protein microarrays are a much more complicated subject than nucleic acid microarrays. One way to split them is by intent. Capture arrays have some sort of affinity capture reagent, most likely antibodies, on the chip surface. If properly designed, built & calibrated they represent a very highly multiplexed set of protein assays. Reverse-phase protein arrays spot fractionated, but unpure, proteins from biological samples on an array.

In contrast, functional protein microarrays attempt to represent a proteome on a chip as individually addressable spots in order to study aspects of that proteome. A number of groups have worked on functional protein microarrays, but there are a limited number of commercial sources, with perhaps the most successful being Invitrogen, which offers human and yeast arrays. If you'd like a great beach book on the subject, a new volume covers a wide array of topics, with Chapter 22 ("Evaluating Precision and Recall in Functional Protein Arrays") definitely my favorite.

Functional protein arrays present a huge challenge. In the ideal case the proteins would be produced, folded correctly and deposited on the slide in such a way that an assay can be run on every protein in parallel. This is a tall order, with lots of complications. Proteins may not fold correctly during expression or may unfold in the neighborhood of the slide surface, the post-translational state of the protein may be variable and is unlikely to capture all possible states of the protein, and the protein may not have key partners which are important for its function.

Despite these, and many other concerns, protein microarray experiments have been published describing various feats. Protein-protein interaction experiments to discover novel interactions (such as this one) or create comprehensive binding profiles (such as this one) are probably the most prevalent use, but the arrays can also be used to discover DNA binding proteins, identify novel enzymes, assay phenotypic differences of mutants, develop novel infectious disease diagnostic strategies, and identify the targets of protein kinases. [links are a mix of open access & paid access; apologies)

A wide variety of ingenious methods have been used to produce functional protein microarrays. The Invitrogen arrays are spotted from purified expressed protein and expected to bind randomly, but some other approaches ensure that the majority of protein molecules bind in a defined way. Some approaches actually synthesize the proteins in situ, and one group even deposited proteins on spots using a mass spectrometer!

Protein microarrays have had their growing pains. The amount of active protein found in a spot can vary widely. One study of protein-protein interactions failed to recover most of the known interactors of the bait protein. Since the bait is primarily a phosphoprotein binding protein, one possible explanation is that the insect-expressed human proteins were not in their correct phosphorylation state. However, poor recall of known substrates was also observed in protein kinase substrate searches run in both human and yeast (see Chapter 22 of the Predki book). Even without worrying about post-translational modification, coverage is an issue. While essentially the complete Saccharomyces proteome is available, the most extensive commercial human chip has less than 1/5th of the proteome and there are not (last I checked) commercial arrays for any other species.

The new publication wins on a bunch of scores. First, it is one of the handful of publications using such arrays which is not from one of the labs pioneering them, suggesting that they might work routinely. This publication uses the Invitrogen yeast arrays. Second, they did recover a lot of known substrates for their ubiquitinating enzyme. Third, the signals look very strong by eye, which has been the case for protein-protein interaction assays but much less so for protein kinase substrate discovery. Fourth, they batted 1.000 with novel positives from array in an independent in vitro ubiquitination assay and were able to verify that at least some of these are ubiquitinated by Rsp5 in vivo (by comparing ubiquitination in wt and Rsp5 mutant strains). Fifth, they performed a protein-protein interaction microarray assay with Rsp5 and the interaction results and ubiquitination results strongly overlapped.

Of course, I used to work at Ubiquitin Proteasome Pathway Inc (which is now touting a new drug with a new target in the pathway), and there I would have been digesting this paper until arrays danced in my dreams. Such assays offer an interesting possibility for greatly expanding our understanding of UPP players and functions -- many Ub transferases or Ub-removing proteases have no known substrates. While they have a lot of issues, functional protein microarrays are starting to make a difference in proteomics.

Monday, June 04, 2007

SOLiD-ifying the next generation of sequencing

ABI announced today that it has has started delivering its SOLiD next generation sequencing instruments to early access customers and will take orders from other customers (anyone want to spot me $600K?). SOLiD uses the ligation sequencing scheme developed by George Church and colleagues.

Like most of the current crop of next generation sequencers (that is, those which might see action in the next couple of years), SOLiD utilizes the clonal amplification of DNA on beads.

One interesting twist of the SOLiD system is that every nucleotide is read twice. This should guarantee very high accuracy. Every DNA molecule on a given bead should have exactly the same sequence, but by having such redundancy one can reduce the amount of DNA on each bead -- meaning the beads can be very small.

Bio-IT World has a writeup on next generation sequencing that focuses on SOLiD (free!). They actually cover the wet side a surprising amount for an IT-focused mag, and even have photos of the development instrument. An interesting issue that the article brings up is that each SOLiD run is expected to generate one terabyte of image data. The SOLiD sequencer will come with a 10X dual core Linux cluster sporting 15 terabytes of storage. This is a major cost component of the instrument -- though it is worth noting that the IT side will be on the same spectacular performance/cost curve as the rest of the computer industry -- it's pointed out that 5 years ago such a cluster would be one of the 500 most power supercomputers in the world; in a handful of years I'll probably be requisitioning laptop with similar power.

That still is a lot of data per run, and in contrast the top-line 454 FLX generates only 13 gigabytes of images per run - so there still is an opportunity to develop a 454 trace viewer that runs on a video iPod! A side-effect of this deluge of image data is that ABI is expecting that users will not routinely archive their raw images, but instead let ABI's software digest them to reads and only save the reads. That's an audacious plan, as with the other sequencers and with fluorescent sequencing before that archiving was pretty standard -- at Millennium we had huge amounts of space devoted to raw traces & NCBI and the EBI have enormous trace archives also. The general reason for archiving the traces is that better software might show up later to read the traces better. SOLiD customers will be faced with either ditching that opportunity or paying through the nose for tape backup of each run.

Since a lot of the same labs are early access customers for the same instruments, one can hope that some head-to-head competitions will ensue to look at cost, accuracy and real throughput. ABI is claiming SOLiD will generate over 1 Gigabase per run, and Illumina/Solexa named their sequencer for similar output (the '1G'), whereas Roche/454 is quoted more in the 0.4/0.5Gb /run range. Further evolutionary advances of all the platforms are to be expected. For SOLiD, that will mean packing the beads tighter and minimizing non-productive beads (those with either zero or more than one DNA species). In the Church paper, an interesting performance metric was introduced: bits of sequence read per bit of image generated -- in the paper it was 1/10000 -- and the goal of a 1:1 ratio was proposed.

In any case, the density of data achievable is spectacular -- one of my favorite figures of all time is Figure 3B in the Church paper, which shows uses sequencing data to determine the helical pitch of DNA! The ABI press release mentions using SOLiD to identify nucleosome positioning motifs in C.elegans, and I recently saw an abstract which used 454 to hammer on HIV integration sites to work out their subtle biases. Ultra-deep, ultra-accurate sequencing will generate all sorts of novel biological assays. One can imagine simultaneously screening whole populations for SNPs or going very deep within a tumor genome for variants. Time to pull up a chair, grab a favorite beverage, and watch the fireworks!

Sunday, June 03, 2007

The Haplotype Challenge

In yesterday's speculation about chasing rare diseases with full human genome sequences, I completely ignored one major challenge: haplotyping.

To take the simplest case, imagine you are planning to sequence a human female's genome, one which you know is exactly like the reference human female genome in structure -- meaning they will have exactly two copies of each region of the genome and will have typical numbers of SNPs. Can you find all the SNPs?

In an ideal world, and some technologists dreams (more on that later this week), you would simply split open one cell, tease apart all the chromosomes, and read each one from end-to-end. 100% coverage, guaranteed.

Unfortunately, we are nowhere near that scenario. While chromosomes are million of nucleotides long, our sequencing technologies read very short stretches. 454 is currently claiming ~200 long reads (though the grapevine suggests that this is rarely achieved by customers), and the other next generation sequencing technologies are expected to have read lengths in the 20-40 range or so. SNPs are, on average, about once every kilobase or so, so there is a big discrepancy. The linkage of different SNPs to each other on the same chromosome is called a haplotype.

Haplotypes are important things to be able to identify & track. For example, if an individual has two different SNPs in the same gene, it can make a big difference if they are on the same chromosome or different chromosomes. Imagine, for example that one SNP eliminates transcription of the gene while the other one generates a non-functional protein. If they are on the same chromosome (in cis), having two null mutations is no different than just one. On the other hand, if on different chromosomes (in trans) means no functional copy is present. Other pairs of SNPs might have the potential to reinforce or counteract each other in cis but not in trans.

The second challenge is we don't in any way start at one end of a chromosome and read to the other. Instead, the genome is shattered randomly into a gazillion (technical term!) little pieces.

If you think about this, if we look through all the sequence data from a single human (or canine or any other diploid) genome, we can sort the sequences into two bins

Positions for which we definitely found two different versions

Positions for which we always found the same nucleotide

Category #2 will contain mostly sequences in which the genome of interest was identical for both copies (homozygous), but it could also contain cases where we simply never saw both copies. For example, if we saw a given region in only one read, we know we couldn't possibly have seen both copies. Category #1 will have some severe limits: we can link sequences to SNPs only if they are within a read length of an informative SNP (one which is heterozygous in the individual), and actually will generally be able to see much shorter (since the informative SNP will rarely do us the honor of being at the extreme beginning or end of a read).

This immediately suggests one trick: count the number of times we see a copy. Based on a Poisson distribution we can estimate whether it is likely that every read we saw was derived from the same copy.

Of course, Nature doesn't make life easy. Many regions of the genome are exact repeats of one form or another. A simple example: Huntington's disease is due to a repeating CAG triplet (codon); in extreme cases the total length of a repeat array can be well over a kilobase, again far beyond our read length. Furthermore, there are other trinucleotide repeats in the genome, and also other large identical or nearly identical repeats. For example, we all carry multiple identical copies of our ribosomal RNA genes and also have a bunch of nearly identical copies of the gen for the short protein ubiquitin.

There is one more trick which can be used to sift the data further. Many of the next generation technologies (as well as Sanger sequencing approaches) enable reading bits of sequence from two ends of the same DNA fragment. So, if one of the two reads contains an informative SNP but the other doesn't, then we know that second region is in the same haplotype. Therefore, we could actually see that region only twice but be certain that we have seen both copies. With fragments of the correct size, you might even get lucky and get a different informative SNP in each end -- building up a 2-SNP haplotype.

This is particularly relevant to what I suggested yesterday. Suppose, for example, that the relevant mutation is a Mendelian dominant. That means it will be heterozygous in the genome. In regions of the genome that are poorly sampled, we won't be sure if we can really rule them out -- perhaps the causative mutation is on the haplotype we never read in that spot.

Conversely, suppose the causative mutation was recessive. If we see a rare SNP in a region which we read only once, we can't know if it is heterozygous or homozygous.

Large rearrangements or structural polymorphisms have similar issues. We can attempt to identify deletions or duplications by looking for excesses or deficiencies in reading a region, but that will be knotted up with the original sampling distribution. The real smoking gun would be to find the breakpoints, the regions bordering the spot where the order changes. If you are unlucky and miss getting the breakpoint sequences, or can't identify them because they are in a repeat (which will be common, since repeats often seed breakpoints), things won't be easy.

Of course, you can try to make your own luck. This is a sampling problem, so just sample more. That is a fine strategy, but deeper sampling means more time & money, and with fixed sequencing capacity you must trade going really deep on one genome versus going shallow on many.

You could also try experimental workarounds. For example, running a SNP chip in parallel with the genome sequencing would enable you to ascertain SNPs that are present but missed by sequencing, and would also enable finding amplifications or deficiencies of regions of the genome (SNP chips cannot, though, directly ascertain haplotypes). Or, you can actually use various cellular tricks to tease apart the different chromosomes, and then subject the purified chromosomes to sequencing or SNP chips. This will let you read out haplotypes, but with a lot of additional work and expense.

I did, at the beginning, set the scenario with a female genome. This was, of course, very deliberate. For most males, most of the X and Y chromosomes is present at single copy (a region on each pairs with the other, the so-called pseudoautosomal region, and hence has the same haplotyping problem). So the problem goes away -- for a small fraction of the genome.

We will soon have multiple single human genomes available for analysis: Craig Venter will apparently be publishing his genome soon, and James Watson recently received his. It will be interesting to see how that haplotyping issue is handled & plays out in the early complete genomes, and whether backup strategies such as SNP chips are employed.

Blog Carnivals & such

Blog carnivals are a wonderful thing: a bunch of bloggers with related topics take turns writing capsule summaries of other blog entries, forming a sort of meta-blog on a topic. I've stumbled across a number in biology, and there are three which I contribute to somewhat erratically (I really need to put the deadlines on my calendar!), Gene Genie, Mendel's Garden and Bio::Blogs. Each of these have new editions out, with Eye on DNA hosting the latest Gene Genie, The Daily Transcript hosting Mendel's Garden and Pedro Beltrao editing Bio::Blogs

A related concept is a shared feed, and I belong to The DNA Network. Indeed, this is my main blog feed now which gets checked regularly -- a nice bit of one-stop-shopping.

Saturday, June 02, 2007

Genome Sequencing for Unique Genetic Diseases

The following is based on a chance encounter with a stranger. I don't believe I am violating any ethical lines, but will entertain criticism in that department. I'm not a physician, and nothing in this should be viewed as more than extreme scientific speculation. If the Gene Sherpa or others feel I should be raked over the coals, then get the bonfire going!

I have a young son & so spend time on playgrounds and similar situations. Last fall I had taken him & his cousin to a playground at a park; he had reached his quota of watching his cousin's youth soccer and needed a change-of-pace. Some older kids were there, resulting in gleeful experimentation with extreme G-forces on the merry-go-round.

One of the other parents there was keeping an eye on the events but also tending to a clearly very challenged girl in a wheelchair -- she had various medical gear on the wheelchair and at least one tube. In the course of routine conversation (no, I wasn't prying!) it came out that (a) the girl was 10 and (b) she had already in her young life had multiple organ transplants. Intrigued, I asked what her condition was called (okay, now I'm guilty of prying), and the answer was that as far as any specialist they had consulted knew, this girl was the only known case.

Given my background & interests, it was natural for me to start the mental wheels grinding on genetic speculation. Don't worry, I don't reserve this for strangers! Shortly after my son was born it came out in casual conversation that a relative on his mother's side was colorblind, and so I went into hyperdrive grinding out the probability that my son would be too. Around the time of his first New Year, he was playing with a green ball amongst red ones, and the lightning hit again! Green amongst red! After several trials, his red-green vision powers were good!

Now, such a multi-symptomatic syndrome could have many causes, but suppose it was genetic? Since there was only one known case, traditional genetic mapping would be impossible. But, what might whole-genome sequencing be able to do?

There are many genetic scenarios, but let's narrow it down to three.

First, it could be a simple Mendelian dominant, as is the case with CHARGE, a developmental disorder. For such devastating diseases to be dominant, they must arise from spontaneous mutations.

Second, it could be a simple Mendelian recessive syndrome, but very rare or a novel phenotype of a known one. Depending on the type of mutation, damaged versions of a gene can lead to phenotypes which are not obviously related. For example, some alleles of decapentaplegic in Drosophila are known as Held Out because the wings are always pointed straight away from the body; other alleles led to the official name as the flies have 15 defective appendages.

Third, it could be something else. Interactions between genes, some nasty epigenetic problem, etc. Those will all be pretty much intractable by genome sequencing.

But how much progress might we make with the other two cases? First, suppose we got a complete genome sequence for the affected child. Comparing that sequence vs. the human reference sequence and catalogs of SNPs (which should grow quite large once human whole genome sequencing becomes common), one could attempt to identify all of the unusual variants in the child's genome. Depending on how well the child's ethnic background is represented in the databases, there might be very few and there might be very many. A sizable deletion or inversion might be regarded as a good candidate for a dominant. An unusual variant, particularly a non-synonymous coding SNP or a SNP in a known or suspected genetic control region, might be a candidate -- and if homozygous might be a candidate for a recessive.

Now, suppose we could also get the mother's sequence as well. Now it should be possible to really hammer on the Mendelian dominant hypothesis, as any rare variants found in the mother can be ruled out, since she is unaffected. If you could get the father's DNA as well, then one could really go to town. In particular, that would enable identifying any de novo mutations (those that occurred in one of the parent's germline (ova/sperm) but they don't carry in their somatic (body) cells). It should also allow identifying any funky transmission issues, such as uniparental disomy (the case in which both copies of a genetic region are inherited from the same parent). Finding uniparental disomy might be a foot in the door towards an imprinting hypothesis -- getting two copies of a gene from the same parent can be trouble if the gene is imprinted.

How many candidate mutations & genes might we find with such a fishing expedition? That is the big question, and one which really can only be answered by trying it out. The precise number of rare alleles found is going to depend on the ethnic background of the parents and their relatedness. For example, if the parents were relatives (consanguineous), then there is a higher chance of getting two copies (homozygosing) of a rare variant (no, I'm not the type to pry that deep). If a parent is from an ethnic background that isn't well represented in the databases, then many rare SNPs will be present.

What sort of gene might we be looking for? Probably just about anything. Genes with known developmental roles might be good candidates, or perhaps predicted transcription factors (ala CHARGE), but it could be anything. Particularly difficult to make sense of would be rare SNPs distant from any known gene -- they might be noise, but they could also affect genetic control elements distant from their target gene, a well-known phenomenon.

For the family in question, what is the probability of getting useful medical information? Alas, probably very slim. One can hope for a House-like epiphany which leads to a treatment, but even if a good candidate for the causative gene can be found that is unlikely. Some genetic diseases involving metabolic enzymes can be managed through diet (e.g. PKU), but many others cannot (e.g. Gaucher's). One might also hope for an enzyme-replacement therapy. However, it is quite likely that such a disease would not be in a metabolic enzyme, and might well be in a gene which we really know nothing much about.

So, such a hunt would be for medical edification. Would it be worth it? At the current price of ~$1M/genome, it's hard to see. But at $10K or $1K per genome, it might well be. It probably wouldn't work in many cases, but perhaps if a program was set up to screen by genome sequencing many families with rare genetic (or potentially genetic) disorders, some successes would filter out. We'd certainly learn a lot of find-scale information about human recombination and de novo mutations.