Wednesday, September 19, 2012

Does Illumina Have A Sequence Diversity Problem?

Roughly speaking, NGS sample preparation workflows can be split into two basic classes of workflows.  Complete molecule workflows are currently suitable for microRNAs and other small fragments at the moment, but attempt to capture the entire molecule.  With luck, long read technologies will someday make these the standard.  Fragment workflows are the workhorse, and take input material (RNA, DNA) and convert them into a library of fragments representing (or directly from) the original material. 
One important use of whole molecule workflows is amplicon sequencing, in which the a PCR fragment has been carefully designed to fit within the bounds of the targeted sequencing platform.  In many cases, these amplicons have the sequencing adapters designed in ("fusion amplicons"), which allows PCR products (after removal of primer dimers) to be input directly into the template preparation step.  Amplicon sequencing is a popular workflow in diagnostics, metagenomics and many other fields.  Because of its relatively long sequencing reads, 454 has been a favorite platform for amplicon sequencing, but as other platforms approach its length (or even surpass it, in the case of PacBio), it is being supplanted.  Amplicon sequencing is also well-matched to the current generation of benchtop sequencers in terms of output; without specialized equipment such as Rain Dance or a serious liquid handling robot, few labs will be able to generate enough amplicon material to even think of running on a HiSeq.

One challenge Illumina faces is that their early cluster calling software relies on diversity; they need variation in sequences to generate variation which allows adjacent clusters to be disambiguated.  For some applications, this occurs naturally because a large set of amplicons are fed into the system.  However, for profiling metagenomes or viral swarms, it is common to use a single primer pair (perhaps with molecular barcodes), which can lead to shortage of diversity.  Since this is a market which MiSeq would seem to be aimed at, I would expect that Illumina would be working hard to ensure that the community is aware of all the tricks of the trade to solve the problem.

It's thus startling to me that challenges sequencing low diversity amplicon libraries on MiSeq are quite common on SeqAnswers, either on its own  (e.g. "amplicon sequencing on MiSeq", "MiSeq cluster generation problems") or embedded in a thread on the recent 2x250 (and more reads!) MiSeq upgrade.

The generic answer to the problem seems to be to spike in a bunch of PhiX control DNA.  The catch is how much to spike in; nobody likes burning reads on the most over-sequenced genome ever.  An alternative approach is to put some diversity in the beginning of the primers, though I've only seen this proposed and not used.  A different strategy is to use variable-length barcodes on the primers so that the constant regions are in multiple registers.

If Illumina doesn't solve this problem, and make sure the recipes are widely and freely available, then the problem may well be solved for them -- by their competitors.  Amplicon resequencing is GnuBio's first targeted market, and for many applications their claimed minor allele sensitivity (1%, I believe), low cost per run and simple workflow may take a lot of business.  ABI, of course, is also trying to get a piece of this action onto Ion boxes (either the PGM or the now arriving Proton).  PacBio would love to play in this space, but realistically who is going to plunk down $800K for one, so it will probably only be folks who have easy access to one.  PacBio does have an advantage in that the quality of the Circular Consensus reads will be roughly even, vs. a dip in the middle for Illumina or Ion (assuming paired-end running); no data has been released yet to assess the uniformity of quality for GnuBio along a read.

10 comments:

Nick Loman said...

This is definitely a problem. Right now we're trying to answer the basic question "What does low diversity mean?". I will post the results on my blog when we've figured it out.

Myself and Josh Quick posted a blog post on this subject which readers might find useful:
http://pathogenomics.bham.ac.uk/blog/2012/08/sequencing-low-diversity-libraries-on-illumina-miseq/


Mick Watson said...

Hasn't Nick already busted this one?

http://pathogenomics.bham.ac.uk/blog/2012/08/sequencing-low-diversity-libraries-on-illumina-miseq/

Anonymous said...

Why not spike it with something useful? Surely, there are other stuff waiting to be sequenced, and if you mix it with amplicons, you may not even have to barcode it.

Keith Robison said...

Apologies for failing to mention Nick's post -- even more embarrassing as I know it is referenced in one of the SEQAnswers threads I mentioned.

Anonymous said...

I've put in diversity in the primers for an amplicon run. I only put it in one of my primers, and the other end of the paired end read completely failed, so it seems like diversity on both ends is required. Otherwise, my sequencing worked quite well.

Anonymous said...

We successfully use multiple primers extended by one or a few nucleotides to overcome this issue. It is a minor inconvenience, but it has worked fine.

Anonymous said...

Hi,

I thought using a Nextera XT kit for the library preparation would solve the low diversity issue. Is this not the case?

Keith Robison said...

I haven't seen detailed discussion of how Nextera XT does with amplicons.

In any case, while that may sometimes be a useful route, fusion amplicons (in which the flowcell binding sequences are incorporated in the primer) are very attractive system, as the informatics is simpler (no assembly) and the workflow is simpler and cheaper (clean up your PCR & you are ready to load). It's an important market the Illumina needs to make sure nobody is saying "well, you can use MiSeq but I wouldn't recommend it"

Alex Ensminger said...

@ Anonymous - I was thinking of the same thing. Some of the phage guys at our institution would love to get their genomes sequenced as background in other people's lanes. The issue becomes library prep standardization and quality at that point, I guess.

Frédéric Raymond said...

We have used Nextera XT and it performed very well on amplicons. For Nextera XT, the best is to have longer amplicons to make sure you get a good tagmentation and more nucleotides to analyse for each sample.

Completing runs with other libraries would be a good idea, I think, if sample library prep is done only with PCR or with ThruSeq.