Sunday, January 25, 2009

Are the old lessons being forgotten?

Okay, first I feel like I have to have a bit of preamble. This, and another post I'm doing the homework on, are pretty critical. Downright negative. I'm not turning into a curmudgeon or planning to turn this space into a rant-a-thon. It's just that both are topics I think are important & have pushed the right buttons.

Also, this isn't meant to be high-and-mighty-and-spotless-expert calling calumny on the great unwashed masses. If I look down at my metaphorical foot I find many tightly spaced patterns of scars, sometimes nearly concentric. We all make mistakes, and often we repeat those of the past. We think we've covered bases that have always been covered or deceive ourselves that safety mechanisms which were needed in the past are no longer necessary.

A bit ago at work I was doing some exploring of a standard a backbone and became curious just how taxonomically widespread pieces of the backbone might be found naturally. So naturally, I pumped the sequence into the NCBI BLASTN server & pointed it at the RefSeq genomes. As expected, a bunch of bacterial plasmids popped up. What was unsettling, though, was a bunch of provisional genomic RefSeqs for eukaryotic chromosomes. Indeed, one project had apparently deposited every chromosome with a pUC-type vector sequence at one end. YIKES!

The other day I got curious again & tried searching the non-redundant DNA and protein databases but with the species filter set to eukaryote. Again, a bunch of hits -- and the shocking part was many were very recently deposited sequences -- even human ones. In some cases, the entire deposited sequence was vector-derived (e.g. the non-human "putative reverse transcriptases" ABK60177.1, CAD59768.1, CAD59767.1 & CAL37000.1).

For example, AK302803.1 is a 1352 nucleotide sequence deposited in 2008; from 888 on is clearly vector -- and the coding region is annotated as 1 to 1275! CAH85743 is a "Plasmodium" protein which is entirely vector derived; again deposited in 2008. PIR (is anybody still curating this?) has a number of vector-derived proteins (e.g. the 231 amino acid "NZ-3 antigen" JC7702; S.pombe beta-lactamase (!) T51301); I was surprised to even find a SwissProt entry that looks like it has pUC-derived sequence

>sp|Q63661.2|MUC4_RAT RecName: Full=Mucin-4; Short=MUC-4; AltName: Full=Pancreatic
adenocarcinoma mucin; AltName: Full=Testis mucin; AltName: Full=Ascites
sialoglycoprotein; Short=ASGP; AltName: Full=Sialomucin
complex; AltName: Full=Pre-sialomucin complex; Short=pSMC;
Contains: RecName: Full=Mucin-4 alpha chain; AltName:
Full=Ascites sialoglycoprotein 1; Short=ASGP-1; Contains: RecName:
Full=Mucin-4 beta chain; AltName: Full=Ascites sialoglycoprotein
2; Short=ASGP-2; Flags: Precursor
Length=2344

GENE ID: 303887 Muc4 | mucin 4, cell surface associated [Rattus norvegicus]
(Over 10 PubMed links)

Score = 46.6 bits (109), Expect = 0.006
Identities = 22/35 (62%), Positives = 25/35 (71%), Gaps = 3/35 (8%)
Frame = -3

pUC19 1427 CCLQTKKPPLPAVVCLPDQELPTLFPKVTGFSRAQ 1323
CCLQTKKPPLPAVVCLPD P+ P + S+ Q
Sbjct 1051 CCLQTKKPPLPAVVCLPD---PSSVPSLMHSSKPQ 1082



Even the RefSeq mRNA section has some very provisional mammalian predicted cDNAs (from chimp) which appear to be polylinker-type sequences from vector (selected restriction sites are marked)


=XbaI= =PstI=
=BamHI =SalI= =PaeI
pUC19 415 GGGGATCCTCTAGAGTCGACCTGCAGGCATG 444
XM_001160101.1 56 GGGGATCCTCTAGAGTCGACCTGCAGGCAT 85
XM_001146903.1 439 GGATCCTCTAGAGTCGACCTGCAGGCATG 467
XM_001141474.1 1503 GGGATCCTCTAGAGTCGACCTGCAGGCA 1530
XM_001141395.1 922 GGGATCCTCTAGAGTCGACCTGCAGGCA 949


Contamination of various sorts has plagued genome projects from the get-go. Perhaps the most notorious was a large deposition of human ESTs which were donated to the public with great fanfare (as a counterpoint to private EST efforts), only to be found later to be rich in yeast sequences. The solution is to run filters -- search everything you do against vectors, E.coli and other common contaminants. In addition, especially in this day-and-age, if your "human" mRNA sequence doesn't match the genome, you've got some 'splaining to do.

What's the harm? Well, when it comes to databases I don't like mess. You always need to check your data, but it's always a nuisance when you actually have to clean it a bunch. Miss something, and some experiment is dirty or worse ruined. Plus, and this is a bit of the theme to my proto-post, some folks haven't yet figured this out & the results are truly ugly. Even worse, these are the obvious problems since bacterial vectors in a eukaryotic sequence truly stick out. Now I'm wondering about all the pUC-like sequences I found in bacterial sources -- can I trust them either?

So, let's all make a it's-still-a-pretty-new-year resolution to recheck our sequencing pipelines. Deliberately throw pUC19 and the E.coli genome through it & see what comes out.

1 comment:

Steven Salzberg said...

Keith, this has been a problem for a long time and only is getting worse - many more people are generating sequence now, and they haven't yet learned the lessons that we learned at the big sequencing centers years ago.

For example, a short paper I was part of years ago was titled "Contamination in the draft of the human genome masquerades as lateral gene transfer." (Willerslev et al., 2002) The point was that "human" sequences in GenBank at the time were bacterial, not human, and that scientists looking for evidence of lateral gene transfer - an important research topic, then and now - could easily be misled by these contaminents.

For big projects, the folks at NCBI recently implemented a screening protocol to check for common vector sequences, but their protocol can't be expected to catch everything.