Menu

BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

I think I will need to contact the Univ. of Oregon sequencing facility to see what their thoughts on this data is, because it’s not even remotely close to what we should be seeing, even with the bisulfite conversion…

6 comments

You’ve made contigs of the bisulfite data? I feel like between trying to first assemble bisulfite data and then trying to align bisulfite data – where you expect 25% mismatches anyway (no C’s) – there could be enough ‘error’ built in that alignments may not necessarily hit C. gigas first?

Yeah, Steven did a de-novo assembly with the BS-seq data because reads weren’t mapping to the C.gigas genome when using BS-Map. I took the de-novo-assembled BS-seq data (see this notebook entry) and BLASTed against the C.gigas genome and the results were not good. So, this particular BLAST against the GenBank nt DB was to potentially see if our sequencing data was actually “our” data by attempting to ID which species the data matched with the most. So, this wasn’t a “real” attempt at data analysis, it was mostly a long shot at seeing if our data was mostly comprised of another species, thus potentially being the wrong data set sent to use from the sequencing facility.

Thanks for the reply and the link to the notebook entry. Yeah, the evalues are not too good – as expected for bisulfite data I would think. Just to make sure, when the bisulfite alignment was done in bsmap it included mapping to the CTOT and CTOB strands? The reads won’t align to the OT and OB strands – which I think is the default for both bsmap and bismark. I was getting similar mapping before I figured this out

Sorry, just saw this. I know you already spoke to Steven about this, but I have no idea what you’re talking about with “CTOT” and “CTOB”! Steven’s been the one dealing with the bisulfite mapping. Thanks for the heads up, though! Does all this interaction with you on this project mean you’re going to come back to our lab?

I haven’t done a BLAST with just raw sequencing reads (i.e. not already de-novo assembled into contigs), so not sure how it would play out. However, BLAST has an option (can’t remember if it’s “on” by default or if you have to use a flag in the BLAST arguments when running the command to enable it) that detects short query reads and takes their length in to account when evaluating database matches. This, in theory, should help ensure that matches using short query sequences are accurate. However, it probably would be prudent to run the BLAST and evaluate the top 10 matches (alignments, e-vals, bit score, etc) to see if the results agree with your expectations.

This may be a silly question, but have you ever done this type of blast with an RNA-Seq dataset with reads of similar length? Yes, I would expect C. gigas would be the best hit, but is there something inherent in the database where ‘shortish’ reads would map to other species as a top hit?