I was hoping this question had been asked, but I haven't been able to
find
it. I want to output the unmapped reads from bowtie as a fastq file
for
subsequent mapping to other genomes (i.e. the "--un <filename>"
option). I
know I can extract the unmapped reads by filtering on the bitwise
values in
the sam output and converting to fastq with the Picard tool, but I'm
using
colorspace data and bowtie converts them to letterspace. My
understanding
(coming mostly from forums and personal discussions) was that the
color-to-letter conversion was somehow lossy so mapping the colorspace
data
directly is always preferable.
So the question is: Is bowtie's '--un' option implemented in Galaxy
and if
so, how do I access it?
Thanks in advance!
Mayank Tandon

Hi Mayank,
The best option I know of is to do the following:
1 - obtain the sequence identifiers for the unmapped reads by filter
the
SAM file, then cutting them out
2 - convert the original FASTQ file to FASTA - you should get two
output, one for the sequences and one for the quality score values
3 - use the tool "Fetch Sequences -> Extract Genomic DNA". The query
is
the list from #1, the target is the "genome" from #2. Do this twice -
once for seqs, once for quals. This means using the target datasets
from
#2 as Custom Reference genomes - help about how to do this is here:
http://wiki.galaxyproject.org/Support#Custom_reference_genome
4 - combine the FASTA seq and qual files back to FASTQ
If you will be doing this again, then capture the process into a
workflow for future use, in a way creating your own "tool".
Hopefully this helps!
Jen
Galaxy
--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org

That's a neat trick, and I definitely wouldn't have thought of that
approach, so thanks for that!
After I finished writing this out, I realized it was super long. So
here
are the questions I'm asking up front, so you can choose whether or
not to
read the details. Thanks!
1. How do I output the quality scores when converting from FASTQ to
FASTA?
2. Does the SAM-to-interval tool output only mapped reads by looking
at the
flag values?
3. Why am I getting the mentioned error and is there a way to resolve
it?
Here are the details:
1. I don't see an option to output both the sequences and the
quality
scores. I found two FASTQ-to-FASTA converters (one under the
"Convert
Formats" and the other in the FastX Toolkit) and both only output
one fasta
file with the sequences. Am I missing something, or should I be
using some
other tool to output both the sequences and the quality scores?
2. The Extract Genomic Sequences tool seems to want an Interval
file as
input, not a list of IDs. Does that mean I should convert the
filtered SAM
output to Interval? Currently I'm using the SAM-to-interval
conversion to
extract the mapped reads and make the data more manageable in one
step
(pretty sure I picked that up from one of the tutorials...). I was
assuming that by definition it could only output an interval if it
was
mapped, and if so, I wouldn't be able to convert the unmapped reads
to
Interval anyway. Is that wrong?
3. I was setting up a workflow with Bowtie and I noticed that the
Workflow Editor does show options to output unmapped reads. But
when I try
to output them, I get this error:
"Error due to input mapping of 'Compute quality statistics' in
'output_unmapped_reads_l'. A common cause of this is conditional
outputs
that cannot be determined until runtime, please review your workflow."
Superficially, this seems silly. Obviously a "conditional output"
will not
be determined until runtime because it's dependent on something else.
So
why is that an error? I have tried outputting to a few different
tools, so
it doesn't seem to be specific to the tool into which the unmapped
reads go
(in this case, "Compute Quality Statistics").
Any thoughts, insights, or even other approaches to the original
problem
would be great. Currently, I'm thinking my best bet is to filter out
the
unmapped reads locally with a Perl script and re-upload, but that felt
like
overkill and time-consuming when I will inevitably want to tweak or
re-run
things. Also, installing a local instance is currently not an option
for
me (though it should be in a few months). In any case, I appreciate
your
help a lot!
Thanks, again!
Mayank Tandon

Exactly. Jennifer's solution for outputting unmapped reads involves
splitting the FASTQ file into basically two FASTA files, one with
sequences
and the other with the corresponding quality score string. So, yes,
they
would be matched files.