Hi everyone,
I have the genome sequence and gene annotation file. Is there a tool
on
Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and
genome
sequences of the genes (including exons and introns) from the genome
sequence? Any suggestions are highly appreciated! Thanks!
Yan

Yan,
One way to do this is to create an interval file with the new co-
ordinates
(+/- 5kb) and then use the Fetch Sequences > Extract genomic DNA tool.
To create a new co-ordinates file, input your annotation file into the
Text Manipulation > Compute tool, using expressions like "c3 =
c3-5000" to
get your new co-ordinates. You'll get 2 new columns in the final
output
file and then use the Text Manipulation > Cut tool to extract the
columns
you need to create an interval file.
Hope this helps.
Cheers,
Graham
Dr. Graham Etherington
Bioinformatics Support Officer,
The Sainsbury Laboratory,
Norwich Research Park,
Norwich NR4 7UH.
UK
Tel: +44 (0)1603 450601
On 24/09/2012 09:02, "Björn Grüning"

Hi Yan,
Both of the other suggestions are good - I'll also give you another
choice to build coordinates before using the "Fetch Sequences ->
Extract
Genomic DNA" tool to obtain the fasta sequence.
Using your input in BED/Interval format (convert from GFF/GTF if
necessary, using the tool "Convert Formats -> GFF-to-BED "), or the
first 6 columns if a BED12 (use "Cut" as needed), then run the
"Operate
on Genomic Intervals -> Get flanks" tool.
"Region:" Whole feature
"Location of the flanking region/s:" Both
"Offset" 0
"Length of the flanking region(s):" 5000
Your question is similar to this one (the first part, but I thought
you
might be interested in how to just get the flanks, too).
http://user.list.galaxyproject.org/Get-flanks-
version-1-0-0-td4604849.html
Good luck with your project!
Jen
Galaxy team
ps. To search prior questions, please see:
http://galaxy.psu.edu/search/mailinglists/
--
Jennifer Jackson
http://galaxyproject.org

Hi Jen, Thanks very much for your help! It is very helpful. However,
following your suggestion, what I got is not what I want. Take one
sequence for example. The annotation for one scaffold is C16582GLEANmR
NA353850.555898-.ID=OYG_GLEAN_10000001;C16582GLEANCDS35385.-0Parent=OY
G_GLEAN_10000001;What I got for this scaffold is >?_C16582_385_5385_-
GCAAACAAGC>?_C16582_385_5385_-GCAAACAAGCI understand that it is trying
to get the sequence of the gene downstream from 385-5385, but the
sequence is short, so I only get what the scaffold has. I would like
to have the upstream+gene+downstream sequence at the same time, not
only the upstream or downstream. How can I do this using a galaxy
tool? Thanks! Yan

Hello Yan,
Unfortunately, the tool can only extract sequence that is provided as
the mapping target. This will be a problem with any of the methods.
This
tool does avoid a problem with generating negative coordinates (which
will cause a problem with the 'Extract' tool). But it is not quite
giving you what you want either, assuming that partially extended
sequence, based on available data, would be acceptable.
Using the compute tool may be the best option for your case, now that
the data is clearer. "End" coordinates that extend past the edge of
the
chromosome are not a problem, but the "Start" coordinate will need to
be
set to 1 (if using GFF3 as interval directly) or 0 (if you converted
to
BED - this doesn't appear to be the case). The expression below will
either subtract '5000' from a "Start" coordinate or change it to a
"1",
depending on how close it is to the leading edge of the scaffold.
(Modify for BED to be 0-based as needed).
(c2 - 5000) if (c2 > 5000) else (1)
Then add 5000 to the end, 'Cut' columns, and extract as Graham
recommended.
I am not going to address the GFF3 format except to say that if you
have
gene rows in your data, use those if your target genome has spliced
transcripts. If the data is transcript, not gene based, and is split
between rows (multi-exon), then the processing becomes more
complicated.
One potential solution is the 'Extract' tool - it does not only
extract
fasta sequence, it can also be used to combine records for some
GFF/GTF
datasets - so you could try this and output "Interval" data instead of
"Fasta". This creates a new GTF file with global coordinates (but the
sequence output will be spliced). Check to see if correct, run the
'Compute' tool to do the extensions, 'Cut' columns, and do a final
'Extract' run to obtain the extended, global, sequence. All of this
would have to be tested with your data - much depends on the
attributes
in your file.
Hopefully one of these solution will work out for you,
Jen
Galaxy team
--
Jennifer Jackson
http://galaxyproject.org