How to use and program Galaxy to download 300 SRA files at once form the NIH database? We need to get a large set of genome data for processing. We have tried to use sratoolkit and common file download manager for the aims of the batch download of SRA files. However, it could be a better solution to use Galaxy.

Great question. At the moment the best way to retrieve data from NCBI in Galaxy is by using the "NCBI SRA Tools" suite. These tools allow you to import individual data entries or import many data entries as a "collection" ready for batch analysis. I am putting a short guide below that will walk you through importing data from NCBI SRA. Please give it a try and let us know if this strategy works for your application.

Thanks for using Galaxy!

Cheers,

Mo Heydarian

Data retrieval with “NCBI SRA Tools” (fastq-dump)

This short guide will guide you through downloading experimental metadata, organizing the metadata to short lists corresponding to conditions and replicates, and finally importing the data from NCBI SRA in collections reflecting the experimental design.

Downloading metadata

Direct your browser to https://www.ncbi.nlm.nih.gov/Traces/study/ and in the search box enter GEO data set identifier (for example: GSE72018). Once the study appears, click the box to download the “RunInfo Table”.

Organizing metadata

The “RunInfo Table” provides the experimental condition and replicate structure of all of the samples. Prior to importing the data, we need to parse this file into individual files that contain the sample IDs of the replicates in each condition. This can be achieved by using a combination of the ‘group’, ‘compare two datasets’, ‘filter’, and ‘cut’ tools to end up with single column lists of sample IDs (SRRxxxxx) corresponding to each condition.

Importing data

We can now provide the files with SRR IDs to the NCBI SRA Tools (fastq-dump) to import the data from SRA to Galaxy. By organizing the replicates of each condition in separate lists, the data will be imported as “collections” that can be directly loaded to a workflow or analysis pipeline.