Using Galaxy

Galaxy NGS 101

Learning Hub

See our Learning hub for key coverage of Galaxy user interface concepts, data, and tools. Review "Shared Data → Published Pages" on the Main server usegalaxy.org for publication supplementals and tutorials. Also see the section at the top of this wiki End-User Support Resource Short List.

Screencasts

Screencast videos demonstrate the step-by-step for a range of topics. Packed with tips and methods usable across analysis workflows plus presentations and tutorials for administrations, these are a great resource for both the scientific and technical Galaxy communities.

Biostar

We want to create a space where researchers using Galaxy can come together and share both scientific advice and practical tool help. Whether on http://usegalaxy.org, a Cloudman instance, or any otherGalaxy, if you have something to say about Using Galaxy, this is the place to do it!

Mailing Lists

Galaxy has one public mailing lists for questions, one private mailing list for bug reports, and one announcement mailing list. Please do not post questions through the Galaxy Issue Board; these will only be redirected. Manage subscriptions and learn more about these list at the Mailing Lists home page. See also:

IRC Channel

Galaxy also has an IRC channel in which you can participate. You can connect to the chat directly via browser here. This IRC channel is an informal online gathering place for the Galaxy community to post questions and help each other out. If you are unfamiliar with IRC, it is conducive to quick discussion, much like any other casual chat program. There is also lots of online help about IRC.

Solutions

Tool help

Getting an account

Having your own account on the public Test and/or Main server means that you can save histories, work with more data, associate an OpenID, and get the most out of Galaxy's functionality. Be sure to note that the public Test and Main instance usage policies are one account per user, as stated in our Terms and Conditions. Also, make sure your email address is valid so that you can confirm your new account (emails are case sensitive) and so that our administrator can contact you if needed (rare, but you'll want the email!). More details here.

Finding a tool

At the top of the left tool panel, type in a tool name or data type into the tool search box. Shorter keywords find more choices. Can't find the tool you want? Try looking in the Tool Shed. New tools are added all the time that can be used in local or cloud Galaxy instances.

Loading data

Data is loaded using the tools in the Get Data tool group. Some access specific data provider sites that will load data back into your Galaxy history. To directly load your own local data or data from another source, use the tool Get Data → Upload File (also accessible from the top of the left tool panel, as seen in the graphics below). Want to practice import/export functions with small sample data? Import the Upload sample data history here.

No HTML content. The loading error generated may state this. Remove HTML fields from your dataset before loading into Galaxy or omit HTML fields from the query if importing from a data source (such as Biomart).

Data must be < 50G (uncompressed) to be successfully uploaded and added as a dataset to a history, from any source.

Is the problem the dataset format or the assigned datatype? Can this be corrected by editing the datatype or converting formats? See Learn/Managing Datasets for help or watch the screencast above for a how-to example.

Problems in the first step working with your loaded data? It may not have uploaded completely. If you used an FTP client, the transfer message will indicate if a load was successful or not and can often restart interrupted loads. This makes FTP a great choice for slower connections, even when loading small files.

Downloading data

Download datasets by clicking on the disk icon inside the dataset. Good for smaller sizes in all browsers.

Download entire histories by selecting "Export to File" from the History menu, and clicking on the link generated.

Transfer entire histories by selecting "Export to File" from the History menu, generating the link, coping the link in the "from" Galaxy instance, then in the "to" Galaxy instance select "Import from File" from the History menu, and paste in the link into the new form.

* The video Datasets 1 includes help about different datatypes and what to expect in the download icon (one file or two!).

Download tip: Big data

Utility option: from a shell/unix/terminal window on your computer use wget or curl.

The link can be obtained by right clicking the floppy disk icon inside a history item and choosing "Copy Link Location" (for most datasets) or "Download Dataset/Download bam_index" (for BAM datasets there are two downloads). Once you have the <link>, type this (where "$" indicates the terminal prompt), so that the <link> is inside of single quotes. Like many commands, there are many options. These are examples commonly used with Galaxy.

Example: Dataset metadata missing or incomplete

How to notice if this is a problem

The dataset will not download when using the disk icon

Tools error when using a specific dataset that has been used before successfully

Tools error with a message that ends with: OSError: [Errno 2] No such file or directory. Note that not all failures of this type are due to metadata and may simply be a cluster failure - rerunning the job may resolve the problem instead, but try the solution first.

Solution

Reset the metadata on the dataset(s). This may be an uploaded dataset or one created by prior tools. It could be one of the input datasets to a failed job.

How to: Click on the Auto-detect button found near the bottom of the Edit Attributes form for the dataset. Reach this form using the dataset's icon.

If this does not resolve the problem

If resetting metadata does fix the issue, then there may have been a transient cluster job failure. Re-run the job at least once.

Other problematic dataset solutions are listed here, but these are not are not based on the same underlying issue.

When a tool is executed, one or more new datasets are added to a history. The same is true when a workflow is executed. If using the public Main Galaxy instance, the most effective strategy when running jobs on the shared resource is to start jobs (or workflows), and then leave them alone to execute until completion.When work is urgent during peak-usage times on the public Main Galaxy instance, a CloudMan instance is a quick-to-implement alternative. For large scale and/or urgent ongoing work, a CloudMan, Local, or SlipStream Galaxy each have advantages as a longer-term solution. Read more ...So, how does the processing of tool jobs on Main actually work?

The color of a dataset designates the current status of the underlying job.

Green

The job completed successfully.

The resulting data is ready to be used in visualizations, available as input to tools, can be downloaded, or utilized for any other downstream purpose.

Yellow

The job is executing. Allow this to complete!

If you are using the public Main Galaxy instance, this job is running on one of our clusters. Different types of tools send jobs to different clusters appropriate for the requirements of each tool. Some tools are more compute intensive than others and significant resources are dedicated to job processing. Jobs have up to 72 hours to complete, if they run longer than this they will fail with a "wall-time" error and turn red. Examining tool paramaters is the first option, less sensitive parameters may result in an equally acceptable result, but use less resource. If that is not appropriate or does not succeed, a CloudMan Galaxy or Local Galaxy with sufficient resources may be the solution.

Grey

The job is being evaluated to run (new dataset) or is queued. Allow this to complete.

If you are using the public Main Galaxy instance, this job is queued, waiting for an opening on the appropriate cluster. It is very important to allow queued jobs to remain queued, and to not delete/re-run them. If re-run, this not only moves the new job back to the end of the queue, effectively lengthening the wait time to execute, but if done repeatedly, the volume of "executing deleted" jobs can create additional work processes in the history as these are cleared away, using up resources, and can cause additional delays.

Red

The job has failed.

There can be many reasons for this, see the next section, Error from tools for details.

Light blue

The job is paused.

This indicates either an input has a problem or that you have exceeded disk quota set by the administrator of the Galaxy instance you are working on.

If there is an input problem, correct the problem (often by re-run an upstream job) and click on the tool form option to "resume dependencies". You will not need to stop or restart downstream jobs in most cases (permit paused jobs to start, as inputs datasets become available, through this method).

If you need to make room, permanently delete unneeded data. If you are using the public Main Galaxy instance, disk quotas are defined here. You will not need to delete/re-run jobs while doing this, unless you are filtering your work to prevent exceeding quota again (only purging, not restarting at this time). Instead, restart using the History menu option "Resume Paused Jobs".

Grey, Yellow, Grey again ???

The job is waiting to run, due to admin re-run or an automatic fall-over to a longer-running cluster (currently, Stampede)

First, see the descriptions for grey and yellow jobs above.

The job was first submitted to the default cluster, but did not finished within the "wall-time" quota. Instead of failing, the job was automatically submitted to the long-running cluster Stampede. This cluster offers more execution time resource to the job. The wait may be longer since jobs running on this cluster by other users are also executing for a longer time period.

Stopping (deleting) the job and then restarting places it back at the end of the first queue, where the cycle will begin again, extending wait time even further. Please do not do this. Allow the job to process.

If the job fails after running on Stampede, then it is too large to run on http://usegalaxy.org, also known as "Main". Choose another strategy to execute your job on a different Galaxy platform or consider modifying inputs/parameters to make the job less compute intensive.

Bright blue with moving arrow (deprecated)

The job may run immediately, or may turn grey if the server is busy, meaning that guidelines for grey jobs apply, and these grey datasets should never be deleted/re-run, for the same reasons explained above.

An upload job that seems to stay in the "bright blue with moving arrow" state for a very long time generally indicates that the file being loaded is too large for the method used (specifically, a browsed-file upload) and FTP should be used instead. This is the only active job that should be deleted under normal usage, as it will never complete (no file over 2G will ever load via file browser upload).

Error from tools

Dataset format problems are the #1 reason that tools fail. Most likely this problem was introduced during the initial data upload. Double check the dataset against Galaxy's datatypes or external specifications. In many cases, the format issues can be corrected using a creative combination of Galaxy's tools.

Note: not all formats are outlined in detail as they are common types or derived from a particular source. Read the target tool help, ask the tool authors, or even just google for the most current specification.

Is the problem the dataset format or the assigned datatype? Can this be corrected by editing the datatype or converting formats? Often a combination of tools can correct a formatting problem, if the rest of the file is intact (completely loaded).

ExampleNGS: Mapping tools: On the tool form itself is a short list of help plus links to publications and the tool author's documentation and/or website. If you are having trouble with Bowtie, look on this tool's form for more information, including a link to this website: http://bowtie-bio.sourceforge.net/index.shtml.

ExampleNGS: SAM Tools tools: SAMTools requires that all input files be to specification (Learn/Datatypes) and that the same exact reference genome is used for all steps. Double checking format is the first check. Double checking the the same exact version of the reference genome is used is the second check. The last double check is that the number of jobs and size of data on disk is under quota. Problems with this set of tools is rarely caused by other issues.

Tools for fixing/converting/modifying a dataset will often include the datatype name. Use the tool search to locate candidate tools, likely in tool groups Text Manipulation, Convert Formats, or NGS: QC and manipulation.

The most commonly used tools for investigating problems with upload, format and making corrections are:

Job failure reason: cancelled by admin or a cluster failure

The initial error message will be reported as below, and is found in the comments of a failed dataset (red dataset):

This job failed because it was cancelled by an administrator.
Please click the bug icon to report this problem if you need help.

Other reported error indicate a cluster failure in the error report (click on the bug icon to review. These often do not need to be submitted as the failure error message describes the problem and correction path.

The error indicates that the job was likely given inputs and/or parameters that are either malformed, do not meet the requires for the tool's usage, or the parameters used are very computationally intensive. See Troubleshooting tool errors. These are the exact same checks a submitted bug report is reviewed for.

It is also possible that there was a server or cluster error. A re-run for nearly all failed jobs is the first pass solution. Exceptions may be if the error is clarified as exceeding memory or job execution time (see next sections).

See the two sections below for details about how to determine and resolve the root cause of the error.

If after reviewing, and re-running, and the cause of the error is unclear:

Review the parameters for the tool and determine if any changes made away from the default setting (or possibly the detail settings) are compute-intensive. Make changes if they suit your research goals. See the underlying tool's documentation to better understand specific parameters. This is often linked in the Help section on a tool's execution form.

If the tool used was one that compares two dataset, change the order of the inputs, and test with a re-run. Some tools consume less memory when the larger dataset is entered as the first input on the tool form.

In some cases, reporting the memory issue to our team through the "green bug" icon from a dataset is a good way to let us know about tools that run out of memory resource. With the expectation that we cannot probably not solve your issue directly, but cumulative feedback helps us to learn which tools would benefit from additional resource allocation.

If the job remains too large to run on the public Main Galaxy instance at http://usegalaxy.org, then moving to an instance where more memory can be allocated to jobs is the solution. A good choice is CloudMan as processing memory is easily scaled up. AWS in Education grants can help with costs. Another option is setting up a local Galaxy, if you have a computer/server with enough processing memory (16 GB minimum, but more is likely needed if the jobs are large, possibly up to 64 GB).

The full error message will be reported as below, and can be found by clicking on the bug icon for a failed job run (red dataset):

job info:
This job was terminated because it ran longer than the maximum allowed job run time.
Please click the bug icon to report this problem if you need help.

The error indicates that the job execution time exceeded the "wall-time" on the cluster node that ran the job. "Wall-time" is the maximum amount of time any job has to complete before it is terminated. When using the public Main Galaxy instance at http://usegalaxy.org, see the walltime available here.

Sometimes the execution time of a job can be shorted by adjusting the inputs, parameters used, or the cluster used (try Stampede or Jetstream, if available for the tool form under the section Job Resource Parameters). This solution is similar to jobs that error because they exceed memory allocation.

If the tool used was one that compares two dataset, change the order of the inputs, and test with a re-run. Some tools consume less memory when the larger dataset is entered as the first input on the tool form.

Give the longer-running cluster a try, see the Main wiki's section about Stampede (Jetstream is also a choice).

If the job remains too large to run on the public Main Galaxy instance at http://usegalaxy.org, then moving to an instance where more resource can be allocated for jobs is the solution. A good choice is CloudMan. AWS in Education grants can help with costs.

Tool doesn't recognize dataset

Usually a simple datatype assignment incompatibility between the dataset and the tool. Expected input datatype format is explained on the Tool form itself under the parameter settings. Convert Format or modify the datatype using the dataset's icon to reach the Edit Attributes form. Many metadata attributes can be edited on these forms, including database. You may need to first create a Custom Build when using a Custom Reference Genome.

Dataset special cases

FASTQ Datatype QA

If the required input is a FASTQ datatype, and the data is a newly uploaded FASTQ file, run FastQC then FASTQ Groomer as first steps, then continue with your analysis. Watch the FASTQ Prep Illumina screencast for a walk-through.

If you are certain that the quality scores are already scaled to Sanger Phred+33 (the result of an Illumina 1.8+ pipeline), the datatype ".fastqsanger" can be directly assinged. Click the icon to reach the Edit Attributes form. In the center panel, click on the "Datatype" tab (3rd), enter the datatype ".fastqsanger", and save. Metadata will assign, then the dataset can be used.

If you are not sure what type of FASTQ data you have, see the help directly on the FASTQ Groomer tool for information about types.

For Illumina, first run FastQC on a sample of your data (how to read the full report). The output report will note the quality score type interpreted by the tool. If not ".fastqsanger", run FASTQ Groomer on the entire dataset. If '.fastqsanger", just assign the datatype.

For SOLiD, run NGS: Fastq manipulation → AB-SOLID DATA → Convert, to create a ".fastqcssanger" dataset. If you have uploaded a color space fastq sequence with quality scores already scaled to Sanger Phred+33 (".fastqcssanger"), first confirm by running FastQC on a sample of the data. Then if you want to double-encode the color space into psuedo-nucleotide space (required by certain tools), see the instructions on the tool form Fastq Manipulation for the conversion.

If your data is FASTA, but you want to use tools that require FASTQ input, then using the tool NGS: QC and manipulation → Combine FASTA and QUAL. This tool will create "placeholder" quality scores that fit your data. On the output, click the icon to reach the Edit Attributes form. In the center panel, click on the "Datatype" tab (3rd), enter the datatype ".fastqsanger", and save. Metadata will assign, then the dataset can be used.

Tabular/Interval/BED Datatype QA

If the required input is a Tabluar datatype, other datatypes that are in a specialized tabular format, such as .bed, .interval, or .txt, can often be directly reassigned to tabular format. Click the icon to reach the Edit Attributes form. In the center panel, using tabs to navigate, change the datatype (3rd tab) and save, then label columns (1st tab) and save. Metadata will assign, then the dataset can be used.

If the required input is a BED or Interval datatype, the reverse (.tab → .bed, .tab → .interval) may be possible using a combination of Text Manipulation tools, to create a dataset that matches the BED or Interval datatype specifications.

Reference genomes

Using the same exact reference genome for all steps in an analysis is often mandatory to obtain accurate results. To use the reference genomes available on usegalaxy.org (Main), get the genome from our rsync server.

Detecting Genome Mismatch Problems

How can I tell if I have a reference genome mismatch problem?

There isn't one single error that points to this problem. But, if you are running a tool for the first time using a newly uploaded dataset, and an error occurs or more likely simply unexpected results are produced - double checking the reference genome would be a good choice.

Correcting Chromosome Identifier Conflicts

I suspect there is a problem with the identifiers but how can I check? Or better, how can I fix the problem?

A quick way to check for this issue is to compare the chromosome identifiers in the input datasets to each other and to the reference genome used (or intended to be used).

Even small differences in identifiers can cause tools to fail, produce warnings, or create incomplete results. This is the second most common cause of usage-related tool failures (input format problems are the first).

Using an Ensembl-based chromosome identifier file on Galaxy Main with a locally cached reference genome? Most built-in, native, reference genomes are sourced from UCSC and have UCSC-based identifier names. When using inputs with both versions of identifiers in the same analysis pipeline, there will almost certainly be errors or unexpected results. But, in many cases, inputs from the history can be adjusted to match the cached data, all within Galaxy. Read more about how...

Why isn't my Ensembl GTF compatible with Cufflinks and how can I use Ensembl GTFs with Cufflinks?

First, determine if an Ensembl GTF is the best choice. If an iGenomes version is available, this has advantages due to the addition of specific attributes utilized by the RNA-seq Tuxedo pipeline. Check at the Cufflinks website here.

Download the .tar file locally, uncompress it, then upload only the .gtf file to Galaxy. Loading .tar archives is not supported and has unpredictable outcomes (sometimes the first file in the archive will load - but this is not the file you need, sometimes only a portion of the first file will load - without a warning, and other times an upload error will result: none of these cases should be reported as a bug report/tool error).

For certain genomes, the reference annotation GTF file is available on the public Main Galaxy instance, http://usegalaxy.org, under Shared Data -> Data Libraries -> iGenomes.

Avoiding Genome Mismatch Issues

When moving between instances, what can be done to mitigate the risk of using the wrong assembly?

When moving between a Galaxy CloudMan AMI and the public Main Galaxy instance, just make sure the database name is the same. If the assigned database name is the same, the content of the reference genome is the same.

When moving between a local Galaxy and the public Main Galaxy instance, there are a few choices:

Reference Genomes and GATK

Shared and Published data

Have you been asked to share a history? Or has someone shared a workflow with you but you're not sure where to find it? Or maybe you just want to find out more about how publishing your work in Galaxy can be used to support your next publication? Watch the how to Share and Publish screencast and read more here.

Reporting tool errors

If running a tool on the public Galaxy server (i.e., http://usegalaxy.org) is resulting in an error (the dataset is red), and you can't determine the root cause from the error message or input format checks:

Re-run the job to eliminate transitory cluster issues.

Report the problem using the dataset's icon. Do not submit an error for the first failure, but leave it undeleted in your history for reference.

IMPORTANT: Get the quickest resolution by leaving all of the input and output datasets in the analysis thread leading up to the error in your history undeleted until we have written you back. Use Options → Include Deleted Datasets and click dataset links to undelete to recover error datasets before reporting the problem, if necessary.

Include in the bug report what checks confirmed that data format was not an issue

Anything else you feel is relevant to the error

We do our best to respond to bug reports as soon as possible.

Please send all email as reply-all as we work to resolve the error. The galaxy-bugs address we will be corresponding from is internal to the Galaxy team only and we work together to resolve reported problems.

If you have resolved the issue, a reply to the bug report to let us know is appreciated.

Interpreting scientific results

A double check against the tool help and documentation is the first step. If the tool was developed by a 3rd party, they are likely the best experts for detailed questions. Tool forms have links to documentation/authors.

Tools on the Test server

Test tool errors reported as a bug reports (#Error from tools) are considered low priority and may not receive a reply.

General feedback & discussion threads (instead of questions requiring a reply from the Galaxy team) are welcomed at the development mailing list.

Exceptions are possible. Sometimes community users help to test-drive new functionality. If you are interested in this type of testing for a particular tool, contact us on the development mailing list.

Tools on the Main server: RNA-seq

Example → RNA-seq analysistools.

Read the Galaxy team's publication, then review the live supplement and try the tutorial

Using the same reference genome for all steps is very important. Even small differences in chromosome/scaffold names can result in errors. Double check that the naming between the reference genome and any other inputs such as SAM/BAM and GTF datasets all use the same naming conventions. See our FAQ for more help if this is suspected to be the root cause of an error.

Confirming GTF/GFF3 data sources using gffread can be one way to discovering where problems are.

Still stuck?

If the tool form help, publications & supplemental/tutorials including those from the community under Learn, the Transcriptome FAQ, or tool author's web sites do not address the question or problem, then contacting the tool authors is often the next step for detailed algorithm questions: mailto:tophat.cufflinks@gmail.com.

If you ended up with a failed dataset (red), it is sometimes better to submit that instead as a tool error (bug report) unless the question is general.

We can help or guide you to help. Whenever sharing or submitting a history for feedback, please be sure to leave the datasets in the analysis thread undeleted so that we can offer the best advice.

Custom reference genome

Often the quickest way to get your analysis going is to load a custom genome for your own use. Simply upload the FASTA file using FTP and use it as the "reference genome from the history" (wording can vary slightly between tools, but most have this option). Learn more about how to set up and use a Custom Genome including how to create a Custom Build.

Videos

Best Practices

Use the same custom genome for all the steps in your analysis that require a reference genome. Don't switch or the data can become mismatched in your files, preventing downstream work.

To add a custom Genome Build so that it can be assigned as a "database" attribute, or to make it known/available to certain tools, create it under "User → Custom Builds". More details here....

TIP: To modify a dataset to have an unassigned reference genome, use the icon to "Edit Attributes". On the form, for the attribute Database/Build:, set the genome to be " unspecified (?) ", and submit. Any prior assignments will be removed.

If you genome is available on usegalaxy.org (Main), but just not indexed for the tool you want to use, you can get the genome from our rsync server. This will ensure that all of your work uses the same exact reference genome for all steps in an analysis, a critical part of a successful experiment.

If you find that there are in downstream tool errors after using a Custom reference genome in an upstream tool on usegalaxy.org (Main), this is good cause to suspect that there is a reference genome mismatch problem. This generally means that the Custom genome needs to be changed to use ours, or that you need to use the Custom genome for all downstream tools, too.

Quick genome access

If your genome is small (bacterial, etc.), using it as a Custom Reference Genome is the quickest way to to get it into Galaxy and to start using it with tools.

Obtain a FASTA version, load using using FTP, and use from your history with tools.

Tools on the Main server: Extract DNA

Example → Fetch Sequences: Extract Genomic DNA

Start by loading the custom reference genome in FASTA format into your history as a dataset, using FTP if the dataset is over 2G in size.

Load or create an appropriate Interval, BED, or GFF coordinate dataset into the same history.

On the Extract Genomic DNA tool form, you will use the options:

"Source for Genomic Data:" as "History"

next, for the new menu option "Using reference file", select the fasta dataset of your target genome from your active history

Tools on the Main server: GATK

To use other genomes, load in fasta format and prepare as a Custom genome/build.

Note that GATK requires a specific reference genome sort order. The general guideline is "chr1, chr2, chr3,.... chrX, chrY, chrM" (followed by other partial chromosomes sorted in alphabetical order). Use tools in the group "Text Manipulation", "Convert Formats", and "Sort and Filter" to perform any needed rearrangement.

It is best to use the same exact reference genome for all steps, or problems can occur downstream, often requiring the analysis to be started over (from mapping, when the genome was first used).

Want to use hg19? The genome is available as a GATK-sorted version under "Data Libraries -> GATK". Import the fasta file into your history, then proceed with using as a custom genome with tools.

Community Q & A

If you have a development topic to discuss, your data/tool situation has not come up before, and/or troubleshooting has failed (including at least one re-run, as explained in Error from tools above), post to a list or Galaxy Biostar

Note:If your question is about an error on Main for a job failure, start by reviewing the troubleshooting help for Tool Errors. If data input and the job error message don't resolve the issue, please use the tool error submission form from the red error dataset, instead of starting a public mailing list discussion thread (do not delete error datasets). Read more ...

What to include in a question

Where you are using Galaxy: Main, other public, local, or cloud instance

End-user questions from Test are generally not sent/supported - Test is for breaking

If a local or cloud instance, the distribution or galaxy-central hg pull #

What troubleshooting steps (if a problem is being reported) you have tested out

If on Main, you may be asked for a shared history link. Use Options → Share or Publish, generate the link, and email it directly back off-list. Note the dataset #'s you have questions about.

IMPORTANT: Get the quickest answer for data questions by leaving all of the input and output datasets in the analysis thread in your shared history undeleted until we have written you back. Use Options → Show Deleted Datasets and click dataset links to undelete to recover datasets if necessary

Discussion threads are open to the entire community and the Galaxy team to answer

Always reply-all unless sharing a private link

Reporting a software bug

Bug or Error from tools? Sometimes it is hard to tell. If you are on the public Main instance, and ran a tool that produced a red error dataset, then you will probably want to start by reporting this as a Tool Error, but add in comments about your suspicious about a bug if there is something odd about the job failure.