Hello I'm just getting started with the Galaxy 101 tutorial and I wanted to report what might be a few problems. I'm not sure if this is the place to make a report (or get help) since the tutorial is at https://github.com/nekrut/galaxy/wiki/Galaxy101-1. I am running with my own local copy of Galaxy that I recently downloaded.

Disclaimer: I'm a software developer / engineer with very little experience in biology.

The instructions for the 101 tutorial seem to be a bit outdated. They reference data sets that aren't available in the location mentioned (although a little poking around revealed them in a nearby archive). They also refer to menu items that have been moved/renamed in the interface.

It might also be good to provide an estimate of the processing time and resulting database size for the join operation. With the data I used (which was a bit of a guess ... see note 1 above), the join operation appeared to hang. I restarted it a few times, with the same result. Eventually I decided to let it run over night, and it did finish, but it generated a large data set that alarmed my system administrator. If this is expected, it might be good to mention these in the tutorial. If this isn't expected, then I may be doing something wrong and welcome any help available.

If anyone can update the tutorial (or direct me to a more updated version) that would be very welcome!!

The current version is not very old .. and I can explain what I think is going on and set some expectations.

For the specific long running Join job with large output: My guess is that at the step where the data is extracted from UCSC, the entire genome region was selected, when the tutorial instructions are to filter by a single chromosome. This is a common mistake for those new to the UCSC Table browser and there are specific instructions around this filtering along with graphics. Maybe try again making sure that this step is done with the filter?

For the database queries: The specific database versions noted in the tutorial change every night (RefSeq) to every few months (dbSNP). Using the latest version of each is just fine. The goal is to learn the steps and interface - the actual results won't be exact and that is expected.

For job run time estimates: How long the query will take will depend on where the job is run (public server, local, cloud instance). The resources available and other concurrent tasks make a big difference - just as command line jobs will - so estimating not practical. However, I can give you a general time estimate to complete the whole thing (also noted in the tutorial) - the entire tutorial might take anywhere from 45 minutes up to 2 hours for new users on a reasonably robust local or cloud Galaxy (at least 16 GB ram) as well as the public Main Galaxy server at http://usegalaxy.org (as long as these are the only queries running from the account).

General advice for running jobs in Galaxy: It is almost never a good idea to stop and restart jobs. Restarted jobs are added to the back of the queue - which always increases wait time. Deleted job take resources to clear, which can cause delays on the database side, especially if there are several. The best advice is to start jobs and allow them to complete - unless you know that the job had an entry error and it needs to be redone (not simply an exact "re-run"). More: https://wiki.galaxyproject.org/Support#Dataset_status_and_how_jobs_execute

It seems like your local instance of Galaxy does not have the Join tool you need. To join features based on genomic position, you need to use the "Join the INTERVALS of two datasets side-by-side" tool. For reference, you can go to useGalaxy.org and see the version of Join you need (Operate on Genomic Intervals -> Join the INTERVALS of two datasets side-by-side).

You can add this tool to your local Galaxy by giving yourself admin privileges and installing the tool from the toolshed onto your local instance. Instructions on how to do so can be found here (https://wiki.galaxyproject.org/Admin/GetGalaxy) under the 'Become an Admin' header.

Also, when I check what I assume is the output folder (galaxy/database/files/000) I'm already seeing a dataset_19.dat file that's at 50 gigabytes and growing steadily. It looks well on its way toward another 700GB result. The dataset_17.dat file is 933KB and the dataset_18.dat file is 17MB.

The most current the tutorial instructs to UCSC Known Genes as the gene track. And to limit it to a single chromosome "chr22" << that part if very important. This is set under "regions" on the UCSC Table browser form as shown in the graphic.

Also - the two join tools are very different. One joins on a common key (value in a column). The other joins by looking for genomic footprint overlap. You want to use the second - these tools are in the Tool Shed. Go to http://usegalaxy.org/toolshed and search for "GOPS" to locate the repo for review: suite_gops_1_0. Now go in through the Admin functions of your Galaxy instance and install this for use.

Thanks for your response. I too have been wondering if we're looking at the same tutorial. For example, it doesn't say "UCSC Known Genes" in the version I've been using. I've reproduced what I see at: https://github.com/nekrut/galaxy/wiki/Galaxy101-1 starting with step 1. below:

1. Getting data from UCSC

1.0. Getting coding exons

First thing we will do is to obtain data from UCSC by clicking Get Data -> UCSC Main:

You will see UCSC Table Browser interface appearing in your browser window:

Make sure that your settings are exactly the same as shown on the screen (in particular, position should be set to "chr22", output format should be set to "BED - browser extensible data", and "Galaxy" should be checked within the Send output to option). Click get output and you will see the next screen:

here make sure Create one BED record per: is set to "Coding Exons" and click Send Query to Galaxy button. After this you will see your first History Item in Galaxy's right pane. It will go through gray (preparing) and yellow (running) states to become green:

1.1. Getting SNPs

Now is the time to obtain SNP data. This is done almost exactly the same way. First thing we will do is to again click on Get Data -> UCSC Main:

but now change group to "Variation":

so that the whole page looks like this:

click get output and you should see this:

where you need to make sure that Whole Gene is selected ("Whole Gene" here really means "Whole Feature") and click Send Query to Galaxy button. You will get your second item in the history:

Now we will rename the two history items to "Exons" and "SNPs" by clicking on the Pencil icon adjacent to each item. After changing the name scroll down and click Save. Also we will rename history to "Galaxy 101 (2015)" (or whatever you want) by clicking on Unnamed history so everything looks like this:

2. Finding Exons with the highest number of SNPs

2.0. Joining exons with SNPs

Let's remind ourselves that our objective was to find which exon contains the most SNPs. This first step in answering this question will be joining exons with SNPs (a fancy word for printing exons and SNPs that overlap side by side). This is done using Operate on Genomics Intervals -> Join tool:

make sure your Exons are first and SNPs are second and click Execute. You will get the third history item:

I see now. Some of the graphics have Gencode, some have Known Genes. Either is Ok to use (as are most tracks in the group). Just be sure to limit the region by chromosome and use the correct Join tool - see my prior post. Jen

Note that the "Join" page above is different from the one in the version of Galaxy that I have. Mine shows:

Join two Datasets side by side on a specified field (Galaxy Version 2.0.2)
Join
1. Exons
using column
Column: 1
with
2: SNPs
and column
Column: 1
Keep lines of first input that do not join with second input
No
Keep lines of first input that are incomplete
No
Fill empty columns
No
Execute

For anyone following along, I found the newly added "Join" function under the "Get Data" menu (on the left side). With that tool, I was able to complete the Galaxy 101-1 tutorial.

With the best of intentions, let me observe that the tutorial (especially with a "101" name) could use some updating. Just adding a few of the comments from this topic might save new users a few hours of frustration.

Thanks for the help ... now on to Galaxy 101-2.

Galaxy 101-2 was uneventful. Everything worked as per the tutorial. Thanks.