Jump to another community

(howto) Install all software packages required to follow the GATK Best Practices.

Objective

Install all software packages required to follow the GATK Best Practices.

Prerequisites

To follow these instructions, you will need to have a basic understanding of the meaning of the following words and command-line operations. If you are unfamiliar with any of the following, you should consult a more experienced colleague or your systems administrator if you have one. There are also many good online tutorials you can use to learn the necessary notions.

Basic Unix environment commands

Binary / Executable

Compiling a binary

Adding a binary to your path

Command-line shell, terminal or console

Software library

You will also need to have access to an ANSI compliant C++ compiler and the tools needed for normal compilations (make, shell, the standard library, tar, gunzip). These tools are usually pre-installed on Linux/Unix systems. On MacOS X, you may need to install the MacOS Xcode tools. See https://developer.apple.com/xcode/ for relevant information and software downloads. The XCode tools are free but an AppleID may be required to download them.

Starting with version 3.6, the GATK requires Java Runtime Environment version 1.8 (Java 8). Previous versions down to 2.6 required JRE 1.7, and earlier versions required 1.6. All Linux/Unix and MacOS X systems should have a JRE pre-installed, but the version may vary. To test your Java version, run the following command in the shell:

java -version

This should return a message along the lines of ”java version 1.8.0_25” as well as some details on the Runtime Environment (JRE) and Virtual Machine (VM). If you have a version that does not match the requirements stated above for the version of GATK you are running, the GATK may not run correctly or at all. The simplest solution is to install an additional JRE and specify which you want to use at the command-line. To find out how to do so, you should seek help from your systems administrator.

Software packages

BWA

SAMtools

Picard

Genome Analysis Toolkit (GATK)

IGV

RStudio IDE and R libraries ggplot2 and gsalib

Note that the version numbers of packages you download may be different than shown in the instructions below. If so, please adapt the number accordingly in the commands.

1. BWA

This will produce a directory called bwa-0.7.12 containing the files necessary to compile the BWA binary. Move to this directory and compile using:

cd bwa-0.7.12
make

The compiled binary is called bwa. You should find it within the same folder (bwa-0.7.12 in this example). You may also find other compiled binaries; at time of writing, a second binary called bwamem-lite is also included. You can disregard this file for now. Finally, just add the BWA binary to your path to make it available on the command line. This completes the installation process.

Testing

Open a shell and run:

bwa

This should print out some version and author information as well as a list of commands. As the Usage line states, to use BWA you will always build your command lines like this:

bwa <command> [options]

This means you first make the call to the binary (bwa), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command.

This will produce a directory called samtools-0.1.2 containing the files necessary to compile the SAMtools binary. Move to this directory and compile using:

cd samtools-0.1.2
make

The compiled binary is called samtools. You should find it within the same folder (samtools-0.1.2 in this example). Finally, add the SAMtools binary to your path to make it available on the command line. This completes the installation process.

Testing

Open a shell and run:

samtools

This should print out some version information as well as a list of commands. As the Usage line states, to use SAMtools you will always build your command lines like this:

samtools <command> [options]

This means you first make the call to the binary (samtools), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command. This is a similar convention as used by BWA.

3. Picard

Read the overview of the Picard software on the Picard project homepage, then download the latest version (currently 2.4.1) of the package containing the pre-compiled program file (the picard-tools-2.x.y.zip file).

Installation

Unpack the zip file using:

tar xjf picard-tools-2.4.1.zip

This will produce a directory called picard-tools-2.4.1 containing the Picard jar files. Picard tools are distributed as a pre-compiled Java executable (jar file) so there is no need to compile them.

Note that it is not possible to add jar files to your path to make the tools available on the command line; you have to specify the full path to the jar file in your java command, which would look like this:

java -jar ~/my_tools/jars/picard.jar <Toolname> [options]

This syntax will be explained in a little more detail further below.

However, you can set up a shortcut called an "environment variable" in your shell profile configuration to make this easier. The idea is that you create a variable that tells your system where to find a given jar, like this:

PICARD = "~/my_tools/jars/picard.jar"

So then when you want to run a Picard tool, you just need to call the jar by its shortcut, like this:

java -jar $PICARD <Toolname> [options]

The exact way to set this up depends on what shell you're using and how your environment is configured. We like this overview and tutorial which explains how it all works; but if you are new to the command line environment and you find this too much too deal with, we recommend asking for help from your institution's IT support group.

This completes the installation process.

Testing

Open a shell and run:

java -jar picard.jar -h

This should print out some version and usage information about the AddOrReplaceReadGroups.jar tool. At this point you will have noticed an important difference between BWA and Picard tools. To use BWA, we called on the BWA program and specified which of its internal tools we wanted to apply. To use Picard, we called on Java itself as the main program, then specified which jar file to use, knowing that one jar file = one tool. This applies to all Picard tools; to use them you will always build your command lines like this:

java -jar picard.jar <ToolName> [options]

This means you first make the call to Java itself as the main program, then specify the picard.jar file, then specify which tool you want, and finally you pass whatever other arguments (input files, parameters etc.) are needed for the analysis.

Note that the command-line syntax of Picard tools has recently changed from java -jar <ToolName>.jar to java -jar picard.jar <ToolName>. We are using the newer syntax in this document, but some of our other documents may not have been updated yet. If you encounter any documents using the old syntax, let us know and we'll update them accordingly. If you are already using an older version of Picard, either adapt the commands or better, upgrade your version!

Next we will see that GATK tools are called in essentially the same way, although the way the options are specified is a little different. The reasons for how tools in a given software package are organized and invoked are largely due to the preferences of the software developers. They generally do not reflect strict technical requirements, although they can have an effect on speed and efficiency.

4. Genome Analysis Toolkit (GATK)

In order to access the downloads, you need to register for a free account on the GATK support forum. You will also need to read and accept the license agreement before downloading the GATK software package. Note that if you intend to use the GATK for commercial purposes, you will need to purchase a license. See the licensing page for an overview of the commercial licensing conditions.

Installation

Unpack the tar file using:

tar xjf GenomeAnalysisTK-3.3-0.tar.bz2

This will produce a directory called GenomeAnalysisTK-3.3-0 containing the GATK jar file, which is called GenomeAnalysisTK.jar, as well as a directory of example files called resources. GATK tools are distributed as a single pre-compiled Java executable so there is no need to compile them. Just like we discussed for Picard, it's not possible to add the GATK to your path, but you can set up a shortcut to the jar file using environment variables as described above.

This completes the installation process.

Testing

Open a shell and run:

java -jar GenomeAnalysisTK.jar -h

This should print out some version and usage information, as well as a list of the tools included in the GATK. As the Usage line states, to use GATK you will always build your command lines like this:

java -jar GenomeAnalysisTK.jar -T <ToolName> [arguments]

This means that just like for Picard, you first make the call to Java itself as the main program, then specify the GenomeAnalysisTK.jar file, then specify which tool you want, and finally you pass whatever other arguments (input files, parameters etc.) are needed for the analysis.

5. IGV

The Integrated Genomics Viewer is a genome browser that allows you to view BAM, VCF and other genomic file information in context. It has a graphical user interface that is very easy to use, and can be downloaded for free (though registration is required) from this website. We encourage you to read through IGV's very helpful user guide, which includes many detailed tutorials that will help you use the program most effectively.

6. RStudio IDE and R libraries ggplot2 and gsalib

Download the latest version of RStudio IDE. The webpage should automatically detect what platform you are running on and recommend the version most suitable for your system.

Installation

Follow the installation instructions provided. Binaries are provided for all major platforms; typically they just need to be placed in your Applications (or Programs) directory. Open RStudio and type the following command in the console window:

install.packages("ggplot2")

This will download and install the ggplot2 library as well as any other library packages that ggplot2 depends on for its operation. Note that some users have reported having to install two additional package themselves, called reshape and gplots, which you can do as follows:

install.packages("reshape")
install.packages("gplots")

Finally, do the same thing to install the gsalib library:

install.packages("gsalib")

This will download and install the gsalib library.

Important note

If you are using a recent version of ggplot2 and a version of GATK older than 3.2, you may encounter an error when trying to generate the BQSR or VQSR recalibration plots. This is because until recently our scripts were still using an older version of certain ggplot2 functions. This has been fixed in GATK 3.2, so you should either upgrade your version of GATK (recommended) or downgrade your version of ggplot2. If you experience further issues generating the BQSR recalibration plots, please see this tutorial.

Comments

I'm having an issue getting picard tools configured to work in any directory. I've downloaded and unpacked the picard zip file and added the picard-tools-1.94 directory to my path, however when I run:

java -jar AddOrReplaceReadGroups.jar -h

I get the following error: Error: Unable to access jarfile AddOrReplaceReadGroups.jar

The command works if I am in the picard-tools-1.94 directory, making me think that something is wrong with my path variable but when I echo my path variable and copy the relevant path directly into a cd command I move to the correct directory (so there are no typos) and the command works (so I should be adding the correct directory). Any suggestions? Here is the value of my PATH variable:

Hi.
I am new to GATK. While installing everything went well except installation of ggplot2 and gsalib. I am getting error msg as :
Warning in install.packages :
unable to connect to 'cran.rstudio.com' on port 80.

Is it a problem of network only or something else I have to do.
Thanks in advance.

Hi Geraldine, this was not the case in my install, perhaps because ggplot2 had been installed on my system for a few years. It took me a while to figure out the problem and I would save others such inconvenience if the full list were included.

The problem is that we would then have to update the dependencies every time the developer of another library changes their package, and that's just too much burden on us. As it is now, it is your responsibility to keep your software up to date. If you have some software that has been installed for several years, one of the first things you should think of if you run into problems is to update everything.

@Geraldine_VdAuwera said:
What I do is set up an environment variable that points to the directory where the jars live, so I can so something like $picardDir/AddOrReplaceReadGroups.jar

Geraldine,
I am having the same problem as haseley above but I am not sure how to set up the envirinment variable using the information you provided above. I am new to linux. Can you send me a command line to run. Do I have to run "$picardDir/AddOrReplaceReadGroups.jar" before running "java -jar AddOrReplaceReadGroups.jar -h "

I recommend you look for an online tutorial that covers working with environment variables. There are many fine tutorials for Linux beginners, and I think this will be more useful to you in the long run than giving you a set of commands to run. Good luck!

That's a fair point, but we've only tested the "master" package (in keeping with the usual Earth-logic software naming convention; not sure what the htslib devs are doing merging master into develop...), so proceed with "develop" at your own risk.

I'm having an issue getting picard tools configured to work in any directory. I've downloaded and unpacked the picard zip file and added the picard-tools-1.94 directory to my path, however when I run:

java -jar AddOrReplaceReadGroups.jar -h

I get the following error: Error: Unable to access jarfile AddOrReplaceReadGroups.jar

The command works if I am in the picard-tools-1.94 directory, making me think that something is wrong with my path variable but when I echo my path variable and copy the relevant path directly into a cd command I move to the correct directory (so there are no typos) and the command works (so I should be adding the correct directory). Any suggestions? Here is the value of my PATH variable:

I am sysadmin helping our scientists set up GATK environment. I am following the instructions in this post, and some of them cause doubts - at least on Linux (don't know much about Mac).

Step 4 (Picard installations) tells to "add the Picard directory to your path to make the tools available on the command line". As somebody else already mentioned, this doesn't make sense - jar invocation doesn't use PATH to find jar file. So, java -jar AddOrReplaceReadGroups.jar -h doesn't work, unless AddOrReplaceReadGroups.jar is in the current directory. I assume, GATK itself relies on Picard archives; the typical way is to add jars to the CLASSPATH. Should I add all of them? That doesn't seem right? Could you please correct the instructions.

Step 5. The trivial invocation (with -h flag) works; however, any ToolName throws an error that the tool is not found. I don't know if it's related to Picard jars missing, or there is some other reason - but some verification that goes beyond just -h would be very helpful.

Step 6. Most of our Linux servers don't have GUI installed. Scientists use R Studio Server for all their R development and modelling needs. However, you recommend installing R Studio IDE, which is a client-based software. Do you want me to install it on the server? It won't work without X. Or you want the scientists to install it on their workstations (many of them have it already). Then I would need some instructions as to how to integrate such client installation with GATK. Or I misunderstood this whole step?

For Picard, we define an environment variable PICARD_HOME, and then invoke with java -jar $PICARD_HOME/AddOrReplaceReadGroups.jar. Actually, we do the same thing for GATK.

For GATK, I assume that you tried tool names that should exist (like PrintReads or UnifiedGenotyper)? If you build the jars yourself, it's possible to mess things up and not compile in the tools, but the downloadable jar for distribution (at least v2.8-1) doesn't have this problem.

You don't need an R IDE, I suspect that the recommendation was made just for ease of installation. Just make sure that Rscript is on the path and that the ggplot2 package (and all dependencies) is installed

Step 4: Apologies for the confusion; what it means is what @pdexheimer outlines: create an environment variable to use as shortcut to the directory where you store the jars. We typically have several versions on the same machine so it's easier to control what we're using that way rather than using classpaths. We'll try to clarify the doc.

Step 5: Not sure what you mean -- could you please post the command line you tried that didn't work, and what result or error message you got?

Step 6: The IDE is not required, it's just a recommendation for people who don't already work with R, as it can help make installing the libraries easier for them.

Step 4: I assume that GATK needs to somehow know where Picard jar files are located, right? So, whether I use directory name or environment variable - it doesn't let GATK know about it. Or there is no dependency, and the scientist is supposed to invoke Picard jars independently of GATK? Then I don't have any questions.

Step 5: As I said, I am sysadmin (although hanging around the scientists for a long time). And our lawyers asked scientists to stay out of the system while they are finalizing contract. In short, I don't have the tools "that should exist". That's exactly my question - can somebody suggest a command (beyond just help screen) that should work.

I didn't build jars myself; as Step 5 instructs, the jars are pre-built in the download... The goal is to package Amazon AMIs for the scientists, and I want to make sure that all the pieces work correctly before I start packaging.

Step 4: there is no direct dependency; GATK does not make calls to Picard, if that's what you mean. We just ask users to get Picard because there is some of data preprocessing that needs to be done with Picard before the data can be input to GATK.

Step 5: Oh I see. Well you could run one of the simple analysis tools on the example data that is provided with the download (if I remember correctly, in the resources subdirectory). E.g. you would do:

YES! Thank you so much (on both steps)! Step 4: that certainly clarifies. and Step 5: The results are much more comforting than just help screen! I got "CountReads - CountReads counted 33 reads in the traversal" and "0 reads were filtered out during the traversal" which looks really great!
Thanks a lot

That's a fair point, but we've only tested the "master" package (in keeping with the usual Earth-logic software naming convention; not sure what the htslib devs are doing merging master into develop...), so proceed with "develop" at your own risk.

Hi Geraldine,

Thanks. With the newer version of htslib you can just recompile samtools with the new htslib library and all the functionality is available from samtools.

I'm getting some troubles with setting the path of the picard tools...
I have the same issue as Nathan and Mickaël: it works if I move in the "picard" dir, but not from somewhere else.

I red forums explaining what an environment variable is. As you suggested, I modified my .bashrc as follows:
-I created an environmental variable " picardDir": export picardDir="$HOME/picard.tools-1.110"
-I added it to my path: export PATH="$PATH:$picardDir"
By typing "env" I could verify that both environmental variable "picardDir" and "PATH" were modified correctly.

then I run:
java -jar $picardDir/AddOrRpeplaceReadGroups.jar -h (tested also with quotes: java -jar "$picardDir/AddOrRpeplaceReadGroups.jar" -h)
However, I still get the same error message:
Error: Unable to access jarfile /home/fabrice/picard-tools-1.110/AddOrRpeplaceReadGroups.jar
And if I try to type the absolute path in the command rather that my "picardDir" environment variable:
java -jar /home/fabrice/picard-tools-1.110/AddOrRpeplaceReadGroups.jar -h
I get the same error message.

I'm not sure -- htslib is under active development and is maintained by others (not us) so I can't really comment on compilation issues. You'd have to ask the developers of htslib for support.

However I can tell you that htslib is not really required in order to run GATK, so you can skip it unless you need the ability to revert a bam file to FastQ (which is what we use it for in the tutorial).

Ok thanks Geraldine for your answer,
I haven't fixed the issue so I am running GATK without htslib and that's fine for what I want to do.
However, I will let you know if ever I find a fix to this issue...

I just have a comment on the part 7. R packages. It seems that ggplot2 and gsalib are not enough. I installed another package called 'reshape', then it worked for generating the figures in the BQSR step.

The issue is that you are not specifying where the GenomeAnalysisTK.jar file exists. You must specify the path to the file before the file name. For example, my GATK is stored in my Applications folder, so my command would be:
java -jar /Applications/GenomeAnalysisTK.jar -h

I havent got any problems installing the reshape gplots gsalib libraries but ...

tools is not present in CRAN?! But might work in R-3.0.2. Looking at the github code showed me that these are imported in BQSR.R(clickme). This also shows that this is only triggered in AnalyseCovariants when -plots is specified (with -csv it runs correctly).

OOOPS missed it!

I cannot remove this post. library("tools") works. In the library() function I failed to scroll down package tools is present in the default installation of R.

Hi guys.. Thanks so much for the great support. I am also a little new to Linux and I believed everything was working fine but now I am stuck and have been fighting with this for some time. I have installed java with what seems to be a working version and run:

java -jar GenomeAnalysisTK.jar --help

with the appropriate path to GenomeAnalysisTK.jar. I get what seems like an appropriately long readout with a lot of options (not sure if this is what everyone else sees) but I think I am having a problem because I don't see a list of tools.. When I run the analysis with the example data and CountReads I get the error

Maybe @Geraldine_VdAuwera could confirm. I got it to work but could not be used with OpenJDK 9 only 8. Had to run OpenJDK 8 instead. Yet I was attracted to 9 because I believed there was some base line memory allocation when java is running in containers running on servers. This just will not work on 9?