Skip links

Annotation - Olurida_v081 MAKER on Mox

by Sam WhiteNovember 27, 201820 min read

Remarkably, I managed to burn through our Xsede computing resources and don’t have terribly much to show for it. Ooof! This is a major bummer, as it “only” takes ~8hrs for a WQ-MAKER job to run there, as opposed to months the last time I tried running it on Mox. Although we have used up our Xsede allocation, all is not lost! The experience of setting up/running WQ-MAKER has enlightened me on how it all works and how to run it on Mox so it will (hopefully) take far, far less time than the last Mox attempt. With that said, here we go…

Firstly, I re-installed MAKER (v2.31.10) and configured for OpenMPI support. This is computing jargon that basically allows MAKER to work on a computer cluster efficiently. Now that we have two Mox nodes, I think this will help accelerate the process.

With that out of the way, here’s a very brief overview of the entire MAKER annotation process. Be aware, despite it’s “brevity”, this is still a lengthy read:

Merge all the hundred thousands (seriously) of individual GFFs and FastAs in to a singular file of each file type. MAKER has built-in scripts to do this.

Generate ab initio gene prediction using SNAP. This is integrated in to MAKER.

Run MAKER again, using the SNAP gene models.

Merge the new set of GFFs.

Run SNAP a second time.

Run MAKER a third time using the second set of SNAP gene models.

Merge the final set of GFFs.

Done???

So, that’s how it’s done! Easy!

With each round of MAKER, a “control” file needs to be generated and modified appropriately. Modifications consist of telling MAKER locations of files and whether or not to use certain types of files when producing a new model (e.g. RNAseq data, SNAP HMM file, etc.). Here are the three control files that were used to run MAKER. The links are simply text files, despite their extension, so they can be downloaded and viewed in any text editor, if desired, but I’ve pasted their contents below for easier review:

The post above provides details on how to speed the process up (hint: use GFFs for subsequent MAKER rounds to avoid repeated BLAST-ing. BLAST-ing is one of the slowest parts of the process.) and provides some explanations of how to evaluate the process, as well as how to run BUSCO/Augustus.

The 3rd MAKER Run GFF should contain the most refined gene models. This GFF has individual genes, coding sequences (CDS), mRNAs, and proteins. However, it’s a good idea to load all three GFFs in a genome browser and see how they compare.

A run through BUSCO/Augustus gene prediction should refine these models even further and seems to be the standard practice when annotating genomes.

Additionally, the protein FastA file needs to be subject to BLASTp, as well as run through InterProScan to actually assign functions to the genome.

Finally, MAKER can put all this together and create better sequence ID info in the FastA files and the GFFs (will create NCBI-standardized sequence IDs).