RESULTS

This run took about an hour longer than the previous run, but for some reason it ran with only 21 workers, instead of 22. This is probably the reason for the increased run time.

I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!

Probably the easiest and most logical to start working with will be those scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with NCBI protein IDs will give us a basic annotaiton of “functional” portions of the genome.

Additionally, we should probably do some sort of comparison of this run with the previous run where I did not provide the transposable elements FastA file.

I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!

Probably the easiest and most logical approach from here is to start working with scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with a database of NCBI protein IDs will give us a basic annotation of “functional” portions of the genome.

Additionally, we should probably do some sort of comparison of this run with the follow up run where I provided the transposable elements FastA file to see what impacts the exclusion/inclusion of that info had on annotation.

The estimated value of these awarded resources is $14,890.00. The allocation of these resources represents a considerable investment by the NSF in advanced computing infrastructure for U.S. The dollar value of your allocation is estimated from the NSF awards supporting the allocated resources.

If XSEDE Extended Collaborative Support (ECSS) assistance was recommended by the review panel, you will be contacted by the ECSS team within the next two weeks to begin discussing this collaboration.

For details about the decision and reviewer comments, please see below or go to the XSEDE User Portal (https://portal.xsede.org), login, click on the ALLOCATIONS tab, then click on Submit/Review Request. Once there you will see your recently awarded research request listed on the right under the section ‘Approved’. Please select the view action to see reviewer comments along with the notes from the review meeting and any additional comments from the Allocations administrator.

By default the PI and all co-PIs will be added to the resources awarded. If this is an award on a renewal request, current users will have their account end dates modified to reflect the new end date of this award. PIs, co-PIs, or Allocation Managers can add users to or remove users from resources on this project by logging into the portal (https://portal.xsede.org) and using the ‘Add/Remove User’ form.

Share the impact of XSEDE! In exchange for access to the XSEDE ecosystem, we ask that all users let us know what XSEDE has helped you achieve:

For all publications, please acknowledge use of XSEDE and allocated resources by citing the XSEDE paper (https://www.xsede.org/how-to-acknowledge-xsede) and also add your publications to your user profile.

Tell us about your achievements (http://www.xsede.org/group/xup/science-achievements).

Help us improve our reporting by keeping your XSEDE user profile up to date and completing the demographic fields (https://portal.xsede.org/group/xup/profile).

For question regarding this decision, please contact help@xsede.org.

Best regards,
XSEDE Resource Allocations Service

===========================

REVIEWER COMMENTS

===========================

Review #0 – Excellent

Assessment and Summary:
Maker is a well-known and fitting workflow for Jetstream. This should be a good use of resources.

Appropriateness of Methodology:

After each user on the allocation has logged in, the PI will need to open a ticket via help@xsede.org and request the quota be set per the allocation to 1TB. Please provide the XSEDE portal ID of each user.

Appropriateness of Computational Research Plan:

We request that any publications stemming from work done using Jetstream cite us – https://jetstream-cloud.org/research/citing-jetstream.php

Efficient Use of Resources:

We had a tremendous amount of help from Upendra Devisetty at CyVerse in getting the Xsede Startup Application written, as well as running WQ-MAKER on Xsede/Atmosphere/Jetstream (or, whatever it’s called!).

I set up the run using 21 “WORKER” machines. Twenty WORKERS were “m1.large” machines (i.e. CPU: 10, Mem: 30 GB, Disk: 60 GB). The remaining WORKER was set at “m1.xlarge” (i.e. CPU: 24, Mem: 60 GB, Disk: 60 GB) to use up the rest of our allocated memory.

TL;DR – It appears to be continuing where it left off!

I decided to spend some time to figure out what was actually happening, as it’s clear that the annotation process is going to need some additional time to run and may span an additional monthly maintenance shutdown.

This is great, because, otherwise, this will take an eternity to actually complete (particularly because we’d have to move the job to run on one of our lab’s computers – which pale in comparison to the specs of our Mox nodes).

However, it’s a bit shocking that this is taking this long, even on a Mox node!

I started annotating the Olympia oyster genome on 20180529. Since then, the job has been interrupted twice by monthly Mox maintenance (which happens on the 2nd Tuesday of each month). Additionally, when this happens, the SLURM output file is overwritten, making it difficult to assess whether or not Maker continues where it left off or if it’s starting over from scratch.

Anyway, here’s how I deduced that the program is continuing where it left off.

I figured out that it produces a generic feature format (GFF) file for each contig.

Decided to search for the first contig GFF and look at it’s last modified date. This would tell me if it was newly generated (i.e. on the date that the job was restarted after the maintenance shutdown) or if it was old. Additionally, if there were more than one of these files, then I’d also know that Maker was just starting at the beginning and writing data to a different location.

This shows:

Only one copy of Contig0.gff exists.

Last modified date is 20180530.

Check the slurm output file for info.

This reveals this important piece of info:

MAKER WARNING: The file 20180529_oly_annotation_01.maker.output/20180529_oly_annotation_01_datastore/AC/68/Contig215522//theVoid.Contig215522/0/Contig215522.0.all.rb.out
did not finish on the last run

All of these taken together lead me to confidently conclude that Maker is not restarting from the beginning and is, indeed, continuing where it left off. WHEW!

Just for kicks, I also ran a count of GFF files to see where this stands so far:

Wow! 622,010 GFFs!!!

Finally, for posterity, here’s the SLURM script I used to submit this job, back in May! I’ll have all of the corresponding genome files, proteome files, transcriptome files, etc. on one of our servers once the job completes.

This took an insanely long time to complete (nearly six weeks)!!! After some internet searching, I’ve found a pontential solution to this and have initiated another PB Jelly run to see if it will run faster. Regardless, it’ll be interesting to see how the results compare from two independent runs of PB Jelly.

BEST OF:

Interesting tidbit: The pbjelly_sjw_01 assembly is EXACTLY the same as the soap_bgi_01. Looking at the output messages from that PB Jelly assembly, one can see why. The messages indicate that no gaps were filled on the BGI scaffold reference! That means the PB Jelly output is just the BGI scaffold reference assembly!

Results:

OK! This seems to have worked (and it was quick, like less than an hour!), as it actually produced a FASTA file! Will run QUAST with this and some assemblies to compare assembly stats. Have added this assembly to our Olympia oyster genome assemblies table.