I have sequenced a bacterial genome with a GridIon from ONT. Basically what I want to check is whether or not trimming 50 bps at the beginning of the reads will improve alignment against the reference genome and ultimately the call of a consensus sequence.

For that what I have done so far is:

concatenated all the fastqs generated from the Gridion into a single (very large) fastq file: ref494_cat_all.fastq

$\begingroup$Please capture the output of the first command and show us the first few lines (via bcftools view if it's not text): bcftools mpileup -f ref_genbank/ref_genbank.fasta ref494_aligned.sorted.bam > output.tmp$\endgroup$
– John MarshallJul 9 '19 at 7:28

1

$\begingroup$I got Killed output to the terminal after a minute or two. Also this command generates an empty output.tmp file.$\endgroup$
– BCArgJul 9 '19 at 7:49

$\begingroup$As noted below, killed due to memory exhaustion. However: you deserve an error message. Mpileup typically doesn't use that much memory and I have been unable to reproduce this on (non-ONT) data here. If you are able to, it would be interesting to try current develop bcftools to see if this still happens and/or to run it under the debugger and see where it crashes.$\endgroup$
– John MarshallJul 9 '19 at 9:12

$\begingroup$The latter is quite easy: run gdb --args bcftools mpileup -f ref_genbank/ref_genbank.fasta -o /dev/null ref494_aligned.sorted.bam, then type run and wait for it to crash. Then type backtrace; the output from this will be instructive.$\endgroup$
– John MarshallJul 9 '19 at 9:17

$\begingroup$Just before the program got killed I got Program terminated with signal SIGKILL, Killed.The program no longer exists. The output of backtrace was No stack That you cannot reproduce this error on non-ONT data, you think it can be related to the size of my input files? My bam is 14GB (coverage is about 2.000, we ran for about 48h, which we won't do routinely)$\endgroup$
– BCArgJul 9 '19 at 9:58

2 Answers
2

The - is usually used to mean standard input when reading data. So the message would suggest that the second bcftools command fails to read from stdin, so it fails to read the output of the first bcftools command:

$\begingroup$I doubt there's any problem with the input files, as I can visualise the alignment on table or IGV, so I guess the bcftoools mpileup is not generating any output (please see my comment to @John Marshall). I guess I will report the problem on their github page.$\endgroup$
– BCArgJul 9 '19 at 7:53

1

$\begingroup$@BCArg yes, precisely. If you got Killed in the terminal, then the process is being killed by the operating system, most likely because it is taking up too much memory. Sounds like you need to run this on a more powerful machine.$\endgroup$
– terdonJul 9 '19 at 7:56

1

$\begingroup$yes mystery solved, I monitored the the performance of the command bcftools mpileup -f ref_genbank/ref_genbank.fasta ref494_aligned.sorted.bam > output.tmp with htop and the program got killed right after I reached the maximum of my RAM memory (16GB), so in fact running the command on the cluster appears to be the solution.$\endgroup$
– BCArgJul 9 '19 at 8:12

It's possible you're looking at instructions for a recent version of samtools / bcftools, whereas your computer has an older version. Can you try samtools mpileup -g -f ... to output the pileup as BCF format instead of bcftools mpileup -f ...?

Update: it's good to know that the samtools / bcftools pipelines are at least producing consistent results, which suggests that the error is not associated with the output format on the mpileup side of the pipeline, and also good to see that you've identified that this was a memory issue.

The most frequent cause of memory issues I've had with sorted BAM files has been trying to figure out variants in extremely high coverage data (e.g. over 1000X), which would make sense for a bacterial isolate sequenced on a GridION. I use my own script to carry out a reservoir sampling on mapped BAM format files, which reduces coverage down to a more manageable level. Here's an example command to normalise to 100X coverage where possible:

$\begingroup$The VCF/BCF-outputting mpileup command was transferred from samtools to bcftools relatively recently. So in fact it is @gringer's suggestion that is instructions for an earlier version of samtools / bcftools.$\endgroup$
– John MarshallJul 9 '19 at 7:40