GATK-Snp calling

can anyone help to advice if splitting the bam files Versus splitting the -L region, which one could speed up faster? asssume that i have 500 bam files, will splitting the bam files to 22 different chromosomes will increase the speed further as compare to splitting the -L region (intervals)? Can anyone help to advice or probably had experienced it before. thank you.

Best Answer

I see -- No, splitting the bam files will not speed up processing. In fact, specifying intervals using -L does not really speed up processing either. What it does is allow you to skip regions you are not interested in. So processing takes less time overall -- but per reference base, it takes the same amount of time. Make sense?

That said, those documents I linked to also describe scatter gather (using Queue) which can parallelize operations on a different level. If you are using a cluster that can speed things up a lot.

Thanks Geraldine for your answer,i understand that parellism(multi-threading) -nt do help. but i just want to confirm if the speed is equivalent? (with same -nt settings) whether splitting the -L region is equivalent to splitting the bam file?i really need this answer on top of -nt multi-threading to speed up my very huge sample sizes. i just dont want to spend time in splitting the bam file if it is equivalant (speed wise) to splitting the -L interval regions. thank you for help and advice:)

I see -- No, splitting the bam files will not speed up processing. In fact, specifying intervals using -L does not really speed up processing either. What it does is allow you to skip regions you are not interested in. So processing takes less time overall -- but per reference base, it takes the same amount of time. Make sense?

That said, those documents I linked to also describe scatter gather (using Queue) which can parallelize operations on a different level. If you are using a cluster that can speed things up a lot.