Jump to another community

Cross product between entities for Mutect2 normal-normal analysis

What I want to do
One of the standard evaluations of somatic variant callers is normal-normal calling, where you take a bunch of replicate normal (non-tumor) bams from the same individual eg NA12878 and run your caller over every pair of samples, assigning one sample as the "tumor" and one as the "normal."

If I were writing a wdl outside of Firecloud I could implement this with the cross product:

My understanding of why this is tricky
As I understand the data model, it's baked into Firecloud that you run a method over each sample, which means you don't know about the other samples. That is, you can't perform the cross because your input is a File egsample.bam, and not an Array[File] eg sample_set.bams.

Hacky solution 1
I suppose one could set up a bunch of pairs, basically by implementing the cross manually and then uploading the resulting data model, and run the analysis over a pair set. Besides being really ugly this is not very maintainable because the part of the workflow that forms all pairs out of the list of samples lives outside of Firecloud.

What I mean is that even though the natural data model of the problem is

sample

bam

sample1

sample1.bam

sample2

sample2.bam

sample3

sample3.bam

I would run on pairs:

pair

tumor_sample

normal_sample

pair1

sample1

sample2

pair2

sample2

sample1

pair3

sample1

sample3

pair4

sample3

sample1

pair5

sample2

sample3

pair5

sample3

sample2

Hacky solution 2
I could also imagine the following hack. The data model would be a single, dummy, sample, with the attribute "bam_paths" which is a file with a path to a different bam on each line. Then the method would take in this FoFN, use read_tsv to get the Array[File] of bams, and then scatter over the cross.

Similarly, there's another analysis I want to do that scatters Mutect2 linearly (not sure what the word is but I mean trivially, without a cross or anything) over samples, but then does a cross over the output, the pairwise overlap between callsets FWIW. I'm also not sure how to do that.

Answers

Hi @davidben, my instinct is that it should be possible to do through sample sets but I might be dreaming -- let me check with the FC team. If that's not possible we'll find you the least hacky workaround possible -- which might be your solution 2.

Not sure I understand your second request, do you mean run a WDL across each sample individually then collect the output and take that into another WDL as an array?

Make a sample set that includes all the samples you want in your cross, set the root entity type for your method to sample set, then specify this.samples.bam to your Array input. This will give your WDL the list of bams from your samples.

If I'm right about your second question, then you would do essentially the same thing; except you would first run your M2 method per sample (by setting its root entity to sample then running it on the sample set with expression this.samplesin the launch dialog), then you would run your cross analysis on the sample set in the same way as for your first problem.

Does that make sense? Let me know if you run into any trouble. At some point we want to write some kind of cookbook of recipes for this sort of thing since it's really not intuitive to figure out.