Best Answer

You can run each chromosome separately, but the downside to this is that stage 5 does this filtering of "noisy" samples and removing them from the set of discovery samples. These filtered samples are added back in for final genotyping.

If you run chr-by-chr, then you may end up with different sets of discovery samples on different chromosomes (although all sites will be genotyped in all samples in the end). Depending on your application, this may not be a problem. If you are doing any kind of population genetic analyses, though, t will create ascertainment biases. It may also affect how you think about QC, for example by looking at the length spectrum or frequency spectrum of the final calls.

Also perhaps worth mentioning here that the output from the CNV pipeline (stage 12) is a set of unfiltered (or at least minimally filtered) calls. If you want a low false discovery rate, it is important to do additional filtering. Traditionally, we have done this by setting a length threshold (separate thresholds for deletions and duplications).

In our latest software release, we now emit a new INFO field GSCNQUAL that we have started using for filtering on an experimental basis. You can think of this as a continuous per-site quality score. We find that we can achieve higher sensitivity with the same FDR by filtering on GSCNQUAL (instead of length) and that the same metric works comparably for both deletions and duplications. We are still evaluating this and should have more info available soon. In the meantime, you can experiment with filtering by ranking the variants according to this metric. If you have run with an earlier version of GS, you can get GSCNQUAL by simply re-genotyping your stage 12 calls with the latest version of GS.

Answers

In stage5, the pipeline looks at the total number of candidate variants across all samples so that we can eliminate samples that have higher than expected rates of variants. We use only the autosome for this analysis, because it is harder to do this right on the sex chromosomes. The error is because you have no variable autosomal sites (because this is only chrX).

If you look at the other chrs, you will see what the stage5 outputs normally look like. The main output is the file cnv_stage5/eval/SelectedSamples.list. The pipeline is set up so that you can manually override the set of discovery samples by manually modifying this file. So one possible workaround is to create this file either including all of your samples or based on the results for the autosome, and then touch cnv_sentinel_files/stage_5.sent and cnv_sentinel_files/.stage_5.sent.done

@bhandsaker said:
I think this is because you are running each chromosome separately.

In stage5, the pipeline looks at the total number of candidate variants across all samples so that we can eliminate samples that have higher than expected rates of variants. We use only the autosome for this analysis, because it is harder to do this right on the sex chromosomes. The error is because you have no variable autosomal sites (because this is only chrX).

If you look at the other chrs, you will see what the stage5 outputs normally look like. The main output is the file cnv_stage5/eval/SelectedSamples.list. The pipeline is set up so that you can manually override the set of discovery samples by manually modifying this file. So one possible workaround is to create this file either including all of your samples or based on the results for the autosome, and then touch cnv_sentinel_files/stage_5.sent and cnv_sentinel_files/.stage_5.sent.done

Thank you Bob and it works!
On ther other hand, based on your suggestion, is it better for us to run the CNVdiscovery with whole genome instead of running each chromosome?

You can run each chromosome separately, but the downside to this is that stage 5 does this filtering of "noisy" samples and removing them from the set of discovery samples. These filtered samples are added back in for final genotyping.

If you run chr-by-chr, then you may end up with different sets of discovery samples on different chromosomes (although all sites will be genotyped in all samples in the end). Depending on your application, this may not be a problem. If you are doing any kind of population genetic analyses, though, t will create ascertainment biases. It may also affect how you think about QC, for example by looking at the length spectrum or frequency spectrum of the final calls.

Also perhaps worth mentioning here that the output from the CNV pipeline (stage 12) is a set of unfiltered (or at least minimally filtered) calls. If you want a low false discovery rate, it is important to do additional filtering. Traditionally, we have done this by setting a length threshold (separate thresholds for deletions and duplications).

In our latest software release, we now emit a new INFO field GSCNQUAL that we have started using for filtering on an experimental basis. You can think of this as a continuous per-site quality score. We find that we can achieve higher sensitivity with the same FDR by filtering on GSCNQUAL (instead of length) and that the same metric works comparably for both deletions and duplications. We are still evaluating this and should have more info available soon. In the meantime, you can experiment with filtering by ranking the variants according to this metric. If you have run with an earlier version of GS, you can get GSCNQUAL by simply re-genotyping your stage 12 calls with the latest version of GS.

I got exactly the same problem. Would you explain the details about how to create the cnv_stage5/eval/SelectedSamples.list file "either including all of your samples or based on the results for the autosome", so I can continue the following stages? I used the 1000G phase 1 reference and looks that it was OK to find the -genomeMaskFile and -ploidyMapFile.

Assuming you are running each chromosome separately (which is what caused the problem you are referencing), then you have 22 different SelectedSamples.list files for each of the autosomal chromosomes. You could set your selected samples for X and Y to all of your samples, or to the union of the 22 autosomal files, or the intersection, or perhaps some fancier combination.

If you would have run the whole genome together, then the default behavior would have been to look at the variants-per-sample across the 22 autosomal chromosomes combined, then discard outliers more than 3 MAD above the median. You can use the files in the stage5/eval directories for each chromosome to do this calculation yourself if you want, which would be exactly what the default pipeline would have done.

@bhandsaker I had this same error: "x and y lengths differ" during CNVDiscovery stage 5. I am only running the analysis on chr19. So I don't have any "normal" SelectedSamples.list files to reference. What do you suggest I do?

Thanks for getting back to me. Below is the contents of VariantsPerSample.report.dat file:

"SAMPLE VARIANTS SINGLETONS"

As an update, I attempted to edit the contents of the DiscoverySamples.list file to include the names of all of my samples, one per line, and entered the commands:
touch cnv_sentinel_files/stage_5.sent
and
touch cnv_sentinel_files/.stage_5.sent.done

You can post it if you want, but what you need to do is to figure out why there are no candidate intervals being found. Perhaps dig around in stage 2 and see if the genotyping for any of the initial windows is showing variability. If not, then you need to figure out why none of the windows appear to be variable. It's hard to be more specific.

I looked through the output files of stage 2; definitely appears as though the program did not identify any variability. There are no data in the SelectedVariants.list, and the VariantsPerSample.report.dat contains no variants as well. Other output files appear similarly data-free. For instance, here are the first several lines of some of the output files.

Hi Bob, I met this error in GenomeStrip2.0 CNVdiscoveryPipeline,too, and it stopped at stage5.

I have touch DiscoverySamples.list file include the names of all of my samples, and touch cnv_sentinel_files/stage_5.sent and touch cnv_sentinel_files/.stage_5.sent.done,but I met a matter when I run this pipeline these files would be removed or emptied. How can I solve this matter?

And I also touch the SelectedSamples.list file ,but I don't understand what to fill in it with. Would you explain the details about how to create the cnv_stage5/eval/SelectedSamples.list file "either including all of your samples or based on the results for the autosome"?

Are you also seeing the same symptoms as mdistler in stage2? Those symptoms suggest that there was some problem where no read depth data is being found for any of the samples on any of the intervals. This is probably due to the arguments being incorrect in some way.

One common user error is that if you supply a "list file" for any of the arguments, the extension must be ".list", otherwise it will treat the file name as the argument. As an example, specifying -sample mysamples.txt will probably not do what you want, as it is requesting to process a single sample with the name "mysamples.txt". If you do -sample mysamples.list then it will read the contents of the file (but only if the file exists, which can also be a source of user error if you spell the file name wrong).

@bhandsaker Yes I use -I bamfile.list, and the sample files exist because I have seen these samples in
cnv_stage2/seq_chromosomeID/seq_chromosomeID/seq_chromosomeID.genotypes.vcf.gz,
but as you said, the content of VariantsPerSample.report.dat file in stage2 is "SAMPLE VARIANTS SINGLETONS", too. I have run 12 samples use these code successfully before, and this time I run 20 samples together . I will deal with more samples in the future. So how can I adjust the appropriate arguments ?

Perhaps where you do this?-I ${SV_DIR}/installtest/bamfile.list
This looks like it might be a different list of bam files than you used in preprocessing.
Also, for the CNV pipeline, only the header information from the input files is used, so an alternative is to use-I ${runDir}/metadata/headers.bam
which might be slightly faster.

Right. These symptoms are all consistent with no data being found for any of your samples.
You are using hg19, so chromosomes would be specified as "1", "2", etc., not "chr1".

Trying to debug with one sample is a good idea. Did you use a .list file with one line or use the path to the bam file directly? Peeking at the content of the files in the stage2/seq_1/eval directory is also a good idea might give a clue, but my expectation is that it will look like everyone is copy number zero / has no reads.

For debugging, you can use (e.g.) -intervalList 20 to run just chr20. But don't do this in production as stage5 looks at the number of called sites genome wide and aneuploidies will cause problems if you do this one chromosome at a time. You can also use -lastStage 2 to run only through stage2 (for debugging).

The essence of the problem is that somehow your input bam files are not matching the metadata. Maybe look first at the metadata, in particular the first few lines of metadata/sample_gender.report.txt. Also, what does a snippet from metadata/profiles_100Kb/rd.dat look like (this lists summarized depth per chromosome)? Then I would try to run the first sample listed in sample_gender.report.txt using -I direct/path/to/bam_file.bam (or cram if you are using cram). And if this isn't working, I would dump the bam header with samtools paying special attention to the @RG headers which define the sample information. Also make sure the sam records have the chromosome names you expect (e.g. "1" vs. "chr1"). Something is mismatched.