positive and negative set in model training by variantRecalibrator

Thanks for develop this tool set and share with others with good supports, it really contains a lot of wonderful tools.

I just try to understand VQSR more into detail. If I give the resources (dbsnp, hap map and 1kg omni data) recommended by the best practice with default settings (which one is in training, which one is the true set ...), Does the positive set contain all variants which recorded in the resources having train=TRUE, but how does the tool select negative set? Does it order the variants from high to low by the QUAL value, and pick up the 5% from the bottom (if the percentBad = 0.05)? Will there be some overlap between positive set and negative set? And is there any quality filtration on the data, e.g. one date point is more than a standard deviation away from average...

Best Answer

The positive training set are the variants that overlap the training sets. The negative training set is the bottom 5% after evaluating all the variants against the positive model. Does that help clarify this for you?

I'm not sure what you mean by quality filtration of the data -- can you please clarify that question?

Answers

The positive training set are the variants that overlap the training sets. The negative training set is the bottom 5% after evaluating all the variants against the positive model. Does that help clarify this for you?

I'm not sure what you mean by quality filtration of the data -- can you please clarify that question?