A related question...the ability to identify and remove chimeric amplicons when lacking overlapping sequences has come up for a recent data set. Does anyone have a feel for how this may impact data analysis?

You don't need a full overlap for amplicon libraries. You just need enough of an overlap to merge the reads. Depending on your read quality, a aiming for 30-50bp overlap should be fine.

Chimeric non-overlapping amplicons are hard to detect. If you cluster your pairs, and then find pairs in which the two reads map to different clusters, then you could assume that those pairs are chimeric. But the sensitivity and specificity depend completely on the quality of clustering.

If you're interested in very small number of base differences, you absolutely need to fully overlap. If you need 5-10% differences, maybe you could get away with not fully overlapping. But the cost difference between the 2 kits is only a couple of hundred $, your downstream computation time will be much much greater with marginal sequences which likely will cost thousands rather than hundreds

ETA-my cost estimate for computational time is based on 16S, I've never dealt with any of the cancer panels so don't know how significantly poor quality bases would impact your results

If you're interested in very small number of base differences, you absolutely need to fully overlap. If you need 5-10% differences, maybe you could get away with not fully overlapping. But the cost difference between the 2 kits is only a couple of hundred $, your downstream computation time will be much much greater with marginal sequences which likely will cost thousands rather than hundreds

ETA-my cost estimate for computational time is based on 16S, I've never dealt with any of the cancer panels so don't know how significantly poor quality bases would impact your results

I disagree. First off, I'm not sure what computation time you are talking about. How are you incurring thousands of dollars of compute costs from this kind of data?

Second, the data that paper used was low quality and not indicative of what I would expect from a properly-design MiSeq 2x250 amplicon run, using staggered primers and an appropriate amount of spike-in, etc.

Third, errors due to incorrect merges and errors in the reads themselves are conflated; since the former are due to the specific software used for overlapping, and are also a function of the overlap length, you can't really draw a conclusion about the error rates of overlapping reads using any methodology but the one described in the paper. Unfortunately, it's not described in the paper - rather, they sort of hint that it's described here, where I guess it occurs in the make.contigs command. I have not tested that, but would be very surprised if it was the best available tool for the purpose.

Fourth, 2x150 reads have a much lower error rate than 2x250 reads. If they overlap by 50bp, then the only nonoverlapping portion is the first and last 100bp, which have around a peak 0.2% error rate for R1 and 0.5% error rate for R2 (average is lower), including all reads with no quality-filtering. Those are on HiSeq; MiSeq error rates are generally lower.

Longer reads and longer overlaps are better, of course. But 2x150 is viable as long as there is sufficient overlap to merge, and you can tolerate a fraction of a percent error rate in the non-double-sequenced portion.

If you are sequencing amplicons for 16s you need to cluster the sequences into OTUs. the more sequencing errors you have the more spurious OTUs you generate-which massively increases the memory require to cluster those (assuming you are doing de novo clustering). If you get them clustered, you then will waste a lot of time trying to find meaning in the sequencing noise or you can just throw out all of the rare OTUs which means that you will be throwing out good data along with the bad because you can't tell the difference between the good and bad rares. Ecologically this matters, for a cancer panel-maybe it doesn't.