I have 1,377,239 paired reads, 250bp each, 82,486 of them have been soft clipped for more than 200bp (eg a read with CIGAR string like 150S30M70S), is this normal?

These data come from a small targeted sequencing project (~160Kb region). I checked my exome data set, none of the 50 millions reads show >200bp soft clipping. I kind of guess this could be due to contamination, but am not sure since I'm not the library preparation guy.

Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.

Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.

Thanks for your reply.

1. I'm checking the sequencing chemistry, we used phiX spike and V2 chemistry.

2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?

3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.

2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?

If the multiple samples are from the same region, that is not going to help with the low nucleotide diversity problem. You may want to consider deliberately under-loading the library and/or adding a phiX spike, if you happen to re-run. Are these amplicons or pull-down libraries?

Quote:

3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.

Can you blat the soft clipped reads against the reference to check that there is nothing strange going on (chimeras)?

I ran blat with default settings on the reads with more than 200 soft clipped reads. The first 18 columns of output are summarized by R. It seems the mapping rate is relatively low in these reads, any clue?

1.matches - Number of matching bases that aren't repeats.
2.misMatches - Number of bases that don't match.
3.repMatches - Number of matching bases that are part of repeats.
4.nCount - Number of 'N' bases.
5.qNumInsert - Number of inserts in query.
6.qBaseInsert - Number of bases inserted into query.
7.tNumInsert - Number of inserts in target.
8.tBaseInsert - Number of bases inserted into target.
9.strand - defined as + (forward) or - (reverse) for query strand. In mouse, a second '+' or '-' indecates genomic strand.
10.qName - Query sequence name.
11.qSize - Query sequence size.
12.qStart - Alignment start position in query.
13.qEnd - Alignment end position in query.
14.tName - Target sequence name.
15.tSize - Target sequence size.
16.tStart - Alignment start position in query.
17.tEnd - Alignment end position in query.
18.blockCount - Number of blocks in the alignment.

which has 57 matches followed by 44 soft-clipped bases. The 57 matches were mapped to chromosome 1, whereas the 44 other bases map perfectly to chromosome 2.

It's not a problem with adapters, but it seems BWA-MEM misses some easy alignments. The few other reads in the same region do not have that mapping characteristic. They map to that region from start to end.