I want to point out a feature of the data that we didn’t expect to see, namely that genome coverage has very large inter-region variance. Check out this z-score normalized distribution of read coverage per 1Mbase genomic region:

Not beautiful. There is some some ~600sd outlier causing trouble. This is interesting simply on merit of being surprising.

It’s also intersting for practical considerations — some assumptions built into our software design aren’t optimized for this kind of very irregular, very high coverage. The good news is that we can clean this up without too much effort. The distribution of reads is presumed to be Poisson, for closer to normal at higher levels of coverage. Let’s see what happens if we trim off the top decile of data:

Better. What’s up with this blip at -10z? Oh, and yes, it does start to look normal if we bandpass and chop off the bottom decile as well:

There’s some R code below for how I went about this. I’m also attaching a data file of the overall z-score coverage for all ~3K 1Mbase genomic regions.