More thinking/planning about the new uptake-sequencing data

Some housekeeping issues:The sequence data: The PhD student has found that some segments of the genome have very low coverage in the input data - some positions have coverage of zero. This means that the calculated uptake ratios for these positions are either unreliable (low coverage) or missing (coverage = 0). He's going to plot segments of the genome with the low coverage points in a different colour, so we can see how bad the problem is.

Part of the problem may be due to how the reads were originally mapped onto the donor genome. The mapping used a concatenated donor-recipient double genome to remove the contaminating recipient reads from the data. Because the donor and recipient sequences used were those of NCBI reference genomes rather than of the exact cultures used for the experiment, sequencing errors in the reference genomes may have caused donor sequences to mis-align onto the recipient genome.

This can easily be checked by examining the full alignment of the input DNA. This should not contain any contaminating recipient sequences, so any reads that align to the recipient are alignment errors. The ideal solution would be to realign the reads using better reference sequences, but we could instead just add this misaligned coverage into the donor-aligned input dataset we're analyzing.

Any remaining positions with near-zero coverage in the input dataset should probably be flagged and removed from the analyses.

The USS-scoring matrices: A careful reader might have noticed in yesterday's post that the two scoring matrices are not the same length. The uptake-based matrix is 32 nt long, but the genome-based matrix is 37 nt long. They are also not exactly aligned to each other; position 1 of the uptake-based matrix is position 3 of the genome-based matrix. Rather than dealing with these discrepancies later (or forgetting to deal with them), we should create concordant matrices now to use for the scoring.

This requires deleting the first two positions and the last three positions of the genome-based matrix. Since the remaining last few positions have no 'information' in either matrix, we might as well delete a couple more, to give concordant matrices that are both 30 bp long.

Forward-strand and reverse-strand USSs: Since the USS motif is not symmetric (not a palindrome in DNA language), we need to identify and specify the locations of the USSs in the two strands. The top panel below illustrates the problem. To keep the position references consistent, the two strands are initially scored in the same left-to-right direction, with the reverse-strand scoring done using a matrix with complementary bases in the reverse orientation. For both strands the left end of each USS initially specifies its position in the genome, but this is a bit misleading since it's not the centre or most important position of the USS. Worse, since the crucial 'core' of the USS motif isn't at its centre, the initial positions of the forward USSs are skewed differently than the reverse USSs.

The lower panels indicate the two possible solutions. Both are technically easy - we just create new USS positions by adding numbers to the original positions. In the solution shown in the middle panel, we'd add 13 (I think) to both the forward and reverse positions (sorry, the figure shows the trimmed 30 bp USS but the numbers haven't been corrected for the removal of two positions at the start). In the solution shown in the lower panel we'd add 7 to the forward strand positions and 21 to the reverse-strand positions. (I'm not certain these are the correct numbers...)