Analysis of uptake sequences by score

Here are the logos for the N. meningitidis and H. influenzae uptake sequences after sorting the occurrences by the scores that the Gibbs motif Sampler assigned them. (I'm pretty sure that each score is a measure of how well that occurrence's sequence matches the position weight matrix that Gibbs determined for this data set, but I don't know how the calculation is done.)

The top set are the logos for 5381 N. meningitidis DUSs. The numbers are different than in yesterday's post because I realized I had been analyzing a N. gonorrhoeae data set. The overall picture is the same for N. meningitidis and N. gonorrhoeae - low-scoring DUS retain strong consensus for most of the central positions but have only very weak consensuses for the other positions. The drop-off is quite steep. The shapes of the logos are about the same for all the occurrences with scores lower than about 0.95.

The H. influenzae dataset is even more skewed; almost 60% of the USSs have perfect scores, and about 8% have zero scores. But the consensus decays fairly evenly across the positions, and even the zero-score occurrences have the full motif. Like the N. meningitidis DUS, the shapes of the USS logos are about the same for all occurrences with scores below 0.95.

I think the question in my mind was whether there is a obvious place to draw a line between 'real uptake sequence' and 'degenerate sequence that doesn't deserve to be treated as an uptake sequence'. Unfortunately the analysis is complicated by the different sizes of the datasets - the N. meningitidis set has almost twice as many sites as the H. influenzae set.

OK, I've dug out another set of H. influenzae runs, done with a high 'expected' setting to maximize the number of sites found. This has 3466 USSs, with a lot more having zero scores than in the previous set. Now the first and last Gs in the core are seen to be weaker in USSs with low scores, though not in the larger set of USSs with zero scores. Overall the consensus still remains constant as the scores and consensus strengths decrease. Notably, the flanking AT-rich segments remain as important in poorly matched USSs as the core does.

5 comments:

I think this is telling us that there is no clear score cut-off for defining a non-degenerate US. Does this also tell us which positions are the most important for uptake and how well do they agree with Lindsay's data?

I think I'll do another post on how to interpret this, to clarify my muddled thinking about how the motif perspective fits with this data.

In Lindsay's data, changing the two Gs doesn't strongly affect uptake, but the effect is still less than some other positions that show stronger conservation in the low-score logos. (There are errors in the base labels in this figure; I'll need to check her notebooks.)

What is the "background" A+T content that Gibbs is using when calculating the strength of your motifs? Some Gibbs servers default to 60% A+T because Gibbs is mostly used to analyze promoter DNA. If this is the case, the overrepresentation of G+C in your weak Neisseria motifs may be an artifact of G+C bases being scored higher than A+Ts, when in fact the weak DUS sites in the genome vary equally at all positions. In other words, Gibbs may preferentially favour weak sites that differ at the A+T positions but maintain the G+C positions.

Could it help to use some more simple way to score the DUS sequences. I would focus on the 12 bp of the Neisseria DUS and than make lists of sequences allowing zero, one, two ... missmatches (use for example fuzznuc ). Than one could sort the sequences and see if there is a high number of a certain sequence that does not fully resemble the consensus. At least than one deals with real sequences and not with some strange scores. I guess Neisseria itself would understand this way better than the Gibbs scores.

@Tim: I don't think the background base composition is the cause, because the changes are position-specific (e.g. some USS Gs get much weaker than others).

@Torsten: The reason I'm doing the Gibb analysis is to get away from the erroneous repeat/mismatch view. The position-weight-matrix produced by the Gibbs analysis is much more consistent with how we think uptake sequences evolve. I'll try to clarify this in my next post.