Genomics and evolution of bacterial pathogens

Main menu

Point mutations (SNPs) in outbreak strain relative to Ec55989

This post is really a reply to PatrikD’s comments on my last post about the SNP-based phylogeny of E. coli, showing the close relationship between the outbreak EAEC O104:H4 strain and the finished genome sequence of Ec55989 (well, chromosome). I can’t post much in a comment reply so it requires a new post.

PatrikD said:

Since I’m looking for SNPs that agree between the three assemblies, I’m not too worried about remaining sequencing errors. Given how closely these strains were sampled, I’m not interested in any minor differences between them, but more in the difference between 55989 and the three outbreak strains.

Now, I think the best way to do that at this point would be to use the SNP calls from HiSeq reads of TY2482. I get 1781 high quality SNP calls compared to Ec55989, see table here (using bwa & samtools to map reads and call SNPs). Most of these are in prophage regions (annotated in this artemis-readable file and flagged in the SNP table in the ‘mobile’ column); excluding these there are 564 probable SNPs (SNP table, artemis-readable file).

This is their distribution (dark blue = phage regions and IS elements; red = SNP calls overlapping with these mobile regions; green = 564 SNP calls not in these regions; inner ring=GC content):

Of the 1230 SNPs in mobile regions (i.e. subject to horizontal transfer or homologous recombination with typically horizontally transferred genes), there are 713 non-synonymous mutations (i.e. changes the encoded amino acid) and 424 synonymous changes (i.e. silient mutations, not affecting the encoded amino acid sequence)…while it is dodgy to do dN/dS on this kind of intraspecies data, it would be ~0.5, consistent with purifying selection.

For the 564 SNPs in non-mobile regions of the genome (i.e. not subject to horizontal transfer or homologous recombination with typically transferred genes), 284 are non-synonymous and 145 synonymous (dN/dS~0.6). Again it’s tempting to call this purifying selection, as it looks like mutations affecting protein function are being selectively removed, relative to those silent mutations. Also, 136 SNPs (24%) were in intergenic regions. Since only 13% of the chromosome is non-coding, this suggests that mutations in non-coding regions are better tolerated than those in coding sequences, again consistent with purifying selection against mutations affecting protein function.

There are 4,762 genes annotated in Ec55989, and only 564 SNPs…so the majority of genes don’t have any point mutations (although I haven’t looked at insertion/deletions yet, only subsitution mutations where one DNA base is exhanged for another). But do any genes have a high number of changes?

ibrA, an immunoglobulin-binding regulator, has 20 SNPs, mainly in the C-terminal domain. It’s quite likely this is a recombination event rather than diversifying selection, but deserves a closer look. A neighbouring gene, EC55989_3315 (conserved hypothetical) also has 18 SNPs, so recombination is a likely culprit.

The Ag43 gene, EC55989_3357, has 17 SNP calls, however we know from the assemblies that the outbreak strain has acquired divergent copies of Ag43 genes [see posts on analysis of integrated sequence, and biofilm genes], so these SNP calls probably reflect the differences between two distinct copies of Ag43 genes rather than point mutations acquired in the EC55989_3357 gene itself.

There are 6 SNPs in EC55989_4669, an Iha adhesin receptor. This could be real, or could be a divergent copy acquired by the outbreak strain.

Hypothetical proteins EC55989_3283, EC55989_3284, EC55989_3293, EC55989_3355, EC55989_4845, EC55989_4890, EC55989_4891, EC55989_4896, EC55989_4897 each have 3-4 SNPs. These are probably phage remnants or similar to phage genes, but some followup analysis could confirm this. There are 7 SNPs in shiA, a fragment of a gene similar to a shigella SHI-2 pathogenicity island gene. EC55989_4671 is an IS element that slipped through my filter, and contains 5 SNP calls.

Hi Kat. I’m surprised you find that many more SNPs. In your earlier table, there’s only 457 differences between the reference 55989 and the “Escherichia_coli_TY2482_hiseq2” assembly, 389 of which also agree with the H112180280 and LB226692 assemblies.

Could you say a bit more about how you used the reads to find these SNPs, and why they would not show up in the assembly-based SNPs you had in that table?

The sequencing errors in the HiSeq reads are less pernicious than the systematic indels you get from homopolymer errors with 454 or Ion Torrent, but they’re also more likely to show up as SNPs. That’s why I used the BGI v3 assembly to look for SNPs instead – most of the sequencing errors in the reads should be averaged away in the assembly.

I also required the same base call in the HPA and Life Tech assembly, which may be a bit overkill. In theory it’s possible you could miss a SNP because you happen to get a sequencing error in the same place in one of the other two assemblies, but that should be a fairly rare occurrence.

Overall, I think I’d prefer to put my faith in the 389 SNPs I found based on the assembled sequence, rather than the 1781 you get from the reads themselves.

I realise there are some gaps in the info I’ve provided, so I’ll try to fill them in here. There’s so much going on at the moment I must confess I lose track of what I’ve already posted and what I’ve not. Anyway.

The original assembly-based SNP calls by Konrad were using MUMmer to map contigs to Ec55989 and call SNPs, then merge these into one table. If a SNP was not called in LB226692 relative to Ec55989, it would be assigned the Ec55989 allele, although there isn’t direct evidence in the assembly that this is the allele. The SNPs were also filtered based on the MUMmer output, which reports how close the SNP locus is to other SNPs/indels and to the end of contigs – we used a conservative cutoff here to remove SNPs within 100bp of another variant or a contig end. This is because errors tend to pile up at contig ends since they are almost always covered a lower read depth, because we know SNPs close to an indel in 454 or Ion Torrent data are likely to be caused by sequencing errors in homopolymers and because we are not interested in recombinant regions (which interfere with the phylogenetic signal in our tree) so we don’t actually want regions with loads of adjacent SNPs. This is not well suited for analysing point mutations, but meets the goal of the analysis which was to see where the outbreak genome fits within the E. coli phylogeny.

To analyse point mutations, it’s much more reliable to go back to the reads themselves and call SNPs direct from read mapping. Here I’ve used bwa to map BGI’s Hiseq reads for TY2482 to Ec55989, and samtools to call SNPs (they all have read depth>100 and consensus quality>85). For the LB226692 Ion Torrent reads, I’ve used TMAP to do the mapping and samtools to call SNPs from the resulting bam file in the same was as for Hiseq reads.

This table shows the SNPs called in TY2482 using read mapping, with the corresponding calls in other genomes by read mapping (LB226692) and MUMmer analysis of assemblies used to check the validity of this. The table shows 1795 SNP calls in total (sheet 1), with 1675 of these called from read analysis of both TY2482 and LB226692. Of the remainder, 62 are uncertain in LB226692 (using read data, full SNP calls for LB226692 in sheet 2) and for 45 SNPs, LB226692 has the Ec55989 allele according to read mapping (using the consensus call produced from the reads by samtools). For 12 SNPs, the allele in LB226692 called from reads is different from the alleles in both TY2482 and Ec55989 (ie it has its own, third, allele), which could be homoplasy but more likely errors in the LB226692 SNP calls near homopolymers. Now obviously the problem with analysing short reads directly is that there will be lots of reads that look the same but are derived from different copies of homologous sequence in the genome…so you will get false SNP calls in stretches of sequences which are present in multiple slightly divergent copies in the genome, i.e. our mobile elements like transposases and phage. So, looking in the non-mobile sequences only (565 SNPs), 528 of these are called as SNPs (from reads) with identical alleles in both TY2482 and LB226692, 11 have an unknown allele in LB226692, 1 has different alleles in the two novel genomes and 24 are called as TY2482-specific SNPs (with LB226692 allele matching Ec55989).

I don’t have reads available for the HPA genome, only scaffolds. But using MUMmer to align the scaffolds to the Ec55989 genome (sheet 3; unfiltered calls), of these 528 SNP calls in non-mobile regions (on which the reads for TY2482 and LB226692 agree) 447 have identical SNP calls in the HPA genome. The remaining 81 have no SNP called by MUMmer in the HPA genome…which doesn’t necessarily mean there is no SNP in the HPA genome, but that it was not called by MUMmer. You’d have to check using a fasta or blast search of the surrounding region in the HPA genome to address this properly.

So, to analyse the point mutations in the outbreak strain compared to Ec55989, at this point I would use the column 8 of the attached spreadsheet, which gives the outbreak allele for 447 SNPs that were called in (a) TY2482 HiSeq reads, (b) LB226692 Ion Torrent reads and (c) the HPA scaffold (see this table, non-mobile).