FluShuffle and FluResort: new algorithms to identify reassorted strains of the influenza virus by mass spectrometry.

1School of Molecular Bioscience G-08, The University of Sydney, Sydney, NSW, 2006, Australia.

Abstract

BACKGROUND:

Influenza is one of the oldest and deadliest infectious diseases known to man. Reassorted strains of the virus pose the greatest risk to both human and animal health and have been associated with all pandemics of the past century, with the possible exception of the 1918 pandemic, resulting in tens of millions of deaths. We have developed and tested new computer algorithms, FluShuffle and FluResort, which enable reassorted viruses to be identified by the most rapid and direct means possible. These algorithms enable reassorted influenza, and other, viruses to be rapidly identified to allow prevention strategies and treatments to be more efficiently implemented.

RESULTS:

The FluShuffle and FluResort algorithms were tested with both experimental and simulated mass spectra of whole virus digests. FluShuffle considers different combinations of viral protein identities that match the mass spectral data using a Gibbs sampling algorithm employing a mixed protein Markov chain Monte Carlo (MCMC) method. FluResort utilizes those identities to calculate the weighted distance of each across two or more different phylogenetic trees constructed through viral protein sequence alignments. Each weighted mean distance value is normalized by conversion to a Z-score to establish a reassorted strain.

CONCLUSIONS:

The new FluShuffle and FluResort algorithms can correctly identify the origins of influenza viral proteins and the number of reassortment events required to produce the strains from the high resolution mass spectral data of whole virus proteolytic digestions. This has been demonstrated in the case of constructed vaccine strains as well as common human seasonal strains of the virus. The algorithms significantly improve the capability of the proteotyping approach to identify reassorted viruses that pose the greatest pandemic risk.

Phylogenetic tree for the hemagglutinin protein (H1 subtype) with colouration of its predicted identity within the PanVax strain. Irrelevant clades have been collapsed for clarity. A scale bar is shown that represents distance as substitutions per site. The location of the expected strain origin (A/California/07/2009) is labelled and the sum of probabilities for its clade of close relatives is shown in brackets as a percentage. The location of the A/Puerto Rico/08/1934 strain is also labelled.

Phylogenetic tree for the nucleoprotein for influenza type A with colouration of its predicted identity within the PanVax strain. The location of the expected origin (A/Puerto Rico/08/1934) is labelled and the sum of probabilities for its clade of close relatives is shown in brackets as a percentage. The location of the A/California/07/2009 strain is also labelled.

Phylogenetic tree for the hemagglutinin protein (H1 subtype) with colouration of its predicted identity within the type A/Solomon Islands/03/2006 strain. The location of the expected identity is labelled and the sum of probabilities for its clade of close relatives is shown in brackets as a percentage. The clade of closely related sequences with the greatest sum probability is marked in bold. The clade containing seasonal H1N1 strains is also shown with its sum of probabilities.