The field of metagenomics and whole community sequencing is a promising area to unravel the content of microbial communities and their relationship to disease and antimicrobial resistance in the human population. Bioinformatic tools are extremely important for making sense out of metagenomics data, by estimating the presence of pathogens and antimicrobial resistance determinants in complex samples. Combined with relevant explanatory data, metagenomics is a powerful tool for surveillance.
In this course, we teach about the potential of metagenomics for surveillance and give the learners an overview of the steps and considerations in a metagenomics study. After this course, the learners will know:
- the difference between the concepts of metagenomics and other microbial genomics
- the need to use controls in different steps of a metagenomics study
- the advantages of metagenomics for the surveillance of antimicrobial resistance
- how sampling design, sample size, sample material and sample handling influence the outcome of a metagenomics study
- sample processing for bacterial and viral metagenomics
- different sequencing platforms and their possibilities regarding metagenomics
- the steps involved in a general metagenomics study, including quality control, mapping to different databases, and read count analysis
- the principles behind various tools available for analysis of metagenomics data
- how to interpret read classification results
- the need for epidemiology in surveillance
- the concept of global and integrated surveillance
- the challenges for the use of metagenomics in surveillance
- the potential of metagenomics for surveillance
We look forward to welcoming you !

Interpretation of results and potential of metagenomics for surveillance

In this module you will learn about two different approaches to analyse and interpret sequence reads - classification and assembly. You will see examples of methods to vizualize read counts and to analyze metagenomics together with explanatory data. Last, you will learn the potential of metagenomics for the development of a future global and integrated surveillance, and the challenges you may encounter during that process

Ana Sofia Ribeiro Duarte

Tine Hald

Sünje Johanna Pamp

Patrick Munk

Liese Van Gompel

Valeria Bortolaia

Pimlapas Leekitcharoenphon

Transcrição

[MUSIC] >> So we start with statistical analysis to describe the distribution of your data, your community composition. Here you have an example. An example of a study where they collected samples from different environments, from soil, ocean, human feces, and chicken gut, among others. And here they describe the diversity of resistance genes and the abundance of resistance genes with a stacked bar plot. Where you see, the first color represents diversity in this particular sample, and the lighter gray represents abundance. So this is the relative abundance of resistance genes among all genes. Then you have a Venn diagram where it is shown how many resistance genes occur particularly in each of the environments and in two or three of the environments. So what is the overlap in terms of resistome between these three different environmental samples. And then finally you have very well known boxplots. In this case they are used to depict the distribution of four different things. If we look at the first one as an example, this describes the overall occurrence, The relative percentage, so it's how many of the annotated sequences among the total were mapped to genes that in this case confer resistance to tetracycline. And you can see that among all human samples there's a higher relative percentage of genes conferring resistance to tetracycline compared to the samples collected from soil, or the samples collected from oceans. So this is one example. This is another example from another study. And this is to show that in this study they have used a combination of many of the methods that I have mentioned here. And this you will see in many, many studies and it is many times would make sense because these methods complement each other. So what one method shows will complement what another method will show. So if you want to show the full picture of your study, if you're interested in a full description of your data, then you very likely will have to combine different methods. And in this case, you will see that they gave the actual values for the richness and diversity indexes. Then they describe, with a bar plot, the relative abundance among the total of genes conferring resistance to different antimicrobial classes. Here you have again a Venn diagram showing the overlap between different types of samples. Here you have a heatmap which shows the taxonomic abundance for each sample of different taxa. And here you have a stacked barplot, which shows the taxonomic diversity for each sample. So you can use them and match them in the way you find most appropriate to convey your message. Still on how your data is distributed. You can use ordination analysis to do this. Ordination analysis or gradient analysis is a type of multivariate, which means that for each sample you have several outcomes. For example, if for one sample you want to analyze the abundance of several genes, several resistance genes, for example. These are two different types of ordination analysis. Here you have the canonical correspondence analysis. Many times what you find is that, so here you have your data represented in two dimensions. And between parenthesis, which you many times find, is how much percent of variation in your data each of the dimensions explain. And here you have a very neat example of how this kind of analysis can really find meaningful clusters in your data. This is the bacterial species abundance of patients with inflammatory bowel disease and healthy individuals. And you can see that the healthy individuals cluster apart from individuals with Inflammatory Bowel Disease, but closer to individuals with ulcerative colitis than to individuals with Crohn's Disease. So there is clearly a difference in terms of the bacterial species abundance in those two individuals even though the clinical manifestation might be a similar one. And finally, you can use network analysis. In network analysis, you investigate a social structure between your data. You call in network analysis, nodes are everything you have in your network. And then ties, edges, or links are the relationships and interactions between your nodes. So what you analyze in network analysis is how the different nodes interact in a way that they usually occur together or in a way that they usually mutually exclude each other. In this example from this study, the results are summarized as in how different species are either co-present in a sample or mutually excluding each other. But ultimately, the very nice feature of network analysis is that you can graphically represent this network structure. And by doing so, you may find, again, clusters that may indicate you towards possible explanatory variables for your data. So in this study, there was food, or sequences from food products and their bacterial composition. And you can see some clear differences, for example, between beef product fresh with zero days and the beef product with six days. You can see, and you probably expect, but now you can clearly see that the bacterial composition of the meat changes and particularly how and what species compose each of the two products. The same was done for different types of sourdough and for different cheese and milk products. So that was a quick snapshot about characterizing your community. Now we continue with finding determinants for what you find in your community. A possible way is to do a Spearman's rank correlation coefficient analysis. It's a bit different from the Pearson correlation coefficient because it analyzes the correlation between the rank of the values, of the variables. But what you see in the end is a correlation, in this case, it's between different genus. But you can also do it between different resistance genes or resistance genes and certain bacteria from the microbiome, for example. Then you can use a regression analysis, which might be, as I said before, a bit cumbersome, because, if you think of antimicrobial resistance, when you do metagenomics, you end up with a resistome. So many, many resistance genes characterized in your sample. And if you want to run a regression analysis with determinant variables, you might need to summarize your data to either a resistance class or total resistance. Or run a separate regression analysis for each of the resistance genes you're interested in. But still, you can do something and find some interesting results. And then you can also do a meta-analysis. If you suspect that, for example, there's a certain driving factor that may have an influence on variation of your data, but you cannot quite explain why. And this can be, for example, the country of origin of a sample, if you're running a multi-country study. In this case, it's a regression analysis, the lines you see here. And it characterizes how the total drug used, total antimicrobial used, is associated with the total antimicrobial resistance found in terms of abundance of resistance genes in the sample. And this is for pig feces, this is for poultry feces. And the different colors represent different countries. And you can see some trend. But again there's a large spread, and there's also the fact that you have to summarize your data, both the use and the resistance to a total. This is a forest plot from a meta-analysis, and what this shows is also, an association between the use of beta lactam agents and the total beta lactamase resistance. And this is your global estimate, the summary estimate for the overall association between one and the other. And then for each country, this you have a random effect for country, and then for each country you have represented the effect for that country, which is represented by the dot in the middle, or the square in the middle. And then the confidence interval for that country-specific estimate, which depends on how certain or what is your uncertainty in relation to the data of that country. And then the overall picture is summarized in your summary estimate. However, you have a detailed insight into how exactly each country is contributing to that summary estimate. Another method you can use to find determinants is machine learning, in particular classification models. In classification models you try to find a rule, or the model tries to find a rule, that maps your variables, the predictor variables, which are called features. For example, the abundance of different resistance genes in a certain isolate or a certain isolate of bacteria. And how these features map to a certain target. And your target could be, for example, whether an isolate is susceptible or resistant to a certain antimicrobial agent. There are different types of algorithms you can choose to find this rule for you. If you want this kind of model that finds a classification for your sample, like resistance / susceptible. It's a classification algorithm and it will in the end, given a certain resistome, a certain pattern of resistance genes, and an unknown outcome, it will predict the outcome of your sample. So a sample, an isolate that has not been tested in the lab for susceptibility or resistance to an antimicrobial agent can, in the end, be predicted using a good machine learning model. Here's an example where a data mining algorithm was used to find patterns, find text structures, artificial ones, in a DNA sequence, and with that predict Salmonella serotype. As I said in the beginning, this is a snapshot. There are many more methods available. And in this publication, you find a good overview of different methods that can be used to visualize and analyze metagenomics data. A little bit about the challenges of interpreting metagenomics data. We have been mentioning along the course that along the workflow of a metagenomics project, you may induce biases in your data. Since you collect your sample until you analyze it with a bioinformatics algorithm. Also when you analyze a sample, and you want to find pathogens, pathogenic bacteria. They are usually in very low numbers compared to the original, the dominant microbiota. And this may of course be a disadvantage for surveillance. If you're targeting genes like antimicrobial resistance genes, eventually you might want to identify where the genes are harbored, in which bacteria, and whether they really consist on a hazard to a public health or not. And this leads us also from presence to function, so we don't, in the end we don't only want to identify that a certain pathogen or resistance gene is present in a sample. We also want to know whether it's being expressed or not. And this leads us to the need for further methods like not only mapping the reads, but also doing assembly as Patrick has explained. But also to use other methods beyond metagenomics, like metatranscriptomics, metaproteomics, and metametabolomics. And here you find a quick overview of what each of these methods means. As an example, in food safety they've come up with a term called foodomics. Which includes not only metagenomics, but also all of these that I have just mentioned. And here is an overview of how the number of publications has increased in the last years, not only predominantly in genomics, but the other omics are clearly following the trend. And here you find an overview of all the references that I've used during this lecture, which I hope you found useful. And that's all for today, so thank you very much for watching. [MUSIC]