The rapid advancement of sequencing techniques, coupled with the new methodologies of bioinformatics to handle large-scale data analysis, are providing exciting opportunities for us to understand microbial communities from a variety of environments beyond previous imagination.

This book provides invaluable, up-to-date and detailed information on various aspects of bioinformatics data analysis with applications to microbiology. It describes a number of different useful bioinformatics tools, makes links to some wet-lab techniques, explains different approaches to tackle a problem, talks about current challenges and limitations, gives examples of applications of bioinformatics methods to microbiology, and discusses future trends. The chapters include topics such as genome sequencing techniques, assembly, SNP analysis, annotation, comparative genomics, microbial community profiling, metagenomics, phylogenetic microarrays, barcoding and more. Each chapter is written by scientists who are expert in the field, and is peer-reviewed.

Bioinformatics and Data Analysis in Microbiology is an essential book for researchers, lecturers and students involved in microbiology, bioinformatics and genome analysis.

Table of contents

1. Understanding the Unseen Majority Around us: An Overview of Microbiological Technologies

Meesbah Jiwaji, Gwynneth F. Matcher and Rosemary A. Dorrington

Of all the living organisms on the planet, microorganisms are the most numerically abundant and diverse in nature. Despite their ubiquity, researchers have only begun to understand the diversity profiles, metabolic functioning and potential economic value of these organisms. Classical investigation of microorganisms involves the culture and study of selected microbes in the laboratory setting. While this approach has yielded much information, there are two major drawbacks. Firstly, most microbes present in the environment are unculturable using currently available media/methodologies and, secondly, they focus on one, often attenuated, isolate/species and/or a set of genes at a time. To overcome these problems, researchers have focussed on the development of new technologies that yield large, reliable and robust datasets in fields that include genomics, transcriptomics and proteomics. Importantly, the development of high-throughput sequencing technologies has dramatically advanced the analysis of microbial species diversity and their functioning within ecosystems. The large volumes of information-rich data require intelligent, and often repetitive, computational analysis, stressing the need for development of suitable bioinformatics analysis tools. This chapter provides an overview of microbes and the importance of why we need to understand them, as well as the methods applied to studying microbiota within ecosystems.

2. Prokaryotic Genome Sequencing and Assembly

Morag Graham, Gary Van Domselaar and Paul Stothard

Researchers can now readily obtain millions of sequence reads from the genomes of their favourite prokaryotic organisms thanks to the development of next-generation sequencing technologies. Through sequence assembly, it is possible to reconstruct large portions of a genome from the overlapping sequence reads. However, assembly is challenging because the sequence reads are generally quite short and genomes often contain internally repeated segments that may confound the complete reconstruction of a genome from its constituent reads. There are different approaches for addressing these challenges that involve, for example, more advanced assembly tools, reference genome sequences, and directed follow-up sequencing. Regardless of the strategy employed there are many steps and programs involved, and the final outputs need to be annotated and interpreted with the known shortcomings of the data and methodologies in mind.

The transformation of DNA sequencing technologies has enabled more powerful and comprehensive genetic profiling of microbes. The sheer number of informative loci provided by genome-sequencing allows the investigation of structural variation and horizontal gene transfer as well as delivering novel insights into genetic origins, evolution and epidemiological history. Microbial genomes can be sequenced en masse at high coverage but have associated challenges of high mutation rates and low conservation of genome structure. Consequently, detecting changes in DNA sequences requires a nuanced approach specific to the organism, availability of similar genomes, and types of variation. Here, we outline the high power of genome-sequencing to detect a wide scope of polymorphism classes. Samples without related species on which to scaffold a genome sequence require specific assembly methods that can be enhanced by progressive procedures for improvement. Polymorphism identification depends on genome structure, and error rates in closely related specimens can be reduced by incorporating population-level information. The development of genome analysis platforms is hastening the optimisation of variant discovery and has direct applications for pathogen surveillance. Robust variant screening facilitates more sensitive scrutiny of population history, including the origin and emergence of infectious agents, and a deeper understanding of the selective processes that shape microbial phenotypes.

4. Prokaryotic Genome Annotation

Gary Van Domselaar, Morag Graham and Paul Stothard

Genome annotation is the process of identifying the important features contained within a genome sequence and attaching relevant biological information to those features. Typically one of the first steps to be applied after sequencing a new genome, annotation involves the coordinated application of a variety of software tools and analysis techniques. An understanding of the tools, databases, computational methods, and available pipelines used to generate genome annotations is necessary to assess their accuracy and their appropriateness for downstream applications. In this chapter we focus on the computational methods that have been developed for annotating bacterial and archaeal genomes. We then survey the popular pipelines that incorporate these methods to generate high quality annotated prokaryotic genomes.

Microbial pathogens are responsible for a significant proportion of mortality in humans. Although there is unlikely to be a single set of virulence genes common to all pathogens, comparative genomic analysis of pathogens and non-pathogens can shed light on which genes may be required for the pathogenic lifestyle and which genes are unique to a certain species or genus. Thousands of microbial genomes have been completely sequenced and annotated, providing the opportunity for multi-way comparisons. When comparing genomes of even closely related species, the genomes appear to be composed of a core set of genes common to a variety of organisms, a set of genes common to closely related organisms only, and finally a set of genes unique to a species or even to certain strains. In this chapter we review some of the methods for comparative analysis of microbial genomes and provide some results of genes unique to a species and genus using the Mycobacteria as an example.

6. Microbial Community Profiling: Current Approaches and Future Trends

Angel Valverde, Pieter De Maayer and Don A. Cowan

Microorganisms are vital to the function of all ecosystems. This is largely because they exist in enormous numbers and they have immense cumulative mass and activity. In this chapter we focus on one of the two main families of genomic methods that have been used to examine natural microbial populations and communities: fingerprinting technologies. Firstly, we introduce several fingerprinting techniques and discuss their strengths and limitations. Secondly, we describe the construction of phylogenetic trees and several multivariate and statistical tools used in interpreting the observed diversity patterns in microbial communities. Finally, we discuss some of the long-standing unresolved questions and future perspectives in the field of microbial ecology.

Metagenomics aims to estimate the organismal composition and metabolic potential encoded in genetic material obtained from microbial communities. The ultimate goal is to correlate genetic information with environment/host specific meta-data to discover genetic biomarkers of disease, health, and environmental change/adaptation. The power of investigating whole microbial communities, the direct application of sequencing without a need for prior cultivation in combination with increasingly efficient sequencing technologies have made such studies commonplace. This chapter provides an overview of metagenomic research emphasizing two commonly used experimental approaches: (1) marker gene (including 16S rRNA gene) and (2) whole genome shotgun sequencing (WGS). We exemplify these approaches by focusing on two studies we have worked on extensively: the National Institutes of Health (NIH) funded Human Microbiome Project (HMP) and a Baltic Sea study. In particular, we discuss experimental design aspects, preprocessing of sequence data, sequence assembly, constructing gene catalogs, estimating microbial community composition, and metabolic potential. Wherever appropriate, we describe normalization methods to avoid systematic biases, and describe a selection of suitable statistical methodology for exploratory multivariate and differential abundance analysis. We conclude with a section on cloud computing to facilitate on-demand metagenomic analysis including a review of effective bioinformatics software, and future trends.

8. Human Microbiome Analysis via the 16s rRNA Gene

Jonathan McCafferty and Anthony Fodor

The human associated microbiota has been linked to an ever-expanding set of diseases including obesity, cancer and inflammatory bowel disease. While the decreasing cost of sequencing is making whole-genome metagenomic shotgun sequencing more feasible, 16S rRNA based sequencing studies remain the most commonly utilized method to characterize a microbial community. In this review, we consider different methods to characterize a mixed microbial community by examination of the 16S rRNA gene. We discuss older, low-resolution methods such as Terminal Restriction Length Polymorphisms (T-RFLP) and Automated Ribosomal Intergenic Spacer Analysis (ARISA), which yield low-cost "snapshots" of the microbial community that can be generated rapidly. We next consider current high-throughput sequencing technology from 454 Life Sciences and Illumina. These techniques generate large amounts of data and careful consideration must be given to how low-quality sequences and PCR chimeras are removed from downstream consideration. We examine algorithms for clustering sequences into Operational Taxonomic Units (OTUs) and for assigning taxonomy. Finally, we consider methods for assigning statistical significance to differences between different microbial communities.

9. Phylogenetic Microarrays

Oleg Paliy, Vijay Shankar and Marketa Sagova-Mareckova

Environmental microbial communities are known to be highly diverse, often comprising hundreds and thousands of different species. Such great complexity of these populations, as well as the fastidious nature of many of the microorganisms, makes culture-based techniques both inefficient and challenging to study these communities. The analyses of such communities are best accomplished by the use of high-throughput molecular methods such as phylogenetic microarrays and next generation sequencing. Phylogenetic microarrays have recently become a popular tool for the compositional analysis of complex microbial communities, owing to their ability to provide simultaneous quantitative measurements of many community members. This chapter describes the currently available phylogenetic microarrays used in the interrogation of complex microbial communities, the technology used to construct the arrays, as well as several key features that distinguish them from other approaches. We also discuss optimization strategies for the development and usage of phylogenetic microarrays as well as data analysis techniques and available options.

10. Genetic Barcoding of Bacteria and its Microbiology and Biotechnology Applications

A wide variety of genetic data about organisms of interest has become available with the advancement to next generation sequencing (NGS). For many potential new users, to process this huge amount of genetic data released by NGS and to utilise this information to resolve practical questions remains a challenge. Genetic barcoding of microorganisms is the first obvious area where NGS has met the requirements of applied microbiology. In general, barcoding in microbiology is a comparative genome approach to differentiate between species or strains that are hard to distinguish by traditional methods. In this chapter, we introduce the conceptual background of bacterial barcoding and present several basic bioinformatics tools and approaches to provide solutions to NGS data handling. While working with a putative industrial strain or potentially hazardous pathogen, the following questions arise: (i) is this strain unique and if so, what makes it unique genetically or practically speaking; (ii) how can it be detected in the environment; (iii) are there any genetic markers for its extraordinary activity? The possibility of barcoding of whole bacterial communities is considered and both the benefits and limitations of the traditional 16S rRNA based barcoding and multi-locus sequence typing are discussed.