Bioinformatics Seminar

The bioinformatics network aims to promote bioinformatics research at Kiel University and its partner institutes by supplying a framework for an inter-disciplinary scientific exchange and an inter-faculty education program.

Monthly seminars provide a meeting point for scientists and students and are open to everyone who is working on or interested in bioinformatics. Topics at the monthly seminars cover all areas of bioinformatics, including theory, method development, and data analysis in the fields of genomics, transcriptomics, metagenomics, population genomics, systems biology, and biostatistics. Seminars are announced here and also by email to the bioinf mailing list.

Next seminar:

9. 11. 2018 (Friday) 15:00

Johannes Zimmermann

Institute of Experimental Medicine, CAU Kiel

Fishing with metabolic networks - crafting, catching, curating

Metabolic networks are repositories of knowledge about the metabolic processes that occur in an organism. They are successfully used to examine various phenomenons rather on a integrative pathway level than on gene level only. In my talk, I want to give an introduction to metabolic networks theory focusing on the construction, analysis and curation of such networks. Own contributions are discussed as well as finally the application of metabolic networks to community modeling spotlighted by the metaorganinsm paradigm.

Machine learning methods and in particular random forests are promising approaches for classification and regression based on omics data sets. I will first describe in layman's terms how the random forest algorithm works and how a prediction model can be built. However, these complex models are not easy to interpret and
one strategy for better understanding is variable selection, i.e. the identification of variables that are important for prediction.
In the second part of my talk I will then present our novel method called surrogate minimal depth (SMD).
It is based on the structure of the decision trees in the forest and additionally takes into account relationships between variables. In simulation studies we showed that correlation patterns can be reconstructed and that SMD is more powerful than existing variable selection methods. Thus, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcomes in a high dimensional data setting.

7. 6. 2019 (Friday) 15:00
N.N.

Archive:

1. 10. 2018 (Monday) 15:00 Dan Graur, University of Houston

Something Old, Something New, Something Borrowed, Something Blue: Applying the Concept of Mutational Load to Genomic Sequences to Determine an Upper Limit on the Functional Fraction of the Human Genome

For the human population to maintain a constant size from generation to generation, an increase in fecundity must compensate for the reduction in the mean fitness of the population caused by deleterious mutations. The required increase depends on the deleterious-mutation rate and the number of sites in the genome that are functional. These dependencies and the fact that there exists a maximum tolerable replacement level fertility (e.g., humans cannot have 100 children) allow us to estimate an upper limit for the fraction of the human genome that can be functional. By estimating the fraction of deleterious mutation out of all mutations in known functional regions, we conclude that the fraction of the human genome that can be functional cannot exceed 25%, and is almost certainly much lower.

The German Network for Bioinformatics Infrastructure (de.NBI) is a national initiative funded by the Federal Ministry of Education and Research (BMBF). The mission of the de.NBI initiative is (i) to provide high-quality bioinformatics services to users in basic and applied life sciences research from academia, industry and biomedicine; (ii) to offer bioinformatics training to users in Germany and Europe through a wide range of workshops and courses; and (iii) to foster the cooperation of the German bioinformatics community with international network structures. The infrastructure network was launched by the BMBF in March 2015 and, after two national calls, now includes 40 service projects operated by 30 project partners that are organized in eight service centers. Scientists from Kiel University and from the Fritz Lipmann Institute Jena joined the de.NBI nework as associated partners in 2017. The staff of de.NBI develops further and maintains almost 100 bioinformatics services for the human, plant and microbial research fields and provides comprehensive training courses to support users with different expertise levels in bioinformatics. The network is currently expanding its activities to the European level, as the de.NBI consortium was assigned by the BMBF to establish and run the German node of ELIXIR, the European life-sciences Infrastructure for biological Information. Like de.NBI on the national level, ELIXIR-DE is coordinated from Bielefeld University and includes over twenty partner institutes across Germany.

My PhD focused on developing computational and statistical methods in the field of molecular evolution. In my talk I will give a brief overview of my work, which dealt with various aspects of sequence analysis. I will then present in greater detail one of the projects, TraitRateProp. TraitRateProp is a probabilistic method that allows testing whether the rate of sequence evolution is associated with changes in a binary phenotypic character trait. The method further allows the detection of specific sequence sites whose evolutionary rate is most noticeably affected following the character transition, suggesting a shift in functional/structural constraints. TraitRateProp was first evaluated in simulations and then applied to study the evolutionary process of plastid plant genomes upon a transition to a heterotrophic lifestyle.
Finally, I will present my current work on developing and applying computational tools for the analysis of eukaryotic metagenomics data. Metagenomics is revolutionizing the study of microbes and their fundamental roles in biological, geological, and chemical processes on earth. Despite the important roles eukaryotes play in most environments, they have received little research attention, due to their lower abundance in samples and to the complexity of their gene and genome architectures. To date, we generally cannot reliably predict eukaryotic genes in metagenomics sequences. However, being able to analyze eukaryotic metagenomics data is of great importance to numerous scientific fields, including biotechnology and medicine, ecology and evolution. In my study, I work on developing computational tools for the high-throughput discovery of eukaryotic gene sequences in metagenomics data and for their functional annotation.

Transcriptomic alterations during ageing reflect the shift from cancer to degenerative diseases in the elderly

Disease epidemiology during ageing shows a transition from cancer to degenerative chronic disorders as dominant contributors to mortality in the old. Nevertheless, it has remained unclear to what extent molecular signatures of ageing reflect this phenomenon. Here we report on the identification of a conserved transcriptomic signature of ageing based on gene expression data from four vertebrate species across four tissues. We find that ageing-associated transcriptomic changes follow trajectories similar to the transcriptional alterations observed in degenerative ageing diseases but are in opposite direction to the transcriptomic alterations observed in cancer. We confirm the existence of a similar antagonism on the genomic level, where a majority of shared risk alleles that increase the risk of cancer decrease the risk of chronic degenerative disorders and vice versa. These results reveal a fundamental trade-off between cancer and degenerative ageing diseases that sheds light on the pronounced shift in their epidemiology during ageing.

Eukaryotic organisms are associated and have co-evolved with a complex bacterial community. Together host and bacteria form a synergistic relation. Disturbance of the homeostasis between host and its associated partners may contribute to disease development. While in the last decades research has focused on bacteria host interactions viruses have been disregarded although they represent the most abundant entity in the world outnumbering bacterial cells and are one of the key regulators of bacterial communities killing 20-40% of all bacterial cells each day. In this seminar I’ll introduce you to the viral world living in association with diverse organisms of different habitats ranging from marine algae, sponges, freshwater polyps to fecal samples of mice. On the basis of these examples I’ll emphasize different methodical problems including viral isolation, library preparation and sequence data analysis that challenge viral research and have to be taken into account when working with viruses.

Genome sequences vary locally in their complexity due to duplication events in their evolutionary past. As a result, there is long-standing interest in elucidating the relationship between sequence complexity and the function of the encoded genes. However, measuring local sequence complexity is problematic as most metrics have no bounds that coincide with a known minimum for completely ordered sequences, and a maximum for random sequences. An exception to this rule is our complexity measure CM, which is bounded by 0 and an expectation of 1. This measure is robust to variation in GC-content, and can be computed efficiently. We have implemented CM in our program macle for MAtch CompLExity. Macle takes as input a genome sequence in FASTA format for indexing. In the case of the complete human genome, indexing takes 3.5 h using 128 GB RAM. Given the resulting index, macle computes CM in sliding windows of arbitrary width across the entire genome in roughly 19 s using 25 GB RAM.
To investigate the relationship between sequence complexity and gene function, we determined which genes were enriched in regions of a given complexity. We found that high complexity regions were strongly enriched for regulatory genes active in development. In contrast, low complexity regions were enriched for genes involved in immunity. We end by speculating on the role of the few unannotated regions of high complexity found.

Structural variation (SV) is of key importance for the evolution of genomes across the tree of life. This talk presents a tour of methodological developments for SV calling, genotyping, and haplotyping. First, I will explain methods (Clever, Mate-Clever) we developed and applied in the frame of the Genome of the Netherlands (GoNL) project, which sequenced 250 Dutch families, and highlight some of the results of this study. Second, I will venture into the world of bacterial genomics and show how lessons learned from detecting human structural variation can be applied to design a tool (Daisy) to detect recent horizontal gene transfer. Third, I will discuss the impact of technological developments for detecting SVs, using the data produced by the Human Genome Structural Variation Consortium (HGSVC) as an example. The HGSVC sequenced nine human genomes each on seven different platforms (Illumina paired ends, Tru-seq synthetic long reads, jumping libraries, 10X Genomics, PacBio, BioNano optical maps, Strand-seq). In the frame of this project, we particularly explored the abilities of these technologies to resolve haplotypes by employing our WhatsHap method, which I will briefly explain. As a result, the HGSVC has produced a map of haplotype-specific structural variation that highlights SVs as substantially more prevalent in humans than was previously appreciated.

The mammalian gut microbiota is essential in shaping many of its host's functional attributes. Relationships between gut bacterial communities and their mammalian hosts have been shown in recent years to play an important role in the well-being and proper function of their hosts. A classic example of these relationships is found in the bovine digestive tract in a compartment termed the rumen. The rumen microbiota is necessary for the proper physiological development of the rumen and for the animal’s ability to digest and convert plant mass into basic food products, making it highly significant to humans. In my lecture I will discuss our recent findings regarding this ecosystem's development, and interaction with the host.

The redox-sensitive proteome (RSP) consists of protein thiols that undergo redox reactions, playing an important role in coordinating cellular processes. Here, we applied a large-scale phylogenomics approach to map the evolutionary origins of the eukaryotic RSP. Based on current-day snapshot of the diatom Phaeodactylum tricornutum we inferred ancestral sequence states and traced the evolution of the RSP stepwise back to the origin of eukaryotes. Our results show, that the majority of P. tricornutum redox-sensitive cysteines (76%) is specific to eukaryotes, yet these are encoded in genes that are mostly of a prokaryotic origin (57%). Furthermore, we find a threefold enrichment in redox-sensitive cysteines in genes that were gained by endosymbiotic gene transfer during the primary plastid acquisition. The secondary endosymbiosis event coincides with frequent introduction of reactive cysteines into existing proteins. While the plastid acquisition imposed an increase in the production of reactive oxygen species, our results suggest that it was accompanied by significant expansion of the RSP, providing redox regulatory networks the ability to cope with fluctuating environmental conditions.

The Graphical Fragment Assembly formats 1 and 2 (GFA1 and GFA2) are recently defined formats for representing sequence graphs, such as assembly graphs (de Bruijn and string graphs), sequence variation graphs and gene splicing graphs. The formats are adopted by several software tools, including sequence assemblers, read mappers, variant analysis tools and interactive visualization tools.
We present a scripting language library for handling GFA files in Python (GFApy). The library allows the user to conveniently parse, edit and write GFA files. Complex operations, such as the separation of the implicit instances of repeats and the merging of linear paths are also supported. Furthermore, the library is easily extensible: we show an example on how to define custom record types for metagenomic analysis.
GFApy is the first library which allows for convenient handling of GFA files using Python and the first publicly available implementation in any language fully supporting the GFA2 specification.

Ancestor-descendent relations play a cardinal role in evolutionary theory. Those relations are determined by rooting phylogenetic trees. Existing rooting methods are hampered by evolutionary rate heterogeneity or the unavailability of auxiliary phylogenetic information. We present a novel rooting approach, the minimal ancestor deviation (MAD) method, which embraces heterotachy by utilizing all pairwise topological and metric information in unrooted trees. We demonstrate the method in comparison to existing rooting methods by the analysis of phylogenies from eukaryotes and prokaryotes. MAD correctly recovers the known root of eukaryotes and uncovers evidence for cyanobacteria origins in the ocean. MAD is more robust and consistent than existing methods, provides measures of the root inference quality, and is applicable to any tree with branch lengths.

10.3.2017 (Friday) 15:00 Malte Rühlemann, IKMB Kiel

Genome-wide association studies of the human gut microbiota

The human gut is the habitat of billions of microorganisms belonging to a manifold of different taxonomic groups with a huge functional repertoire. The gene content of the gut bacteria excesses that of the human host by more than a hundred fold and plays an important role in the digestion of food, modulation of immune functions, and colonisation of pathogens. While changes in the intestinal microbiota have been linked to a variety of different diseases, the question of what factors shape and influence the variation seen in a „normal“ and „healthy“ microbiota are still largely unanswered.

Using uni- and multivariate statistical frameworks adapted to a genome-wide association study setting in two cohorts, comprised of a total of ~ 1,800 individuals from Northern Germany, we wanted to investigate the influence of host-genetic variation on core members of the gut microbiota, as well as on overall beta-diversity of the community. Results show an overlap with previously known candidate genes for host-microbe-interactions from functional studies, sharing with loci identified in association studies of inflammatory disorders and new candidate genes shedding new light onto the mechanisms how the host-genome influences the bugs in our guts.

These days, sequencing projects often produce huge data sets. Especially for single-cell projects it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to datasets with large amounts of redundant data. Metagenomic data sets often show a high coverage for abundant species and a low one for rare species.

For a de novo assembly, the assembler has to reconstruct the genetic information out of these data sets alone, a puzzle with sometimes billions of pieces. This is a demanding task, particularly with regard to the amount of RAM needed. Common assemblers like metaSPAdes or AllpathLG regularly need more than the 250GB of memory a common server has our days.

But is all the data necessary for the problem solution? The basic idea of our work is to filter out redundant reads in order to reduce memory and time requirements of the assembly process. The decision whether to keep or dump a certain read is based on a probalistic counting scheme for the k-mers (substrings of reads of length k) seen so far and on the phred score. While this method has been shown to be very effective on single-cell and transcriptomic data sets, we are currently working on adapting it to metagenomic data sets.

13.1.2017 (Friday) 15:00 Beate Slaby, GEOMAR Kiel

Marine sponges are ancient metazoans that are populated by distinct and highly diverse microbial communities. In order to obtain deeper insights into the functional gene repertoire of the Mediterranean sponge Aplysina aerophoba, we combined Illumina short-read and PacBio long-read sequencing followed by un-targeted metagenomic binning. We identified a total of 37 high-quality bins representing 11 bacterial phyla and 2 candidate phyla. Statistical comparison of symbiont genomes with selected reference genomes revealed a significant enrichment of genes related to bacterial defense (restriction-modification systems, toxin-antitoxin systems) as well as genes involved in host colonization and extracellular matrix utilization in sponge symbionts. A within-symbionts genome comparison revealed a nutritional specialization of at least two symbiont guilds, where one appears to metabolize carnitine and the other sulfated polysaccharides, both of which are abundant molecules in the sponge extracellular matrix. A third guild of symbionts may be viewed as nutritional generalists that perform largely the same metabolic pathways but lack such extraordinary numbers of the relevant genes. This study characterizes the genomic repertoire of sponge symbionts at an unprecedented resolution and it provides greater insights into the molecular mechanisms underlying microbial-sponge symbiosis.

Costs and necessity of the Black Queen: the impact of metabolic trade-offs on the evolution of microbial community structure and dynamics

Microbial cells often exchange costly produced metabolites with neighbouring cells within their communities - creating a vast network of interdependencies where cooccurring organisms perform complementary metabolic functions. The Black Queen hypothesis aims to explain the evolution of such dependencies through the loss of metabolic functions by a sub-group of cells while the function is retained by coexisting cells that share the function’s essential product. To test this hypothesis requires knowledge of (i) the fitness consequences of metabolic gene loss as well as (ii) the costs that are associated with the biosynthesis of exchanged metabolites. Both quantities, however, usually remain elusive.

Here we addressed this issue using data mining approaches and constraint-based modelling of bacterial metabolism. The computational estimates and predictions were complemented with laboratory experiments of Escherichia coli and Acinetobacter baylyi. The results suggest that loss of conditionally essential biosynthetic functions is highly prevalent in natural bacterial populations. This rampant loss of anabolic functions can be explained by selective advantages of biosynthetic gene loss in the presence of the focal metabolites. In addition, epistatic interactions frequently affected fitness after losing multiple genes. We also identified a carbon source-dependent trade-off between the production costs of different classes of amino acids. Such biochemical trade-offs are known to play a crucial role in the ecology and evolution of microorganisms because coexisting lineages can mutually save metabolic costs by specialising in the production of different essential metabolites. Taken together, our observations demonstrate potential molecular causes underlying the evolution of metabolic interdependency and complementary within microbial communities.

The concept of gene-environment interaction is relevant both in the etiology of complex diseases and in personalized treatment. Statistical aspects in the identification or utilization of such interactions will be highlighted, in particular relating to study design and statistical analysis for disease gene identification or pharmacogenetic clinical trials.

Interference between paralogues at the protein level affects the dynamics after gene duplication

A common feature of proteins is their assembly into homomeric structures to act as functional units. Usually, the subunits are derived from a single genetic locus. When such a gene is duplicated, the gene products are suggested initially to cross-interact when co-expressed thus resulting in the phenomenon of paralogue interference. In this talk, I will present a case study of protein evolution in which paralogue interference after duplication might have facilitated neofunctionalization of one duplicate. I will also explore further possible ways of how paralogue interference can shape the fate of a duplicated gene and present further illustrative examples. One important outcome is a prolonged time window in which both copies remain under selection increasing the chance to accumulate mutations and to develop new properties. Thereby, paralogue interference can mediate the co-evolution of duplicates.

One of the most intriguing puzzles in biology is the degree to which evolution is repeatable. The repeatability of evolution or parallel evolution has been studied in a variety of model systems, but has rarely been investigated with clinically relevant viruses. To investigate parallel evolution of HIV-1, we passaged two replicate HIV-1 populations for almost one year in each of two human T-cell lines. For each of the four replicate lines, we determined the genetic composition of the viral population at nine time points by sequencing the entire genome. Mutations that were carried by the majority of the virus population showed an extreme degree of parallel evolution. In one of our evolutionary lines, all 19 majority mutations also occur in another line but appear in a different order. This repeatable pattern of HIV-1 evolution is indicative of a predictable process, which is maximally inconsistent with evolutionary neutrality.

15.4.2016 (Friday) 15:00 Marc Hoeppner, IKMB

Workflow systems in bioinformatics

Within just a few years, the steadily decreasing cost of next-generation sequencing has turned biology into one of the most data intense research disciplines in the world. While this age of "big data" is promising exciting new insights, it also threatens to outpace our ability to make sense of the flood of information and handle it efficiently. Here, one particular challenge is the use of high performance compute infrastructures and the detailed record keeping (data provenance) necessary for good scientific practice. Within this presentation, I will discuss the challenges of big data and how dedicated workflow systems can help accelerate bioinformatics, including some hands-on examples to show that the adoption of such purpose-built solutions do not need to be complicated.

12.2.2016 (Friday) 15:00 Dirk Fleischer, Kiel Marine Science

11.12.2015 (Friday) 15:00 Steffen Möller, University of Rostock

eQTL: intertwining disease decomposition and drug repositioning

Expression QTL (eQTL) further annotate disease-associated genetic loci with co-observed changes in the transcriptome. With drugs selected to compensate the disturbance caused for single loci, for a genotyped patient of a multifactorial disease one may derive a recipe for a drug cocktail. This presentation reviews resources available today and emergent algorithms, exemplified on murine data for experimental autoimmune encephalomyelitis, a mouse model for neuroinflammation.

In this presentation, I'll present three related topics: (1) Our PubFlow approach to automate publication workflows for scientific data. The PubFlow workflow management system employs established technology. We integrate institutional repository systems and world data centers (in marine science). PubFlow collects provenance data automatically via our monitoring framework Kieker. In our evaluation in marine science, we collaborate with the GEOMAR Helmholtz Centre for Ocean Research Kiel. (2) Data processing in genomics: I'll briefly sketch bioinformatics tools such as Bioconductor and Galaxy, and indicate how these tools may be combined with advanced data-analysis systems for Internet-scale data processing such as MapReduce/Hadoop, including our own tools ExplorViz and TeeTime. (3) For good scientific practice, it is important that research results may be properly checked by reviewers and possibly repeated and extended by other researchers. I'll discuss publishing code, in addition to data.

Short bio:

Prof. Dr. Wilhelm (Willi) Hasselbring is professor of Software Engineering at Kiel University. In the competence cluster Software Systems Engineering (KoSSE), he coordinates technology transfer projects with (local) industry. In the excellence cluster Future Ocean, he is principal investigator and co-coordinator of the research area Ocean Observations.

9.10.2015 (Friday) 15:00 Anne Kupczok, IFAM CAU

We analyze a microbial symbiont community inhabiting Bathymodiolus mussels. The pattern of genetic variation among symbiotic populations is used to distinguish among modes of symbiont transmission. Therefore a high-resolution metagenomics approach is applied to a data set of multiple mussels. By cross-assembly and and binning into bacterial species, we find one highly abundant and one less abundant symbiont. Single-nucleotide polymorphisms (SNPs) are analyzed to quantify the genetic variation and population structure of the abundant species. We find that host-specfic SNPs are rather rare but population structure is present among the samples. We hypothesize that the observed pattern is caused either by geographic isolation or by selection during symbiont uptake into the host and symbiont maintenance over time.

9.7.2015 (Thursday) 15:00 David Ellinghaus, IKMB Kiel

A systematic cross-disease study of five chronic inflammatory diseases

11.6.2015 (Thursday) 16:00 Corrina Breusing, GEOMAR

Population connectivity and dispersal of vent mussels from the Mid-Atlantic Ridge

30.4.2015 (Thursday) 15:00 Elie Jami, IKMB Kiel

Characterization of the bovine rumen microbiome from birth to adulthood and its potential effect on host physiology

Gene transfers identified at major evolutionary transitions among archaea specifically implicate gene acquisitions for metabolic functions from bacteria as key innovations in the origin of higher archaeal taxa.

2014, March 18th, 15:00: Prof. David Bryant, University of Otago, New Zealand

Phylogenetic analysis of species radiations using SNPs and AFLPs. (Bio/Bioinformatics/Genetics)

Technological wonders such as next generation sequencing mean that we can now, in principle, obtain SNP (single nucleotide polymorphism) data from multiple individuals in multiple species. This promises enormous benefits for population genetic and phylogenetic analysis, particularly of closely related or poorly resolved species. My interest is in how to analyse these data effectively and responsibly. We have developed an algorithm which estimates species trees, divergence times, and population sizes from independent (binary) makers such as well spaced SNPs. The method is based on coalescent theory (like the BEAST software), though it uses mathematical trickery to avoid having to consider all the possible gene trees. As a `full likelihood' method, it should be more accurate than alternative FST based approaches. I'll talk about our experiences applying this method to AFLP data from alpine plants, and some recent discoveries about the usefulness (or uselessness) of SNP data for estimating population sizes.

2014, March 14th, 15:00: Dr. Till Bayer. GEOMAR

Using bionformatics to study the lateral component of language evolution

Ever since August Schleicher (1821-1868) first proposed the idea that the language history is best visualized “bei dem Bilde eines such verästelnden Baumes”, this view has been controversially discussed by linguists, leading to various opposing theories, ranging from wave-like evolutionary scenarios to early network proposals. The reluctance of many scholars to accept the tree as the natural metaphor for language evolution was due to conflicting signals in linguistic data: Many resemblances would simply not point to a unique tree. In the last two decades, historical linguistics has been experiencing a “quantitative revolution” and many automatic approaches from evolutionary biology have been applied to linguistic data. Given the important role that language contact and lexical borrowing play during language history, it is surprising that the majority of the new automatic approaches in historical linguistics assumes a strict “eukaryotic framework” for language evolution and only focuses on the reconstruction of language trees. I will argue that a “prokaryotic framework” for language evolution – based on biological network approaches that help to distinguish vertical from lateral processes during genome evolution – offers a fruitful alternative to current linguistic “dendrophilia” and provides more comprehensive insights into the complexities of language evolution.

2014, January 10th, 15:00: Prof. Dr. Bernhard Haubold, MPI Plön

Alignment-Free Tools for Genome Comparison

Whole genome sequencing has become routine. However, comparing whole genomes by alignment remains challenging. I therefore present three fast computer programs for comparing unaligned genomes. All three are based on calculating the lengths of exact matches between pairs of genomes. This quantity can be looked up efficiently by indexing sequences. I ex- plain how we combine genome indexing with mathematical modeling to construct programs for estimating pairwise substitution rates, closest local homologues, and detecting recombination.