History of Previous Talks in Autumn 2017

Abstract: Bayesian inference has many attractive features, but a major challenge is its potentially very high computational cost. While sampling from the prior distribution is often straightforward, the most expensive part is typically conditioning on the data. In many problems, a single data set data may not be informative enough to enable reliable inference for a given quantity of interest. This can be difficult to assess in advance and may require a considerable amount of computation to discover, resulting in a weakly informative posterior distribution “gone to waste”. On the other hand, borrowing strength across multiple related data sets using a hierarchical model may for very costly models be computationally infeasible.

As an alternative approach to traditional hierarchical models, we develop in this work a framework which reuses and combines posterior distributions computed on individual data sets to achieve post-hoc borrowing of strength, without the need to re-do expensive computations on the data. As a by-product, we also obtain a notion of meta-analysis for posterior distributions. By adopting the view that posterior distributions are beliefs which reflect the uncertainty about the value of some quantity, we formulate our approach as Bayesian inference with uncertain observations. We further show that this formulation is closely related to belief propagation. Finally, we illustrate the framework with post-hoc analyses of likelihood-free Bayesian inferences.

Correlation-Compressed Direct Coupling Analysis

Date: November 27, 2017

Abstract: Direct Coupling Analysis (DCA) is a powerful tool to find pair-wise dependencies in large biological data sets. It amounts to inferring coefficients in a probabilistic model in an exponential family, and then using the largest such inferred coefficients as predictors for the dependencies of interest. The main computational bottle-neck is the inference. As described recently by Jukka Corander in this seminar series DCA has be done on bacterial whole-genome data, at the price of significant compute time, and investment in code optimization.

We have looked at if DCA can be speeded up by first filtering the data on correlations, an approach we call Correlation-Compressed Direct Coupling Analysis (CC-DCA). The computational bottle-neck then moves from DCA to the more standard task of finding a subset of most strongly correlated vectors in large data sets. I will describe results obtained so far, and outline what it would take to do CC-DCA on whole-genome data in human and other higher organisms.

This is joint work with Chen-Yi Gao and Hai-Jun Zhou, available as arXiv:1710.04819.

Towards Intelligent Exergames

Date: November 20, 2017

Abstract: Exergames – video games that require physical activity – hold promise of solving the societal hard problem of motivating people to move. At the same time, artificial intelligence and machine learning are transforming how video games are designed, produced, and tested. Work combining both computational intelligence and exergames is sparse, however. In my talk, I delineate the challenges, opportunities, and my group’s research towards intelligent exergames, building on our previous research on both exergame design (e.g., Augmented Climbing Wall, Kick Ass Kung-Fu) and intelligent control of embodied simulated agents. Video and examples: http://perttu.info

Learning of Ultra High-Dimensional Potts Models for Bacterial Population Genomics

Date: November 13, 2017

Abstract: The potential for genome-wide modeling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has earlier been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 10000-100000 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here we introduce a novel inference method (SuperDCA) which employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 100000 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA thus holds considerable potential in building understanding about numerous organisms at a systems biological level.

Affiliation: Professor of Statistics, University of Helsinki and University of Oslo

Place of Seminar: University of Helsinki

Efficient and accurate approximate Bayesian computation

Date: November 6, 2017

Abstract: Approximate Bayesian computation (ABC) is a method for calculating a posterior distribution when the likelihood is intractable, but simulating the model is feasible. It has numerous important applications, for example in computational biology, material physics, user interface design, etc. However, many ABC algorithms require a large number of simulations, which can be costly. To reduce the cost, Bayesian optimisation (BO) and surrogate models such as Gaussian processes have been proposed. Bayesian optimisation enables deciding intelligently where to simulate the model next, but standard BO approaches are designed for optimisation and not for ABC. Here we address this gap in the existing methods. We model the uncertainty in the ABC posterior density which is due to a limited number of simulations available, and define a loss function that measures this uncertainty. We then propose to select the next model simulation to minimise the expected loss. Experiments show the proposed method is often more accurate than the existing alternatives.

Abstract: A Markov equivalence class contains all the Directed Acyclic Graphs (DAGs) encoding the same conditional independencies, and is represented by a Completed Partially Directed DAG (CPDAG), also named Essential Graph (EG). We approach the problem of model selection among noncausal sparse Gaussian DAGs by directly scoring EGs, using an objective Bayes method. Specifically, we construct objective priors for model selection based on the Fractional Bayes Factor, leading to a closed form expression for the marginal likelihood of an EG. Next we propose an MCMC strategy to explore the space of EGs, possibly accounting for sparsity constraints, and illustrate the performance of our method on simulation studies, as well as on a real dataset. Our method is fully Bayesian and thus provides a coherent quantification of inferential uncertainty, requires minimal prior specification, and shows to be competitive in learning the structure of the data-generating EG when compared to alternative state-of-the-art algorithms.

Computational Challenges in Analyzing And Moderating Online Social Discussions

Date: October 23, 2017

Abstract: Online social media are a major venue of public discourse today, hosting the opinions of hundreds of millions of individuals. Social media are often credited for providing a technological means to break information barriers and promote diversity and democracy. In practice, however, the opposite effect is often observed: users tend to favor content that agrees with their existing world-view, get less exposure to conflicting viewpoints, and eventually create “echo chambers” and increased polarization. Arguably, without any kind of moderation, current social-media platforms gravitate towards a state in which net-citizens are constantly reinforcing their existing opinions. In this talk we present a ongoing line of work on analyzing and moderating online social discussions. We first consider the questions of detecting controversy using network structure and content, tracking the evolution of polarized discussions, and understanding their properties over time. We then address the problem of designing algorithms to break filter bubbles and reduce polarization. We discuss a number of different strategies such as user and content recommendation, as well as viral approaches.

Probabilistic preference learning with the Mallows rank model

Date: October 16, 2017

Abstract: Ranking and comparing items is crucial for collecting information about preferences in many areas, from marketing to politics. The Mallows rank model is among the most successful approaches to analyze rank data, but its computational complexity has limited its use to a form based on Kendall distance. Here, new computationally tractable methods for Bayesian inference in Mallows models are developed that work with any right-invariant distance. The method performs inference on the consensus ranking of the items, also when based on partial rankings, such as top-k items or pairwise comparisons. When assessors are many or heterogeneous, a mixture model is proposed for clustering them in homogeneous subgroups, with cluster-specific consensus rankings. Approximate stochastic algorithms are introduced that allow a fully probabilistic analysis, leading to coherent quantification of uncertainties. The method can be used, for example, for making probabilistic predictions on the class membership of assessors based on their ranking of just some items, and for predicting missing individual preferences, as needed in recommendation systems.

Does my algorithm work?

Date: October 9, 2017

Abstract: It is easy to propose a new algorithm for solving a Machine Learning problem. It is much harder to convince other people that the proposed algorithm actually works. The “gold standard” of tight theoretical guarantees is often out of reach. So what do we do? Typically, an algorithm is validated on a couple of test problems and its output is compared with that of algorithms that are known to work. This is not a great strategy.

In this talk, I will outline a general strategy for assessing whether an algorithm for approximate Bayesian computing works on a given problem. This method does not require evaluation of the true posterior and also indicates ways in which the computed posterior systematically deviates from the true posterior.

Machine learning for Materials Research

Date: October 2, 2017

Abstract: In materials research, we have learnt to predict the evolution of microstructure starting with the atomic level processes. We know about defects — point and extended, — and we know that these can be crucial for the final structural (and related mechanical and electrical) properties. Often simple macroscopic differential equations, which are used for the purpose, fail to predict simple changes in materials. Many questions remain unanswered. Why a ductile material suddenly becomes brittle? Why a strong concrete bridge suddenly cracks and eventually collapses after serving for tens of years? Why the wall of high quality steels in fission reactors suddenly crack? Or, why the clean smooth surface roughens under applied electric fields? All these questions can be answered, if one peeks in to atom’s behavior imagining it jumping inside the material. But how the atoms “choose” where to jump amongst the numerous possibilities in complex metals? Tedious parameterization can help to deal with the problem, but machine learning can provide a better and more elegant solution to this problem.

In my presentation, I will explain the problem at hand and show a few examples of former and current application of Neural Network for calculating the barriers for atomic jumps with the analysis of how well the applied NN worked.

Computational creativity and machine learning

Date: September 25, 2017

Abstract: Computational creativity has been defined as the art, science, philosophy and engineering of computational systems which, by taking on particular responsibilities, exhibit creative behaviours. In this talk I first try to elaborate on what creative responsibilities could be and why they are interesting. I then outline ways in which machine learning can be used to take on some of these responsibilities, helping computational systems become more creative.

Abstract: Efficient index structures for fast approximate nearest neighbor queries are required in many applications such as recommendation systems. In high-dimensional spaces, many conventional methods suffer from excessive usage of memory and slow response times. We propose a method where multiple random projection trees are combined. We demonstrate by extensive experiments on a wide variety of data sets that the method is faster than existing partitioning tree or hashing based approaches, making it the fastest available technique on high accuracy levels.

Abstract: A challenge in analyzing data from tumor samples is that the biopsies contain an mixture of various cells, including cancer cells, immune cells and stromal cells. This hinders the discovery of clinically relevant information and can lead to systematically biased results. A few recent analysis techniques control for such factors, but only accommodate specific types of data, or require controls which cannot be obtained from each patient. I will present our developments on statistical methods for controlling for the latent and varying fraction of tumor cells in next-generation methylation and RNA sequencing data, which aim to enable unbiased and more accurate comparison of patient-derived samples.