Measuring the difference between data mining results is an important open problem in exploratory data mining.
We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data.

Stijl mines descriptions of ordered binary data. We model data hierarchically with noisy tiles - rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle, and give an algorithm for mining optimal sub-tiles in just O(nmmin(n,m)) time.

We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data.

Measuring the difference between data mining results is an important open problem in exploratory data mining.
We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data.

mtv is a well-founded approach for summarizing data with itemsets; using a probabilistic maximum entropy model, we iteratively find that itemset that provides us the most new information, and update our model accordingly. We can either mine top-k patterns, or identify the best summarisation by MDL or BIC.

Boolean Matrix Factorization has many desirable properties, such as
high interpretability and natural sparsity. However, no method for selecting the correct model order has been available. We propose to use the Minimum Description Length principle, and show that besides solving the problem, this well-founded approach has numerous
benefits, e.g., it is automatic, does not require a likelihood function, and, as experiments show, is highly accurate.

A good first impression of a dataset is paramount to how we proceed our analysis. We discuss mining high-quality high-level descriptive summaries for binary and categorical data. Our approach builds summaries by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering.

We aim at finding itemsets that characterise the data well. To this end, we construct decision trees by which we can pack the data succinctly, and from which we can subsequently identify the most important itemsets. The Pack algorithm can either filter a candidate collection, as well as mine its models directly from data.

The Krimp algorithm mines sets of itemsets by the MDL principle, defining the best set of patterns as the set that compresses the data best. The resulting code tables are orders of magnitude smaller than the number of (closed) frequent itemsets. They are highly characteristic for the data, and obtain high accuracy on many data mining tasks.

Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find simple connection pathways between sets of marked nodes — using MDL to identify the optimal result. We propose the efficient dot2dot algorithm for approximating this goal.

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? With NetSleuth, we answer this question affirmatively for the Susceptible-Infected virus propagation model.

We consider non-parametric causal inference. That is, given two variables of which we know that they are be correlated, Ergo can efficiently and reliably infer their causal direction – without having to assume a distribution.

Correlation analysis is one of the key elements of data mining, but many methods are limited to either linear or pairwise correlations.
We propose mac, a non-parametric, multivariate and non-linear correlation measure based on maximal correlation analysis.

Discretization is the transformation of continuous data into discrete bins. The key question is how to discretize data such that multivariate associations and patterns are preserved. That is exactly what ipd does.