The talk is
concerned with translating recent ideas from computer science onprobabilistic data-compression techniques into a statistical framework that
can be ‘safely’ applied for speeding linear regression analyses for very larges
sample sizes in bio-medicine.

Our motivation is
to facilitate the use of multivariate regression and model exploration in tall
data sets, so that, for example, genetic association analyses carried out on hundreds
of thousands of subjects can investigate multivariate effects for a set of
explanatory features, rather than be restricted to one feature at a time associations
for computational feasibility.

Among the many approaches to dealing with tall data,
probabilistic data compression techniques using random linear mapping, developed in the computer science
community, so called sketching, are
particularly suitable for linear regression problems. In the first part of the
talk, we will present a hierarchical representation of sketching, which allows
deriving statistical properties (distributional) of different sketching
algorithms. In particular, we will discuss how the signal to noise ratio in the
original data set is important for the choice of sketching algorithm. In the
second part of the talk, we will further refine some of the approximation
guarantees and consider iterative sketches. The talk will be illustrated on a
genetic analysis of the link between a blood cell trait and the HLA region
involving a sample of 130,000 people.