Big Data: a new era for Statistics

Transcription

1 Big Data: a new era for Statistics Richard J. Samworth Abstract Richard Samworth (1996) is a Professor of Statistics in the University s Statistical Laboratory, and has been a Fellow of St John s since 23. In 212 he was awarded a five-year Engineering and Physical Sciences Research Council Early Career Fellowship a grant worth 1.2million to study New challenges in high-dimensional statistical inference. Big Data is all the rage in the media these days. Few people seem to be able to define exactly what they mean by it, but there is nevertheless consensus that in fields as diverse as genetics, medical imaging, astronomy, social networks and commerce, to name but a few, modern technology allows the collection and storage of data on scales unimaginable only a decade ago. Such a data deluge creates a huge range of challenges and opportunities for statisticians, and the subject is currently undergoing an exciting period of rapid growth and development. Hal Varian, Chief Economist at Google, famously said in 29, I keep saying the sexy job in the next ten years will be statisticians. This might raise eyebrows among those more familiar with Mark Twain s lies, damned lies and statistics, but there s no doubt that recent high-profile success stories such as Nate Silver s predictions of the 212 US presidential election have given statisticians a welcome image makeover. Let me begin by describing a simple, traditional statistical problem, in order to contrast it with today s situation. In the 192s, an experiment was carried out to understand the relationship between a car s initial speed, x, and its stopping distance, y. The stopping distances of fifty cars, having a range of different initial speeds, were measured and are plotted in Figure 1(a). Our understanding of the physics of car braking suggests that y ought to depend on x in a quadratic way, though from the figure we see that we can t expect an exact fit to the data. We therefore model the relationship as y = ax+bx 2 +ǫ, where ǫ represents the statistical error. Our aim is to estimate the unknown parameters a and b, which reflect the strength of the dependence of y on each of x and x 2. Here we don t need to 1

2 Stopping distance (ft) (a) (b) Stopping distance (ft) Initial speed (mph) Initial speed (mph) Figure 1: Panel (a) gives the stopping distances of 5 cars having a range of different initial speeds. Panel (b) also shows the fitted curve, as well as a 95 per cent prediction interval for the stopping distance of a car having an initial speed of 21mph. include a constant term in the quadratic, because a car with zero initial speed doesn t take long to stop. We would like to choose our estimates of a and b in such a way that the curve ax+bx 2 reflects the trend seen in the data. For any such curve, we can imagine drawing vertical lines from each data point to the curve, and a standard way to estimate a and b is to choose them to minimise the sum of the squares of the lengths of these lines. For a statistician, this is a straightforward problem to solve, yielding estimates â = 1.24 and ˆb =.9 of a and b respectively, and the fitted curve displayed in Figure 1(b). From this, we can predict that a car with initial speed 21mph would take 65.8ft to stop. In fact, we can also quantify the uncertainty in this prediction: with 95 per cent probability, a car with this initial speed would take between 34.9ft and 96.6ft to stop. (Incidentally, a modern car would typically take around 43ft to stop at that initial speed.) Of course, one can easily imagine that the stopping distance of a car would depend on many factors that weren t recorded in this experiment: the weather and tyre conditions, the state of the road, the make of car, and so on. Such additional information should allow us to refine our model, and make more accurate predictions, with less uncertainty. My point is that, by contrast with the 192s, in today s world we can often record a whole raft of variables whose values may influence the response of interest. In genetics, for instance, 2

3 Figure 2: The left panel shows a photograph of a typical microarray. The right panel shows the complexity of the output from a typical microarray experiment. microarrays (see Figure 2) are used in laboratories to measure simultaneously the expression levels of many thousands of genes in order to study the effects of a treatment or disease. An initial statistical model analogous to our car model, then, would require at least one variable for each gene. Interestingly, for microarray data, there may still be only around fifty replications of the experiment, though many other modern applications have vast numbers of replications too. Suddenly, estimating the unknown parameters in the model is not so easy. The method of least squares we used with our car data, for example, can t be used when we have more variables than replications. What saves us here is a belief in what is often called sparsity: most of the genes should be irrelevant for the particular treatment or disease under study. The statistical challenge, then, is that of variable selection which variables do I need in my model, and which can I safely discard? Many methods for finding these few important needles among the huge haystack of variables have been proposed over the last two decades. One could simply look for variables that are highly correlated with the response, or use the exotically-named Lasso (Tibshirani, 1996), which can be regarded as a modification of the least squares estimate. In Shah and Samworth (213), we gave a very general method for improving the performance of any existing variable selection method: instead of applying it once to the whole dataset, we showed that there are advantages to applying it to several random subsamples of the data, each of half the original sample size, eventually choosing the variables that are chosen on a high proportion of the subsamples. We were able to prove bounds that allow the practitioner to choose the threshold for this proportion to control the trade-off between false negatives and false positives. 3

4 Another problem I ve worked on recently is classification. Imagine that a doctor wants to diagnose diabetes. On a sample of diabetics, she makes measurements that she thinks are relevant for determining whether or not someone has the disease. She also makes the same measurements on a sample of non-diabetics. So when a new patient arrives for diagnosis, she again takes the same measurements. On what basis should she classify (diagnose) the new individual as coming from the diabetic or non-diabetic population? From a statistical point of view, the problem is the same as that encountered by banks that have to decide whether or not to give someone a loan, or an filter that has to decide whether a message is genuine or spam. One can imagine that an experienced doctor might have a notion of distance between any two individuals sets of measurements. So one very simple method of classification would be to assign the new patient to the group of his nearest neighbour (i.e. the person closest according to the doctor s distance) among all n people, say, in our clinical trial. But intuitively, we might feel there was too much chance about whether or not the nearest neighbour happened to be diabetic, so a slightly more sophisticated procedure would look at the patient s k nearest neighbours, and would assign him to the population having at least half of those k nearest neighbours. In Hall, Park and Samworth (28), we derived the optimal choice of k, in the sense of minimising the probability of misclassifying thenew individual. For those interested, it shouldbechosen proportional to n 4/(d+4), where d is the number of measurements made on each individual. An obvious drawback of the k-nearest neighbour classifier is that it gives equal importance to the group associated with the nearest neighbour as it does the kth nearest neighbour. This observation prompts us to consider weighted nearest neighbours, with weights that decay as one moves further from the individual to be classified. In Samworth (212), I derived the optimal weighting scheme, as well as a formula for the improvement attainable over the unweighted k nearest neighbour classifier. It s between a 5 per cent and 1 per cent improvement when d 15, which might not seem like much, until it s you that requires the diagnosis! Five years ago, I set up the Statistics Clinic, where once a fortnight any member of the University can come and receive advice on their statistical problems from one of a team of helpers (mainly my PhD students and post-docs). The sheer range of subjects covered and the diversity of problems they present to us provide convincing evidence that statistics is finally being recognised for its importance in making rational decisions in an uncertain world. The twenty-first century is undoubtedly the information age even Mark Twain would agree! Professor Richard J. Samworth 4

MACHINE LEARNING [Hands-on Introduction of Supervised Machine Learning Methods] DURATION 2 DAY The field of machine learning is concerned with the question of how to construct computer programs that automatically

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

Validation of measurement procedures R. Haeckel and I.Püntmann Zentralkrankenhaus Bremen The new ISO standard 15189 which has already been accepted by most nations will soon become the basis for accreditation

In this chapter, we present the theory of consumer preferences on risky outcomes. The theory is then applied to study the demand for insurance. Consider the following story. John wants to mail a package

Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

STATISTICS 8: CHAPTERS 7 TO 10, SAMPLE MULTIPLE CHOICE QUESTIONS 1. If two events (both with probability greater than 0) are mutually exclusive, then: A. They also must be independent. B. They also could

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

Chapter 23 Squares Modulo p Revised Version of Chapter 23 We learned long ago how to solve linear congruences ax c (mod m) (see Chapter 8). It s now time to take the plunge and move on to quadratic equations.

Reasoning with Uncertainty More about Hypothesis Testing P-values, types of errors, power of a test P-Values and Decisions Your conclusion about any null hypothesis should be accompanied by the P-value

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to explain the difference between the p-value and a posterior

CHAOS LIMITATION OR EVEN END OF SUPPLY CHAIN MANAGEMENT Michael Grabinski 1 Abstract Proven in the early 196s, weather forecast is not possible for an arbitrarily long period of time for principle reasons.

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model

We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

Module 5 Hypotheses Tests: Comparing Two Groups Objective: In medical research, we often compare the outcomes between two groups of patients, namely exposed and unexposed groups. At the completion of this

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

1 What Is Probability? The idea: Uncertainty can often be "quantified" i.e., we can talk about degrees of certainty or uncertainty. This is the idea of probability: a higher probability expresses a higher

Why You Should Consider Graduate Study in the UCLA Department of Biostatistics Biostatistics as an academic discipline Natural direction for individuals with strengths in applied mathematics Linear algebra,

Straightening Data in a Scatterplot Selecting a Good Re-Expression What Is All This Stuff? Here s what is included: Page 3: Graphs of the three main patterns of data points that the student is likely to

AMS 5 CHANCE VARIABILITY The Law of Averages When tossing a fair coin the chances of tails and heads are the same: 50% and 50%. So if the coin is tossed a large number of times, the number of heads and

An analysis of the 2003 HEFCE national student survey pilot data. by Harvey Goldstein Institute of Education, University of London h.goldstein@ioe.ac.uk Abstract The summary report produced from the first

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

WEEK #22: PDFs and CDFs, Measures of Center and Spread Goals: Explore the effect of independent events in probability calculations. Present a number of ways to represent probability distributions. Textbook

Chapter 5 Estimating Demand Functions 1 Why do you need statistics and regression analysis? Ability to read market research papers Analyze your own data in a simple way Assist you in pricing and marketing

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

The Big Picture Correlation Bret Hanlon and Bret Larget Department of Statistics Universit of Wisconsin Madison December 6, We have just completed a length series of lectures on ANOVA where we considered

Pascal s wager So far we have discussed a number of arguments for or against the existence of God. In the reading for today, Pascal asks not Does God exist? but Should we believe in God? What is distinctive

Factoring Polynomials Hoste, Miller, Murieka September 12, 2011 1 Factoring In the previous section, we discussed how to determine the product of two or more terms. Consider, for instance, the equations

Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

Boards and CEOs preparing for growth Almost half of the CEOs in Denmark s largest corporations consider the financial crisis to be over and expect positive growth in the near future. This calls for preparation

Attitude to risk Module 5 Attitude to risk In this module we take a look at risk management and its importance. TradeSense Australia, June 2011, Edition 10 Attitude to risk In the previous module we looked

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

COS 116 The Computational Universe Laboratory 11: Machine Learning In last Tuesday s lecture, we surveyed many machine learning algorithms and their applications. In this lab, you will explore algorithms

parent ROADMAP MATHEMATICS SUPPORTING YOUR CHILD IN HIGH SCHOOL HS America s schools are working to provide higher quality instruction than ever before. The way we taught students in the past simply does

Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

Performance Assessment Task Quadratic (2009) Grade 9 The task challenges a student to demonstrate an understanding of quadratic functions in various forms. A student must make sense of the meaning of relations

Lecture 13 Understanding Probability and Long-Term Expectations Thinking Challenge What s the probability of getting a head on the toss of a single fair coin? Use a scale from 0 (no way) to 1 (sure thing).

Market Efficiency: Definitions and Tests 1 Why market efficiency matters.. Question of whether markets are efficient, and if not, where the inefficiencies lie, is central to investment valuation. If markets

1 Abstract The dangers of medical websites and the addiction surrounding them is explored in this paper. Medical websites online are starting to take over the roles of doctors by providing ample information

THE EXPECTED VALUE FOR THE SUM OF THE DRAWS In the game of Keno there are 80 balls, numbered 1 through 80. On each play, the casino chooses 20 balls at random without replacement. Suppose you bet on the