Combining multiple models

Transcription

1 Combining multiple models Basic idea of meta learning schemes: build different experts and let them vote Advantage: often improves predictive performance Disadvantage: produces output that is very hard to analyze Schemes we will discuss: bagging, boosting, stacking, and error-correcting output codes The first three can be applied to both classification and numeric prediction problems 10/25/

2 Bagging Employs simplest way of combining predictions: voting/averaging Each model receives equal weight Idealized version of bagging: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifier s predictions This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees) 10/25/

3 Bias-variance decomposition Theoretical tool for analyzing how much specific training set affects performance of classifier Assume we have an infinite number of classifiers built from different training sets of size n The bias of a learning scheme is the expected error of the combined classifier on new data The variance of a learning scheme is the expected error due to the particular training set used Total expected error: bias + variance 10/25/

4 More on bagging Bagging reduces variance by voting/averaging, thus reducing the overall expected error In the case of classification there are pathological situations where the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new datasets of size n by sampling with replacement from original dataset Can help a lot if data is noisy 10/25/

5 Bagging classifiers model generation Let n be the number of instances in the training data. For each of t iterations: Sample n instances with replacement from training set. Apply the learning algorithm to the sample. Store the resulting model. classification For each of the t models: Predict class of instance using model. Return class that has been predicted most often. 10/25/

6 Boosting Also uses voting/averaging but models are weighted according to their performance Iterative procedure: new models are influenced by performance of previously built ones New model is encouraged to become expert for instances classified incorrectly by earlier models Intuitive justification: models should be experts that complement each other There are several variants of this algorithm 10/25/

7 AdaBoost.M1 model generation Assign equal weight to each training instance. For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model. Compute error e of model on weighted dataset and store error. If e equal to zero, or e greater or equal to 0.5: Terminate model generation. For each instance in dataset: If instance classified correctly by model: Multiply weight of instance by e / (1 - e). Normalize weight of all instances. classification Assign weight of zero to all classes. For each of the t (or less) models: Add -log(e / (1 - e)) to weight of class predicted by model. Return class with highest weight. 10/25/

8 More on boosting Can be applied without weights using resampling with probability determined by weights Disadvantage: not all instances are used Advantage: resampling can be repeated if error exceeds 0.5 Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers not too complex and their error doesn t become too large too quickly 10/25/

9 A bit more on boosting Puzzling fact: generalization error can decrease long after training error has reached zero Seems to contradict Occam s Razor! However, problem disappears if margin (confidence) is considered instead of error Margin: difference between estimated probability for true class and most likely other class (between 1, 1) Boosting works with weak learners: only condition is that error doesn t exceed 0.5 LogitBoost: more sophisticated boosting scheme 10/25/

10 Stacking Hard to analyze theoretically: black magic Uses meta learner instead of voting to combine predictions of base learners Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model) Base learners usually different learning schemes Predictions on training data can t be used to generate data for level-1 model! Cross-validation-like scheme is employed 10/25/

11 More on stacking If base learners can output probabilities it s better to use those as input to meta learner Which algorithm to use to generate meta learner? In principle, any learning scheme can be applied David Wolpert: relatively global, smooth model Base learners do most of the work Reduces risk of overfitting Stacking can also be applied to numeric prediction (and density estimation) 10/25/

October 2013 Machine Learning for Language Technology Lecture 6: Ensemble Methods Marina Santini, Uppsala University Department of Linguistics and Philology Where we are Previous lectures, various different

A Few Useful Things to Know about Machine Learning Pedro Domingos Department of Computer Science and Engineering University of Washington 2012 A Few Useful Things to Know about Machine Learning Machine

1. Subject Resampling approaches for prediction error estimation. The ability to predict correctly is one of the most important criteria to evaluate classifiers in supervised learning. The preferred indicator

18 LEARNING FROM EXAMPLES An intelligent agent may have to learn, for instance, the following components: A direct mapping from conditions on the current state to actions A means to infer relevant properties

The learning problem is called realizable if the hypothesis space contains the true function; otherwise it is unrealizable On the other hand, in the name of better generalization ability it may be sensible

Linear Regression: Predicting House Prices I am big fan of Kalid Azad writings. He has a knack of explaining hard mathematical concepts like Calculus in simple words and helps the readers to get the intuition

Assignment 6 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Assume that you are given a data set and a neural network model trained on the data set. You are asked to build a decision tree

CSC-272 Exam #2 March 20, 2015 Name Questions are weighted as indicated. Show your work and state your assumptions for partial credit consideration. Unless explicitly stated, there are NO intended errors

Welcome to CSCE 478/878! Please check off your name on the roster, or write your name if you're not listed Indicate if you wish to register or sit in Policy on sit-ins: You may sit in on the course without

A study of the NIPS feature selection challenge Nicholas Johnson November 29, 2009 Abstract The 2003 Nips Feature extraction challenge was dominated by Bayesian approaches developed by the team of Radford

The Implementation of Machine Learning in the Game of Checkers William Melicher Computer Systems Lab Thomas Jefferson June 9, 2009 Abstract Most games have a set algorithm that does not change. This means

TDDE09, 729A27 Natural Language Processing (2017) Language Modelling Marco Kuhlmann Department of Computer and Information Science Partially based on material developed by David Chiang This work is licensed

Article from Predictive Analytics and Futurism December 2015 Issue 12 The Third Generation of Neural Networks By Jeff Heaton Neural networks are the phoenix of artificial intelligence. Right now neural

Refine Decision Boundaries of a Statistical Ensemble by Active Learning a b * Dingsheng Luo and Ke Chen a National Laboratory on Machine Perception and Center for Information Science, Peking University,

Naive Bayesian Introduction You are working on a classification problem and you have generated your set of hypothesis, created features and discussed the importance of variables. Within an hour, stakeholders

Introduction to Classification Classification: Definition Given a collection of examples (training set ) Each example is represented by a set of features, sometimes called attributes Each example is to

Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models Shuo Wang and Xin Yao Abstract Many real-world applications have problems when learning from imbalanced data sets, such as medical diagnosis,

AP Statistics Audit Syllabus COURSE DESCRIPTION: AP Statistics is the high school equivalent of a one semester, introductory college statistics course. In this course, students develop strategies for collecting,

Statistics for Risk Modeling Exam September 2018 IMPORTANT NOTICE This version of the syllabus is final, though minor changes may occur. This March 2018 version includes updates to this page and to the