Transcription

2 Goals of the lecture The Ladder of Roots and Powers Changing the shape of distributions Transforming for Linearity 2

3 Why transform data? 1. In some instances it can help us better examine a distribution 2. Many statistical models are based on the mean and thus require that the mean is an appropriate measure of central tendency (i.e., the distribution is approximately normal) 3. Linear least squares regression assumes that the relationship between two variables is linear. Often we can straighten a nonlinear relationship by transforming one or both of the variables Often transformations will fix problem distributions so that we can use least-squares regression When transformations fail to remedy these problems, another option is to use nonparametric regression, which makes fewer assumptions about the data 3

4 Power transformations for quantitative variables Although there are an infinite number of functions f(x) that can be used to transform a distribution, in practice only a relatively small number are regularly used For quantitative variables one can usually rely on the family of powers and roots: When p is negative, the transformation is an inverse power: When p is a fraction, the transformation represents a root: 4

5 Log transformations A power transformation of X 0 should not be used because it changes all values to 1 (in other words, it makes the variable a constant) Instead we can think of X 0 as a shorthand for the log transformation log e X, where e is the base of the natural logarithms: In practice most people prefer to use log 10 X because it is easier to interpret increasing log 10 X by 1 is the same as multiplying X by 10 In terms of result, it matters little which base is used because changing base is equivalent to multiplying X by a constant 5

6 Cautions: Power Transformations (1) Descending the ladder of powers and roots compresses the large values of X and spreads out the small values As p moves away from 1 in either direction, the transformation becomes more powerful Power transformations are sensible ONLY when all the X values are POSITIVE If not, this can be solved by adding a start value Some transformations (e.g., log, square root, are undefined for 0 and negative numbers) Other power transformations will not be monotone, thus changing the order of the data 6

7 Cautions: Power Transformations (2) Power transformations are only effective if the ratio of the largest data value to the smallest data value is large If the ratio is very close to 1, the transformation will have little effect General rule: If the ratio is less than 5, a negative start value should be considered 7

8 Transforming Skewed Distributions The example below shows how a log 10 transformation can fix a positive skew The density estimate for average income for occupations from the Canadian Prestige data is shown on top; the bottom shows the density estimate of the transformed income Probability density function Probability density function income log.inc 8

9 Transforming Nonlinearity When is it possible? An important use of transformations is to straighten the relationship between two variables This is possible only when the nonlinear relationship is simple and monotone Simple implies that the curvature does not change there is one curve Monotone implies that the curve is always positive or always negative (a) can be transformed, (b) and (c) can not 9

10 The Bulging Rule for transformations Tukey and Mosteller s rule provides a starting point for possible transformations to correct nonlinearity Normally we should try to transform explanatory variables rather than the response variable Y since a transformation of Y will affect the relationship of Y with all Xs not just the one with the nonlinear relationship If, however, the response variable is highly skewed, it makes sense to transform it instead 10

11 Transforming relationships Income and Infant mortality (1) Leinhardt s data from the car library Robust local regression in the plot shows serious nonlinearity The bulging rule suggests that both Y and X can be transformed down the ladder of powers I tried taking the log of income only, but significant nonlinearity still remained infant In the end, I took the log 10 of both income and infant mortality income 11

12 Income and Infant mortality (2) log.infant A linear model fits well here Since both variables are transformed by the log 10 the coefficients are easy to interpret: An increase in income by 1% is associated, on average, with a.51% decrease in infant mortality log.income 12

13 Transforming Proportions Power transformations will not work for proportions (including percentages and rates) if the data values approach the boundaries of 0 and 1 Instead, we can use the logit or probit transformations for skewed proportion distributions. If their scales are equated, these two are practically indistinguishable: The logit transformation: (a) removes the boundaries of the scale, (b) spreads out the tails of the distribution and (c) makes the distribution symmetric about 0. It takes the following form: 13

14 Logit Transformation of a Proportion Notice that the transformation is nearly linear for proportions between.20 and.80 Values close to 0 and 1 are spread out at an increasing rate, however Finally, the transformed variable is now centered at 0 rather than.5 14

15 Next Topics: The Basics of Least Squares Regression Least-squares fit Properties of the least-squares estimator Statistical inference Regression in matrix form The Vector Representation of the Regression Model 15

AP Statistics Section :12.2 Transforming to Achieve Linearity In Chapter 3, we learned how to analyze relationships between two quantitative variables that showed a linear pattern. When two-variable data

Name: Date: 1. A study is conducted on students taking a statistics class. Several variables are recorded in the survey. Identify each variable as categorical or quantitative. A) Type of car the student

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

Linear Regression In this tutorial we will explore fitting linear regression models using STATA. We will also cover ways of re-expressing variables in a data set if the conditions for linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

Review MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) All but one of these statements contain a mistake. Which could be true? A) There is a correlation

2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

Fall, 2001 1 Logs as the Predictor Logs Transformation in a Regression Equation The interpretation of the slope and intercept in a regression change when the predictor (X) is put on a log scale. In this

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

Straightening Data in a Scatterplot Selecting a Good Re-Expression What Is All This Stuff? Here s what is included: Page 3: Graphs of the three main patterns of data points that the student is likely to

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? SAMUEL H. COX AND YIJIA LIN ABSTRACT. We devise an approach, using tobit models for modeling annuity lapse rates. The approach is based on data provided

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression

6. Methods 6.8. Methods related to outputs, Introduction In order to present the outcomes of statistical data collections to the users in a manner most users can easily understand, a variety of statistical

Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

Name: Class: Date: ID: A Algebra II Semester Exam Review Sheet 1. Translate the point (2, 3) left 2 units and up 3 units. Give the coordinates of the translated point. 2. Use a table to translate the graph

Mathematics Review for MS Finance Students Anthony M. Marino Department of Finance and Business Economics Marshall School of Business Lecture 1: Introductory Material Sets The Real Number System Functions,

The Distribution of S&P 5 Index Returns William J. Egan, Ph.D. wjegan@gmail.com January 6, 27 Abstract This paper examines the fit of three different statistical distributions to the returns of the S&P

Corporate Defaults and Large Macroeconomic Shocks Mathias Drehmann Bank of England Andrew Patton London School of Economics and Bank of England Steffen Sorensen Bank of England The presentation expresses

Learning objectives Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc To recognize different types of variables To learn how to appropriately explore your data How to display data using graphs How

WHAT IT IS Return to Table of ontents Descriptive statistics include the numbers, tables, charts, and graphs used to describe, organize, summarize, and present raw data. Descriptive statistics are most

Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.) 1. What is the median of the following set of scores? 18, 6, 12, 10, 14? a. 10 b. 14 c. 18 d. 12 2. Approximately

Math Objectives Students will recognize that bivariate data can be transformed to reduce the curvature in the graph of a relationship between two variables. Students will use scatterplots, residual plots,

Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Warren F. Kuhfeld Mark Garratt Abstract Many common data analysis models are based on the general linear univariate model, including

Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting

This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra

MCQ S OF MEASURES OF CENTRAL TENDENCY MCQ No 3.1 Any measure indicating the centre of a set of data, arranged in an increasing or decreasing order of magnitude, is called a measure of: (a) Skewness (b)

Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

Provider York County School Division Course Syllabus URL http://yorkcountyschools.org/virtuallearning/coursecatalog.aspx Course Title Algebra I AB Last Updated 2010 - A.1 The student will represent verbal

Probit Analysis By: Kim Vincent Quick Overview Probit analysis is a type of regression used to analyze binomial response variables. It transforms the sigmoid dose-response curve to a straight line that

Logit and Probit Brad 1 1 Department of Political Science University of California, Davis April 21, 2009 Logit, redux Logit resolves the functional form problem (in terms of the response function in the

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis