================================================================
CHAPTER 4: PRACTICAL ASPECTS OF LINEAR MODELS -- COMPUTATIONS, INTERPRETATIONS
* R: Formula language for modeling, with excursions
- Create some toy 'data':
N |t|)"]
Ways to explore a model object:
class(mym)
is.list(mym)
names(mym) # Components of the list.
sapply(mym, class)
Similar for summary:
class(mym.smry)
is.list(mym.smry)
names(mym.smry) # Components of the list.
sapply(mym.smry, class)
What about formulas?
is.list(myf)
is.vector(myf)
is.character(myf)
is.expression(myf)
as.character(myf) # coercion
length(myf)
myf[1]
myf[2]
myf[3]
# Important: formulas are not character strings !!!!!!!!
# They have an internal representation that involves 3 parts,
# but all we need to know is that formulas feed into
# modeling (and some plotting) functions.
- "Operator overloading": new meanings for +,-,1,*,/,^ in model formulas
'+' means 'include'
'*' means 'include with interactions'
':' means 'include these specific interactions'
'^' means 'include hierarchical ANOVA terms with interactions of all orders'
'.' means 'include all predictors' (all variables other than the response)
'-' means 'exclude following predictor'
'1' means 'intercept'
- To prevent overloading: I(...), e.g., + I((x1+x2)^2)
- Linear models with categorical predictors:
summary(lm(y ~ x3, data=myd))
. Observation: The complete collection of dummies of a
categorical variable adds up to the intercept vector,
hence we''d have complete collinearity.
. Standard remedy: Drop one dummy.
It becomes the 'reference category'.
. Which category gets dropped by R?
A: The first one.
. Which category is the first one?
A: The first one listed in 'levels()':
levels(myd[,3])
Oh, this column is not a factor!
class(myd[,3])
But 'lm()'coerces 'myd[,3]' into a 'factor',
and it sorts the levels alphabetically:
levels(factor(myd[,3]))
==> 'c' gets dropped.
. How can we control what the first factor is?
A: We make the variable a factor and set
the levels in the order we want.
myd[,3] Dropped 'm', which is the first category.
. WARNING!!!
Do NOT set levels as follows when myd[,3] already is a factor:
levels(myd[,3]) observational data, where x1 cannot be manipulated
and where y is noisy, hence there are only averages,
but even these averages are not "known", only estimated...
- Interpretations of regression estimates:
. Slope interpretation: the issue of...
----------------------
| "ceteris paribus" |
----------------------
(English speakers pronounce this Latin expression
with "k" for "c" and stresses on first syllables.)
. "bj = est. ave. difference in y for pairs of objects with a unit difference in xj
when all other predictors are at the same levels."
. "bj = est. ave. difference in y for pairs of cases that ONLY
differ by a unit in xj."
. "Unit change in xj" and "holding all other predictors fixed"
are formulations often used but misleading because of
their implication of active experimentation.
- Interpretation of other regression estimates:
. Interpretation of b0?
"est. ave. y at x=0"
. Interpretation of stderr(b1)? !!!!!!!!!!!!!!!!!!!!!!!!!!!
("est. stdev of b1 varying from dataset to dataset")
(caveat: conditioning on x, same X, random y)
. Interpretation of t1?
("t1 = b1/stderr(b1) = dist from 0 in multiples of stderrs")
. Interpretation of p-value(b1)?
("prob of observing a t1 more extreme than the observed value,
dataset to dataset, under the assumption beta1=0")
(Warning! This is not a parameter but a random variable. Why?)
. Interpretation of s=RMSE? (R''s "Resid. stderr" is a bad term)
("est. sdev of errors/noise/epsilons")
. Interpretation of R Square?
("fraction of variation in y accounted for by the model")
(avoid: "explained by")
- More background on slope interpretations:
. Example: Car models, x1 = Weight, x2 = HP, y = MPG
The slope of x1 estimates the average difference
in MPG for pairs of car models that differ by one
Weight unit but have the same HP.
. Fine point: For different pairs, it doesn''t have to be the same
HP. It''s just within a pair that HP has to be the same.
. The previous point is due to linearity of the model:
You can think of the slopes as partial derivatives.
For linear functions f(x) = b0+b1*x1+...+bp*xp
the partial derivative wrt xj is bj, and it is
the same everywhere.
For nonlinear models the situation would be more difficult
because partial derivatives would not be constant.
. The association reflected by a slope is as follows:
delta xj |--> est. ave. delta y = bj * delta xj
at fixed levels of all other xk.
. If xj is a dummy variable, "delta xj = 1" is a two-group mean comparison.
bj = ave est difference between group 1 and the group 0
when all other predictors are the same.
. Use differences (delta xj) other than a unit difference
in the following cases:
+ The unit difference is too small to matter.
Example: xj = weight of a car in lb
Better use delta xj = 100 or 1000 lb.
+ The unit difference is too large and implies extrapolation.
Example: xj = weight of small diamonds in carats
If the range of the weights is 1/10 to 1/3 carat,
then a 1 carat difference has not been observed
and implies extrapolation.
Use delta xj = 1/10 carat instead.
. Intercepts are often extrapolations as well.
Examples:
+ cars with zero HP and zero weight
+ diamonds with zero weight
- Linear models and nonlinear associations:
. Many types of nonlinearities can be modeled with
variable transformations such as logarithms and roots
and with interaction terms.
We consider:
+ Additive nonlinearity
+ Interactions
They are modeled by forming new predictors.
. Additive nonlinearity:
xj |--> mu(...,xj,...) = ... + bj*f(xj) + ...
+ Frequent cases:
o f(xj) = log(xj)
o f(xj) = sqrt(xj)
o f(xj) ~ xj + xj^2
==> Only for predictors whose values are >0 and >=0, resp.
+ You will have to interpret what the meaning of bj is:
In both cases you should not interpret additive differences in xj
but multiplicative differences instead.
E.g., if two cases differ in xj by 10\%, i.e., by a factor 1.1,
the estimated average difference in y is
o bj*(log(xj*1.1) - log(xj)) = bj*log(1.1)
~ bj*0.1
o bj*(sqrt(xj*1.1) - sqrt(xj)) = bj*sqrt(xj)*(sqrt(1.1)-sqrt(1))
~ bj*sqrt(xj)*0.05
^^^^^^^^^^^
the term in the fitted equation
for the case with the lower xj value.
+ Note that the ceteris paribus clause holds like in linear models:
We compare pairs of cases with 10\% difference in xj but same
values on all other xk.
Again these 'same values' within a pair can differ between pairs.
Why? Because even in additive models the partial derivative wrt xj
does not depend on the other variables:
(delta/delta xj) (f1(x1)+...+fj(xj)+...+fp(xp)) = (d/d xj) fj(xj)
. Interaction nonlinearities:
+ The simplest form of interaction between two predictors is multiplicative.
I.e., one introduces new predictors which are the products of original predictors.
+ Consider a 'first order interaction', i.e., an interaction in two predictors xj and xk:
E[y(x)] = mu(...,xj,...,xk,...) = ... + bj*xj + bk*xk + bjk*xj*xk + ...
+ Note that we now have p+2 predictors: x0,x1,...xj,...,xj,...,xp,xj*xk
(p+2)nd predictor ^^^^^
+ IMPORTANT IMPLICATION OF INTERACTIONS: 'ceteris paribus' is no longer possible
for the interacting predictors:
---------------------------------------------------
| It is NOT possible to ask that xj*xk be the same |
| when there is a difference in xj. |
---------------------------------------------------
Q: How are going to interpret the slopes bj, bk and bjk?
+ If xj and xk are both quantitative, then the estimated interaction bjk
is interpreted as a correction to the slopes bj and bk:
o bj*xj + bk*xk + bjk*xj*xk = (bj+bjk*xk)*xj + bk*xk
==> The slope of xj depends on xk through bj(xk) := (bj+bjk*xk)
o bj*xj + bk*xk + bjk*xj*xk = bj*xj + (bk+bjk*xj)*xk
==> The slope of xk depends on xj through bk(xj) := (bk+bjk*xj)
+ If xj is quantitative and xk is a dummy (i.e., 0,1 valued),
we have the following two interpretations:
o bj(xk) = (bj+bjk*xk) = bj when xk=0
bj(xk) = (bj+bjk*xk) = bj+bjk when xk=1
Thus the interaction term allows for different xj slopes
in the two xk groups:
bjk = slope difference for xj between the xk groups
==> We can test whether the two groups require different slopes
by testing the null hypothesis H0: betajk = 0.
o bk(xj) = (bk+bjk*xj) is the estimated mean difference
between group xk=1 and group xk=0.
This estimated mean difference is a function of xk.
==> We can test whether the mean difference between the xk groups
depends on xj by testing the same null hypothesis H0: betajk = 0.
+ Both xj and xk are dummies:
bj = estimated average difference between groups xj=1 and xj=0
in the group xk=0.
bk = estimated average difference between groups xk=1 and xk=0
in the group xj=0.
bjk = estimated correction when comparing groups (xj=0;xk=0) and (xj=1;xk=1)
Additively the est.ave.diff. should be bj+bk,
but if the two groups 'interact' (have some sort of 'synergy'),
this needs to be correct to bj+bk+bjk.
Example: xj ~ drug A administered (1) or not (0)
xk ~ drug B administered (1) or not (0)
==> bjk ~ synergy between drugs A and B
bjk > 0: positive synergy (they strengthen each other''s effects)
bjk < 0: negative synergy (they attenuate each other''s effects)
* 'MBA' STATISTICS:
- The ranges of random variables and observations are often
of one of the following types:
. If a variable (x or y)
+ takes on only positive values, and
+ changes/differences are expressed in percentages/multiples
as opposed to numeric amounts,
==> "RATIO SCALE" variable
. If a variable
+ can take on positive and negative values and
+ changes/differences are expressed in numeric amounts,
==> "INTERVAL SCALE"
. If a variable
+ is confined to [0,1] (or [0,100] if expressed as \%)
==> "PROBABILITY/PROPORTION SCALE"
. If a variable
+ is confined to non-negative integers,
==> "COUNTING SCALE"
. If a variable
+ is confined to {0,1},
==> "Binary Variable" for 2-group comparison
- Examples of each? (Anything not covered?)
...
ratio: salaries, other compensations such as bonuses, investments, ...
interval: financial log-returns, coordinate axes for physical space
probability scale: relative frequencies as prob estimates, election returns
counting scales: number of ordered items
- Notes on ratio scales:
. Changes/gains/losses/differences are expressed as
+ factor changes: 'doubling', 'tripling', 'halving',...
"production is up by a factor 1.3"
+ percent changes: 'salary is up 5%', 'sales declined 20%', ...
"production is up by 30%"
Percentages are equivalent ways to express multiplicative differences.
. Questions/practice:
Translate back and forth between factor changes and percent changes:
+ 'an increase of 100%' means what factor change? ...
+ 'tripling income' means what percent change? ...
+ 'the internet grew 140%' means what factor change? ...
+ 'stocks lost 3%' means what factor change? ...
+ Interpret the following as investment returns over 6 periods:
(1) +10%, +10%, +10%, -10%, -10%, -10%
(2) +10%, -10%, +10%, -10%, +10%, -10%
Question:
o Who is better off?
o Do you end up at 100\% or less or more?
- Logarithms:
. Their principal role is to map
RATIO SCALES to INTERVAL SCALES
Map multiplicative factor changes/differences
to additive amount changes/differences.
. Linear Approximation of Logarithms: the "small change/difference approximation"
log(1+z) ~ z for small z (convention: small means |z| In general, if "ave. est." is interpreted as "median est.",
one can arbitrarily transform with monotone functions.
(For the normal and any symmetric distribution,
we have 'mean=median',
but not for skewed distributions.)
#================================================================
#
#
# - MODELS OF EXPONENTIAL GROWTH/DECAY/APPRECIATION/DEPRECIATION:
# . Example from the crazy dotcom years: growth of number of web servers
URL |t|)
## (Intercept) 13.408146 0.021395 626.7 <2e-16 ***
## year 0.900178 0.008834 101.9 <2e-16 ***
## ----
# Translation of an additive model on the log-scale
# to a multiplicative model on the original scale:
# log(nserv) ~ 13.4 + 0.9 * year + r
# nserv ~ exp( 13.4 ) * exp( 0.9 * year ) * exp( r )
# ^^^^^^^^^^^ exp( 0.9 )^year * "
# 660000 * 2.46^year * "
plot(webs.red, pch=16, xlab="yr", ylab="nserv")
lines(x=webs.red[,1], y=660000*2.46^webs.red[,1])
# Not perfect but adequate for now because we are more
# interested in interpretation.
#
# . Interpretation: We start with 660,000 servers in Jan 1997.
# From there on we have a yearly increase
# by a factor of 2.46, or 146\%.
#
# . Multiplicative errors: Make sense when error
# is proportional to size of y(x) (or E[y(x)] = mu(x) really).
# (The size of mice varies differently from the size of elephants.)
# [Another way to accommodate size-dependent error would be
# with direct modeling of the heteroscedasticity:
# y = b0 + b1*x + eps, where eps ~ N(0,x^2*sigma^2) (e.g.)
# But this does not limit y to positive values,
# as do multiplicative errors.]
#
# . Using linear approximation for log(y)-models:
# log(yhat) = b0 + b1*x
# difference in x by dx
# => difference in log(yhat) by b1*dx
# => "difference" in yhat by b1*dx*100% if |b1*dx|<0.1
#
# Web server data:
#
# b1 = .9 = (est. ave.) log-increase in webservers per year
# ==> dx = 1yr is much too large for linear approximation.
#
# Try dx = 1 month: b1*dx = b1/12 = 0.075
# ==> Small enough for linear approximation
# b1/12*100 = 7.5% = (est. ave.) percentage increase per month
# [Compounded this should amount to a 146% yearly increase,
# but here we see the approximation error:
# This monthly rate compounds to
1.075^12
## [1] 2.381780
# instead of 2.46. The more exact rate would be
2.46^(1/12)
## [1] 1.077899
# or closer to a 7.8% increase per month
# Still, linear approximation for log-differences |z| take log(x).
plot(log(wine[,1]), wine[,2],
pch=16, xlab="log Display ft", ylab="Wine Sales")
# Looks nearly ok. Ok enought to try a linear model:
wine.lm |t|)
# (Intercept) 83.560 14.413 5.797 6.24e-07 ***
# log(display.feet) 138.621 9.834 14.096 < 2e-16 ***
# ----
#
# Fitted Equation: (est. ave.) sales ~ 140 * log(display.feet) + 84
#
plot(wine, pch=16, xlab="log Display ft", ylab="Wine Sales")
lines(x=wine[,1], y=84+139*log(wine[,1]))
#
# Magic of logs again:
# . 10% difference in display.feet
# => (est. ave.) difference in sales is about 140 * 0.1 = 14 bottles
# . Doubling display feet: log(2) ~ 0.7
# => (est. ave.) difference in sales is about 140 * 0.7 ~ 100 bottles
#
# General: fit y ~ b0 + b1*log(x)
# 100% difference in x => ~b1*.7 difference in yhat. (est. ave.)
# 10% difference in x => ~b1/10 difference in yhat. (est. ave.)
# 1% difference in x => ~b1/100 difference in yhat. ( " )
#
#
#================================================================
#
#
# 3) MODELS OF CONSTANT ELASTICITY:
#
# Example: Fedex has(had) a delivery service called "CourierPak".
# At different prices, different numbers of CourierPak deliveries
# were requested by customers.
URL 0).
# The minimum volume is zero.
# 2) More importantly, by economic theory it is not a unit increase
# in Price that is associated with a constant decrease in Volume.
# It is a PERCENTAGE increase in Price that causes a PERCENTAGE decrease
# in volume.
#
courier.log.lm |t|)
# (Intercept) 8.04930 0.20145 39.957 < 2e-16 ***
# log(Price) -0.32732 0.07961 -4.112 0.000397 ***
# ----
# The raw Price->Volume fit and the log(Price)->Log(Volume) perform
# about the same, but the world prefers the latter. Interpretation:
#
# log(Volume) ~ 8 - 1/3*log(Price) + r
#
# . Constant elasticity as a power law:
# Volume = exp(8 - 1/3*log(Price + r)
# = exp(8) * Price^{-1/3} * exp(r)
# = 2981 * Price^{-1/3} * exp(r)
# In general: log(y) = b0 + b1*log(x) + r
# y = exp(b0) * x^b1 * exp(r)
# exp(b0) = est.median baseline at x=1
#
# . Small change approximation:
# Price difference of 1% ~ Volume difference of ... ?
# Price difference of 1%
# => log(Price) difference of about 0.01
# => log(Volume) difference of about (-1/3)*0.01 (|.|<0.1)
# => Volume difference of about -1/3 %
#
# In general: 1% difference in x ~ difference in yhat of b1%.
# This is called "constant price elasticity".
#
# (Assumes |b1|<0.1. If this is not the case, choose a smaller dx:
# dx=dPrice=0.001 => dy=dVolume=b1*0.1% .)
#
#
#================================================================ SKIP THE FOLLOWING
* FIXED COSTS/VARIABLE COSTS:
- Example: A manufacturer of 'blocks' gets orders of certain
quantities and production costs. It is clear
that total production cost is mostly driven by
quantity. But the manufacturer would like to know
what other factors make some orders more and others
less expensive.
- Distinguish:
. FC: fixed cost (often setup costs)
. VC: variable cost (cost per additional unit)
- TC - Total Cost model: Denote quantity (number of units ordered) by 'Q'.
TC = beta0 + beta1*Q + beta2*x2 + ... + eps
^^^^^ ^^^^^
?? ?? (?? are for you to fill in: FC, VC)
lm(Tot.Cost ~ Quantity + x2 + ...)
- Unrealistic error structure:
. Do you think 'eps' is homoskedastic,
same variation for small and large orders?
. (The other lack of realism, eps unbounded below,
will not be resolved in the following proceedings.)
- Predictor issue:
. Some predictors affect setup costs, hence contribute
to TC at the per-order level, as the predictor x2 above.
. HOWEVER, many predictors affect cost on a per unit basis.
+ amount of material per block,
+ difficulty of material,
+ extra features ordered for the blocks,
...
==> Contribution to total cost should be beta3*Q*x3
TC = beta0 + beta1*Q + beta2*x2 + beta3*Q*x3 + ... + eps
^^^^^ ^^^^^
FC-pred VC-pred
- New concept: AC = Average Cost := TC/Q
. Average cost model:
AC = beta0/Q + beta1 + beta2*x2/Q + beta3*x3 + ... + eps
. If the AC model is assumed homoskedastic,
then the TC model is heteroskedastic with
larger sigma for larger orders.
. In AC cost models, the fixed cost is prorated
from quantity Q to a single unit.
. AC and TC models have both FC and VC, but
---------------------------------------------------------------------
| + in the TC model FC is the intercept, VC is slope for Q, |
| + in the AC model FC is the slope for 1/Q, VC is the intercept. |
---------------------------------------------------------------------
==> Both models have interpretations of beta0 and beta1
as fixed and variable costs, in reverse order.
But they differ in their error structure, and in
how other factors affect the Total Cost of the order.
- A data example: Cost is given as Ave.Cost.
blocks |t|)
## (Intercept) 4146.470 865.000 4.794 3.22e-06 ***
## Units 22.149 1.658 13.356 < 2e-16 ***
## Multiple R-squared: 0.4752, Adjusted R-squared: 0.4725
# Model for Ave.Cost:
summary(lm(Ave.Cost ~ I(1/Units), blocks[sel,]))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.082 2.273 13.673 < 2e-16 ***
## I(1/Units) 1502.211 416.821 3.604 0.000397 ***
## Multiple R-squared: 0.06185, Adjusted R-squared: 0.05709
# The estimates of fixed and variable costs
# differ wildly between the two approaches.
# Which one would you believe?
# Typically Tot.Cost models have much higher R^2
# than Ave.Cost models, but this is irrelevant:
# The high R^2 stems from the trivial proportionality with Units,
# which the Ave.Cost model factors out.
# Often, Ave.Cost models are more plausible and
# yield comparable predictions when multiplied up to Tot.Cost.
#
#================================================================
* A PROTOCOL FOR REPORTING REGRESSION FINDINGS:
- Target audience: non-statisticians (your manager...)
. Principles: avoid technical terms, report relevant subject matter,
keep stats and its qualifications at low volume,
except when bearing bad news ... (which happens often in practice)
- Example:
URL "For reference, consider hypothetical car models roughly
in the middle of the pack, with weight 3,500 lb and 200 HP."
- Fit a model: (offline, not of interest to your audience)
my.cars.lm "Such reference car models are predicted to range in mileage
from as low as 20 MPG to as high as 32 MPG.
Thus there is considerable variation in highway mileage
even among models with the same weight and
same horsepower."
- Continue with general comments on the quality of the data and results:
summary(my.cars.lm)
==> "This is compatible with the fact that weight and horsepower
account for about two thirds (64%) of the variation in MPG.
(Weight and horsepower are highly statistically significant
as predictors of MPG, but, as is often the case, this does
not translate to an equal degree of practical significance.)"
- Translate the slopes as meaningfully as possible:
"If we compare models that differ by 500 lb in weight
but have the same horsepower, the estimated average difference
in mileage is about -2 MPG, quantifying the obvious fact that
heavier cars have lower mileage."
"If, on the other hand, we compare models that differ by 10 HP
but have the same weight, the estimated average difference
in mileage is about -0.3 MPG, again quantifying the obvious fact
that stronger cars have lower mileage."
- Maybe mention parenthetically the uncertainty in the slopes:
"(The 95% uncertainty about these numbers can be described
by a range -2+-1/3 MPG for the 500 lb weight difference
and -.3+-.06 MPG for the 10 HP difference.)"
- Complications:
+ Predictors can be ratio scale, hence you may have used log(x),
hence you need to consider a percent difference in x.
+ Any other kind of transformation also causes problems.
E.g., physics says to model "I(1/MPG.Highway)", which
would require additional non-trivial translations of
statement about differences, possibly replacing means
with medians, and translating endpoints of ranges.
+ Nonlinear effects are often intuitively expressed by
comparing two specific x-configurations:
"Compare, for example, cars with weights 3,500 and 4,000 pounds,
and 200 HP."
Especially when there are interaction terms you might have
to choose specific values for several predictor variables
and make comparisons across these predictor values.
================================================================