Bookclub: ISLR

Go to page

TS Contributor

Hi,
I just pledged to work my way through this book (Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani). Anyone willing to join? Keeping the discipline would be more fun if we worked together.

Member

I'm in! I already did the first 4 chapters of this book during an intensive two-week course, then I got sick and kind of scrambled to keep up, mostly sticking with lecture notes and slides. So I would be up to re-read the first chapters and finish the rest.

It's a pretty good book, by the way. Just in the sweet spot between accesible and rigorous, for me at least.

Human

Hi,
I just pledged to work my way through this book (Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani). Anyone willing to join? Keeping the discipline would be more fun if we worked together.

Sounds nice, but I don't think I can commit fully to that. But it would be interesting to listen to, and maybe participate in discussions.

Aha, so IRSL means "Introduction to Statistical Learning with Applications in R". I guess that Hastie-Tibshirani have made the right decision to skip 'estimation' and call it 'learning' and by that include the machine-learning people.

Not a robit

TS Contributor

So,
my first musing /question about the book: in the chapter about Classification we discuss linear and quadratic discriminant analysis. It is also said that for more than two categories LDA is preferred over logistic regression. However there was no significance calculation given for LDA and there is none in the R output either. Also, I do not recall anyone ever recomending LDA in this forum instead of a logistic regression, not to mention recomending QDA. Is this because we can have no p-values? How would one calculate the sample size?

TS Contributor

I just played with Exercise 10 from Chapter 4 – predicting if the stock exchange would go up or down using weekly data. I got an interesting surprise – the exercise proposed that I split the training and the test data according to the year – everything before 2009 was training and after, until 2010 was test data. It also proposed to only use Lag2 as a predictor, out of the 5 available lags and the trade volume. I was not sure about splitting according to time – my suspicion was that if there was a trend or any other time-related pattern this might not be captured in the test set in the same way as in the training, leading to biased performance.

So, I did the logistic regression in 3 cases : with all the available data I got Lag2 as the significant predictor. However, if I ran the logistic regression on the train data alone, then Lag1 was significant and Lag2 was not. I guess this means that probably neither of them is significant, all we see is some fluke in the data. I the decided to take a completely random selection of 800 points as the train data – and sure enough there were no significant predictors there.

Now, apart from the true objective of the exercise, this raises interesting questions about our use of regression and model selection. I would have accepted either Lag1 or Lag2 as a legitimate predictor in any analysis, and I guess anyone would have accepted them as well. Given the recent discussions on the value of the p-value as a tool, this is quite sobering. Maybe, one could extend the p-value testing to require that train and test samples should be used as well? I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?

BTW of the possible choices for a classification algorithm logistic regression with a threshold of 0.5 behaved very poorly, and LDA was only marginally better . Both algorithms essentially bet on a upward movement – the logistic regression only predicted 7 downward movements out of a total of 289. Because there were more upward movements then downward this got them a true positive rate of around 51-52% . Surprisingly QDA got a whooping 58.5% with KNN with k=1 being as bad as logistic regression but k=5 slightly better than logistic regression and QDA. I actually never had QDA on my radar, I guess this will change now.

Not a robit

rogojel, slow down. My schedule doesn't open up until late next week - then I will be all over this with you.

Yes, these authors highly champion the use of cross-validation if sample is marginally large and Training, Testing, and validation sets for larger data, or leave-one-out for small samples. My problem is that I usually don't get large datasets.

Did you do any model scoring? That is something that I have not really done.

Not too familiar with lags. Were they calling the older dataset the lag. I always think of it as the run-in data, so say prior 3 days or something like that in panel data. I am right in thinking this?

TS Contributor

the advantages of a frequent traveller - since I have a Surface I read the book and use R on the plane. It definitely accelerates things.

@hlsmith: the data is a time series and the lags are simply data shifted by one, two....five periods.

I only rated the models based on the true positive rate - and sortof based on aestetics as well, where I regard a confusion matrix that is well balanced aesthetically more satisfying as one that only gets one category right.

TS Contributor

Last exercise for classification: the Boston data-set from the MASS library. The goal is to predict which districts will have an above median crime rate based on all sorts of descriptive data. Since I just learned about LDA, this is what I tried first.

Vanilla attempt - on a training set of 80% randomly selected data : 87% TPR but not nice at all,the model basically just decided to always pick TRUE . It did not get the FALSEs at all but I had mostly Ts in the test dataset so..

Trying the QDA method and voila: 93% TPR and a well balanced confusion matrix.

My problem with LDA is that it does not give a p-value or any clue as to which variables are important in the model and which aren`t - so I just decided to prune the model based on the group means reported in the output on the basis of "large difference - stays, small difference goes".

The new pruned lda model improved to about 90% - the qda just stayed where it was. So, in conclusion there might be a way to improve an lda model but generally qda will perform a bit better for the price of more variance I guess.

So, trying the other standard methods - logistic regression performed about as well as lda but pruning based on p-values reduced model oerformance a lot.

knn was almost as good as the qda method (almost) . Interestingly increasibg k from 1 to 5 did not improve the model at all. I really expected that it would but apparently k=1 was already capturing all structure in the data .

Not a robit

So you are still on the classification chapter. I may read it over the next couple of days. I had been working on learning interrupted time series, which seems very straightforward as long as there aren't any time varying confounder or observation level confounders.

New Member

I am thinking of something like either build the model by using a training set and validate it using a test set – or work backwards, find the model using all the data but then to require that there should be some indication of the effect if we used a smaller random subset of the original data?

TS Contributor

So,
a bit late, due to the year-end hassle, but still keeping at it - I am working now on chap. 6 - Regression and especially model selection, ridge and lasso.

I just finished ex. 8 where I had to generate a random X and a Y that was a polynomial function of X with degree 3 plus noise of course. Then generate the powers of X up to 10 and try to find a regression model correctly describing the X-Y relationship.

My first surprise was that the regsubsets function from the leaps package did a pretty good job identifying the model with 3 variables . I tried three selection criteria, cp, rsquared.adjusted and BIC . If I went for the minimum then only BIC picked the right model but if I went for the "knee" in the graphical representation then all three were obviously identifying the model with 3 parameters as the best one.

Using the lasso the "best" model found by cross validation also identified 3 parameters, but only if I picked the lambda.1se and not the lambda.min- which was my intuitive choice anyway.

As I knew the parameter values I could also compare the lm model's guess to that of the lasso - and interestingly the lm model was somewhat better. Also comparing the MSE on a new set of similarly generated data lm performed better.

So, I repetead the exercise by adding a lot more noise . In this case the performance of the lasso MSE-wise was closer to tjat of the lm but still the simple lm model was better.

Not a robit

I am jealous, I am still a little too busy and lazy to commit. The LASSO seems to outperform when it is more of a p > n scenario I believe, and variable may be correlated. And as you know the CV helps more with the overfitting and out-of-sample application.

TS Contributor

I got slowed down due to work but still have the ambition to continue - so the last exercise for chapter 6redicting the crime rate in Boston - the dataset Boston from the library MASS.

The task is to generate all the models that were developed in the chapter. I generated a random sample of 100 datapoints for testing and left 406 in the training.

The first thing I learned is that in the presence of some outliers the test-set performance of the models can be hugely variable. For the exact same model, depending on the test-set I could get an MSE of 100 or 10 . The effect depended on whether some outliers got into the test-set or not - of course an outlier in the test - set meant that it had no influence on the model but generated a large residual.

So, comparing the methods - again the simple regression (with interactions) performed on the average better then either the lasso or the regression. PCR was somewhere in between the regression and the lasso while PLS got very close to the simple regression. Given how much more difficult it would be to explain a PLS as compared to the regression the simple regression still seems to be the winner - but the number of variables was really not high enough to see the advantages of the more sophisticated methods.

Another point - it does make sense to include nonlinearities and interactions into the models - This would be easy with a simplle regression - for all the others I just added product columns to the dataset (could try squares as well). The tendency did not change as far as model performance was concerned, but the MSEs went down for all the models.

Also, the outliers complicate the modelling a lot - so exploratory analysis would be a must for any modelling . This does not seem to be a great discovery, but one tends to forget this in the heat of a project.