I have a dataset of text messages of which I'm trying to filter out the spam from the legitimate ones. I have roughly 4600 pieces of data spread among 57 features and then their classification as spam or not. I have four 'versions' of the data, one the regular data and the other three I have applied various types of preprocessing.

I'm supposed to be fitting each of these to a regression model using ridge regressing, and must use cross validation to calculate the ridge regression parameter. I have a loose understanding of what cross validation is but I'm confused as to how I apply it to this situation, particularly how to split my data. Could I get some guidelines/pointers for this?

1 Answer
1

When performing cross-validation, you use part of the data (say nine tenths of the observations) to train the model ant the remaining tenth to compute a goodness-of-fit statistic like R2 or whatever you choose. The distinctive idea in cross valiation is that ALL data are used to both train and test.

Assume, for instance, you have N=1000 observations. You would set aside observations 1 to 100 to test and train on all others, then set aside observations 101 to 200 and train on all others, etc. Thus, you would fit your model ten times and average the results of R2 or whatever you choose.

It is common (and usually sufficient) to split the data in 5 or 10 fragments and perform (5-fold, 10-fold) cross-validation as indicated. In linear regression you may, at really no extra cost, perform the most extreme variety of cross-validation, leave-one-out, in which you leave aside one observation each time.

If you are using R, I would suggest you to look at function lm.ridge in package MASS.