You have 20 datapoints, each of which has 1,000,000 attributes. Each observation also has an associated $y$ value, and you are interested in whether a linear combination of a few attributes can be used to predict $y$. That is, you are looking for a model

Let's make the dataset, and compute the y's with a "hidden" model that we are trying to recover

In [15]:

defhidden_model(x):#y is a linear combination of columns 5 and 10...result=x[:,5]+x[:,10]#... with a little noiseresult+=np.random.normal(0,.005,result.shape)returnresultdefmake_x(nobs):returnnp.random.uniform(0,3,(nobs,10**6))x=make_x(20)y=hidden_model(x)print(x.shape)

We know we are already in trouble -- we've selected 2 columns which correlate with Y by chance, but neither of which are columns 5 or 10 (the only 2 columns that actually have anything to do with y). We can look at the correlations between these columns and Y, and confirm they are pretty good (again, just a coincidence):

We're worried about overfitting, and remember that cross-validation is supposed to detect this. Let's look at the average $R^2$ score, when performing 5-fold cross validation. It's not as good, but still not bad...

In [27]:

cross_val_score(clf,xt,y,cv=5).mean()

Out[27]:

0.61616795686754722

And even if we make some plots of the predicted and actual data at each cross-validation iteration,
the model seems to predict the "independent" data pretty well...

Now cross-validation properly detects overfitting, by reporting a low average $R^2$ score and a plot that looks like noise. Of course, it doesn't help us actually discover the fact that columns 5 and 10 determine Y (this task is probably hopeless without more data) -- it just lets us know when our fitting approach isn't generalizing to new data.

This website does not host notebooks, it only renders notebooks
available on other websites.