0:11Skip to 0 minutes and 11 secondsHello again, and welcome back to Advanced Data Mining with Weka. We’re going to look at the time series forecasting package now to do roughly what we did in the last lesson without the time series forecasting package. I’ve got the airline data loaded here. The time series package has given me this additional Forecast tab. I’m going to go straight to that, and without any more ado I’m just going to click Start and see what happens. Well, the time series package transforms the data into a large number of attributes. Unfortunately, you don’t get to see the attributes in the Preprocess panel. We still just have those two attributes there. You don’t see the generated attributes there.

1:00Skip to 1 minute and 0 secondstraining data: passenger_numbers; we’ve got month, quarter, date-remapped. The date-remapped is like what we did for the date in the last lesson. We did it manually, which changed it from milliseconds since January 1, 1970 into something more sensible. This actually does a better job, because it takes proper account of which years are leap years and which years aren’t leap years. Then we’ve got these lagged variables. The passenger_numbers lagged by – we just had 12 before – but now we’ve got the lags by 1, 2, 3, right up to 12 for 12 months, I guess.

1:31Skip to 1 minute and 31 secondsWe’ve got the square of the date-remapped and the cube of the date-remapped, in case you need those, and a bunch of other things, the date-remapped times these lagged variables. That’s a lot of variables. Underneath here is the generated model, which is very complicated. Let’s see how well it does. Actually, it doesn’t show here how well it does. To see that, we have to turn on Perform Evaluation. Let me click that here. Run it again, and we get a root-mean-squared

2:02Skip to 2 minutes and 2 secondserror of 10.6 on the training set, which looks good: last time we got 16.0. That was the best figure we got. But remember, this is the error on the training set. That’s always very misleading. Let’s make a simpler model. There’s a lot of attributes here. We can’t edit the generated attributes, like I said, but we can apply a filter. So I’m going to go to Advanced Configuration, and for my base learner, I’m going to choose the FilteredClassifier. And in the FilteredClassifier, I’m going to specify linear regression just like we had before, and for the filter, I’m going to choose the Remove attribute filter. Here it is.

2:53Skip to 2 minutes and 53 secondsI’m going to configure that to remove attributes number 1, 4, and 16, which I happen to know the correct ones. I’m sorry. I’m going to leave attributes 1, 4, and 16, and I’m going to set invertSelection to True. So these are the three attributes that I leave. Well, let’s just see what happens. Go back and look at my attributes, and here’s the generated attributes that we saw before. Now here’s the filtered attributes. We’ve got passenger_numbers, we’ve got date-remapped, and we’ve got this lag by 12. This is what we did in the last lesson, remember? Let’s see how we get on here. We got a root-mean-squared error of 27.8.

3:35Skip to 3 minutes and 35 secondsActually, we got that on the last lesson, but we got even better results by deleting the first 12 instances. Remember the first 12 instances have got lagged values with unknown values, and linear regression does bad things with unknown values, at least as far as time series are concerned. So I want to delete the first 12 instances.

3:58Skip to 3 minutes and 58 secondsNow, I could do that by applying two filters: removing attributes and removing instances and I could use the multifilter. But actually on the time series forecasting panel, there’s an easy way of doing that, which you really need to learn, because you’re going to be doing it a lot. In Advanced Configuration, we’re going to look at Lag creation and the More options. We’re going to say remove leading instances with unknown lag values. Let me run that, and now I get a root-mean-squared error of 15.8, and a model which is exactly the same

4:33Skip to 4 minutes and 33 secondsas the model we got on the last lesson: 1.07 times lag_passenger_numbers plus 12.7. That’s what we got before. Now, let’s just return to this full model that we had. We won’t use the filtered classifier; we’ll just use linear regression. Here it is. Now, we get a Root mean squared error of 8.7. It looks fantastic. But the model looks extremely complicated. We looked it before. Here it is again. Look at the complexity of this model. So it’s probably overfitted. What we’d like to do is to evaluate this on held out training data. We can do that with the Evaluation panel.

5:20Skip to 5 minutes and 20 secondsI’m going to evaluate on – we can either have a fraction here or a number of instances – I’m going to evaluate on 24 instances, that is two years’ worth of instances and run that. I get an error on the test data of 59. That’s huge. The error on the training data is only 6.4. So let’s just have a look at this on the slide. With the full model, all the attributes, we’ve got this enormous gap between the training error and the test error. And with this simple model, with just two attributes there, there’s a little gap, but not very big. So we could try reducing the attributes in other ways. We could actually use the AttributeSelectedClassifier.

6:09Skip to 6 minutes and 9 secondsI won’t do that for you, but to do that I’d have to choose the metalearner AttributeSelectedClassifier and specify linear regression as the base learner and then specify some attribute selection method. If I left that at all the defaults, I would in fact get four attributes selected. And I’d get a training and test error of 11 and 19. Still some indication of overfitting. The gap between these two figures really indicates overfitting. Now, we reduced the model to two attributes using a filter, the Remove filter. But actually there is a simpler way of doing that, which you need to learn, in the Forecast panel.

6:51Skip to 6 minutes and 51 secondsIf you go to Lag creation, it’s going to create lags between 1–12 – we saw those – but if you use custom lag lengths, we can increase that to 12, and now it’s only going to create a lag length of 12. I can remove the powers of time. Remember we had the time squared and the time cubed. We can remove the product of time and lagged variables. And if I go to periodic attributes here and click Customize, then I can include whichever ones of these attributes it wants to generate. Now, I’m not going to include any of those. So that will get us the simplest attribute set. I’ll just run that, and let’s look

7:37Skip to 7 minutes and 37 secondsnow at the attributes being used, just three of them: passenger_numbers, date-remapped, and this lag by 12. Down here, of course, we’ve got the same result as we got before. We’ve got the same model and the same training and test errors. If we plot these things, this is the training data. Now remember we’re ignoring the first 12 instances at the beginning because we have unknown values for the lagged variable, and we’re reserving 24 instances at the end for testing. So if we look now at the full model, we get this red line, and you can see that the predictions over the test data are starting to vary from those data points.

8:14Skip to 8 minutes and 14 secondsIf you look at the simple model, the one with just two attributes, then we get a more accurate line. Here they are, in fact, both together, and you can see the blue one from the simple model is more accurate than the red one for the full model. We’re using one-step-ahead predictions to evaluate the error here, which means that they can propagate. If you look at the solid red line toward the end, the first of those big dips is an error, and then the second sort of ‘double dip’ is an error that’s propagated from the first error.

8:46Skip to 8 minutes and 46 secondsOnce it starts making an error in this kind of evaluation, when we’re evaluating the one step ahead each time, the errors are going to propagate. So it’s a pretty bad thing once you start making errors, they get worse and worse. OK. That’s it. Weka’s time series forecasting package makes it easy to experiment with lagged variables and other kinds of things like that. It automatically generates many attributes, perhaps too many attributes, so it’s a good idea to always try simpler models. You can use the Remove filter which we did at first, or you can choose which attributes you want using the Lag creation and Periodic attributes tabs under Advanced Configuration.

9:30Skip to 9 minutes and 30 secondsAs always in data mining, you need to be wary of evaluation based on the training data, and you can hold data out using the Evaluation tab. Finally, we’re evaluating time series using repeated one-step-ahead predictions, which means that errors propagate.

Using the time series forecasting package

Dealing manually with time series is a pain, as we learned in the last lesson. Weka’s time series forecasting package automatically produces lagged variables, plus many others – perhaps too many! It transforms the data by adding a large number of attributes, which, unfortunately, invites overfitting. This is indicated by a large discrepancy between error on the training set and error on independent test data. You can configure Weka to reduce the number of added attributes.