This is a long post but it is not conceptually difficult. Please bear with me.

I am trying to model the seasonality of production volume of an agricultural commodity. I do not care about the structure of the model (as long as it produces sensible output) nor about explaining seasonality; my goal is to adjust the series for seasonality before proceeding to forecasting.

I know the seasonality is due to supply (weather is a key determinant) as well as demand (higher demand around Christmas and Easter, etc.).

My data is weekly, and I am following the approach by prof. Hyndman of modelling the time series as an ARMA process with Fourier terms to account for seasonality (see here). I also include Christmas and Easter dummies together with Fourier terms.

You might think I do not need a Christmas dummy since Christmas is always the 25th of December and thus Fourier terms would account for it, unlike for Easter which jumps back and forth in March-April from year to year. However, Christmas sometimes occur in week 51 and sometimes in week 52 (at least in my calendar), and Fourier terms will not account for that; hence a Christmas dummy is actually relevant.

Actually, I would like to include up to four dummies for Christmas (from two weeks before Christmas to one week after Christmas) as the seasonal effect is not concentrated on just the Christmas week but also spread out a little in time. The same is with Easter.

Thus I end up with $4+4=8$ dummies, up to 25 pairs of Fourier terms and an unknown ARMA order, which could be as high as (5,5), perhaps. This gives me an awfully large pool of models to choose from. Thus I have trouble replicating the approach used in the source above because of limited computational resources. Recall that in the original source there were no dummies, so the model pool was $2^{4+4}=256$ times smaller than mine and it was possible to try out a reasonably large proportion of all possible models, and then choose one of them by AIC.

Even worse, I would like to repeat my exercise multiple times (100 times) using a rolling window on the data I have. Thus the computational requirements for my exercise become exorbitant. I am looking for a solution, and hope someone could help me out.

Here are a few ideas I have:

Drop the ARMA terms all together.

This will make the estimates of coefficients on Fourier terms and dummies biased and inconsistent, thus the seasonal component (the Fourier terms times their coefficients plus dummies times their coefficients) might be rubbish. However, if ARMA patterns are pretty weak, maybe the bias and inconsistency of the coefficient estimates will not be too bad?

Estimate only some models from the pool of all possible models. E.g. select the ARMA order by auto.arima() function in forecast package in R: auto.arima() does not consider all possible ARMA orders but only non-restricted ones (e.g. an AR(2) model with the AR(1) coefficient restricted to zero is not considered) and takes some other shortcuts. Also, I could force all dummies to be present without questioning whether all of them are really relevant.

This indeed reduces the model pool considerably (by a factor of $2^{4+4}=256$ only due to dummies), but it comes at a cost of likely inclusion of irrelevant variables. Not sure how bad that could be.

Use a different estimation method: conditional sum of squares (CSS) instead of maximum likelihood (ML) when possible, something else?

Unfortunately, CSS in place of ML does not seem to reduce the computational time enough in practice (I tested a few examples).

Do the estimation in two (or more) steps: select the ARMA order independently of the number of Fourier terms and inclusion/exclusion of dummies, then select the number of Fourier terms, then select the number of dummies, or something similar.

This would reduce the model pool to the following: the number of all possible ARMA models (e.g. $2^{5+5}$) plus 26 (0 pairs of Fourier terms, 1 pair of Fourier terms, ..., 25 pairs of Fourier terms) plus $2^{4+4}$ (all combinations of dummies). This particular division into stages certainly does not make sense, I admit. But perhaps there exists a smart way to divide model selection into steps? I guess we need mutual orthogonality of regressors to be able to consider their inclusion/exclusion from the model one by one, or something similar. I need some insight here...

Do something along the lines of Cochrane-Orcutt correction: (1) estimate a model without any ARMA terms (thus a simple OLS regression of production volume on Fourier terms and dummies); (2) examine the ARMA patterns in residuals and determine the relevant ARMA order; (3) estimate the corresponding ARMA model with Fourier terms and dummies as exogenous regressors. (I know Cochrane-Orcutt does not consider ARMA(p,q) but just AR(1), but I thought maybe I could generalize it? Not sure, perhaps I am making a mistake here?)

This would reduce the model pool to be estimated by a factor of $2^{5+5}$ if we would otherwise consider ARMA(5,5) and all sub-models nested in it, which is very nice. However, Cochrane-Orcutt correction is not as good as getting the model order correct in the first place. (I lack a theoretical argument there, but I suspect you do not always end up with the correct ARMA order if you use Cochrane-Orcutt or something similar.)

I would appreaciate any comments on the ideas above and, even more, your own ideas on how to save computational time.

1 Answer
1

Your approach is very arbitrary in my opinion as you are assuming certain model structures such as fixed effects describing weekly patterns which may have changed over time. Furthermore specifying high order ARIMA models is probably flawed due to over-specification which is akin to kitchen-sink modelling (throw everything but the kitchen sink into the model and pray). You are tacitly ignoring level shifts, local time trends and one-time unusual values. Your are tacitly ignoring possible changes in the error variance. We have found that a useful combination of weekly dummies in concert with a minimally sufficient ARIMA structure along with empirical identification of level shifts/time trends can be effective. Finally the holiday indicators that you are considering can be carefully added to a working model to test the hypothesis of any lead/lag structure around the holidays. Certainly non-recurring events like Easter would be an example. With lap years I would excise the 53rd week effect by deleting say the first week in November and distributing the Y value evenly between the other 52 weeks.

You should not IMHO try to adjust the data for seasonality but rather incorporate seasonal structure into a composite model for both interpretative reasons and forecasting reasons. Hope this helps.

AFTER RECEIPT OF DATA 313 WEEKS: USING A STEPWISE FORWARD AND STEPDOWN IMPLEMENTATION

The data plot clearly shows a major intercept change at or around period 270 and another intercept change at 65. The analysis time,important to the OP was 21 seconds on a two year old dell portable. The final model required Generalized Least Squares as the error variance changed dramatically at week 42 of 2011 . The equation in algebraic form is presented here and here . Partially presented again here with all coefficients being statistically significant at alpha = .05 . The actual and cleansed series highlight the unusual activity The Actual/Fit and Forecast plot is with the list of forecasts here and graphically ....Looks like the flag ..... If you have any particular detailed questions please contact me offline ... as I don't know how to set up a chat room ....

$\begingroup$Thank you for such an explicit answer! Let me make a few remarks. (1) Many of your suggestions address the richness of the model. I agree that they make sense and are relevant in general. I skipped them in my example only to make it simpler. However, enriching the model by addressing your points would not solve the problem of computational burder; it would rather increase it.$\endgroup$
– Richard HardyNov 1 '14 at 12:22

$\begingroup$(2) The leap years are already taken care of by the original prof. Hyndman's approach, so there is no need to give them a special treatment one extra time. Meanwhile, deleting certain weeks would distort any time structures in the data, which is unwelcome. (3) AIC selection among multiple candidate lead-lag structures (since I do not have a single most plausible candidate to begin with) around the holidays is preferable to testing hypothesis on each candidate structure one by one; it mirrors a problem of variable inclusion/exclusion.$\endgroup$
– Richard HardyNov 1 '14 at 12:23

$\begingroup$(4) Modelling seasonality together with all other effects is of course preferred in theory (due to proper model specification) but problematic in practice due to limited data and limited computational resources. As we saw, even the choice of pure seasonal models is computationally demanding; then what could one say about enriching the model with other extra features... If I had a model implied by theory (which I do not), I would proceed as you suggest. When I am choosing between a large pool of candidate models, the joint modelling of seasonality and all the other features becomes a curse.$\endgroup$
– Richard HardyNov 1 '14 at 12:30

$\begingroup$I unfortunately agree with the computational time but disagree with using any one statistic (like the AIC or it's brother the error variance ) to assess model importance sue to the explicit ignoring of deterministic effects remaining in the residuals and the consistent over-parameterization without tests of statistical significance for model coefficients.$\endgroup$
– IrishStatNov 1 '14 at 12:34

$\begingroup$I do not mean AIC is a panacea, but it will do the job of assessing (A) the extra variance induced by a remaining deterministic component in the residuals relative to (B) the extra variance due to imprecisely estimated extra parameters intended to explain these deterministic components. With regards to statistical significance for model coefficients, the use of statistical tests for variable selection has been criticized (e.g. here – quite elucidating).$\endgroup$
– Richard HardyNov 1 '14 at 13:14