Doing magic and analyzing seasonal time series with GAM (Generalized Additive Model) in R

Written on 2017-01-24

As I wrote in the previous post, I will continue in describing regression methods, which are suitable for double seasonal (or multi-seasonal) time series. In the previous post about Multiple Linear Regression, I showed how to use “simple” OLS regression method to model double seasonal time series of electricity consumption and use it for accurate forecasting. Interactions between two seasonal variables were successfully used to achieve this goal. The issue of forecasting time series from smart meters was discussed in my first post.

In this post (tutorial), I will fully introduce a “magical” Generalized Additive Model (GAM) to model time series of electricity consumption. You may wonder, why words “magic” or “magical” are mentioned? You will get your answer later in this post

Now I have to warn you that this blog post is very long. GAM is a very complex method and I wanted to go little bit deeper in order to show you this interesting regression method. I think it will be worth it and in the end of the post, I hope you will understand the theory behind R functions. In my opinion, R blogging (or any data mining or machine learning blogging) without theory interpretation of used methods has zero informative value for a reader.

So, let’s look under the roof of the GAM regression method. I think you need some motivation to it – here it is!

Theory behind Generalized Additive Model (GAM)

What do these three words (or letters) in the name of this method mean and where does it come from? I am sure that you know something about Linear Model (maybe because you had read my previous post about MLR). It is a simple regression method which models the response (dependent) variable by independent variable(s). It’s solved by the OLS method. So now we know, what the M in the name means. When we want to linearly model a response variable which is not from normal Gaussian distribution, for example, it can be binary (logistic regression) or discrete (Poisson) variable, we can use a generalization of the linear model - Generalized Linear Model (GLM). It’s solved by the iteratively reweighted least squares (IRLS) method. The G in the name is also clear. The A letter Additive Model, means that the response variable depends linearly on unknown smooth functions. In the other words, the goal is to model the response variable by independent variables, which are in the form of some smooth functions. Voilà, GAM is created.

Smooth functions are also called splines. Smoothing splines are real functions that are piecewise-defined by polynomial functions (basis functions). The places, where the polynomial pieces connect are called knots. In GAMs, penalized regression splines are used in order to regularize the smoothness of a spline.

Therefore, the model can be written in a linear way like this:

where \( \mathbf{X} \) is a model matrix and \( \beta \) is a vector of regression coefficients.

Then, the objective function to be minimized is:

where \( \lambda \) is a smoothing parameter and the integral of squares of second derivatives can be written as:

where \( \mathbf{S} \) is matrix of known coefficients.

All this mathematical madness then implies that regression coefficients can be obtained (estimated) by the equation:

\( \hat{\beta} \) is called penalized least squares estimator in this case.
The method of obtaining the estimate of the \( \beta \) is called Penalized Iteratively Re-weighted Least Squares (P-IRLS).

There are some more important theoretical questions to answer. For example, what kind of smoothing splines exists? Or, how to choose an optimal smoothing parameter \( \lambda \)? How basis dimensions (or a number of knots) are set?

There are several smoothing bases \( b \) (splines) which are suitable for regression:

thin plate regression splines

cubic regression spline

cyclic cubic regression spline

P-splines

An example of the cubic basis function for dimension \( q = 3 \):

You can read more about them in references, which I will write up. In this post, for daily seasonality cubic regression spline, and for weekly seasonality P-splines will be used. Both types of splines are knot-based, so choosing a right number of knots will be important.

Next, an important procedure is to choose (estimate) an optimal smoothing parameter \( \lambda \) and the number of basis dimensions (i.e. degrees of freedom). This can be done in GAM by Generalized Cross Validation score (GCV). It minimizes an equation:

where \( \mathbf{A} \) is the projection matrix (i.e. influence or hat). It’s obvious that when lambda is near 1 then spline will be over-smoothed, in opposite side when lambda is near zero than spline isn’t penalized, so the method behaves like a classical OLS. With a number of basis dimensions (estimated degrees of freedom), it is opposite. Higher dimension implies that fit will be less smoothed (overfit), on the other side lower dimensions implies more smoothed behavior of fitted values.

Interactions

We are almost at the end of the explanation of GAMs theory. One more thing. As I showed in the previous blog post, interactions are a very important part of the regression model for double seasonal time series. With GAMs there are four (!) main possibilities, how to include them to the model. First is the most basic, like in MLR, the multiplication of two independent variables: \( x_1\times x_2 \). Second one is possibility to use smoothed function to one variable: \( f_1(x_1)\times x_2 \). Third one comes to use same smoothed function for both variables: \( f_1(x_1)\times f_1(x_2) \mbox{ (often denoted } f(x_1, x_2)) \). Fourth one is the most complex, with GAM it is possible to use tensor product interactions. So it is possible to use different smoothing bases for variables and penalize it in two (when we do interactions of two independent variables) different ways: \( f_1(x_1)\otimes f_2(x_2) \). More nicely, tensor product interactions can be written as:

This allows for an overall anisotropic (different in each direction) penalty, so the overall shape of a tensor product smooth is invariant to a rescaling of its independent variables. This is a huge advantage in the comparison to usage of one smoothing function. Simply said, we have theoretically supported that it’s allowed to use different metrics of variables in the interactions term.

Resources (references)

Phew…that’s almost all of the difficult theory behind GAM method. I will add some more things in analytical part of this post. Honestly, it was difficult for me to find this information on the web. So if you want to read more about GAM, here is the list of useful links which were used in my post:

Doing “magic” with GAMs for modeling time series

I have prepared a file with four aggregated time series of electricity consumption for an analysis. It can be found on my GitHub repo, the name of the file is DT_4_ind. The file was created easily by the package feather (CRAN link). Data manipulations will be done then by data.table package.

GAM methods are implemented in R in the awesome package mgcv by Simon Wood (link to mgcv).

For visualizations packages ggplot2, grid and animation will be used. One useful function from package car will be used too.

Read the mentioned smart meter data by read_feather to one data.table.

DT<-as.data.table(read_feather("DT_4_ind"))

Prepare DT to work with a GAM regression model. Transform the characters of weekdays to integers and use function recode from package car to recode weekdays as there are coming in the week: 1. Monday, …, 7. Sunday.

There is possible to see two main seasonalities in plotted time series: daily and weekly. We have 48 measurements during the day and 7 days during the week so that will be our independent variables to model response variable - electricity load. Let’s construct it:

N<-nrow(data_r)# number of observations in the train setwindow<-N/period# number of days in the train setmatrix_gam<-data.table(Load=data_r[,value],Daily=rep(1:period,window),Weekly=data_r[,week_num])

Here we are! Train our first GAM with function gam. Independent variables are modeled by smoothing function s, for daily seasonality cubic regression spline is used, for weekly seasonality, P-splines is used, a number of knots are logically set to the number of unique values. Let’s do it.

Package mgcv have many advantages and nice features. First is its visualization capabilities. Let’s try it:

layout(matrix(1:2,nrow=1))plot(gam_1,shade=TRUE)

That looks nice, right? We can see here the influence of variables to electricity load. In the left plot, the peak of the load is around 3 p.m. during the day. In the right plot, we can see that during weekends the consumption logically decreases.

What to look at here? EDF: estimated degrees of freedom - can be interpreted like how much given variable is smoothed (higher EDF value implies more complex splines). P-values: statistical significance of given variable to response variable, tested by F-test (lower is better). \( R^2 \) - adjusted R-squared (higher is better). GCV: GCV score (mentioned above). In the summary, we can see that the R-sq. (adj) value is a little bit low…

That is, of course, a horrible result. We need to include interactions of two independent variables to the model.

One more thing before that… What is hiding behind the stored gam_1 object and used optimizer?

gam_1$optimizer

## [1] "magic"

Here is that “magic”!

The first type of interactions will be the third one mentioned in the section about interactions, so one smoothing function to both variables is used. In the term s(Daily, Weekly) simple smoothing terms s(Daily) and s(Weekly) are also automatically included. Let’s do it.

Just a little differences in comparison to the gam_2 model, looks like with te fit is more smoothed. Are we missing something? Of course! Function te has a default for a number of knots \( 5^d \), where d is a number of dimensions (variables), which is in our case too small. Now, I will set a number of knots to the maximal possible value k = c(period, 7), what means that upper boundary for EDF (Estimated Degrees of Freedom) will be 48*7 - 1 = 335.

EDF = 335, here we are. We can see that R-squared is lower than with the model gam_4, it is due to that 335 is too high and we overfitted the model. It is prove to that GCV procedure (estimation of lambda and EDF) is working properly. You can read more about the optimal setting (choosing) of k argument at ?choose.k and of course in the book from Simon Wood.

With the mgcv package we have two more opportunities, methods, how to include tensor product interactions term - with functions ti and t2. ti produces a tensor product interaction, appropriate when the main effects (and any lower interactions) are also present, while te produces a full tensor product smooth. t2 is an alternative function to te and uses different penalization method. You can read more about it in a documentation of the package mgcv: use ?ti and ?t2 to see more.

I have also printed the GCV score value for the last three models, which is also a good criterion to choose optimal model among a set of fitted models. We can see that with t2 term and corresponding model gam_6 the GCV value is the lowest. It may be indicating that gam_6 is our best model so far.

Other model selection criterion, which is widely used in statistics, is AIC (Akaike Information Criterion). Criterion has equation \( AIC = 2k - 2\ln (\hat{L}) \), where k is the number of parameters to be estimated and \(\hat{L}\) is the maximized value of the likelihood function of the model. So lower values are better for AIC. Let’s look at them for our three models:

Seems OK. We can see that fitted values for models gam_4 and gam_6 are very similar. It’s very difficult to choose which model is better. It’s possible to use more visualization and model diagnostic capabilities of package mgcv to compare these two models.

The first one is function gam.check, which makes four plots: QQ-plot of residuals, linear predictor vs. residuals, the histogram of residuals and the plot of fitted values vs. response. Let’s make them for models gam_4 and gam_6.

The function gam.check makes also output to the console of more useful information. We can see again that models are very similar, just in histograms some differences can be seen, but it’s also insignificant.

We didn’t use so far the default plot function to gam object for models with tensor product interactions. Let’s use it and explore what it brings us.

It’s a nice contour line plot. On axes are our independent variables, contour lines and corresponding numbers on them represents effects of ind. variables to the response variable. Now it is possible to see some little differences. The model gam_6 with t2 have more “wavy” contours. So it implies that it adapts more to response variable and smoothing factor is lower. In the next parts of the post, the model gam_6 will be used for the analysis.

Another great feature of the package mgcv is the plotting function vis.gam. It makes 3D view (or 2D) of surface of fitted values according to independent variables. Let’s look at it.

Again we can see that the highest value of electricity load is on Monday at 3:00 pm, it is very similar till Thursday, then the load is decreasing (weekends).

Another interesting contribution to the time series regression model can be an inclusion of autoregressive model for serially correlated errors. Let’s look at it.

Autoregressive models and GAMS

GAM or MLR have assumptions in a model that errors (residuals) are identically and independently distributed (i.i.d.). In the case of the time series regression, it is very strong assumption, which is here, logically, not fulfilled. Present time series values are highly correlated with past values, so errors of the model will be correlated too. This phenomenon is called an autocorrelation. This implies that estimated regression coefficients and residuals of a model might be negatively biased, which also implies that previously computed p-values of statistical tests or confidence intervals are wrong.

How can we handle this situation? By inclusion of autoregressive model (AR) for errors in our model. So we have the model with the term for errors like this:

where the second equation is a classical AR(1) process and \( \phi \) is an unknown autoregressive coefficient to be estimated. Errors can be also nested within a week, which is in our case more appropriate, because of the double seasonal character of our time series. You can add also higher orders of AR process and also MA (moving average) model. You can read more about estimation and modeling of this kind of models in an excellent book by Box, Jenkins, and Reinsel: Time Series Analysis.

It’s possible to add correlation term for errors with function gamm, which stands for GAM mixture models. It calls the lme function from package nlme. Now, train basic model with function gamm and an another model with AR(1) process nested within a week - just add this argument to gamm function: correlation = corARMA(form = ~ 1|Weekly, p = 1).

Value 0.71 is pretty high, which indicates a strong dependency on previous values of errors (lags).

Now let’s look another diagnostic for these two models. Plot of values of a partial autocorrelation function applied to normalized residuals. It gives the partial correlation of residuals with its own lagged values, controlling for the values of the residuals at all shorter lags.

layout(matrix(1:2,ncol=2))pacf(resid(gam_6_ar0$lme,type="normalized"),lag.max=48,main="pACF of gam n.6")pacf(resid(gam_6_ar1$lme,type="normalized"),lag.max=48,main="pACF of gam n.6 with AR(1)")

We can see a significant difference between these two plots and corresponding values of pACF. Optimal values of pACF should be under dashed blue lines, which isn’t completely this scenario on the right plot (model with the AR(1)). It can be better…

I will do little hack now. For optimal choose of AR(p) and MA(q) orders auto.arima function, from the library forecast, will be used on residuals from the model gam_6_ar0. It automatically chooses optimal orders of ARMA (in our case) based on AIC criterion. As we can use just ARMA models in gamm, so nonstationarity isn’t allowed, set an argument stationary = TRUE.

The new model gam_6_arma is slightly better in the meaning of AIC. P-value is again very small, which means that gam_6_arma is significantly better than gam_6_ar1.

Now do a comparison of these two models on theirs pACF values.

layout(matrix(1:2,ncol=2))pacf(resid(gam_6_ar1$lme,type="normalized"),lag.max=48,main="pACF of gam n.6 with AR(1)")pacf(resid(gam_6_arma$lme,type="normalized"),lag.max=48,main="pACF of gam n.6 with ARMA(1,2)")

It’s better now. Can we be happy? Up to now, everything seems fine.

Let’s make one more visualization of residuals of two models to confirm our little uncertainty about the correctness of the model with ARMA(1,2). Compare it with baseline model gam_6_ar0.

Here we are. The model with an ARMA(1,2) process has residuals at low fitted values somehow correlated, which is not good (heteroscedasticity). What now? I have to be honest, I can’t explain why this interesting thing happened. So, I would be very happy when someone guides me to the right direction and writes some notes to the discussion, it would really help. One more note about the inclusion of ARMA models to GAM or MLR. It didn’t help in the meaning of forecast accuracy, I made a lot of experiments, but MAPE was higher than with models without ARMAs. The big disadvantage is also higher time complexity than with the simple model without ARMA.

Animations and conclusions

I will end this post with a visualization analysis of the model gam_6. I will try to do some animated dashboards of main characteristics of the model thru time. It should be very interesting how electricity load changes and how the model can adapt to these changes. In the first two animations you can see three plots on the one figure:

Animations were created by functions from package grid to layout ggplots to one figure, and function saveGIF from package animation to create the final GIF. Here there are:

We can see the behavior of GAMs on two times series from two different types of industries: commercial property and light industrial. Notice (look), that fitted values and forecasts are smoother than where were with Multiple linear regression from the previous post. It’s good information to know, it implies that our model isn’t overfitted. On the other hand, I also compared forecast performance of GAM and MLR, but GAM was slightly worse than MLR, which can be a little bit surprising for somebody. Seems that usage of penalized least squares (GAM) for forecasting time series doesn’t bring improvement against classical OLS (MLR). But there are other advantages, which I will point out in the end.

Another animation, which I created, is a 3D visualization of fitted values like a surface. As I showed you before, it can be done very easily by function vis.gam.

We can compare it with the first dashboard because again it’s time series from commercial properties. EDF is also printed to the title of GIF for reasons to see how it changes depending on the behavior of electricity consumption.

With this, I would like to end the main part of this tutorial and conclude it with some remarks.

In the beginning of the post, motivation and the theory behind GAM method was introduced. Next, the analytical (“magical”) part continued with explaining various features of the package mgcv. Different types of interactions were showed and analyzed - best chose was tensor product interactions (t2). Next, consideration about the correctness of an autoregressive model (or ARMA) for errors was discussed - ended with an open question. In the end, I tried to make a big view of a behavior of GAMs by animations of its performance on double seasonal time series.

Everything that was said implies these advantages and disadvantages of using GAM for your problem:

Advantages:

Many possibilities how to model your independent variables (many types of smooth functions).

Define interactions by tensor product, which means usage of different basis functions for interacted variables.

Test statistical significance of a non-linear relationship of the independent variable to dependent.

Disadvantages:

Complex method, difficult to learn.

It’s sometimes difficult to interpret results.

Several parameters to tune, which have to be set carefully.

In the future post, I want to focus on some machine learning method like Support Vector Regression or Regression Trees with alongside of grid search for tuning parameters.

The script (3post-GAM.R) for a creation of the whole tutorial is located on my GitHub repository. I would like to encourage you to add any feedback to the discussion below. Thank you for reading this long post.