Yesterday, the Munich datageeks Data Day took place. It was a totally fun event – great to see how much is going on, data-science-wise, in and around Munich, and how many people are interested in the topic! (By the way, I think that more than half the talks were about deep learning!)

Whatever the title, it was really about showing a systematic comparison of forecasting using ARIMA and LSTM, on synthetic as well as real datasets. I find it amazing how little is needed to get a very decent result with LSTM – how little data, how little hyperparameter tuning, how few training epochs.

Of course, it gets most interesting when we look at datasets where ARIMA has problems, as with multiple seasonality. I have such an example in the talk (in fact, it’s the main climax ;-)), but it’s definitely also an interesting direction for further experiments.

More and more often, and in more and more different areas, deep learning is making its appearance in the world around us.
Many small and medium businesses, however, will probably still think – Deep Learning, that’s for Google, Facebook & co., for the guys with big data and even bigger computing power (barely resisting the temptation to write “yuge power” here).

Partly this may be true. Certainly when it comes to running through immense permutations of hyperparameter settings. The question however is if we can’t obtain good results in more usual dimensions, too – in areas where traditional methods of data science / machine learning prevail. Prevail, as of today, that is.

One such area is time series prediction, with ARIMA & co. top on the leader board. Can deep learning be a serious competitor here? In what cases? Why? Exploring this is like starting out on an unknown road, fascinated by the magical things that may await us 😉
In any case, I’ve started walking down the road (not running!), in a rather take-your-time-and-explore-the-surroundings way. That means there’s much still to come, and it’s really just a beginning.

Here, anyway, is the travel report – the presentation slides, I mean: best viewed on RPubs, as RMarkdown on github, or downloadable as pdf).
Enjoy!

R for SQListas, part 2

Welcome to part 2 of my “R for SQListas” series. Last time, it was all about how to get started with R if you’re a SQL girl (or guy)- and that basically meant an introduction to Hadley Wickham’s dplyr and the tidyverse. The logic being: Don’t fear, it’s not that different from what you’re used to.
This (and upcoming) times it will be about the other side of the coin – if R was “basically just like SQL”, why not stick with SQL in the first place?
So now, it’s about things you cannot do with SQL, things R excels at – those things you’re learning R for :-). Remember in the last post, I said I was interested in future developments of weather/climate, and we explored the Kaggle Earth dataset (as well as another one, daily data measured at weather station Munich airport)? In this post, we’ll finally try to find out what’s going to happen to future winters. We’ll go beyond adding trend lines to measurements, and do some real time series analysis!

Inspecting the time series

First, we create a time series object from the dataframe and plot it – time series objects have their own plot() methods:
start_time <- as.Date("1992-01-01")
ts_1950 <- ts(df_munich$avg_temp, start = c(1950,1), end=c(2013,8), frequency = 12)

Plotting the time series with the default plotting method

Time series decomposition

Now, let’s decompose the time series into its components: trend, seasonal effect, and remainder. We clearly expect there to be seasonality – the influence of the month we’re in should be clearly visible – but as stated before we’re mostly interested in the trend.

From these decompositions, it does not seem like there’s a significant trend. Let’s see if we can corroborate the visual impression by some statistical data. Let’s forecast the weather!
We will use two prominent approaches in time series modeling/forecasting: exponential smoothing and ARIMA.

Exponential smoothing

With exponential smoothing, the value at each point in time is basically seen as a weighted average, where more distant points weigh less and nearer points weigh more. In the simplest realization, a value at time t(n+1) is modeled as weighted average of the value at time t and the incoming observation at time t(n+1):

More complex models exist that factor in trends and seasonal effects.
For our case of a model with both trend and seasonal effects, the Holt-Winters exponential smoothing method generates point forecasts. Equivalently (conceptually that is, not implementation-wise; see http://robjhyndman.com/hyndsight/estimation2/), we can use the State Space Model (http://www.exponentialsmoothing.net/) that additionally generates prediction intervals.

The State Space Model is implemented in R by the ets() function in the forecast package. When we call ets() without any parameters, the method will determine a suitable model using maximum likelihood estimation. Let’s see the model chosen by ets():

The model chosen does not contain a smoothing parameter for the trend (beta) – in fact, it is an A,N,A model, which is the acronym for Additive errors, No trend, Additive seasonal effects.
Let’s inspect the decomposition corresponding to this model – there is no trend line here:

plot(fit)

Time series decomposition using ets()

Now, let’s forecast the next 36 months!

plot(forecast(fit, h=36))

Forecast from ets()

The forecasts look rather „shrunken to the mean“, and the prediction intervals – dark grey indicates the 95%, light grey the 80% prediction interval – seem rather narrow. Indeed, narrow prediction intervals are often a problem with time series, because there are many sources of error that aren’t factored in the model (see http://robjhyndman.com/hyndsight/narrow-pi/).

ARIMA

The second approach to forecasting we will use is ARIMA. With ARIMA, there are three important concepts:

AR(p): If a process is autoregressive, future values are obtained as linear combinations of past values, i.e.

where e is white noise. A process that uses values from up to p time points back is called an AR(p) process.

I(d):This refers to the number of times differencing (= subtraction of consecutive values) has to be applied to obtain a stationary series.
If a time series yt is stationary, then for all s, the distribution of (y(t),…, y(t+s)) does not depend on t. Stationarity can be determined visually, inspecting the autocorrelation and partial autocorrelation functions, as well as using statistical tests.

MA(q): An MA(q) process combines past forecast errors from time points up to p times back to forecast the current value:

While we can feed R’s Arima() function with our assumptions about the parameters (from data exploration or prior knowledge), there’s also auto.arima() which will determine the parameters for us.
Before calling auto.arima() though, let’s inspect the autocorrelation properties of our series. We clearly expect there to be autocorrelation – for one, adjacent months are similar in temperature, and second, we have similar temperatures every 12 months.

tsdisplay(ts_1950)

Autocorrelation check

So we clearly see that temperatures for adjacent months are positively correlated, while months in „opposite“ seasons are negatively correlated. The partial autocorrelations (where correlations at lower lags are controlled for) are quite strong, too. Does this mean the series is non-stationary?
Not really. A seasonal series of temperatures can be seen as a cyclostationary process (https://en.wikipedia.org/wiki/Cyclostationary_process) , where mean and variance are constant for seasonally corresponding measurements. We can check for stationarity using a statistical test, too:

The model chosen by auto.arima is ARIMA(1,0,5)(0,0,2)[12] where the first triple of parameters refers to the non-seasonal part of ARIMA, the second to the seasonal one, and the subscript designates seasonality (12 in our case). So in both parts, no differences are applied (d=0). The non-seasonal part has an autoregressive and a moving average component, the seasonal one is moving average only.

Now, let’s get forecasting! Well, not so fast. ARIMA forecasts are based on the assumption that the residuals (errors) are uncorrelated and normally distributed. Let’s check this:

res <- fit$residuals
acf(res)

While normality of the errors is not a problem here, the ACF does not look good:

ACF of auto.arima() residuals

Clearly, the errors are not uncorrelated over time. We can improve auto.arima performance (at the cost of prolonged runtime) by allowing for a higher maximum number of parameters (max.order, which by default equals 5) and setting stepwise=FALSE:

Comparing this forecast with that from exponential smoothing, we see that exponential smoothing seems to deal better with smaller training samples, as well as with longer extrapolation periods.
Now we could go on and use still more sophisticated methods, or use hybrid models that combine different forecast methods, analogously to random forests, or even use deep learning – but I’ll leave that for another time 🙂 Thanks for reading!

R for SQListas, what’s that about?

This is the 2-part blog version of a talk I’ve given at DOAG Conference this week. I’ve also uploaded the slides (no ppt; just pretty R presentation 😉 ) to the articles section, but if you’d like a little text I’m encouraging you to read on. That is, if you’re in the target group for this post/talk.
For this post, let me assume you’re a SQL girl (or guy). With SQL you’re comfortable (an expert, probably), you know how to get and manipulate your data, no nesting of subselects has you scared ;-). And now there’s this R language people are talking about, and it can do so many things they say, so you’d like to make use of it too – so now does this mean you have to start from scratch and learn – not only a new language, but a whole new paradigm? Turns out … ok. So that’s the context for this post.

Let’s talk about the weather

So in this post, I’d like to show you how nice R is to use if you come from SQL. But this isn’t going to be a syntax-only post. We’ll be looking at real datasets and trying to answer a real question.
Personally I’m very interested in how the weather’s going to develop in the future, especially in the nearer future, and especially regarding the area where I live (I know. It’s egocentric.). Specifically, what worries me are warm winters, and I’ll be clutching to any straw that tells me it’s not going to get warmer still 😉
So I’ve downloaded / prepared two datasets, both climate / weather-related. The first is the average global temperatures dataset from the Berkeley Earth Surface Temperature Study, nicely packaged by Kaggle (a website for data science competitions; https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data). This contains measurements from 1743 on, up till 2013. The monthly averages have been obtained using sophisticated scientific procedures available on the Berkeley Earth website (http://berkeleyearth.org/).
The second is daily weather data for Munich, obtained from http://www.wunderground.com. This dataset was retrieved manually, and the period was chosen so as to not contain too many missing values. The measurements range from 1997 to 2015, and have been aggregated by taking a monthly average.
Let’s start our journey through R land, reading in and looking at the beginning of the first dataset:
library(tidyverse)
library(lubridate)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)
df <- read_csv('data/GlobalLandTemperaturesByCity.csv')
head(df)

Now we’d like to explore the dataset. With SQL, this is easy: We use WHERE to filter rows, SELECT to select columns, GROUP BY to aggregate by one or more variables…And of course, we often need to JOIN tables, and sometimes, perform set operations. Then there’s all kinds of analytic functions, such as LAG() and LEAD(). How do we do all this in R?

Entering the tidyverse

Luckily for the SQLista, writing elegant, functional, and often rather SQL-like code in R is easy. All we need to do is … enter the tidyverse. Actually, we’ve already entered it – doing library(tidyverse) – and used it to read in our csv file (read_csv)!
The tidyverse is a set of packages, developed by Hadley Wickham, Chief Scientist at Rstudio, designed to make working with R easier and more consistent (and more fun). We load data from files using readr, clean up datasets that are not in third normal form using tidyr, manipulate data with dplyr, and plot them with ggplot2.
For our task of data exploration, it is dplyr we need. Before we even begin, let’s rename the columns so they have shorter names:
df <- rename(df, avg_temp = AverageTemperature, avg_temp_95p = AverageTemperatureUncertainty, city = City, country = Country, lat = Latitude, long = Longitude)
head(df)

distinct() (SELECT DISTINCT)

Good. Now that we have this new dataset containing temperature measurements, really the first thing we want to know is: What locations (countries, cities) do we have measurements for?
To find out, just do distinct():
distinct(df, country)

Do you think this is starting to get difficult to read? What if we add FILTER and GROUP BY operations to this query? Fortunately, with dplyr it is possible to avoid paren hell as well as stepwise assignment using the pipe operator, %>%.

Meet: %>% – the pipe

This looks a lot like the fluent API design popular in some object oriented languages, or the bind operator, >>=, in Haskell.
It also looks a lot more like SQL. However, keep in mind that while SQL is declarative, the order of operations matters when you use the pipe (as the name says, the output of one operation is piped to another). You cannot, for example, write this (trying to emulate SQL‘s SELECT – WHERE – ORDER BY ): df %>% select(dt, avg_temp) %>% filter(city == ‘Munich’) %>% arrange(avg_temp). This can’t work because after a new dataframe has been returned from the select, the column city is not longer available.

arrange() (GROUP BY)

Now that we’ve introduced the pipe, on to group by. This is achieved in dplyr using group_by() (for grouping, obviously) and summarise() for aggregation.
Let’s find the countries we have most – and least, respectively – records for:
# most records
df %>% group_by(country) %>% summarise(count=n()) %>% arrange(count %>% desc())

In this way, aggregation queries can be written that are powerful and very readable at the same time. So at this point, we know how to do basic selects with filtering and grouping. How about joins?

JOINs

Dplyr provides inner_join(), left_join(), right_join() and full_join() operations, as well as semi_join() and anti_join(). From the SQL viewpoint, these work exactly as expected.
To demonstrate a join, we’ll now load the second dataset, containing daily weather data for Munich, and aggregate it by month:
daily_1997_2015 % summarise(mean_temp = mean(mean_temp))
monthly_1997_2015

As we see, average temperatures obtained for the same month differ a lot from each other. Evidently, the methods of averaging used (by us and by Berkeley Earth) were very different. We will have to use every dataset separately for exploration and inference.

Set operations

Having looked at joins, on to set operations. The set operations known from SQL can be performed using dplyr’s intersect(), union(), and setdiff() methods. For example, let’s combine the Munich weather data from before 2016 and from 2016 in one data frame:
daily_2016 % arrange(day)

Window (AKA analytic) functions

Joins, set operations, that’s pretty cool to have but that’s not all. Additionally, a large number of analytic functions are available in dplyr. We have the familiar-from-SQL ranking functions (e.g., dense_rank(), row_number(), ntile(), and cume_dist()):
# 5% hottest days
filter(daily_2016, cume_dist(desc(mean_temp)) % select(day, mean_temp)

We also have lots of aggregation functions that, if already provided in base R, come with enhancements in dplyr. Such as, choosing the column that dictates accumulation order. New in dplyr is e.g., cummean(), the cumulative mean:
daily_2016 %>% mutate(cum_mean_temp = cummean(mean_temp)) %>% select(day, mean_temp, cum_mean_temp)

OK. Wrapping up so far, dplyr should make it easy to do data manipulation if you’re used to SQL. So why not just use SQL, what can we do in R that we couldn’t do before?

Visualization

Well, one thing R excels at is visualization. First and foremost, there is ggplot2, Hadley Wickham‘s famous plotting package, the realization of a “grammar of graphics”. ggplot2 predates the tidyverse, but became part of it once it came to life. We can use ggplot2 to plot the average monthly temperatures from Berkeley Earth for selected cities and time ranges, like this:
cities = c("Munich", "Bern", "Oslo")
df_cities % filter(city %in% cities, year(dt) > 1949, !is.na(avg_temp))
(p_1950 <- ggplot(df_cities, aes(dt, avg_temp, color = city)) + geom_point() + xlab("") + ylab("avg monthly temp") + theme_solarized())

Average monthly temperatures for three selected cities

While this plot is two-dimensional (with axes time and temperature), a third “dimension” is added via the color aesthetic (aes (…, color = city)).

It seems like overall, Bern is warmest, Oslo is coldest, and Munich is in the middle somewhere.
We can add smoothing lines to see this more clearly (by default, confidence intervals would also be displayed, but I’m suppressing them here so as to show the three lines more clearly):
(p_1992 <- p_1992 + geom_smooth(se = FALSE))

Adding smoothing lines to more clearly see the trend

Good. Now that we have these lines, can we rely on them to obtain a trend for the temperature? Because that is, ultimately, what we want to find out about.
From here on, we’re zooming in on Munich. Let’s display that trend line for Munich again, this time with the 95% confidence interval added:
p_munich_1992 <- p_munich_1950 + (scale_x_date(limits=limits))
p_munich_1992 + stat_smooth()

Both fits behave quite differently, especially as regards the shape of the confidence interval near the end (and beginning) of the time range. If we want to form an opinion regarding a possible trend, we will have to do more than just look at the graphs – time to do some time series analysis!
Given this post has become quite long already, we’ll continue in the next – so how about next winter? Stay tuned 🙂