Data

Data exists in two data sets, training and test. The data are time series, each of a 1000 time points and some summary statistics of these. The training set consists of 63530 samples, the test set an additional 77769. In practice this means 490 Mb and 600 Mb of memory. Not surprising, just prepping the data and dropping it in randomForest gives an out of memory error. The question then is how to preserve memory.
It seemed that my Windows 7 setup used more memory than my Suse 13.2 setup. The latter is quite fresh, while Win 7 has been used two years now, so there may be some useless crap which wants to be resident there. I did find that Chrome keeps part of itself resident, so switched that off (advanced setting). Other things which Win 7 has and Suse 13.2 misses are Google drive (it cannot be that hard to make it, but Google is dragging its heels) and a virus scanner, but there may be more.
This helped a bit, but this data gave me good reason to play a bit with dplyr's approach to store data in a database.library(dplyr)library(randomForest)library(moments)

load('LATEST_0.2-TRAIN_SAMPLES-NEW_32_1000.saved')my_db <- src_sqlite(path = tempfile(), create = TRUE)#train_samples$class <- factor(train_samples$class)train_samples$rowno <- 1:nrow(train_samples)train <- copy_to(my_db,train_samples,indexes=list('rowno'))rm(train_samples)
So, the code above reads the data and stores it in a SQLite database. At one point I made class a factor, but since SQLite does not have factors, this property is removed once the data is retrieved out of the database. I added a rowno variable. Some database engines have a function for row numbers, SQLite does not, and I need it to select records.
The key learning I got from this is related to the rowno variable. Once data is in the database, it just knows what is in the database and only understands its own flavour of SQL. Dplyr does a good job to make it as similar as possible to data.frames, but in the end one needs to have a basic understanding of SQL to get it to work.

Plot

The data has the property that part of it, the last n time points, are true samples. In this part the samples have an increase or decrease of 0.5%. The question is then what happens in the next part, further increase or decrease. The plot below shows the true samples of the first nine records. What is not obvious from the figure, is that the first two records have the same data, except for a different true part.

First model

As a first step I tried a model with limited variables and only 10000 records. For this the x data has been compressed in two manners. From a time series perspective, where the ACF is used, and from trend perspective, where 10 points are used to capture general shape of the curves. The latter by using local regression (loess). Both these approaches are done on true data and all data. In addition, the summary variables provided in the data are used.
The result, an OOB error rate of 44%. Actually, this was a lucky run, I have had runs with error rates over 50%. I did not bother to check predictive capability. mysel <- filter(train,rowno<10000) %>% select(.,-chart_length,-rowno) %>% collect()yy <- factor(mysel$class)vars <- as.matrix(select(mysel,var.1:var.1000))leftp <- select(mysel,true_length:high_frq_true_samples)rm(mysel)myacf <- function(datain) { a1 <- acf(datain$y,plot=FALSE,lag.max=15) a1$acf[c(2,6,11,16),1,1]}myint <- function(datain) { ll <- loess(y ~x,data=datain) predict(ll,data.frame(x=seq(0,1,length.out=10)))}

OOB estimate of error rate: 44.21%Confusion matrix: 0 1 class.error0 2291 2496 0.52141221 1925 3287 0.3693400
The plot shows the variable importance. Besides the variables provided the ACF seems important. Variables based on all time points seemed to work better than variables based on the true time series.

Sunday, December 21, 2014

Based on The DO loop, since I wanted a fractal Christmas tree and there is no point in inventing what has been made already. Besides, this is not the first time this year that I used R to do what has been done in SAS.

Sunday, December 14, 2014

When I downloaded the KNMI meteorological data, the intention was to do something which takes more than just the computers memory. While it is clearly not big data, at the very least 100 years of daily data is not small either. So I took along a load of extra variables to see what trouble I would run into. I did not run out of memory, but did make some figures.

Data

Data are acquired from KNMI. They have various sets of data, this page
has a selection form which leads to the data used today. The data comes
with a header explaining details, unfortunately in Dutch.

Plots

Just about everybody knows days are shorter in winter. What I never realized, even within that shorter day, we get less daylight. The short days are often so clouded, we don't get sun, meanwhile, in summer the sun does shine a bigger part of the daylight period.

In real hours sunshine this results in the following plot. December clearly has the darkest hours.

What we do get, not surprising since Dutch weather is fairly similar to English weather, is rain. Not continuous rain, most of the time it is dry, but still, autumn and winter do have days where it does not seem to stop. Autumn has the bad reputation for rain, but this plot makes winter look particular bad.

All this rain gives a load of humidity. This humidity in its turn, gives rise to a weather we name 'waterkoud'. It is above zero C but still quite cold outside. The humidity makes for air with a high heat capacity, hence one cools down quickly. Temperatures below zero can make for much nicer weather, but that can hamper traffic quite a lot. Most of the time it just doesn't freeze.

Saturday, December 6, 2014

In exercise 61.1 the problem is that the model has bad mixing. In the SAS manual the mixing is demonstrated after which a modified distribution is used to fix the model.
In this post the same problem is tackled in R; MCMCpack, RJags, RStan and LaplaceDemon. MCMCpack has quite some mixing problems, RStan seems to do best.

Data

To quote the SAS manual "This example shows how to analyze count data for calls to a technical support help line in the weeks immediately following a product release. (...) You can model the number of daily calls as a Poisson random variable, with the average number of calls modeled as a nonlinear function of the number of weeks that have elapsed since the product’s release. (...) During the first several weeks after a new product is released, the number of questions that technical support receives concerning the product increases in a sigmoidal fashion. The expression for the mean value in the classic Poisson regression involves the log link. There is some theoretical justification for this link, but with MCMC methodologies, you are not constrained to exploring only models that are computationally convenient. The number of calls to technical support tapers off after the initial release, so in this example you can use a logistic-type function to model the mean number of calls received weekly for the time period immediately following the initial release."observed <- scan(text='1 0 1 2 2 2 2 1 3 1 3 34 5 4 8 5 5 5 9 6 17 6 97 24 7 16 8 23 8 27', what=list(integer(),integer()), sep=' ',)names(observed) <- c('weeks','calls')observed <- as.data.frame(observed)

Analysis

MCMCpack

The MCMCpack solution is derived from LaplacesDemon solution below. It is placed as first because it shows some of the problems with the mixing. As a change from LaplacesDemon, gamma could get negative, for which I made a -Inf likelihood.library(MCMCpack)MCMCfun <- function(parm) { names(parm) <- c('alpha','beta','gamma') if (parm['gamma']<0) return(-Inf) mu <-parm['gamma']* LaplacesDemon::invlogit(parm['alpha']+parm['beta']*observed$weeks) LL <- sum(dpois(observed$calls,mu,log=TRUE)) LP <- LL+ dgamma(parm['gamma'],shape=3.4,scale=12,log=TRUE) + dnorm(parm['alpha'],-5,sd=.25,log=TRUE) + dnorm(parm['beta'],0.75,.5,log=TRUE) return(LP)}mcmcout <- MCMCmetrop1R(MCMCfun, c(alpha=-4,beta=1,gamma=2))
The figures show bad mixing. Parameters, especially Beta and Gamma, get stuck. There is quite some autocorrelation.plot(mcmcout)acf(mcmcout)

The cause is a nasty correlation between Beta and Gammapairs(as.data.frame(mcmcout))

LaplacesDemon

I chose an adaptive algorithm for LaplacesDemon. For the specs, the numbers are derived from the standard deviation of a previous run. Stepsize keeps reasonably constant through the latter part of run. The samples look much better than MCMCpack, although the mixing is not ideal either.
At a later stage I tried this same analysis with reflective Slice Sampler, however, that did was quite a bit more difficult and the end result was not better than this.

Jags

I do not think using one chain is advisable, especially since Jags makes more chains so easy. But in the spirit of this analysis I am using one. Coda plots are used since they are a bit more compact.library(R2jags)datain <- list( calls=observed$calls, weeks=observed$weeks, n=nrow(observed))parameters <- c('alpha','beta','gamma')

Wiekvoet

Wiekvoet is about R, JAGS, STAN, and any data I have interest in. Topics range from sensometrics, statistics, chemometrics and biostatistics. For comments or suggestions please email me at wiekvoet at xs4all dot nl.