Sunday, August 31, 2014

I got some comments on my priors of last week's post where better priors were proposed. Hence this week's post looks at them. After some minor adaptations, since I have nine beta parameters and beta needs to be fairly high, I can conclude that these priors provide significant improvements. Especially the model which has p(alpha,beta) prop to (alpha + beta)(-5/2) worked extremely well.

Priors

There were two proposals. A common step was 'parameterizing in terms of mean theta = (alpha / (alpha +
beta)) and total count kappa = (alpha + beta)'. In my own data there are nine regions hence nine sets of theta, kappa, alpha, beta. As hyperprior over the nine regions' incidence I chose a normal distribution for theta. The difference between the proposals is in kappa. I made no hyperpriors over the nine values of kappa.

Prior: Uniform on log scale

'Andrew (...) suggests taking the priors on alpha and beta to be
uniform on the log scale.' In code this means log kappa is uniform distributed.

Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)

Apparently in Bayesian Data Analysis 3 it is proposed to make p(alpha,beta) prop to (alpha + beta)(-5/2). This is handled via a Pareto distribution.

Sampling kappa priors

Via a small program I had a look what this means in terms of samples. The uniform log scale seemed to be reasonable even for the range of my beta parameters. On the other hand the p(alpha,beta) prop to (alpha + beta)(-5/2) proposal kept kappa way too low for the data at hand. This seems informative for a different range of kappa and hence beta rather than uninformative. There I made a different variation. modelk <- 'parameters { real kappa_log1; real kappa2; real <lower=10^6> kappa3;}transformed parameters { real kappa1; kappa1 <- exp(kappa_log1);}model { kappa_log1 ~ uniform(0,18); kappa2 ~ pareto(0.1,1.5); kappa3 ~ pareto(10^6,1.5);}'parametersk <- c('kappa1','kappa2','kappa3')initsk <- function() { list( kappa_log1=runif(1,0,18) , kappa2 = runif(1,.2,10) , kappa3 = runif(1,2e6,1e7) )}

Sunday, August 24, 2014

During the last year I have been running some estimations in both JAGS and Stan. In that period I have seen one example where JAGS could not get me decent samples (in the sense of low Rhat and high number of effective samples) but that was data which I could not blog about. When two weeks ago I had a problem where part of my model did not converge well in JAGS I wondered how Stan would fare. Hence this post. It appears that Stan did not really do much better. What did appear is that results in this kind of difficult problem can vary depending on the inits and random samples used in the chain. This probably means more samples helps, but that is not the topic of this post.

Programs

In effect I expect most readers of this blog to know about both Stan and JAGS, but a few lines about them seem not amiss. Stan and JAGS can be used for the same kind of problems, but they are quite different. JAGS is a variation on BUGS, similar to WinBUGS and OpenBUGS, where a model states just relations between variables. Stan on the other hand, is a program where a model has clearly defined parts, where order of statements is of influence. Stan is compiled, which takes some time by itself. Both Stan and BUGS can be run by themselves, but I find it most convenient to run them from R. R is then used for pre-processing data, setting up the model and finally summarizing the samples. Because JAGS and Stan are so different, they need completely different number of MCMC samples. Stan is supposed to be more efficient, hence needing less samples to obtain a posterior result of similar quality.
From a model development point of view, JAGS (rjags, R2jags) is slightly more integrated in R than Stan (Rstan), mostly because JAGS models pretend to be R models, which means my editor will lend a hand, while Rstan has its model just in a text vector. In addition, JAGS has no compilation time. The plus of Stan though is highly organized model code.

Models

The model describes the number of shootings per state, hierarchically under regions. This means there is a binomial probability of interest, the states, under beta distributed regions. The beta has uninformative priors. After some tweaking the models should be equivalent. This means that the JAGS model is slightly different from previous posts. The number of samples chosen is 250000 with 100000 burn-in for JAGS and 4000 with half burn-in for Stan. I have chosen for ten chains. Usually I would use four, but since I suspected some chains to misbehave, I opted for a larger number. The inits were either around 1000, which means that a number of parameters have to shift quite a bit to get beta near 1 in 100000 or close to the that distribution, which means the parameters mostly have to converge the regions and states to the correct values. In terms of model behavior I only look at the priors and hyperpriors. Especially a and b from the beta distribution (state level) are difficult to estimate, while their ratio and state level estimates are quite easy.

Results

What I expected to write here is that Stan was coping a bit better, especially when the inits are a bit off. Which is what happened in the first version of the post. But then I did an additional calculation and Stan got worse results too. So, part of the conclusion is that it is very dependent on the inits and the random sampling in the MCMC chain.

Speed

In terms of speed, Stan has the property that different chains have markedly different speeds. One chain can take 90 seconds while the next takes 250 seconds. In JAGS individual chains progress is not displayed, so no information there.
In general speeds were about the same, 1000 to 1800 seconds. If that seems large, this was on a Celeron processor with one core used. MCMC chains are embarrassingly parallel, so gains can be made easy.

Gelman diagnostic

This is the diagnostic calculated in coda so the diagnostics are comparable. There are eight series of values in the figure. Each pair is for one model, where coda actually gives both point estimate and 95% upper limit of the estimate. Smart directs to the better inits, 1000 to inits around 1000. The x axis refers to the parameters estimated. There is something odd at variables 19, 20 and 28, 29. Just prior to posting I discovered that variables, especially aa and bb are sorted differently compared to the other variables in Stan than JAGS. In hindsight I should have put my parameters in alphabetical order.
The plot shows that Stan is actually doing a bit less than JAGS especially with the inits which should have made correct results more easy.

Effective number of samples

Again the plot is made using calculations in coda so the numbers are comparable. In number of effective samples it seems Stan is doing a bit better for the more difficult parameters. For the easy parameters JAGS is a bit better, here the large number of samples for JAGS pays off..

Conclusion

Neither JAGS nor Stan came out clearly on top, which was not as I expected. Nevertheless, it still seems that while JAGS is my tool for simple models, while Stan is the choice for more complex models. Conversion from JAGS to Stan was not difficult.

Sunday, August 17, 2014

I was reading the Julia documentation the other day. They do speed comparisons to other languages. Obviously R does not come out very well. The R code for quicksort is here and I noticed it was not vectorized at all. So I wondered if it could be improved. A quick check on wikipedia showed that the algorithm displayed by wikipedia is the one used by the Julia team. By abandoning this structure and using a just in time compiler some extra speed can be achieved. Additional bonus is that the algorithm actually got more simple. It should be noted (wikipedia) 'Quicksort can be implemented with an in-place partitioning algorithm, so the entire sort can be done with only O(log n) additional space'. Such an implementation was used by the Julia team, however profiling seems to show that limited additional space was not achieved with the R implementation.

Algorithm

To quote wikipedia again:

Pick an element, called a pivot, from the array.

Reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called thepartitionoperation.

Recursivelyapply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

Variations in implementation

After coding the first version, it was tried to make faster variations. The first three version are intended to have less comparisons than wfqs1(). The last one is intended to build a bit more vectorizing in qsort().

wfqs2 <- function(x) {

if (length(x)<2) return(x)

ipivot <- sample(length(x),1)

pivot <- x[ipivot]

c(wfqs2(x[x<pivot]),pivot,wfqs2(x[-ipivot][x[-ipivot]>=pivot]))

}

wfqs3 <- function(x) {

if (length(x)<2) return(x)

ipivot <- sample(length(x),1)

pivot <- x[ipivot]

split <- x<pivot

split2 <- !split

split2[ipivot] <- FALSE

c(wfqs3(x[split]),pivot,wfqs3(x[split2]))

}

wfqs4 <- function(x) {

if (length(x)<2) return(x)

split <- x<x[1]

c(wfqs4(x[split]),x[1],wfqs4(x[!split][-1]))

}

wfqsx = function(a) {

qsort_kernel = function(lo, hi) {

if(lo>=hi) return()

if(hi-lo==1) {

if(a[lo]>a[hi]) {

t <- a[lo]

a[lo] <<- a[hi]

a[hi] <<- t

}

return()

}

goUp <- a[(lo+1):hi]>a[lo]

up <- which(goUp)

do <- which(!goUp)

pivottarget <- lo+length(do)

up <- up[up<=length(do)]+lo

do <- do[do>length(do)]+lo

t <- a[do]

a[do] <<- a[up]

a[up] <<- t

t <- a[pivottarget]

a[pivottarget] <<- a[lo]

a[lo] <<- t

qsort_kernel(lo,pivottarget-1)

qsort_kernel(pivottarget+1, hi)

}

qsort_kernel(1, length(a))

return(a)

}

Analysis

Speed without JIT

The first figure shows that sorting time is roughly equivalent to the number of items to be sorted. The exception there is base:sort() where around 100 items there is no size effect. Even though not very pronounced on this scale, qsort() is slowest and wfqs4() is fastest after base:sort().

The second figure shows speed relative to base:sort(). This shows that qsort() is takes approximately 6 times as much time as wfqs4().

Profiling

Profiling seems to show that time is spend because it is spend. Most of it is not spend on any function which Rprof() registers. Regarding memory usage, if mem.total is anything to go by, qsort() uses quite more than wfqs4().

##### qsort #############################################

$by.self

self.time self.pct total.time total.pct mem.total

"qsort_kernel" 6.72 86.15 7.80 100.00 1671.4

"<GC>" 0.68 8.72 0.68 8.72 112.2

"+" 0.16 2.05 0.16 2.05 31.5

"<" 0.12 1.54 0.12 1.54 21.5

"-" 0.06 0.77 0.06 0.77 5.4

"floor" 0.04 0.51 0.04 0.51 10.2

"<=" 0.02 0.26 0.02 0.26 5.3

$by.total

total.time total.pct mem.total self.time self.pct

"qsort_kernel" 7.80 100.00 1671.4 6.72 86.15

"<Anonymous>" 7.80 100.00 1671.4 0.00 0.00

"eval" 7.80 100.00 1671.4 0.00 0.00

"qsort" 7.80 100.00 1671.4 0.00 0.00

"<GC>" 0.68 8.72 112.2 0.68 8.72

"+" 0.16 2.05 31.5 0.16 2.05

"<" 0.12 1.54 21.5 0.12 1.54

"-" 0.06 0.77 5.4 0.06 0.77

"floor" 0.04 0.51 10.2 0.04 0.51

"<=" 0.02 0.26 5.3 0.02 0.26

$sample.interval

[1] 0.02

$sampling.time

[1] 7.8

##### wfqs4 ###########################################

$by.self

self.time self.pct total.time total.pct mem.total

"wfqs4" 0.98 75.38 1.30 100.00 258.0

"<" 0.10 7.69 0.12 9.23 26.2

"<GC>" 0.08 6.15 0.08 6.15 11.8

"c" 0.06 4.62 0.08 6.15 12.0

"!" 0.04 3.08 0.04 3.08 9.0

"-" 0.02 1.54 0.02 1.54 4.6

".External" 0.02 1.54 0.02 1.54 0.0

$by.total

total.time total.pct mem.total self.time self.pct

"wfqs4" 1.30 100.00 258.0 0.98 75.38

"<Anonymous>" 1.30 100.00 258.0 0.00 0.00

"eval" 1.30 100.00 258.0 0.00 0.00

"<" 0.12 9.23 26.2 0.10 7.69

"<GC>" 0.08 6.15 11.8 0.08 6.15

"c" 0.08 6.15 12.0 0.06 4.62

"!" 0.04 3.08 9.0 0.04 3.08

"-" 0.02 1.54 4.6 0.02 1.54

".External" 0.02 1.54 0.0 0.02 1.54

"runif" 0.02 1.54 0.0 0.00 0.00

$sample.interval

[1] 0.02

$sampling.time

[1] 1.3

JIT compiling

The JIT is a great equalizer. Everybody profits with the exception of base:sort(). It is kind of obvious base:sort() does not profit from the JIT, the part that does the work should be in machine code. In the R code, it seems the various implementations are much closer to each other. Small refinements are apparently swamped by the JIT. Nevertheless, wfsq4() is still fastest after base:sort(). It takes half the time of qsort().

Discussion

It cannot be stated enough: When programming in R, vectorize. Vectors make compact code. Vectors are efficient; a vectorized calculation which does more actual computations can beat complex loops which avoid computations. The speed effect is reduced when the JIT is used. Starting the JIT then should be the first step when an R algorithm is too slow. The second step is starting the profiler and looking for (vectorized) optimization. When these two fail it is time to roll out the big guns; (inline) C code, faster computer or switching to a faster language altogether.

Wiekvoet

Wiekvoet is about R, JAGS, STAN, and any data I have interest in. Topics range from sensometrics, statistics, chemometrics and biostatistics. For comments or suggestions please email me at wiekvoet at xs4all dot nl.