GNU R vs Julia: is it only a matter of devectorization?

Recently I have read a post on benefits of code devectorization in Julia. The examples given there inspired me to perform my own devectorization exercise. I decided to use bootstrapping as a test ground. The results are quite interesting (and not so bad for GNU R).

The task is very simple (and typical):

generate 10000 elements sample from uniform distribution

1000 times perform bootstrap sample of the vector and calculate its standard deviation

So not counting longer Julia start up time – execution time is comparable.

Now we can try devectorizing Julia code:

function run()

ssize = 10000

nboot = 1000

x = rand(ssize)

y = Array(Float64,nboot)

bx = Array(Float64, ssize)

for i in 1:nboot

for j in 1:ssize

bx[j] = x[(rand(Uint32) % ssize) + 1]

end

y[i] = std(bx)

end

std(y)

end

for i in 1:4

print(“$(@elapsed run()) “)

end

# result: 0.072204955 0.082288181 0.072243955 0.073718137

We have over 4-times speedup. So devectorization works perfectly and Julia lives up to the promise of performance gain.

However, this is not a whole story. Remember that GNU R performs extra work behinds the scenes that standard Julia ignores – it handles missing values (NA) out of the box.

Actually we can tell Julia to take missing values into account via DataFrames package. Let us test vectorized and devectorized code.

Vectorized Julia with NAs:

using Distributions

using DataFrames

function run()

ssize = 10000

nboot = 1000

x = DataArray(rand(ssize))

y = DataArray(Float64,nboot)

for i in 1:nboot

y[i] = std(sample(x, ssize))

end

std(y)

end

for i in 1:4

print(“$(@elapsed run()) “)

end

# result: 1.320962821 1.305368346 1.250524339 1.272740111

So w are over 4 times slower than GNU R now. How about devectorized Julia with NAs? Let us try:

using DataFrames

function run()

ssize = 10000

nboot = 1000

x = DataArray(rand(ssize))

y = DataArray(Array(Float64,nboot))

bx = DataArray(Array(Float64, ssize))

for i in 1:nboot

for j in 1:ssize

bx[j] = x[(rand(Uint32) % ssize) + 1]

end

y[i] = std(bx)

end

std(y)

end

for i in 1:4

print(“$(@elapsed run()) “)

end

# result: 1.027682685 0.994636538 1.017903246 1.021942776

There is a speed up, but it is minor and we still are 3 times slower.

Probably Julia code could be tuned, but we can draw some preliminary conclusions (at least for current versions of GNU R and Julia).

If you can use default Julia data types in your analysis Julia should easily beat GNU R in performance, especially after devectorization. However if you require handling of missing values in your code and have to use DataFrames package in Julia you can expect GNU R to be quite well optimized.

PS.

While trying to get a grasp of Julia I have collected some notes on its syntax and features from R-programmers perspective. You can have a peek its preliminary version here.

Related

To leave a comment for the author, please follow the link and comment on his blog: R snippets.