Saturday, January 25, 2014

RenderScript in Android - the benchmark program

The previous post of this series tried to present argument, why it
is a good idea to consider implementing part of your code in
RenderScript, instead of Java or native code. This part presents a
benchmark that we will use to evaluate the technology. I
intentionally chose the benchmark not to be an image manipulation
algorithm or a convolution filter. I wanted a use case that
processes large amount of data. Sound processing yields itself as a
quite evident choice.

The algorithm I am going to present is quite naive and would require
a lot of refinement for commercial deployment. But my goal was
RenderScript benchmarking so the algorithm's tolerance to all kinds
of environmental disturbances was not relevant. We will do voice
fingerprinting that takes a voice sample array, compares this sample
to a reference sample and returns true or false depending on whether
the voice sample is sufficiently similar to the reference sample.
The algorithm should return true if the voice sample contains the
same word from the same speaker, false otherwise.

I implemented Dynamic Time-Warping (DTW) for the purpose. DTW
compares two data series assuming that the timing between the two
series may be different. For example the same speaker may say the
same word with slightly higher pitch. Therefore DTW calculates a so
called warping path, a pair of corresponding timestamps from series1
and series2. E.g. a part of a warping path may be (1,1),
(1,2),(2,3),(3,3) meaning that a[1] (data at index #1 from series1)
corresponds to b[1], a[1] corresponds to b[2], a[2] corresponds to
b[3] and a[3] corresponds to b[3]. Knowing the warping path and a
cost function, the distance between the two series can be
calculated. DTW finds the warping path with the lowest cumulative
cost (i.e. the sum of cost between the corresponding data items
along the warping path). A detailed introduction to the algorithm
can be found here.

This example program implements classic DTW calculating every
possible pairing between the data items of the two series. This
requires an NxM matrix where N is the size of series1 and M is the size
of series2. DTW(i,n) (element at position (i,n) in this matrix) is the
accumulated cost value (sum of the cost function values) of the optimal
warping path that leads through the position at i,n. (i.e. with assuming
that index i in series1 and index n in series2 correspond to each
other). The function calculating DTW(i,n) is thus:

DTW(i,n)=min(DTW(i-1,n),DTW(i-1,n-1),DTW(i,n-1))+c(i,n)

where c(i,n) is the cost between a[i] and b[n]. The example program uses

c(i,n)=abs(a[i]-b[n]))

Cells at the edge of the matrix are initialized like

DTW(i,0)=sum(c(i,0))

and

DTW(0,i)=sum(c(0,i))

Classic DTW can require a large amount of space. Take for example a
short voice sample. With a sampling rate of 44100Hz/5=8820Hz, even a
short word can generate easily a vector of 5000-8000 elements. If the
reference vector is of a similar size, the DTW matrix will consist of 25
million to 64 million elements, all of them integer (meaning 4 bytes
each). A data item of that size may easily exhaust the fixed maximum
process space allocated to Android applications. Therefore we do a
simple optimization. The output of our algorithm is the optimal distance
which will be found at DTW(M,N), the right-bottom element in the
matrix. In order to calculate elements in row i, we need only elements
in row i-1 plus the already calculated elements in row i. We can drop
row i-2 and calculate DTW(i,0) on the fly, when we get there. This will
reduce the memory requirement to 2*M data items instead of N*M.

The kernel function is non-linear, it is not so evident to break it down to sub-ranges (although FastDTW does precisely that).

Calculation of matrix cells depend on an intermediate result of the calculation and not only on the input values.

The example program is available here. The application exercises the DTW algorithm on speech samples.
Launch the application and record a reference sample. E.g. record a
word like "four". Then record a same or different word from the same
or different speaker and observe the distance that DTW calculates.
If the environment is not too noisy, you will find that same word
from the same speaker yields a (normalized) distance below 1000-1200
while other words from the same speaker or the same word from a
different speaker yields higher distance. The calculation takes a
bit of time because the DTW algorithm is executed twice, once in
Java and once in RenderScript. The application displays the execution
time for both implementations.

For now, you can observe that RenderScript execution times are 4-5 times
smaller than Java execution times even (Update:visit this post for an updated number. After optimization, RenderScript is still 2.3 faster). though the implementation does
not exploit any parallelization features. I will go deeper into this
observation in the next post.

4 comments:

Unfortunately, like I said extraordinary claims need extraordinary evidence and you didn't deliver with obviously unoptimized Java code. With a few minor tweaks without even getting really fancy the core Java method performing the calculation can be reduced to ~27.5% (3-4 times) the execution time of your original code.

I'm not sure if you intentionally tried to make the Java code slow, but you raised some serious doubts by commenting out the System.arraycopy and putting in a for loop to copy data between two arrays mind you once again inside of an existing loop! Also, the completely unnecessary "signalCost" method which just is a Math.abs() call is called 3 times (twice in a nested for loop). You should know that calling methods in tight loops is extremely bad form for performance.

Here is the modified method:

http://pastebin.com/mfmHKkCb

Since you mention the Renderscript code having a 4-5 times improvement over your poorly written Java example I posit that the ~8-10% improvement is simply not worth the loss of cross-platform compatibility and headache free maintenance. The code is shorter, clearer, doesn't require developers to bother with Renderscript and multiple files. If one wanted to improve upon it then evaluating the NDK / C++ for a native cross-platform solution is certainly worthy. There are ways of writing ones C++ code inline to Java code and this method would be a perfect example for such a solution.

Of course the parallel solution is the real avenue to explore. And multithreaded Java code is not going to be that far behind C++ and once again given the data set to be calculated one would implement a multithreaded parallel CPU version of the algorithm and where available OpenCL to greatly speed it up choosing which one depending on the data size to process. Do this once and you have an algorithm that runs cross-platform everywhere.

I am curious if you care to revisit your assumptions as otherwise it seems confirmation bias is in play at least for the moment.

With regards to the NDK: there's a long discussion between the GCC and Clang community, who generates faster code. The reality is that the two compilers are head-to-head when it comes to the efficiency of the generated code. Clang however offers much faster compilation time. NDK implementation is about as fast as RenderScript but it is much faster to write a RenderScript function. The reasons: Clang/LLVM is better integrated into the SDK toolchain (one reason is that Clang compiles so quickly) and the SDK automatically generates all the glue code while in case of NDK you have to fiddle with all the complexities of the JNI.

Indeed, the optimizations provided for the Java side are basic. Marking the method static will help slightly, but it must also be noted that the current execution environment of running the method once without any warm up is a bad micro test essentially. Things approach or best 70% (66-72% improvement) with a warm up before grabbing a result of the Java test. I haven't tested if loop unrolling and removing the Math. methods will help, but it should be noted that doing so would likely slow down a one time / single execution of the calculation method.

As far as JNI is concerned one actually doesn't need to fiddle with the glue code. Using a build tool like jnigen which is a standalone effort part of libgdx one can easily build and integrate inline to the Java code native methods and automatically generate all platform resources and not just Android. This solution trumps any serial Renderscript option since it's cross-platform.

I posit that a parallel CPU Java version will likely be twice as fast as the current Renderscript version. A parallel OpenCL version upwards of 5-10 times faster than Renderscript.

hello sir, my name is candra. i'm about to study mobile programming. i have to make speech to text apps that implemented dynamic time warping algorithm. could you please some give me advice about what i have to do first sir?if do , here's my email sir candra07dinata06@gmail.comthank you sir

About the blog

This blog is a personal diary about my adventures with the Google Android platform. I write it in the hope that others may find my experiences useful but please, beware. The blog is created as I gain experience about the platform myself so errors, omissions, etc. may be found in the entries.