We are looking for collegues

Speed up Android computations using RenderScript

For one of our customers, we were asked to write an application, which would be used by poultry farmers in African countries to compute an optimal feeding recipe.
The farmers would enter the ingredients available on the market and their price. From this, the app would take the nutritional information of each ingredient, and find a mix which would fulfill all nutritional requirements for the feed, and would be the cheapest option as well.

This is a linear programming optimization problem, where the constraints are defined by the nutritional requirements of the feed. These give a solution space of all possible mixes, and from this space of solutions we try to select one with the cheapest price.

Linear programming problems can be defined like this:
You have different products, and a limited set of resources.
You want to divide your resources in a way that they are used in the best possible way.

For example, a factory produces different products from a few raw materials, from which there is a limited stock. How many should the factory produce of each product from the available materials, to achieve the highest profit possible?

Although there do exist polynomial algorithms to solve these kind of problems, the most commonly used one is the simplex method, which is not polynomial though, but usually solves problems as efficiently as other polynomial ones.
For my initial solution, I used an older version of the Apache Commons Math library, which has a class called SimplexSolver for solving linear programming problems.
Its API is easy to use, and it does its job well. However, I found it to be quite slow. On older devices (note that the users of the app are in Africa), the first calculation takes 2–3 seconds.
Of course I can show a loading dialog until the computations finish in the background, but it would be even better if it would take less time.
By upgrading the library to the latest version, this time got reduced to around 100–200ms, which is many times better, and is not annoying anymore.

However, I asked myself: can it still be faster?
The algorithm in the library was optimized multiple times, so the observed speed is mostly due to it being written in Java.
I was curious how much faster it would be if I would rewrite it in a more low-level language, like C or C++.

On Android, for this I have two options:

Use the Native Development Kit (NDK): I would write my code in C++, compile it for each CPU architecture into a shared object, and then include those in my app.

Or use RenderScript: RenderScript uses a special C99-derived syntax, and is platform-independent. You don’t need the NDK, the script is compiled when building your application, and can use the GPU and CPU as well. However, the API is quite limited, and the code cannot be reused for say, iOS.

For my case, RenderScript fits more, since it is a lot easier to get started with. I was also curious how much performance gain I would get with it (if I would), especially on older phones.
With the RenderScript support library, I can run scripts even on API level 8, which covers our use-case.

Adding RenderScript support to my app

You don’t have to add any extra lines to your dependencies, it just works (assuming your build tools and support repository are up-to-date).

Next, I created two files in the directory src/main/rs/: simplex.rs and simplex.rsh.
The former will have the implementation, the latter will contain the exported variables and constants.

Communicating between Java and RenderScript code

The algorithm will accept the input constraints and objective function, and will output the optimal solution vector if one has been found.
However, it is not that easy to pass values between the two sides.
In theRenderScript code, you can not do any dynamic allocations, so the object for your return value has to be pre-allocated on the Java side.
But there are some nice features in RenderScript, which try to make our development easier: if you create global variables in the C code,
accessors to it will be generated in Java.

This even works with structs, in this case an entire class gets generated for it.So if I define the following struct in my simplex.rsh file:

I build the app, I will get two new classes: ScriptC_simplex and ScriptField_Tableau.

After creating an instance of the former, I can call functions defined in the RenderScript code, supply allocations, and get/set global parameters.
The latter is the Java counterpart of the struct I just defined in C. I can create an instance of it, and then give it to my script, where the values will be automagically set:

You may notice, that all setters have 3 parameters.

The first is the index, this is useful when the pointer points to an array, but in my case it is a struct, so it is 0 for all cases.

The second is the value to set.

The third and final is a boolean called copyNow. If I set this to true, the value will be immediately copied to the C struct. If false, you need to call copyAll() later, to synchronize all you uncopied changes at once.

Finally, I set the tableau I just created with the generated bind_tableau(…) method.

I can also define constants in the simplex.rsh file, which will generate a getter only:

There are multiple ways to start the script. You can create a special root() function, which could accept special input and output parameters.
But you can also export C functions, which then can be called from the Java side viathe generated method.
I did it this way, so I created a solve() function in the simplex.rs implementation file, and declared it in the header as well:

void solve();

Each of the declared methods will be available on the ScriptC_simplex class with an invoke_ prefix, so in my case, I can start the algorithm by calling:

script.invoke_solve();

Finally, I need some way to send the result back to the Java counterpart, when the algorithm has finished.
For this, I use “allocations”, which can be created from different built-in types like bitmaps, floats, ints, or even custom ones.
We need two return values: the solution vector (a float array), and the size of this vector (an integer).
In C, these allocations are rs_allocation types, and can be operated on with special methods of the RenderScript API:

rs_allocation solution_vector;
rs_allocation result_size;

But the memory allocation itself has to done in Java. I allocate the memory, and then bind the allocated memory to the specific rs_allocation type.
I create a custom type for both in Java and then create allocations from these types, which I then bind to the fields.
I start the algorithm (note: methods beginning with invoke_ run synchronously), and then read the results:

Now that we can communicate with the script, it is time to write the implementation.

Writing code in RenderScript

Writing code in RenderScript is like writing C code without any includes.
You can only use functions which are declared in the RenderScript API reference, which is only mathematical operations and functions to communicate with the Java side via messages and allocations.
RenderScript script files usually start with these lines:

The first line is the version of the runtime API used, currently ‘1’ is the only option here.
The second line is the package name where the Java code will be generated to when the script is compiled (so the package of ScriptC_simplex, ScriptField_Tableau, etc.).
The third line is the floating point precision. I did not experience a change in performance or precision when changing this, so I leave it on the default, full precision.
Finally, I include my header as I would do it in C.

Implementing the algorithm in RenderScript is not much different than I it would be writing it C.
I can use for loops, pointers, local variables, and local functions.
All functions that I use for internal purposes have to be marked static, otherwise RenderScript will think it is an exported one, and will try to generate code for it.

A major drawback is that I can not use printf(…), which would be helpful for debugging purposes.
Instead, the only method you can use, is rsDebug(), which writes to logcat.
It has 2 parameters: the message string (or char*, to be precise), and a value (which is mandatory for some reason).
For example, this code:

int answer = 2 * 7;
rsDebug("2 times 7 is: ", answer);

will write this message in logcat:

D/RenderScript: 2 times 7 is: 14 0xe

Another drawback is that I cannot use format strings, so the only way to write out values for debugging purposes is by using the last parameter of rsDebug().
I also noticed that if I log too many lines at once, some may not appear in logcat, and it also slowed the program down a bit.
So I created a LOG macro, which will only log if I set the DEBUG macro to true, otherwise it does nothing:

In the previous section, I mentioned that you need to use the special functions of the RenderScript API to operate on rs_allocation types.
To set the values of these, you need to use the methods starting with rsSetElementAt_.

Let’s say we have found the solution, and we want to write it to the allocations, to make it available in Java.
Here’s an example how I set first each element of the solution vector, and then finally the vector size:

The rsSetElementAt_… methods follow these conventions:
After the underscore, the method name contains the type of the allocation (which should be the same you used on the Java side when creating it).
The order of parameters also differs from the usual:

The first parameter is the rs_allocation field you want to operate on.

The second parameter is the value to set.

The third parameter is the index.

When the solution has been written, I return from the solve method.

Performance of the RenderScript code

When I completed the implementation of the simplex algorithm in RenderScript, I wasn’t sure at all if it would be faster.
There’s a lot of things going on when communicating between the Android and RenderScript framework, and it is not as low-level as C,
since the code still runs on top of the RenderScript Runtime Layer.

But the performance I observed was not expected at all.
On a recent Android device, these were the average execution times (I also did the test with the old version of the Apache Commons Math library to see how much it improved):

The results surprised me. The 10x speed increase was consistent on all devices, from older feature phones to octacore tablets, where it would finish usually in 2ms.

Conclusion

By reimplementing the core algorithm the app uses, I reduced waiting times from 80ms to 8ms.

Although in my case, it didn’t matter much, how much time the application needs to calculate the optimal solution.

But perhaps you have a case where porting code to RenderScript could make your app experience smoother?