Posted by Félix
on 2014-04-18Comments Off on Crashing competing media players on Android

In this article, I will show how easy it is to cripple or even crash a competing multimedia-oriented application on Android with only a few lines of code.

Summary (aka TLDR)

On Android, in order to do HW decoding, an application must communicate with a process that has access to the HW codecs. However this process is easy to crash if you send it a selected MKV file. It is possible to create an Android service that checks if a competing media player is currently running and then crash the media process, thus stopping HW decoding for this player. As a result, the media player app can also crash and your phone can even reboot.

Mediaserver

Presentation

I have been contributing to the open source projects VLC and VLC for Android in the last few months. I worked on the implementation of an efficient hardware acceleration module for our Android application using the MediaCodec and the OpenMAX APIs. Since your average Android application does not have sufficient permission to access /dev/* entries on your system, you must use a proxy called the mediaserver in order to communicate with HW codecs. The mediaserver works with an IPC mechanism, for instance if you want to perform HW accelerated decoding of an H264 file, you ask the mediaserver to create a decoder and you call methods on this decoder using IPC.
Internally, the mediaserver is using the OpenMAX IL (OMXIL) API to manipulate HW codecs. For HW decoding, userland applications can either use IOMX (OpenMAX over IPC) or the more recent and standardized API called MediaCodec (API level 16).

Strengths and weaknesses

Using HW accelerated video decoding offers many advantages: you use less battery and you can potentially get significantly better performance. My Nexus 5 phone can only decode 4K videos with HW acceleration. Using the MediaCodec API you can even remove memory copies introduced by the IPC mechanism of mediaserver: you can directly render the frames to an Android surface. Unfortunately, there are also disadvantages: HW codecs are usually less resilient and are more likely to fail if the file is slightly ill-formed. Also, I don't know any HW codec on Android able to handle the Hi10P H264 profile.
The mediaserver on Android also has its own weaknesses: since there is only one process running, if it crashes, all connected clients will suddenly receive an error with almost no chance of recovery. Unfortunately, the mediaserver source code has many calls to CHECK() which is basically assert() but not stripped from release builds. This call is sometimes used to check values that were just read from a file, see for instance the MatroskaExtractor file. When testing HW acceleration for VLC, we noticed random crashes during video playback, sometimes after 20 minutes of flawless HW decoding.

Mediacrasher

The mediaserver is both easy to crash and essential for all media players applications in order to use HW decoding. Therefore, I realized it would be easy to create a small application that detects if a competing media player is started and then crashes the mediaserver to stop HW decoding. Depending on the app, and on luck too, the media player can crash or decide to fallback to SW. If the video is too large for SW decoding, playback will stutter.
After a bit of testing on the impressive collection of buggy video samples we have at VideoLAN, I found one that deterministically crashes (on my Nexus 5 with 4.4.2, your mileage may vary) the mediaserver when using the MediaScanner API. The sample can be found here. Ironically, it seems to be a test file for the MKV container format:

The evil plan for crashing competing apps can now be achieved with only a few simple steps:
- Starts an Android service (code running in background even if the calling app is not in foreground).
- Frequently monitors which apps are running and looks for known media players like VLC or the Android stock player.
- Uses MediaScannerConnection on the CONT-4G.mkv file, the mediaserver crashes and the media player will immediately receives an error if it was using HW codecs.

I've developed a small test application, and it works fine. Of course, VLC does not use this wicked technique, since it's an open source project you can check yourself in the source code of VLC and VLC for Android.
This looks extreme, and I hope no application will ever use this technique on purpose, but I wanted to show the stability issues we currently encounter with mediaserver. Actually, some services are already regularly crashing the mediaserver on my device, for instance while thumbnailing or extracting metadata from the files on your device; but they don't do that on purpose, I think...

Code

The code is short, first is the main Activity, it simply starts our service:

Results

I tested my program on some media players available on Android and here are the results:
- VLC: short freeze and then shows a dialog informing the user that HW acceleration failed and asking him if he wants to restart with SW decoding.
- DicePlayer: freezes the player, force quit is needed.
- MX Player: silent fallback to SW decoding, but the transition is visible since the video goes back in time a few seconds.
- Archos Video Player: reboots your device.
- Play Films: reboots your device.
- Youtube: stops playback, the loading animation never stops and the video display turns black.

In this article we will present a simple code finding an optimal solution to the graph coloring problem using Integer Linear Programming (ILP). We used the GNU Linear Programming Kit (glpk) to solve the ILP problem.

Background

As a project assignment for school we recently had to implement an optimized MPI program given a undirected graph where the edges represent the communications that should take place between MPI processes. For instance this program could be used for a distributed algorithm working on a mesh where each MPI process work on a share of the whole mesh, the edges represent the exchange of boundaries conditions for each iteration.
For a given process, the order of communications with all the neighbours is highly correlated with the performance of the whole program. If all processes start by sending data to the same process, the network bandwidth to this process will be a bottleneck. Therefore it is important to find a good order for communications. Ideally, at each step one process should be involved in only one communication.

By coloring the edges of the graph (all adjacent edges must have different colors), we can find a good scheduling order for MPI communications. Edge coloring can be solved directly, but we have chosen to use the line graph (also called edge-to-vertex dual) instead of the original graph. Therefore we just have to color the vertices of the line graph instead of implementing an edge coloring algorithm.

Goal

Initially we implemented a greedy coloring algorithm using the Welsh-Powell heuristic. This algorithm is fast and generally yields fairly good results, but I was interested in getting the optimal solution. Remembering the course I had on linear programming and the research papers I read on register allocation, I decided to use integer linear programming to solve this problem. The proposed implementation is by no means optimized, my goal was to implement a simple but optimal solution using a linear programming library. The final code is approximately 100 lines of code and I think it can be an interesting example for developers that want to start using GLPK.

Input

We need an input format to describe our graph. We used a very simple file format: each line is a vertex, the first number on each line is the identifier of the vertex and the remaining numbers are the neighbours of this vertex. For instance, Wikipedia provides the following example for graph coloring:
Labelling the vertices from 0 to 5, we obtain the following file:

Before creating variables, we need an upper bound on the number of colors needed for our graph. Every graph can be colored with one more color than the maximum vertex degree, this will be our upper bound:

To set up the constraints we must build the sparse matrix of coefficients by creating triplets .
These triplets are scattered in 3 different vectors. We must insert one element at the beginning because GLPK starts at index 1 after loading the matrix:

And now the last set of constraints, this is a bit longer because we iterate on all edges. The graph is undirected but edges are duplicated in our file format, so we must ensure we do not add constraints twice:

Thoughts

To end this article, here are some important points:
- If the initial bounds for is tight, solving will be faster. For the lower bound you can use 2 instead of 1 if your graph is not edgeless. To find a better upper bound you can use a greedy coloring algorithm before using ILP to find the optimal solution.
- If you want to solve the problem faster, use another formulation using column generation.
- If you use C++ you might want to implement your own wrapper above GLPK in order to manipulate variables and constraints easily.

Update

After reading my article, my teacher Anne-Laurence Putz kindly gave me another formulation which is simpler and generally more efficient.
We delete variable and use the following objective function instead:

We assign a weight for each color, thus the solver will minimize the number of colors used when minimizing the objective function. We don't need the second constraint anymore therefore less rows are needed.

Vectorization

Vectorization is the task of converting a scalar code to a code using SSE instructions. The benefit is that a single instruction will process several elements at the same time (SIMD: Single Instruction, Multiple Data).

Auto-vectorization

Sometimes the compiler manages to transform the scalar code you've written into a vectorized code using SIMD instructions, this process is called auto-vectorization. For example you can read the documentation of GCC to check the status of its tree vectorizer: http://gcc.gnu.org/projects/tree-ssa/vectorization.html, several examples of vectorizable loops are given.

Manual Vectorization

There are a lot of tips to help auto-vectorization: for example rewriting the code to avoid some constructs, using builtins to help the compiler, etc.
But sometimes, despite all your efforts, the compiler cannot vectorize your code. If you really need a performance speedup then you will have to do vectorization by hand using the _mm_* functions. These functions are really low-level so this process can be quite hard and writing an efficient SSE code is difficult.

Vectorization of conditional code

If you are new to SSE, you might be wondering how conditional code should be vectorized. We will show in this article that this is in fact really easy. You only need to know boolean algebra.

Scalar code

We will attempt to use SSE instructions to speedup the following scalar code:

This example is simple: we have a (long) array of floats and we want to compute the square root of each value from these array. Given that a negative number has an imaginary square root, we will only compute the square root of positive values from the array.

Auto-vectorization attempt

Before using SSE instructions, let's see if GCC manages to vectorize our code, we compile the code using the following flags: -O3 -msse3 -ffast-math.
We also use the flag -ftree-vectorizer-verbose=10 to get feedback from the vectorizer on each analyzed loop. Here is the output for our function:

Auto-vectorization failed because of the if statement, we will have to vectorize the code by hand.
Note that if we remove the if (we compute the square root on all elements), then auto-vectorization is successful:

Branchless conditional using a mask

Load and store

Based on what I've shown in my SSE tutorial, we already know the skeleton of our algorithm: first we need to load 4 float values from memory to a SSE register, then we compute our square root and when we are done we store the result in memory. As a reminder, we will be using the _mm_load_ps function to load from memory to a SSE register and the _mm_store_ps to load from a register to memory (we assume memory accesses are 16-bytes aligned). Our skeleton is therefore as follows:

If the comparison is true for one pair of values from the operands, all corresponding bits of the result will be set, otherwise the bits are cleared.

Using the mask

The result of the comparison instruction can be used as a logical mask because all the 32 bits corresponding to a float value are set. We can use this mask to conditionally select some floats from a SSE register. In order to vectorize our code, we will use the cmpge instruction and compare 4 values with 0:

(Note: as we are comparing with zero, we could have just used the sign bit here).

The trick is now simple, we perform the square root operation as if there were no negative values:

__m128 sqrt= _mm_sqrt_ps(vec);

__m128 sqrt = _mm_sqrt_ps(vec);

If there were negative values, our sqrt variable is now filled with some NaN values. We simply extract the non-NaN values using a binary AND with our previous mask:

_mm_and_ps(mask, sqrt)

_mm_and_ps(mask, sqrt)

But, if there were negative values, some values in the resulting register are just 0. We need to copy the values from the original array. We just negate the mask and perform a binary AND with the register containing the original values (variable vec)

_mm_andnot_ps(mask, vec)

_mm_andnot_ps(mask, vec)

Now we just need to combine the two registers using a binary OR and that's it!
Here is the final code for this function:

Conclusion

Thanks to the comparison instructions available with SSE, we can vectorize a scalar code that used a if statement.
Our scalar code was very simple, but if it was at the end of a complicated computation, it would be tempting to perform the square root without SSE on the result vector. We have shown here that this computation can also be done using SSE, thus maximizing parallelism.

Posted by Félix
on 2012-08-13Comments Off on GDB - Debugging stripped binaries

A few days ago I had a discussion with a colleague on how to debug a stripped binary on linux with GDB.
Yesterday I also read an article from an ex-colleague at EPITA on debugging with the dmesg command.
I therefore decided to write my own article, here I will demonstrate how to use GDB with a stripped binary.

Symbols and debugging

With the previous information, GDB knows the bounds of all functions so we can ask the assembly code for any of them. For example thanks to nm we know there is a fun function in the code, we can ask its assembly code:

Discarding Symbols

Why stripping you may ask ? Well, the resulting binary is smaller which mean it uses less memory and therefore it probably executes faster. When applying this strategy system-wide, the responsiveness of the system will probably be better. You can check this by yourself: use nm on /bin/*: you won't find any symbols.

The problem

Okay, there are no more symbols now, what does it change when using GDB ?

As GDB does not know the bounds of the functions, it does not know which address range should be disassembled.

Once again, we will need to use a command working at a lower level.
We must use the examine (x) command on the address pointed by the Program Counter register, we ask a dump of the 14 next assembly instructions:

Libc initialization

By looking at the code, you might be asking yourself: "Where the hell are we??"
The C runtime has to do some initialization before calling our own main function, this is handled by the initialization routine __libc_start_main (check its prototype here).

Before calling this routine, arguments are pushed on the stack in reverse order (following the cdecl calling convention). The first argument of __libc_start_main is a pointer to our main function, so we now have the memory address corresponding to our code: 0x8048440. This is what we found with nm earlier!
Let's add a breakpoint on this address, continue and disassemble the code:

This looks like our main function, the value 21 (0x15) is placed on the stack and a function (the address corresponds to fun) is called.
Afterwards, the eax register is cleared because our main function returns 0.

Additional commands

To step to the next assembly instruction you can use the stepi command.
You can use print and set directly on registers:

Recently I have written articles on image processing and on SSE, so here is an article on image processing AND SSE!

SSE for Image Processing

SSE instructions are particularly adapted for image processing algorithms. For instance, using saturated arithmetic instructions, we can conveniently and efficiently add two grayscale images without worrying about overflow. We can easily implement saturated addition, subtraction and blending (weighted addition) using SSE operators. We can also use floating point SSE operations to speed-up algorithms like Hough Transform accumulation or image rotation.

In this article we will demonstrate how to implement a fast dilation algorithm using only a few SSE instructions. First we will present the dilation operator, afterwards we will present its implementation using SSE, and finally we will discuss about the performance of our implementation.

Dilation

Dilation is one of the two elementary operators of mathematical morphology. This operation expands the shape of the objects by assigning to each pixel the maximum value of all pixels from a structuring element. A structuring element is simply defined as a configuration of pixels (i.e. a shape) on which an origin is defined, called the center. The structuring element can be of any shape but generally we use a simple shape such as a square, a circle, a diamond, etc.

On a discrete grayscale image, the dilation of an image is computed by visiting all pixels; we assign to each pixel the maximum grayscale value from pixels in the structuring element. Dilation, along with its dual operation, erosion, forms the basis of mathematical morphology algorithms such as thinning, skeletonization, hit-or-miss, etc. A fast implementation of dilation and erosion would therefore benefit to all these operators.

Scalar Implementation

The scalar (without SSE) implementation on a grayscale image is done by using an array of offsets in the image as the neighborhood. For our implementation we have used 4-connexity as illustrated below:

The red pixel is the center, and the blue pixels forms the structuring element.

In order to avoid losing time when dealing with borders, each image has extra pixels to the left and to the right of each row. The number of additional rows depends on the width of the structuring element. There is also a full additional row before and after the image data. Thus, a dilation can be computed without having to handle borders in the main loop.

SSE Implementation

Using the SSE function _mm_max_epu8 it is possible to compute the maximum on 16 pairs of values of type unsigned char in one instruction. The principle remains the same, but instead of fetching a single value for each offset value, we fetch 16 adjacent pixels. We can then compute the dilation equivalent to 16 structuring elements using SSE instructions.

We need to perform 5 SSE loads for our structuring element, as illustrated in the following image (but with 4 pixels instead of 16). Each rectangle corresponds to a SSE load. By computing the minimum of each groups of pixel and storing the result in the destination image we have perform the dilation with 4 consecutive structuring elements.

Regarding the implementation, it is really verbose (as always with SSE!), but still short, we only need to change the inner loop.

We have used an aligned SSE load for the center and the pixels above and below the center. As mentioned in previous articles, using aligned load improves performances.

Performances

We have measured the performances of the SSE implementation against the scalar implementation on a Intel Core2 Duo P7350 (2 GHz). Two grayscale images of Lena were used.

2048 x 2048

4096 x 4096

Scalar

50 ms

196 ms

SSE

7.5 ms

31 ms

Using SSE, we managed to obtain a speedup factor of 6 compared to the scalar implementation!

Efficiency

Our algorithm runs pretty fast but it could be even faster! If we perform the dilation in-place, the code runs twice as fast as the previous implementation. If it's still not fast enough for your needs, you can also implement the algorithm directly in assembly using advanced techniques such as instruction pairing, prefetching... A good step-by-step tutorial on the subject can be found here for 4x4 matrix-vector multiplication.