Introduction

There are two types of applications for Android. One is the Dalvik application, which is a Java*-based application that can run smoothly on any architecture without any modification. The second type is an NDK application that has part of the code written in C/C++ or ASM and must be recompiled for a specific CPU.

This discussion focuses on NDK applications. For these, we generally just need to modify the APPABI parameter in application.mk and compile the NDK part, which then allows it to be run on the corresponding device. However, some parts of some NDK applications cannot simply be recompiled if they have certain types of code, such as assembly language code or single instruction, multiple data (SIMD) code. This article explains how to handle these issues and introduces some of the major considerations developers should know about when porting a non-Intel® Architecture (x86) platform app to an Intel Architecture platform. We also discuss Endian conversion between x86 and non-x86 platforms.

Single Instruction Multiple Data (SIMD)

SIMD is a class of parallel computers, as described in Flynn's taxonomy, with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data-level parallelism as there are simultaneous (parallel) computations. SIMD is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions to improve the performance of multimedia use. For the x86 mobile platform, SIMD is called Intel® Streaming SIMD Extensions (Intel® SSE, SSE2, SSE3, etc.). For the ARM* platform, SIMD is called NEON* technology. If you need any background on NEON, refer to the manufacturer’s documentation.

Intel® Streaming SIMD Extensions (Intel® SSE)

First, what is Intel SSE? Basically, it is a collection of 128-bit CPU registers. These registers can be packed with four 32-bit scalars that allow an operation to be performed on each of the four elements simultaneously. In contrast, it may take four or more operations in regular assembly to do the same thing. In the following diagram, you can see two vectors (Intel SSE registers) packed with scalars. The registers are multiplied with MULPS, which then store the result. That's four multiplications reduced to a single operation. The benefits of using Intel SSE are too significant to ignore.

Figure 1: Two vectors (Intel® SSE registers) packed with scalars

Now with the basic idea of Intel SSE in mind, let's take a look at some of the more common instructions.

Data Movement Instructions

MOVUPS

Move 128 bits of data to an SIMD register from memory or SIMD register. Unaligned.

MOVAPS

Move 128 bits of data to an SIMD register from memory or SIMD register. Aligned.

Move sign bits of each of the four packed scalars to an x86 integer register.

MOVSS

Move 32 bits to an SIMD register from memory or SIMD register.

Arithmetic Instructions NOTE: A scalar will perform the operation on the first elements only. The parallel version will perform the operation on all of the elements in the register.

Parallel

Scalar

ADDPS

ADDSS - Adds operands

SUBPS

SUBSS - Subtracts operands

MULPS

MULSS - Multiplies operands

DIVPS

DIVSS - Divides operands

SQRTPS

SQRTSS - Square root of operand

MAXPS

MAXSS - Maximum of operands

MINPS

MINSS - Minimum of operands

RCPPS

RCPSS - Reciprocal of operand

RSQRTPS

RSQRTSS - Reciprocal of square root of operand

Comparison Instructions

Parallel

Scalar

CMPPS, CMPSS

Compares operands and return all 1s or 0s

Logical Instructions

ANDPS

Bitwise AND of operands

ANDNPS

Bitwise AND NOT of operands

ORPS

Bitwise OR of operands

XORPS

Bitwise XOR of operands

Shuffle Instructions

SHUFPS

Shuffle numbers from one operand to another or itself

UNPCKHPS

Unpack high order numbers to an SIMD register

UNPCKLPS

Unpack low order numbers to a SIMD register

Other instructions not covered here include data conversion between x86 and MMX registers, cache control instructions, and state management instructions.

How to Convert NEON to Intel SSE

Intel SSE instructions and NEON do not translate 1:1. Although they are based on the same design principle, the method of implementation is different. You have to translate each instruction one by one. Following are two code segments showing the same function, one using NEON and the other using Intel SSE instructions.

For more information about converting NEON to Intel SSE, refer to the blog listed in the Reference section[1]. It provides a head file that can be used to auto map NEON and Intel SSE instructions.

Enable Assembly Language

Complex Instruction Set Computer (CISC) processors, like Intel’s, have a rich instruction set capable of performing complex actions with a single instruction (as opposed to RISC architectures that aim for more generalized instructions and efficiency). CISCs have a comparatively large number of general-purpose registers and data instructions that generally use three registers: one destination and two operands. If you need any background on ARM instructions/architecture, refer to the manufacturer’s documentation.

Intel processors (i.e., 386 and beyond) have eight 32-bit general purpose registers, as shown in following diagram. The register names are mostly historical. For example, EAX used to be called the accumulator since it was used by a number of arithmetic operations, and ECX was known as the counter since it was used to hold a loop index. While most of the registers have lost their special purposes in the modern instruction set, by convention, two are reserved for special purposes: the stack pointer (ESP) and the base pointer (EBP).

For the EAX, EBX, ECX, and EDX registers, subsections may be used. For example, the least significant two bytes of EAX can be treated as a 16-bit register called AX. The least significant byte of AX can be used as a single 8-bit register called AL, while the most significant byte of AX can be used as a single 8-bit register called AH. These names refer to the same physical register. When a 2-byte quantity is placed into DX, the update affects the value of DH, DL, and EDX. These sub-registers are mainly hold-overs from older, 16-bit versions of the instruction set. However, they are sometimes convenient when dealing with data that are smaller than 32-bits (e.g., one-byte ASCII characters).

Because of the differences between ARM and x86 assembly language, ARM ASM code cannot be used directly on x86 platforms. However, there are two ways for ARM ASM code to be used when porting an ARM Android application to x86:

Implement the same function with x86 assembly language.

Replace the ARM code with C code.

In many open sources, ASM code has been overwritten to improve performance, but performance is not an issue in this case because processors are more robust than ever before. However, unlike the overwritten ASM code, C code that implemented the same function was retained in source code, and we can compile C code for an x86 platform.

As an example, an ISV developed a game that used Vorbis audio compression format, an open source program that contains some segmented ARM ASM code. So the ISV could not convert it to x86 NDK. Instead of having to rewrite this section of code with x86 ASM, the ISV rebuilt it in C, so it ran smoothly on x86. The issue was resolved by disabling Macro _ARM_ASSEM_ and enabling Macro _LOW_ACCURACY_.

Endian Conversion in ARM and x86

In cross-platform projects, we often face an age old problem—Endian conversion. If the file is generated on a little Endian machine, an integer 255 may be stored like the following:

ff 00 00 00

But when it's read into the memory, the value will be different for different platforms, and this will cause a porting problem.

The function has the advantage of working on both large and small Endian platforms. But it is inconsistent with the common way of reading a structure.

fread(&header, sizeof(struct MyFileHeader), 1, file);

If MyFileHeader contains a lot of integers, it will result in many read()s. It’s not only cumbersome to code, but also runs slowly due to increased IO operation. So I propose another method: let’s leave the code unchanged and use several macros to post-process the data.

This approach has the advantages of not wasting any CPU cycles if ENDIAN_CONVERSION is not defined, and the code can be preserved in its natural form of reading a whole structure at a time.

Conclusion

ARM and x86 have different architecture and instruction sets, so there are some differences in the low-level functionality. I hope the information in this article will help you address these differences when developing Android NDK apps for multiple platforms.

About the Author

Peng, Tao (tao.peng@intel.com) is a software apps engineer in the Intel Software and Services Group. He currently focuses on gaming and multimedia applications enabling and performance optimization, in particular on Android mobile platforms