A: All modern CPUs expect that fundamental types like ints, longs and floats will be stored in memory at addresses that are multiples of their length.

CPUs are optimized for accessing memory aligned in this way.

Some CPUs:

allow unaligned access but at a performance penalty;

trap unaligned accesses to the operating system where they can either be ignored, simulated or reported as errors;

use unaligned addresses as a means of doing special operations during the load or store.

When a C compiler processes a structure declaration, it can:

add extra bytes between the fields to ensure that all fields requiring alignment are properly aligned;

ensure that instances of the structure as a whole are properly aligned. Malloc always returns memory pointers that are aligned for the strictest, fundamental machine type.

The specifications for C/C++ state that the existence and nature of these padding bytes are implementation defined. This means that each CPU/OS/Compiler combination is free to use whatever alignment and padding rules are best for their purposes. Programmers however are not supposed to assume that specific padding and alignment rules will be followed. There are no controls defined within the language for indicating special handling of alignment and padding although many compilers like gcc have non-standard extensions to permit this.

Summary

Structure Alignment

Structure alignment may be defined as the choice of rules which determine when and where padding is inserted together with the optimizations which the compiler is able to effect in generated code.

Q: Why is this an issue for ARM systems?

A:

The early ARM processors had limited abilities to access memory that was not aligned on a word (four byte) boundary.

Current ARM processors (as opposed to the StrongARM) have less support for accessing halfword (short int) values.

The first compilers were designed for embedded system applications.

The compiler writers chose to allow the declaration of types shorter than a word (chars and shorts) but aligned all structures to a word boundary to increase performance when accessing these items.

These rules are acceptable to the C/C++ language specifications but they are different from the rules that are used by virtually all 32 and 64 bit microprocessors. Linux and its applications have never been ported to a platform with these alignment rules before so there are latent defects in the code where programmers have incorrectly assumed certain alignment rules. Moreover, these defects appear when applications are ported to the ARM platform.

The Linux kernel itself contains these types of assumptions.

These latent defects can consequently lead to:

decreased performance;

corrupted data;

program crashes.

The exact effect depends on how the compiler and OS are configured as well as the nature of the defective code.

These defects may be fixed by:

changing the compiler's alignment rules to match those of other Linux platforms;

using an alignment trap to fix incorrectly aligned memory references;

finding and fixing all latent defects on a case-by-case basis.

The three alternatives are, to some extent, mutually exclusive. All of them have advantages and disadvantages and have been applied in the past so there is some experience with each although the correct solution depends on your goals (see below).

Q: How is this related to the alignment trap?

A: On the StrongARM processor, the OS can establish a trap to handle unaligned memory references. This is important because unaligned memory references are a frequent consequence of alignment traps although they are not the only consequence.

Thus, some, but not all, alignment defects can be fixed within an alignment trap.

Furthermore, not every unaligned access indicates a defect. In particular, compilers for processors without halfword access will use unaligned accesses to efficiently load and store these values. If the alignment trap fixes these memory references, the program will produce incorrect results.

On the ARM and StrongARM, if you ask for a non-aligned word and you don't take the alignment trap, then you get the aligned word rotated such that the byte align you asked for is in the LSB.

Consider:

Address: 0 1 2 3 4 5 6 7 Value : 10 21 66 23 ab 5e 9c 1d

Using *(unsigned long*)2 would give:

on x86: 0x5eab2366 on ARM: 0x21102366

An alignment trap can distinguish between kernel code and application code and do different things for each.

The basic choices for the alignment trap are:

It can be turned off. The unaligned access will then behave like unaligned accesses on other members of the ARM family without performance penalty.

It can "fixup" the access to simulate a processor that allows unaligned access.

It can fixup the access and generate a kernel message.

It can terminate the application or declare a kernel panic.

There is a significant performance penalty for fixing up unaligned memory references.

Q: Which compilers are affected?

A: The ARM port of GCC can be configured to either:

align all structures to a word boundary -- even those containing just chars and shorts (this is the way ARMLinux is distributed);

align structures based on the alignment constraints of the most strict structure member (this is the same alignment as is used on the x86);

follow other rules.

Changing between 1 and 2 is a one line change in the gcc or egcs source. With additional effort, these could be modified with an additional compile time parameter selecting the alignment rules to be used. Some other architectures already have such a flag, so these could be used as a model.

The compiler supplied with the ARM SDT defaults to align all structures on a word boundary. It has a "packed structure option" (-zas1) that changes alignment to match the x86 rules. In future, this option will be the default since:

word-alignment causes too much user trouble, and the performance/codesize improvement has never been proven (typically the affected structures are small, and generally not copied around a lot).

Q: What are the advantages of word structure alignment?

A: The advantages are as follows:

structure copies for structures containing only chars and shorts are much faster;

faster code can be generated for halfword access on pre ARM.v4 processors;

binary compatibility across all ARM processors;

binary compatibility with the original compilers;

compatibility with ARM Directives.

The overall performance impact on StrongARM processors is hotly debated.

Here are some typical responses:

most of the system would run faster

--unattributed

pretty much ANYTHING that you care about memory bandwidth and performance issues on will or could seriously be impacted by this.

--unattributed

Although in theory it produces faster code, in practice most code and thus the system will run a lot slower.

--unattributed

The performance impact on other processors is less debated, but there is not complete consensus there either.

The only way to resolve this debate is to measure the relative performance.

Q: What are the disadvantages of word alignment?

A: The disadvantages of word alignment are that:

it exposes latent defects. If the compiler aligned consistently with other Linux platforms, these defects would remain latent and the effort involved in porting Linux to the StrongARM would be reduced;

uncorrected defects can silently corrupt data;

uncorrected defects cause unreliability. Alignment defects that only show up under heavy load, under certain patterns of use, at particular optimization levels, or in certain configurations are particularly difficult to find and fix;

the corrected programs are "fragile". Subsequent code changes can create or expose new alignment defects;

it effectively decreases performance.

There is hot debate on both the number of Linux packages that have latent alignment defects and how difficult these defects will be to find and fix. Estimates of the magnitude of the problem include:

The only programs that I found that were violating this when I did the original port were very few and far between. I think it was in the order of 1 in 200. However, as of lately, maybe because of the commercialisation of the Internet, this figure appears to be increasing.

--unattributed

Generally, the defects I've found stick out like a sore thumb.

--unattributed

These problems are so severe that I'd be very surprised if any major Linux application runs reliably or can be made to run reliably without superhuman effort.

--unattributed

Unless other measures are taken, this debate will not be resolved until ARM distributions that align all structures are complete and widely deployed or the attempt is abandoned. Distributions that elect to not align all structures avoid the problem and thus never find out its magnitude in detail.

The alignment trap for application code can be used to produce an estimate of the problem magnitude earlier than this. Application code will execute unaligned memory references in the following circumstances:

it was compiled for an ARM processor and is using "legal unaligned load word instructions" to reference halfword and/or byte data;

the application has code that deliberately does unaligned memory references. This indicates that the application is not portable to a variety of platforms;

When the alignment trap is set to generate a count of traps from application code and code compiled for the StrongARM is run, then every trap signals the existence of a defect that needs to be fixed. If the problem magnitude is large, many messages/counts will be recorded. If the problems are rare or have already been fixed, the trap will be silent.

GCC generates unaligned load/store multiple instructions now and then too.

This picture has changed as more and more packages are updated to newer versions and compiled with newer compiler versions to the point that the number of traps has declined to about 1,000 per CPU minute even with X windows use.

Setting the alignment trap to produce messages or counts is obviously useful for debugging as well. However, it produces only an estimate of the magnitude because there are potential latent defects that will cause applications to fail without ever doing an unaligned memory reference.

The argument that aligned structures are effectively slower is based on three positions:

the fixes to alignment defects often result in slower code;

the alignment trap would be called less frequently if the compiler didn't align all structures;

code compiled for ARM processors will execute slower than code compiled specifically for the StrongARM.

Q: What is the magnitude of the porting problem?

A: At this point, several years of fixing alignment defects in Linux packages have reduced the problems in the most common packages.

Packages known to have had alignment defects are:

Linux kernel;

binutils;

cpio;

RPM;

Orbit (part of Gnome);

X Windows.

This list is very incomplete.

Q: Why can't we just change the compiler?

A: The problem with changing the compiler is one of compatibility and transition. A completely new distribution for the ARM or StrongARM could use whatever alignment rules meet its goals. However, there would be problems running binaries from other distributions. For commercial applications, this would split the ARM market in two and they would need to decide which distribution(s) to support.

Those familiar with UNIX history know the potential costs of these splits.

Since StrongARM binaries cannot be run on the ARM processors, this is the natural dividing point for this split. To some extent, this split has already occurred since many packages are being ported specifically for the StrongARM.

Changing alignment in a StrongARM distribution will therefore affect its ability to run ARM binaries.

From this perspective, the worst case is having two binary standards on the StrongARM processor for the same OS.

The upgrade from aligned to unaligned or vice versa is particularly tricky because of interdependencies between programs and shared libraries. When the upgrade is in progress, the system is really some kind of "mixed distribution". Also, local programs compiled before the upgrade need to be recompiled to ensure compatibility.

Q: What about mixed distributions?

A: It is possible to create header files for libraries and system calls that are independent of which alignment rules are used by the compiler and thus ensure binary compatibility between distributions even if different compilers are used. All of the distributions would need to standardize on these modified headers for this to work.

If these changes were in place, different applications could be compiled with different rules within the same distribution as the needs of the application itself dictate. Some people are going to be experimenting with alternatively configured compilers and will need to make at least a start on these changes in order to do this experimentation. Later in this FAQ is a list of the system header files that would be affected.

Q: Some examples of code with problems?

A: All of the following examples are defective in a way that works for most Linux platforms and fails under the ARMLinux distribution. The behaviour of the ARMLinux distribution is described.

Example A

Suppose, I'm doing something to a truecolour image in C++ (brightening it for instance) and I have a pointer to the image in memory.

The Pixel structure will be padded with an extra byte at the end and will be aligned to a word boundary. Each ptr++ will step the pointer by four bytes instead of three as intended and thus the image will be corrupted. If image is aligned on a word boundary (this is random chance), no unaligned memory references will be made.

If the loop is alterred so that ptr is incremented by three bytes instead of four, then the image may be corrupted depending on what brighten does and the optimization level.

Each unicode character consumes four bytes instead of two as on other platforms. Although in this case, the only impact is benign (extra memory consumption).

Attempting to read, write, or copy unicode strings based on this definition would lead to problems.

Q: How do I find alignment problems in code from other platforms?

A: This section is fairly specific to ARMLinux application porting. Fixing all alignment problems, including those that may cause problems in future or on other platforms, is beyond the scope of this FAQ.

The gcc compiler for the ARMLinux distribution aligns all structures containing ints, longs, floats and pointers in the same way as gcc on x86 and other 32 bit platforms. The differences that may result in exposing latent alignment defects are all related to structures consisting entirely of chars and shorts either signed or unsigned. On ARMLinux, these are aligned to a word (4 byte) boundary. On other platforms these are aligned to a character boundary (ie: unaligned) for structures containing only chars and a halfword boundary for structures containing shorts or shorts and chars.

In practice, structures of this nature are relatively rare, so this is a good place to start looking.

The uses of these structures that may cause problems are:

one of these structures is contained within another structure. This will generate additional interior padding in the containing structure unless the contained structure just happens to be already aligned. This interior padding will cause problems if the structure is read or written as whole or aliased to another data structure;

an array of one of these structures or a containing structure is defined. These will be larger on ARMLinux unless the structure just happens to be a multiple of 4 bytes long. This is really just another type of internal padding and the same caveats apply;

a pointer to something else (usually char* or void*) is cast to a pointer to one of these structures, a containing structure or an array of one of these. The compiler expects that all pointers to one of these structures are aligned on a word boundary. If not, the generated code can incorrectly load field values and silently corrupt memory contents on stores. The exact behaviour depends on how the fields are accessed and optimizations made by the compiler;

sizeof() is taken on the structure, an instance of the structure or a containing structure or an array of one of these. The length returned will be different on ARMLinux unless the structure just happens to be an even multiple of 4 bytes long. This isn't a problem in itself, but the program may use this value inappropriately;

sizeof() is assumed. Look for #defines and comments that mention the size of the structure, a containing structure or the array as these indicate potential problems;

in a mixed environment (using different compilers or a compiler switch), problems occur if a structure or pointer to a structure is passed between a caller and a called routine that have different alignment rules or different interior padding. In this case, the mere existence of structures that would be differently aligned or padded in a public header file is a defect.

Q: How do I fix alignment problems?

A: This really depends on your goals.

If you are concerned with the long term portability of the code, you will find and remove all expectations about padding and alignment from it. How to do this is beyond the scope of this FAQ.

If you want to port a package to ARMLinux with minimal code changes or you suspect alignment problems and want a quick test, using the gcc extension __attribute__((packed)) will help in many cases.

If you want to arrange the header files of a library for binary compatibility between different alignment settings on StrongARM compilers, use __attribute__((packed)), explicitly insert padding bytes, and/or force alignment with unions or zero length arrays.

Q: What about C++?

A: A C++ class is an extension of a struct and many of the same comments apply. In addition, inheritance and template classes introduce new ways of combining structures that can cause interior padding that is different between on ARMLinux and x86 systems. Name mangling may or may not be affected. This makes the problems more difficult to identify from the source code. C++ programs in particular need to be devoid of all expectations of interior padding and alignment.