Simplest bare metal program for ARM

Bare metal programs run without an operating system beneath; coding on bare metal is useful to deeply understand how a hardware architecture works and what happens in the lowest levels of an operating system. I wanted to create a simple example of bare metal program for ARM using free open source tools: RealView Development Suite is the state of the art of ARM compilers, but it is expensive for hobbyists; Codesourcery is a company that provides a free version of the GNU gcc toolchain for ARM cores. In particular, the EABI toolchain must be downloaded from their download page; I fetched the IA32 GNU/Linux installer. During the graphical installation, the tools are installed in a sub-folder of the user’s home; this is fine if only a single person wants to use the toolchain on that computer, otherwise it is more efficient to install it system-wide. The path to the toolchain binaries must be added to the PATH environmental variable; usually the installation process does it for you, but if it doesn’t, the standard installation path is “~/CodeSourcery/Sourcery_G++_Lite/bin“.

I created a C file called test.c containing the simplest C code I wanted to compile:

int c_entry() {
return 0;
}

The classic printf(“Hello world!\n”); example is more complex because when coding bare metal the standard input/output must be defined: it could be a physical serial port for example. I called it c_entry instead of main because in this example some things that are usually assumed true when the program reaches the main code are not implemented: for example, variable initialized globally in C code could not be really initialized.

To compile this code into an object file (test.o) run the following command, very similar to compiling code with gcc:

$ arm-none-eabi-gcc -c -mcpu=arm926ej-s -g test.c -o test.o

The -mcpu flag indicates the processor for which the code is compiled. I wanted to target the ARM926EJ-S processor in this example for these reasons:

In order to create a bare metal program we must understand what does the processor do when it is switched on. The ARM9 architecture begins to execute code at a determined address, that could be 0 (usually allocated to RAM) or 0xFFFF0000 (usually allocated to Read Only Memory). We must put some special code at that particular address: the interrupt vector table. It is a series of 32-bit instructions that are executed when something special happens: for example when the ARM core is reset, or when the memory contains an unknown instruction that doesn’t belong to the ARM instruction set, or when a peripheral generates an interrupt (the serial port received a byte). The instructions in the interrupt vector table usually make the processor jump to the code that handles the event. The jump can be done with a branch instruction (B in ARM assembly) when the destination address is near.

I created an assembly file called startup.s containing the following code:

Line 2 exports the name _Reset to the linker in order to set the program entry point.

Line 3 to 11 is the interrupt vector table that contains a series of branches. The notation “B .” means that the code branches on itself and stays there forever like an endless for(;;);

Line 14 initializes the stack pointer, that is necessary when calling C functions. The top of the stack (stack_top) will be defined during linking.

Line 15 calls the c_entry function, and saves the return address in the link register (lr).

To compile this code into an object file (startup.o) run the following command:

$ arm-none-eabi-as -mcpu=arm926ej-s -g startup.s -o startup.o

Now we have test.o and startup.o, that must be linked together to become a program. The linking process also defines the address where the program is going to be executed and declares the placement of its sections. To give this information to the linker, a linker script is used. I wrote this linker script, called test.ld, following a simple example in the linker manual:

The script tells the linker to place the INTERRUPT_VECTOR section at address 0, and then subsequently place the code (.text), initialized data (.data) and zero-initialized and uninitialized data (.bss). Line 11 and 12 tells the linker to move 4kByte from the end of the useful sections and then place the stack_top symbol there. Since the stack grows downwards the stack pointer should not exceed its own zone, otherwise it will corrupt lower sections. The script on line 1 tells the linker also that the entry point is at _Reset. To link the program, execute the following command:

$ arm-none-eabi-ld -T test.ld test.o startup.o -o test.elf

This will generate an ELF binary for ARM that can be executed with a simulator, or it can be loaded inside a real ARM core on a hardware board; for simplicity we can use the Codesourcery version of the gdb debugger:

The target sim command tells the debugger to use its internal ARM simulator,

the load command fills the simulator memory with the binary code,

the debugger places a breakpoint at the beginning of the c_entry function,

the program is executed and stops at the breakpoint,

the program counter (pc register) of the ARM core is set to 0 to emulate a software reset,

the execution flow can be examined step-by-step in the debugger.

An easier way to debug is using the ddd graphical front-end with the following command:

$ ddd --debugger arm-none-eabi-gdb test.elf

This program is a starting point to begin to develop more elaborate solutions. The next step I want to take is using QEMU as the development target: with it I can interact with some peripherals, even if emulated, and create bare metal embedded programs more useful in the “real world” using only free open source software.

Hello Balau,
I’m trying to compile your code to a cortex-m3 environment, changing to -mcpu=cortex-m3 in the AS and GCC parameters, but for some reason it messes the code (I used “xp /_iw addr” in qemu to watch it). It seems to compile in thumb2 mode all the time, but the startup needs to be in pure ARM so it throws segmentation fault when executed.
Do you know why these tools uses thumb as default?
How can I fix it; just remove the parameter -mcpu on the startup?
Will I need some extra code to jump to thumb2 mode? (I need it because it’s useful as it saves flash memory)

Watch out: Cortex-M3 is very different from the ARM926 that I target in my example. Cortex-M3 supports ONLY Thumb2, even in startup code, and the vector table at the beginning of the code is also different. Check also this post for more info on Cortex-M3 compilation: Using CodeSourcery bare metal toolchain for Cortex-M3

You are right Balau, Cortex-m3 is quite different, I really like it more than older ARMv7. Just pure C code to make it run! I’ve made a simple test app and it was very easy to make it run. But I’ve found something estrange; as far as I know, cortex-m saves r0-r3, r12, PC, LR and PSR on a exception request. It gives automatically room to run exception routines. But, for some reason, my test app adds push-pop on every exception routine. A real example:

hi, i tried your example, and it is absolutely marvelous!
i found only one problem with my system, i use beagleboard, but nevertheless, everything compiled properly (i had a small problem with linking, but quick googling the error fixed it)
but, when i tried gdb, my version of arm-none-linux-gnueabi didn’t support sim option for the target argument, i got response:
“Undefined target command: “sim”. Try “help target”. ”
and i tried it, but there was no sim option listed, is this a problem with my coidesourcery? should i have installed arm-none-eabi instead?
Thanks!

arm-none-linux-gnueabi does not support “target sim”, you need arm-none-eabi fot that.
Alternatively you can use qemu-system-arm as a GDB server and then attach with arm-none-linux-gnueabi-gdb, it should work.
Anyway I suggest using arm-none-eabi for bare-metal programs; the Linux toolchain that you are using is intended to compile Linux user space programs.
If in the future you need to simulate Linux user space programs you can try “qemu-arm” instead.

The GDB program of the bare-metal toolchain “arm-none-eabi-gdb” supports the ARM simulator with “target sim“, but the GDB of other toolchains such as “arm-none-linux-gnueabi-gdb” do not support it. In that case you can use an emulator such as QEMU and attach to it with “target remote ...“

Thanks a lot for your tutorials.
I’m a student and I have a project to do on ARM arch, and your work is very usefull.
So I have to design a very simple kernel.
It’s ok to boot from the RAM.
But now, I want to to boot from the Flash, so I’va done some modification in the ld script :

then put the .text and .rodata in Flash with “>flash”, the .data in the RAM with “>ram AT>flash” and the .bss in the RAM with “>ram”.

The initial code should take care of copying the .data section from Flash to RAM and zeroing the .bss section.

Many things could go wrong in the boot procedure and in uploading code, I suggest using a debugger or, if you can’t, you can use some code that turns on some LEDs to understand where the program is running.

I agree that -zmuldefs is not a good solution but merely a workaround. Maybe the problem is in the linker script. I’m quite sure somehow the startup.o file is linked twice.
Also, you can try to launch the linker with “–verbose” and “-Map test.map” options to discover more information.

Hi,
I have gone through this post,i tried to debug the bare metal program using the tool chain “arm-none-eabi”,but it is failing to connect to target simulatior saying undefined command “sim”.help target list is not showing target sim option.can you please help me what else am i missing?

Maybe the new versions of the arm-none-eabi toolchain dropped the support for the simulator.
I suggest using QEMU instead, if you want to execute your code.
See here for a bare metal example on QEMU and how to connect with the debugger:Hello world for bare metal ARM using QEMU

Thanks Balau.Finally target sim command worked for me.But I want to know why target sim command is only working for arm-none-eabi toolchain,why it is not working for other toolchains? And I have written a simple C program for hello world,without .s and .ld and i tried to compile it with arm-none-eabi-gcc -g hello.c -o hello,but it is showing following error:

“target sim” must be implemented by the developers that take care of the toolchain. I don’t know the reason why they did not implement the simulator.

About the C program compilation, the “undefined reference” errors are there because bare-metal programs need to implement low-level functions, otherwise for example the program doesn’t know where to send the characters of the “printf” string.
See this post about it:

I don’t know why the “_start” symbol is not found. Usually the toolchain links automatically the “crt” object files that contains the “C Run-Time” necessary to execute C code correctly.

Hi,
Thanks for the detailed post, helped me getting started with bare metal programming the raspberry pi. But at this moment I’m having problems with IRQ handler, do you have an example for it? Or could you let me know in short how it is to be done?
Thank you,
Brijen

The toolchains have a triplet consisting of Architecture-OperatingSystem-BinaryInterface.
Architecture is ARM for both.
OperatingSystem is the one where the compiled program will run. In case of Linux, it means that the program will be executed by Linux, and the program libraries will talk with a Linux kernel through system calls. In case of None, the program will be run on the “bare metal”, talking directly with the hardware (usually with newlib as C standard library and runtime).
I think the eabi and gnueabi string in the toolchain name means the same, because EABI is the specification that both toolchains implement, gnueabi is the GNU open source implementation.

Thanks! I tried your examples with arm-linux-gnueabi-gcc and it worked even though it’s a baremetal program. Is it because we’re not making syscalls? If I was to write my own kernel and OS does it matter if the toolchain is “Linux” instead of “none”?

Yes the example works because we are supplying the startup code and we are not making system calls.
If you want to write your own kernel and OS different from Linux, then I don’t think you can use either of the two toolchains directly.
It depends on what kind of OS you are writing, but I think for a simple OS you could supply the startup and low-level calls, and use the bare metal toolchain.
Otherwise you could create your own toolchain such as “arm-YOUR_OS-gnueabi-gcc” but I suppose it’s harder.

How do i know the performance of any SOC with and without L2 cache at the u-boot level. I enabled the L2 cache controller at the u-boot level. I want to see the execution speed of an application at the u-boot level by enabling L2 cache. Please clarify me if it is possible. Sorry If this is not a relevant question to you.

I think that it depends on what you have of this SoC. For example you might need one of these:
– Development board
– Target device (for example smartphone)
– RTL simulation
– Cycle-accurate emulator
For example the CodeSourcery ARM simulator or QEMU are not suitable for this, because they don’t model cache with its timings.
Then you need a way to measure the start and end of U-Boot execution, for example raising and lowering a GPIO and measuring the rising/falling edge distance with oscilloscope or an external microcontroller. Or using internal timers if the SoC has any.
Obviously you need to remove dead wait time such as “press any key to stop autoboot” countdown otherwise you won’t get useful results.

great work! Thanks a lot, makes itmore easier to understand the basics of what is going on after reset.
However, I need to put out RAM bytes on UART just _before_ RAM gets initialized by the startup script. Therefore I need to initialize UART in ASM before. Do you know how to do this for omap4460/ARM A9?

You are most likely compiling for Thumb instruction set. Be aware that most (maybe all) ARM cores that support both ARM and Thumb instruction sets start in ARM mode at reset, so you can’t compile startup.s in Thumb mode. The cores that support only Thumb, which are Cortex-M, have a completely different startup respect to the one I showed here.

Hi
The above method is partially working for me. But the program does not end. GCC gdb for a program in host pc usually exits the program. But the arm-none-eabi-gdb does not end and does not print any values. In a simple sort program it executes the instruction but I do not see the output and it keeps on running

Yes it is expected to not exit. It should loop in the B . instruction in startup.s.
This example is not made for full C programs, as I hint in the article.
In order to use printf and C library in general, take a look at this, it might be what you are searching for: https://balau82.wordpress.com/2010/11/04/qemu-arm-semihosting/

Thanks for quick reply.
Yes I have started using QEMU. It is working great for me. However I am using qemu-arm with arm-none-linux-gnueabi-gcc rather than qemu-system-arm with arm-none-eabi-gcc.
I hope there is not major difference between these two. I am trying to emulate a system with arm v2a instruction set

@Bala
My blog posts is about building a bare metal program. You seem to want to execute an user-space Linux programs, and that’s different especially at the beginning of execution and that’s why it doesn’t work.
I think that if you follow my example to do what you want, then you are following the wrong example.qemu-arm is a program to emulate an user-space ARM Linux program, while qemu-system-arm emulate a system without operating system.
There’s a relevant difference between arm-none-eabi-gcc and arm-none-linux-gnueabi-gcc: the first builds bare metal programs, that executes without operating system; the second builds Linux user-space programs, that need to run on top of a Linux kernel.
I hope the distinction is clear.

@Alessandro: those are two different toolchains. You probably downloaded and installed the arm-none-linux-gnueabi toolchain, that produces Linux user-space programs, instead of the arm-none-eabi toolchain that produces bare metal programs. You need to download and install the right toolchain.

Can the gdb ‘target sim’ target provide instruction counts or even better, CPU cycle counts? If it cannot explicit do it is there a way we can use tracing or scripting to coerce it to? Or can we perhaps somehow instrument the bare metal environment with enough gprof scaffolding?

Here’s what I’m trying to do: I think the bare metal development environment is extremely useful for developing and optimizing small, self-contained algorithms. It is often not obvious which implementation of a particular algorithm is most efficient. For cases in which you can control the instruction cache or the type of program memory, a simulator’s instruction counter or cycle counter could help point to the most optimal solution.

I don’t know if the simulator that runs in gdb is able to count instructions, I don’t bet on it. I see that Z8000 has a register that can be read to count instructions.
QEMU is able to output a trace of the assembly instructions that are executed, for example with -D exec.log -d in_asm -singlestep options. It could be used to count the instructions between two points of the program by post-processing the log.
Be aware that the accuracy of these numbers are questionable: in a real hardware you would have different timings with respect to memory access and caches for example.

Thanks for the QEMU suggestion; it’s a far superior approach for the algorithm optimization problem I’m trying to solve. Your example in the ‘Hello world for bare metal ARM using QEMU’ blog in conjunction with the instruction logging options produces an execution tracelog. As you note, it’s straightforward to parse this log to extract instruction counts or execution histograms. BTW, some users reported issues with the tools. I’m using Ubuntu 14.04 and the stock arm-none-eabi-* tools and qemu obtained via apt-get work. Neither the qemu-kvm-extras nor the CodeSourcery toolchain were needed.

You’re also correctly warn how instruction counts may not accurately predict real hardware performance. Some instructions like PUSH take numerous CPU cycles. Potentially one could parse the tracelog and annotate it with CPU cycle counts. But even then we face cache dependencies, memory speeds, etc. However, given that we have a simulation environment that is under our control we should be able to constrain things well enough to ascertain relative performance between alternate algorithm implementations.

Hello.
The tutorial is very helpful.
But now i want to use assembly instructions directly instead of c code for compilation. i am using arm7tdmi cpu.
Is there a way to directly compile arm assembly code in gcc.

Also, i need to access some ram locations for storing some data, how do i approach it? I tried setting memory regions in gdb, but it says, ‘can’t access memory location’. What do i do with this?

GCC can accept assembly files as well. You write your assembly file instead of a C file, you pass it to GCC as a command line parameter instead of a C file, and GCC tries to assemble a .o object file from it (it calls GNU assembler by itself). Something like:

arm-none-eabi-gcc -mcpu=arm7tdmi -c yourcode.s -o yourcode.o

About accessing RAM location, I’m not sure what you mean, because I guess you want to do something like a C variable, but I don’t understand why you mention gdb which is a debugger. You probably want to write something like this in assembly:

.bss
var: .space 4

And then you use “var” as the address of a 4-byte integer or something like that.

Thank you very much for your help. It worked like a charm.
Just one more question. How do you check the space allocated for RAM in gdb? I want to see the memory map of the microcontroller. The command ‘info mem’ does not show the memory map of the controller.

I don’t think you can. It seems that GDB needs some sort of way to “retrieve it from target” but I don’t know what’s the mechanism that is used, and I don’t think it’s implemented in QEMU or in the actual targets.

Ok. The locations can be then just determined by trial and error. Is there a telnet interface for gdb? I want to see the output of the gdb interactively through an automated script. Currently I am using the command file with gdb commands, and checking the log file for the output.
I tried using the remote debugger, but it didn’t work.

I don’t understand: gdb default interface is an interactive terminal. If you launch gdb without giving it a script it will present you with a (gdb) prompt. So it seems to me that the default behavior is already what you want.

I’ve been trying to make the basic blinky work over linux using GNU gcc toolchain for ARM.
Problem I’m currently observing is .hex file I got from Keil uvision is around 2 Kb and from gcc is around 67kb..plus it doesn’t blinks the led.

The option to pass linker script to gcc is -T, not -D. Correct lscript.sh and it could work. Also, .hex is usually a text file, but you created a binary file, but it depends on how you use it, I don’t know if it’s a problem.

[…] If you plan to write a purely standalone binary, you are required to initialize the hardware manually and provide a functioning C execution environment. It also requires information regarding placement and relocation of the text and data segments, allocation and zeroing of BSS segment and placement of the stack space. See this and this. […]