eCrash: Debugging without Core Dumps

How to use backtrace and a custom library to debug your embedded applications.

Embedded Linux does a good job of bridging the gap between embedded
programming and high-level UNIX programming. It has a full TCP stack,
good debugging tools and great library support. This makes for a
feature-rich development environment. There is one downside, however.
How can you debug problems that occur out in the field?

On full-featured operating systems, it's easy to use a core dump to
debug a problem that occurs in the field.

On non-embedded UNIX systems, when a program encounters an exception, it
outputs all of its current state to a file on the filesystem. This file
is usually called core. This core file contains all the memory
the program was using at the time the failure occurred. This allows
for post-mortem investigation to diagnose the exception.

Typically, on embedded Linux systems, there is no (or very little)
persistent disk storage. On all of the systems on which I have worked,
there is more RAM than persistent storage. So, getting a core dump
is impossible. This article describes some alternatives to core dumps
that will allow you to perform post-mortem debugging.

Programs can fail for reasons other than exceptions. Programs can
deadlock, or they can have run-away threads that use up all system
resources (memory, CPU or other fixed resources). It also would be
beneficial to generate some kind of persistent crash file under these
situations.

Requirements

So, first we need to come up with the information we want to
save. Because of memory constraints, saving all of the process' memory
is not an option. If it were, you simply could use core dumps! But,
there is other very useful information we can save. At the top of
the list is the backtrace of the failed thread.

A backtrace is a list of the functions that were called to get to the
current position in the program. Even with the absence of system memory
and data, a backtrace can shed light onto what was happening at the time
of failure.

Many embedded systems also have logs: lists of errors, warnings and
metrics to let you know what happened. Having a post-mortem dump
of the last few logs before failure is an invaluable asset in finding
the root cause of a failure.

In complex, multithreaded systems, you usually have many mutexes.
It could be useful, in the case of a deadlock, to show the state of all
the processes' mutexes and semaphores.

Showing memory usage statistics also could help diagnose the problem.

Once we have determined the information we want to save, we still
need to come up with where to save it. This will vary greatly from
system to system. If your system has no persistent storage at all,
perhaps you can output the crash information to a serial terminal or
display it on an LCD readout. (We have serious space constraints there!)
If your system has CompactFlash, you can save it to a filesystem.
Or, if it has raw Flash (an MTD device), you can either save it to
a jffs2 filesystem, or maybe to a raw sector or two.

If the crash was not too severe, perhaps the crash could be uploaded
to a tftp server or sent to a remote syslog facility.

Now that we have a firm grasp on what we want to save, and locations to
which we can
save it, let's talk about how we are going to do it!

The Backtrace

In general, getting a backtrace is not as simple as it sounds. Accessing
system registers (like the stack pointer) varies from architecture to
architecture. Thankfully, the FSF comes to our
rescue in GNU's C Standard Library (see the on-line Resources). Libc has
three functions that will aid us in retrieving backtraces: backtrace(),
backtrace_symbols() and backtrace_symbols_fd().

The backtrace() function populates an array of pointers with a backtrace
of the current thread. This, in general, is enough information for
debugging, but it is not very pretty.

The backtrace_symbols() function takes the information populated by
backtrace() and returns symbolic names (function names). The only
problem with backtrace_symbols is that it is not async-signal safe.
backtrace_symbols() uses malloc(). Because malloc() uses spinlocks, it is
not safe to be called from a signal handler (it could cause a deadlock).

The backtrace_symbols_fd() function attempts to solve the signal issues
associated with malloc and output the symbolic information directly to
a file descriptor.

Working inside of a Signal Handler

Some functions inside of libc rely on signals themselves: some IO
operations, memory allocation and so on. So, we are very limited in
what we should do inside of a handler. In our case, we can cheat
a little. Because our program already is crashing, a deadlock is not
that big of a concern. The code in my examples
makes use of several not-allowed functions, such as fwrite(), printf()
and sprintf().
But, we can work to avoid some of the functions that are prone to
deadlock, such as malloc() and backtrace_symbols().

In my opinion, the biggest loss we have is the loss of
backtrace_symbols. But, here is where things get easier. You always
can implement your own symbol table and look up the functions from the
pointers themselves.

In my examples, I sometimes use backtrace_symbols(). I have not
seen a deadlock yet, but it is possible.

Comment viewing options

This library is verry interesting, but seem that it print only address of main.c.
My program is linked staticaly with a library that contain a thread that call assert().
The program create the thread and I register it in eCrash. I launch the program, it crash and print the stack trace of the offended thread. I have analyzed the address printed with the program add2line but it return only address that are in my main.c. Program and library are compiled witch -g3 and -ggdb flags.
I'll appreciate any help.

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.