Bourne Shell Scripting/Debugging and signal handling

In the previous sections we've told you all about the Bourne Shell and how to write scripts using the shell's language. We've been careful to include all the details we could think of so you will be able to write the best scripts you can. But no matter how carefully you've paid attention and no matter how carefully you write your scripts, the time will come to pass when something you've written simply will not work — no matter how sure you are it should. So how do you proceed from here?

In this module we cover the tools the Bourne Shell provides to deal with the unexpected. Unexpected behavior of your script (for which you must debug the script) and unexpected behavior around your script (caused by signals being delivered to your script by the operating system).

Contents

So here you are, in the middle of the night having just finished a long and complicated shell script, just poured your heart and soul into it for three days straight while living on nothing but coffee, cola and pizza... and it just won't work. Somewhere in there is a bug that is just eluding you. Something is going wrong, some unexpected behavior has popped up or something else is driving you crazy. So how are you going to debug this script? Sure, you can pump it full of 'echo' commands and debug that way, but isn't there an easier way?

Generally speaking the most insightful way to debug any program is to follow the execution of the program along statement by statement to see what the program is doing exactly. The most advanced form of this (offered by modern IDEs) allows you to trace into a program by stopping the execution at a certain point and examining its internal state. The Bourne Shell is, unfortunately, not that advanced. But it does offer the next best thing: command tracing. The shell can print each and every command as it is being executed.

The tracing functionality (there are two of them) is activated using either the 'set' command or by passing parameters directly to the shell executable when it is called. In either case you can use the -x parameter, the -v parameter or both.

-v

Turns on verbose mode; each command is printed by the shell as it is read.

-x

This turns on command tracing; every command is printed by the shell as it is executed.

There is another parameter that you can use to debug your script, the -n parameter. This causes the shell to read the commands but not execute them. You can use this parameter to do a syntax checking run of your script.

As you saw in the previous section, we used the shell parameters by passing them in as command-line parameters to the shell executable. But couldn't we have put the parameters inside the script itself? After all, there is an interpreter hint in there... And surely enough, we can do exactly that. Let's modify the script a little and try it.

So there's no problem there. But there is a little gotcha. Let's try running the script again:

Running the script again

Code:

$ sh divider.sh

Output:

0

expr: division by zero

Where did the debugging go?

So what happened to the debugging that time? Well, you have remember that the interpreter hint is used when you try to execute the script as an executable in its own right. But in the last example, we weren't doing that. In the last example we called the shell ourselves and passed it the script as a parameter. So the shell executed without any debugging activated. It would have worked if we'd done a "sh -xv divider.sh" though.

What about sourcing the script (i.e. using the dot notation)?

Running the script again

Code:

$ . divider.sh

Output:

0

expr: division by zero

No debugging there either...

This time the script was executed by the same shell process that is running the interactive shell for us. And the same principle applies: no debugging there either. Because the interactive shell was not started with debugging flags. But we can fix that as well; this is where the 'set' command comes in:

And now we have debugging active in the interactive shell and we get a full trace of the script. In fact, we even get a trace of the interactive shell calling the script! But now what happens if we start a new shell process with debugging on in the interactive shell? Does it carry over?

Running the script again

Code:

$ sh divider.sh

Output:

sh divider.sh

+ sh divider.sh
0
expr: division by zero

Not quite...

Well, we certainly get a trace of the script being called, but no trace of the script itself. The moral of the story is: when debugging, make sure you know which shell you're activating the trace in.

By the way, to turn tracing in the interactive shell off again you can either do a 'set +xv' or simply a 'set -'.

When writing or debugging a shell script it is sometimes useful to exit out (to stop the execution of the script) at certain points. You use the 'exit' built-in command to do this. The command looks simply like this:

exit [n]

* Where n (optional) is the exit status of the script.

If you leave off the optional exit status, the exit status of the script will be the exit status of the last command that executed before the call to 'exit'.

For example:

Exiting from a script

#!/bin/sh -xecho hello
exit 1

If you run this script and then test the output status, you will see (using the "$?" built-in variable):

Checking the exit status

Code:

echo$?

Output:

1

There's one thing to look out for when using 'exit': 'exit' actually terminates the executing process. So if you're executing an executable script with an interpreter hint or have called the shell explicitly and passed in your script as an argument that is fine. But if you've sourced the script (used the dot-notation), then your script is being run by the process running your interactive shell. So you may inadvertently terminate your shell session and log yourself out by using the 'exit' command!

There's a variation on 'exit' meant specifically for blocks of code that are not processes of their own. This is the 'return' command, which is very similar to the 'exit' command:

return [n]

* Where n (optional) is the exit status of the block.

Return has exactly the same semantics as 'exit', but is primarily intended for use in shell functions (it makes a function return without terminating the script). Here's an example:

The function returned with a testable exit status of 2. The overall exit status of the script is zero though, since the last command executed by the script ('echo Goodbye!!') exited without errors.

You can also use a 'return' statement to exit a shell script that you executed by sourcing it (the script will be run by the process that runs the interactive shell, so that's not a subprocess). But it's usually not a good idea, since this will limit your script to being sourced: if you try to run it any other way, the fact that you used a 'return' statement alone will cause an error.

A syntax, command error or call to 'exit' is not the only thing that can stop your script from executing. The process that runs your script might also suddenly receive a signal from the operating system. Signals are a simple form of event notification: think of a signal as a little light suddenly popping on in your room, to let you know that somebody outside the room wants your attention. Only it's not just one light. The Unix system usually allows for lots of different signals so it's more like having a wall full of little lamps, each of which could suddenly start blinking.

On a single-process operating system like MS-DOS, life was simple. The environment was single-process, meaning your code (once running) had complete machine control. Any signal arriving was always a hardware interrupt (e.g. the computer signalling that the floppy disk was ready to read) and you could safely ignore all those signals if you didn't need external hardware; either it was some device event you weren't interested in, or something was really wrong — in which case the computer was crashing anyway and there was nothing you could do.

On a Unix system, life is not so easy. On Unix, signals can come from all over the place (including other programs). And you never have complete control of the system either. A signal may be a hardware interrupt, or another program signalling, or the user who got fed up with waiting, logged in to a second shell session and is now ordering your process to die. On the bright side, life is still not too complicated. Most Unix systems (and certainly the Bourne Shell) come with default handling for most signals. Usually you can still safely ignore signals and let the shell or the OS deal with them. In fact, if the signal in question is number 9 (loosely translated: KILL!! KILL!! DIE!! DIE, RIGHT NOW!!), you probably should ignore it and let the OS kill your process.

But sometimes you just have to do your own signal handling. That might be because you've been working with files and want to do some cleanup before your process dies. Or because the signal is part of your multi-process program design (e.g. listening for signal 16, which is "user-defined signal 1"). Which is why the Bourne Shell gives us the 'trap' command.

The trap command is actually quite simple (especially if you've ever done event-driven programming of any kind). Essentially the trap command says "if one of the following signals is received by this process, do this". It looks like this:

trap [command string] signal0 [signal1] ...

*Where command string is a string containing the commands to execute if a signal is trapped

and signaln is a signal to be trapped.

For example, to trap user-defined signal 1 (commonly referred to as SIGUSR1) and print "Hello World" whenever it comes along, you would do this:

Trapping SIGUSR1

$ trap"echo Hello World" 16

Most Unix systems also allow you to use symbolic names (we'll get back to these a little later). So you can probably also do this:

Trapping SIGUSR1 (little easier)

$ trap"echo Hello World" SIGUSR1

And if you can do that, you can usually also do this:

Trapping SIGUSR1 (even easier)

$ trap"echo Hello World" USR1

The command string passed to 'trap' is a string that contains a command list. It's not treated as a command list though; it's just a string and it isn't interpreted until a signal is caught. The command string can be any of the following:

A string

A string containing a command list. Any and all commands are allowed and you can use multiple commands separated by semicolons as well (i.e. a command list).

''

The empty string. Actually this is the same as the previous case, since this is the empty command string. This causes the shell to execute nothing when a signal is trapped — in other words, to ignore the signal.

Nothing, the null string. This resets the signal handling to the default signal action (which is usually "kill process").

Following the command list you can list as many signals as you want to be associated with that command list. The traps that you set up in this manner are valid for every command that follows the 'trap' command.

Right about now it's probably a good idea to look at an example to clear things up a bit. You can use 'trap' anywhere (as usual) including the interactive shell. But most of the time you will want to introduce traps into a script rather than into your interactive shell process. Let's create a simple script that uses the 'trap' command:

This script in and of itself is an endless loop, which prints "Running..." and then sleeps for five seconds. But we've added a 'trap' command (before the loop, otherwise the trap would never be executed and it wouldn't affect the loop) that prints "Hello World" whenever the process receives the SIGUSR1 signal. So let's start the process by running the script:

To spring the trap, we must send the running process a signal. To do that, log into a new shell session and use a process tool (like 'ps') to find the correct process id (PID):

Finding the process ID

Code:

$ ps -ef | grep signal

Output:

bzt 10865 7067 0 15:08 pts/0 00:00:00 /bin/sh ./trap_signal.sh

bzt 10808 10415 0 15:12 pts/1 00:00:00 fgrep signal

Our PID is 10865

Now, to send a signal to that process, we use the 'kill' command which is built into the Bourne Shell:

kill [-signal] ID [ID] ...

*Where -signal is the signal to send (optional; default is 15, or SIGTERM)

and ID are the PIDs of the processes to send the signal to (at least one of them)

As the name suggests, 'kill' was actually intended to kill processes (this fits with the default signal being SIGTERM and the default signal handler being terminate). But in fact what it does is nothing more than send a signal to a process. So for example, we can send a SIGUSR1 to our process like this:

You might notice that there's a short pause before "Hello World!" appears; it won't happen until the running 'sleep' command is done. But after that, there it is. But you might be a little surprised: the signal didn't kill the process. That's because 'trap' completely replaces the signal handler with the commands you set up. And an 'echo Hello World' alone won't kill a process... The lesson here is a simple one: if you want your signal trap to terminate your process, make sure you include an 'exit' command.

Between having multiple commands in your command list and potentially trapping lots of signals, you might be worried that a 'trap' statement can become messy. Fortunately, you can also use shell functions as commands in a 'trap'. The following example illustrates that and the difference between an exiting event handler and a non-exiting event handler:

Here's the official definition of a signal from the POSIX-1003 2001 edition standard:

A mechanism by which a process or thread may be notified of, or affected by, an event occurring in the system.Examples of such events include hardware exceptions and specific actions by processes.The term signal is also used to refer to the event itself.

In other words, a signal is some sort of short message that is sent from one process (possible a system process) to another. But what does that mean exactly? What does a signal look like? The definition given above is kind of vague...

If you have any feel for what happens in computing when you give a vague definition, you already know the answer to the questions above: every Unix flavor that was developed came up with its own definition of "signal". They pretty much all settled on a message that consists of an integer (because that's simple), but not exactly the same list everywhere. Then there was some standardization and Unix systems organized themselves into the System V and BSD flavors and at last everybody agreed on the following definition:

The system signals are the signals listed in /usr/include/sys/signal.h .

God, that's helpful...

Since then a lot has happened, including the definition of the POSIX-1003 standard. This standard, which standardizes most of the Unix interfaces (including the shell in part 1 (1003.1)) finally came up with a standard list of symbolic signal names and default handlers. So usually, nowadays, you can make use of that list and expect your script to work on most systems. Just be aware that it's not completely fool-proof...

POSIX-1003 defines the signals listed in the table below. The values given are the typical numeric values, but they aren't mandatory and you shouldn't rely on them (but then again, you use symbolic values in order not to use actual values).

POSIX system signals

Signal

Default action

Description

Typical value(s)

SIGABRT

Abort with core dump

Abort process and generate a core dump

6

SIGALRM

Terminate

Alarm clock.

14

SIGBUS

Abort with core dump

Access to an undefined portion of a memory object.

7, 10

SIGCHLD

Ignore

Child process terminated, stopped

20, 17, 18

SIGCONT

Continue process (if stopped)

Continue executing, if stopped.

19,18,25

SIGFPE

Abort with core dump

Erroneous arithmetic operation.

8

SIGHUP

Terminate

Hangup.

1

SIGILL

Abort with core dump

Illegal instruction.

4

SIGINT

Terminate

Terminal interrupt signal.

2

SIGKILL

Terminate

Kill (cannot be caught or ignored).

9

SIGPIPE

Terminate

Write on a pipe with no one to read it (i.e. broken pipe).

13

SIGQUIT

Terminate

Terminal quit signal.

3

SIGSEGV

Abort with core dump

Invalid memory reference.

11

SIGSTOP

Stop process

Stop executing (cannot be caught or ignored).

17,19,23

SIGTERM

Terminate

Termination signal.

15

SIGTSTP

Stop process

Terminal stop signal.

18,20,24

SIGTTIN

Stop process

Background process attempting read.

21,21,26

SIGTTOU

Stop process

Background process attempting write.

22,22,27

SIGUSR1

Terminate

User-defined signal 1.

30,10,16

SIGUSR2

Terminate

User-defined signal 2.

31,12,17

SIGPOLL

Terminate

Pollable event.

-

SIGPROF

Terminate

Profiling timer expired.

27,27,29

SIGSYS

Abort with core dump

Bad system call.

12

SIGTRAP

Abort with core dump

Trace/breakpoint trap

5

SIGURG

Ignore

High bandwidth data is available at a socket.

16,23,21

SIGVTALRM

Terminate

Virtual timer expired.

26,28

SIGXCPU

Abort with core dump

CPU time limit exceeded.

24,30

SIGXFSZ

Abort with core dump

File size limit exceeded.

25,31

Earlier on we talked about job control and suspending and resuming jobs. Job suspension and resuming is actually completely based on sending signals to processes, so you can in fact control job stopping and starting completely using 'kill' and the signal list. To suspend a process, send it the SIGSTOP signal. To resume, send it the SIGCONT signal.

If you go online and read about 'trap', you might come across another kind of "signal" which is called ERR. It's used with 'trap' the same way regular signals are, but it isn't really a signal at all. It's used to trap command errors (i.e. non-zero exit statuses), like this:

Error trapping

Code:

$ trap'echo HELLO WORLD' ERR
$ expr 1 / 0

Output:

expr: division by zero

HELLO WORLD

The non-zero exit status was trapped as though it was a signal.

So why didn we cover this "signal" earlier, when we were discussing 'trap'? Well, we saved it until the discussion on system and non-system signals for a reason: ERR isn't standard at all. It was added by the Korn Shell to make life easier, but not adopted by the POSIX standard and it certainly isn't part of the original Bourne Shell. So if you use it, remember that your script may not be portable anymore.