How-Tos

Quick Start

The information presented in this post doesn’t really lend itself to having a “Quick Start” section, but if you’re in a hurry we have a How-To section along with Video and Audio included with this post that may be a good quick reference for you. There are some really great general references in the Resources section that may help you as well.

Video

General Debugging

BASHDB Overview

Audio

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands and scripts easier, but if you’re not already familiar with the concepts presented here, typing things yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory PathWarning or ErrorCommand Line Snippet With Commands/Options/ArgumentsCommand Options and Their Arguments OnlyHyperlink

Overview

This post is the first in a series on shell script debugging, error handling, and security. Although I’ll be presenting some methodologies and techniques that apply to all shell languages (and most programming languages), this series will focus very heavily on BASH. Users of other shells like CSH will need to do some homework to see what information transfers and what does not.

One of the difficulties with debugging a shell script is that BASH typically doesn’t give you very much information to go on. You might get error output showing a line number, but that’s just the line where the shell became aware of the error, not necessarily the line where the error actually occurred. Add in a vague error message such as the one in Listing 1, and it gets difficult to tell what’s going on inside your script.

This post is written with the intent of giving you knowledge that will help when you see an error like the one in Listing 1 while trying to run a script. This type of error is just one of many errors that the shell may give you, and is more easily dealt with when you have a good understanding of scripting syntax and the debugging tools at your disposal.

Along with talking about debugging tools/techniques, I’m going to introduce a handy script debugger called BASHDB. BASHDB allows you to step through a script in much the same way as a program debugger like GNU’s GDB does with C code.

By the end of this post you should be armed with enough knowledge to handle the majority of debugging needs that you have. There’s a lot of information here, but taking the time to learn it will help make you more effective in your work with Linux.

Command Line Script Debugging

BASH has several command line options for debugging your shell scripts, and some of these are shown in Listing 2. These options will be applied to your entire script though, so it’s an all-or-nothing trade off. Later in this post I’ll talk about more selective methods of debugging.

Listing 2

-n Checks for syntax errors without executing the script (noexec).
-u Causes an error to be thrown whenever you try to access a variable that has
not been set (nounset).
-v Sends all lines to standard error (stderr) as they are read, even comments.
-x Turns on execution tracing (xtrace) which displays each command as it is
executed.

All of the options in Listing 2 can be used just like options with other programs (bash -x scriptname), or with the built-in set command as shown later. With the -x option, the number of + characters before each of the lines of output denotes the subshell level. The more + characters there are, the further down into nested subshells you are. If there are no + characters at the start of the line, then the line is the normal output from the execution of the script. You can use the -x and -v options together for verbose execution tracing, but the amount of output can become a little overwhelming. Using the -n and -v options together provides a verbose syntax check without executing the script.

If you decide to use the -x and -v options together, it can be helpful to use redirection in conjunction with a pager like less, or the tee command to help you handle the information. The shell sends debugging output to stderr and the normal output to stdout, so you’ll need to redirect both of them if you want the full picture of what’s going on. To do this and use the less pager to handle the information, you would use a command line like bash -xv scriptname 2>&1 | less . Instead of seeing the debugging output scroll by in the shell, you’ll be placed into the less pager where you’ll have access to functions like scrolling and search. While using the pager in this way, it’s possible that you may get an error like Broken pipe if you exit the pager before the script is done executing. This error has to do with the script trying to write output to something (less) that’s no longer there, and in this case can be ignored.

If you would prefer to redirect the debugging output to a file for later review and/or processing, you can use tee: bash -xv scriptname 2>&1 | tee scriptname.dbg . You will see the debugging output scroll by on the screen, but if you check the current working directory you will also find the scriptname.dbg file which holds the redirected output. This is what the tee command does for you. It allows you to send the output to a file while still displaying it on the screen. If the script will take awhile to run you can alter the redirection operator slightly, put the script in the background, and then use tail -f scriptname.dbg to follow the updates to the file. You can see this in action in Listing 3, where I’ve created a script that runs in an infinite loop (the code is incorrect on purpose) generating output every 20 seconds. I start the script in the background, redirecting the output to the infinite_loop.dbg file only (not to the screen too). I then start the tail -f command to follow the file for a few iterations, and then hit Ctrl-C to interrupt the tail command. Once you understand how to redirect the debugging output in this way, it’s fairly easy to figure out how to split the debugging and regular output into separate files.

Internal Script Debugging

This section is called “Internal Script Debugging” because it focuses on changes that you make to the script itself to add debugging functionality. The easiest change to make in order to enable debugging is to change the shebang line of the script (the first line) to include the shell’s normal command line switches. So, instead of a shebang line like #!/bin/bash - you would have #!/bin/bash -xv. There are also both external and built-in commands for the BASH shell that make it easier for you to debug your code, the first of which is set.

The set command allows you to set shell options while your script is running. The options of the most interest for our purposes are the ones from Listing 2. For example, you can enclose sections of your script between the set -x and set +x command lines. By doing this you enable debugging for only the section of code within those lines, giving you control over what specific section of the script is debugged. Listing 4 shows a very simple script using this technique, and Listing 5 shows the script in action.

As you can see, the debugging output looks like you started the script with the bash -x command line. The difference is that you get to control what is traced and what is not, instead of having the execution of the whole script traced. Notice that the command to disable execution tracing (set +x) is included in the execution trace. This makes sense because execution tracing is not actually turned off until after the set +x line is done executing.

Output statements (echo/print/printf) are useful for getting information from your script at specific points. You can use output statements to track the progression of logic throughout your script by doing things like evaluating variable values and shell expansions, and finding infinite loops. Another advantage of using output statements is that you can control the format. When using command line debugging switches you have little or no control over the format, but with echo, print, and printf, you have the opportunity to customize the output to display in a way that makes sense to you.

You can utilize a DEBUG function to provide a flexible and clean way to turn debugging output on and off in your script. Listing 6 shows the script in Listing 4 with the addition of the DEBUG function, and Listing 7 shows one way to switch the debugging on and off from the command line using a variable.

Listing 6

#!/bin/bash -
# File: func_example.sh
# This function can be used to selectively enable/disable debugging.
# Use with the set command to debug sections of the script.
function DEBUG()
{
# Check to see if the enable debugging variable is set
if [ -n "${DEBUG_ENABLE+x}" ]
then
# Run whatever command/option/argument combo that was
# passed to our DEBUG function.
$@
fi
}
echo "Output #1"
DEBUG set -x #Debugging on
echo "Output #2"
DEBUG set +x #Debugging off
echo "Output #3"

The DEBUG function treats the rest of the line after it as an argument. If the DEBUG_ENABLE variable is set, the DEBUG function will output it’s argument (the rest of the line) as a command via the $@ operator. So, any line that has DEBUG in front of it can be turned on or off by simply setting/unsetting one variable from the command line or inside your script. This method gives you a lot of flexibility in how you set up debugging in your script, and allows you to easily hide that functionality from your end users if needed.

Instead of requiring a user to set an environment variable on the command line to enable debugging, you can add command line options to your script. For instance, you could have the user run your script with a -d option (./scriptname -d) in order to enable debugging. The mechanism that you use could be as simple as having the -d option set the DEBUG_ENABLE variable inside of the script. An example of this, with the addition of multiple debugging levels, can be seen in the Scripting section.

Another technique that you can use to track down problems in your script is to write data to temporary files instead of using pipes. Temp files are many times slower than pipes though, so I would use them sparingly and in most cases only for temporary debugging. There is a Linux Journal article by Dave Taylor (April 2010) referenced in the Resources section that talks about using temporary files in the article’s script. In a nutshell, you replace the pipe operator (|) with a redirection to file (> $temp), where $temp is a variable holding the name of your temporary file. You read the temporary file back into the script with another redirection operator (< $temp). This allows you to examine the temporary file for errors in the script’s pipeline. Listing 8 shows a very simplified example of this.

Listing 8

#!/bin/bash -
# Set the path and filename for the temp file
temp="./example.tmp"
# Dump a list of numbers into the temp file
printf "1n2n3n4n5n" > $temp
# Process the numbers in the temp file via a loop
while read input_val
do
# We won't do any real work, just output the values
echo $input_val
done < $temp # Feeds the temp file into the loop
# Clean up our temp file
rm $temp

The last debugging technique that I'm going to touch on here is writing to the system log. You can use the logger command to write debugging output to /var/log/messages, or another file if you use the -f option. I consider this technique to be primarily for production scripts that have already been released to your users, and you don't want to abuse this mechanism. Flooding your system log with script debugging messages would be counter productive for you and/or your system administrator. It's best to only log mission critical messages like warnings or errors in this way.

To use the logger command to help track script debugging information, you would just add a line like logger "${BASH_SOURCE[0]} - My script failed somewhere before line $LINENO." to your script. The line that this adds in the system log looks like the output line in Listing 9. There are a couple of variables that I've thrown in here to make my entry in the system log more descriptive. One is BASH_SOURCE, which is an array that in this case holds the name and path of the script that logged the message. The other is LINENO, which holds the current line number that you are on in your script. There are several other useful environment variables built into the newer versions of BASH (>= 3.0). Some of these other variables (all arrays) include BASH_LINENO, BASH_ARGC, BASH_ARGV, BASH_COMMAND, BASH_EXECUTION_STRING, and BASH_SUBSHELL. See the BASH man page for details.

Introducing BASHDB

As I mentioned before, BASHDB is a debugger that does for BASH scripts what GNU's GDB does for C/C++ programs. BASHDB can do a lot, and it has four main features to help you eliminate errors from your scripts. First, It can start a script with options, arguments, and anything else that might affect its operation. Second, it allows you to set conditions on which a script will stop. Third, it gives you the ability to examine what's going on at the point in a script where it's stopped. Fourth, BASHDB allows you to manipulate things like variable values before telling the script to move on.

You can type bashdb scriptname to start BASHDB and set it to debug the script scriptname. Listing 10 shows a couple of useful options for the bashdb program.

Listing 10

-X Traces the entire script from beginning to end without putting bashdb in
interactive mode. Notice that it's capital X, not lowercase.
-c Tests/traces a single string command. For example, "bashdb -c ls *" will
allow you to step through the command string "ls *" inside the debugger.

In order to show where you're at, BASHDB displays the full path and current line number of the running script above the prompt. In interactive mode, the prompt BASHDB gives you looks something like bashdb where 1 is the number of commands that have been executed. The parentheses around the command number denote the number of subshells you are nested within. The more parentheses there are, the deeper into subshells you are nested. Listing 11 gives a decent command reference that you can use when debugging scripts at the BASHDB interactive mode prompt.

Listing 11

- Lists the current line and up to 10 lines that came before it.
backtrace Abbreviated "T". Shows the trace of calls including things like
functions and sourced files that have brought the script to where it
is now. You can follow "backtrace" with a number, and only that number
of calls will be shown.
break Abbreviated "b". Sets a persistent breakpoint at the current line unless
followed by a number, in which case a breakpoint is set at the line
specified by the number. See the "continue" command for a shortcut to
specifying the line number.
continue Abbreviated "c". Resumes execution of the script and moves to the next
stopping point or breakpoint. If followed by a number, "continue" works
in a similar way as issuing the "break" command followed by the number
and then the continue command. The difference is that "continue" sets a
one time breakpoint whereas "break" sets a persistent one.
edit Opens the text editor specified by the EDITOR environment variable to
allow you make and save changes to the current script. Typing "edit"
by itself will start editing on the current line. If "edit" is followed
by a number, editing will start on the line specified by that number.
Once you're done editing you have to type "restart" or "R" to reload
and restart the script with your changes.
help Abbreviated "h". Lists all of the commands that are available when
running in interactive mode. When you follow "help" or "h" with a
command name, you are shown information on that command.
list Abbreviated "l". Lists the current line and up to 10 lines that come
after it. If followed by a number, "list" will start at the specified
line and print the next 10 lines. If followed by a function name, "list"
starts at the beginning of the function and prints up to 10 lines.
next Abbreviated "n". Moves execution of the script to the next instruction,
skipping over functions and sourced files. If followed by a number,
"next" will move that number of instructions before stopping.
print Abbreviated "p". When followed by a variable name, prints the value of
a specified variable. Example: print $VARIABLE
quit Exits from BASHDB.
set Allows you to change the way BASH interacts with you while running
BASHDB. You can follow "set" with an argument and then the words "on"
or "off" to enable/disable a feature. Example: "set linetrace on".
step Abbreviated "s". Moves execution of the script to the next instruction.
"step" will move down into functions and sourced files. See the "next"
command if you need behavior that skips these. If followed by a number,
"step" will move that number of instructions before stopping.
x Similar to the "print" command, but more powerful. Can print variable
and function definitions, and can be used to explore the effects of a
change to the current value of a variable. Example: "x n-1" subtracts 1
from the variable "n" and displays the result.

Normally when you hit the Enter/Return key without entering a command, BASHDB executes the next command. This behavior is overridden though when you have just run the step command. Once you've run step, pressing the Enter/Return key will re-execute step. The rest of the operation of BASHDB is fairly straight forward, and I'll run through an example session in the How-To section.

If you're a person who prefers to use a graphical interface, have a look at GNU DDD. DDD is a graphical front end for several debuggers including BASHDB, and includes some interesting features like the ability to display data structures as graphs.

How-To

If you've been reading this post straight through, you can see that there are a lot of script debugging tools at your disposal. In this section, I'm going to go through a simple example using a few of the different methods so that you can see some practical applications. Listing 12 shows a script that has several bugs intentionally added so that we can use it as our example.

Listing 12

#!/bin/bash -
# buggy_script.sh is designed to help us learn about
# shell script debugging
#
if [-z $1 ] # Space left out after first test bracket
then
echo "TEST"
#fi #The closing fi is left out
# Use of uninitialized variable
echo "The value is: $VALUE1"
# Infinite loop caused by not incrementing num
num=0
while [ $num -le 10 ]
do
sleep 2
echo "Testing"
done

When I try to run the script for the first time I get the same error that we got in Listing 1. The first thing that I'm going to do is use the -x and -u options of BASH to run the script with extra debugging output (bash -xu ./buggy_script.sh). When I rerun the script this way, I see that I don't really gain anything because BASH detects the unexpected end of file bug before it even tries to execute the script. The line number isn't any help either since it just points me to the very last line of the script, and that's not very likely to be where the error occurred. I'll run into the same problems if I try to run the script with BASHDB as well.

I remember that the rule of thumb with unexpected end of file errors is that they usually mean that I've forgotten to close something out. It could be an if statement without a fi at the end, a case statement that's missing an esac or ;;, or any number of other constructs that require closure. When I start looking through the script I notice that my if statement is missing a fi, so I add (uncomment) that. This particular bug teaches us an important lesson - that there will always be some errors that will require us to do some digging on our own. We may be able to use our debugging techniques to get us close to the error, but in the end we have to know
the language well enough to be able to spot syntax errors. Once I add the fi statement, I'm ready to rerun the script. The second time the script runs, I get an unbound variable error.

You can see in the error that a command line argument ($1) is unbound. This tells me that I forgot to add an argument after ./buggy_script.sh . I end up with the command line bash -xu ./buggy_script.sh testarg1 which gives me the next two errors shown in Listing 14.

Execution tracing shows me that the last command executed is [-z' testarg1 '] . The first error tells me that for some reason the start of the test statement ([-z) is being treated as a command. I think about it for a second and remember that there has to be a space between test brackets and what they enclose. The statement [-z $1 ] should read [ -z $1 ] . Since I try to focus on one error at a time, I fix the test statement and rerun the script. The first error from Listing 14 goes away, but the second error remains. You can see that it's another unbound variable error, but this time it's referencing a variable that I created and not a command line argument. The problem is that I use the variable VALUE1 in an echo statement before I've even set a value for it. In this case that would just leave a blank at the end of the echo statement, but in some cases it can cause more serious problems. This is what using the -u option of BASH does for you. It warns you that a variable doesn't have a value before you try to use it. To correct this error, I add a statement right above the echo line that sets a value for the variable (VALUE1="1").

After fixing the above errors and rerunning the script, everything seems to work fine. The only problem is that even though I set the while loop up to quit after the variable num gets to 10, the loop doesn't exit. It seems that I have an infinite loop problem. This loop is simple enough that you can probably just glance at it and see the problem, but for the sake of the example we're going to take the long way around. I add an echo statement (echo "num Value: $num") to show me the value of the num variable right above the sleep 2 line. When I run the script again without the BASH -x option (to cut out some clutter), I get the output shown in Listing 15.

You can see that the output from the echo statement I added is always the same (num Value: 0). This tells me that the value of num is never incremented and so it will never reach the limit of 10 that I set for the while loop. The fix is to use arithmetic expansion to increment the num variable by 1 each time around the while loop: num=$((num+1)) . When I run the script now, num increments like it should and the script exits when it's supposed to. With this bug fixed, it looks like we've eliminated all of the errors from our script. The finalized script with the num evaluation echo statement removed can be seen in Listing 16.

Now I'll walk you through correcting the same buggy script using BASHDB. As I said above, the unexpected end of file error is best solved by applying your understanding of shell scripting syntax. Because of this, I'm going to start debugging the script right after we notice and fix the unclosed if statement. To start the debugging process, I use the line bashdb ./buggy_script.sh to launch BASHDB and have it start to step through the script. If you compiled BASHDB from source and haven't installed it, you'll need to adjust the paths in the command line accordingly.

BASHDB starts the script and then stops at line 7, the if statement. I then use the step command to move to the next instruction and get the output in Listing 17.

Listing 17

$ bashdb ./buggy_script.sh
bash Shell Debugger, release 4.0-0.4
Copyright 2002, 2003, 2004, 2006, 2007, 2008, 2009 Rocky Bernstein
This is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
(/home/jwright/Documents/Scripts/Learning/buggy_script.sh:7):
7: if [-z $1 ] # Space left out after first test bracket
bashdb<0> step
./buggy_script.sh: line 7: [-z: command not found
(/home/jwright/Documents/Scripts/Learning/buggy_script.sh:13):
13: echo "The value is: $VALUE1"

Notice that until I run the step command, BASHDB doesn't give me an error for line 7. That's because it has stopped on the line 7 instruction, but hasn't executed it yet. When I step through that instruction and on to the next one, I get the same error as the BASH shell gives us ([-z: command not found). As before, we realize that we've left a space out between the test bracket and the statement. To fix this, I type the edit command to open the script in the text editor specified by the EDITOR environment variable. In my case this is vim. I have to type visual to go to normal mode, and then I'm able to edit and save my changes to the script like I would in any vi/vim session. With the space added, I save the file and exit vim which puts me back at the BASHDB prompt. I type the R character and hit the Enter/Return key to restart the script, which also loads my changes. I end up right back at line 7 again.

This time when I use the step command, BASHDB moves past the if statement and stops right before executing line 13 (the next instruction). Everything looks good, so I use the step command again by simply hitting the Enter/Return key. The output in Listing 18 is what I see.

We see that the echo statement ends up not having any text after the colon, which is not what we want. What I'll do is issue an R (restart) command and then step back to line 13 so that I can check the value of the variable. Once I'm back at the echo statement on line 13, I use the command print $VALUE1 to inspect the value of that variable. A snippet of the output from the print command is in Listing 19.

There's a blank line between the bashdb<1> print $VALUE1 and bashdb<2> lines. This tells me that there is definitely not a value (or there's a blank string) set for the VALUE1 variable. To correct this I go back into edit mode, and add the variable declaration VALUE1="1" just above our echo statement. I follow the same edit, save, exit, restart (with the R character) routine as before, and then step down through the echo statement again.

This time the output from the echo statement is The value is: 1 which is what we would expect. With that error fixed, we continue to step down through the script until we realize that we're stuck in our infinite while loop. We can use the print statement here as well, and with the line print $num we see that the num variable is not being incremented. Once again, we enter edit mode to fix the problem. We add the statement num=$((num+1)) at the bottom of our while loop, save, exit, and restart. We now see that the num variable is incrementing properly and that the loop will exit. We can type the continue command to let the loop finish without any more intervention.

After the script has run successfully, you'll see the message Debugged program terminated normally. Use q to quit or R to restart. If you haven't been adding comments as you go, it would be a good idea at this point to re-enter edit mode and add those comments to any changes that you made. Make sure to run your script through one more time though to make sure that you didn't break anything during the process of commenting.

That's a pretty simple BASHDB session, but my hope is that it will give you a good start. BASHDB is a great tool to add to your shell script development toolbox.

Tips and Tricks

If you're like many of us, you may have trouble with quoting in your scripts from time to time. If you need a hint on how quoted sections are being interpreted by the shell, you can replace the command that's acting on the quoted section with the echo command. This will give you output showing how your quotes are being interpreted. This can also be a handy trick to use when you need insight into other issues like shell expansion too.

If you don't indent temporary (debugging) code, it will be easier to find in order to remove it before releasing your script to users. If you don't already make a habit of indenting your scripts in the first place, I would recommend that you start. It greatly increases the readability, and thus maintainability, of your scripts.

You can set the PS4 environment variable to include more information with the shell's debugging output. You can add things like line numbers, filenames, and more. For example, you would use the line export PS4='$LINENO ' to add line numbers to your script's debugging output. The creator of the bashdb script debugger sets the PS4 variable to (${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]} - [${SHLVL},${BASH_SUBSHELL}, $?] which gives you very detailed information about where you're at in your script. You can make this change to the variable permanent by adding an export declaration to one of your bash configuration files.

Make sure to use unique names for your shell scripts. You can run into problems if you name your shell script the same as a system or built-in command (i.e. test). I like to make my shell script names distinctive, and for added protection I almost always add a .sh extension onto the end of the filename.

Scripting

These scripts are somewhat simplified and in most cases could be done other ways too, but they will work to illustrate the concepts. If you use these scripts, make sure you adapt them to your situation. Never run a script or command without understanding what it will do to your system.

Our first script example is going to have two separate parts to it. The first is a script in which we've enclosed our debugging functionality from above. This is a case where it's helpful to create modular code so that other scripts can add debugging functionality simply by sourcing one file. That way you're not duplicating code needlessly for commonly used functionality. The second script implements the debugging script, and uses a command line option (-d) to enable debugging. The script also uses multiple debugging levels to allow the user to control how verbose the output is by passing an argument to the -d option.

Listing 20

#!/bin/bash -
# File:debug_module.sh
# Holds common script debugging functionality
# Set the PS4 variable to add line #s to our debug output
PS4='Line $LINENO : '
# The function that enables the enabling/disabling of
# debugging in the script, and also takes the user
# specified debug level into account.
# 0 = No debugging
# 1 = Debug executed statements only
# 2 = Debug all lines and executed statements
function DEBUG()
{
# We need to see what level (0-2) of debugging is set
if [ "$1" = "0" ] #User disabled debugging
then
echo "Debugging Off"
set +xv
# Set the variable that tracks the debugging state
_DEBUG=0
elif [ "$1" = "1" ] #User wants minimal debugging
then
echo "Minimal Debugging"
set -x
# Set the variable that tracks the debugging state
_DEBUG=0
elif [ "$1" = "2" ] #User wants maximum debugging
then
echo "Maximum Debugging"
set -xv
# Set the variable that tracks the debugging state
_DEBUG=0
else #Run/suppress a command line depending on debug level
# If debugging is turned on, output the line
# that this function was passed as a parameter
if [ $_DEBUG -gt 0 ]
then
$@
fi
fi
}

This script has two main purposes. One is to set the PS4 variable so that line numbers are added to the debugging output to make it easier to trace errors. The other is to provide a function that takes an argument of either a number (0-2), or a command line and then decides what to do with it. If the argument is a number from 0 to 2, the function sets a debugging level accordingly. Level 0 turns off all debugging (set +xv), level 1 turns on execution tracing only (set -x), and level 2 turns on execution tracing and line echoing (set -xv). Anything else that is passed to the function is treated as a command line that is either run or suppressed depending on what the debugging level is.

As always, there are many ways to improve this script. One would be to add more debugging levels to it. I created three (0-2), which accommodated only the -x and -v options. You could add another level for the -u option, or create your own custom levels. Listing 21 shows an implementation of our simple modular debugging script.

The first statement that you see in the Listing 21 script is a source statement reading the modular debugging script (debug_module.sh). This treats the debugging script as if it was part of the script we're currently running. The next major section that you see is the while loop that parses the command line options and arguments. The main option to be concerned with is "d", since it's the one that enables or disables debugging output. The getopts command requires the -d option to have an argument on the command line via the getopts "d:h" statement. The user passes a 0, 1, or 2 to the option and that in turn sets the debugging level via the _DEBUG variable and the DEBUG function. The DEBUG function is called 4 more times throughout the rest of the script. Three of those times it is used as a switch to run or suppress a line of the script, and once it is used to reset the debugging level to 0 (debugging off).

The last three lines of the script are a little different. I put them in there to show how you could implement your own custom debugging functionality. In the first of those lines, the _DEBUG variable is set to 2 (maximum debugging output). The next two lines are used to select how much debugging output you see. When you set _DEBUG to 1, the line "First debugging level" is output. If you set _DEBUG to 2 as in the script, the conditions for both the "First debugging level" (> 0) and the "Second debugging level" (> 1) statements are met, so both lines are output. Listing 22 shows the output that you get from running this script, and if you look at the bottom you'll see that the lines "First debugging level" and "Second debugging level" are output.

This next script is somewhat like an automated unit test. It's a wrapper script that automatically runs another script with varying combinations of options and arguments so that you can easily look for errors. It takes some time up front to create this script, but it allows you to quickly test how any changes you make to a test script might cause problems for the end user. It could take a lot of time to step through and test all of the option/argument combinations manually on a complex script, and with that extra work (if we're honest) this test might get left out all together. That's where the automation of the script in Listing 23 comes in.

Listing 23

#!/bin/bash -
# File unit_test.sh
# A wrapper script that automatically runs another script with
# a varying combination of predefined options and arguments,
# to help find any errors.
# Variables to make the script a little more readable.
_TESTSCRIPT=$1 #The script that the user wants to test
_OPTSFILE=$2 #The file holding the predefined options
_ARGSFILE=$3 #The file holding the predefined arguments
# Read the options and arguments from their files into arrays.
_OPTSARRAY=($(cat $_OPTSFILE))
_ARGSARRAY=($(cat $_ARGSFILE))
# The string that holds the option/argument combos to try.
_TRIALSTRING=""
# Step through all of the arguments one at a time.
for _ARG in ${_ARGSARRAY[*]}
do
# The string of multiple command line options that we'll
# build as we step through the available options.
_OPTSTRING=""
# Step through all of the options one at a time.
for _OPT in ${_OPTSARRAY[*]}
do
# Append the new option onto the multi-option string.
_OPTSTRING="${_OPTSTRING}$_OPT "
# Accumulate the command lines that will be tacked onto
# the command as we're testing it.
_TRIALSTRING="${_TRIALSTRING}${_OPT} $_ARGn" #Single option
_TRIALSTRING="${_TRIALSTRING}${_OPTSTRING}$_ARGn" #Multi-option
done
done
# Change the Internal Field Separator to avoid newline/space troubles
# with the command list array assignment.
IFS=":"
# Sort the lines and make sure we only have unique entries. This could
# be taken care of by more clever coding above, but I'm going to let
# the shell do some extra work for me instead. An array is used to hold
# the command lines.
_CLIST=($(echo -e $_TRIALSTRING | sort | uniq | sed '/^$/d' | tr "n" ":"))
# Step through each of the command lines that were built.
for _CMD in ${_CLIST[*]}
do
# We can pipe the full concatenated command string into bash to run it.
echo $_TESTSCRIPT $_CMD | bash
done

There are two files that I created to go along with this test script. The first is sample_opts, which holds a single line of possible options separated by spaces (-d -v -q). These options stand for debugging mode, verbose mode, and quiet mode respectively. The second file that I create is sample_args, which contains two possible arguments separated by a space (/etc/passwd /etc/shadow). I'll run our unit_test.sh script by passing it the name of the script to test, the sample_opts argument, and the sample_args argument. For this example, it really doesn't matter what the test script (./test_script.sh) is designed to do. We just provide the options and arguments that we want to test, and that's all the unit_test.sh script needs to know. Listing 24 shows what happens when I run the test.

Notice that the output from the unit test script shows that the -v and -q options cause a conflict. I have hard coded that error in the test script for clarity, but in everyday use you would have to look for things like real errors or output that doesn't match what is expected. The error about the -v and -q options makes sense in this case because you wouldn't want to run verbose (chatty) mode and quiet (non-chatty) mode at the same time. They are mutually exclusive options that should not be used together. This unit test script not only finds errors that I may miss with manual inspection, it allows you to easily recheck your script whenever you make a change, and ensures that your script is checked the same way every time.

There are a lot of improvements that can be made to this unit test script. For starters, the script doesn't check every possible combination of options. It's limited by the order that the options are in the sample_opts file. The script never reorders those options. Another improvement would be to have the script automatically check for common errors like illegal option, file not found, etc. As it stands now though, you can pipe the output of the script to grep in order to look for a specific error yourself.

Troubleshooting

The version of BASHDB that came with my chosen Linux distribution had a bug causing an error when a BASHDB function tried to return the value of -1. The problem went away though once I downloaded and compiled the latest version straight from the BASHDB website.

If a script you're debugging causes BASHDB to hang, you can try the CTRL+C key combination. This should exit from the script you're debugging and return you to the BASHDB prompt.

Conclusion

There are quite a few tools and methods at your disposal when debugging scripts. From BASH command line options, to a full debugger like BASHDB, to your own custom debugging and test scripts, there's a lot of room for creativity in making your scripts more error free. Better and more thorough debugging of your scripts from the outset will help lessen problems down the line, reducing down time and user frustration. In the future, I'll talk about handling runtime errors and security as the next steps in ensuring the quality and reliability of your shell scripts. Look for another post in this series soon.

Video

Audio

Quick Start

If you just want enough information to fix your problem quickly, you can read the How-To section of this post and skip the rest. I would highly recommend reading everything though, as a good understanding of the concepts and commands outlined here will serve you well in the future. We also have Video and Audio included with this post that may be a good quick reference for you. Don’t forget that the man and info pages of your Linux/Unix system can be an invaluable resource as well when you’re trying to solve problems.

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands easier, but if you’re not already familiar with the concepts presented here, typing the commands yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory PathWarning or ErrorCommand Line Snippet With Commands/Options/ArgumentsCommand Options and Their Arguments OnlyHyperlink

Where listings on command options are made available, anything with square brackets around it (“[” and “]“) is an argument to the option, and a pipe (“|“) means that you can choose one of two alternatives ([4|6] means choose 4 or 6).

Overview

This post is geared more toward system administrators than software developers, but anyone can make good use of the information that you’re going to see here. The Resources section holds links to take your study further, even into the developer realm. I’m going to start off by giving you a brief background on shared libraries and some of the rules that apply to their use. Listing 1 shows an example of an error you might see after installing PostgreSQL via a bin installer file. In this post, I’m going to step through some commands and techniques to help you deal with this type of shared library problem. I’ll also work through resolving the error in Listing 1 as an example, and give you some tips and tricks as well as items to help you if you get stuck.

Background

Shared libraries are one of the many strong design features of Linux, but can lead to headaches for inexperienced users, and even experienced users in certain situations. Shared libraries allow software developers keep the size of their application storage and memory footprints down by using common code from a single location. The glibc library is a good example of this. There are two standardized locations for shared libraries on a Linux system, and these are the /lib and /usr/lib directories. On some distributions /usr/local/lib is included, but check the documentation for your specific distribution to be sure. These are not the only locations that you can use for libraries though, and I’ll talk about how to use other library directories later. According to the Filesystem Hierarchy Standard (FHS), /lib is for shared libraries and kernel modules that are required for startup and running in the root filesystem (/bin and /sbin), and /usr/lib holds most of the internal libraries that are not meant to be executed directly by users or shell scripts. The /usr/local/lib directory is not defined in the latest version of the FHS, but if it exists on a distribution it normally holds libraries that aren’t a part of the standard distribution, including libraries that the system administrator has compiled/installed after the initial setup. There are some other directories like /lib/security that holds PAM modules, but for our discussion we’ll focus on /lib and /usr/lib.

The counterpart to the dynamically linked (shared) library is the statically linked library. Whereas dynamically linked libraries are loaded and used as they are needed by the applications, statically linked libraries are either built into, or closely associated with a program at the time it is compiled. A couple of the situations where static libraries are used is when you’re trying to work around an odd/outdated library dependency, or when you’re building a self-contained rescue system. Static linking typically makes the resulting application faster and more portable, but increases the size (and thus the memory and storage footprint) of the binary. There is also a multiplication of the size of a static library’s footprint if more than one program uses it. For instance, one program using a library that is 10 MB in size just consumes 10 MB of memory (1 program x 10 MB), but if you run 10 programs with the same library compiled into them, you end up with 100 MB of memory consumed (10 programs x 10 MB). Also, when programs are statically linked, they can’t take advantage of updates made to the libraries that they depend on. They are locked into whatever version of the library they were compiled with. Programs that depend on dynamically linked libraries refer to a specific file on the Linux file system, and so when that file is updated, the program can automatically take advantage of the new features and fixes the next time it loads.

Shared libraries typically have the extension .so which stands for Shared Object. Library file names are followed by a version numbering scheme which can include major and minor version numbers. A system of symbolic links are used to point the majority of programs to the latest and greatest library version, while still allowing a minority of programs to use older libraries. Listing 2 shows output that I modified to illustrate this point.

You can see in the output that there are two versions of libreadline installed side-by-side (5.2 and 6.0). The version numbers are in the form major.minor, so 5 and 6 are major version numbers, with 2 and 0 being minor version numbers. You can usually mix and match libraries with the same major version number and differing minor numbers, but it can be a bad idea to use libraries with different major numbers in place of one another. Major version number changes usually represent significant changes to the interface of the library, which are incompatible with previous versions. Minor version numbers are only changed when an update such as a bug fix is added without significantly changing how the library interacts with the outside world. Another thing that you’ll notice in Listing 2 is that there are links created from libreadline.so.5 to libreadline.so.5.2 and from libreadline.so.6 to libreadline.so.6.0. This is so that programs that depend on the 5 or 6 series of the libraries don’t have to figure out where the newest version of the library is. If an application works with major version 6 of the library, it doesn’t care if it grabs 6.0, 6.5, or 6.9 as long as it’s compatible, so it just looks at the base name of the library and takes whatever that’s linked to. There are also a couple of other situations that you’re likely to encounter with this linking scheme. The first is that you may see a link file name containing no version numbers (libreadline.so) that points to the actual library file (libreadline.so.6.0). Also, even though I said that libraries with different major version numbers are risky to mix, there are situations where you will see an earlier major version number (libreadline.so.5) linked to a newer version number of the library (libreadline.so.6.0). This should only happen when your distribution maintainers or system administrators have made sure that nothing will break by doing this. Listing 3 shows an example of the first situation.

All things considered, the shared library methodology and numbering scheme do a good job of ensuring that your software can maintain a smaller footprint, make use to the latest and greatest library versions, and still have backwards compatibility with older libraries when needed. With this said, the shared library model isn’t perfect. There are some disadvantages to using them, but those disadvantages are typically considered to be outweighed by the benefits. One of the disadvantages is that shared libraries can slow the load time of a program. This is only a problem the first time that the library is loaded though. After that, the library is in memory and other applications that are launched won’t have to reload it. One of the most potentially dangerous drawbacks of shared libraries is that they can create a central point of failure for you system. If there is a library that a large set of your programs rely on and it gets corrupted, deleted, over-written, etc, all of those programs are probably going to break. If any of those programs that were just taken down are needed to boot your Linux system, you’ll be dead in the water and in need of a rescue CD.

While I would argue that dependency chains are not really a “problem”, they can work hardships on a system administrator. A dependency chain happens when one library depends on another library, then that one depends on another, and another, and so on. When dealing with a dependency chain, you may have satisfied all of the first level dependencies, but your program still won’t run. You have to go through and check each library in turn for a dependency chain, and then follow that chain all the way through, filling in the missing dependencies as you go.

One final problem with shared libraries that I’ll mention again is version compatibility issues. You can end up with a situation where two different applications require different versions of the same library – that aren’t compatible. That is the reason for the version numbering system that I talked about above, and robust package management systems have helped ease shared library problems from the user’s perspective, but they still exist in certain situations. Any time that you compile and/or install an application/library yourself on your Linux system, you have to keep an eye out for problems since you don’t have the benefit of a package manager ensuring library compatibility.

Introducing ld-linux.so

ld-linux.so (or ld.so for older a.out binaries) is itself a library, and is responsible for managing the loading of shared libraries in Linux. For the purposes of this post, we’ll be working with ld-linux.so, and if you need or want to learn more about the older style a.out loading/linking, have a look at the Resources section. The ld-linux.so library reads the /etc/ld.so.cache file which is a non-human readable file that is updated when you run the ldconfig command. The way that shared libraries are loaded is that ld-linux.so checks to see what paths to look for the libraries in by checking the value of the LD_LIBRARY_PATH environment variable, then the contents of the /etc/ld.so.cache file, and finally the default path of /lib followed by the /usr/lib directory.

The LD_LIBRARY_PATH environment variable is a colon separated list that preempts all of the other library paths in the ldconfig search order. This means that you can use it to temporarily alter library paths when you’re trying to test a new library before rolling it out to the entire system, or to work around problems. This variable is typically not set by default on Linux distributions, and should not be used as a permanent fix. Use it with care, and preference should be given to the other library search path configuration methods. A handy thing about the LD_LIBRARY_PATH variable is that since it’s an environment variable, you can set it on the same line as a command and the new value will only effect the command, and not the parent environment. So, you would issue a command line like LD_LIBRARY_PATH="/home/user/lib" ./program to run program and force it to use the experimental shared libraries in /home/user/lib in preference to any others on the system. The shell that you run program in never sees the change to LD_LIBRARY_PATH. Of course you can also use the export command to set this variable, but be careful because doing this will affect your entire system. One final thing about the LD_LIBRARY_PATH variable is that you don’t have to run ldconfig after changing it. The changes take effect immediately, unlike changes to /lib, /usr/lib, and /etc/ld.so.conf. I’ll explain more about ldconfig later.

You can use the ld-linux.so library by itself to list which libraries a program depends on. It’s behavior is very much like the ldd command that we’ll talk about next because ldd is actually a wrapper script that adds more sophisticated behavior to ld-linux.so. In most cases ldd should be your preferred command for listing required shared libraries. In order to use ld-linux.so.2 to get a listing of the depended upon libraries for the ls command, you would type /lib/ld-linux.so.2 --list /bin/ls swapping the 2 out for whatever major version of the library that your system is running. I’ve shown some of the command line options for ld-linux.so in Listing 4.

Listing 4

--list Lists all library dependencies for the executable
--verify Verifies that the program is dynamically linked and that
the ld-linux.so linker can handle it
--library-path [PATH] Overrides the LD_LIBRARY_PATH environment variable and
uses PATH instead

You can start a program directly with ld-linux.so by using the following command line form /lib/ld-linux.so.2 --library-path FULL_LIBRARY_PATH FULL_EXECUTABLE_PATH , where you replace 2 with whatever version of the library you are using. An example would be /lib/ld-linux.so.2 --library-path /home/user/lib /home/bin/program which would run program using /home/user/lib as the location to look for required libraries. This should be used for testing purposes only, and not for a permanent fix on a production system though.

Introducing ldd

The name of the ldd command comes from its function, which is to “List Dynamic Dependencies”. As mentioned in the previous section, by default the ldd command gives you the same output as issuing the command line /lib/ld-linux.so.2 --list FULL_EXECUTABLE_PATH. Each library entry in the output includes a hexadecimal number which is the load address of the library, and can change from run to run. Chances are that system administrators will never even need to know what this value is, but I’ve mentioned it here because some people may be curious. Listing 5 shows a few of the options for ldd that I use the most.

Keep in mind that you have to give ldd the full path to the binary/executable for it to work. The only way to work around giving ldd the full path is to use cd to change into the directory where the binary is. Otherwise you get an error like ldd: ./ls: No such file or directory. The only time that you would need to run ldd with root privileges would be if the binary has restrictive permissions placed on it.

As I mentioned in the Background section, you need to be aware of dependency chains when using shared libraries. Just because you’ve run the ldd command on an executable and satisfied all of it’s top level dependencies doesn’t mean that there aren’t more dependencies lurking underneath. If your program still won’t run, you should check each of the top level libraries to see if any of them have their own library dependencies that are unmet. You continue that process, running ldd on each library in each layer until you’ve satisfied all of the dependencies.

Introducing ldconfig

Any time that you make changes to the installed libraries on your system, you’ll want to run the ldconfig command with root privileges to update your library cache. ldconfig rebuilds the /etc/ld.so.cache file of currently installed libraries based on what it first finds in the directories listed in the /etc/ld.so.conf file, and then in the /lib and /usr/lib directories. The /etc/ld.so.cache file is formatted in binary by ldconfig and so it’s not designed to be human readable, and should not be edited by hand. Formatting the ld.so.cache file in this way makes it more efficient for the system to retrieve the information. The ld.so.conf file may include a directive that reads include /etc/ld.so.conf.d/*.conf that tells ldconfig to check the ld.so.conf.d directory for additional configuration files. This allows the easy addition of configuration files to load third-party shared libraries such as those for MySQL. On some distributions, this include directive may be the only line you find in the ld.so.conf file.

You often need to run ldconfig manually because a Linux system cannot always know when you have made changes to the currently installed libraries. Many package management systems run ldconfig as part of the installation process, but if you compile and/or install a library without using the package management system, the system software may not know that there is a new library present. The same applies when you remove a shared library.

Listing 6 holds several options for the ldconfig command. This is by no means an exhaustive list, so be sure to check the man page for more information.

Listing 6

-C [file] Specifies an alternate cache file other than ld.so.cache
-f [file] Specifies an alternate configuration file other than
ld.so.conf
-n Rebuilds the cache using only directories specified on the
command line, skipping the standard directories and ld.so.conf
-N Only updates the symbolic links to libraries, skipping the
cache rebuilding step
-p --print-cache Lists the shared library cache, but needs to be piped to the
less command because of the amount of output
-v --verbose Gives output information about version numbers, links
created, and directories scanned
-X Opposite of -N, it rebuilds the library cache and skips
updating the links to the libraries

ldconfig is not the only method used to rebuild the library cache. Gentoo handles this task in a slightly different way, which I’ll talk about next.

Introducing env-update

Gentoo takes a slightly different path to updating the cache of installed libraries which includes the use of the env-update script. env-update reads library path configuration files from the /etc/env.d directory in much the same way that ldconfig reads files from /etc/ld.so.conf.d via the ld.so.confinclude directive. env-update then creates a set of files within /etc including ld.so.conf . After this, env-update runs ldconfig so that it reloads the cache of libraries into the /etc/ld.so.cache file.

How-To

Hopefully by the point you’re reading this section you either have, or are beginning to get a pretty good understanding of the commands used when dealing with shared libraries. Now I’m going to take you through a sample scenario of a PostgreSQL installation running on Red Hat 5.4 to demonstrate how you would use these commands.

I have downloaded a bin installer to use on my CentOS installation instead of the PostgreSQL Yum repository because I wanted to install a specific older version of Postgres outside of the package management system. In most cases you’ll want to use a repository with your package management system though, as you’ll get a more integrated installation that can be kept up to date more easily. That’s assuming that your Linux distribution offers the repository mechanism for installing and updating packages, and many distributions don’t.

After installing Postgres via the bin file, I take a look around and see that the majority of the PostgreSQL files are in the /opt/PostgreSQL directory. I decide to experiment with the binaries under the pgAdmin3 directory, and so I use the cd command to move to /opt/PostgreSQL/8.4/pgAdmin3/bin. Once I’m there, I try to run the psql command and get the output in Listing 7 (same as Listing 1).

There might be some of you reading this who will realize that I could have probably avoided the library error in Listing 7 by running the psql command from the /opt/PostgreSQL/8.4/bin directory. While this is true, for the sake of this example I’m going to forge ahead trying to figure out why it won’t run under the pgAdmin3 directory.

The main thing that I take away from the output in Listing 7 is that there is a shared library named libpq.so.5 that cannot be found by ld-linux.so. To dig just a little bit deeper, I use the ldd command and get the output in Listing 8.

Notice that the error given in Listing 7 only gives you the first shared library that’s missing. As you can see in Listing 8, this doesn’t mean that other libraries won’t be missing as well.

My next step is to see if the missing libraries are already installed somewhere on my system using the find command. If the libraries are not already installed, I’ll have to use the package management system or the Internet to see which package(s) I need to install to get them. The output in Listing 9 shows the output from the find command.

After looking in both of the directories shown in the output, I notice that all of my other missing libraries are housed within them. If you were just temporarily testing some new features of the psql command, you could use the export command to set the LD_LIBRARY_PATH environment variable as I have in Listing 10.

You can see that once I’ve set the LD_LIBRARY_PATH variable, all I have to do is enter my PostgreSQL password and I’m greeted with the psql command line interface. I’ve used the /opt/PostgreSQL/8.4/lib/ library directory instead of the one beneath the pgAdmin3 directory as a matter of preference. In this case both directories include the same required libraries. For a permanent solution, we can add the path via the ld.so.conf file.

I could just add /opt/PostgreSQL/8.4/lib/ directly to the ld.so.conf file on its own line, but since the ld.so.conf file on my installation has the include ld.so.conf.d/*.conf directive, I’m going to add a separate conf file instead. In Listing 11 you can see that I’ve echoed the PostgreSQL library path into a file called postgres-i386.conf under the /etc/ld.so.conf.d directory. After checking to make sure the file has the directory in it, I run the ldconfig command to update the library cache.

Make sure that you unset the LD_LIBRARY_PATH variable though so that you can make sure that it was your ld.so.conf configuration file changes that fixed the problem, and not the environment variable. Issuing a command line such as unset LD_LIBRARY_PATH will accomplish this for you.

There are many scenarios beyond the one in this example, but it gives you the concepts used to work through the majority of shared library problems that you’re likely to come up against as a system administrator. If you’re interested in delving more deeply though, there are several links in the Resources section that should help you.

Tips and Tricks

I have read that running ldd on an untrusted program can open your system up to a malicious attack. This happens when an executable’s embedded ELF information is crafted in such a way that it will run itself by specifying its own loader. The man pages on the Ubuntu and Red Hat systems that I checked don’t mention anything about this security concern, but you’ll find a very good article by Peteris Krumins in the Resources section of this post. I would suggest at least skimming Peteris’ post so that you’re aware of the security implications of running ldd on unverified code.

Although it’s a little bit beyond the scope of this post, you can compile a program from source and manually control which libraries it links to. This is yet another way to work around library compatibility issues. You use the GNU C Compiler/GNU Compiler Collection (gcc) along with its -L, and -l options to accomplish this. Have a look at item 13 (the YoLinux tutorial) in the Resources section for an example, and the gcc man page for details on the options.

Have a look at the readelf and nm commands if you want a more in-depth look at the internals of the binaries and libraries that you’re working with. readelf shows you some extra information on your ELF files by reading and parsing their internal information, and nm lists the symbols (functions, etc) within an object file.

You can temporarily preempt your current set of libraries and their functions with the LD_PRELOAD environment variable and/or the /etc/ld.so.preload file. Once these are set, the dynamic library loader will use the preload libraries/functions in preference to the ones that you have cached using ldconfig. This can help you work around shared library problems in a few instances.

If you run into a program that has its required library path(s) hard coded into it, you can create symbolic links from each one of the missing libraries to the location that’s expected by the executable. This technique can also help you work around incompatibilities in the naming conventions between what your system software expects, and what libraries are actually named. I talk about using symbolic links in this way a little more in the Troubleshooting section.

Scripting

These scripts are somewhat simplified and in most cases could be done other ways too, but they will work to illustrate the concepts. If you use these scripts, make sure you adapt them to your situation. Never run a script or command without understanding what it will do to your system.

The first script shown in Listing 12 can be used to search directory trees for binaries with missing libraries. It makes use of the ldd and find commands to do the bulk of the work, looping through their output. Since I have heavily commented the scripts in Listing 12 and Listing 13, I won’t explain the details of how they work in this text.

Listing 12

#!/bin/bash -
# These variables are designed to be changed if your Linux distro's ldd output
# varies from Red Hat or Ubuntu for some reason
iself="not a dynamic executable" # Used to see if executable is not dynamic
notfound="not.*found" # Used to see if ldd doesn't find a library
# Step through all of the executable files in the user specified directory
for exe in $(find $1 -type f -perm /111)
do
# Check to see if ldd can get any information from this executable. It won't
# if the executable is something like a script or a non-ELF executable.
if [ -z "$(ldd $exe | grep -i "$iself")" ]
then
# Step through each of the lines of output from the ldd command
# substituting : for a delimiter instead of a space
for line in $(ldd $exe | tr " " ":")
do
# If ldd gives us output with our "not found" variable string in it,
# we'll need to warn the user that there is a shared library issue
if [ -n "$(echo "$line" | grep -i "$notfound")" ]
then
# Grab the first field, ignoring the words "not" or "found".
# If we don't do this, we'll end up grabbing a field with a
# word and not the library name.
library="$(echo $line | cut -d ":" -f 1)"
printf "Executable %s is missing shared object %sn" $exe $library
fi
done
fi
done

When run on the /opt/PostgreSQL directory mentioned above, it finds all of the programs that exhibit our missing library problem. As it stands now, this script will only check the first layer of library dependencies. One way to improve it would be to make the script follow the dependency chain of every library to the end, making sure that there is not a library farther down the chain that is missing. Better yet, you could add a “max-depth” option so that the user could specify how deeply into the dependency chain they wanted the script to check before moving on. A max-depth setting of “0” would allow the user to specify that they wanted the script to follow the dependency chain to the very end.

In Listing 13, I have created a wrapper script that could be used when developing new software, or as a last ditch effort to work around a really tough shared library problem. It utilizes the shell’s feature of temporarily setting an environment variable for a command on the same line as the command designation. That way we’re not setting LD_LIBRARY_PATH for the overall environment, which could cause problems for other programs if there are library naming conflicts.

Listing 13

#!/bin/bash -
# Set up the variables to hold the PostgreSQL lib and bin paths. These paths may
# vary on your system, so change them accordingly.
LIB_PATH=/opt/PostgreSQL/8.4/lib # Postgres library path
BIN_FILE=/opt/PostgreSQL/8.4/pgAdmin3/bin/psql # The binary to run
# Start the specified program with the library path and have it replace this
# process. Note that this will not change LD_LIBRARY_PATH in the parent shell.
exec $(LD_LIBRARY_PATH="$LIB_PATH" "$BIN_FILE")

I’ve broken the library and binary paths out into variables to make it easier for you to adapt this script for use on your system. This script could easily serve as a template for other wrapper scripts as well, anytime that you need to alter the environment before launching a program. Remember though that this wrapper script should not be used for a permanent solution to your shared library problems unless you have no other choice.

Troubleshooting

In some cases, a program may have been hard coded to look for a specific library on your system in a certain path, thus ignoring your local library settings. In order to fix this problem, you can research what version/path of the library the program is looking for and then create a symbolic link between the expected library location and a compatible library. In some cases you can recompile the program with options set to change how/where it looks for libraries. If the programmer was really kind, they may have included a command line option to set the library location, but this would be the exception rather than the rule when library locations are hard coded.

The ldd command will not work with older style a.out binaries, and will probably give output mentioning “DLL jump” if it encounters one. It’s a good idea not to trust what ldd tells you when you’re running it on these types of binaries because the output is unpredictable and inaccurate. Newer ELF binaries have support for ldd built into them via the compiler, which is why they work.

Just because the dynamic linker finds a library doesn’t mean that the library isn’t missing “symbols” (things like functions/subroutines). If this happens, you may be able to match the ldd command output to libraries that are installed, but your program will still have unpredictable behavior (like not starting or crashing) when it tries to access the symbol(s) that are missing. In this case the ldd command’s -d and -r options can give you more information on the missing symbols, and you’ll need to dig deeper into the software developer’s documentation to see if there are compatibility issues with the specific version of the library that you’re running. Remember that you can always use the LD_LIBRARY_PATH variable to temporarily test different versions of the library to see if they fix your problem.

There may be some rare cases where ldconfig may not be able to determine a library type (libc4, 5, or 6) from it’s embedded information. If this happens, you can specify the type manually in the /etc/ld.so.conf file with a directive like dirname=TYPE where type can be libc4, libc5, or libc6. According to the man page for ldconfig, you can also specify this information directly on the command line to keep the change on a temporary basis.

If you have stubborn library problems that you just can’t seem to get a handle on, you might try setting the LD_DEBUG environment variable. Try typing export LD_DEBUG="help" first and then run a command (like ls) so that you can see what options are available. I normally use “all“, but you can be more selective on your choices. The next time that you run a program, you’ll see output that is like a stack trace for the library loading process. You can follow this output through to see where exactly your library problem is occurring. Issue unset LD_DEBUG to disable this debugging output again.

Conclusion

I hope that this post has armed you with the knowledge that you need to solve any shared library problems that you might come up against. Work through shared library problems step-by-step by determining what library/libraries are needed, finding out if they’re already installed, installing any missing libraries, and making sure that your Linux distribution can find the libraries, and you should have no problem fixing most of your dynamic library issues. If you have any questions, or have any information that should be added to this post, leave a comment or drop me an email. I welcome your feedback.

Video

Audio

Quick Start

If you just want enough information to fix your problem quickly, you can read the How-To section of this post and skip the rest. I would highly recommend reading everything though, as a good understanding of the concepts and commands outlined here will serve you well in the future. We also have Video and Audio included with this post that may be a good quick reference for you. Don’t forget that the man and info pages of your Linux/Unix system can be an invaluable resource as well when you’re trying to solve problems.

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands easier, but if you’re not already familiar with the concepts presented here, typing the commands yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory PathWarning or ErrorCommand Line Snippet With Commands/Options/ArgumentsCommand Options and Their Arguments OnlyHyperlink

Overview

When you try to access an object on a Linux file system that is in use, you may get an error telling you that the device or resource you want is busy. When this happens, you may see a message like the one in Listing 1.

Listing 1

$ sudo umount /media/4278-62C2/
umount: /media/4278-62C2: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))

Notice that there are 2 commands specified at the end of the output – lsof and fuser, which are the two commands that this post will be focused on.

Introducing lsof

lsof is used to LiSt Open Files, hence the command’s name. It’s a handy tool normally used to list the open files on a system along with the associated processes or users, and can also be used to gather information on your system’s network connections. When run without options, lsof lists all open files along with all of the active processes that have them open. To get a full and accurate view of what files are open by what processes, make sure that you run the lsof command with root privileges.

To use lsof on a specific file, you have to specify the full path to the file. Remember that everything in Linux is a file, so you can use lsof on anything from directories to devices. This makes lsof a very powerful tool once you’ve learned it.

There are many options for lsof, and I have listed summaries for the ones that I find most useful in Listing 2. Anything with square brackets around it (“[” and “]“) is an argument to the option, and a pipe (“|“) means that you can choose one of two alternatives ([4|6] means choose 4 or 6).

Listing 2

+d [directory] Scans the specified directory and all directories/files in its
top level to see if any are open.
+D [directory] Scans the specified directory and all directories/files in it
recursively to see if any are open.
-F [characters] Allows you to specify a list of characters used to split
the output up into fields to make it easier to process.
Type lsof -F ? for a list of characters.
-i [address] Shows the current user's network connections and the processes
associated with them. Connection types can be specified via an
argument: [4|6][protocol][@hostname|hostaddr][:service|port]
-N Enables the scanning/listing of files on NFS mounts.
-r [seconds] Causes lsof to repeat it's scan indefinitely or every so many
seconds.
+r [seconds] A variation of the -r option that will exit on the first iteration
when no open files are listed. It uses seconds as a delay value.
-t Strips all data out of the output except the PIDs. This is good
for scripts and piping data around.
-u [user|UID] Allows you to show the open files for the user or user ID that
you specify.
-w Causes warning messages to be suppressed. Make sure that the
warnings are harmless before you suppress them.

If you are extra security conscious, have a look at the SECURITY section of the lsof man page. There are 3 main issues that the developers of lsof feel may be security caveats. Many distributions have addressed at least some of these security concerns already, but it doesn’t hurt to understand them yourself.

Introducing fuser

By default fuser just gives you the PIDs of processes that have a file open on a system. The PIDs are accompanied by a single character that represents the type of access that the process is performing on that file (f=open file, m=memory mapped file or shared library, c=current directory, etc). If you want output that’s somewhat similar to the lsof command, you can add the -v option for verbose output. According to the man page, this formats the output in a “ps-like” style. To get a full and accurate view of what files are open by all processes, make sure that you run fuser with root privileges. Listing 3 holds some of the fuser options that I find most useful.

Listing 3

-i Used with the -k option, it prompts the user before killing each process.
-k Attempts to kill all processes that are accessing the specified file.
-m Shows the users and processes accessing any file within a mounted file
system.
-s Silent mode where no output is shown. This is useful if you only want to
check the exit code of fuser in a script to see if it was successful.
-u Appends the user name associated with each process to each PID in the
output.
-v Gives a "ps-like" output format that is somewhat similar to the default
lsof output.

fuser is supposed to be a little lighter weight than lsof when it comes to using your system resources. To get an idea of what “a little” meant, I ran some very quick tests on both of the commands. I found that fuser consistently took only 30% – 50% of the time that it took lsof to run the same scan, but used about the same amount of RAM (within 5%). My tests were quick and dirty using the ps and time commands, so your mileage may vary. In any event very few users, if any, will notice a performance difference between the two commands because they use such a small amount of system resources.

How-To

Hopefully by the point you’re reading this section you either have, or are beginning to get a pretty good understanding of both the lsof and fuser commands. Either one of them can be used to solve device and/or resource busy errors in Linux. Let’s take a look at a few scenarios.

Say that I have mounted a CD to /media/cdrom0, used it for awhile copying files from it, and now want to unmount it. The problem is that Linux won’t let me unmount the CD. I get the familiar error in Listing 4, but you can see that I then use lsof and fuser to track down what’s going on.

Both commands tell me what PID is accessing the file system mounted on /media/cdrom0 (2238). Each of the two commands also tells me that the process is using a directory within the /media/cdrom0 file system as it’s current working directory. This is shown as the cwd specifier in the lsof output, and the letter c in the output of fuser (appended to the PID). Finally, each of the commands tells me that a process I (jwright) started is using the directory, and lsof goes one step further in telling me the exact directory the process (listed as bash in the COMMAND column) is using as its current working directory.

Armed with this information, I start searching around and find that I have a virtual terminal open in which I used the cd command to descend into the /media/cdrom0/boot directory. I have to change to a directory outside of the mounted file system or exit that virtual terminal for the umount command to succeed. This example uses a simple oversight on my part to illustrate the point, but many times the process holding the file open is going to be outside of your direct control. In that case you have to decide whether or not to contact the user who owns the process and/or kill the process to release the file. Be careful when killing processes without contacting your users though, as it can cause the user who is accessing the file/directory some major problems.

Another scenario is something that has happened to me when running Arch Linux. At seemingly random intervals, MPlayer (run from the command line) would refuse to output sound and started complaining that the resource /dev/dsp was busy and that it couldn’t open /dev/snd/pcmC0D0p. Listing 5 shows an excerpt from the error MPlayer was giving me, and Listing 6 is the output that I got from running the lsof command on /dev/snd/pcmC0D0p.

After doing some research, I found that the exe process was associated with the version of the Google Chrome browser that I was running and with it’s use of Flash player. I closed Firefox and Chrome and then tested MPlayer again, but still didn’t have any sound. I then ran the same lsof command again and noticed that the exe process was still there, apparently hung. I killed the exe process and was then able to get sound out of MPlayer immediately.

Through this investigation I found that the problem was not truly random, but occurred whenever Chrome came in contact with a Flash movie with sound. The silent MPlayer problem only seemed random because I was not accessing Flash movies with sound at consistent intervals. Now I’m not meaning to pick on Arch Linux here, because the problem seems to have been present in other distributions as well. Also, I have been unable to reproduce this problem on newer versions of Google Chrome running on Arch Linux, telling me that the issue has probably been resolved.

Listing 7 shows a basic example of how you might use the lsof command to track what services/processes are using the libwrap (TCP Wrappers) library. Keep in mind that the | head -4 text at the end of the command line just selects the first 4 lines of output.

If you wanted to get a full system-wide view of the processes using libwrap, you would run the command with sudo or by issuing the su command (I recommend using sudo instead thought).

Carrying this example further, we could add the -i option to display the network connection information as well (Listing 8). The TCP argument to the option tells lsof that we want to only look at TCP connections, excluding other connections like UDP. This is a good way study the services that are currently being protected by the TCP Wrappers mechanism. Please note that this command may take some time to complete.

By using the -t option, you receive output from lsof that can then be passed to another command like kill. Listing 9 shows that I have opened a file with two instances of tail -f so that tail will keep the file open and update me on any data that is appended to it. Listing 10 shows a quick way to terminate both of the tail processes in one shot using the -t option and back-ticks.

If you haven’t seen back-ticks (`) used before in the shell, it probably looks a little strange to you. The back-ticks in this instance tell the shell to execute the command between them, and then replace the back-ticked command with the output. So, for Listing 10 the section of the line within the back-ticks would be replaced by the list of PIDs that are accessing /tmp/testfile.txt. These PIDs are passed to the kill command which sends SIGTERM to each instance of tail, causing them to exit.

An alternative to this would be what you see in Listing 11, where you see the -k and -i options of the fuser command used to interactively kill both instances to tail.

Tips and Tricks

Don’t use the -k option with fuser without checking to see which processes it will kill first. The easiest way to do this is by using the -ki option combination so that fuser will prompt you before killing the processes (see Listing 11). You can specify a signal other than SIGKILL to send to a process with the -SIGNAL argument to the -k option.

As mentioned above, the -r option of lsof causes it to repeat its scan every so many seconds, or indefinitely. This can be very useful when you are writing a script that may need to call lsof repeatedly because it avoids the wasted overhead of starting the command from scratch each time.

lsof functionality is supposed to be fairly standard across the Linux and Unix landscape, so using lsof in your scripts can be an advantage when you’re shooting for portability.

When you are using fuser to check who or what is using a mounted file system, add the -m option to your command line. By doing this, you tell fuser to list the users/processes that have files open in the entire file system, not just the directory you specify. This will prevent you from being confused when fuser doesn’t give you any information even though you know the mounted file system is in use. So, you would issue a command that’s something like

sudo fuser -mu /media/cdrom

to save you that trouble. You still don’t know which subdirectory or file is being held open, but this is easily solved by using the +D option with lsof to search the mounted file system recursively.

sudo lsof +D /media/cdrom/

Scripting

These scripts are somewhat simplified and in most cases could be done other ways too, but they will work to illustrate the concepts. If you use these scripts, make sure you adapt them to your situation. Never run a script or command without understanding what it will do to your system.

For the first scripting example, lets say that it’s 5:00 and you need to leave for the day, but you also have to delete a shared configuration file that’s still being used by several people. Presumably the configuration file will be automatically recreated when someone needs it next. The script shown in Listing 12 shows one way of taking care of the file deletion while still leaving on time, and it uses lsof. This assumes for the sake of the example that every system that has access to the shared configuration file releases it when users are done and logout for the night. Make sure to run this script with root privileges or it might not see everyone that’s using the file before deleting it, causing a mess.

Listing 12

#!/bin/bash -
# Check every 30 seconds to see everyone is done with the file
lsof +r 30 /tmp/testfile.txt > /dev/null 2>&1
# We've made it past the lsof line, so we must be ok to delete the file
rm /tmp/testfile.txt

You end up with a very quick and simple script that doesn’t require a continuous while loop, or a cron job to finish its task.

Another example would be using fuser to make a decision in a script. The script could check to see if a preferred resource is in use and move on to the next one if it is. Listing 13 shows an example of such a script.

Listing 13

#!/bin/bash -
# Make sure to run this script with root privileges or it
# may not work.
# Set up a counter to track which console we are checking
COUNTER=0
# Loop until we find an unused virtual console or run out of consoles
while true
do
# Check to see if any user/process is using the virtual console
fuser -s /dev/tty$COUNTER
# Check to see if we've found an unused virtual console
if [ $? -ne 0 ]
then
echo "The first unused virtual console is" /dev/tty$COUNTER
break
fi
# Get ready to check the next virtual console
COUNTER=$((COUNTER+1))
# Try to get a listing of the virtual console we are checking
ls /dev/tty$COUNTER > /dev/null 2>&1
# Check to see if we've run out of virtual consoles to check.
# The ls command won't return anything if the file doesn't exist.
if [ $? -ne 0 ]
then
echo "No unused virtual console was found."
break
fi
done

This script loops through all of the virtual console device files (/dev/tty*) and looks for one that fuser says is unused. Notice that I’m checking the exit code of both fuser and ls via the built-in variable $?, which holds the exit status of the last command that was run.

That’s just a small sampling of what you can do with lsof and fuser within scripts. There are any number of ways to improve and expand upon the scripts that I’ve given in Listing 12 and Listing 13. Having an in-depth knowledge of the commands will open up a lot of possibilities for your scripts and even for your general use of the shell.

Troubleshooting

Every time that I try to run the lsof command on my Ubuntu 9.10 machine with administrative privileges, I get the following warning:

This warning occurs when lsof tries to access the Gnome Virtual File System (gvfs), which is (among other things) a foundational part of Gnome’s Nautilus file manager. lsof is warning you that it doesn’t have the ability to look inside of the virtual file system and so it’s output may not contain every relevant file. This warning should be harmless, and can be suppressed with the -w option.

If lsof stops for a long time, you might need to use some of the “Precautionary Options” listed in the Apple Quickstart Guide in the Resources section. The lsof man page also has a group of sections which start at BLOCKS AND TIMEOUTS that may help you.

Conclusion

There’s a whole host of possibilities for the lsof and fuser commands beyond what I’ve mentioned here, but hopefully I’ve given you a good start. As with so many other things, the time you put into mastering your Linux system will pay you back again and again. If you have any information to add to what I’ve said here, feel free to drop a line in the comments section or send us an email.