16.5. File and Archiving Commands

Archiving

tar

The standard UNIX archiving utility.
[1]
Originally a
Tape ARchiving program, it has
developed into a general purpose package that can handle
all manner of archiving with all types of destination
devices, ranging from tape drives to regular files to even
stdout (see Example 3-4). GNU
tar has been patched to accept
various compression filters, for example: tar
czvf archive_name.tar.gz *, which recursively
archives and gzips
all files in a directory tree except dotfiles in the current
working directory ($PWD).
[2]

Some useful tar options:

-c create (a new
archive)

-x extract (files from
existing archive)

--delete delete (files
from existing archive)

This option will not work on magnetic tape
devices.

-r append (files to
existing archive)

-A append
(tar files to
existing archive)

-t list (contents of
existing archive)

-u update archive

-d compare archive with
specified filesystem

--after-date only process
files with a date stamp after
specified date

It may be difficult to recover data from a
corrupted gzipped tar
archive. When archiving important files, make multiple
backups.

shar

Shell archiving utility.
The text and/or binary files in a shell archive are
concatenated without compression, and the resultant
archive is essentially a shell script, complete with
#!/bin/sh header, containing all the
necessary unarchiving commands, as well as the files
themselves. Unprintable binary characters in the target
file(s) are converted to printable ASCII characters in the
output shar file. Shar
archives still show up in Usenet newsgroups,
but otherwise shar has been replaced
by tar/gzip.
The unshar command unpacks
shar archives.

The
mailshar command is a Bash script that
uses shar to concatenate multiple files
into a single one for e-mailing.
This script supports compression and uuencoding.

ar

Creation and manipulation utility for archives, mainly
used for binary object file libraries.

rpm

The Red Hat Package Manager, or
rpm utility provides a wrapper for
source or binary archives. It includes commands for
installing and checking the integrity of packages, among
other things.

A simple rpm -i package_name.rpm
usually suffices to install a package, though there are many
more options available.

rpm -qf identifies which package a
file originates from.

bash$ rpm -qf /bin/lscoreutils-5.2.1-31

rpm -qa gives a
complete list of all installed rpm packages
on a given system. An rpm -qa package_name
lists only the package(s) corresponding to
package_name.

This specialized archiving copy command
(copy
input and output)
is rarely seen any more, having been supplanted by
tar/gzip. It still
has its uses, such as moving a directory tree. With an
appropriate block size (for copying) specified, it
can be appreciably faster than tar.

The paxportable archive
exchange toolkit facilitates periodic
file backups and is designed to be cross-compatible
between various flavors of UNIX. It was designed
to replace tar and cpio.

pax -wf daily_backup.pax ~/linux-server/files
# Creates a tar archive of all files in the target directory.
# Note that the options to pax must be in the correct order --
#+ pax -fw has an entirely different effect.
pax -f daily_backup.pax
# Lists the files in the archive.
pax -rf daily_backup.pax ~/bsd-server/files
# Restores the backed-up files from the Linux machine
#+ onto a BSD one.

Note that pax handles many of
the standard archiving and compression commands.

Compression

gzip

The standard GNU/UNIX compression utility, replacing
the inferior and proprietary
compress. The corresponding decompression
command is gunzip, which is the equivalent of
gzip -d.

The -c option sends the output of
gzip to stdout. This
is useful when piping to other
commands.

The zcat filter decompresses a
gzipped file to
stdout, as possible input to a pipe or
redirection. This is, in effect, a cat
command that works on compressed files (including files
processed with the older compress
utility). The zcat command is equivalent to
gzip -dc.

On some commercial UNIX systems, zcat
is a synonym for uncompress -c,
and will not work on gzipped
files.

This is an older, proprietary compression
utility found in commercial UNIX distributions. The
more efficient gzip has largely
replaced it. Linux distributions generally include a
compress workalike for compatibility,
although gunzip can unarchive files
treated with compress.

The znew command transforms
compressed files into
gzipped ones.

sq

Yet another compression (squeeze)
utility, a filter that works only on sorted
ASCII word lists. It
uses the standard invocation syntax for a filter,
sq < input-file > output-file.
Fast, but not nearly as efficient as gzip. The corresponding
uncompression filter is unsq, invoked
like sq.

The output of sq may be
piped to gzip for further
compression.

zip, unzip

Cross-platform file archiving and compression utility
compatible with DOS pkzip.exe.
"Zipped" archives seem to be a more
common medium of file exchange on the Internet than
"tarballs."

Highly efficient Lempel-Ziv-Markov compression.
The syntax of lzma is similar to
that of gzip. The 7-zip Website
has more information.

xz, unxz, xzcat

A new high-efficiency compression tool, backward compatible
with lzma, and with an invocation
syntax similar to gzip. For
more information, see the Wikipedia
entry.

File Information

file

A utility for identifying file types. The command
file file-name will return a
file specification for file-name,
such as ascii text or
data. It references
the magic numbers
found in /usr/share/magic,
/etc/magic, or
/usr/lib/magic, depending on the
Linux/UNIX distribution.

The -f option causes
file to run in batch mode, to read from
a designated file a list of filenames to analyze. The
-z option, when used on a compressed
target file, forces an attempt to analyze the uncompressed
file type.

Similar to which, above,
whereis command gives the
full path to "command," but also to its
manpage.

$bash whereis rm

rm: /bin/rm /usr/share/man/man1/rm.1.bz2

whatis

whatis command looks up
"command" in the
whatis database. This is useful
for identifying system commands and important configuration
files. Consider it a simplified man
command.

$bash whatis whatis

whatis (1) - search the whatis database for complete words

Example 16-33. Exploring /usr/X11R6/bin

#!/bin/bash
# What are all those mysterious binaries in /usr/X11R6/bin?
DIRECTORY="/usr/X11R6/bin"
# Try also "/bin", "/usr/bin", "/usr/local/bin", etc.
for file in $DIRECTORY/*
do
whatis `basename $file` # Echoes info about the binary.
done
exit 0
# Note: For this to work, you must create a "whatis" database
#+ with /usr/sbin/makewhatis.
# You may wish to redirect output of this script, like so:
# ./what.sh >>whatis.db
# or view it a page at a time on stdout,
# ./what.sh | less

Use the strings command to find
printable strings in a binary or data file. It will list
sequences of printable characters found in the target
file. This might be handy for a quick 'n dirty examination
of a core dump or for looking at an unknown graphic image
file (strings image-file | more might
show something like JFIF,
which would identify the file as a jpeg
graphic). In a script, you would probably
parse the output of strings
with grep or sed. See Example 11-8
and Example 11-10.

diff: flexible file comparison
utility. It compares the target files line-by-line
sequentially. In some applications, such as comparing
word dictionaries, it may be helpful to filter the
files through sort
and uniq before piping them
to diff. diff file-1
file-2 outputs the lines in the files that
differ, with carets showing which file each particular
line belongs to.

The --side-by-side option to
diff outputs each compared file, line by
line, in separate columns, with non-matching lines marked. The
-c and -u options likewise
make the output of the command easier to interpret.

There are available various fancy frontends for
diff, such as sdiff,
wdiff, xdiff, and
mgdiff.

The diff
command returns an exit status of 0
if the compared files are identical, and
1 if they differ (or
2 when binary
files are being compared). This permits use of
diff in a test construct within a shell
script (see below).

A common use for diff is generating
difference files to be used with patch
The -e option outputs files suitable
for ed or ex
scripts.

patch: flexible versioning
utility. Given a difference file generated by
diff, patch can
upgrade a previous version of a package to a newer version.
It is much more convenient to distribute a relatively
small "diff" file than the entire body of a
newly revised package. Kernel "patches" have
become the preferred method of distributing the frequent
releases of the Linux kernel.

patch -p1 <patch-file
# Takes all the changes listed in 'patch-file'
# and applies them to the files referenced therein.
# This upgrades to a newer version of the package.

The diff command can also
recursively compare directories (for the filenames
present).

bash$ diff -r ~/notes1 ~/notes2Only in /home/bozo/notes1: file02
Only in /home/bozo/notes1: file03
Only in /home/bozo/notes2: file04

Use zdiff to compare
gzipped files.

Use diffstat to create
a histogram (point-distribution graph) of output from
diff.

diff3, merge

An extended version of diff that compares
three files at a time. This command returns an exit value
of 0 upon successful execution, but unfortunately this gives
no information about the results of the comparison.

bash$ diff3 file-1 file-2 file-3====
1:1c
This is line 1 of "file-1".
2:1c
This is line 1 of "file-2".
3:1c
This is line 1 of "file-3"

The merge
(3-way file merge) command is an interesting adjunct to
diff3. Its syntax is
merge Mergefile file1 file2.
The result is to output to Mergefile
the changes that lead from file1
to file2. Consider this command
a stripped-down version of patch.

sdiff

Compare and/or edit two files in order to merge
them into an output file. Because of its interactive nature,
this command would find little use in a script.

cmp

The cmp command is a simpler version of
diff, above. Whereas diff
reports the differences between two files,
cmp merely shows at what point they
differ.

Like diff, cmp
returns an exit status of 0 if the compared files are
identical, and 1 if they differ. This permits use in a test
construct within a shell script.

Versatile file comparison utility. The files must be
sorted for this to be useful.

comm
-optionsfirst-filesecond-file

comm file-1 file-2 outputs three columns:

column 1 = lines unique to file-1

column 2 = lines unique to file-2

column 3 = lines common to both.

The options allow suppressing output of one or more columns.

-1 suppresses column
1

-2 suppresses column
2

-3 suppresses column
3

-12 suppresses both columns
1 and 2, etc.

This command is useful for comparing
"dictionaries" or word
lists -- sorted text files with one word per
line.

Utilities

basename

Strips the path information from a file name, printing
only the file name. The construction basename
$0 lets the script know its name, that is, the name it
was invoked by. This can be used for "usage" messages if,
for example a script is called with missing arguments:

echo "Usage: `basename $0` arg1 arg2 ... argn"

dirname

Strips the basename from
a filename, printing only the path information.

basename and dirname
can operate on any arbitrary string. The argument
does not need to refer to an existing file, or
even be a filename for that matter (see Example A-7).

These are utilities for
generating checksums. A
checksum is a number
[3]
mathematically calculated from the contents of a file,
for the purpose of checking its integrity. A script might
refer to a list of checksums for security purposes, such
as ensuring that the contents of key system files have not
been altered or corrupted. For security applications, use
the md5sum (message
digest 5
checksum) command, or better yet, the
newer sha1sum (Secure Hash Algorithm).
[4]

Security consultants have demonstrated that even
sha1sum can be compromised. Fortunately,
newer Linux distros include longer bit-length
sha224sum,
sha256sum,
sha384sum, and
sha512sum commands.

uuencode

This utility encodes binary files (images, sound files,
compressed files, etc.) into ASCII characters, making
them suitable for transmission in the body of an
e-mail message or in a newsgroup posting. This is
especially useful where MIME (multimedia) encoding
is not available.

uudecode

This reverses the encoding, decoding
uuencoded files back into the
original binaries.

Example 16-39. Uudecoding encoded files

#!/bin/bash
# Uudecodes all uuencoded files in current working directory.
lines=35 # Allow 35 lines for the header (very generous).
for File in * # Test all the files in $PWD.
do
search1=`head -n $lines $File | grep begin | wc -w`
search2=`tail -n $lines $File | grep end | wc -w`
# Uuencoded files have a "begin" near the beginning,
#+ and an "end" near the end.
if [ "$search1" -gt 0 ]
then
if [ "$search2" -gt 0 ]
then
echo "uudecoding - $File -"
uudecode $File
fi
fi
done
# Note that running this script upon itself fools it
#+ into thinking it is a uuencoded file,
#+ because it contains both "begin" and "end".
# Exercise:
# --------
# Modify this script to check each file for a newsgroup header,
#+ and skip to next if not found.
exit 0

The fold -s command
may be useful (possibly in a pipe) to process long uudecoded
text messages downloaded from Usenet newsgroups.

mimencode, mmencode

The mimencode and
mmencode commands process
multimedia-encoded e-mail attachments. Although
mail user agents (such as
pine or kmail)
normally handle this automatically, these particular
utilities permit manipulating such attachments manually from
the command-line or in batch
processing mode by means of a shell script.

crypt

At one time, this was the standard UNIX file encryption
utility.
[5]
Politically-motivated government regulations
prohibiting the export of encryption software resulted
in the disappearance of crypt
from much of the UNIX world, and it is still
missing from most Linux distributions. Fortunately,
programmers have come up with a number of decent
alternatives to it, among them the author's very own cruft
(see Example A-4).

openssl

This is an Open Source implementation of
Secure Sockets Layer encryption.

Of course, openssl has many other uses,
such as obtaining signed certificates
for Web sites. See the info
page.

shred

Securely erase a file by overwriting it multiple times with
random bit patterns before deleting it. This command has
the same effect as Example 16-61, but does it
in a more thorough and elegant manner.

This is one of the GNU
fileutils.

Advanced forensic technology may still be able to
recover the contents of a file, even after application of
shred.

Miscellaneous

mktemp

Create a temporary file[6]
with a "unique" filename. When invoked
from the command-line without additional arguments,
it creates a zero-length file in the /tmp directory.

bash$ mktemp/tmp/tmp.zzsvql3154

PREFIX=filename
tempfile=`mktemp $PREFIX.XXXXXX`
# ^^^^^^ Need at least 6 placeholders
#+ in the filename template.
# If no filename template supplied,
#+ "tmp.XXXXXXXXXX" is the default.
echo "tempfile name = $tempfile"
# tempfile name = filename.QA2ZpY
# or something similar...
# Creates a file of that name in the current working directory
#+ with 600 file permissions.
# A "umask 177" is therefore unnecessary,
#+ but it's good programming practice nevertheless.

make

Utility for building and compiling binary packages.
This can also be used for any set of operations triggered
by incremental changes in source files.

The make command checks a
Makefile, a list of file dependencies and
operations to be carried out.

The make utility is, in effect,
a powerful scripting language similar in many ways to
Bash, but with the capability of
recognizing dependencies. For in-depth
coverage of this useful tool set, see the GNU software
documentation site.

install

Special purpose file copying command, similar to
cp, but capable of
setting permissions and attributes of the copied
files. This command seems tailormade for installing
software packages, and as such it shows up frequently in
Makefiles (in the make
install : section). It could likewise prove
useful in installation scripts.

dos2unix

This utility, written by Benjamin Lin and collaborators,
converts DOS-formatted text files (lines terminated by
CR-LF) to UNIX format (lines terminated by LF only),
and vice-versa.

ptx

The ptx [targetfile] command
outputs a permuted index (cross-reference list) of the
targetfile. This may be further filtered and formatted in a
pipe, if necessary.

more, less

Pagers that display a text file or stream to
stdout, one screenful at a time.
These may be used to filter the output of
stdout . . . or of a script.

An interesting application of more
is to "test drive" a command sequence,
to forestall potentially unpleasant consequences.