What is tar?

The TAR (Tape ARchiver) utility was originally released in 1979 with the
seventh edition of the Unix operating system. Despite its age, tar is still
used everywhere. Now granted, tar is not the same program it was when it was
released, but the function it performs is the same.

Initially developed to write data to sequential I/O devices for tape
backup purposes, tar is now commonly used to collect many files into
one larger file for distribution or archiving, while preserving file
system information such as user and group permissions, dates, and
directory structures. (Wikipedia, June 2013)

As per the Unix philosophy, tar performs one function, and does so very
well. It takes a list of files, and spits out a single stream of data.
What I want to cover, is some of the cool things you can do with tar.

Command Syntax

tar [operation-flag] [option-flags] <list of files>

The operation flag tells tar what it should be doing with the data given
to it. Note the following operations:

x: extract the files from the data stream

c: create a data stream given a list of file/directory names,

z: compress the data stream with gzip

t: print out the file names as they are found in the data stream

C: (upper-case) requires a directory path as argument, extracts to that directory

For example, a typical tar command might look like the following when you
want to store some files in a tar archive.

> tar czf myArchive.tar.gz somefiles/

f tells tar you want the files to be stored in an archive
named "myArchive.tar.gz"

c tells tar you are creating an archive

z tells tar that you want to apply compression to the archive, so
it passes the data stream through gzip as it writes the file to disk.

Everything else is a file or directory name that you would like to
include in the archive. If you don't specify f, tar will write the
data stream to stdout, which allows you to do the following command:

> tar c somefiles/ > myArchive.tar

Which I personally find easier to read, since the file you wish to save
the archive to appears after the list of files you wish to archive.

Similarly, extracting an archive is:

> tar xf myArchive.tar
(or)> cat myArchive | tar x

The "Data Stream"

The data stream is actually just a file format that tar uses to describe the
data. For each file in the stream, there is a block of bytes containing
information about the file, such as its name, size, user attributes and so on.
What makes it a stream, is the fact that the file can be processed without
knowing where the end is.

What I mean by that, is you can feed the data stream into tar for
extraction, and it will immediately begin extracting files before it even
knows where the end of the archive is! This unique trait is a big part of
what makes it so useful, and a big part of what makes it so different
from other archive formats like zip.

The tar file format also does not have an index. You cannot just ask it
for a file and receive it immediately. If you request a file from tar, it
starts reading through the archive linearly and keeps going until it
finds a file with the name you were looking for.

A word on data stream formats

Tar has gone through several different file formats in order to overcome
limitations with the original implementation. Currently, the newest POSIX
implementation places no limits on file size, filename/path length, and is
included in modern versions of GNU tar. You can specify the tar format with:

> tar c --format=posix ...
(or)> tar c --posix ...

Another modern format was "gnu", which included some neat features, like being
able to specify the length of the tape and split it across tapes in an
interactive way. These extensions are not a part of posix however, and have
some limitations (albeit unlikely to be a problem) on the file UIDs.

I know I have had to deal with the tar file format when using FreeBSD's tar,
which is also standard on Apple's OSX, which has the 8GiB file size limit.

This linear nature is a reminder of tar's roots, where writing to tape
drives had to be linear, they didn't have the ability to randomly access
data. A sequential data stream allows you to do some cool things:

The tar-pipe

Should you need to transfer a large number of small files across
computers, you may find using tar faster than tools that focus on copying
files. In scp for example, there is a lot of overhead for every file
transfered.

> tar c somefiles/ | ssh myuser@myserver "tar x"

When you pipe data into an ssh command, it will provide it as stdin for
the specified command on the remote host. In this case, we are piping to
"tar x" on the remote host, which will extract the data stream into the
host's home directory.

If you want to extract the data to somewhere other than the home
directory, you can use the -Coption flag, like so:

> tar c somefiles/ | ssh myuser@myserver "tar x -C /tmp/somedir"

Adding a progress bar to tar

Although you can view the filenames as they are being processed by tar
with the -v flag, it doesn't give you a definitive view of how much
progress tar has made, and doesn't contain an option to do so. Instead,
we can use pv (pipe viewer) to monitor the data stream as it enters or
leaves tar.

The easiest example of this is using pv to pipe the archive into a tar
extract command:

> pv somearchive.tar | tar x

In this case, pv knows the file size; which allows it to give you an
estimate of the time remaining. Creating an archive and knowing the ETA
is a little more difficult, check the next section for details. You can
still use pv to create an archive, without knowing the total archive
size:

> tar c somefiles/ | pv > somearchive.tar

Tar-pipe with pv bash function

If we want to know how long the tar command is going to take, you can get
a pretty good idea if you know the total size of the files, and then tell
pv the amount of data you expect to pass through the pipe.

For example, you can use the du command to get the total file size
before hand:

The -sc flags tell du that you want the numbers to represent the size
in bytes and print a grand total at the end. Now you tar command would
be:

> tar c content/ draft/ output/ | pv -s 3812 > myArchive.tar

Now putting together what we know, we can turn this into a nifty tar pipe
shell script which includes a progress bar. If you just want to create an
archive file, you would use > filename.tar to create an archive.

I wrote the previous without know about du's -c option, but I've also
tested it much more thoroughly and know that it works. The for loop runs
du on ever file argument and accumulates the sum in the size
variable. The tar command at the end pipes output to stdout.

Applying compression

The tar file format is not actually compressed, its just turning many files
into a single stream of data. The advantage of this is that you can just pipe
this stream into any number of compression utilities.

For example, to compress an archive with gzip:

> tar c somefiles/ | gzip > myArchive.tar.gz

Tar also has flags to specify compression, rather than piping it through
the command externally. But choosing to pipe through an external
compression utility allows you to use pigz, which is a multi threaded
implementation of gzip; much faster on computers with more than one
core.

Finally, adding a progress bar will still work, as long as you remember
to give pv the decompressed/compressed data stream as appropriate.

Splitting up the Archive

Say you want to take a very large archive and split it over some media of a
fixed size; simply take the data stream and pipe it through the split
command.

For example:

tar c somefiles/ | split -d - myprefix-somefiles.tar-

Check out the man page for split for more information. Typically I'm only
concerned about,

-b: Size of each file being split, use a single letter suffix for
"B,G,T"

You can use the cd command to move into another directory for the output
files.

Of course splitting things should make you nervous, you're now depending on
each of these files to remain intact if you want to be able to recreate the
original archive. You can always use something like
parachive, which will generate parity files, capable of
repairing files.