Writing Man Pages in HTML

HTML is a cool way to look at the Linux man pages. Here's how to do it.

Technical Overview

This section will cover a few of the implementation details
of vh-man2html. It's very brief and is really intended to point out
that CGI scripting is something that anyone with a little
programming knowledge can do with success.

Without getting into a tutorial on CGI scripting, a CGI
script is a program executed by the remote HTTP daemon (i.e., web
server). A web browser can cause a remote web server to run a CGI
program when you follow an HTTP link that matches its name. For
example, pointing a web browser at:

http://www.caldera.com/cgi-bin/man2html

executes cgi-bin/man2html on Caldera's web server. The CGI
programs that a web server is prepared to run are usually
restricted to those found in cgi-bin directories on the server.

The CGI script can return output to the remote caller by
writing a document to its standard output. The start of the output
document must contain a small text header describing its contents.
In the case of man2html the content returned is an HTML page.
Listing 1 shows the HTML output
from the man page to HTML converter; the header line is:

content-type: text/html

and the rest of the document is normal HTML, which consists
of text marked up with HTML tags. With some web browsers, you can
use options like Netscape's “View Document Source” to inspect
this HTML source.

A script may create an HTML page that contains further
references to other CGI scripts. In Listing 1 the following
reference returns the reader to the main vh-man2html contents
page:

<A HREF="http:/cgi-bin/man2html">Return to Main
Contents</A>

A CGI script receives input that may have been embedded in
the original reference or that have may have been added as a result
of user input. For example, in Listing 1, the “SEE ALSO” section
directs the cgi-bin/man2html program to return the HTML for a
specific manual page:

<A HREF="http:/cgi-bin/man2html?man1/from.1l">
from</A>

In this case the HTTP reference is supplied with a single parameter
“man1/from.1l”--the name of a man page. The start of the
parameter list is delimited by a “?”. If there were more than one
argument, they would be separated by “+” signs (and there are
conventions for how to pass special characters such as “+” and
“?” as parameters). The CGI program won't see any the of
delimiting characters; it just receives the parameters as arguments
in its normal argument list (or optionally via standard input).
This means the CGI script doesn't have to concern itself with how
its input got delivered over the network, it simply receives it in
the form of command-line arguments, standard input, plus a variety
of environment variables.

In addition to clicking on references, the user can also
enter data into input fields. The simplest way for a CGI program to
introduce an input field onto a form is to include the tag
<ISINDEX> in the HTML it generates. This results in a single
input field, such as in Figure 1. If the user enters anything in
the input field and presses return, the server
will re-run the CGI program, passing it the input via the parameter
passing conventions we've just discussed. You can also create HTML
forms, but I'm not going to discuss them here.

By generating the kinds of HTML references presented above,
CGI programs can perform complex interactions with the remote user.
The beauty of all this is that, to get started, the only skill you
need is the ability to write fairly simple code in a language of
your choosing. You need to know how to process command-line
arguments and write to standard output. The rest of the knowledge
you need can be gotten for free from Web documents or from any one
of a number of books on HTML and CGI. CGI is a client-server that
actually works. Heavy duty CGI programming languages such as Python
and Perl have tools and libraries to assist you with the
task.

I should also mention the issue of security. If your HTTP
daemon is accessible by potentially hostile users, your CGI scripts
could provide an avenue for them to attack you. Hostile users might
try to supply malicious parameters to your CGI scripts. For
example, by using special shell characters such as back quotes and
semicolons, they might be able to get the script to execute
arbitrary commands. The only way to prevent this is to carefully
examine all input parameters for anything suspicious. For example,
vh-man2html can be passed the full file name of a man page;
however, it doesn't just accept and return any file name it is
passed—it accepts only those filenames present within the man
hierarchy. The program also makes sure the file name does not
contain relative references such as “..” (the parent directory),
and removes any suspect characters such as back quotes that might
be used to embed commands in the parameter list. In languages like
C, where memory bounds checking is lacking, the length of the input
arguments should be constrained to fit within the space allocated
for them. Otherwise the caller may be able to write beyond the
allocated space into other data and change the behavior of the
program to his/her advantage (e.g., change a command the program
executes from gzip to rm). To help check that long input parameters
wouldn't threaten vh-man2html's integrity, I borrowed some time on
an SGI box and built vh-man2html with Parasoft's Insight bounds
checker. Insight pre-processes a C or C++ program adding array
bounds checking, memory leak detection and many other checks. One
of the reasons I'm mentioning Insight is that Parasoft's Web site,
http://www.parasoft.com/, lists Linux as a supported
platform.

vh-man2html includes four CGI programs. They all generate
interdependent HTTP references to each other.

Man page to HTML translation is handled by the man2html C
program. The Unix man pages are marked with man or BSD mandoc tags
which are nroff/troff macros. The bulk of the program is a series
of large case statements and table lookups that attempt to cope
with all the possible macros.

Listing 2 shows a typical
nroff/troff marked up manual page that is using the man macro
package. The macros use a full-stop, i.e. a period, as a lead-in to
a one or two character macro name. troff/nroff uses two character
macro names—apparently they fit nicely into the 16-bit word size
of the old Unix platforms such as the PDP11 (at least that's what I
was told)--a trick which man2html.c still utilizes. Some of the
macros can be directly translated to appropriated HTML tags; for
example, lines beginning with “.SH” Section Headings are directly
translated to HTML <H1> headings.

Many troff tags limit their arguments and effects to just one
line and have no corresponding end tag—where as many of the
equivalent HTML constructs also require an end tag. For example,
the text following a troff “.SH” section heading tag needs to be
enclosed in a pair of HTML heading level 1 tags, e.g.,
“<H1>text</H1>”. Other troff tags with a larger
scope, such as many kinds of lists, have both begin and end tags,
which makes translation to HTML very easy.

One tricky issue is dealing with multiple troff tags on one
line; for example, tags that imply bracketing of following text or
font changes. In order to correctly place bracketing, the
translator can work recursively within a line. For example, the BSD
mandoc sequence for an command option called -b
with an argument called bcc-addr is expressed
in troff as:

.Op Fl b Ar bcc-addr

which indicates the reader should see:

[ -

where b is in bold and
bcc-addr is in italics. The corresponding HTML
is:

[ -<B>b</B> <I>bcc-addr</I> ]

By using recursion on hitting the Op tag, we can get the square
brackets on the beginning and end of the entire line.

There are some troff tags whose effect is terminated by tags
of equal and higher rank; in these cases, the translator must
remember its context and generate any necessary terminating HTML.
Nested lists are also possible. In these situations man2html has to
maintain a stack of outstanding nestings that have to be completed
when a new equal or higher element is encountered.

I admire Richard's dedication in methodically building up
translations of all of the tags. Adding in the BSD mandoc tags
proved to be a painful experience, and in the end, the only way to
get it right was to convert every BSD mandoc page I could find and
pipe the output to weblint (an excellent HTML checker). For
example, in tcsh/csh:

The program also has to navigate the man directory
hierarchies and generate lists of references to pages that might be
relevant (e.g. a page with the same name might be present in
multiple man hierarchies). The list of man hierarchies to be
consulted is read from /etc/man.config, which is the standard
configuration file for the man-1.4 package that ships with Redhat
and Caldera. This configuration file is also consulted for details
on how to process man pages that have been compressed with gzip or
other compression programs.

man2html could have easily been written in Python or Perl,
but you can't beat C for speed. man2html is fast enough on my 486
that I didn't think caching its output was worthwhile—each page is
just regenerated on demand. However, if I was going to provide man
pages from a server for a large number of high frequency users, I
would probably pre-generate all the man pages as a static document
set.

Two awk scripts, manwhatis and mansec, generate name-title
and name only indexes for man sections and cache them in
/var/man2html. manwhatis locates and translates whatis files into
the desired section index, which it caches in /var/man2html. It
rebuilds the cache if any whatis file has been updated since the
cached version was generated. The script divides the whatis file
alphabetically and constructs an alphabetic index to the HTML
document, so that the the user can quickly jump to the section of
the alphabet they're interested in.

mansec traverses the man hierarchy to build up a list of
names; it rebuilds its cache if any of the directories in the
hierarchies have been updated. mansec has to use the sort command
to get the names it finds into alphabetical order. It also builds
an alphabetic quick index just like manwhatis.

Both manwhatis and mansec accept an argument that indicates
which section to index. They have to check the argument for
anything potentially malicious and return a document containing an
error message if they find anything they weren't expecting:

The mansearch script is an awk script front end to the
Glimpse search utility. It accepts user input, which it passes onto
Glimpse, so I had to be careful to include code to check the input
for safety before invoking Glimpse. This basically means excluding
any shell special characters or making sure they can't do anything
by quoting them appropriately. For example, in awk we can silently
ignore any characters that we aren't willing to accept:

# Substitute "" for any char not in A-Za-z0-9
# space.
string = gsub(/[^A-Za-z0-9 ]/, "", string);
I chose awk over Python and Perl mainly because it is small,
widely available and adequate for the task. Note that I'm using the post
1985 "new awk". For larger, more complex CGI scripts I'd probably use
Python (if I had to start again without Richard's work, I think
man2html would be a Python script).
In order to make vh-man2html usable remotely, I changed man2html and my
scripts to generate HTTP references that were relative to the current server.
For example, I used:

<A HREF="http:/cgi-bin/man2html">Return to Main
Contents</A>

rather than

<A HREF="http://localhost/cgi-bin/man2html">Return to Main
Contents</A>

which works fine except for “redirects”. A redirect is a small
document output by a CGI script. This is an example redirect:

Location: http://sputnik3/cgi-bin\
/man2html/usr/man/man1/message.1

A redirect has no context, so the host has to be specified.
man2html generates redirects when a user enters an approximate name
such as “message 1”. The redirect corrects
this to a full reference such as the one above. The server name is
obtained from one of the many environment variables that an HTTP
server normally sets before invoking a CGI script.

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.