Perl comes with a set of database management (DBM) library files.
Here's the list of packages that come with the latest version
of Perl (5.002):

GDBM_File.pm-GNU's database
interface files

NDBM_File.pm-Berkeley
UNIX compatibility

ODBM_File.pm-Standard
Perl package

SDBM_File.pm-Standard
Perl package

AnyDBM_File.pm-Virtual
classes for any of the above database interfaces

The AnyDBM_File package encapsulates
the rest of the packages. If you use the AnyDBM
file, you'll automatically load one of the standard packages from
the previous list. To use the AnyDBM_File
package, insert the following statement at the start of your Perl
script:

use AnyDBM_File;

There is nothing preventing you from explicitly naming a particular
DBM package to override the defaults used by the AnyDBM_file
package. To use the GDBM_File
explicitly, insert this statement instead:

use GDBM_File;

A DBM, when used with these packages, is a mapping of an associative
array to a file on disk. To map the associative array to disk,
you use dbmopen() to create
or open an existing file, make all the modifications to the associative
array, and then close the file with a call to dmbclose().

I covered how to bind variables to associative arrays and how
to use the tie() function
in Chapter 6, "Binding Variables to
Objects." In the future, most dbmopen()
function calls will be replaced by, or at least internally work
as, tie() function calls.
Basically, a DBM file will be tied to an associative array. Subsequent
updates to the associative array will be reflected in the file
on disk. However, you will come across legacy code that does use
the dbmopen() and dbmclose()
functions, so it's important to see how these work together. Also,
you may find it easier to use these DBM functions to program for
quick prototyping than to have to code the callback functions
required for the tie() function.
After you've read this chapter, you should be able to decide whether
to use DBM functions or write the functions for database access
using the tie() function.

Here's an example to illustrate how the DBM files work. First,
I'll create a small database of some stock symbols and the names
of the companies. The file containing this data sample is called
sym.txt and is partially
shown here:

This file contains several lines of records where each record
has two items of data: the first column is the stock ticker symbol
followed by the company name. Note that a company name can contain
one or more space characters.

Now I'll construct a mini-database of all company names indexed
by their stock symbol. To use the DBM utilities for this endeavor,
I would use the script shown in Listing 18.1.

In Listing 18.1, the AnyDBM_File
package is used at line 10 to get the best available package for
the system. You can override the type by replacing the AnyDBM_File
package with a DBM package of your choice. In line 11, the input
file is opened.

Line 16 creates a DBM called ticker
and maps it onto an associative array called %stocks{}.
Note that you are actually creating two files called ticker.pag.
The file permissions given for this open
are 0666 so that it's possible for all other users to use this
DBM.

In lines 17 through 23, the symbol names are read in, and the
values are assigned to the newly created hash, called stocks.
Lines 18 and 19 simply take the text input from the sym.txt
file and separate it into a symbol and a company name. At line
21 a value is assigned to the associative array using the symbol
as the index into the array. Now that this array has been mapped
to the disk, the data can be kept in the array even after the
script that created this file is long gone.

Note also that the join() function
is used in line 21 to piece together the items of names of the
companies into one string. Had the join()
function not been used, $stocks{$symbol}
would contain the number of items of the @names
array and not the contents.

In line 31 the dbmclose()
function is used to commit the changes to disk. If you do not
use dbmclose(), your DBM
file modifications are not be saved to disk. The dbmclose()
function disconnects (or un-ties) the hash from the file
on disk so that any changes made to the hash after the dbmclose()
function are made as if to an uninitialized associative array.
When the contents of the hash will be flushed to disk depends
on the underlying system. As far as the program is concerned,
the hash connected to the underlying database is not initialized
anymore.

Two files are created with the use of the AnyDBM_File
package. These are the two files that were created by the script
in Listing 18.1 when I ran it:

Of course, because you do not have access to the original files
that I used in this example, the sizes shown here will be different
for the data file that you use. Just note that the size of both
the files is the same.

If the AnyDBM_File package
had been substituted with the GDBM_File
page, we would only get one file:

-rw-r--r-- 1 khusain users 6608
Mar 26 08:06 scripts/ticker

Again, the size of the ticker
file will be different depending on the input data file that you
use. However, you should note that the size of the ticker file
is the same as that of the two files created by the AnyDBM-File
package.

The database has now been created on disk. Whether it consists
of one file or two is not important in the application that will
access the database because all the internals are hidden with
the DBM function calls.

In order for us to look at the data in the newly created DBM file,
another script needs to be created. The viewing script is shown
in Listing 18.2.

To delete an item from the %stocks{}
associative array, use the delete
command. For example, the code in Listing 18.3 can be modified
to become a deletion command by adding the delete
operator to line 20. The code in line 20 would look like this:

Lines 19 through 24 contain a while
loop that uses the each command.
The each command is more
efficient than using a for
loop command because both the key and the indexed value are retrieved
with the each command. In
the for loop on the keys
of an associative array, two separate steps must be taken to retrieve
the same information-first to get the key for the loop iteration
with a call to the keys(hash) function, and second to get
the value based on the retrieved key.

Running the program to look for NAL
in the company names produces the following output:

The program found all company names with the string NAL
in their name. As you can see, within Perl you have the power
to create a new database, add or delete items from the database,
and list the contents of the database. Using the hash as the map,
you can perform many other operations on the hash and reflect
the results back to disk.

The DBM files are still a flat database; therefore, in order to
create relationships between data items, you still have to do
some shuffling around of indexes, filenames, and files. Fairly
sophisticated data algorithms are not out of reach, however, if
you are careful. For example, the sample $symbol
index can also be used as a filename containing historical data.
Let's say the %stocks{} DBM
array is used as a storage area for the stocks in a portfolio.
It's easy to get historical data from the Internet (for example,
via CompuServe) for a given stock symbol. Listing 18.5 collects
and displays some statistics for the symbols in the sample portfolio.

So that you don't have to type everything via STDIN,
the script is designed to use command-line options with the Getopts::Long
module. See Chapter 16, "Command-Line
Interface with Perl," for more information on how to use
the Getopts::Long.pm module.
In this example, the -s option
is used to specify the stock symbol and the -v
option is used to get a complete listing of all the readings.

You print the stocks you are tracking by checking to see if the
name exists in the %stocks{}
hash. This is done in line 23 with the following statement:

if ($stocks{$search}) {

Once the file is found to exist with the -e
flag in line 24, it is opened and read from in lines 27 to 35.
The summary is printed in lines 36 through 40. In case the symbol
is not listed in the database, an error message is printed in
the else clause in line 43.
The format for the output via the write
statement in line 33 is defined in lines 50 to 53.

In this example, the historical data for a stock is stored in
a file called AMAT. Therefore,
to invoke the script, use this command:

Because it's possible to use the DBM files from within modules,
let's see if the Invest.pm
module (covered in Chapter 5, "Object-Oriented
Programming in Perl") can be updated to include saving and
restoring portfolio information on disk. You will be appending
some functions to the original Invest.pm.
All the new functions will go at the end of the file unless specified
otherwise.

Some changes need to be made to the Invest.pm
module to get it to work with the DBM files. The first change
is to include the following statement before the first executable
line in the Invest.pm file:

require AnyDBM_File;

The @portfolio array is changed
to %portfolio because this
array can be mapped directly to disk via a DBM file. The @portfolio
array contains references to hashes and not the content of the
item in the referenced hash. Therefore, a new scheme has to be
incorporated in order to parse the values and then store them
to disk. After the values are stored to disk, the reverse operation
has to be applied to read them back. Because this sample portfolio
is not a large database, the important values can be stored using
a colon delimited array. If this example were a very large array,
the items could be stored with the pack()
function and thus store only binary values.

A new function has to be created to save the contents of the portfolio
to disk. Add the following function to the end of the Invest.pm
file on or about line 116:

The %dummy hash is used to
map to the disk file. Each item in the portfolio
hash is parsed for storage in the %dummy
hash. The value of $key goes
from 0, 1,
2, and up. One string is
saved to disk per hash item. Here's the format of the string:

$key:$type:$symbol:$shares

$type can be Stock
or Fund. $symbol
is the stock or fund symbol and $shares
is the number of shares of the stock. Keep in mind that this is
only an example-the real data stored would probably have to include
date of purchase, purchase price, and so on. In this event, the
fields could be appended to the string with colons separating
each field. If you want to add purchase price, your field would
look like this:

$key:$type:$symbol:$shares:$purchased

To restore the file back from disk, the Invest.pm
file will have to read back the same portfolio file and restore
all the values in the portfolio array. The function will also
have to recognize the difference between the types of items it
has to re-create. The restoration is done by this function which
is added to the end of the Invest.pm
file at about line 132:

To create the sample portfolio, use the shell script shown in
Listing 18.6. You might want to edit the code shown in Listing
18.6 to print the reported information in a different format than
the one shown here. In this sample script, two stocks are added
and then the contents of the portfolio are printed.

To view the contents of the portfolio, you'll have to write another
simple script. This script is shown is Listing 18.7. The file
to recover data from is called myStocks.
You should be able to see this file in your directory.

There are occasions when you'll want to have more than one file
open for DBM access. You can use a unique stock ticker symbol
as the index into several hashes, each of which is then mapped
to its own DBM file. For example, the following database code
could be used to track stock price information in one DBM file
and earnings information in another DBM file. Both DBM files will
be indexed via the stock symbol name. Results of analyzing data
from both DBM files could be printed using code similar to the
following snippet:

So far I have only covered the standard DBM utilities that come
with Perl distribution. For most casual users, these DBM files
will be sufficient for their database needs. Unfortunately, when
things get complicated, as in the case of relational databases,
you might want to reconsider your options with other database
solutions. The price tag for DBM utilities is attractive because
they're free. However, you just might want to pay someone to acquire
a commercial Relational Database Management System (RDBMS).

Second, there is an inherent danger in using DBM utilities that
I must warn you about. If you make a mistake in working with your
mapped hash and somehow write it to disk with a dbmclose(),
guess what? You just wiped out your entire database. This type
of faux pas is not hard to do, especially if you are modifying
data. Obliteration of your DBM database is generally only recoverable
from backup.

Commercial databases have a "safer" feel because they
provide you with a safety net by keeping alternate backups. You
are still your own worst enemy, but it's a little bit harder to
destroy all data. In any event, always back up your data.

If you do decide to use a database management system (DBMS) other
than the DBM utilities, all is not lost. You can use RDB (a freeware
relational database) or other Perl front-ends to popular databases.
All of the front packages allow your Perl programs to talk to
different databases via the protocol used by a DBI package.

Listing 18.8 presents the final version of the Invest.pm
file after all the modifications discussed up to now in this chapter
have been made to it.

The database interface (DBI) package for Perl is the implementation
of the DBI Application Program Interface (API) specification written
by Tim Bunce (Tim.Bunce@ig.co.uk).
The DBI package API is designed specifically for use with Perl.
The set of functions and variables in the DBI package provide
a consistent interface to the application using it. The strong
point of the DBI package API, in addition to its broad set of
available functions, is that it completely isolates the using
application from the internal implementation of the underlying
database.

The DBI specification exists at various sites in the CPAN archives,
but the latest version (v0.6) is not up to date. The best source
of information is to look in the source files for a DBI package
itself. The entire specification is good for getting an idea of
how everything is intended to work together. However, the interface
has changed considerably since the specification was released.
Check out the file dbispec.v06
in compressed form at ftp.demon.co.uk
in the /pub/perl/db directory.

The DBI specification started out as DBperl back in 1992 as a
team effort from several Perl enthusiasts. Here are the initial
contributors to the specification for each type of database:

infoperl (Informix)

Kurt Andersen (kurt@hpsdid.sdd.hp.com)

interperl (Interbase)

Buzz Moschetti (buzz@fsrg.bear.com)

oraperl (Oracle)

Kevin Stock (kstock@encore.com)

sybperl (Sybperl)

Michael Peppler (mpeppler@itf.ch)

sqlperl/ingperl (Ingres)

Ted Mellon (dbi-users@fugue.com) and Tim Bunce

The original DBI specification was edited by Kurt Anderson. In
1994, Tim Bunce took over the editing and maintenance of the specification
in addition to the DBI and DBD::Oracle
package development. The specification and related files are copyrighted
by Tim Bunce.

The original specification was edited by Kurt Anderson from the
discussions on the mailing list. In 1993, Tim Bunce took over
the editing and maintenance of the specification and in 1994 started
the development of the DBI and DBD::Oracle
modules. The DBI specification and modules are copyrighted by
Tim Bunce but are freely available to all with the same terms
as Perl. (Tim is the technical director of the software systems
house, Paul Ingram Group in Surrey, England. Tim can be reached
at Tim.Bunce@ig.co.uk, but
DBI related mail should be sent to the dbi-users@fugue.com
mailing list.)

The DBI is not related to any one specific database because it
serves as an intermediary between a program and one or more DBD::*
driver modules. DBD:: modules
are drivers written to support a specific database back-end. The
DBI:: module manages all
installed DBD:: drivers in
your system. You can load and use more than one DBD::
module at the same time.

DBD:: modules are written
in such a way that they may be copied and customized to suit your
specific needs. For example, the DBD::Oracle
module served as the starting point for Alligator Descartes, another
well-known pioneer in developing database interfaces for Perl,
to develop DBD:: modules
for other databases. He has written two copyrighted documents
on how to develop your own driver from DBD::
modules. These documents are located on the Web site www.hermetica.com
in the technologia/DBI directory.

Some of DBI:: packages available
on the Internet are listed here; you can get the latest versions
of these files from the Internet CPAN sites:

DBD-Oracle-0.29.tar.gz
for Oracle users

DBD-Informix-0.20pl0.tar.gz
for Informix database users

DBD-QBase-0.03.tar.gz
for Quickbase users

DBD-mSQL-0.60pl9.tar.gz
for mSQL-based databases

DBI-0.67.tar.gz for the
DBI interface

The interface packages are simply front-ends to the database engine
that you must have installed on your machine. For example, in
order to use the Oracle DBI package, you'll need the Oracle database
engine installed on your system. The installation instructions
are located in the README
files in the packages themselves. You'll need to have Perl 5.002
installed on your system to get some of the packages to work,
especially the DBI module.

The DBI interface is very different than the old, database-specific
interfaces provided by oraperl, ingperl, interperl, and so on.
To simplify the transition to Perl 5, some DBD::
drivers, such as DBD::Oracle,
come with an extra module that emulates the old-style interface.
The DBI interface has never been fully defined because it has
been constantly evolving. This evolution will take a large step
forward with the adoption of the standard ODBC interface as the
core of the DBI. Because this redefinition of the DBI interface
standard is bound to change the DBI interface, Tim Bunce recommends
using stable emulation interfaces, such as oraperl, instead.

The RDB database utilities for Perl deserve an honorable mention.
The RDB package is complete, simple to use, and very easy to set
up. The author of this package is Walt Hobbs; he can be reached
at hobbs@rand.org.

The source and documentation is found in the file RDB-2.5k.tar.Z
in the CPAN modules directories. There is a file named Info.RDB
in the package that provides a short overview of how the RDB package
organizes its data and what commands are available for you to
use. The Info.RDB file also
lists the origins of the RDB package, describes how to use it,
and provides a sample data file.

The operators in the RDB packages are Perl scripts that use standard
I/O for UNIX to operate on ASCII files. The format for databases
in the RDBM package is to store data in rows and columns in text
files. Each row contains items separated by tabs, and each row
is terminated by a newline character. (The field separator is
a tab, and the record separator, therefore, is the newline character.)
Each column in the text file has the items' names and format defined
as the first two rows.

Because the data is stored in such a format, it's easier to access
the data using programs other than those supplied with the RDB
package. You can use Perl or awk scripts to get what you want
if the RDM programs do not give you what you need. The RDB operators
are only Perl scripts that you can use as a basis for writing
your own extensions.

The operators on the package include listing by row, listing by
column, merging tables, and printing reports. All operators in
the RDB package read from standard input and write to standard
output.

Consider the following sample data file. There are four columns
in the data file. Each column has a heading and type of data associated
with it. The comments with #
in the front of the line are ignored. The first uncommented row
contains the name of fields per column. The row immediately after
that stores the type of data. 4N
means that LINE and WORD
are four-digit wide numbers. (S
specifies a string and M
stands for month.) A digit by itself is a string; therefore, NAME
fields are eight characters wide.

You can join two tables together on a per-column basis to get
a merged table. The command to do this is jointbl.
The -c option does the merge
for you on a per-column basis. Consider the two files p1.dat
and p2.dat with a common
column of NAME in each file.
The merged output is shown with this command:

jointbl -c < p1.dat NAME p2.dat

The use of the jointbl command
is shown in the following input/output example:

Other more powerful features of this RDB package are listed in
the RDB.ps PostScript file.
It prints out to a 72-page manual with examples and details on
all of the commands available for you. If you do not feel like
killing a tree, you can use ghostview
to view the file.

Perl supplies a flat database package in the DBM utilities. The
modules allow Perl scripts to map hashes to disk files for storage.
For most users, storing data with DBM utilities is sufficient.
Some DBD:: and DBI::
modules are available as front-ends to commercial databases such
as Oracle, Sybase, Informix, and Quickbase. You need the specific
database and database engine installed on your machine in order
for the DBD:: code to work.
The RDB package provides a text file-based relational database
management system. Utilities in the RDB package are a set of Perl
programs that operate on rows and columns of data files.

This chapter offers a brief introduction to the HyperText Markup
Language (HTML) and the Common Gateway Interface (CGI). The information
in this chapter provides the basis for the rest of the chapters
about Web pages in this book, especially for the topic of writing
CGI scripts in Perl. This chapter assumes that you have a cursory
knowledge of what the World Wide Web (WWW) is about and how to
use a browser.

I also assume you're somewhat familiar with HTML code. Going into
more detail about HTML programming would cause us to move too
far away from the scope of the book: Perl programming. Therefore,
I stick to the very basic HTML elements for text formatting and
listing items rather than covering a lot of HTML programming issues.

Reading this one chapter won't make you a Webmaster, but you'll
learn enough to create Web pages you can subsequently use in conjunction
with Perl scripts. With these basics, you'll be able to easily
incorporate other HTML page-layout elements in your documents.

If you are not familiar with HTML or would like more information,
don't worry. There are several documents on the Internet that
describe how to write HTML pages. For up-to-date documentation
on HTML, conduct a search on the keywords HTMLPrimer and
html-primer in the Internet search areas.

For more information via printed text books, you might want to
consult these titles:

HTML is the de facto standard language for writing Web pages on
the Internet. HTML documents are written as text files and are
meant to be interpreted by a Web browser. A Web browser displays
the data in HTML pages by reading in the tags around the data.
Web browsers reside on client machines, and Web server daemons
run on Web servers. The protocol used by Web servers and clients
to talk to each other is called the HyperText Transfer Protocol
(HTTP).

An HTML page contains uniform resource locators (URLs) in addition
to the tags. A URL tells the browser where to get certain data.
URLs can point to other Web documents, FTP sites, Gopher sites,
and even executable programs on the server side. The Common Gateway
Interface (CGI) is the standard used to run programs for a client
on the server.

A Web server gets a request for action from the browser when the
user selects the URL. The request is processed by the server by
running a program. The program is often referred to as a CGI script
because a lot of programs for handling CGI requests are Perl scripts.
The results of the CGI script are sent back to the browser making
the request. The browser displays the results back to the user.
Results can be in plain text, binary data, or HTML documents.

The browser reading the output from the CGI script has to know
the type of input it is receiving. The type of information is
sent back as a multipurpose Internet mail-extension (MIME) header.
For example, to send back plain text, you use "Content-Type:
text/plain\n\n" at the start of the document.
To send back HTML data, you use "Content-type:
text/html\n\n".

Note

Using two carriage returns after the type of data is very important. The HTML standard requires a blank line after the Content-type string. This is why we have "\n\n"
appended to Content-type. In most cases, the "\n\n" will work as intended to produce a blank line for a browser. Sometimes this will not work, and the data being sent back
to the browser will not be shown because the server will be handling carriage-returns/line-feeds using the "\r\n" string instead of "\n". To allow for
inconsistencies in the way operating systems handle carriage-return/line-feed pairs, you should use the string "\r\n\r\n".

An HTML document uses markup tags to specify special areas of
the text. The format of an HTML document is as follows:

<HTML>
<HEAD>
<TITLE>Title of the page</TITLE>
</HEAD>
<BODY>
The
body of the document.
</BODY>
</HTML>

All text for the HTML document is shown between the <HTML>
and </HTML> tags. There
can be only two pairs of elements, one pair of <BODY>
and </BODY> tags to
store the text matter for the HTML document, and the other pair
of <HEAD> and </HEAD>
tags. The <HEAD> and
</HEAD> tags show the
document title in the heading section of a viewer. The <TITLE>
and </TITLE> tags hold
the string for text in the title bar for your browser and are
the only required element within the <HEAD>
and </HEAD> tags.

Both the <HEAD> and
<TITLE> tags are optional.
However, for compatibility with some browsers, you should include
them. The <BODY> and
</BODY> tags are required
in all cases. Most HTML tags are paired. So if you have <HEAD>,
then you should have </HEAD>.
There are exceptions to this rule. For example, the paragraph
tag <P> and the line
break <BR> tag are
used by themselves and do not require any accompanying </P>
or </BR> tags. (The
</P> tag is sometimes
used to terminate a paragraph, but the </BR>
tag does not exist.)

Tags are not case sensitive, and any formatting in between the
tags is almost always ignored. Therefore, the tag <html>
is the same as <HtMl>
and <HTML>.

It's the presence of <HTML>,
<HEAD>, and <BODY>
tags in the page that distinguishes an HTML page from a simple
text page. Figure 20.1 presents a sample text page which does
not have any formatting on it whatsoever being loaded into an
HTML browser.

Figure 20.1: An unformatted document.

All the text shown in Figure 20.1 is aligned in the way that the
original text document was set up. In some cases, the text would
have been clumped in one long paragraph. Here is the text for
the document shown in Figure 20.1:

[Image] Risks with open contracts
One cent move equals $1120 in your profits.
Therefore a 4 cent move can
either make you a handsome profit or break your bank. A flood
in sugar
growing area may cause prices to drop
sharply. If you are holding a
long contract, this drop in price will have to be covered at the
end of
the trading day or your position will
be liquidated.
[Image] Sugar
Options cost a fixed amount of money.
However, the money spent on an
option should be treated like insurance. No matter where the price
goes
your loss will be limited to the price
of the option. Of course, with
limiting risk you are also limiting profits.

To make the text more presentable, you can add some HTML tags
to the document, as shown in Listing 20.1. First, we'll delimit
the paragraphs with a <P>
tag and then add some headings to it. HTML provides six levels
of headings, numbered H1
through H6. H1
is the top-level heading in a document's hierarchy and H6
is the bottom. Generally, you use H2
headers inside H1 headers,
H3 headers inside H2
headers, and so on. Do not skip heading levels unless you have
a compelling reason to switch heading styles. Use the tags <H1>Text
for heading</H1> for defining a heading.

A sample HTML page is shown in Listing 20.1. See the output in
Figure 20.2.

Your browser will show a different font for each header style.
In this listing, we used two heading levels, <H1>
and <H2>, lines 5 and
8. The required <TITLE>
is also shown and terminated with the </TITLE>
keyword in line 1. The <P>
tag is used to start a new paragraph in lines 4, 5, and 7, for
example. If the <P>
tags were not used, the text would follow in one big clump unless
you were beginning a new header or a horizontal line with the
<HR> tag. See Line
15 where the <HR> tag
is used to draw a horizontal line.

Look at lines 12 through 14. Using <BR>
tags, I have created the list of three items using IMG tags.

Note how images are stored and displayed within the document with
the <IMG> tag. Basically,
the IMG tag specifies where an image will be displayed in the
HTML document. The SRC attribute of the IMG tag specifies the
URL for the contents of the image. The URL can be a complete (absolute)
pathname or a relative pathname. An absolute path would describe
the entire URL: http://www.ikra.com/mydirectory/red-ball-small.gif,
whereas the relative pathname will assume the default directory
for images on the server and pick the image called red-ball-small.gif
from that directory. By using image files, you can enhance the
appearance of items in a list and provide your own bullets. Consider
this reference in line 12:

<IMG SRC="red-ball-small.gif">
<B>Sugar Contracts</B><BR>

The file red-ball-small.gif
is shown next to the text Sugar Contracts
using the relative URL scheme.

There are several more attributes to an IMG tag allowing you to
align text around the image, size the image, provide a mapping
of URLs to go to if the user clicks on a section of the image,
and so on. Be sure to check the references provided in the "HTML,
CGI, and MIME" section for more information on how to use
IMG tags.

HTML supports the following types of list formats with which you
can show information:

Unnumbered lists (items in <UL>
and </UL> tags) where
the items in a list are not numbered.

Numbered lists (items in <OL>
and </OL> tags) where
each item in a list is numbered.

Definition lists (items in <DT>
and <DD> tags) where
each item in the list is a term followed by a descriptive paragraph
that applies to that term.

Items in a list are separated by <LI>
tags. The </LI> tag
is not needed to stop one item and start another. In Listing 20.1,
lines 17 to 22 provide an unnumbered list with the use of <UL>
and </UL> tags. To make
this a numbered list, simply change the <UL>
and </UL> pair to an
<OL> and </OL>
tag, respectively, and all the items in the list will be numbered.

A definition list has two items: a <DT>
tag with a term and the definition of the term with
<DD>. The <LI>
and <DD> items can
contain multiple paragraphs with the use of the <P>
tag. Lines 27 through 38 in Listing 20.1 show how to create a
simple definitions list.

You can even nest lists within lists. Just remember to end each
list completely within the list item of the parent list. See the
sample HTML code in Listing 20.2.

The top level numbered list is started at line 5 and ends in line
29. Each item of the numbered list is inserted in lines 6, 13,
19, and 24. Nested lists are shown as unnumbered lists enclosed
in <UL> and </UL>
tags in lines 8 to 12, 14 to 18, 20 to 23, and 25 to 28. Each
<UL> and </UL>
pair is terminated in the space for each item in the top level
numbered list.

The basic set of HTML tags such as the <HEAD>,
<BODY>, and <TITLE>
are almost always supported by all browsers. However, you should
keep in mind that not all tags are supported by all Web browsers.
If a browser does not support a tag, it just ignores it.

You can emphasize text in the document so that it stands out when
displayed by a browser. For example, to underline text, use <U>
and </U> to enclose
the text. For bold text, enclose the text in <B>
and </B> tags. For
italics, use <I> and
</I>. See the sample
HTML code in Listing 20.3.

Line 6 defines a centered line of text with the <Center>
and </Center> tags.
Note how the <H1> and
</H1> tags are enclosed
along with the text being centered. Line 7 draws a horizontal
line. (Remember that HR stands for Horizontal Rule.) The <b>
and </b> tags are used
in lines 10 and 11 to embolden some text. Since HTML is case insensitive,
we could have used the tags <B>
and </B> with no loss
of functionality. Use of the <EM>
and <STRONG> tags is
shown in lines 14 and 15, respectively.

Note that the effect of <B>
remains in effect until the </B>
is seen. So the text being set to a style can be typed in across
several lines. In lines 19 through 26, center several lines.

In HTML documents you should use <STRONG>
and <EM> tags instead
of <B> and <I>
tags, respectively. There are occasions where using even seemingly
innocent tags such as <B>
for bolding text or <I>
for italicizing text may not work. For example, if a browser is
using a font that does not have an italics character set, the
<I>text</I> will
be printed with its codes; that is, it will be as <I>text</I>
and not as text. To
take care of some of these issues, should they arise, HTML authors
prefer to use the <EM></EM>
emphasis tags in place of the <I></I>
tags and the <STRONG></STRONG>
tags in place of the <B></B>.

An HTML document defines only the display style with the use of
markup tags. How a Web browser chooses to display the information
is left to the implementation at the Web browser. So the client's
browser is responsible for rendering this text and may have a
completely different font than what you are using to create the
page. Therefore, what you see on your screen will be different
from what someone else using a different font may see. For best
results, try viewing your HTML page with different browsers. The
font on someone else's browser might be completely different.
For one thing, users can choose whatever font they desire for
their browser. Even if users don't explicitly choose a font, the
browser may default to a font that is different from the one you
use. For example, one user may use the Internet Explorer from
Microsoft with a 10-point font, and another user may use Netscape
Navigator with a 12-point font. Each user will see different
lengths of paragraphs and text layout.

Simple text formatting with the use of carriage returns in HTML
documents does not work. The client browser will implement word
wrapping when it sees text. You can force paragraph and line breaks
with the use of <P>
or <BR> tags. The <BR>
tag creates a line break, and does not start a new paragraph
like the <P> tag. A
<P> tag may be used
to lay out text differently using formats and text alignment on
different types of browsers. A <BR>
tag simply starts a new line without changing the current paragraph
format. Not using these tags causes all white space (including
spaces, tabs, and blank lines) to be collapsed into one white
space. Listings 20.4 and 20.5 provide samples of two HTML files
that produce the same output.

Listing 20.5. The second version of the HTML file shown in
Listing 20.4.

1 <HTML><HEAD><TITLE>A
simple HTML file</TITLE>
2 <HEAD>
3 <BODY>
4 <H1>This is Header Level 1</H1> <H2>This
is Header Level 2</H2>
5 <H3>This is Header Level 3</H3> <H4>This
is Header Level 4</H4>
6 <H5>This is Header Level 5</H5> <H6>This
is Header Level 6</H6>
7 This is line 1 This is line 2<P> This is line 3<P>
8 This is a broken line with an &ltoBR&gt<BR>
sign. <P> End of File Here.
9 </BODY></HTML>

Figure 20.3 shows the output from both listings. Note how heading-level
1 and heading-level 2 are shown in the same font style in this
figure. The example shown here uses Netscape as the browser. The
results on your browser might be different because each browser
displays HTML in the way it chooses. It's a bit like the contents
of a box shifting during transport but the weight of the contents
has not changed.

Figure 20.3: Using different heading levels.

For the sake of readability, it's best to place headings and paragraphs
on separate blank lines along with <P>
tags. You can also specify formatting options for paragraph alignment
in the <P> tag, as
illustrated in Listing 20.5. The format specified for <P>
will continue until the next <P>
tag. To terminate a particular format you can use the </P>
tag.

Having the browser format text for you does keep you from a lot
of the hassle of tracking line breaks, paragraph formatting, and
so on. However, when displaying text that is already laid out
(such as source code), the default formatting can wreak havoc
on your source listings. For this reason, HTML documents have
the preformatted (<PRE>)
tag. Using the <PRE>
tag turns off the HTML formatting at the browser. Listing 20.8
contains an HTML file that illustrates using this tag. The output
is shown in Figure 20.5.

By now you have seen that the greater than symbol and the less
than symbol are used to encode documents. What if you wanted to
include them in text that was not in preformatted tags? There
are several special characters to encode these special symbols
in HTML. In order to include them in a file, you have to enter
their codes in the HTML file. Four of the most common special
characters in HTML and their codes are

&lt; as the escape
sequence for <.

&gt; as the escape
sequence for >.

&amp; as the escape
sequence for the ampersand (&).

&quot; the escape
sequence for the double quote (").

So, to show this string in an HTML page, <HELLO>,
you would use the text statement &lt;HELLO&gt;
in your code. Note that the semicolon is required in each code.

URL stands for uniform resource locator. Basically, it's
an extension of a file name to include the network address of
a "thing" file and where that "thing" may
be found. The "thing" your URL is pointing to may be
a file, directory, database, image, newsgroup, archive, and so
on, on a machine that is residing anywhere on the Internet. It's
up to the browser to show that data to you.

A URL consists of three parts:

service://sitename[:port]/path/filename

The service part tells you
which service you are trying to access. The sitenameis the Internet address of the machine where the service is
being sought. Theport
number is optional, because the default is port 80. The path/filename
is the location of the file relative to the root directory for
the server.

The services your browser can offer will vary. Here are the most
common types of service you can use with a Web browser:

Http for perusing an
HTML document

Gopher for starting a
gopher session

Ftp for starting an ftp
session

Telnet for starting a
telnet session

File for getting a local
file

News for a Usenet newsgroup

Here are some examples of URLs:

ftp://pop.ikra.com/pub/perl5/VRML.pm This
URL specifies a file to get via ftp
from the /pub/perl5 directory.

http://www.ikra.com/iistv.html This
URL specifies an HTML document on the node www.ikra.com.

news://ikra.com/newsgroups/comp.dcom.telecom This
URL specifies a newsgroup to get on ikra.com.

To specify URL links to other documents, use the anchor tags,
<A> and </A>.
Anchor tags are used to provide links to other documents as well
as provide a location in a document that another HTML document
can refer to. Please check the references in this chapter if you
are not familiar with using HTML anchors. Here's the format for
using anchors for creating links to other documents:

<A HREF="URL">Text describing
where the URL points to</A>

The HREF token specifies
the URL to use. The text between the first ending >
and the start of </A>
is shown by the browser in a different color. The space between
<A and HREF
is required.

Here is a sample hypertext reference in an HTML document. For
clarity, I refer to the document that specifies the hyperlink
as the first document, and the location to which the URL points
as the second document. Let's look at the URL:

<A HREF="http://pop.ikra.com/iistv.html">IISTV</A>

The browser showing this link will make the word IISTV
show up in a different color than regular text and the link is
underlined. Clicking on the word IISTV
will cause the browser to get the file iistv.html
from www.ikra.com.

You normally use the absolute pathname of the file in a URL. Pathnames
always use the standard UNIX syntax (/),
not the DOS syntax (\), for
delimiting directory names in a path. Relative pathnames are also
possible if the anchor being specified happens to reside in the
same directory tree on the same machine as the original document.

Use relative links if you are likely to move directories around.
Moving the top-level file's location will move the entire tree
with it. On the other hand, a full pathname makes an explicit
link to the file regardless of where other files are moved to.

Anchors can also be used to move to a particular part within the
same document. Suppose you want to set a link from one document
(index.html) to a section
in another document (text.html).
Therefore, in the index.html
file, define a tag for the topic in the text.html
file like this:

<A NAME = "Daumesnil">Daumesnil
Station</A><P>
The station to go to the Zoo from....

In the index.html file, you
create the anchor to the text.html
file with a named anchor. A named anchor uses the name of the
file, followed by the hash mark (#)
and the location in the file:

Get off on <A HREF = "text.html#Daumesnil">the
Daumesnil stop</A>
and go two stations west to Port Doree.

Now when the user clicks on the words the
Daumesnil stop, the browser loads index.html
and goes to the place where the tag is placed for Daumesnil.
To go to a location within the same file, skip the filename portion
and use only the hash mark with the anchor name.

A request from a client browser at the Web server is handled by
the httpd daemon. If the
request is to run a CGI program, the server will run the program
for you and return the results of the program back to the client
browser. Input to the CGI program being run is sent either via
the command line or through environment variables. The CGI program
can be in any language, but it has to be able to process input
in this way. The output from the program is generally to standard
output. Therefore, a CGI program can be a compiled executable,
or a Perl, shell, or awk script, and so on. Naturally, because
this book is about Perl, the CGI scripts I discuss here will be
in Perl.

CGI scripts are similar to the scripts you would write to handle
the command-line interface but with one glaring exception: You
cannot send command-line options to CGI scripts. CGI uses the
command line for other purposes, so it's impossible to send arguments
to the script via the command line.

CGI uses environment variables to send your CGI program its parameters.
Environment variables are useful in CGI scripts in that any child
processes created by a CGI script also inherit the values of the
parent's environment variables. Some of the main environment variables
used by CGI scripts are listed in Table 20.1. The dagger ()
indicates that the variable might not be supported on all servers,
so use it with caution.

Table 20.1. CGI environment variables.

Variable

Description

AUTH_TYPE

The authorization type

CONTENT_LENGTH

The size in a decimal number of octets (8-bit bytes) of any attached entity

CONTENT_TYPE

The MIME type of an attached entity

DOCUMENT_ROOT

The root directory for your server's documents

DOCUMENT_URL

The URL for your client's document

DOCUMENT_NAME

*The name of your client's document

DATE_LOCAL

Local to server

DATE_GMT

Local to server

GATEWAY_INTERFACE

The server's CGI specification version

HTTP_(string)

The client's header data

PATH_INFO

The path to be interpreted by the CGI script

PATH_TRANSLATED

The virtual to physical mapping of the path

QUERY_STRING

The URL-encoded search string

REMOTE_ADDR

The IP address of the client

REMOTE_HOST

The full qualified domain name of client

REMOTE_IDENT

The identity data of the client

REMOTE_USER

The user ID sent by the client

REQUEST_METHOD

The request method sent by the client

SCRIPT_NAME

The URL path identifying the CGI script

SERVER_NAME

The server name (the host name of a DNS entry)

SERVER_PORT

The port at which the request was received

SERVER_ROOT

*The root directory for CGI binaries at the server

SERVER_PROTOCOL

A request for protocol name and version number

SERVER_SOFTWARE

A request for server software name and version number

Let's briefly cover some of these environment variables and how
your CGI script uses them. Keep in mind that only a few of these
variables are guaranteed to be set at any one invocation. Not
all of these variables are even set by all servers, so check your
documentation (usually a README
file of sorts) if you do not have the ncSA server. This book deals
with the ncSA server, which pretty much covers most of these variables.

To make things easier, you probably will not even use most of
the environment variables all the time. The most often used ones
tend to be QUERY_STRING,
REQUEST_METHOD, and PATH_INFO.
However, I cover some of the others just so you are aware of them
and what they do.

This variable is set to the number of bytes for the attached data
to a file. If there is no attached data, this is set to NULL.
In Perl, this is interpreted as zero. For example, for a string
of "x=10&y=10",
the variable is set as CONTENT_LENGTH=9.

The CONTENT_TYPE variable
is set to the type of MIME entity (data) being sent to the CGI
script. If there is no data, this variable is set to NULL,
as well. The MIME types are defined in HTTP 1.0 (and later versions).
With MIME types you can represent different types of data: video,
audio, images, or just plain text. Each type of data has its own
format: for example, GIF files are in the GIF format, text files
are in ASCII, and so on. For a GIF file, the CONTENT_TYPE
variable is set as follows:

CONTENT_TYPE = image/gif

A client may specify a list of the types of acceptable media in
its request header. This allows the server some flexibility in
how it returns data. Most types of data are registered with the
Internet Assigned Numbers Authority (IANA). The most common registered
types and subtypes are listed in Table 20.2. Subtypes and types
are not case sensitive-so GIF is equivalent to gif.

Table 20.2. MIME registered content types used by HTTP.

Type

Subtypes

Description

text

plain

ASCII text information

image

jpeg, gif

Supported formats

audio

basic

Audio information (includes .au format)

video

mpeg

Video data

application

octet-stream

Application-specific data

message

rfc822, partial,
external-body

An encapsulated message

multipart

mixed, alternative,
digest, parallel

Multiple parts of individual data
types

The types shown in Table 20.2 are by no means complete. Servers
may support more types and subtypes. The application type describes
transmitted application-specific data. This is generally binary
data that is not interpreted by the server. Here is the usual
setting for such data:

CONTENT_TYPE=application/octet-stream

The text type describes textual data. The primary and default
subtype is plain. Here is
the general setting for this variable:

CONTENT_TYPE=text/plain.

With the introduction of "rich text" as specified by
RFC 1521 it may be necessary to specify the character set being
used. RFC 1521 is the Internet standard for defining internet
MIME formats. For more information on RFC 1521, check out the
HTML page on http://www.cis.ohio-state.edu/htbin/rfc/rfc1521.html.

Therefore, the setting for sending a plain text file via CGI would
be the following:

CONTENT_TYPE=text/plain; charset=us-ascii
A blank like here is manadatory per standard

Note that each content type must be followed by a mandatory blank
line as required by HTML/CGI specifications. The image Content-Type
can also be used to describe image data. The two well-known image
types are JPEG and GIF. The type of format for a GIF would be

CONTENT_TYPE=image/gif
A blank like here is manadatory per standard

The video Content-Type describes
video data in the MPEG format. For audio, the Content-Type
would be in the basic format. In extended subtypes, you'll see
an x in front of the
subtype.

This variable represents additional path information. It describes
a resource to be returned by the CGI application once it has completed
its execution. The value of the PATH_INFO
variable can be a URL, a string provided to the CGI script, or
some other client information. It is up to the handling CGI script
as to how to use this information (if the information is needed).
If this information is a valid URL, the server may interpret the
URL before passing it to the CGI application.

The QUERY_STRING variable
is defined as anything that follows the first question mark (?)
in a URL. This information can be added either by an ISINDEX
document or an HTML form (with the GET
action). This string is encoded in the standard URL format of
changing spaces to +, individual
assignments separated by ampersands and encoding special characters
with %XX hexadecimal encoding.
You will need to decode the value of QUERY_STRING
before you can use it.

Assignments to values in an HTML FORM
are done in the form x=value.
Spaces are converted into plus (+)
signs. So an assignment of the form x=hello
world, will be converted to x=hello+world.

Individual assignments are separated by ampersands (&).
The equal (=) sign, the plus
(+) sign, and the ampersand
(&) may only be included
as encoded values (by using the #
operator, for example). Therefore, these assignments, x="hello
world", "a=2"
and "y=I am here",
are encoded as this:

The server is responsible for packing the data and setting the
environment variables before running the CGI script. Information
being passed to the CGI script can be passed in one of two ways:
as part of STDIN (the Perl
handle for standard input)
or by setting the value in QUERY_STRING.
The results from the execution are sent back from the CGI script
to STDOUT (the default output
file handle for Perl).

When data is passed in via STDIN,
the CONTENT_LENGTH and CONTENT_TYPE
variables are set to reflect the length and type of data, respectively.
This kind of input is the result of the POST
method at the client. For the GET
method of requesting data, the variable QUERY_STRING
is set to the values.

Incoming strings take the form of an input string. Assignments
to values are done in the form x=value.
Spaces are converted into plus (+)
signs. Individual assignments are separated by ampersands (&).
The equal (=) sign, the plus
(+) sign, and the ampersand
(&) may only be included
as encoded values (by using the #
operator, for example). Therefore, two assignments, x=1
and "y=42 and 32",
are encoded as this:

x=1&y=42+and+32

There is a method to send data to a CGI script with the command
line, provided you have the support in the server and your data
traffic is not large. Command-line interfaces tend to be restricted
by the number of bytes available for the interface. The method
of passing data via the environment variables is considered safer
and is the preferred way of passing data.

A CGI script must always return something back to the client.
This is a requirement; otherwise, the client at the other end
might hang for a long time. The standard output from the CGI program
is the output that the script writes to STDOUT.
Output is returned in one of two ways: a nonparsed header output
or a parsed header output. In the nonparsed header output, a complete
HTTP message is sent back. In the parsed header output, an HTML
header and the body of a message is sent with another URL. The
latter method is the preferred way of sending data back from a
server to a browser.

CGI programs can return a number of document types back to the
client. The CGI standard specifies what format of data to expect
at the server via the MIME header. The most common MIME types
are text/html
for HTML, and text/plain
for straight ASCII text.

For example, to send back HTML to the client, your output should
look like the example shown here:

The CGI script in Listing 20.8 shows how to display information
about the environment variables in use. The best way to learn
is by example-and this sample script will show you how to write
a very basic CGI script in Perl.

Line 3 forces the data written to STDOUT
(the default) to be sent back to the client immediately. Make
it a habit to do this is for all your CGI scripts. Flushing the
output is necessary when processing large amounts of data, because
the inherent caching on the output buffer may delay the transfer
of text from the one shown.

Line 4 returns the MIME type of document back to the client. Note
the two carriage returns at the end of the print
statement. The client expects an extra blank line.

Line 5 simply prints an output header for the text output. Line
6 checks if the input is coming from the standard input (STDIN)
or if it should be gathered from the command line. The command-line
parsing is shown only for use in this example. In Chapter 22,
"Using HTML FORMs,"
and Chapter 23, "HTML with Perl Modules,"
I develop a Perl subroutine to handle this input via the HTML
FORMS. Keep in mind that
the command-line interface is restrictive when large amounts of
data are being passed through. For the moment, this example shows
how the input is set in $ARGC
and the @ARGV array.

Lines 7 and 8 extract the input stream from <STDIN>,
whereas lines 11 through 14 extract incoming arguments from the
command-line interface. The rest of the script echoes the values
of the environment variables set at the time the shell script
is executed. Let's look at the variables in the output shown in
Figure 20.6. Note that the output is in text.

Figure 20.6:Environment variables from the test script.

The output in Figure 20.6 does not look very pretty does it? We
can send back an HTML document, too. Look at Listing 20.9, where
we send back an HTML document. In this listing, the content type
is set to "text/HTML"
because we are sending back an HTML document. The rest of the
script is basically echoing an HTML document back to the client.

The names of the two FRAMEs
in Listing 20.10 are set as TOPICS
and INDEX. The area that
these tags use up on the screen in specified in the FRAMESET
tags. The ROWS attribute
of the FRAMESET specifies
the percentage of screen space that is to be used when rendering
each page. For example, the line

<FRAMESET ROWS="22%,78%">

sets aside 22% of the top
of the screen for the first frame to be laid out and 78%
for the other frame. Had we used COLS
instead of ROWS in the specification,
like this:

<FRAMESET COLS="22%,78%">

we would be specifying 22%
of the screen space for the first frame and 78%
for the second frame. More than one frame can be specified as
long as the sum of all the percentages assigned for spaces add
up to 100. For example, to set aside three columnar spaces on
a screen, you could use the following line:

<FRAMESET COLS="22%,22%,*">

The first two frames would be given a column whose width is 22%
each of the total space available. The asterisk in the third column's
space indicates to the browser to use the rest of the available
space for the third column.

Also, like lists, frames can be nested within other frames by
nesting a <FRAMESET></FRAMESET>
tag pair in other FRAMESET
tags. For more information on using FRAMES,
check out the reference books listed earlier in this chapter.

In Listing 20.10, the TOPICS
frame occupies the 22% top
area, and the INDEX frame
occupies the bottom 78%.
The URLs in the INDEX frame
use the name TOPICS of the
window and the TARGET option
to display text in the TOPICS
window frame. See Listing 20.11.

The TARGET attribute sets
the frame that will be used to lay out the contents of what the
HREF attribute points to.
When you click on any of the links defined in the top row of the
HTML page, the contents of the URL are laid out in the bottom
frame. Therefore, the top serves as a table of contents, and the
bottom serves as the viewing area for the HTML page.

This chapter has been a whirlwind introduction to writing HTML
pages and interfacing with CGI scripts. A book devoted to these
topics would probably give you more detailed information. However,
you now have enough information to start writing your own Web
pages. In Chapters 22 and 23
I cover how to use HTML tables and forms with Perl CGI scripts.

CGI scripts are the workhorse behind the HTML pages. Input is
sent from a client via a server to a CGI script and replies are
expected back even if commands are not successful. Replies can
take the form of text, data, images, HTML pages, and so on. Output
from the CGI script is sent to its standard output and is passed
back to the client via the server. Arguments to CGI scripts are
passed via environment variables because this is an easy way to
pass values between child and parent processes. CGI scripts do
not have to be written in Perl, but it's easier to manipulate
strings in Perl than it is in C or other languages.