A beginner's guide to CGI scripting

This web page assumes that the reader is fluent in Hypertext Markup Language (HTML),
with a passing knowledge of JavaScript,
and some acquaintance with Perl. In particular, familiarity with HTML forms
is assumed. If you don't understand Perl at all, find Robert's
Perl tutorial using e.g. google, slog through it a few times, and
get hold of Perl for your system and write a few simple programs. Then
read on..

First, some terminology

As with all fairly technical subjects, there are lots of words and abbrev's. Here
we will examine just a few of the important ones..

HTTP is the Hypertext Transfer Protocol. We'll learn quite a lot
about how the Web works as we go through the following page. HTTP
is about moving messages around the Internet.

A "user agent" is simply software that is used to browse the Web.
Another name for such browser software is "the client";

The web server that provides web pages is simply "the server",
although you will come across the confusing abbreviation HTTPD (the
d is for daemon - regard the daemon as a sort of agent, not somebody
with horns and a tail!);

CR and LF. These are abbreviations for 'carriage return' (hexadecimal
0D) and 'line feed' (hex 0A). Unfortunately, different systems use
either OA on its own or the two combined to indicate the end of one
line and the beginning of the next. You can create the characters in
Perl as follows:

RFC. This stands for "Request For Comment", and is always followed
by a number, which defines one of a multitude of Internet standards
(and related documents).
Unfortunately, there is a proliferation of RFCs related to HTTP, so
we'll refer to them again and again!

URI. A Uniform Resource Identifier - either a name (URN) or location
(URL). A URI is simply a formatted string which identifies a resource
on a network. There are mildly complex rules that define the format
of a URI.

A proxy is an 'agent' that forwards a message towards a server.
The proxy usually reformats the request.

The word gateway . The central idea behind CGI scripting is
that you can hook your web browser into a gateway into,
for example, a database.
The gateway will translate the message into something that the
underlying software can understand.
The flipside of this advantage is that you're
'letting the whole world run a program on your system'!

Running CGI scripts on a computer will always potentially
compromise the security of that computer!

A. How a browser normally works!

When we learnt HTML, we didn't realise that we were floating on several
layers of abstraction. For example, consider the following:

Okay, we're connected to the Internet via our Internet Service
Provider. We start up our browser (say, Netscape, or Opera), and
want to fetch a web-page. What is the request that the browser sends
to fetch a web page at say, http://www.anaesthetist.com/index.htm ? Well, here
it is..

Not what you're used to, is it? The above message is all behind-the-scenes
stuff. First, note that after all the blurb is a blank line . Yes, I
promise you - it's there. This is very important. How important, we'll find
out later!

Next, see how this is an example of a GET request.
As with all such messages, the format is very precisely structured
- it sticks rigidly to rules originally defined in something called RFC 822.
The request is quite specific. It's asking for HTTP (hypertext transfer protocol)
version 1.0, and will only Accept certain types of data in reply
It's pretty easy to determine what types are being referred to - html pages, and gif
images. These data types are called MIME types
(MIME stands for Multipurpose Internet Mail Extension - we discuss it
in a little more detail far below).

At this point .. pause .. take a deep
breath. Although all of the above looks like garbage, look once more
at the message, read each line carefully, and see how a little order starts
to appear out of the chaos!

{As an aside, note the From: line, which was a polite little
line found in early requests, now totally obsolete due to the enthusiastic
activities of spammers}.

Here's what might come back from the server that provides the
web page..

Note again the header section followed by a blank line .
On the first line, "HTTP/version" is followed by a code (200) that indicates "everything
went fine". The corresponding code for "not found" is the dreaded 404.
A lot of blurb (date, server, MIME version etc) follows.
There certainly is a lot of stuff here, isn't there? The MIME type
(text/html) and length, and so on. After our mandatory little blank
line, the body of the message - an html page.

When you actually get around to writing a CGI script, you'll find
that the above can lazily be abbreviated to just..

Content-type: text/html
<HTML> .. the web page goes here.. </HTMLY>

The trick is that your CGI script writes such a response, which
is then parsed by the server on which your script lives. The
final product - the plethora of information that you saw above -
is then sent to the user agent.

Soon we will consider a slight variation on the above, where
instead of requesting a web page, the browser asks the server
to run a CGI script ! But before we do so, a brief comment
on the very first line of the message..

B. What is HTTP/1.0 ?

HTTP/1.0 is the basis of the World Wide Web.
There is no 'formal' definition or 'standard'
(although HTTP/1.0 accounts for about three quarters of
Internet traffic). The closest we can come is something called
RFC 1945, which was written by Tim Berners-Lee and his colleagues in
May 1996, long, long after he started up the World Wide Web. He
described HTTP as:

".. an application-level
protocol with the lightness and speed necessary for distributed,
collaborative, hypermedia information systems. It is a generic,
stateless, object-oriented protocol which can be used for many tasks,
such as name servers and distributed object management systems,
through extension of its request methods (commands). A feature of
HTTP is the typing of data representation, allowing systems to be
built independently of the data being transferred."

Basically, HTTP is a simple protocol for transferring information.
HTTP makes it easy to transfer not only HTML documents, but a vast
variety of other data. Not only can we retrieve documents, we can also
for example search for information, and talk to a variety of programs
on computers across the 'Net. HTTP is very similar to the
format used for e-mail (RFC 822) and MIME.

The beauty of HTTP/1.0 is that it makes it easy for us to 'talk to'
other Internet-based protocols - there's a host of these, including
SMTP, NNTP, FTP, Gopher, and WAIS. HTTP/1.0 is an excellent negotiator
between these protocols.

Recently the World Wide Web Consortium (W3C) has defined a new
standard called HTTP/1.1, which will probably eventually replace 1.0.
Most servers still use 1.0. We discuss HTTP/1.1 below,
as well as taking a brief peek at some features of HTTP/1.0.
(Don't look at these now).

C. What is a CGI script?

A CGI script is really a program. The program lives on a computer
connected to the Internet. The "script" (actually a fully-fledged program)
is usually written in the elegant and satisfying language Perl ,
athough a host of other languages may be used (TCL, BASIC [Yuk!], C, Ada, .. and so on).
CGI scripts need not necessarily be interpreted scripts, in fact, they
can be compiled programs (for example, C is always compiled).

Consider someone (let's call him Fred) who
is browsing web pages on the internet. Fred comes across a web page
which contains a link that refers to a CGI script. Fred clicks on the
link. What happens next?

When Fred clicks on the link, certain parameters are passed by his browser
across the Internet, and eventually the CGI script that lives on the server
starts running. What happens next depends very much on the nature
of the script, but commonly:

The script reads the parameters that were passed to it;

The script writes a new web page back to Fred, so that
he can view it.

Let's explore how this happens. First, a VERY SIMPLISTIC example of what the
browser request might look like:

Note the similarity to our previous, conventional web-page example,
including (you guessed it) the blank line at the end. The difference
is that, instead of requesting a web page, the GET asks for a script
called "myscript.pl" found in the directory "/cgi-local/". What
will the response look like? Well, that depends on the script, but
commonly, the response will be almost identical to our previous
response..

The only difference is that the script dynamically writes the
response, rather than fetching a static page stored somewhere on the
server. You can imagine the immense power (and potential for cock-ups)
inherent in such a set-up! There are several important aspects to CGI
scripting that you have to be aware of. We will look at:

HTML components;

Data that come into the script;

How the script responds.

C1. HTML components

There are several ways that a CGI script might be invoked. One is
simply including a reference to the script in an HTML anchor.
Another is a form. We previously learnt about HTML forms in our JavaScript tutorial.
Important components are:

the <FORM> tag itself, which tells the browser what method
is being used to transmit information (GET or POST), and an action
attribute, which says which CGI script to use to actually process the
data coming from the form!

the <INPUT> tags which say what types of input you're dealing with;

<SELECT> which defines a list to select from (and includes
the <OPTION> tag);

<TEXTAREA> for big chunks of text.

When we discussed forms in our JavaScript tutorial, we referred to
two methods that might be used, GET and POST. We also mentioned that
POST is much more general and powerful than GET. Let's look at a POST
that might result from a form..

Now look more carefully at the body of the POST. It's made up of lines with the
general format:

&name=value

See how every name begins with an ampersand (&), and
then the value follows an equals sign. Neither the = nor the & are
encoded, but any other occurrence of either within the actual name or
value will be encoded (as %3D and %26 respectively).

The cute thing about POST is that a separate data stream is opened,
and the data (here name=value pairs) are put onto that separate stream.
The stream becomes the standard input of the CGI script!
(For future reference, we'll here note that when
you use the POST method, then the environment variables CONTENT_LENGTH
and CONTENT_TYPE are both set up appropriately).

Always use POST rather than GET, unless you have no choice

With this under our belt, let's look briefly at the same information,
passed as a GET..

Note the restrictive format - all the name=value parameters are
plonked after a question mark that follows the path and name of
the Perl script "posthandler.pl". Despite this limitation, the
GET method is useful where we simply wish to submit a stock query
to a script, for we can embed the query in a link, thus:

.. but be careful. There's often a (poorly-defined) limitation on the
amount of data you can pass with a GET, so don't be surprised if long
data are ruthlessly truncated!

C2. Data coming in to the script

CGI stands for Common Gateway Interface . The Interface is the
glue that binds the "client" (in our example, Fred) and the actual
script that does the work. Information comes in to the Interface, and
is then relayed to the script. There is a particularly exacting format
that is used to pass this information. What happens is that the
script can read in parameters passed to it by the Interface. The script
"sees" these parameters as things called environment variables .
The names of environment variables are always written in UPPER CASE.
There are many environment variables, but perhaps the two most important
ones are:

QUERY_STRING

PATH_INFO

The QUERY_STRING environment variable

We already know that if we create a form and then call up a CGI script
using POST (the better way) then it's easy for the script to see the
incoming data - it simply reads the standard input . Things
are more convoluted if we use GET. GET can be invoked..

As part of the HTML reference to the CGI script, for example:

<a href="http://www.foo.com/bar/foobar.pl?thisIsSomeInfo">

From an HTML form.
The method used with the form must be GET . (This explains
the "?rubbish" that you often see in the location bar of your browser
when you use a search engine).

From an ISINDEX document (fine-print stuff that we've
deferred until later. You'll probably never use it).

Note that the string that the script sees in QUERY_STRING is mutilated during the
process of transfer. In particular:

space characters are converted to + signs;

Certain special characters are encoded as a percentage sign
followed by a 2-digit hexadecimal code, as we learnt above.

The PATH_INFO environment variable

This allows extra information to be transmitted from a web-page to
a CGI script.
The information is put in after the path to the script, for example:
"http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere";

You can combine such path information with a query-string..
"http://www.anaesthetist.com/cgi-local/Fred.pl/PathINfogoeshere?Whatmeworry";

Other environment variables

There are several other env variables that are routinely
set for all requests:

SERVER_SOFTWARE - the name and version of the information
server software that runs the gateway. The format is "name/version",
for example "NCSA 1.0" or "Apache/1.3.3";

SERVER_NAME - the name of the server (one of IP address, hostname,
or DNS alias), for example "stupid.chiron";

GATEWAY_INTERFACE - the version of CGI that is running, in the
format "CGI/revision", for example "CGI/1.1";

There are other env variables whose presence depends on
the demand being made of the gateway:

REQUEST_METHOD - we are particularly interested in the HTTP methods
GET and POST, but there are others (for example, HEAD, which is identical
to GET except that the server never returns a body in its response!);

SERVER_PROTOCOL - the name and revision of the "information protocol"
used by the request ("protocol/revision"), for example ""HTTP/1.0";

SERVER_PORT - the port number to which the request was sent - for
HTTP, the port is usually 80.

PATH_TRANSLATED - this is another incarnation of PATH_INFO, but
has been translated by the server from a virtual address to a physical
one, where appropriate; In our above example of PATH_INFO, the
translated path might be:
"/usr/bin/cgi-local/PathINfogoeshere";

SCRIPT_NAME - the (virtual) path of the script itself! This is used for
URLs that reference themselves. For example "/cgi-local/foobar.cgi";

REMOTE_HOST - who made the request? (If this is null, then look at
REMOTE_ADDR, below); Note that on some servers, DNS lookup can be turned
on or off. This variable will only be populated if the server allows
DNS lookup to be turned on. Do NOT rely on this to definitively identify
someone!

REMOTE_ADDR - the IP address of the requestor (of the remote host);

AUTH_TYPE - only of relevance where the server can authenticate
the user. Protocol-specific;

REMOTE_USER - authenticated user name, in the case where AUTH_TYPE
is relevant;

REMOTE_IDENT - set to the remote user name only in the specific instance
where "RFC 931 identification" is supported! This is pretty darn useless
for actually identifying someone, as it can easily be faked by those with
nefarious intent.

A further complication is environment variables that begin
with HTTP_ :

HTTP_ACCEPT
- the content is determined by the "Accept: .." lines
in the header, if they exist. The value is just a list of mime types,
separated by commas (One can even say "*/*");

HTTP_USER_AGENT - the browser used by the client ("software/version library/version",
or anything the client wishes to spew forth - Internet Explorer will
sometimes identify itself as Mozilla, ie. Netscape?!);

HTTP_REFERER only has a value if the script was invoked from within
an HTML document, in which case the value is the URL of the document.
Note the spelling [or lack thereof].
Some servers will also (or even only) accept HTTP_REFERRER.

HTTP_FROM is obsolete (and usually now left blank) but formerly
contained the user's email - spam has put paid to this!

The above little Perl script will actually provide a list of environment
variables as an HTML page (admittedly, without the usual head and body
tags etc).
The first line simply says where to find Perl on the system.
See how Perl automagically keeps the environment variables in
the %ENV associative array, and we go through each using the foreach instruction. The print statements preceding
the foreach loop are explored in detail below. Note the usual Perl
method of accessing each associative variable - because it's
a variable we say $ENV and not %ENV, but we put curly brackets {braces}
afterwards thus: $ENV{whatever}.

Note the two carriage returns (\n\n) after the Content-type statement -
these provide our magical blank line without which the header simply
won't work. Also see how we simply print to standard output!
(stdout - if you were to run the
script on your machine without any web in the way, you would see the response on your console).
Incredibly straightforward. As usual, there are wrinkles.
There's a fancy way in Perl (isn't there always?) of quoting large
sections of text. Here it is..

In the above, we've used the text string "END OF HTML" to delineate
(surprise, surprise) the end of the html text we wish to print. Cute, is
it not? This cute Perl trick is called "here-document quoting". But NOTE
that the line END OF HTML must be alone and on its own -
any character on the line apart from "END OF HTML"
(even a space) will screw things up horribly. Use with respect!

Other Content-types

There's no reason why your CGI script has to return a Content-type
of text/html. For example, you can generate images on the fly (if you
know how) of type image/gif. Something along the lines of..

Well, presumably you'd want to put in a width and height if you knew
them, now, wouldn't you? This is the sort of stuff that generates those
ugly little web-page counters that you so self-righteously despise!

Other options apart from Content-type

Apart from Content-type header lines, there are several others that you can put in,
but there are two that can stand on their own, replacing Content-type: -
Location and Status .

Location is quite sneaky (redirects to another URL, if a URL is specified ,
or fetches a document as if the client had requested it, if a path is specified . You can even submit a "?" directive
after the relevant file name).

Status sends something called
a status-line to the client. A status-line is an HTTP/1.0
message that combines a three-digit code (nnn) and a string that
explains the reason for the code.

Pulling data out of a GET statement

You know how to get the data - they are stored in the QUERY_STRING
environment variable. So we simply say:

Okay, we skipped over a bit in the above foreach loop,
because the names and values need a little massaging! We need to
substitute values like %xx (ie hexadecimal encoding) with the
relevant character. So in place of the FIX UP comment line in
the above, we might put:

$name =~ s/%([0-9A-Fa-f][0-9A-Fa-f])/pack("C",hex($1))/ge;

.. and a similar line for $value.
{check the above code}.

Ooops, nearly forgot! Some more lines to add before we simply
plonk the $value into $NameValues{$name}. What about the case where the same
$name is submitted several times? Well, okay, you can simply overwrite
the first few occurrences (as the above code will do), or you can
check for the previous values and, say, concatenate all the values.
Here's our check (we'll leave you to decide what to do next)..

if ( defined ($NameValues{$name}) )
{ # do what you want to here with the multiple values..
# for example, separate them by colons, or whatever.
} else
{ $NameValues{$name} = $value;
};

Reading the body of a POST statement

The general way that you read the body of a POST statement is simply
by querying the standard input, stdin. This is the same as reading
the keyboard (console) when your Perl program is interacting with
you. In other words you say something along the lines of:

.. which will read the whole body of the POST from the standard input.
Note that this body is composed of multiple lines (separated by
CR and/or LF characters).
See how we get the length of the data from the CONTENT_LENGTH environment
variable. There is NO obligation for CGI to put some sort
of "End Of File" character on the data, so CONTENT_LENGTH is rather
important.

After we've read in our $inputString, we massage it into shape (and
name/value pairs) as we did for the GET statement above.

We use the more complex Perl read statement rather than
something like:

D. Actually installing a Perl script

Put the correct path to Perl as the first line of your script.
Something like:

#!/usr/bin/perl

(Don't leave out the #, or put a space between the # and the !).
"which perl" will generally tell you on a UNIX system where perl
resides.

Put the script in the correct directory. Something like /cgi-bin
or /cgi-local,
but again, contact admin, grovel, grovel..

Note that some simple things may screw you around. For example,
if you create a file in MS-DOS edit, and upload it to the web as a binary,
Perl won't run it on UNIX systems because the
default "end of line" character for MS-DOS/Windows is CR+LF, while on
UNIX Perls, it's just LF. There are several solutions - the easiest is
just to make sure that when you FTP upload the file to UNIX, your
transfer is in ASCII mode - then the translation will automatically
occur.
Another solution is to find one of the translation programs on the Web,
for example "fixcrlf.exe".

Make sure that the file has the correct suffix - on
many systems, you have to rename the file from ".pl" to ".cgi" in
order to get it to work!

Make sure that the directory and/or script itself has the
correct permissions. There are three numbers that you have to get
right - this magic triad is "755". The correct UNIX command is..

Where filename is the name of the directory or file
whose permission you're changing. You'll find that friendly software
like WS-FTP will show the permissions in a slightly different way.

Useful UNIX commands apart from chmod are mv , used
to move or rename a file, mkdir for creating a directory,
cp to copy a file, pwd to list the current directory,
and ~ as a replacement for the long cumbersome path to
your home directory!

E. .. and running it

Okay. Create a web-page with a reference to the script, or simply
type in the URL thus:

http://www.anaesthetist.com/cgi-local/foobar.pl

And see whether you get back the dynamic web-page you expected
(or a confusing error)! If all hell broke loose (ie. you got an
internal server error, code 500) then look carefully on your server for
the file error.log that will tell you what went wrong (yep, you
guessed it, try the grovel, grovel routine). Chances are, you didn't set 755 permissions
for both the directory and script file, but you may have the wrong #!pathtoperl,
or some other error in your script. (Check the script from the command
line if you can, although this is often unhelpful).

If the script file wasn't even found, then you may have forgotten
that UNIX is cAse SENsitIvE, or have the wrong name, or the wrong suffix
(e.g. ".pl" in stead of ".cgi", or vice versa). Here's an example
to play around with:

A simple 'Hello world' example

Click on this test link. The URL
is "http://www.anaesthetist.com/cgi-local/hi.pl"

Writing to files from a Perl script

Hmm. It's general practice for scripts to be run as 'nobody'. This means
that if your CGI is to write to a file on the server, the directory
needs to be "world-writable" (a particularly bad idea) or owned by 'nobody'.
[You may wish to research this further].

Accessing Databases

This depends on your database.
There's a lot on the 'Net about the common databases - including mySQL and PostgreSQL.
For example, check out resourceindex.com.
It's also good to know some ODBC.

F. Security Holes

If you write Perl scripts that others can execute, you ARE opening
security holes! Sounds pretty brutal, but largely true. Beware
the following (at the very least):

Use of the Perl eval command.

Special characters that have meaning to a Bourne shell.

popen and system .

Server-side includes. These must be TURNED OFF ,
or you will die a horrible death, heh.
Anyway, you should trim them out of your data - see our note on SSIs.

Data that bite! Certain characters such as `backticks` have
special meaning in Perl, and submitting data with |pipes| to something
like an (eugh) Access database will result in Visual Basic being invoked
on the stuff between the pipes. Hackers can use this to accomplish
evil. Other changes may simply sow confusion - if you don't watch out
for it, something like

will cause a headache. You first have to duplicate the single quote
in the surname!

If you don't understand the above, you probably shouldn't be
allowing others to use your scripts by publishing them on the 'Net!
Even if you do, you're probably still going to get burned from time
to time.

G. A miscellany

Special characters

The following are all special characters in HTTP/1.0:

< > ( ) @ , ; :
\ " / [ ]
? = { }

.. as well as the space character, and horizontal tab. Also don't forget
the special roles of LF ± CR!

Calling CGI from JavaScript

This is actually fairly straightforward. In your JavaScript simply say something along
the lines of..

Using CGI.pm

There's a readily available Perl library module called CGI.pm.
If it's on your system, you're lucky, and can just say

use CGI;

at the start of your Perl script. If it's not, try (grovel, grovel)
or installing it in a local directory and then saying something along
the lines of..

use lib '/path/of/yourlocaldir';
use CGI;

Using CGI.pm you can handle parsing of CGI queries and form generation
with just a few simple calls. Check it out in Lincoln Stein's vast CGI.pm documentation.
John Callender has written an excellent introduction to this Perl module. This
includes how to send an email from Perl, and is incidentally a rather good
introduction to Perl! (Or get the source of that other
excellent form to email script, FormMail).

MIME stands for "Multipurpose Internet Mail Extension". MIME allows
for transfer and recognition of a host of different data types.
There are hundreds of types. MIME types were originally defined in RFC 1341.
For some improvements, see the now obsolete RFC 1521 and 1522, and the more recent
five-part RFC 2045 to 2049. Mailcap files (which handle media
types) are in RFC 1524. All MIME types should be registered with IANA.
Current types include:

A note on chmod

For a proper overview, "man chmod ". Some modes that may be useful
in a typical CGI context are:
* CGI programs, 0755
* data files to be readable by CGI, 0644
* directories for data used by CGI, 0755
* data files to be writable by CGI, 0666 (data has absolutely no security)
* directories for data used by CGI with write access, 0777 (no security)
* CGI programs to run setuid, 4755
* data files for setuid CGI programs, 0600 or 0644
* directories for data used by setuid CGI programs, 0700 or 0755
* For a typical backend server process, 4750

What you're doing is setting bit flags in the permissions of
files and directories. Scary stuff! The 'behind-the-scenes' information
is that 755 (for example) is an octal number. The "7" refers
to "remote file permissions" for the owner, and the subsequent two
5's to permissions for "group" and "other" respectively. Each
octal digit is made up of three bits, the first referring to read
permissions, the second to write, and the last to execute.
So "5" allows read and execute, but not write, and so on.
(read = 4, write = 2, execute = 1, 4+2+1 = 7).

Okay, the leftmost digit (usually assumed to be a zero) is a little
different, in that the first digit selects the "set user ID", the
second the "set group ID", and the last the "save text image" attribute.
So the "4" in "4755" means "set user ID", for example.

Note that if you're being anally retentive, it's probably best
to specify the leading zero, as this ensures that the number is seen
as octal on almost any system!

Server-side includes

Note that we particularly don't want some sneaky hacker to
insert "server-side includes" into our data. So we must at some stage
go through all data strings and rip out all potential server-side
includes, thus:

$datum =~ s/<!--(.|\n)*-->//g;

The general format of a server side include is:

<!--#command tag1="value1" tag2="value2" -->

for example..

<!--#exec cgi="/cgi-bin/hits.pl"-->

In other words, a 'standard' HTML comment containing a directive
to include the output from an executable in the web-page. Not only
do SSIs provide a security hazard, but they will also slow down an
overworked server even further, because documents that are
provided have to be parsed by the server to see whether it should
insert an 'include'. Note that setting up SSI is also quite a business
as you have to decide which directories are safe to use, and tell
the server what file type is to be parsed and turned into an HTML
document. Internally the server uses the MIME-type
text/x-server-parsed-html to identify SSI documents (they often have
the suffix .shtml).

Commands include exec, config, include, echo, fsize, and flastmod . With exec the cmd tag executes the
string provided (using /bin/sh), and cgi runs a script and
inserts its output, whatever it is!
For more on NCSA server side includes, try..
this note

Trickery

There are yet more wrinkles..

There is a way of preventing the server from parsing the output
of your script (ie. you can talk directly to the client). The trick is to
name the script beginning with "nph-", which stands for "not parse
header". The catch - your script must return a valid HTTP/1.0
response to the client, in all its gory detail.

and so on. The sneaky thing is that QUERY_STRING values are also put into
argv[1], argv[2] and so on, provided that the data were not submitted
from a form. How is argv[1] distinguished from argv[2] and so on? Easy!
Spaces (now translated to plus signs) are the delimiter. So, if you web
page anchor is:

<a href="http://www.foo.com/bar/foobar.pl?alpha beta gamma?">

then the script will see argv[1] as the string "alpha", argv[2] will be "beta", and so
on.

One trick - if you want a CGI program that simply echoes back
the content of a POST form (so that you can test form submission) try post-query , which is available as
ACTION="http://hoohoo.ncsa.uiuc.edu/cgi-bin/post-query".
There's a similar script called simply query that does the same
for forms that use the GET method (Available as
ACTION="http://hoohoo.ncsa.uiuc.edu/cgi-bin/query").

Error Scripts

These may be used to handle the case where a script 'crashes and burns'.
They have extra environment variables, including:

REDIRECT_REQUEST (the request as it was sent)

REDIRECT_URL (URL that caused a problem)

REDIRECT_STATUS (status number and message that would have been sent)

A topic beyond the scope of this document at present!

A summary of RFC 822

This document has application to all sorts of internet messages, including e-mail
and HTML, as well as requests for CGI scripts;

Long lines can be "folded" by (a) inserting a CRLF combination,
and (b) immediately after this, inserting at least one 'linear white-space'
character (LWSP, a tab or space);
You 'unfold' by removing all CRLFs followed immediately by LWSP;

An unfolded 'field' in a header is composed of:

field name made up of ASCII characters 33 to 126 decimal, excluding
colon;

A colon (:)

a field body (any ASCII characters, apart from CR or LF)

Some field bodies are further structured ; others are just
ASCII text;

Structured field bodies may be structured to contain one or more of:

individual special characters e.g. @ . ,

"quoted strings"

domain-literals - [text within square brackets]

(comments) - (enclosed in brackets)

atoms (basically, a word)

Display of structured field data should NOT allow LWSP
between words separated by @ or .

Characters can be quoted by preceding them with a backslash \
but ONLY within a quoted string, domain-literal, or comment.

The document distinguishes between "dtext" and "ctext", the former
any character apart from [square brackets], backslash and CR; the latter
any character apart from (parenthesis), backslash and CR. A domain-literal
is made up of dtext within square brackets, a comment of ctext within
parenthesis.

special characters are ( ) < >
@ , ; : \ " .
[ ]

With structured text, contiguous words in a phrase are assumed
to be separated by just ONE space;

< angle brackets > are used to indicate the presence of
a "one machine usable reference" (e.g. a mailbox)

Field names are generally case-independent;

Backspace characters are regarded as 'overstriking' preceding
text (but not past the start of a line) !

The body must occur after the header, but individual header lines
need NOT be in any particular order!

The canonical format for messages will now be shown. In the
following note that items in square brackets are optional, that
(parenthesis) here indicates a single item, that "a / b" means
a or b, that *something means any number of copies of
something including none at all, and that 1*something means one or
more copies of something. Here goes (in BNF)..

Of arcane interest is that 'source-routing' (along a pre-specified
path) is possible, but strongly discouraged;

For any domain, Postmaster@domain" is required to
be valid;

Appendix A3 gives examples of complete headers (email -ish!)

Miscellaneous Notes on HTTP 1.0

The following are snippets from RFC 1945. The extended BNF format used
is similar to that of RFC 822, and other features are similar to MIME (particularly the obsolete RFC 1521).
Appendix C of RFC 1945 lists how HTTP/1.0 differs from MIME.

HTTP 1.1

The RFC standard for HTTP/1.1 is RFC 2616 from June 1999. We're up to ~170
pages in this standard, which differs quite a lot from 1.0.
Krishnamurthy, Mogul and Kristol have reviewed the differences
between v 1.0 and 1.1. In summary, these are:

Because of 'back-compatibility' issues, the standard is complex
and far from uniform. HTTP will as usual ignore headers it doesn't understand!

A Via header is supported so one can see who along the chain
between client and server is using 1.0 and who is using 1.1.

An OPTIONS method is introduced that (at least in theory) allows
a client to ask the server what it can do, without actually requesting
a 'resource'.

the Upgrade header is for the future - to allow a switch
to an alternate means of communication.

In 1.0, caching was simple - the Expires header
told a server which cached a 'common' response (that was often requested)
until when it need not update its copy of this response by asking
for a fresh copy. There was another frill, a cache could ask the originator
for a copy of a page (or whatever) If-Modified-Since . The originating
server would then respond with a newer copy (code 200), or code 304 which
meant 'not modified, OK to use the original'. (The problem here was one
of clock synchronisation). There was even the ability
to disable all caching with the header Pragma: no cache .

1.1 cranks the above up a few notches - a cache entry is either
fresh or stale. Stale entries should (normally) be re-validated when
requested. A new idea is the ETag header, which is shorthand for
'entity tag'. Rather than insisting on comparison of timestamps, the
new standard makes sure that if two entity tags are the same, then the
associated responses must be identical. There are also extra conditional
requests..

If-None-match - several entity tags are presented!

If-Unmodified-since

If-Match.

There's also a Cache-control header, with lots of cache control
directives including relative expiration times (max-age directive),
private , no-store and no-transform .
Even more cute is the Vary header which forces the cache
to examine not only the requested URL (a la 1.0) but also select request
header fields to make sure that sending a cached response is appropriate!

A new (and beautiful) attribute of 1.1 is that you can request part of a resource. Great for speed and bandwidth, also you
can download chunks of (say) a large file! A good idea. The relevant
response code is 206 (Partial Content). One can even send multiple
ranges in one message, and there's a MIME type for this.. multipart/byteranges.

It's now possible to first send a header, and then only, if the server
is happy to accept the body of the request (which may be biiiig), transmit
the body of the message. The unhappy server sends a 401 to tell the client
to go to hell; the happy server says 100 (Continue). The trick is that
this new 100 code is sent in its own special header Expect:. But
watch for bugs in this clever trick!

Compression is provided for (either 'transfer-codings', which
are hop-by-hop ie happen along the line of communication, or 'content-codings',
which happen at one end, with decompression at the other end). Note
that 1.0 already has a Content-Encoding header but that 1.1 adds Transfer-Encoding for hop by hop coding. The Accept-Encoding
header is spruced up (enough?), and the TE header allows the
client to say which transfer encoding is kosher.

A layer deeper than HTTP is the transport protocol used to actually
get the messages across - this is usually TCP . Unfortunately,
setting up a new TCP connection for each request is clumsy and costly.
Even worse is that each image in a web-page is retrieved by
a distinct HTTP request (with its own little TCP overhead)!
1.1 introduces two solutions to this clumsiness - persistent
connections and pipelining . Implementation is complex -
the Connection header allows a message that is about to be
forwarded to have hop-by-hop headers clipped out of it! Persistent
connections are now a default (!) but can be turned off at will
using a Connection: close header. Pipelining allows multiple
requests to be sent across the same TCP connection, without waiting
for an answer to the initial request!

Dynamic responses to requests often don't know the length of
the response, so cannot send a Content-Length header. In 1.1
we now have chunked transfer-coding, (set up by specifying Transfer-Encoding: chunked ), so manageable chunks of data
can be sent. A great idea!

A Trailer header is convenient when used with chunking -
deferred headers (that depend, for example, on the whole body) can
be listed in this header and then sent after the chunks! There are
several technicalities.

In 1.1 the Content-Length header refers only to the
actual message length (the "entity length") and cannot be used to
talk about temporary length changes brought about by, for example,
compression between hops in the chain!

To prevent problems that occur when several hosts are bound to
the same IP address (there is a multiplicity of virtually hosted
dot coms around, for example), a Host header provides more
detail about the actual 'vanity' name of the dot com or whatever
whose page is being requested at a given IP. (A port may also be
specified). {This partially addresses a problem that will probably
only be sorted out when IPv6 comes of age}.

There are now many more error codes than the sixteen found with 1.0.
There's also a Warning header, and 24 new status codes, including
409 (Conflict) and 410 (Gone)! All the codes are listed starting on
page 39 of RFC 2616.

Several security issues are addressed in 1.1. "Basic authentication"
is jacked up so that now a client first computes a (MD5) checksum of
several values including a one-time value provided by the host ("nonce").
Nonces can also become stale! Proxy authentication is also addressed.

The Content-MD5 header is refined.

"Content negotiation" is carefully specified, but still confusing
and complex.

Different Servers

You should research your server - a good internet search term is simply
"HTTPD" with the type of server, for example Apache (by far the commonest,
over 50% of all HTTPDs). Note that the CERN/W3C and NCSA HTTPDs are no
longer being developed, so if you use them and there's a bug, you may
be stuck!