Introduction

After a brief hiatus into the world of style sheets, I have returned to the planet of CGI. In the last
CGI article, we looked at, among other things, how the browser encodes data so that it can be sent across
the internet to your server. We also looked briefly at the simplest of Perl scripts in order to get
familiar with the language. In this article, I will detail the steps required to extract the information
that is sent to your CGI program. The article is fairly Perl intensive, but hopefully the examples will
make it easy enough to follow.

GET vs POST revisited

Just to recap, GET and POST are the two methods in which a browser can send data to the server. GET
encodes the data and then appends it to the end of the URL. Typically, these URLs look like this:

http://www.server.com/cgi-bin/script.pl?item1=Jason&item2=Nugent

In which each name/value pair of the form is joined together with an equal sign, and then each pair of
information is separated with an ampersand. Also, remember that any special characters present in this url
which might not normally be allowed (like spaces, slashes, and tildes) are hexadecimally encoded.

With POST, any form name/value pairs are sent to the browser AFTER the rest of the response headers have
been sent. The content of your form does not get appended to the URL like it does with the GET method, but
rather it is sent to the script via standard input (STDIN). Knowing which method used is the first key
step in acquiring the data sent from your formto your cgi program.

Perl Environmental Variables

Perl has a number (quite a few, actually) of environmental variables that are set when a CGI program is
run. The values that these variables get set to depend greatly on the environment in which the CGI
program is called, and you can use these values to greatly enhance your Perl script. First, however, lets
take a look at the data structures used in Perl.

The three that are most commonly used are scalars, arrays of scalars, and associative arrays of scalars,
or hashes.

Scalars

Scalars can store a single piece of information and take on the form

$variable_name

where variable_name is the name of the variable. Variable names are case sensitive, so $my_variable and
$MY_variable are two different variables. It is usually good practice to keep all your variable names
lowercase, which prevents the possibility of future name collisions when new special keywords are added to
Perl. The developers of Perl have decided that all new keywords will be uppercase, so if you consistently
use lowercase you should have no problem.

Arrays

The second type of variable is an array. Arrays are very similar to arrays in JavaScript, in the sense
that you can store many different pieces of information in them and reference a specific one using a
numeric index. Unlike scalars, array variable names begin with an "@" symbol. The following line creates
an array which holds two pieces of information:

@array = ('This is some text', 'This is some more text');

Arrays begin counting from zero, so to get access to the first cell of this array, the following
notation would be used:

print $array[0]; # prints 'This is some text'

Notice that a cell in an array is referenced using a $, not a @. This is because each cell of an array
is in fact a regular scalar variable. To say @array[0] is incorrect.

Hashes (or Associative Arrays)

The last type of data structure is an associative array, or hash. I like to use the word 'hash' because
it is easier to type. Seriously. Hashes are a lot like arrays of scalars in the sense that they also
contain scalar variables, but they are not indexed with numbers. Hashes are referenced using key strings.
A hash is referenced not with a $ or a @, but with a %. So, the following is a valid hash name:

%hash_name;

Note that scalars, arrays of scalars, and hashes all exist in separate namespaces, which means that you
can use the same variable name for all three. The following is valid:

$variable;
@variable;
%variable;

Although, for obvious reasons, it is not recommended. It is better to keep things separate and avoid
the confusion.

Back to hashes. To reference a specific cell in a hash, you need to know the key which indexes it. For
example, You might have a hash named %Jason, which has two keys, 'firstname' and 'lastname':

Note that, like scalar arrays, individual cells of the hash are referenced using a $, since each cell
is, in fact, a scalar variable.

The Environmental Variable Hash

The whole point of that introduction to variables was so everyone can understand the environmental
variable hash. This hash is called %ENV and contains elements that are set (usually) when a CGI program is
invoked from a browser. One of the most important ones when dealing with information submitted from a form
is the QUERY_STRING environmental variable. This variable contains the information submitted when a GET
request method is used, and is the part of the URL which appears after the question mark. So, a request
method of:

http://www.server.com/cgi-bin/script.pl?item1=Jason&item2=Nugent

will set QUERY_STRING to

item1=Jason&item2=Nugent

To reference the QUERY_STRING environmental variable in Perl, use:

$ENV{'QUERY_STRING'};

Generally, the first thing that you would do is to copy the contents of QUERY_STRING to another variable
which has an easier name to work with.

$form_info = $ENV{'QUERY_STRING'};

and then you can now work with $form_info instead.

GET or POST. Which was it?

Since GET and POST both submit their information to the CGI program in two different ways, you have to
figure out which one it was before you can attempt to do anything with it. If it is GET then you can get
your information from QUERY_STRING. If it was POST, QUERY_STRING will not be set to anything because a
different mechanism is used to submit information. To determine whether or not it was GET or POST, we need
to look at another environmental variable - REQUEST_METHOD. This variable contains either GET or POST,
depending on what was used to submit the information.

$method = $ENV{'REQUEST_METHOD'};

What about POST? How do I get information then?

If the request method is set to post, QUERY_STRING will not be set. Instead, your form input will come
in via STDIN and must be read off character by character. Fortunately, Perl makes this easy.

To read the required amount of text from STDIN we make use of Perl's read() function. In order to know
how much information to read, we have to use one more environmental variable, called CONTENT_LENGTH. This
variable contains the number of bytes of information returned from the browser, after the request headers
have been sent. The following code will examine the environmental variables and then read in information
from STDIN until it has all of what was sent.

The read() function will read the required amount of information from STDIN, and store it in the
variable $data. To be more particular, read() reads information from a "filehandle", which, in this case,
is STDIN. Typically, filehandles are used to manipulate files that have been opened, and are also used to
redirect ouput around your Perl program. In this example, STDIN is a filehandle which is used as a pipe to
information coming in to your script from the server.

How do I Write a Script that does Both?

This is a great question, and most certainly a valid one. If your script can handle both request
methods, you are free to change the type used in your form without having to modify your script. To do
this requires one more Perl technique - a control structure. In particular, we are going to look at the
if statement.

The if statement in Perl is very similar to the one used in JavaScript. Generally, it is made
up of three sections:

if ( some expression is true ) {
then perform code here;
}

so, for example, we might have something like this:

$x = 4;
if (x < 5 ) { # this will evaluate to true, since x is 4
print "x is less than 5";
}

In our case, we must use an if statement to determine whether or not to use GET or POST as our
request method. So, we have

Alright, now we have our information submitted from the server. As you may recall, we now have to
decode it. First, lets take a look at a sample QUERY_STRING.

item1=Jason&item2=Nugent&item3=Jason+Nugent

This string has three items, and each one has a value associated with it. The first logical step would
be to separate them into separate name/value pairs. Perl can accomplish this step quite easily using the
split command. Consider the following code:

@name_value_pairs = split (/&/, ENV{'QUERY_STRING'});

This line will take our query string variable and split it on every occurance of an ampersand, and store
each fragment in the array @name_value_pairs. What we will have after performing this on the above string
is:

item1=Jason
item2=Nugent
item3=Jason+Nugent

each stored separately in @name_value_pairs. Life becomes a bit more complicated at this point. How do
we separate each name/value pair, and still keep track of them all? The answer is, of course, by using a
hash. If you create a hash and using the keys as the index strings, you can easily reference each
item submitted to the form. So, we have to loop through the array and work on each pair, one at a time.
For this, we use the foreach loop.

In the foreach loop, $pair represents each consecutive value in the @name_value_pairs array and changes
to reflect the new value each time the loop is cycled through.

The operator =~ is a match/assignment operator that basically says "If you find it in the string,
perform the operation and reassign the value to the variable". In this case, if a + sign is found in the
string, it is replaced with a space and the new value is put back into the $value variable.

Consider this code:

%form_results; # this is the hash which will store the form results
foreach $pair (@name_value_pairs) {
($key, $value) = split (/=/, $pair); # split each pair on the equal
# sign , and store each
# fragment in $key and $value
$value =~ tr/+/ /; # a quick transformation regex to convert pluses
# back to spaces in your $value. Jason+Nugent
# becomes Jason Nugent, which is good.
# the next line will take a bit of explaining.
# See outside the code for it
$value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;
# The next line adds a new "cell" to the %form_results hash, indexed
# by $key, with a value of $value. As an example, consider a
# name/value pair called colour=blue. The line below would create a
# new entry in the hash indexed by the string "colour", with a value
# of "blue".
$form_results{$key} = $value; # store the value in the
# %form_results hash, indexed
# according to the key
}

That foreach loop cycles through each name/value pair, splits it apart on the equal sign,
converts pluses to spaces, and then that last line DECODES any hexadecimally encoded characters remaining
in the string. It's a fun little regular expression, so let's take a look:

s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg

first off, the s/ / / indicates that you are performing a substitution on a string. The regular
expression between the first two slashes contains the pattern that you want to match, and the section
between the last two slashes contains the code that you want to replace the matched text with.

The /eg modifiers at the end of the substitution operator influence how the operation is performed. In
this case, the "e" represents the fact that we are replacing the matched text with Perl code that needs to
be "evaluated" before the subsititution can take place. The /g makes the operation "global". More on
these two below.

Let's look at the match regex first:

%([\dA-Fa-f][\dA-Fa-f])

This regular expression begins its match with a percent sign, which is the key component in a
hexadecimal number. It is then followed by:

([\dA-Fa-f][\dA-Fa-f])

so lets look at that. The interior of this section contains [\dA-Fa-f][\dA-Fa-f], or two identical
[\dA-Fa-f] classes. A class will match a SINGLE instance of one of the characters which appears inside it.
In this case, [\dA-Fa-f] will match a single digit (referenced by the \d, and eqivalent to 0-9), OR a
single capital letter from a to f (A-F) OR a single lower case letter from a - f (a-f). Each one of the
[\dA-Fa-f] code sections can match one of these, so this regular expression will match things like

%7E or %AA or %Af or %b6 etc.

Note that this section of the regular expression is stored in parentheses (). This is important later
on, since if a match is found, the section of the regex in parentheses is stored in a variable called $1,
if it was the first parenthesized section. If a second set of parentheses was used, its regex match would
be stored in $2, and so on.

So, let's look at the replacement part of the substitution.

pack ("C", hex ($1))

Note that this is NOT a regular expression. The replacement text of the substitution operator can not
contain a regular expression. It can however, contain executable Perl code, which is why there is an "e"
after the operator. Note the nested hex() function. It takes the matched text (something like 7E, for
example, and stored in $1) and converts it to hexadecimal number. This number is then passed as an
argument to the pack() function, which also takes a "C" argument in this case. The C tells pack to return
an "unsigned char value", which will be the original unencoded text character entered by the user in the
form. The final "g" subscript on the substitution operator means perform the operation globally, so all
the encoded characters get changed back. If this was left off, the substitution would stop after the first
match. Neat, eh?

So now, after all this, we have a hash called %form_results which contains all the information submitted
to the CGI program. You can now do whatever you want with these values - print them back to the user,
store them in a file on the server, or put them in a cookie. For now, we are just going to print them back
out to the browser.

Remember, the first line printed out from our CGI script must tell the browser what type of file is to
follow. By setting the MIME type to "text/html", the browser knows what to do with the information it it
receiving from the server.

A final word of explanation here - since the foreach loop cycles through a regular array, it
can't handle a hash. It can, however, handle a list, which is why the keys() function is used. keys()
returns a list of all the keys of a hash, which then gets passed to the sort() function. sort() puts them
in some semblance of alphabetical order, and then the foreach loop easily handles them.

Working Example

The Next Article

I think the next article will be a combination of two things - Server Side Includes (since everyone
seems to want to use them), and also a much needed discussion on CGI security and how you can do your part.
I figure that a bit of security talk will help people think clearer about the dangers of writing a poor CGI
script. Till then!