Introduction

The genius of PHP is its seamless integration of form variables into your programs. It makes web programming smooth and simple, from web form to PHP code to HTML output.

There's no built-in mechanism in HTTP to allow you to save information from one page so you can access it in other pages. That's because HTTP is a stateless protocol. Recipe 9.2, Recipe 9.4, Recipe 9.5, and Recipe 9.6 all show ways to work around the fundamental problem of figuring out which user is making which requests to your web server.

Processing data from the user is the other main topic of this chapter. You should never trust the data coming from the browser, so it's imperative to always validate all fields, even hidden form elements. Validation takes many forms, from ensuring the data match certain criteria, as discussed in Recipe 9.3, to escaping HTML entities to allow the safe display of user entered data, as covered in Recipe 9.9. Furthermore, Recipe 9.8 tells how to protect the security of your web server, and Recipe 9.7 covers how to process files uploaded by a user.

Whenever PHP processes a page, it checks for GET and POST form variables, uploaded files, applicable cookies, and web server and environment variables. These are then directly accessible in the following arrays: $_GET , $_POST, $_FILES, $_COOKIE, $_SERVER, and $_ENV. They hold, respectively, all variables set by GET requests, POST requests, uploaded files, cookies, the web server, and the environment. There's also $_REQUEST , which is one giant array that contains the values from the other six arrays.

When placing elements inside of $_REQUEST, if two arrays both have a key with the same name, PHP falls back upon the variables_order configuration directive. By default, variables_order is EGPCS (or GPCS, if you're using the php.ini-recommended configuration file). So, PHP first adds environment variables to $_REQUEST and then adds GET, POST, cookie, and web server variables to the array, in this order. For instance, since C comes after P in the default order, a cookie named username overwrites a POST variable named username.

If you don't have access to PHP's configuration files, you can use ini_get( ) to check a setting:

print ini_get('variables_order');
EGPCS

You may need to do this because your ISP doesn't let you view configuration settings or because your script may run on someone else's server. You can also use phpinfo( ) to view settings. However, if you can't rely on the value of variables_order, you should directly access $_GET and $_POST instead of using $_REQUEST.

The arrays containing external variables, such as $_REQUEST, are superglobals. As such, they don't need to be declared as global inside of a function or class. It also means you probably shouldn't assign anything to these variables, or you'll overwrite the data stored in them.

Prior to PHP 4.1, these superglobal variables didn't exist. Instead there were regular arrays named $HTTP_COOKIE_VARS, $HTTP_ENV_VARS, $HTTP_GET_VARS, $HTTP_POST_VARS, $HTTP_POST_FILES, and $HTTP_SERVER_VARS. These arrays are still available for legacy reasons, but the newer arrays are easier to work with. These older arrays are populated only if the track_vars configuration directive is on, but, as of PHP 4.0.3, this feature is always enabled.

Finally, if the register_globals configuration directive is on, all these variables are also available as variables in the global namespace. So, $_GET['password'] is also just $password. While convenient, this introduces major security problems because malicious users can easily set variables from the outside and overwrite trusted internal variables. Starting with PHP 4.2, register_globals defaults to off.

With this knowledge, here is a basic script to put things together. The form asks the user to enter his first name, then replies with a welcome message. The HTML for the form looks like this:

The name of the text input element inside the form is first_name. Also, the method of the form is post. This means that when the form is submitted, $_POST['first_name'] will hold whatever string the user typed in. (It could also be empty, of course, if he didn't type anything.)

For simplicity, however, let's assume the value in the variable is valid. (The term "valid" is open for definition, depending on certain criteria, such as not being empty, not being an attempt to break into the system, etc.) This allows us to omit the error checking stage, which is important but gets in the way of this simple example. So, here is a simple hello.php script to process the form:

echo 'Hello ' . $_POST['first_name'] . '!';

If the user's first name is Joe, PHP prints out:

Hello Joe!

Processing Form Input

Problem

You want to use the same HTML page to emit a form and then process the data entered into it. In other words, you're trying to avoid a proliferation of pages that each handle different steps in a transaction.

Solution

Use a hidden field in the form to tell your program that it's supposed to be processing the form. In this case, the hidden field is named stage and has a value of process:

Discussion

During the early days of the Web, when people created forms, they made two pages: a static HTML page with the form and a script that processed the form and returned a dynamically generated response to the user. This was a little unwieldy, because form.html led to form.cgi and if you changed one page, you needed to also remember to edit the other, or your script might break.

Forms are easier to maintain when all parts live in the same file and context dictates which sections to display. Use a hidden form field named stage to track your position in the flow of the form process; it acts as a trigger for the steps that return the proper HTML to the user. Sometimes, however, it's not possible to design your code to do this; for example, when your form is processed by a script on someone else's server.

When writing the HTML for your form, however, don't hardcode the path to your page directly into the action. This makes it impossible to rename or relocate your page without also editing it. Instead, PHP supplies a helpful variable:

$_SERVER['PHP_SELF']

This variable is an alias to the URL of the current page. So, set the value of the action attribute to that value, and your form always resubmits, even if you've moved the file to a new place on the server.

See Also

Validating Form Input

Problem

You want to ensure data entered from a form passes certain criteria.

Solution

Create a function that takes a string to validate and returns true if the string passes a check and false if it doesn't. Inside the function, use regular expressions and comparisons to check the data. For example, Example 9-1 shows the pc_validate_zipcode( ) function, which validates a U.S. Zip Code.

Discussion

Deciding what constitutes valid and invalid data is almost more of a philosophical task than a straightforward matter of following a series of fixed steps. In many cases, what may be perfectly fine in one situation won't be correct in another.

The easiest check is making sure the field isn't blank. The empty( ) function best handles this problem.

Next come relatively easy checks, such as the case of a U.S. Zip Code. Usually, a regular expression or two can solve these problems. For example:

/^[0-9]{5}([- ]?[0-9]{4})?$/

finds all valid U.S. Zip Codes.

Sometimes, however, coming up with the correct regular expression is difficult. If you want to verify that someone has entered only two names, such as "Alfred Aho," you can check against:

/^[A-Za-z]+ +[A-Za-z]+$/

However, Tim O'Reilly can't pass this test. An alternative is /^\S+\s+\S+$/; but then Donald E. Knuth is rejected. So think carefully about the entire range of valid input before writing your regular expression.

In some instances, even with regular expressions, it becomes difficult to check if the field is legal. One particularly popular and tricky task is validating an email address, as discussed in Recipe 13.7. Another is how to make sure a user has correctly entered the name of her U.S. state. You can check against a listing of names, but what if she enters her postal service abbreviation? Will MA instead of Massachusetts work? What about Mass.?

One way to avoid this issue is to present the user with a dropdown list of pregenerated choices. Using a select element, users are forced by the form's design to select a state in the format that always works, which can reduce errors. This, however, presents another series of difficulties. What if the user lives some place that isn't one of the choices? What if the range of choices is so large this isn't a feasible solution?

There are a number of ways to solve these types of problems. First, you can provide an "other" option in the list, so that a non-U.S. user can successfully complete the form. (Otherwise, she'll probably just pick a place at random, so she can continue using your site.) Next, you can divide the registration process into a two-part sequence. For a long list of options, a user begins by picking the letter of the alphabet his choice begins with; then, a new page provides him with a list containing only the choices beginning with that letter.

Finally, there are even trickier problems. What do you do when you want to make sure the user has correctly entered information, but you don't want to tell her you did so? A situation where this is important is a sweepstakes; in a sweepstakes, there's often a special code box on the entry form in which a user enters a string — AD78DQ — from an email or flier she's received. You want to make sure there are no typos, or your program won't count her as a valid entrant. You also don't want to allow her to just guess codes, because then she could try out those codes and crack the system.

The solution is to have two input boxes. A user enters her code twice; if the two fields match, you accept the data as legal and then (silently) validate the data. If the fields don't match, you reject the entry and have the user fix it. This procedure eliminates typos and doesn't reveal how the code validation algorithm works; it can also prevent misspelled email addresses.

Finally, PHP performs server-side validation. Server-side validation requires that a request be made to the server, and a page returned in response; as a result, it can be slow. It's also possible to do client-side validation using JavaScript. While client-side validation is faster, it exposes your code to the user and may not work if the client doesn't support JavaScript or has disabled it. Therefore, you should always duplicate all client-side validation code on the server.

See Also

Recipe 13.7 for a regular expression for validating email addresses; Chapter 7, "Validation on the Server and Client," of Web Database Applications with PHP and MySQL (Hugh Williams and David Lane, O'Reilly).

Working with Multipage Forms

Problem

You want to use a form that displays more than one page and preserve data from one page to the next.

Solution

Use session tracking:

session_start();
$_SESSION['username'] = $_GET['username'];

You can also include variables from a form's earlier pages as hidden input fields in its later pages:

Discussion

Whenever possible, use session tracking. It's more secure because users can't modify session variables. To begin a session, call session_start( ); this creates a new session or resumes an existing one. Note that this step is unnecessary if you've enabled session.auto_start in your php.ini file. Variables assigned to $_SESSION are automatically propagated. In the Solution example, the form's username variable is preserved by assigning $_GET['username'] to $_SESSION['username'].

To access this value on a subsequent request, call session_start( ) and then check $_SESSION['username']:

In this case, if you don't call session_start( ), $_SESSION isn't set.

Be sure to secure the server and location where your session files are located (the filesystem, database, etc.); otherwise your system will be vulnerable to identity spoofing.

If session tracking isn't enabled for your PHP installation, you can use hidden form variables as a replacement. However, passing data using hidden form elements isn't secure because anyone can edit these fields and fake a request; with a little work, you can increase the security to a reliable level.

The most basic way to use hidden fields is to include them inside your form.

When this form is resubmitted, $_GET['username'] holds its previous value unless someone has modified it.

A more complex but secure solution is to convert your variables to a string using serialize( ) , compute a secret hash of the data, and place both pieces of information in the form. Then, on the next request, validate the data and unserialize it. If it fails the validation test, you'll know someone has tried to modify the information.

The pc_encode( ) encoding function shown in Example 9-2 takes the data to encode in the form of an array.

The pc_decode( ) function recreates the hash of the secret word and compares it to the hash value from the form. If they're equal, $data is valid, so it's unserialized. If it flunks the test, the function writes a message to the error log and returns false.

At the top of the script, we pass pc_decode( ) the variables from the form for decoding. Once the information is loaded into $data, form processing can proceed by checking in $_GET for new variables and in $data for old ones. Once that's complete, update $data to hold the new values and then encode it, calculating a new hash in the process. Finally, print out the new form and include $data and $hash as hidden variables.

Redisplaying Forms with Preserved Information and Error Messages

Problem

When there's a problem with data entered in a form, you want to print out error messages alongside the problem fields, instead of a generic error message at the top of the form. You also want to preserve the values the user typed into the form the first time.

Solution

Use an array, $errors, and store your messages in the array indexed by the name of the field.

Discussion

If your users encounter errors when filling out a long form, you can increase the overall usability of your form if you highlight exactly where the errors need to be fixed.

Consolidating all errors in a single array has many advantages. First, you can easily check if your validation process has located any items that need correction; just use count($errors). This method is easier than trying to keep track of this fact in a separate variable, especially if the flow is complex or spread out over multiple functions. Example 9-4 shows the pc_validate_form( ) validation function, which uses an $errors array.

This is clean code because all errors are stored in one variable. You can easily pass around the variable if you don't want it to live in the global scope.

Using the variable name as the key preserves the links between the field that caused the error and the actual error message itself. These links also make it easy to loop through items when displaying errors.

You can automate the repetitive task of printing the form; the pc_print_form() function in Example 9-5 shows how.

The complex part of pc_print_form( ) comes from the $fields array. The key is the variable name; the value is the pretty display name. By defining them at the top of the function, you can create a loop and use foreach to iterate through the values; otherwise, you need three separate lines of identical code. This integrates with the variable name as a key in $errors, because you can find the error message inside the loop just by checking $errors[$field].

If you want to extend this example beyond input fields of type text, modify $fields to include more meta-information about your form fields:

See Also

Guarding Against Multiple Submission of the Same Form

Problem

You want to prevent people from submitting the same form multiple times.

Solution

Generate a unique identifier and store the token as a hidden field in the form. Before processing the form, check to see if that token has already been submitted. If it hasn't, you can proceed; if it has, you should generate an error.

Discussion

For a variety of reasons, users often resubmit a form. Usually it's a slip-of-the-mouse: double-clicking the Submit button. They may hit their web browser's Back button to edit or recheck information, but then they re-hit Submit instead of Forward. It can be intentional: they're trying to stuff the ballot box for an online survey or sweepstakes. Our Solution prevents the nonmalicious attack and can slow down the malicious user. It won't, however, eliminate all fraudulent use: more complicated work is required for that.

The Solution does prevent your database from being cluttered with too many copies of the same record. By generating a token that's placed in the form, you can uniquely identify that specific instance of the form, even when cookies is disabled. When you then save the form's data, you store the token alongside it. That allows you to easily check if you've already seen this form and record the database it belongs to.

Start by adding an extra column to your database table — unique_id — to hold the identifier. When you insert data for a record, add the ID also. For example:

By associating the exact row in the database with the form, you can more easily handle a resubmission. There's no correct answer here; it depends on your situation. In some cases, you'll want to ignore the second posting all together. In others, you'll want to check if the record has changed, and, if so, present the user with a dialog box asking if they want to update the record with the new information or keep the old data. Finally, to reflect the second form submission, you could update the record silently, and the user never learns of a problem.

All these possibilities should be considered given the specifics of the interaction. Our opinion is there's no reason to allow the deficits of HTTP to dictate the user experience. So, while the third choice, silently updating the record, isn't what normally happens, in many ways this is the most natural option. Applications we've developed with this method are more user friendly; the other two methods confuse or frustrate most users.

It's tempting to avoid generating a random token and instead use a number one greater then the number of records already in the database. The token and the primary key will thus be the same, and you don't need to use an extra column. There are (at least) two problems with this method. First, it creates a race condition. What happens when a second person starts the form before the first person has completed it? The second form will then have the same token as the first, and conflicts will occur. This can be worked around by creating a new blank record in the database when the form is requested, so the second person will get a number one higher than the first. However, this can lead to empty rows in the database if users opt not to complete the form.

The other reason not do this is because it makes it trivial to edit another record in the database by manually adjusting the ID to a different number. Depending on your security settings, a fake GET or POST submission allows the data to be altered without difficulty. A long random token, however, can't be guessed merely by moving to a different integer.

Solution

Discussion

Starting in PHP 4.1, all uploaded files appear in the $_FILES superglobal array. For each file, there are four pieces of information:

name

The name assigned to the form input element

type

The MIME type of the file

size

The size of the file in bytes

tmp_name

The location in which the file is temporarily stored on the server.

If you're using an earlier version of PHP, you need to use $HTTP_POST_FILES instead.

After you've selected a file from that array, use is_uploaded_file( ) to confirm that the file you're about to process is a legitimate file resulting from a user upload, then process it as you would other files on the system. Always do this. If you blindly trust the filename supplied by the user, someone can alter the request and add names such as /etc/passwd to the list for processing.

You can also move the file to a permanent location; use move_uploaded_file( ) to safely transfer the file:

// move the file: move_uploaded_file() also does a check of the file's
// legitimacy, so there's no need to also call is_uploaded_file()
move_uploaded_file($_FILES['event']['tmp_name'], '/path/to/file.txt');

Note that the value stored in tmp_name is the complete path to the file, not just the base name. Use basename( ) to chop off the leading directories if needed.

Be sure to check that PHP has permission to read and write to both the directory in which temporary files are saved (see the upload_tmp_dir configuration directive to check where this is) and the location in which you're trying to copy the file. This can often be user nobody or apache, instead of your personal username. Because of this, if you're running under safe_mode, copying a file to a new location will probably not allow you to access it again.

Processing files can often be a subtle task because not all browsers submit the same information. It's important to do it correctly, however, or you open yourself up to a possible security hole. You are, after all, allowing strangers to upload any file they choose to your machine; malicious people may see this as an opportunity to crack into or crash the computer.

As a result, PHP has a number of features that allow you to place restrictions on uploaded files, including the ability to completely turn off file uploads all together. So, if you're experiencing difficulty processing uploaded files, check that your file isn't being rejected because it seems to pose a security risk.

To do such a check first, make sure file_uploads is set to On inside your configuration file. Next, make sure your file size isn't larger than upload_max_filesize; this defaults to 2 MB, which stops someone trying to crash the machine by filling up the hard drive with a giant file. Additionally, there's a post_max_size directive, which controls the maximum size of all the POST data allowed in a single request; its initial setting is 8 MB.

From the perspective of browser differences and user error, if you can't get $_FILES to populate with information, make sure you add enctype="multipart/form-data" to the form's opening tag; PHP needs this to trigger processing. If you can't do so, you need to manually parse $HTTP_RAW_POST_DATA. (See RFCs 1521 and 1522 for the MIME specification at http://www.faqs.org/rfcs/rfc1521.html and http://www.faqs.org/rfcs/rfc1522.html.)

Also, if no file is selected for uploading, versions of PHP prior to 4.1 set tmp_name to none; newer versions set it to the empty string. PHP 4.2.1 allows files of length 0. To be sure a file was uploaded and isn't empty (although blank files may be what you want, depending on the circumstances), you need to make sure tmp_name is set and size is greater than 0. Last, not all browsers necessarily send the same MIME type for a file; what they send depends on their knowledge of different file types.

See Also

Securing PHP's Form Processing

Problem

You want to securely process form input variables and not allow someone to maliciously alter variables in your code.

Solution

Disable the register_globals configuration directive and access variables only from the $_REQUEST array. To be even more secure, use $_GET , $_POST, and $_COOKIE to make sure you know exactly where your variables are coming from.

To do this, make sure this line appears in your php.ini file:

register_globals = Off

As of PHP 4.2, this is the default configuration.

Discussion

When register_globals is set on, external variables, including those from forms and cookies, are imported directly into the global namespace. This is a great convenience, but it can also open up some security holes if you're not very diligent about checking your variables and where they're defined. Why? Because there may be a variable you use internally that isn't supposed to be accessible from the outside but has its value rewritten without your knowledge.

Here is a simple example. You have a page in which a user enters a username and password. If they are validated, you return her user identification number and use that numerical identifier to look up and print out her personal information:

Normally, $id is set only by your program and is a result of a verified database lookup. However, if someone alters the GET string, and passes in a value for $id, with register_globals enabled, even after a bad username and password lookup, your script still executes the second database query and returns results. Without register_globals, $id remains unset because only $_REQUEST['id'] (and $_GET['id']) are set.

Of course, there are other ways to solve this problem, even when using register_globals. You can restructure your code not to allow such a loophole.

Now you use $id only when it's been explicitly set from a database call. Sometimes, however, it is difficult to do this because of how your program is laid out. Another solution is to manually unset( ) or initialize all variables at the top of your script:

unset($id);

This removes the bad $id value before it gets a chance to affect your code. However, because PHP doesn't require variable initialization, it's possible to forget to do this in one place; a bug can then slip in without a warning from PHP.

Solution

Discussion

PHP has a pair of functions to escape characters in HTML. The most basic is htmlspecialchars( ) , which escapes four characters: <>" and &. Depending on optional parameters, it can also translate ' instead of or in addition to ". For more complex encoding, use htmlentities( ); it expands on htmlspecialchars( ) to encode any character that has an HTML entity.

Both functions allow you to pass in a character encoding table that defines what characters map to what entities. To retrieve either table used by the previous functions, use get_html_translation_table( ) and pass in HTML_ENTITIES or HTML_SPECIALCHARS. This returns an array that maps characters to entities; you can use it as the basis for your own table.

Handling Remote Variables with Periods in Their Names

Problem

You want to process a variable with a period in its name, but when a form is submitted, you can't find the variable.

Solution

Replace the period in the variable's name with an underscore. For example, if you have a form input element named foo.bar, you access it inside PHP as the variable $_REQUEST['foo_bar'].

Discussion

Because PHP uses the period as a string concatenation operator, a form variable called animal.height is automatically converted to animal_height, which avoids creating an ambiguity for the parser. While $_REQUEST['animal.height'] lacks these ambiguities, for legacy and consistency reasons, this happens regardless of your register_globals settings.

You usually deal with automatic variable name conversion when you process an image used to submit a form. For instance: you have a street map showing the location of your stores, and you want people to click on one for additional information. Here's an example:

<input type="image" name="locations" src="locations.gif">

When a user clicks on the image, the x and y coordinates are submitted as locations.x and locations.y. So, in PHP, to find where a user clicked, you need to check $_REQUEST['locations_x'] and $_REQUEST['locations_y'].

It's possible, through a series of manipulations, to create a variable inside PHP with a period:

Discussion

By placing [ ] after the variable name, you tell PHP to treat it as an array instead of a scalar. When it sees another value assigned to that variable, PHP auto-expands the size of the array and places the new value at the end. If the first three boxes in the Solution were checked, it's as if you'd written this code at the top of the script:

Placing a [ ] after a variable's name can cause problems in JavaScript when you try to address your elements. Instead of addressing the element by its name, use the numerical ID. You can also place the element name inside single quotes. Another way is to assign the element an ID, perhaps the name without the [ ], and use that ID instead. Given:

Discussion

In the Solution, we set the value for each date as its Unix timestamp representation because we find this easier to handle inside our programs. Of course, you can use any format you find most useful and appropriate.

Don't be tempted to eliminate the calls to mktime( ); dates and times aren't as consistent as you'd hope. Depending on what you're doing, you might not get the results you want. For example:

This script should print out the month, day, and year for a seven-day period starting October 24, 2002. However, it doesn't work as expected.

Why are there two "Sun, October 27, 2002"s? The answer: daylight saving time. It's not true that the number of seconds in a day stays constant; in fact, it's almost guaranteed to change. Worst of all, if you're not near either of the change-over dates, you're liable to miss this bug during testing.