Transform URLs with mod_rewrite

Much of the magic and convenience of the World Wide Web can be directly attributed to one simple but powerful construct: the Uniform Resource Locator, or URL. The shorthand of the Web, URLs provide a succinct and consistent way to refer to all kinds of online resources.

Much of the magic and convenience of the World Wide Web can be directly attributed to one simple but powerful construct: the Uniform Resource Locator, or URL. The shorthand of the Web, URLs provide a succinct and consistent way to refer to all kinds of online resources.

In the early days of the Internet, URLs were uncomplicated. The bulk of the Web’s content was stored in static pages on single servers, and URLs largely mimicked the hierarchy of the web site’s file system. Nowadays, URLs — and the resources they point to — are much more complex, as most content is stored in databases and is formatted and served on-the-fly. A single URL may disguise a cluster of load-balanced web servers or encode several bits of information in its query string.

While complex URLs work fine in the browser, sharing and bookmarking (and indexing) those URLs can be quite problematic. Email applications often scramble long URLs, and URLs filled with ? and & characters are hard for users to read and remember. Moreover, sites are rarely static. Keeping “legacy” URLs valid remains a challenge for an expanding site.

One solution to all of these URL problems is mod_rewrite, the “Swiss Army knife of URL manipulations,” written by Ralf S. Engelschall. mod_rewrite manipulates URLs quickly and easily, providing solutions to many URL construction problems.

For instance, mod_rewrite can let you simplify your URLs, or, if you ever reorganize your site, can easily transform older links to point to new locations. You can also use mod_rewrite to optimize your page names for search engine crawlers; to protect your images and other media files from unauthorized linking; to hide the actual location of any files on your site; and to redirect users based on their location, their originating network, and even the time of the day.

What makes mod_rewrite so powerful is its rule-based matching engine. In fact, once you discover the power of mod_rewrite, it’ll likely become one of your favorite tools.

The Red Pen is Mightier Than the Sword

mod_rewrite is typically included with Apache (both 1.3.x and 2.0.x), so if Apache is already installed on your system, chances are you just have to flip a switch to turn mod_rewrite on.

If Apache isn’t installed on your workstation, go to http://www.apache.org and download the source. At the time this article was written, the latest version of Apache 2 was 2.0.49, and that version was used for this article. Uncompress, configure, build, and install the package as follows:

That series of commands installs Apache in /usr/local/apache2, the default location. If you want to specify a different location, simply append –prefix=/path/to/apache to ./configure.
–enable-so enables Dynamic Shared Object (DSO) support, meaning that all modules are loaded as directed by the httpd.conf file. –enable-rewrite turns on mod_rewrite; and –enable-mods-shared=all enables all of the standard modules.

Next, go to your Apache installation directory, and open conf/httpd.conf. Make sure to point DocumentRoot to the root of your web site. For instance, for the examples used in this article, DocumentRoot was set to /www. Web pages were located in /www/pages.

While editing httpd.conf, make sure the following line is uncommented, so that mod_rewrite is enabled and loaded.

LoadModule rewrite_module modules/mod_rewrite.so

After installing Apache, you can start the Apache web server with /usr/local/apache2/bin/apachectl start.

Configuringmod_rewrite

mod_rewrite is controlled by a series of directives that transform, or rewrite, URLs. There are two ways to specify directives: you can put them directly and centrally in your httpd.conf file, or you can place them in the per-directory .htaccess file. (If you put directives in .htaccess, make sure to change your Apache configuration file to allow overrides in that directory.)

For example, this entry in httpd.conf sets options for /www.

Directory “/www”
Options Indexes Includes MultiViews
AllowOverride All
Order allow,deny
Allow from all
/Directory

The option AllowOverride All allows a .htaccess file in /www to override these central settings. In this example, that’s where the following rewrite directives will go.

The first line enables mod_rewrite. The second line tells Apache to follow symbolic links, while the third line explicitly sets the base URL for rewrites.

Line four is the actual rewrite rule. The RewriteRule should look oddly familiar: it uses regular expressions to match URLs. This one looks for any requests matching pageone.html exactly, because the ^ and $ match the beginning and end of the string, respectively. If the rule finds a match, the URL is rewritten to pagetwo.html.

The [R=301,L] are flags. Flags are optional, but if you use them, enclose them in square brackets ([ ]). The R directive forces a redirect; the =301 is a return code in case the redirection fails, and is optional. The L flag, also written as last, tells Apache to stop processing any more rules. As you’ll see, the last rule plays an important role as mod_rewrite rules become complex.

Taking mod_rewrite a step farther, RewriteCond adds conditional statements, allowing for greater control and customization. The format for RewriteCond is:

RewriteCond Something_to_test Condition

A great use of RewriteCond is to test for the type of web browser a visitor is using. Several HTTP request variables can be accessed from within mod_rewrite, including HTTP_USER _AGENT. Using RewriteCond and HTTP_USER_AGENT, you can offer optimized, alternative versions of your site for different types of browsers. Listing One shows one simple implementation.

The browser check begins at line four. If the browser is a Mozilla variant, line five is executed, which rewrites the URL index.html to index_mozilla.html. Again, the L flag terminates processing.

If the browser isn’t Mozilla, then processing continues at line six, which tests if the browser is in the Internet Explorer family. If so, line seven rewrites the URL. If not, line eight unconditionally rewrites the URL to the browser-independent index_text.html. Line eight is a catch-all rule.

You can precede a RewriteRule with more than one RewriteCond. Whether you use one or more RewriteCond, the following rewriting rule is used only if its pattern matches the URL and all of the conditions are true.

As mentioned above, several HTTP request variables can be used in RewriteCond. You can also refer to HTTP headers, system environment variables, and many internal Apache settings. See the documentation for your version of Apache for a complete list.

By the way, you can turn on extra logging to help you debug your rewrite directives. Simply edit httpd.conf and set logLevel to debug or info depending on your needs. Uncommenting the CustomLog variable may also help. Browse through Apache’s logs/access_log file to find problems.

Rewrite Rules Rule!

One of the truly annoying tasks of a webmaster is quelling unauthorized links to hosted media and image files. However, mod_rewrite makes the task easy. With it, you can prevent users from stealing and using images from your site. Listing Two shows some sample rules.

(Lines five and six are wrapped due to space constraints, and should appear as two lines in your file.) Listing Two uses the HTTP_REFERRER variable to find out who referred the visitor to the site and what to subsequently serve this visitor.

Taking the listing example line-by-line: Line four tests that HTTP_REFERRER isn’t empty. Line five tests HTTP_REFERRER against http://www.linuxmagazine.com. The [NC] flag denotes no case, or to ignore case for any matches. Line six is the rewrite rule, which matches any URL that ends in an image extension. So, if the URL matches the pattern in line six, and if lines four and five are true, the incoming URL is rewritten to point to an unauthorized access message in http://www.linuxmagazine.com/images/unauthorized.html.

If the image URL is referred to by a page within www. linuxmagazine.com, the condition in line five won’t be satisfied and the image file will be served to the visitor.

Of course, there are many more tricks and cool things you can do with mod_rewrite. There are many other usage examples within the Apache web site and in Engelschall’s Apache Rewrite Handbook and Guide, located at http://www.engelschall.com/pw/apache/rewriteguide.

Harry Tsoi is a systems engineer and an applications integration specialist. His company, IPsoft, Inc. (www.ip-soft.net), provides managed hosting and management services. You can reach Harry at harry@ip-soft.net.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.