Using mod_rewrite to control access

This document supplements the mod_rewrite
+reference documentation. It describes
+how you can use mod_rewrite to control access to
+various resources, and other related techniques.
+This includes many examples of common uses of mod_rewrite,
+including detailed descriptions of how each works.

+
+

Note that many of these examples won't work unchanged in your
+particular server configuration, so it's important that you understand
+them, rather than merely cutting and pasting the examples into your
+configuration.

This document supplements the mod_rewrite
+reference documentation. It describes
+how you can use mod_rewrite to control access to
+various resources, and other related techniques.
+This includes many examples of common uses of mod_rewrite,
+including detailed descriptions of how each works.

+
+

Note that many of these examples won't work unchanged in your
+particular server configuration, so it's important that you understand
+them, rather than merely cutting and pasting the examples into your
+configuration.

When not to use mod_rewrite

This document supplements the mod_rewrite
+reference documentation. It describes
+perhaps one of the most important concepts about mod_rewrite - namely,
+when to avoid using it.

+
+

mod_rewrite should be considered a last resort, when other
+alternatives are found wanting. Using it when there are simpler
+alternatives leads to configurations which are confusing, fragile, and
+hard to maintain. Understanding what other alternatives are available is
+a very important step towards mod_rewrite mastery.

+
+

Note that many of these examples won't work unchanged in your
+particular server configuration, so it's important that you understand
+them, rather than merely cutting and pasting the examples into your
+configuration.

This document supplements the mod_rewrite
+reference documentation. It describes
+perhaps one of the most important concepts about mod_rewrite - namely,
+when to avoid using it.

+
+

mod_rewrite should be considered a last resort, when other
+alternatives are found wanting. Using it when there are simpler
+alternatives leads to configurations which are confusing, fragile, and
+hard to maintain. Understanding what other alternatives are available is
+a very important step towards mod_rewrite mastery.

+
+

Note that many of these examples won't work unchanged in your
+particular server configuration, so it's important that you understand
+them, rather than merely cutting and pasting the examples into your
+configuration.

Redirecting and Remapping with mod_rewrite

This document supplements the mod_rewrite
+reference documentation. It describes
+how you can use mod_rewrite to redirect and remap
+request. This includes many examples of common uses of mod_rewrite,
+including detailed descriptions of how each works.

+
+

Note that many of these examples won't work unchanged in your
+particular server configuration, so it's important that you understand
+them, rather than merely cutting and pasting the examples into your
+configuration.

This document supplements the mod_rewrite
+reference documentation. It describes
+how you can use mod_rewrite to redirect and remap
+request. This includes many examples of common uses of mod_rewrite,
+including detailed descriptions of how each works.

+
+

Note that many of these examples won't work unchanged in your
+particular server configuration, so it's important that you understand
+them, rather than merely cutting and pasting the examples into your
+configuration.

See also

We want to create a homogeneous and consistent URL
+ layout across all WWW servers on an Intranet web cluster, i.e.,
+ all URLs (by definition server-local and thus
+ server-dependent!) become server independent!
+ What we want is to give the WWW namespace a single consistent
+ layout: no URL should refer to
+ any particular target server. The cluster itself
+ should connect users automatically to a physical target
+ host as needed, invisibly.

+

+
+

Solution:

+
+

+

First, the knowledge of the target servers comes from
+ (distributed) external maps which contain information on
+ where our users, groups, and entities reside. They have the
+ form:

+
+

+user1 server_of_user1
+user2 server_of_user2
+: :
+

+
+

We put them into files map.xxx-to-host.
+ Second we need to instruct all servers to redirect URLs
+ of the forms:

when any URL path need not be valid on every server. The
+ following ruleset does this for us with the help of the map
+ files (assuming that server0 is a default server which
+ will be used if a user has no entry in the map):

Some sites with thousands of users use a
+ structured homedir layout, i.e. each homedir is in a
+ subdirectory which begins (for instance) with the first
+ character of the username. So, /~foo/anypath
+ is /home/f/foo/.www/anypath
+ while /~bar/anypath is
+ /home/b/bar/.www/anypath.

+

+
+

Solution:

+
+

+

We use the following ruleset to expand the tilde URLs
+ into the above layout.

This really is a hardcore example: a killer application
+ which heavily uses per-directory
+ RewriteRules to get a smooth look and feel
+ on the Web while its data structure is never touched or
+ adjusted. Background: net.sw is
+ my archive of freely available Unix software packages,
+ which I started to collect in 1992. It is both my hobby
+ and job to do this, because while I'm studying computer
+ science I have also worked for many years as a system and
+ network administrator in my spare time. Every week I need
+ some sort of software so I created a deep hierarchy of
+ directories where I stored the packages:

In July 1996 I decided to make this archive public to
+ the world via a nice Web interface. "Nice" means that I
+ wanted to offer an interface where you can browse
+ directly through the archive hierarchy. And "nice" means
+ that I didn't want to change anything inside this
+ hierarchy - not even by putting some CGI scripts at the
+ top of it. Why? Because the above structure should later be
+ accessible via FTP as well, and I didn't want any
+ Web or CGI stuff mixed in there.

+

+
+

Solution:

+
+

+

The solution has two parts: The first is a set of CGI
+ scripts which create all the pages at all directory
+ levels on-the-fly. I put them under
+ /e/netsw/.www/ as follows:

The DATA/ subdirectory holds the above
+ directory structure, i.e. the real
+ net.sw stuff, and gets
+ automatically updated via rdist from time to
+ time. The second part of the problem remains: how to link
+ these two structures together into one smooth-looking URL
+ tree? We want to hide the DATA/ directory
+ from the user while running the appropriate CGI scripts
+ for the various URLs. Here is the solution: first I put
+ the following into the per-directory configuration file
+ in the DocumentRoot
+ of the server to rewrite the public URL path
+ /net.sw/ to the internal path
+ /e/netsw:

The first rule is for requests which miss the trailing
+ slash! The second rule does the real thing. And then
+ comes the killer configuration which stays in the
+ per-directory config file
+ /e/netsw/.www/.wwwacl:

A typical FAQ about URL rewriting is how to redirect
+ failing requests on webserver A to webserver B. Usually
+ this is done via ErrorDocument CGI scripts in Perl, but
+ there is also a mod_rewrite solution.
+ But note that this performs more poorly than using an
+ ErrorDocument
+ CGI script!

+

+
+

Solution:

+
+

+

The first solution has the best performance but less
+ flexibility, and is less safe:

This uses the URL look-ahead feature of mod_rewrite.
+ The result is that this will work for all types of URLs
+ and is safe. But it does have a performance impact on
+ the web server, because for every request there is one
+ more internal subrequest. So, if your web server runs on a
+ powerful CPU, use this one. If it is a slow machine, use
+ the first approach or better an ErrorDocument CGI script.

Do you know the great CPAN (Comprehensive Perl Archive
+ Network) under http://www.perl.com/CPAN?
+ CPAN automatically redirects browsers to one of many FTP
+ servers around the world (generally one near the requesting
+ client); each server carries a full CPAN mirror. This is
+ effectively an FTP access multiplexing service.
+ CPAN runs via CGI scripts, but how could a similar approach
+ be implemented via mod_rewrite?

+

+
+

Solution:

+
+

+

First we notice that as of version 3.0.0,
+ mod_rewrite can
+ also use the "ftp:" scheme on redirects.
+ And second, the location approximation can be done by a
+ RewriteMap
+ over the top-level domain of the client.
+ With a tricky chained ruleset we can use this top-level
+ domain as a key to our multiplexing map.

At least for important top-level pages it is sometimes
+ necessary to provide the optimum of browser dependent
+ content, i.e., one has to provide one version for
+ current browsers, a different version for the Lynx and text-mode
+ browsers, and another for other browsers.

+

+
+

Solution:

+
+

+

We cannot use content negotiation because the browsers do
+ not provide their type in that form. Instead we have to
+ act on the HTTP header "User-Agent". The following config
+ does the following: If the HTTP header "User-Agent"
+ begins with "Mozilla/3", the page foo.html
+ is rewritten to foo.NS.html and the
+ rewriting stops. If the browser is "Lynx" or "Mozilla" of
+ version 1 or 2, the URL becomes foo.20.html.
+ All other browsers receive page foo.32.html.
+ This is done with the following ruleset:

Assume there are nice web pages on remote hosts we want
+ to bring into our namespace. For FTP servers we would use
+ the mirror program which actually maintains an
+ explicit up-to-date copy of the remote data on the local
+ machine. For a web server we could use the program
+ webcopy which runs via HTTP. But both
+ techniques have a major drawback: The local copy is
+ always only as up-to-date as the last time we ran the program. It
+ would be much better if the mirror was not a static one we
+ have to establish explicitly. Instead we want a dynamic
+ mirror with data which gets updated automatically
+ as needed on the remote host(s).

+

+
+

Solution:

+
+

+

To provide this feature we map the remote web page or even
+ the complete remote web area to our namespace by the use
+ of the Proxy Throughput feature
+ (flag [P]):

This is a tricky way of virtually running a corporate
+ (external) Internet web server
+ (www.quux-corp.dom), while actually keeping
+ and maintaining its data on an (internal) Intranet web server
+ (www2.quux-corp.dom) which is protected by a
+ firewall. The trick is that the external web server retrieves
+ the requested data on-the-fly from the internal
+ one.

+

+
+

Solution:

+
+

+

First, we must make sure that our firewall still
+ protects the internal web server and only the
+ external web server is allowed to retrieve data from it.
+ On a packet-filtering firewall, for instance, we could
+ configure a firewall ruleset like the following:

Suppose we want to load balance the traffic to
+ www.example.com over www[0-5].example.com
+ (a total of 6 servers). How can this be done?

+

+
+

Solution:

+
+

+

There are many possible solutions for this problem.
+ We will first discuss a common DNS-based method,
+ and then one based on mod_rewrite:

+
+
+

+ DNS Round-Robin
+
+

The simplest method for load-balancing is to use
+ DNS round-robin.
+ Here you just configure www[0-9].example.com
+ as usual in your DNS with A (address) records, e.g.,

+
+

+www0 IN A 1.2.3.1
+www1 IN A 1.2.3.2
+www2 IN A 1.2.3.3
+www3 IN A 1.2.3.4
+www4 IN A 1.2.3.5
+www5 IN A 1.2.3.6
+

+
+

Then you additionally add the following entries:

+
+

+www IN A 1.2.3.1
+www IN A 1.2.3.2
+www IN A 1.2.3.3
+www IN A 1.2.3.4
+www IN A 1.2.3.5
+

+
+

Now when www.example.com gets
+ resolved, BIND gives out www0-www5
+ - but in a permutated (rotated) order every time.
+ This way the clients are spread over the various
+ servers. But notice that this is not a perfect load
+ balancing scheme, because DNS resolutions are
+ cached by clients and other nameservers, so
+ once a client has resolved www.example.com
+ to a particular wwwN.example.com, all its
+ subsequent requests will continue to go to the same
+ IP (and thus a single server), rather than being
+ distributed across the other available servers. But the
+ overall result is
+ okay because the requests are collectively
+ spread over the various web servers.

+

+
+

+ DNS Load-Balancing
+
+

A sophisticated DNS-based method for
+ load-balancing is to use the program
+ lbnamed which can be found at
+ http://www.stanford.edu/~riepel/lbnamed/.
+ It is a Perl 5 program which, in conjunction with auxiliary
+ tools, provides real load-balancing via
+ DNS.

+

+
+

+ Proxy Throughput Round-Robin
+
+

In this variant we use mod_rewrite
+ and its proxy throughput feature. First we dedicate
+ www0.example.com to be actually
+ www.example.com by using a single

+
+

+www IN CNAME www0.example.com.
+

+
+

entry in the DNS. Then we convert
+ www0.example.com to a proxy-only server,
+ i.e., we configure this machine so all arriving URLs
+ are simply passed through its internal proxy to one of
+ the 5 other servers (www1-www5). To
+ accomplish this we first establish a ruleset which
+ contacts a load balancing script lb.pl
+ for all URLs.

A last notice: Why is this useful? Seems like
+ www0.example.com still is overloaded? The
+ answer is yes, it is overloaded, but with plain proxy
+ throughput requests, only! All SSI, CGI, ePerl, etc.
+ processing is handled done on the other machines.
+ For a complicated site, this may work well. The biggest
+ risk here is that www0 is now a single point of failure --
+ if it crashes, the other servers are inaccessible.

+

+
+

+ Dedicated Load Balancers
+
+

There are more sophisticated solutions, as well. Cisco,
+ F5, and several other companies sell hardware load
+ balancers (typically used in pairs for redundancy), which
+ offer sophisticated load balancing and auto-failover
+ features. There are software packages which offer similar
+ features on commodity hardware, as well. If you have
+ enough money or need, check these out. The lb-l mailing list is a
+ good place to research.

On the net there are many nifty CGI programs. But
+ their usage is usually boring, so a lot of webmasters
+ don't use them. Even Apache's Action handler feature for
+ MIME-types is only appropriate when the CGI programs
+ don't need special URLs (actually PATH_INFO
+ and QUERY_STRINGS) as their input. First,
+ let us configure a new file type with extension
+ .scgi (for secure CGI) which will be processed
+ by the popular cgiwrap program. The problem
+ here is that for instance if we use a Homogeneous URL Layout
+ (see above) a file inside the user homedirs might have a URL
+ like /u/user/foo/bar.scgi, but
+ cgiwrap needs URLs in the form
+ /~user/foo/bar.scgi/. The following rule
+ solves the problem:

Or assume we have some more nifty programs:
+ wwwlog (which displays the
+ access.log for a URL subtree) and
+ wwwidx (which runs Glimpse on a URL
+ subtree). We have to provide the URL area to these
+ programs so they know which area they are really working with.
+ But usually this is complicated, because they may still be
+ requested by the alternate URL form, i.e., typically we would
+ run the swwidx program from within
+ /u/user/foo/ via hyperlink to

+
+

+/internal/cgi/user/swwidx?i=/u/user/foo/
+

+
+

which is ugly, because we have to hard-code
+ both the location of the area
+ and the location of the CGI inside the
+ hyperlink. When we have to reorganize, we spend a
+ lot of time changing the various hyperlinks.

+

+
+

Solution:

+
+

+

The solution here is to provide a special new URL format
+ which automatically leads to the proper CGI invocation.
+ We configure the following:

Here comes a really esoteric feature: Dynamically
+ generated but statically served pages, i.e., pages should be
+ delivered as pure static pages (read from the filesystem
+ and just passed through), but they have to be generated
+ dynamically by the web server if missing. This way you can
+ have CGI-generated pages which are statically served unless an
+ admin (or a cron job) removes the static contents. Then the
+ contents gets refreshed.

Here a request for page.html leads to an
+ internal run of a corresponding page.cgi if
+ page.html is missing or has filesize
+ null. The trick here is that page.cgi is a
+ CGI script which (additionally to its STDOUT)
+ writes its output to the file page.html.
+ Once it has completed, the server sends out
+ page.html. When the webmaster wants to force
+ a refresh of the contents, he just removes
+ page.html (typically from cron).

Wouldn't it be nice, while creating a complex web page, if
+ the web browser would automatically refresh the page every
+ time we save a new version from within our editor?
+ Impossible?

+

+
+

Solution:

+
+

+

No! We just combine the MIME multipart feature, the
+ web server NPH feature, and the URL manipulation power of
+ mod_rewrite. First, we establish a new
+ URL feature: Adding just :refresh to any
+ URL causes the 'page' to be refreshed every time it is
+ updated on the filesystem.

+##
+## hosts.deny
+##
+## ATTENTION! This is a map, not a list, even when we treat it as such.
+## mod_rewrite parses it for key/value pairs, so at least a
+## dummy value "-" must be present for each entry.
+##
+
+193.102.180.41 -
+bsdti1.sdm.de -
+192.76.162.40 -
+

How can we forbid a certain host or even a user of a
+ special host from using the Apache proxy?

+

+
+

Solution:

+
+

+

We first have to make sure mod_rewrite
+ is below(!) mod_proxy in the Configuration
+ file when compiling the Apache web server. This way it gets
+ called beforemod_proxy. Then we
+ configure the following for a host-dependent deny...

Sometimes very special authentication is needed, for
+ instance authentication which checks for a set of
+ explicitly configured users. Only these should receive
+ access and without explicit prompting (which would occur
+ when using Basic Auth via mod_auth_basic).

+

+
+

Solution:

+
+

+

We use a list of rewrite conditions to exclude all except
+ our friends: