Apache Web Server Forum

I'm involved in planning a site migration of a site with a very large number of URLs.

The existing draft rewrite rules are a complex mix of individual redirects for key pages, directory changes, and pattern-matching for large numbers of dynamic pages.

There are >150 rules in place.

Happily, all old URLs are redirecting to the correct new URLs. Unhappily, they are doing so by way of serving multiple 301 redirects. For example, one to change a directory from old to new, another to replace a query string parameter, another to append a trailing slash to non-file URLs.

(Of course, this is all on dev at present, not live!)

I'm concerned that getting to the correct URL via multiple 301s will cause unnecessary demand on the server, and I know for a fact that Google will not like the taste of it very much.

The complexity and range of existing URLs means that the most logical way of dealing with them is by applying a series of simple rules, rather than writing a complicated rule for each URL format.

I'd like to be able to apply all of my rules and only return a single 301 redirect when done. Will mod_rewrite allow me to do this?

Multiple chained redirects should be avoided if you want to pass the "link juice."

There are two ways to do this.

The first approach is to write each rule so that it accepts all permutations of input URL and redirects to the canonical URL all in one go. This is useful for a limited set of input URL permutations with a limited number of canonical URLs.

The second approach, likely more suited to your situation, is to specify a redirect in each rule, but to hold off invoking it until all fix-ups on the output URL are done. Then a final rule checks to see if any fix-ups were done, and if so, invokes the actual external redirect. To do this, you can set an environment variable in each fix-up rule to indicate that the external redirect should be invoked. For example:

# Redirect to add trailing slash if no trailing slash and no period in final URL-path-part of requested URLRewriteRule ^(([^/]+/)*[^/.]*[^/])$ http://www.example.com/$1/ [R=301,E=doRed:Yes]## Redirect to canonical hostname if non-canonical hostname requestedRewriteCond %{HTTP_HOST} !^www\.example\.com$RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,E=doRed:Yes]## Invoke an external redirect if any of the above fix-ups were appliedRewriteCond %{ENV:doRed} ^Yes$RewriteRule ^(.*)$ - [R=301,L]An important guideline that you should follow to avoid unexpected results and problems: Put all external redirects first, in order from most-specific patterns and conditions (fewest URLs affected) to least-specific pattern (most URLs affected), followed by all internal rewrites, again in order from most- to least-specific. Where patterns and conditions are different and mutually-exclusive, then order won't matter.

This is illustrated by the fact that in the code above, only URLs which don't end in a slash and don't contain a period in the final URL-path-part (indicating that a filetype is not present) are redirected to add a slash. (Side note: Doing this based on the absence of a filetype is much more efficient than checking the disk for "file exists" -- possibly thousands of times more efficient.)

This rule is then followed by the domain canonicalization rule, which will redirect *all* URLs if any non-canonical hostname is requested.

If any internal rewrites are present, they must follow all external redirects. Otherwise, a redirect will expose the internally-rewritten filepath as a URL -- almost always an unwanted result.

The code posted here is for use in .htaccess, as the majority of our readers are on shared name-based servers with no access to their server config files. For use at the server config level, add a leading slash to the regular-expressions patterns in the RewriteRules (only) -- e.g. "^(.*)$" becomes "^/(.*)$"

A bug exists in all versions of Apache mod_rewrite which can cause errors when multiple sequential rewrites are done. The result is that part of the URL gets re-injected into the substitution path, and you see "duplication" of parts of the URL-path. It's generally only a problem with internal rewrites, though. If this does occur, be sure to post back here; There is a solution, but it's ugly, inefficient, and unnecessarily complicated unless needed.

Jim, this is absolutely brilliant, thanks very much for taking the time to set things out so clearly. I thought that there must be something like this available in mod_rewrite, but I hadn't been able to find it elsewhere despite a lot of searching.

I have had a good go at implementing this. I noticed what I thought were a couple of typos in your code, and I've done my best to correct them. In the spirit of the forum I thought I would share my experiences with everyone. As you'll see, I think I've very nearly got it!

I wanted to use your examples, so I tried adding this code to my .htaccess file.

I replaced www.example.com with the domain of my test site. And at first, I removed the environment variable as I wanted to be sure that this code was correct. When I tried it, anything without a trailing slash or a period at the end of the URL hit a redirect loop as it would effectively redirect to itself.

However, I managed to fix this by adding a trailing slash to the redirect, so that the rule read as follows:

Again, this caused a problem - a 500 error across the site! This time the error was a little easier to spot - the second line should have begun with RewriteRule rather than RewriteCond. Again I also removed the environment variable for testing purposes, so that my code was:

It did return a single 301 redirect! However, it was to (the equivalent of)

http://www.example.com/http://www.example.com/folder/

which of course then went on to return a 403 forbidden (with a note that there had been a 404 also).

I feel as though I'm very close now to getting to implement the solution - can you give me a tip to get over this last hurdle? I'm sure it's something to do with looping through the rules and the way that I have ineptly amended your original code.

Yes, it's that d@#n Apache bug again. So we do it the hard way: # Set up initial environment variablesRewriteRule ^(.*)$ - [E=myHost:%{HTTP_HOST},E=myURLpath:$1]## Add trailing slash if no trailing slash and no period in final URL-path-part of requested URLRewriteCond %{ENV:myURLpath} ^(([^/]+/)*[^/.]*[^/])$RewriteRule ^.*$ - [E=myURLpath:%1/,E=doRed:Yes]## Set canonical hostname if non-canonical hostname requestedRewriteCond %{HTTP_HOST} !^www\.example\.com$RewriteRule ^.*$ - [E=myHost:www.example.com,E=doRed:Yes]## Invoke an external redirect if any of the above fix-ups were appliedRewriteCond %{ENV:doRed} ^Yes$RewriteRule ^.*$ http://%{myHost}/%{myURL} [R=301,L]Here we use environment variables to avoid any rewriting or the URL until we are ready to do the actual redirect. Note that the "-" in each rule means "leave the current requested URL alone." In this way, we avoid the path re-injection problem.

This approach is much simpler, and preferable to doing everything in environment variables as I demonstrated above. However, you may find that if additional internal rewrites are done (as in the first rule shown here), the Apache bug will get triggered, and you'll find parts of the URL-path duplicated in the output. In that case, you'll need to use the clunky environment variable method I posted above. (I have confirmed through testing that this bug exists on Apache 1.3x and Apache 2.0 through 2.2).

That's not the path-info issue which could be seen as a feature as well, otherwise certain very special constructs relying on a generation of request_filename won't work; configurable in 2.2.12 (to be released).

The path was rewritten to http://www.example.com/folder/ by the first rule. The pattern of the second rule would match against http://www.example.com/folder/ + path_info, takes the full match and prefixes it with http://www.example.com/ again. The result would be http://www.example.com/http://www.example.com/folder/ in case no path_info was left by the dir_walk.

Yes, it just looked like the nasty mod_rewrite bug [webmasterworld.com] at first glance. As far as I know (it's been over a year since I tested), this bug exists in all Apache versions, and manifests when multiple sequential internal rewrites are done (i.e. no [L] flag).