Since my boss firmly believes that the best websites are redesigned yearly (and all attempts to dissuade him merely steels his resolve,) I need to maintain a sizable array of redirected pages. I have basically used this article as my resource:

However, when I check old pages through Google Webmaster Tools, I am still getting about 60 crawl errors. Those URLs are in my array, but when I check, yes indeed, they still throw a 404 error.

I have configured everything exactly as this article outlines. URLs above and below some of these lines work just fine.

Does anyone have any idea what might be going wrong?

ServerStorm
—
2012-05-08T15:26:03Z —
#2

Hi TMacFarlane,

Are you getting 404 errors for your Old or New urls?

If the answer is you old urls then they should get the 404 error; the trick is capturing the 404ed URL and then trying to slice important parts and then redirect it to the closest match in the new site. The old urls should always generate a 404 error, you are counting on this to trigger your custom 404.php page that will handling this redirection.

So do you have this flow:

url links to non-existent page is clicked.

Apache/IIS serve the custome 404.php page

At the top of the 404.php page you have included the redirector code that if it successfully can create a match it redirects and exits the script before the 'not found' code is reached; otherwise the 'Not found' page content loads

Hope this helpsSteve

TMacFarlane
—
2012-05-08T16:03:53Z —
#3

I am getting the 404 errors from the OLD URL! While the 404 may be happening in the background, the redirect should kick in so that the user/bot does not experience it (they actually should end up experiencing a 301-moved permanently.) The 404.php page is set up just as explained (I can provide the code, but I don't think this is necessary as some URLs work, and others do not.) We are essentially on the same page about this.

I am afraid the part of your message: "if it successfully can create a match it redirects" may be the crux of the issue here, and I cannot provide a good example in the hundreds of lines of why certain ones do not match. Be that as it may, I have combed through the array for any misplaced single-quotes or commas (thankfully, Dreamweaver highlights when I leave one of these out.) But here might be a good example:

Somehow, our TESTING directory ended up being crawled. So I added the following line at the end of the array in redirect.php:

'TESTING' =&gt; '/'

This should mean that if I type in my address bar "www.mysite.com/TESTING" I should naturally go to www.mysite.com/ -- right? nope. Why?

ServerStorm
—
2012-05-08T22:19:44Z —
#4

Hi,

I threw this together.

The URL would normally come from the strtolower($SERVER['REQUESTURI']); but I was using the posted form to test out different urls:

This worked for me... it pulls relative keywords out of the posted url and then redirects. Again instead of using the if($POST){} code you could simply test the strtolower($SERVER['REQUEST_URI']); instead and it would work.

Pleas feel free to ask any questions hope it helps.

Steve

TMacFarlane
—
2012-05-09T19:17:51Z —
#5

Currently, redirect.php assigns "strtolower($SERVER['REQUESTURI']);" to the variable "$oldurl". It works, just not on every item in the array "$redir".

I am sorry, but I fail to follow. How exactly do I test "strtolower($SERVER['REQUESTURI']);"?

This code gets the url that the user tried but apache/IIS could not find, so apache/IIS has loaded the 404.php page and passed the original url request via the $SERVER['REQUESTURI'] parameter. If you have done like the article suggested and create a redirect.php then you would start with the $oldurl = strtolower($SERVER['REQUESTURI']); in the first line.

I then wrote some code that you might like to use as it will automatically assign the best match rather than you having to do it via an associative array. All it needs is a list of valid links on your newest site.

I have modified the code to be less ridged; it no longer needs an exact match. It will do an exact match or a regular expression search on the text that makes up the path of the url. So try doing this with the code I provide:

Create the 404.php and have your webserver set non-found urls to this page

Ensure that the redirect.php is the first include (the first thing in) the redirector file and ensure that you set the $old_url to strtolower(htmlentities($POST['requestedurl']));

Then create your correct urls; those that you want old links to redirect to in the $new_urls array. List them with their full path include the 'http://' or 'https://', the domain and the directory path and the file with extension name like: http://www.mysite.com/contact_us.php

Then make sure all the code I have done is in the redirect php; in the code below I include steps 2,3 so you can simple copy this in its' entirety to the redirect.php

I configured my test apache server with a custom 404.php page and then used a redirector.php file included at the top of it. and ran the same code as I show above only with my domains urls and then tested a bunch of changed domain names, it succesfully matched most of the time. When it didn't it gave me the custom 404.php error that allows them to click a link to the main site. It worked nicely.

Hope this helps.

Steve

ServerStorm
—
2012-05-10T11:19:24Z —
#7

BTW if you don't want to get a TEST or another directory indexed by robots you can use in each file using an single include

You could also consider a robot.txt file. It is not meant as a means for security it is merely a 'don't enter' sign, but search robots will respect it. So you could create a robot.txt file at the root of your site and then do:

User-Agent: *
Disallow: /TEST/

Steve

TMacFarlane
—
2012-05-10T14:41:53Z —
#8

Oh!

I think I understand now. What you are saying is to replace the code in my "redirect.php" file with your code (and just enter the names of my changed pages in the array, and replace the form language ((_$POST) with the server call.) Then the server will do a search-and-match based on the keywords int the URL. I will give it a try-- it might take more than a week.

I do use a robots.txt, and I usually disallow all kinds of thing, but this case got through the cracks. Don't tell my boss!

Thanks so much for the explicit directions.

TMacFarlane
—
2012-05-10T14:59:37Z —
#9

I guess the part that bothers me the most about this the vagueness of the pattern matching. It seems like a search engine-- where you end up with a bunch of porn sites while looking for something innocuous. For instance:

I've been tagging with comments the items in the array that haven't been redirecting, and some of them deal with awards pages. Previously, our awards pages were prefixed with "awd-" and later they were prefixed with "award-". Now they have been moved to their own folder, called "Awards" (to feed a breadcrumb) and the old URLs in my array that were prefixed "award-" are being redirected no problem. I worry that this type of misdirection will only be exacerbated if I introduce a keyword-matching system to the game.

As I write this, I begin to wonder if some kind of pattern matching misdirection isn't to blame for my present "redirect.php" file breaking. But I do want something exact, if only to give me the feeling that i am micromanaging this better. What do you say?

TMacFarlane
—
2012-05-10T15:41:46Z —
#10

OK--

I've actually had the spare time to try this. Every old URL goes to the first item in my $new_urls array (/About/index.php). That seems the same as just writing a redirect to my homepage into my .htaccess file.

ServerStorm
—
2012-05-10T18:31:16Z —
#11

Hi,

The way that the parsing of the url works in the code I gave you is this:

The path '/foo/bar/file1.php' is then split into parts: 'foo', 'bar', 'file1.php'

The end file then has the extension removed like 'file1.php' becomes 'file1'

These path attributes are stored in an array.

The array is looped through to see if there is exact matches in the path. For each exact match a hitcount variable is increased by 1. If it is not an exact match then it looks for words inside of the path piece, so if the path piece in the old site was 'signin' and in the new site it is 'sign_in' it will generate a hitcount for the new url that contains 'sign_in'.

For examples like you gave where it used to be prefixed with awd- and later award- this exact or word within the word regex will not work. You could build special conditions for edge cases. In the code below I modified the setClosestMatch() method to have a test for 'awd-' and I successfully had the code redirect when I put in http://www.mysite.com/awd-bestd.htm to http://www.mysite.com/award-best_distance.html. Although edge cases are hard-coding it allows you to in the future update page names that will not need edge cases and will redirect with exact or word in word(s) matching. Here is the modified setClosestMatch() method (see the changes in the bottom inner foreach loop.

You are right that if none of your pages match it will just redirect to the main page so it would be cleaner/faster to just use a .htaccess redirect.

Regards,Steve

TMacFarlane
—
2012-05-11T14:23:16Z —
#12

Thanks very much for this solution, Steve, but I am afraid it is not what I am looking for. My redirect.php file is not really broken. I have managed to winnow my crawl errors from 349 to 19. Of those 19, four are just fine if they hit a 404. That leaves just fifteen that are aggravating me. I just wonder why my array won't catch them--what is it about those dogs that won't hunt. Furthermore, I can't help wonder how many outliers there will be in your method.

I've always shied away from regular expression pattern-matching, which is probably what makes me more of a front-end guy. However, if I find another use for search functions, I will be able to refer to this code. I am also extremely grateful to you for modeling a better method of scripting (private vs. public functions, etc.)

Plus, you gotta know, it is extremely difficult for me to let go of a practice I have been diligently managing for the past year.

In the end, I was just hoping that someone could point out why those select few entries were not redirecting as appointed. I haven't been able to isolate any syntax errors, or any site-level priorities, or anything like that.

Thanks vey much for your careful and diligent support. I sincerely hope you don't view it as casting pearls before swine, or that it was in vain.

ServerStorm
—
2012-05-11T17:39:31Z —
#13

Hi TMacFarlane,

Hey no problem, I am glad that in the future you may be able to use this code, but please know that I enjoyed doing this as it let me play with some things I hadn't done before so I did get something valuable out of it.

Maybe can you post the 15 or so links that don't get redirected and maybe we can help you with a regex pattern(s) that help you address these?