Friday, February 19, 2010

It's been a long time since I last posted, and unfortunately I've been unable to churn out a post every week. The month of February has been filled with travel, so I haven't had much time to write.

My report on FOSDEM is up on the YDN blog, so I haven't been completely dormant. I also did some stuff at our internal hack day last week. This post is about one of my hacks.

The idea is quite simple. People land up on 404 pages all the time. 404 pages are pages that have either gone missing, or were never there to begin with. 404 is the HTTP error code for a missing resource. Most 404 pages are quite bland, simply stating that the requested resource was not found, and that's it. Back when I worked at NCST, I changed the default 404 page to use a local site search based on the requested URL. I used the namazu search engine since I was working on it at the time.

This time I decided to do something different. Instead of searching the local site for a missing resource, why not engage the user in trying to find missing kids.

I started with trying to find an API for missingkids.com and ended up finding missingkidsmap.com. This service takes the data from Missing Kids and puts it on a google map. The cool thing about the service was that it could return data as XML.

Looking through the source code, I found the data URL:

http://www.missingkidsmap.com/read.php?state=CA

The state code is a two letter code for states in the US and Canada. To get all kids, just pass in ZZ as the state code.

Now I could keep hitting this URL for every 404, but I didn't want to kill their servers, so I decided to pass the URL through YQL and let them cache the data. Of course, now that I was passing it through YQL, I could also do some data transformation and get it out as JSON instead of XML. I ended up with this YQL statement:

SELECT * From xml
Where url='http://www.missingkidsmap.com/read.php?state=ZZ'

Pass that through the YQL console to get the URL you should use. The JSON I got back looked like this:

http_get is a function I wrote that wraps around curl_multi to fetch and cache locally a URL. print_404 is the function that prints out the HTML for the 404 page using the $child data object. The object's structure is the same as each of the location elements in the JSON above. The important parts of print_404 are:

The last thing to do is to tell apache to use this script as your 404 handler. To do that, put the page (I call it 404.php) into your document root, and put this into your apache config (or in a .htaccess file):

@chanux: yeah, you could do a geo lookup on the IP. There are many services that can tell you the country. Rasmus has a good API at http://geoip.pidgets.com/ that can be used for this. I'll update my page later today to use it.

This is a really cool idea, I would like to have a drop in, embeddable <script> widget type thing to add this to my 404 page. I just don't have time to spare to port this code to django/python, a simple script will work on any platform.

Anyone care to spend a little time on this? Possibly cleaning up the design to be more aesthetically pleasing and possibly multiple themes/color schemes?

Again, really cool, I like it. Does some good and is cute at the same time.

@eliot: I added the geoip lookup, but it was too slow so I've dropped it for now (commented out, so you can still see it in the source).

@Nick: I can see how a script tag would be useful for some people. I personally prefer not to go that path for two reasons. First, it makes the page inaccessible to people with javascript turned off. This may not be an issue for others, but it is for me. Second, since this is mainly for 404 pages, anyone putting it onto a 404 page would have to know how to edit their 404 page, in which case they can do a lot more than just a script node.

That said, I think it makes a lot of sense to build a reusable widget/badge in javascript that people can just stick onto their wordpress/blogger/movable type blogs. Feel free to work off the code in github.

@mark: I don't require funding. I just need volunteers to help build all these plugins. You volunteering? ;)

All the missing kids gonna steal all ma traffic! I don't think so! There's no ROI in finding missing kids! Unless those missing kids are like "Super thx 4 finding me, now go buy something on that site that had my picture on it." Affiliate marketing! Now that's what I'M talking about.

@eliot: I haven't looked into how you do the geolookup, but I'm assuming you use the full IP address. I was wondering if it would be less stressful on the service if you just used the first three bytes of the IP and zeroed out the fourth. This would effectively reduce the total number of potential IPs from 4 billion to 16 million, which should also limit the number of times you need to call the API.