URL Scraping

2007-11-16

I was looking for a way to create packaged versions of Curly Logo. For example it would be nice if you could create a version of Curly Logo that already has your favourite SQUARE and FATPEN procedures defined. One way would be to take the served pages (some XHTML and JavaScript) edit it them and host them yourself. How tedious.

Then I thought that Curly Logo could examine the URL, see if it contained a suitably encoded Logo program and execute the program if so. This is useful because «http://www.amberfrog.com/logo/», «http://www.amberfrog.com/logo/?foo», and «http://www.amberfrog.com/logo/#foo» are different URLs that happen to load the same resource. Because it’s exactly the same resource my server can serve cached copies, and in fact your browser can use the ETag header to not even bother getting the same object again. Wins all round.

So what Curly Logo does is decodeURI everything past the ‘#’ in its URL (available in «window.location») and execute it as Logo as if it was typed in. I call this technique URL scraping.

Aside: Firefox makes the turtle bug-eyed when I use a «#» in the URL. I have no idea why. Surely it’s a bug?

The advantage of URL scraping is that nothing changes on the server, so everyone benefits from caching and in some cases it would also open up the possibility of serving static HTML (as I do). There’s no reason that the JavaScript doing the scraping has to use the «#» part of the URL, it was just convenient for me. You could have a server side rewrite rule that maps /foo/red/ and /foo/blue/ to the same object (thereby still gaining the benefits of caching) and have the JavaScript sense the last directory component of the URL; it could pick a CSS theme using that value for example, meaning that different URLs give your users different themes on the same site, but the server transmits the same object in either case. The possibilities are endless.

A cache? I suppose in the most general sense it is, but really I did it so that I could insert a variable object into an otherwise static page without having to have any server side programming at all.