phpblog@developerworks

One of the concepts that are most difficult for new Web developers to fully grasp, is just how dangerous it is to trust user input. Just in the last week, there've been around a dozen or so different reports of vulnerabilities found in Web applications - mostly all of them revolve around unchecked user input. Because of PHPs dominance in the Web application development world, many of the vulnerable applications were ones written in PHP, which hurt PHPs security track record, even though its not the language which is at fault (the same applications, written in any other language  would have suffered from the same vulnerabilities).

The challenge of validating user input is not a simple one. The key to meeting this challenge is attention to details combined with knowledge.

At the end of the day - nothing other than the developer herself can ensure that an unsanitized piece of data finds itself as a part of a filename, and sometimes even a database query, which is why paying close attention to what goes where is important.

But that's actually not enough. Few people fully understand just how little of the Web environment can be trusted. Nowadays, most developers know that you cannot rely on GET or POST variables to have the values you expect (even if they're inside hidden form values) - but how many of them know that you cannot trust any $_SERVER variable that begins with HTTP (e.g., $_SERVER["HTTP_REFERER"])? These can be fully (and easily) spoofed by the remote users, and must not be trusted. Same goes with cookies - they may not be easily visible for or editable by the average user, but as they're saved on the client-side - a 'malicious hacker', or even down to earth script kiddies, can easily set them to their heart's content. And how about $SERVER_NAME ($_SERVER["SERVER_NAME"]), which actually depends on the Host: header sent by the remote user, and can therefore be spoofed under certain circumstances?

At the suggestion of a friend, I added yet another twist to my watermarking script: rotating the watermark a random amount each time.

While cool in theory, it turns out that the practice is a little more difficult. For one thing, PHP's ImageRotate() function creates a new image resource sized large enough to hold the entire rotated image. Even though I'm using a circular watermark, the image it's in is rectangular. So if you rotate a rectangle and put a bounding box around it, you get a lozenge wider and taller than the original one.

For another thing, getting ImageRotate() to do the Right Thing with the transparency is turning out to be a hassle. I'm working with one of the developers of the PHP image library on figuring that one out; it's unclear whether the issue is my boneheadedness or an actual bug in PHP.

Different programming languages excel at different things. I employ a number of different programming languages, and make my choice based on the task at hand. My presentations tend to be powered by Perl. My weblog is powered by Python. But my private applications tend to be written in PHP.

The developerWorks PHP Blog, however, will often touch on topics that may be new to enterprise (typically Java) programmers.

My first post on this site will cover an application I wrote in about an hour to cover a specific need. It breaks a number of "rules" that guide the development of scalable enterprise applications - in particular it does not separate presentation from content. Consequently, this application does not contain any reusable components that will ultimately find their way into a Customer Relationship Management system. I'm entirely OK with that.

Still, the application is centrally managed, requires zero deployment, is accessible everywhere and portable across a wide range of client operating systems and browsers. All good traits to have.

The application is a vocabulary test. My daughter weekly gets a list of words and their definitions to study. At the end of the week, there is a test where she needs to match the words with the definitions. At first, I helped her study using flash cards, but this seemed like an obvious candidate for automation.

The application itself consists of a single source file and a single data file. The source contains code in two language, JavaScript and PHP. (Four if you count CSS and HTML).

After using this application for a few weeks, there was a single requirement change: when taking the test, as my daughter filled in her answers, she would strike out the options that she had already used. This, too was easily accomplished with a little JavaScript.

Nothing in this application couldn't have been done with JSP. However, to do so would have required additional effort, effort that does not result in a more functional end result.

And it wouldn't have been as much fun.

Not everybody may have the diverse set of skills required to pull together such an application in one sitting. However, a quite larger set of people can copy such an application and successfully make meaningful changes to it.

In my experience, that's how people tend to learn languages such as PHP. Some refer to this as Progressive Disclosure.

A few months ago, I talked to one of the leading CMS industry analysts and he mentioned how surprised he was to find the dominance of PHP in this market.The available CMSs are anything from open-source freeware to proprietry to supported open-source.Some familiar names (in no particular order) are Drupal, mambo, eZ and of course, the popular PHP-Nuke.If I'd go through the whole list of great CMS systems I know in PHP, this post would start getting boring.As I am asked very often what CMS I recommend, and with all the excellent packages out there it's hard to do so, I decided to post a link to The CMS Matrix which is a nice site that provides some initial matrix comparisons between the various CMS. Of course, at the end of the day you'll have to install some of them and actually try them to see if they suit your needs.

One of the most popular articles I've ever written has been about Preventing Image 'Theft'. I wrote it several years ago, but people are still reading it (evidently) and contacting me about it.

I've recently had call to use this sort of thing myself, and what I've got now is rather more advanced than described in the original article. For instance, now I transparently intercept images 'going offsite' and replace them with a correctly-sized blank box containing text about the copyright. And for images I want to be basically previewable but not really usable (if people want a usable version they need to contact me) I watermark 'em.

Watermarking a digital image means adding information to the existing pixels. It can be involve adding invisible information for identification purposes, such as the Digimarc technique, or it can be to to visibly degrade the quality of the image, perhaps with a message. I've used both, but it's the latter mechanism I needed most recently.

Watermarking for degradation is an interesting challenge. There's nothing you can do  short of actually destroying data  to keep a really determined perp from gatting past your defences, but you can make it pretty difficult.

For instance, the degradation watermarking I set up recently uses a watermark with built-in noise, so there isn't a single same-colour region that can be undone. A perp would have to figure out what the watermark pixels are, pixel by pixel, in order to create a mask to remove it. And since I'm using it on dense JPEG images, that's a little difficult.

In addition, the watermark is repeated across the entire image, and not at regular intervals. Each one is jigged a bit at random, so no two previews of the same image should be identical. (Well, modulo repeats of the random sequence.) This keeps a perp from figuring that the watermark is repeated at fixed intervals, and using that to help remove it. Of course, if it accesses multiple previews of the same image, it can eventually probably figure out the pixel settings by comparing them. But I suspect that would be a major chore, too.

Both the replacement-with-notice and the watermarking are done in realtime as part of Apache's response to a Web request. The replacement is very low impact indeed, and doesn't cause performance to deteriorate noticeably, but the watermarking involves actual image manipulation, of megapixel images, and so can slow things down. So you can use the former on almost any server, but the latter really needs a machine with a lot of oomph to keep visitors happy.

You can, of course, get rid of the performance impact by watermarking the images ahead of time, and sending the results normally. That decreases the random factor, though,and possibly makes the watermark more easily removed.

I use PHP extensively on almost all of my Web pages. (Just about the only ones that don't use it are on servers that don't have it installed. Mine, of course, all have PHP installed.) In a lot of cases I store the real content in a database, and PHP pulls it out and formats it; sometimes the content is built directly from scanning files and directories.

This has advantages and disadvantages, of course. On the pro side, the content is visible immediately. On the con side, the content is visible immediately. :-) Having every page be interpreted has definite performance implications  and not just on the origin server. Unless care is taken, the dynamic nature of the page can result in it rarely or never being cached, which hits you in the system and the network. If you pay for your bandwidth by the byte, that can be a huge deal.

Like Sam, I use a mix of languages. For Web pages, I use PHP almost exclusively; for standalone apps I use C, PHP, bash, or Perl, as seems appropriate. In general I use Perl, but if I'm frobbing a database, chances are I'll use PHP  unless there's something in CPAN that argues for Perl.

When I first got involved with blogging, in December 2002, I decided to write my own software from the ground up. In PHP. Of course, I'm a bit-twiddler, and did it that way so I'd understand what this 'blogging' thing was all about and how it worked. (Almost as soon as I brought it online, Sam challenged me to take the next step. :-) The result can be seen at my blog.

I hope to go into this subject in more detail in the future, and I definitely intend to say some things about how I'm using PHP to automatically guard my Web servers, but this is just an introductory note after all.