I’m trying out using Markdown on my blog, and to experiment, I have enumerated the steps needed to upload a picture to an image site I used to run. I have listed the details here because it is full of corner cases and non intuitive setups found in shared hosting environments.

Upload Steps

This document describes the file upload procedure to the Server

Validate User Session

If the session key and id are present, use them as a Lookup

Lookup the session by key and id

If the session is not found, or is not valid, Look-up session by IP address

If looking up by IP address fails, create a new session

If creating a new session fails, explode.

Extract and Clean Tags

Take each tag, and transliterate from UTF-8 to ASCII

Take eash tag, and trim leading and trailing white space

Take each tag, and lowercase

Remove all empty tags

Remove all duplicate tags

Check that there are no more than the maximum tags, to avoid abuse.

Check that there are at least the minimum number of tags.

Extract and validate the Category Id

Check that the IP hasn’t exceeded that max uploads per unit time (not the session id!)

Get a local copy of the Uploaded picture data

If the picture came from a file upload

Make sure the upload didn’t have an error

The the file name (as provided by the client) and the file path (location in /tmp)

If a file URL was provided

Record the referrer, and other meta data about where the picture came from.

Attempt to download the file. On success get the file path

If the file download had a Content-disposition file name, use it.

If the file did not have a filename, or didn’t from from HTTP, use the URL basename.

Check if the file size is too small.

Check that the file name doesn’t have php in it.

Check that the file is of a valid type (GIF, PNG, or JPEG)

Calculate the file hash, and lookup to see if the image already exists

Check to see (using the hash) to see if the file has been previously deleted.

Insert the Image post data into the database, and get a post ID

Rename the temp Image file to the post ID and file extension

Update the bump ordering using the post ID

Record that the IP address has uploaded a picture for use in step #4

Store references from step #5 if present

Insert all tags from step #2

Add “Goats” (the currency of the site) to the session based on how many tags were provided

Record in the session that it was last used at this time. (for Garbage collection policy)

Create a thumbnail image of the picture

Invalidate the Index HTML cache

If requested, redirect the user back to the index.

Some points that you might have noticed:

Only ASCII text is support instead of UTF-8. This is because PHP (at the time I created the site) had very poor UTF-8 support, and the default collation of my MySQL 4 database was (unbeknownst to me) in Swedish. Additionally, the original audience of my site was a little rougher than typical people and so would try to abuse text input. In order to minimize admin overhead, I decided to keep the site ASCII only.

Uploaded pictures have to be copied to a directory local to the server software. One day I found that file uploads were failing because I hit the max number of files allowed by my quota. The server software was run on a different partition (/tmp/). This means that from PHP’s perspective, the file was uploaded “successful”, despite being useless.

Session ID (basically an anonymous, semi persistent login cookie) is not used for quota enforcement, because a small number of users could get a new cookies pretty easily. (of course, they could also get a new IP address pretty easily, but its much simpler to handle such users on a case by case basis, rather than trying to make the software handle it. The goal is to limit moderator overhead, not try to be perfect.)

And, some points that you may not have noticed:

Naming files after their database primary key (the post id) is a chicken and egg problem. If there are any problems between allocating the ID (in MySQL, this is a side effect of a successful insert), and any post insert steps, the server could be left in an inconsistent state. Either a row is inserted into the DB without a corresponding picture, or the picture is the serving directory without a row. The latter is much safer, and if I had not been a high-schooler at the time, I would have picked a transactional database and side stepped these problem. Alas, I was not experienced and wrote the server assuming everything would work.