Users of
Presskit’n have been asking for traffic statistics on their press releases so I decided I would get them the most recent data possible. At first I was parsing the access log once a minute and when I was testing that I decided it wasn’t updating fast enough. I’ve gotten used to everything being instant on the internet and I didn’t want to wait a minute to see how many more views there were. In this post I show how I got it to update on page load using Apache, python, Django, and memcached.

Apache Access Logs

Apache is installed with rotatelogs. This program can be used to rotate the logs after they get too large. However I wanted a few more features. Cronolog will update a symlink everytime it creates a new log file so that you can always have the most recent stats.

CustomLog and ErrorLog directives in apache will let you pipe output to a command. So I put the full path to cronolog and then specified the parameters to cronolog. –symlink will point the named symlink to the most recent log created with cronolog. After the options, the path to the log location is specified and date formats can be used. I decided to break mine up by day.

Piping Apache Log info to a Python Script

Apache can have multiple log locations and log multiple times. So I wrote my own logging script in python that would insert into memcached. Here is the extra line I added to apache:

The page views are being added to memcached (with
cache.incr() which is new in django 1.1) for quick retrieval and the logs will still be created by cronolog so no data will be lost when the cache expires. Those logs are used in the next part.

Parsing the Logs

The hit counts will expire from the cache after 24 hours so I
parse the logs once a day and put that information into my database. For this I wrote a
django management command (I didn’t do a management command before because I wasn’t sure how it would handle the parent and child processes). This command is called by ./manage.py parse_log

The hits_today method requires that you define
get_absolute_url which is useful in other places as well. @property is a decorator that makes it possible to access the data with object.hits and leave off the parenthesis.

The hits method uses the content type framework again to look up the hits in the database.

Just the Basics

There is a lot more that can be done with this. This barely touches the raw data available in the logs. A few ways I’ve already started improving this is to not include known bots as hits, check the referrer to see where traffic is coming from, and save the keywords used in search engines.

I use webfaction to host a lot of my django projects. It has an easy setup that will get you developing quickly and a great community of talented programmers. There is also a quick setup for rails, wordpress, and a lot more.