Background

I'm running a Squid server as a reverse proxy caching in front of the web services, to cache defined content at the HTTP layer.

(I'm also caching in the application layer using Ehcache and Spring - but adding the Squid cache in front of this lets me cache more content that just the JSON, and lets me take advantage of Squids proxy functions to manage back end services easily. It also takes more load off the Tomcat server - which is also running scheduled tasks to update the data in the system.)

This lets me run a pretty mixed environment, which suits my current purposes for delivering speedy content, removes load on my main app server and also doesn't tie me to closely to any particular technical implementation in the back end. There is a complexity overhead, but part of the reason for doing this is to play with the technologies - so thats all part of the fun!

At the moment, all this is running on a single EC2 Ubuntu server instance.

All HTTP traffic comes into the Squid server, running on Port 80. This caches certain html, image, json, php, etc content (as defined in /etc/squid3/squid.conf) - and if it can't find the request in the cache (or has been told not to cache it), will hand off to one of the upstream servers to service the request - Tomcat (on port 8080 - serving my Java web services, and admin App) or Apache (on port 9898 - serving my web sites, PHP/HTML).

So my basic setup is similar to this diagram:

The Problem - How to Analyse Web Service Traffic?

What I want to do is be able to track visitors across all web sites and mobile apps. I can track website visitors fine using Google Analytics. However, the mobile apps access the web services directly - so a client based tracker like Analytics cannot track these requests..

As all requests are routed through Squid - all access data sits in the squid 'access.log'. This is the only place that knows about all the traffic (as this will log all access, whether delivered from the cache, or from the backend servers)

Although Squid provides the cachemgr tool - this is more geared to monitoring cache access, than actual detail of what is delivered, where and to who (which is the kind of info I'm more interested in)

My Solution

Looking at the various log analysis tools available - I decided to give AWStats a try - it's free, seems well documented and commonly used

My goal was to get this AWStats set up on my Ubuntu box to read the Squid logs and provide some nice HTML output - updated on a regular basis so I could monitor usage through the day

So my plan was to end up with something like this:

Prerequisites

Before I started - I already had these set up and running on my Ubuntu (version 12) box

Squid Server (running on port 80)

Apache (running on port 9898)

Step 1 : Install AWStats

The first step was just to install AWStats on Ubuntu using these apt commands (note: the last 2 are only needed if you want to track stats by country):

Step 2 : Configure Squid Logging

The next step is to change the format of the Squid logs. By default, Squids access.log is principally designed to only log the kind of info useful for logging caching activity (what was requested, when, was it served from the cache..)

Step 3 : Configure AWStats

We need to tell AWStats which log file we want to use, and what the format is.

There are various ways you can configure your domain in AWStats. For simplicity here - I'm just assuming we have one domain and just use the 'out-the-box' awstats.conf configuration (I believe it is common practice to have different conf files for each domain - but we'll just use the default for now).

Find the following lines in '/etc/awstats/awstats.conf' - and change them to these settings (see more details of setting the LogFormat here):

Step 5 : Configuring Apache/Squid To View Reports

Now we have AWStats installed, and Squid logging the correct format, the next step is to setup Apache and Squid to view the reports.

We want to view the reports on the URL:
http://yourdomainname/statistics/awstats.pl

Because all our traffic goes through Squid - we just need to add some directives to '/etc/squid3/squid.conf' to redirect any urls going via '/statistics' directly to the Apache server. We also map '/awstats' urls to the apache server too - to catch the css and js files AWStats references.

This will create a set of pages in '/var/www/awstats-report/' with the format 'awstats.awstats.conf.130524.html' (note: you will have to create the /var/www/awstats-report/ directory with the correct permissions.
You will have to add a new set of rules to '/etc/squid3/squid.conf' to:

You can then access the static pages on the url: 'yourdomainname/awstats-report/awstats.awstats.conf.130523.html'

Conclusion

Take all the above with a pinch of salt - I'm sure there are better ways to do some of it, it was learning experience for me - and at the end, delivered what I was after.

I think Squid is a great product, especially for a developer like myself who wants to play with different tech and swap it in/and out of my environment. It's proxying abilities allow you to swap implementatations and server locations with relative ease.

And in it's main role - as a reverse proxy web acceleration server - it's been performing excellently, speeding up web service and image serving with never a grumble. Now I have AWStats up and running - I can drill down and get a lot more information about what's going on under the covers!