Getting a Handle on Traffic

When running a Web site of any size, it helps to learn about the visitors you're attracting. The traditional solution for monitoring Web traffic is a log file analysis tool such as analog (http://www.analog.cx). analog is very fast, but what if you'd like real-time or near-real-time statistics? You could run analog from cron every five minutes, but what if you also want to issue ad-hoc queries against your logs to answer very specific questions like, "What's the average number of pages that each Internet Explorer user views?"

When running a Web site of any size, it helps to learn about the visitors you’re attracting. The traditional solution for monitoring Web traffic is a log file analysis tool such as analog (http://www.analog.cx). analog is very fast, but what if you’d like real-time or near-real-time statistics? You could run analog from cron every five minutes, but what if you also want to issue ad-hoc queries against your logs to answer very specific questions like, “What’s the average number of pages that each Internet Explorer user views?”

Things begin to get difficult when you try to customize most Web log reporting tools. And that’s a shame. There’s a lot of interesting questions you might want to ask about your Web traffic: “Where are users coming from? How do users find my site? Do users usually enter through the home page, or does a search engine send them to a more specific page? Which browsers are being used? What are my most popular ’404-ing’ (missing) URLs? Is another site using my images?”

Unfortunately, most tools can’t answer all those kinds of questions out of the box. To answer those questions, you typically need to extend an existing tool, write your own tool, or spend some quality time with awk, grep, and wc.

Or, you can turn to LAMP. By combining Linux, Apache (and mod_log_sql), MySQL, and PHP, you can build a customized logging and reporting system without a lot of effort. This month, let’s look at how to set up logging. Next month, we’ll build a simple PHP-based interface to browse your Web site’s visitor statistics.

mod_log_sql

If you read Randal Schwartz’s September 2001 Perl Matters column, “Cleaning Out a Logging Database” (available online at http://www.linux-mag.com/2001-09/perl_01.html), you’re familiar with the idea of logging all hits against your web site using mod_perl and MySQL. The benefits of logging traffic in MySQL are clear: there’s no shortage of reporting tools, and since MySQL is lightning fast, you’ll be able to get your answers much faster than greping through access_log files.

However, that solution only works if you’re running mod_ perl. If you’re serving static content or PHP-based content, you’re out of luck, right? No, not at all.

Chris Powell’s mod_log_sql (http://www.grubbybaby.com/mod_log_sql) is an Apache module that logs every hit into a MySQL database. Better yet, the module doesn’t take the place of traditional file-based logging — you can continue to generate access_log files while also using mod_log_sql.

Setting it up

Like any other Apache module, you can compile mod_ log_sql into Apache or build it as a DSO (shared object). Instructions for doing so are included with the mod_log_sql software or can be found on the Web at http://www.grubbybaby.com/mod_log_sql/INSTALL, so I won’t repeat them here. Once you’ve built the module (or recompiled Apache), it’s time to setup the log database and create a separate MySQL user just for Apache.

First we’ll create a MySQL user named apache that has access to a database also named apache. Of course, we’ll also need to create the database. Connect to MySQL as root and run these two commands:

GRANT ALL PRIVILEGES ON apache.* TO apache IDENTIFIED BY ‘SomePassword’;
CREATE DATABASE apache;

Next there are several items you’ll need to add to your Apache configuration file (httpd.conf) so that mod_log_sql will know what to do. Listing One shows the lines necessary for a site that hosts one or more domains.

The first entry, MySQLLoginInfo, simply sets the host, username, and password that should be used for logging. MySQLSocketFile should point to the location of the MySQL socket on your system. It is typically in /tmp or /var/run. MySQLDatabase specifies the database to use.

The next three parameters are much more interesting. MySQLCreateTables tells mod_log_sql to automatically create tables if they don’t already exist. Combined with MySQLMassVirtualHosting (which logs each virtual domain to a separate table), there’s no additional setup involved when you launch a new domain on your server. The logging is automatic. Finally, MySQLPreserveFile lists the name of the file (always in /tmp) that the module will write to if MySQL is down or otherwise unavailable. MySQLPreserveFile logs full INSERT queries, so when MySQL comes back up, you can simply pipe them into the mysql command-line tool and not lose any data.

There are several other configuration directives that you can use. Some allow you to filter out hits based on various criteria (user agent, URI, etc.) as well as customize the data that is logged. Check http://www.grubbybaby.com/mod_log_sql/directives.html for full details.

Once you’ve added the necessary configuration directives, restart Apache and visit your Web site. If you find that there’s at least one new table in the apache database, you’re now logging to MySQL in real-time. If not, check your server’s error_log for hints to what went wrong.

Under the Hood

With everything configured and running, it’s a good time to look at how the log data is being stored.

Let’s connect to MySQL and get a list of all the tables in the apache database (there will be one for each domain). The command show tables produces something like Listing Two.

Notice that each table name begins with access_ (reminiscent of access_log files) and that the dots in the domain names have been converted to underscores (dots have a special meaning to MySQL).

Now let’s have a look at how the data is stored in one of those tables. A describe command generates the table description shown in Listing Three. Notice that all of the 22 columns are NULL by default, and none of them are indexed. Both of these facts may become important when you begin to write queries against the log tables.

The depth of the analysis you can perform is limited only by your ability to write the SQL necessary to get at the data. Combine SQL and your favorite programming language and you can generate some impressive reports. The data’s there waiting for you to explore.

Speeding Things Up

After you’ve had a chance to accumulate some data and write some queries, you should think about how to speed up common queries. Because none of the columns are indexed by default, you should add indexes in those places that make the most sense.

If you need a refresher on using indexes in MySQL, see “MySQL Performance Tuning” in the July 2001 issue (available online at http://www.linux-mag.com/2001-06/mysql_01.html). Whatever you do, don’t index fields until you are sure you need to do so. Having too many indexes will slow things down, limiting the number of clients you can serve.

Coming Up Next…

We’ve seen how to configure mod_log_sql to record all Web requests in a MySQL database. With the logs effectively captured in MySQL, we can generate lots of interesting statistics.

Next month, we’ll write some PHP code to build a Web-based application to interactively browse the log data.

Jeremy Zawodny uses Open Source tools in Yahoo! Finance by day and is is writing a MySQL book for O’Reilly & Associates by night. Reach him at: Jeremy@Zawodny.com.