FAQ: Are the Statistics Accurate?

I've heard that statistics like visitors, "sessions," and "paths through the site" can't be computed accurately. Is that true? Are the statistics reported by Sawmill an accurate description of the actual traffic on my site?

Short Answer

Sawmill accurately reports the data as it appears in the log file. However, many factors skew the data in the log file. The statistics are still useful, and the skew can be minimized through server configuration.

Long Answer

Sawmill (and all other log analysis tools) reports statistics based on the
contents of the log files. With many types of servers, the log files
accurately describe the traffic on the server (i.e. each file or page viewed
by a visitor is shown in the log data), but web log files are trickier,
due to the effects of caches, proxies, and dynamic IP addresses.

Caches are locations outside of the web server where previously-viewed
pages or files are stored, to be accessed quickly in the future.
Most web browsers have caches, so if you view a page and then return in the
future, your browser will display the page without contacting the web server,
so you'll see the page but the server will not log your access.
Other types of caches save data for entire organizations or networks.
These caches make it difficult to track traffic, because many views of
pages are not logged and cannot be reported by log analysis tools.

Caches interfere with all statistics, so unless you've defeated the cache
in some way (see below), your web server statistics will not represent
the actual viewings of the site.
The logs are, however, the best information available in this case,
and the statistics are far from useless.
Caching means that none of the numbers you see are accurate
representations of the number of pages actually views, bytes
transferred, etc. However, you can be reasonably sure that if your
traffic doubles, your web stats will double too. Put another way, web
log analysis is a very good way of determining the relative
performance of your web site, both to other web sites and to itself
over time. This is usually the most important thing, anyway-- since
nobody can really measure true "hits," when you're comparing your hits
to someone else hits, both are affected by the caching issues, so in
general you can compare them successfully.

If you really need completely accurate statistics, there are ways of
defeating caches. There are headers you can send which tell the cache
not to cache your pages, which usually work, but are ignore by some
caches. A better solution is to add a random tag to every page, so
instead of loading /index.html, they load /index.html?XASFKHAFIAJHDFS.
That will prevent the page from getting cached anywhere down the line,
which will give you complete accurate page counts (and paths through
the site). For instance, if someone goes back to a page earlier in
their path, it will have a different tag the second time, and will be
reloaded from the server, relogged, and your path statistics will be
accurate. However, by disabling caching, you're also defeating the
point of caching, which is performance optimization-- so your web site
will be slower if you do this. Many choose to do it anyway, at least
for brief intervals, in order to get "true" statistics.

The other half of the problem is dynamic IP addresses, and proxies.
This affects the "visitor" counts, in those cases where visitors
are computed based on the unique hosts. Normally, Sawmill
assumes that each unique originating hostname or IP is a unique visitor,
but this is not generally true. A single visitor can show up as multiple
IP addresses if they are routed through several proxy servers, or
if they disconnect and dial back in, and are assigned a new IP address.
Multiple visitors can also show up as a single IP address if they
all use the same proxy server. Because of these factors, the visitor
numbers (and the session numbers, which depend on them) are not
particularly accurate unless visitor cookies are used (see below).
Again, however, it's a reasonable number to throw around as the "best
available approximate" of the visitors,
and these numbers tend to go up when your traffic goes up,
so they can be used as effective comparative numbers.

As with caching, the unique hosts issue can be solved through web
server profile. Many people use visitor cookies (a browser cookie
assigned to each unique visitor, and unique to them forever) to track
visitors and sessions accurately. Sawmill can be configured to
use these visitor cookie as the visitor ID, by extracting the cookie
using a log filter, and putting it in the "visitor id" field.
This isn't as foolproof as the cache-fooling method
above, because some people have cookies disabled, but most have them
enabled, so visitor cookies usually provide a very good approximation
of the true visitors. If you get really tricky you can configure
Sawmill and/or your server to use the cookie when it's available, and
the IP address when it's not (or even the true originating IP address,
if the proxy passes it). Better yet, you can use the concatenation of the
IP address and the user-agent field to get even closer to a unique visitor
id even in cases where cookies are not available.
So you can get pretty close to accurate visitor information if you really
want to.

To summarize, with a default setup (caching allowed, no visitor
cookies), Sawmill will report hits and page views based on the log data,
which will not precisely represent the actual traffic to the site, and so
will and any other log analysis tool. Sawmill goes further
into the speculative realm than some tools by reporting visitors, sessions, and paths
through the site. With some effort, your server can be configured to
make these numbers fairly accurate. Even if you don't, however, you
can still use this as valuable comparative statistics, to compare the
growth of your site over time, or to compare one of your sites to
another.