Inf@Vis!

The digital magazine of InfoVis.net

Logfile Analysis

by Juan C. Dürsteler

[message nº 174]

Knowing the way the users utilise our website is instrumental to understand if we are providing the appropriate service, if our products are easily found and eventually, to what extent it accomplishes the goal we pursue. One of the common ways to get this knowledge is logfile analysis, a very useful yet not trouble free method.

Internet is plagued with examples of carefully designed websites where, nonetheless, the users get lost, don't find what they are looking for, despite its existence or, even worse, they look for something that should exist in the website but doesn't.

On the other hand many web managers don't know in most cases what their users are doing within the web, including whether they find what they look for or if there are concepts worth placing in the website since the users look for them. Without the knowledge of the impact of our marketing campaigns in an e-commerce site, we barely will be able to make it progress in the right direction.

AT the issues numbers 65, 66 and 67 we already commented on some of the solutions to analyse our website logfiles, monitoring its traffic along with the importance of visualisation in order to quickly understand the results. At that time (november 2001) there were very few graphic applications showing the results of logfile analysis. That still holds true nowadays.

The structure of a logfile is extremely simple. Each time someone downloads an element of the website, like for example an HTML page or just an image, the server writes a line in the log file. That line can adopt one of several formats but usually it looks like this:

IP

Identity check

UserID

Date and time

Method resource and protocol

Status

Size

Referrer

Agent (Browser, O.S., etc)

127.0.0.1

-

frank

[10/Oct/2005:13:55:36 -0700]

"GET /apache_pb.gif HTTP/1.0"

200

2326

"http://www.example.com/start.html"

"Mozilla/4.08 [en] (Win98; I ;Nav)"

We won't go into details of what every element means nor into the different variants of formats. Those interested can consult the format of the popular Apache server or the W3C specifications.

The important thing is that, despite being so elementary, the statistical study of the aggregation of the many requests that the users' browsers issue to the server allows us to know a lot of derived information from these apparently simple lines. Among them, at least apparently, you can derive the number of pages served per day, month or unit of time you want, sites that link to our web or redirect traffic towards us (referrers), the words most commonly searched for in our web and a long etcetera of findings.

If we review the specifications of many server logfile analysers we'll see that knowing the number of unique visits, visitors, for how long they have been they standing at our web or even on a particular page appears to be extremely easy for those products.

Nothing further from the truth. Most of this information is of reduced reliability due mainly to two reasons, among many others (see the discussion about what you
can know and what you can't when analysing a logfile)

HTML is a stateless protocol. Each and every request results in a new, independent connection that opens and closes and, thus, can't be followed reliably with another one made by the same IP address. This is even more so when the IP is assigned dynamically, i.e. different users can appear under the same IP at different times.

Many "visits" are created with a "sessionizer" that finds all the requests pertaining to the same IP address and considers them as part of the same session if the time lapse between any two of them is less than a certain amount.

It's impossible to ensure that all of them pertain to the same visit of the same person. It's equally impossible to know how much time one web page has been observed by a particular user nor which real sequence of pages or path within the web our user has followed.

Many pages are served from intermediate server caches, leaving our server completely unaware that another server has served the cached pages and consequently a user has seen them, not "physically" reaching ours. The use of caches in Internet is not only convenient but indispensable to avoid collapsing in front of the ever increasing traffic, but it limits our knowledge about the usage of the web..

So, it's impossible to know how many pages have been really seen.

Summarising, attending just to the logfile analysis, you can't really know the number of visitors, the number of visits nor the identity of the users. You can't establish reliably the paths they have followed within the website. As a corollary of this you can't know for how long they have been reading a web page either.

Nevertheless, this doesn't mean that the information that can be derived from logfile analysis, although incomplete, isn't valuable.

Firstly, if we don't have a website were our users are obliged to enter a session via login and password, logfile information is most probably the only source of information we have.

Although this information could be incomplete, you can reach a great deal of conclusions just by studying a logfile. For example.

Which concepts that aren't present at the web are being sought after by our users

Which concepts that do exist aren't found.

Which zones of our web site are most probably the most active ones.

The appearance of regular and repetitive patterns in the paths that sessionizers find tend to correspond to true behaviour patterns.

In the end, you can't sell the idea that logfile analysers find visits and "unique" visitors, paths of the same or even the operating system they use, as some manufacturers appear to want us to believe.

Nevertheless once you are aware of what a logfile really says and what its limitations are, we have powerful tools to understand our website and the use our users make of it, opening up a door to decisions that can improve its profitability.