A common technique for getting data summaries out of log files is to use a log file analyser. There are a small number of freely available applications that do this type of analysis, one of which is called Analog. This tool hasn’t been in active development for a few years, but its output is similar to many comparable applications. Let’s take a look at what it can tell us based on a 4 week sample of log files from our media.podcasts.ox.ac.uk fileserver. As Analog breaks its analysis down into sections, I’ll use some of those sections to present its findings and a little bit of feedback.

One thing to note here is that Analog’s reports are aimed at webservers mostly serving webpages. Since these pages tend to consist of an html file plus associated images and other media, it focuses on pages and content types. As our media server is really only serving a limited range of file types and none of those are html pages, this makes categories like “Successful requests for pages” a little redundant. Saying that, it’s not clear at this stage what the 6,579 pages it has counted refer to.

We don’t have any data presently on an “average” file size, so being able to estimate the data transferred per day isn’t possible, but looking at the summary here, it does appear to the casual glance to be rather high. Certainly anyone looking at this who is a regular website administrator or being charged for bandwidth would be very concerned, but it doesn’t appear too surprising given media files range from 10Megabytes to 1000Megabytes in size. However, it also appears to be very wrong, as this level of data transfer equates to 24.5Gbits/sec, which is far in excess of our network capacity.

The Yearly, Quarterly, Monthly, Weekly and Daily reports can be skipped here because they just show total requests for the given period in the sample data. As this was roughly four weeks taken around the start of September 2010 (i.e. just before the start of the LfI Project) the Yearly, Quarterly and Monthly show largely one line, like so:

Fig 1: Analog's Yearly summary

Again, we see the focus on pages here, and the bar chart is based on pages, not requests. As our file types are not recognised as pages, this doesn’t really help us[1].

Daily Summary

As the report states: “This report lists the total activity for each day of the week, summed over all the weeks in the report.” This is slightly more interesting in terms of averages, but again, the representation is focussed on pages, not podcasts.

Fig 2: Analog's Daily Summary report

Comparing the requests to the pages here, we can see that a chart of requests would show Friday as the busiest day for requests, rather than Sunday for “pages”. I’m not sure that much weight can be given to any commentary about the daily pattern of requests, because not all requests are equal and it’s not clear from this summary whether downloads really are mostly focused on the end of the week.

Hour of the Week Summary

This might normally interest people who are looking for a time based pattern in downloads. Again the charts show pages rather than requests, but plotting the request numbers gives us the following picture:

Fig 3: Analog's Hour of the Week Summary plotted using requests

The time data is a little bit deceptive without context I feel. We can see the largest surge in requests around the early hours of Friday morning (times are GMT) which would suggest that this demand is either automated or coming from timezones ahead of us (e.g. Far East). General knowledge of these sorts of patterns tends to see peaks in page traffic for most websites in the evenings as it is usually North American based users waking up and going online. However, since this data has also had a bit of geoip analysis done on it, and we know from that that >60% of the IPs addresses making requests were based in China, we can suggest that this Friday morning surge is afternoon/evening in China and perhaps typical of the trend that says web usage happens later in the day.

As an overal summary I’m not convinced that it offers any useful insight and think it needs to be analysed in the context of geography to determine any form of usage pattern. What this might eventually tell us is that our podcasts should be released on particular days in order to attract the most attention when users come looking for new material.

The Hourly Summary is in effect repeating the above data, but for a 24 hour view (so merging the above days). The trend there is that most requests are coming in between 5am and 5pm with a peak around 8am.

The Quarter Hour report, Quarter Hour Summary, Five Minute Report and Five Minute Summary are more the above, but in smaller increments, so I’m not going to fill space with them here.

Domain Report

This is where most basic log file analysis would attempt to derive useful data or trends regarding the whereabouts of the visitors. This is based on the Reverse DNS lookup (as discussed in Sifting Signal From Noise). The chart as you’ve seen, looks like this:

Fig 4: Analog's Domain Name Analysis of a sample of data from media.podcasts

As discussed, this says that 61.51% of all the requests came from IP addresses without a domain name associated. The next largest blocks (.net and .com) are traditionally global domain extensions, even if perhaps most of them are based in North America. Where the average viewer gets excited is when they see the country based Top Level Domain extensions (tlds) such as .cn, .jp, .uk, etc and try to make assumptions based on the proportion of users from that country. Since I’ve already discussed this, I’ll move on, but will touch on this domain analysis point again in In My Domain).

Organisation Report

Analog makes an attempt to do a breakdown of the most common “organisations” that computers belong to by (roughly) looking for the most occurrences of partial DNS entries (e.g. slightly more than the tlds used in the Domain Report) and large classes of IP address (i.e. large groups likely registered to the same organisation). This data is decidedly more suspect and unreliable (see the domains file for details from Analog on how this is attempted) but is similar to the approach I’m taking in the post In My Domain.

Fig 5: Analog's Organisation Report

It’s not clear who or what 163data.com.cn is, but a quick bit of research suggests it is a Chinese based Internet Service Provider. As one of the more common sources of requests, this would tally with the large proportion of China based IP addresses in our data sample. Another one of interest is the Akamaitechnolgies.com section as we are aware that Apple uses Akamai’s Content Delivery Network for their iTunes U portal. Again, these are high-level summaries, not particularly clear about their derivation, and ultimately a little too weak to based any worthwhile opinions on. It may be that a better tool and a clearer set of organisational definitions is needed for this to be helpful.

Host Report

This is looking at the frequency of IP addresses in the data sample. Whilst Analog refers to this as “The Host Report lists all computers which downloaded files from you” (their emphasis, not ours) we already know this is wrong as detailed in What’s In An IP Address. Given the large number (Approximately 24,000 distinct IP addresses were found in this data sample), appearing as a segment in this report could mean one or more of the following:

A computer made a *lot* of requests for content (which means getting the same podcast many times over since the number of requests here is greater than the 2000+ podcasts offered by Oxford)

The IP address belongs to an ISP and they are sharing it between many computers

The requests may not have been full downloads, perhaps only partial content or other *light touch* type requests (e.g. a monitoring service)

Looking at Analog’s output we see:

Fig 6: Analog's Host Report

Looking at this, the few it can plot can be investigated fairly easily. The first two are IP addresses belonging to Vodaphone Germany, and therefore likely to be ISPs masking many users and computers. Likewise for the 163data.com.cn as mentioned in the Organisation Report section. XO.Net is also a large internet traffic provider, so also likely to represent many computers and users.

Of the Top 50 hosts identified, many are unresolved IP addresses that further investigation will likely reveal to be ISPs. A number of them belong to Akamai Technologies as mentioned in the Organisation Report. 60,000 requests or so came from a network at Cranfield University, likely from their student campus machines (DNS refers to “resnet”, a term commonly used to describe NAT services for residential provision). 60,000 requests came from a machine in Oxford which we know is used to test availability of files, so is unlikely to want to be included in any final count of downloads. See the post on In My Domain for more details on this sort of breakdown.

The Host Redirection Report contains 6 request entries. Since we’re not employing redirection on our fileserver presently, these are slightly surprising in that there are any at all, but easily dismissed as a tiny error in the overall set of 20+ million requests being analysed.

Host Failure Report

This reports lists the requests that resulted in errors. Again, the specific errors encountered isn’t explored in this summary.

Fig 7: Analog's Host Failure Report

Whilst the large red block is an address located in China, little more can be determined about its nature, and without looking into the log files themselves, we can’t see what it was requesting and what error it was receiving. Similarly with the rest of the nearly 22,000 “error” requests reported in the log files. Without knowing the nature of these errors they can be discounted from a basic count of requests on the grounds that they failed, and we are typically only interested in requests that succeeded in sending a podcast to a visitor. However, knowing the nature of these failures would likely help inform the Podcasting Service to any common errors (these errors equate to 0.001% of all requests). Also knowing why two IP addresses were responsible for nearly half of the errors recorded could also improve the service quality.

Browser Report

This is normally a popular statistic to help inform webmasters about the capabilities of visitor’s web browsers in relation to the amount of traffic – i.e. helpful when wanting to introduce a new feature that may require a recent version of the software to work. Let’s see what it has to say for us…

Fig 8: Analog's Browser Report

This looks very wrong. Nagios-plugin refers to an automatic software tool typically used to monitor service status. Indeed, if we don’t notice the small print that says this refers to “requests for pages” then we might start to think that our log files are being filled with data that could be discarded, which contradicts what the other reports have been telling us. Looking at the data table Analog provides for the “Top 40 browsers by the number of requests for pages, sorted by the number of requests for pages” we can see that the data is a little more varied.

As we can see from the above table, the Browser string contains a wealth of information, which is largely being ignored by Analog. Missing from this is any browser string that identifies Apple’s iTunes Application, something we believe should be very popular and feature highly in our data – but doesn’t appear in the above list. Again, it may be possible to correct this and derive some useful information if we can reconfigure Analog’s definition of a page.

Browser Summary

This is largely the same as the above report, but with an extra level of abstractness which would help normally in determining browser models in general rather than some of the specifics revealed above. Whilst the graphic is near identical to the previous chart, the data table is a little more clear.

[Listing browsers with at least 1 request for a page, sorted by the number of requests for pages.]

no.

reqs

pages

browser

1

6315

6315

check_http

6315

6315

check_http/1

2

16002261

22

MSIE

15852366

8

MSIE/6

7

7

MSIE/4

25222

5

MSIE/7

58743

1

MSIE/8

145

1

MSIE/9

3

18

18

Sosospider+(+http:

18

18

Sosospider+(+http://help

4

6158

11

Netscape (compatible)

5

89137

9

Firefox

59

4

Firefox/1

85690

4

Firefox/3

1288

1

Firefox/2

6

29

4

Java

29

4

Java/1

7

105

4

lwp-request

105

4

lwp-request/5

8

11

3

Wget

11

3

Wget/1

9

4

2

Nokia6820

4

2

Nokia6820/2

10

1618

2

Mozilla

1002

2

Mozilla/1

11

42

1

MLBot (www.metadatalabs.com

42

1

MLBot (www.metadatalabs.com/mlbot)

12

2

1

libwww-perl

2

1

libwww-perl/5

13

1

1

Microsoft Windows Network Diagnostics

14

68441

1

Safari

41095

1

Safari/533

15

1

1

Made by ZmEu @ WhiteHat Team – www.whitehat.ro

16

1

1

Kugoo*JFI*

17

1

1

Snapbot

1

1

Snapbot/1

18

5770

1

Opera

5

1

Opera/8

4080151

0

[not listed: 162 browsers]

This summary shows a wider range of browser types, but still appears to miss the expected identification for the iTunes Application. Further investigation is required here to make sense of this data.

Operating System Report

This report would often be slightly helpful for webmasters in determining likely additional capabilities of a visitor’s computer – e.g. for additional file types that might be understood by the computer. In terms of podcasting analysis we have some interest in this, but not as much as a regular webmaster might. Podcasts by their design are intended to be downloaded and played on a wide range of devices. Whilst many of these devices may be mobile media players (e.g. iPods) that need to have their content downloaded onto a parent computer before being copied to the device, an increasing number are able to access the content directly from the network (something we recognise in providing our podcasts via our Mobile Oxford platform). The graphic that accompanies this section is practically a single red circle that says “OS Unknown”, and you can see why when you look at the data table summary shown below.

[Listing operating systems, sorted by the number of requests for pages.]

no.

reqs

pages

OS

1

554840

6352

OS unknown

2

17886716

27

Windows

4303531

12

Windows XP

1348608

7

Unknown Windows

6

4

Windows 95

65814

3

Windows 98

12167657

1

Windows 2000

3

0

Windows CE

6

0

Windows ME

3

0

Windows NT

1088

0

Windows Server 2003

3

1414

10

Known robots

4

70891

7

Unix

70891

7

Linux

5

1745897

2

Macintosh

6

308

0

Symbian OS

Due to the focus on pages, this table is poorly organised for our needs. However, the mix of platform information is interesting, and should be compared to the platform data provided by the Apple iTunes U datasheets to see if there is a consensus. As it stands, I am not confident this data is proportionally valid and suspect that Analog’s interpretation of the Browser strings may require some extra work, or to be redone using another tool.

As we have stated previously, the high proportion of Partial Content requests needs further investigation and understood. The various codes starting with 4xx seem to tally with the error reports we have discussed previously, so this provides some hint as to what may be going wrong with some requests, but not enough detail to actually fix any problems.

File Size Report

This report is summarising the size of the requests being serviced. Analog doesn’t know how big the actual files being downloaded are, but it is reporting on the amount of data being transferred, and interpreting that as the actual file size – even when we know that most of the requests in this data sample are only Partial Content.

Fig 10: Analog's File Size Report

size

reqs

%bytes

0

123689

1B- 10B

3489189

11B- 100B

842

101B- 1kB

14274

1kB- 10kB

98050

10kB-100kB

1334385

100kB- 1MB

535346

1MB- 10MB

1990397

0.12%

10MB-100MB

784067

0.72%

100MB- 1GB

11890653

99.16%

As we can see from the number of requests, this covers the 20 million or so entries in the log file data. As previously stated, we don’t have a metric for average file size, but the scale used here is one suited to the relatively tiny sizes typical of webpages, rather than the 10+ Mb sizes of multimedia files. It is interesting to note how many requests are for amounts that are too small to be valid for downloads (the 3.5 million requests for less than 10 Bytes of content) and this could be significant in determining the number of requests that can be discarded – however, if these requests relate to light “touches” on the fileserver from a Content Deliver Network, then then may need to be counted as downloads as the visitor will have received the file from the CDN itself, and it was checking it’s version is still valid against the original hosted here.

File Type Report

Fig 11: Analog's File Types Report

reqs

%bytes

extension

19325730

99.98%

.mp4 [MPEG4 video files]

935162

0.02%

[not listed: 9 extensions]

This result suggests that there was a configuration error for the analysis, given the spread of podcasting content file types (.mp3, .epub, mp4, .pdf, etc). Whilst Album Art images (.png and .jpg) files are also hosted on the same filestore, this shouldn’t be too significant in the overall figures. I will discount this data pending a further review of the file type configuration setup.

Directory Report

Analog's Directory Report

This report could be interesting but concerns me that it might be misleading as this seems a little too overwhelming. Our insider knowledge knows that the fileservers being analysed here us a directory structure that related to the university departmental structure, and given that this chart is plotting proportion of requests for a given URL path (looking at the first level of the file request) it seems unreasonable that so many requests are for this department (even taking into account the ownership of the most popular feeds in this time period as based on the Apple iTunes U data). Perhaps this is on some level a reflection on the trend illustrated by the Weekly Download chart discussed in Fishing With A Broken Net. I think this data needs further verification before taking this chart at face value, though the Request Report discussed below does re-emphasise why this may be a perfectly valid picture for this particular data sample.

The Redirection Report is an expansion on the Host Redirection Report, though it handily notes that:

This report lists the files that caused requests to be redirected to another file. (Usually directories with the final slash missing, or CGI scripts that forced redirections.)

As before, I’m going to ignore the 6 requests as being inconsequential errors.

The Failure Report is a helpful expansion on the requests that caused errors previously mentioned. Here we can see that many are for files that were likely broken links or missing files (and quickly remedied in the main), though there are a large (4000+) number of requests for “favicon.ico” – a feature of modern web browsers that sometimes assume they are downloading a webpage and go looking for a common filename and type related to that action. Favicons don’t tend to exist for podcasts.

Request Report

The last report in the Analog outputs is perhaps one of the most detailed (in terms of the data table). The data summary is a too long and complex to include in its entirely, so I’ll summarise a few interesting points about it after showing the graphic that accompanies the data.

Analog's Request Report

The data table then lists all files with at least 20 requests in order of the number of requests, but it also provides a breakdown of how it has summarised the requests – specifically that it groups the data by the file URL and ignores the Request String data for counting purposes, but it does reveal the breakdown by the most significant complete URL requests. For example:

The chart is showing the figures highlighted in bold, but under those is a breakdown of the most common URLs used. Here we can see some tracking data added to the Querystring – e.g. “CAMEFROM=itunesu”. We will explore how these URL decorations have been deployed in an attempt to help track the sources of the request in another post. We can note that whilst Analog can see this data, it doesn’t offer a facility to summarise by this information, that is something another tool will have to do.

Hopefully you can now see the basic uses and pitfalls of common log analysis tools.

Carl

Footnotes:

It may be worth trying to see if Analog can be reconfigured to recognise different file extensions as pages. If so, then we may be able to rerun this analysis and get clearer graphical data.

Hi Ken,
No, we largely dropped Analog when we realised that is wasn’t going to get us the range of data we wanted – for example: linking request URLs to our RSS catalogue of items (thus file names become titles of episodes, files can be grouped by feeds and associations, etc); Geo-ip data too wasn’t part of the mix; Wanting to be able to track specific query string data as part of a “campaign” report; and wanting to be able to filter requests and results on any of the available data allowing us to look at subsets of the information in detail (ideally through a point and click interface, as most of the people analysing this for trends are not command-line-scripting-technical).

Whilst some of this could have been scripted and grafted onto Analogue, the effort required wasn’t judged worthwhile, so we crafted our own log file parser to do the above and provide the answers we needed for the project. It hit a few snags – chiefly that the quantity of data being processed does not sit well in a regular SQL database (though the querying functionality that gave us was the main reason for using it), so there’s a working prototype, but it needs some extra work done to deal with the speed issues.

We’d be interested to hear if you have any further success with getting Analog to parse your own data though – the file types recognition is an element of the parser configuration and fairly easy to do.