In this notebook, I am going to show simple example of an Apache access log analysis with pandas. It's my first experience with pandas, and I am sure there are better and more efficient ways to do some of the things shown here. So comments, suggestions and corrections of my broken English are very welcome. You can send me an email, or create a PR for this notebook at github.

We will need apachelog module, for parsing the log. We also have to know the format of the log that is set in the Apache config. In my case I have no access to the Apache config, but the hoster provide description of the format on his help page. Below is the format itself and a short description of every element:

%V - the server name according to the UseCanonicalName setting
%h - remote host (ie the client IP)
%l - identity of the user determined by identd (not usually used since not reliable)
%u - user name determined by HTTP authentication
%t - time the server finished processing the request.
%r - request line from the client. ("GET / HTTP/1.0")
%>s - status code sent from the server to the client (200, 404 etc.)
%b - size of the response to the client (in bytes)
\"%i\" - Referer is the page that linked to this URL.
User-agent - the browser identification string
%T - Apache request time

Look's like somebody downloaded something big from the website around 9 a.m.

But actually the first thing that you probably want to know is how many visits (actually hits) your site has, and how they are distributed in time. We resample the series of variable b with 5 minute intervals, and calculate number of requests during every time span. Actually, in this case it doesn't matter what variable we use, these numbers will indicate just how many times information from the the website was requested.

In [23]:

df_s=df['b'].resample('5t',how='count')df_s.plot()

Out[23]:

<matplotlib.axes.AxesSubplot at 0xc14588c>

We can not only count number of requests per time, but also calculate the sum of the traffic for every time span:

On the plot you can see, that the number of server requests do not always coincide with the amount of traffic, and correlation is actually not extremely high:

In [25]:

df_b.corr()

Out[25]:

count

sum

count

1.000000

0.512629

sum

0.512629

1.000000

We can have a closer look at the curious morning peak:

In [26]:

df_b['2013-03-16 6:00':'2013-03-16 10:00']['sum'].plot()

Out[26]:

<matplotlib.axes.AxesSubplot at 0xc3f5dac>

Seems that this traffic spike was caused by only one request. Let's find out how this request looks. Choose all requests with the size of the response larger than 20 Mb (remember we convert bytes to Mb):

In [27]:

df[df['b']>20]

Out[27]:

Status

b

IP

Request

Time

2013-03-16 09:02:59

200

21.365701

77.50.248.20

GET /books/Bondarenko.pdf HTTP/1.0

It was a .pdf file of a book (look at Request field), and this explains the peak in outgoing trafic at 2013-03-16 09:02:59.

Clearly 20 Mb is a large request (for our website at least :)). But what is the typical size of the server's response? The histogram of response sizes (less than 20 Mb) looks like this:

In [28]:

cc=df[df['b']<20]cc.b.hist(bins=10)

Out[28]:

<matplotlib.axes.AxesSubplot at 0xc52374c>

So, most of the files are less than 0.5 Mb large. In fact they are even smaller:

In [29]:

cc=df[df['b']<0.3]cc.b.hist(bins=100)

Out[29]:

<matplotlib.axes.AxesSubplot at 0xc5760ec>

Very small responses can be all kinds of stuff, but larger ones are probably some relatively big files. Let's have a look at these little spikes between 0.2 and 0.25 Mb:

In [30]:

cc=df[(df['b']>0.2)&(df['b']<0.25)]cc.b.hist(bins=100)

Out[30]:

<matplotlib.axes.AxesSubplot at 0xc57cacc>

We now can locate the size of the file more precisely, and find out its name:

I would like to know how different server responses are distributed through time. Let's first try a not very elegant approach. We create several variables with time series of Status values, each containing only a time series with one particular value. Then a Data Frame is created out of this time series.

I would like to know from which countries people or robots come to the website. It turns out that precise geolocation information is not free, but for our purposes free data bases are fine. We will use pygeoip to get information from GeoIP database, and that has to be downloaded from here.

The first 3 are kind of expected, but Tyumen, a Russian city located in the middle of Eurasia thousands of kilometers away from the nearest ocean coast, is a bit of a surprise (the website is about Oceanography). This is something to investigate later on.

We have lats and lons for the data, so it is natural to put them on the map:

Spammers are a huge problem, and it would be nice to filter them out. My website has very low user activity, and if somebody tries to POST something more than couple of times it's already suspicious. First we have to get information about what request method is being used for the requests. For this we have to analyse the Request column, which contain strings with requests. Request methods are always located in the beginning of the request string, so it's trivial to get it:

In [70]:

defgmet(x):x=x.split()returnx[0]

We apply this little function to the Request column and create new Method column:

Making some Wordpress cron requests... This IP looks familiar... Oh, it's actually the IP of my website - this is my Wordpress doing something. Have to investigate later if I can disable this thing (anyway it's not working, getting 403 error) :)

Number two:

In [78]:

df[df.IP=='160.75.185.7'][0:10]

Out[78]:

Status

b

IP

Request

Method

Time

2013-03-16 17:27:12

200

0.002224

160.75.185.7

GET /administrator/index.php HTTP/1.0

GET

2013-03-16 17:27:12

200

0.002224

160.75.185.7

GET /administrator/index.php HTTP/1.0

GET

2013-03-16 17:27:12

200

0.002224

160.75.185.7

GET /administrator/index.php HTTP/1.0

GET

2013-03-16 17:27:12

200

0.002224

160.75.185.7

GET /administrator/index.php HTTP/1.0

GET

2013-03-16 17:27:12

200

0.002224

160.75.185.7

GET /administrator/index.php HTTP/1.0

GET

2013-03-16 17:27:12

200

0.002224

160.75.185.7

POST /administrator/index.php HTTP/1.0

POST

2013-03-16 17:27:12

200

0.002224

160.75.185.7

POST /administrator/index.php HTTP/1.0

POST

2013-03-16 17:27:12

200

0.002224

160.75.185.7

GET /administrator/index.php HTTP/1.0

GET

2013-03-16 17:27:12

200

0.002224

160.75.185.7

POST /administrator/index.php HTTP/1.0

POST

2013-03-16 17:27:12

200

0.002224

160.75.185.7

POST /administrator/index.php HTTP/1.0

POST

This is suspicious; looks like it tries to get to the website's admin area - candidate for blocking.

Number three:

In [79]:

df[df.IP=='213.248.47.229'][0:10]

Out[79]:

Status

b

IP

Request

Method

Time

2013-03-16 11:00:34

200

0.012539

213.248.47.229

GET /forum/ucp.php?mode=register&sid=8203c4843...

GET

2013-03-16 11:00:34

200

0.026559

213.248.47.229

GET /forum/index.php HTTP/1.0

GET

2013-03-16 11:00:34

200

0.012011

213.248.47.229

GET /forum/ucp.php?mode=register&sid=c78851b8f...

GET

2013-03-16 11:00:39

200

0.016414

213.248.47.229

POST /forum/ucp.php?mode=register&sid=12986e06...

POST

2013-03-16 11:00:39

200

0.016414

213.248.47.229

POST /forum/ucp.php?mode=register HTTP/1.0

POST

2013-03-16 11:00:44

200

0.015024

213.248.47.229

POST /forum/ucp.php?mode=register HTTP/1.0

POST

2013-03-16 11:00:45

200

0.015028

213.248.47.229

POST /forum/ucp.php?mode=register HTTP/1.0

POST

2013-03-16 11:00:49

200

0.015024

213.248.47.229

POST /forum/ucp.php?mode=register HTTP/1.0

POST

2013-03-16 11:00:50

200

0.015028

213.248.47.229

POST /forum/ucp.php?mode=register HTTP/1.0

POST

2013-03-16 11:00:55

200

0.015024

213.248.47.229

POST /forum/ucp.php?mode=register HTTP/1.0

POST

This one tries to register on the forum every 5 second or so - clearly forum spammer. Block.