Datamining Apache Logs with PostgreSQL

The hardest thing I had to learn, when I started working with SQL, was to
speak English well. Data processing using a procedural or object-oriented
programming language involves breaking down the entire operation into multiple
steps. These functions, or methods, massage the data into a form that makes it
easier to understand the logic of the process. Results from one function can
pass into functions that eventually produce the desired result. In other words,
GIGO (garbage in, garbage out).

However, performing data manipulation using SQL in a similar fashion is
simply not possible. Composing SQL statements is very much an act of oratory.
The process is similar to that of explaining your actions to another person.

This article plumbs the depths of your mastery of the spoken word by showing
how you can compose the necessary instructions to datamine an Apache system
log file that you've uploaded into a PostgreSQL database. I'll use only those
statements as defined in the SQL-92 and SQL-99 protocols. I won't use any
user-defined functions.

For the purposes of this article, I assume that you have a working
PostgreSQL database engine. Refer either to your distribution's documentation
or the PostgreSQL home page to learn how
to set up your system.

Working with psql, the PostgreSQL Console Client

There are two kinds of commands demonstrated: SQL and psql meta
commands. SQL commands always end with a semicolon. psql meta commands always
begin with a backslash. SQL commands are case insensitive, while psql meta
commands are case sensitive.

The Apache Log File

Properly mined, system log files provide an important source of information
to the sysadmin, the webmaster, and the nontechnical members of the team such as those in sales
and senior management. The challenge is in translating the data into a form
that reveals significant patterns.

Here's an example record that syslog records from Apache (this is one
line):

Syslog data can be quite detailed, so finding what you want is a question of
setting up the configuration instructions correctly. Refer to the syslog and
Apache documentation for more information.

Datamining the Log File, the Theory

The first step is uploading the Apache log file into a temporary session
table on a PostgreSQL database. This table will disappear as soon as the
psql client has shut down its connection to the database. Each
line of data in the log file is one record in the table.

The second step is processing the table by parsing critical data into
another table with its own columns. In this case, the table consists of three
columns: the IP address, the date stamp, and a copy of the individual record
from the first table.

Log in to the template1 database using the console client
psql as the superuser, postgres, and create a
database called apachelog:

Note: There is always a database named template1 in a
PostgreSQL cluster.

Create a temporary session table called tmp_apache, consisting
of one column named record with the text
datatype:

CREATE TEMPORARY TABLE tmp_apache(record TEXT);

Note: Remember, this table will disappear as soon as you log out of your
session because it's a temporary table.

Copy into the table tmp_apache the data from the log file
directly:

\copy tmp_apache from 'my_apachelog.txt'

You can review the first record of the log with the SQL command:

SELECT record FROM tmp_apache LIMIT 1;

Parsing the Log File

Create a table called http_log. Give it three columns,
ip_addr, the_date, and record:

CREATE TABLE http_log(log_date DATE,ip_addr CIDR,record TEXT);

Suppose you want to explain the above statement to another person. You can
restate it as:

CREATE a TABLE called http_log with
the columns log_date of data type DATE,
ip_addr of data type CIDR and record of
data type TEXT.

The next step is to extract the IP address and timestamp for each
record:

INSERT INTO http_log(log_date,ip_addr,record)
SELECT CAST(substr(record,strpos(record,'[')+1,20) AS date),
CAST(substr(record,0,strpos(record,' ')) AS cidr),
record
FROM tmp_apache;

There are two parts to this statement. The first part is the inner
SELECT query, and the second portion is the outer
INSERT statement.

The SELECT statement returns three columns of information that
originates from the single column in tmp_apache. The query
identifies both the IP address and the date stamp, which it then parses.
Because the functions used can parse only text strings, the query must cast the
results appropriately to insert them into a column of another data type.

There are two PostgreSQL functions used in this SELECT query:
substr() and strpos().

The function strpos() returns the integer value (the location)
of a particular character in a text string. Here, it identifies the date stamp
by searching the record for the first instance of a square bracket
([). Refer to the HTTP log; this occurs at the beginning of the
date.

The second function, substr(), returns a substring from a
longer text string. It takes three parameters: the text string to search, the
substring start point, and the substring end point. The date stamp is always
20 characters long in the example log file.

The same parsing process repeats for the IP address. The IP address is the
first piece of data at the beginning of the record. The functions therefore
identify the location of the leftmost white space in the record.

The INSERT INTO statement takes the results from the
SELECT query and adds them directly into the three columns of the
http_log table.

The following psql commands output the results into a text file
named results.txt, then launch vim on it by
using the bang (!) command:

\o results.txt
SELECT * FROM http_log;
\o
\!vim results.txt

Answering Questions

It's now time to answer questions. There's a saying, The perfect question
supplies its own answer. So too is it true to say that a well-composed query
statement explains the kind of information that it should return.

Remember, you can output all query results to a text file with the
psql\o meta command. The meta command
\e enters into the editor mode, thus permitting you to compose
your SQL statements in a formatted manner. The default editor is
vi, but you can change this by altering the editor environment
variable, which can be PSQL_EDITOR, EDITOR, or
VISUAL. For example, this command will change the default editor
to midnight commander (assuming you're using the bash shell):

$ export EDITOR="mc -e"

List the IP addresses that have visited the site

SELECT ip_addr FROM http_log;

SELECT all records from in the column ip_addr FROM
the table http_log.

List the distinct IP addresses (removing duplicates) that have visited the
site

SELECT DISTINCT ON(ip_addr) ip_addr FROM http_log;

Note: DISTINCT is a nonstandard clause found only in
PostgreSQL.

SELECT unique and DISTINCT records ON
the column ip_addr and return the records from the column
ip_addrFROM the table http_log;

Count the total number of hits recorded in the http_log
table

SELECT count(*) FROM http_log;

SELECT and count all recordsFROM the
table http_log.

Count all IP addresses (including repetitive hits from the same IP address)
that have visited the site for each day

SELECT log_date,count(ip_addr) AS "Visits"
FROM http_log
GROUP BY log_date
ORDER BY log_date DESC;

SELECT to return data from the column log_date and
count the number of ip_addr for each date. Label the
IP address counted column's title AS Visits FROM the table
http_log. GROUP the results by the
log_date and sort them in ORDER BY the
log_date in DESCending order.

Group By is a term, also called a clause, used when
one or more of the returned columns returns a value that is the product of a
function. This function processes a range, or group, of numbers. For example,
this last query has a column whose summed total involves counting the number of
IP addresses, count(ip_addr), for each logged date,
log_date. This summed total, also known as an aggregate
function, can be any kind of function that operates on a group of numbers.
Examples include obtaining the average, maximum, or minimum value of a given
group of numbers.

Count the unique IP addresses (distinct hits) that have visited the site
for each day

SELECT log_date,count(ip_addr) AS "Visits"
FROM (
SELECT DISTINCT ON(log_date,ip_addr)
log_date,ip_addr FROM http_log
) AS http_counter_tmp
GROUP BY log_date
ORDER BY log_date DESC;

SELECT all data from the columns log_date and
count the number of ip_addr for each date. Label the
column for the counted numbers AS Visits. This data comes
FROM another query that will SELECT DISTINCT values
ON the columns log_date and ip_addr,
respectively, for the data returned from the columns log_date and
ip_addr FROM the table http_log. Save this
subselect saved AS a temporary table called
http_counter_tmp. GROUP the results BYlog_date and ORDER them BY log_date in a
DESCending manner.

Did you notice how the last query used two SELECTs? The ability
to nest SELECT statements is a subselect, and it is part of the
SQL standard. One of the advantages of a subselect lies in the DBA's ability
to group and express complex, multiple operations as a single statement. Among
other things, this makes it easier to understand the logic behind an operation
as you read the instructions with minimal extraneous documentation.

Count the daily number of downloads of mydoc.pdf

SELECT log_date, count(ip_addr) AS "mydoc.pdf downloads"
FROM http_log
WHERE record ~ 'mydoc.pdf'
GROUP BY log_date
ORDER BY log_date DESC;

SELECT all data from the columns log_date and
count the number of ip_addr for each date. Label the
counted column's title AS mydoc.pdf downloads. The columns come
FROM the table http_log. Return data from all rows
WHERE the record column contains the text string
mydoc.pdf. GROUP the results BY log_date
and ORDER them BY the log_date too in
DESCending order from the newest to the oldest dates.

Count any any exploit attempts that use very long strings

Find exploit attempts by SELECTing the character
length of the record, labeled AS the
Record Length. Include also the ip_addr and the
log_date FROM the table http_log. ORDER
the results BY the length of the record
string and the log_date both in a DESCending
manner.

Note: You should be able to scroll through the results using the
psql console client if your binaries have readlin library support.
Otherwise, use the meta command \o to output the results into a
text file that you can review with the appropriate editor.

Generating the Report

Reports should be brief and yet furnish complex information in a clear
format. They are however, not necessarily as easy to make as they are to
read.

The number of downloads of a PDF document, "SRAA whitepaper," on the
web site.

The number of downloads for a second PDF document, "Informix whitepaper."

What makes this example interesting is that it assumes that only one of the
three tables, T1, has inputs for every day. Otherwise the queries
could have been simpler. The other two, T2 and T3,
will have infrequent inputs: some days there will be downloads, and other
days there won't. The challenge is to put all the data for all three tables on
the same row for the same day. The DISTINCT clause used in tables
T2 and T3 prevents counting the same person more than
once if he downloads a PDF document multiple times on the same day.

The report uses two queries, lest the SQL statement grow too long to
understand easily. The first statement saves the results in the table
webhits_tmp. The second query statement generates the actual
report.

SELECT DISTINCT ON(t1.log_date,visits,"SRAA pdf","Informix pdf")
t1.log_date,
t1.visits,
CASE WHEN t1.log_date=t2.log_date THEN t2."SRAA Whitepaper" END AS "SRAA pdf",
CASE WHEN t1.log_date=t3.log_date THEN t3."Informix Whitepaper"
END AS "Informix pdf"
INTO TEMP TABLE webhits_tmp
FROM t1,t2,t3;

Here's one way to express the above query:

SELECT only records that have unique and DISTINCT
values ON the columns t1.log_date, visits, SRAA pdf,
and Informix pdf, respectively. Return data from the columns
t1.log_date, t1.visits, "SRAA pdf" and "Informix
pdf". In the CASE WHEN t1.log_date is equal to
t2.log_date THEN return t2."SRAA Whitepaper" that
END AS "SRAA pdf". In the CASE WHEN t1.log_date is
equal to t3.log_date THEN return t3."Informix
Whitepaper" that END AS "Informix pdf". Place the results
of the statement INTO the TEMPorary TABLE
webhits_tmp. Use the columns FROM the tables t1,
t2, and t3, respectively, and ORDER the output
BY log_date in a DESCending manner.

Here's the second statement:

SELECT DISTINCT ON(log_date,visits) *
FROM webhits_tmp
ORDER BY log_date DESC;

The explanation is:

SELECT only records that have unique and DISTINCT
values ON the columns log_date and
visits. Return all the columns FROM the table
webhits_tmp and ORDER the output BY
log_date in a DESCending manner.

Here's what the report might look like:

log_date

visits

SRAA pdf

Informix pdf

2004-12-17

320

10

2004-12-16

540

30

10

2004-12-15

510

500

2

600

15

Caveats

Keep in mind that the solutions presented in this article are not
necessarily the best way to achieve your goals. Excellent SQL statements not
only return accurate data, but they also do it in the fastest possible manner.

You can often phrase a SQL statement any number of ways and it will still
return the same result. However, some statements will run faster than others,
because the database engine performs the logic of solving your query
differently. The composition of a fast query requires an intimate knowledge of
your database and the type of hardware you run.

Conclusion

Composing good SQL statements involves not only a logical approach to the
problem but also the capacity to express the solution in a concise manner that
actually makes sense when you open your mouth.

Robert Bernier
is the PostgreSQL business intelligence analyst for SRA America, a
subsidiary of Software Research America (SRA).