Using Regular Expressions and XML Classes to Parse Your Log Files

Most systems these days can generate log files to store data about the activity of the system. What about when you
are asked to transform all of that data into usable information? I will show you how to use regular expressions
and .NET's XML classes to turn your log files into a dataset to allow you to search, sort, and report on your data.

What We'll Cover

Using the Regex class and regular expression capture groups

Using the XMLTextWriter class

Using the DataSet class

What You'll Need

A basic understanding and working knowledge of the
System.Text.RegularExpressions namespace.

The Game Plan

We need to transform this blob of text into something a little more structured.
You might be thinking, "Hmm, why not import the file into Access using the tab-delimited
wizard?" That solution would be totally OK, if we have one file or just a few files. The solution here requires a little more automation. Plus, had
the log files been written in a different format, for example, several lines
per log data, we'd be in trouble. What we need here is a structured data format;
enter XML.

We can see several benefits from transforming these files into XML. With XML,
we can:

Import the data into any number of programs including Excel or Access.

Directly load a dataset object from this XML, and perform searches on that
dataset, in memory.

Create reports using XSLT.

Pretty much do anything we want with this data, since it is pure
XML.

But how do we perform this magical act? By using the .NET Framework to:

Parse the log file text.

Write XML files from the parsed text.

Parsing the Log File

If you've worked with regular expressions before, you know that using them
is one of the fastest ways of parsing text. In the .NET Framework, the main
class to be used in this area is the System.text.RegularExpressions.Regex
class. One of the most powerful features of this class is the ability to specify,
within the search pattern, Groups that will easily allow parsing and retrieval
of parts of the text.

For example, given the date 17/08/1975 and the regular expression
(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}),
I can write code to retrieve any part of the text in the date by
name, like so:

Note: If you don't understand the code above, you should refer
to the two articles mentioned at the beginning of this
article.

Transforming to XML

Once we have a bunch of data, like the date example before, we
can easily transform it to XML.
The XML representation of the date we have from the previous
example could be represented like this:

<Date>
<day>17</day>
<month>08</month>
<year>1975</year>
</Date>

Outputting this kind of XML used to be a pretty easy, but pretty
error-prone, task. Sure, you could just slap each string into a memory buffer
with XML tags, but the amount of errors you could get makes this approach pretty untenable.
The XMLTextWriter class in the System.XML namespace
rids us of a lot of details here, and very conveniently abstracts away all of the
"boilerplate" code you need to write, allowing us to concentrate on the content
we wish to write in our XML document.

To show just how easy it is to use this class, here's a class that takes in
the MatchGroup object from the last example, writes an XML document
with this data into a memory buffer, and returns this XML output:

As you can see, it's a very easy job to write XML using this
class. I first create an in-memory StringBuilder
that will house the created XML. I then hand it off to the constructor of a
StringWriter, which is used to construct our
XMLTextWriter object.
I could have easily passed in any
System.IO.StreamWriter-derived object; thus, I have
the flexibility of writing to pretty much anything I want.
I then call the WriteStartDocument method, which
creates the <xml version-..> tag at the beginning of the
XML text.
(I don't have to call it, though. I can just start writing out
elements right away.) I then open a new element tag that will contain sub-elements,
using the WriteStartElement method.
Then I proceed to write the actual values as sub-nodes in the
open element using the WriteElementString
method,
passing in the name of the node, and the value inside of it.
To finish, I call the WriteEndDocument
method, which closes all open elements in the XML.
Had I wanted to just close the current Date
element, I would call the WriteEndElement
method,
and continue writing more elements.

Possible Encoding Problem

Now, if you're trying out the code to produce this XML, you might find a little
surprise in the generated XML file. in the XML file, the first line might read
<?xml version="1.0" encoding="utf-16"?> instead
of <?xml version="1.0" encoding="utf-8"?>, and as a consequence,
you might have some problems reading in the XML file later on. In order to control
the encoding with which the XMLTextWriter writes the XML file,
you'll need to specify the encoding in the XMLTextWriter's constructor. This
also means that it is simpler to just pass in a filename to the contructor rather
than use an in-memory buffer, which will then be written to a file anyway. Here's
the code to initialize the writer with a file name and the proper encoding:

This should solve our problem, and since we are writing to a file, we can get
rid of the code that writes the StringBuilder into the file.

A More Generic Approach

Actually, we can make the writing function much more generic, by automatically
going through all of the groups of a given match and writing their names and values
as XML. The following bit of code shows how to do this:

In order to achieve this, we need to have an instance of the
Regex class to play with.
We have to use this same instance to receive the
Match object. Then we can use the Regex
instance to retrieve the name of a group, based on its
number:

reg.GroupNameFromNumber(i)

Don't ask me why the group name is not a property of the Group
class.
This means that for this functionality to work, we can't use the
static Match()method of the Regex
class,
which makes things a bit more cumbersome. That's why, for
the remainder of the code samples, I'll use the earlier version
of the code, although it's less generic.
You can then implement this method, if you wish, in your
programs.

Putting It All Together

OK. We know how to parse, and we know how to output to XML.
Let's try to wrap this up using a class that takes in a single
log file and transforms it into an XML file. This class should
receive the name of the log file to read, parse it line by line,
and generate a [logFileName].xml file:

It creates an in-memory XMLTextWriter object and
initializes it to the proper settings. It then creates an open
Entries element inside of it, into which all of the
child Entry elements (for each line) will be
written.

It then goes through each line in the log file, and uses the
Regex.Match method on that line, using a pattern
that matches each sub-group we identified at the beginning of
this article.

If the match was a success, it passes both the
XMLWriter instance and the Match object
to a separate method, which writes the group names and values
into the XML writer instance.

After going through all of the lines in the log file, it closes
all of the elements in the XML file and writes it all to a file
named the same as the original log file, with the addition of
".log" at the end.

Letting the User Search for Data

Now that we have our data stored as structured XML, we can
use it to let the user easily search through it. To do this,
we'll use a very easy technique already given to us inside the
.NET framework. We'll use a DataSet object to load
our XML data, then we'll Select data from the
dataset using a filter that can be specified by the
user. We can then display the resulting DataRows to
the user.

The DataSet class has a LoadXML method, which allows
us to pass it a file name and have it automatically load the data into a table
structure inside the dataset. For our purposes, we can send in the file name
without any additional parameters. What will be generated inside of the dataset's
memory will be a table that contains a collection of DataRows,
each one holding a set of columns that corresponds to the set of
properties we created in the log file -- Date, Time, Hour, Action,
and so on. Once we have this table in place, we can use the DataTable's
Select method to retrieve any number of DataRow objects
that match the filter we provide. Here's the code to do this:

private void LoadXMLFile()
{
// Load the XML file into the dataset
m_ds.ReadXml(txtFileName.Text);
// Show All log entry Rows at first load
// by passing in a 'true' filter
// this is just like specifying
// SELECT * FROM ENTRIES WHERE true
RefreshResults("true");
}
private void RefreshResults(string filter)
{
try
{
// Clear the result list view
lvResults.Items.Clear();
// Get the first datatable inside the dataset
// we know this one contains the data we need
DataTable table = m_ds.Tables[0];
// Get the datarows that match the user's filter
// the filter can be any valid SQL filter
DataRow[] rows = table.Select(filter);
foreach(DataRow row in rows)
{
// Add an item to the list view
ListViewItem item =
new ListViewItem(row["date"].ToString());
item.SubItems.Add(row["time"].ToString());
item.SubItems.Add(row["record"].ToString());
item.SubItems.Add(row["action"].ToString());
item.SubItems.Add(row["user"].ToString());
lvResults.Items.Add(item);
}
}
catch(Exception e)
{
// The user might pass invalid filter expressions,
// in which case we get an exception notifying
// the filter parsing error in question
MessageBox.Show(e.Message);
}
}

Using this straightforward code, we can let the user load any
XML file, and filter its contents based on a SQL-like
filter expression.
Basically, if you have written SQL code, you can use a
WHERE clause to select the specified rows.
We receive an array of DataRows, and since we know
beforehand the names of the columns for each DataRow (same as the
XML elements in our log file), we can just display the values for
each column.

We could have just as easily looped through all of the available
columns and display each one's value to the user, without even
knowing what kind of data was inside of our DataRow. We could
dynamically add columns to our ListView corresponding to the name
of each DataColumn in the DataRow, and
voila -- you have yourself a more generic searching mechanism for
practically any simple XML file.

Summary

We've learned the following:

Parsing log files is easy.

Writing XML files is easy.

Searching XML files is easy.

Generating XML log files and searching them should be
pretty darn easy!

eXpresso, a tool for
testing regular expressions (This article explains
how to work with it.)

Roy Osherove
has spent the past 5+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various Net topics(most of which can be found on his weblog), and loves discovering new things everyday.