Importing a log file with rxImport()

Tuesday's post on a new Kaggle contest mentioned that Revolution Analytics offers a free trial for using Revolution R Enterprise in the Amazon cloud. One reason this might be of interest to contestants is the rxImport() function which reads delimited text data, fixed format text data, and with an appropriate ODBC driver, data stored in a database. (rxImport() also directly reads SAS and SPSS files, but I'm guessing that this feature is not lilely to be of interest to contestants). As it turns out, rxImport()is also useful for dealing for semistructure text data such as log files. For example, here are the first three lines of internet log file complements of gVim.

Not perfect - but clearly useful! Moreover, an added bonus is that rxImport() assigns variable names to the columns: "V1", "V2", etc. which can be used in the import process. The following code, imports the file and does a bit of cleaning along the way, removing some columns and renaming others.

Now we are getting somewhere. We can use the rxDataStep() function to remove the columns V4 and V5, which are no longer needed, and further process the data. The following code uses a transform function in the data step to break apart the V6 column into some meaningful fields.

The Transform function, transformFunc(), in the last block of code may look a bit mysterious. The key to understanding what it does is to realize that rxDataStep()reads a big file a chunk at a time. Each chunk holds the data in a list, and processing must take this structure into account. If the structure of the list is not clear, it is easy enough to print things out and take a look. The following code reads in 5 lines of the file 4 lines to a chunk and prints out the contents of the chunk.

# Look at what is going on in the chunksrxImport(inData=file,outFile="test", transformFunc = function(data) { print(data); # Internal variables can tell you aboutthe chunk print(paste("chunk starts with row",.rxStartRow,"of file")); print(paste("chunk number = ",.rxChunkNum)); print(paste("number of rows read = ",.rxNumRows)); data }, rowsPerRead=4, # reads 4 rows into a chunk if available numRows=5, # only read 5 rows from the file overwrite=TRUE) # overwrite the file if it exists

The code also points out some internal variables that may be useful in writing transform functions to process each chunk.

.rxStartRow contains the row of the file that begins the chunk..rxChunkNum contains the number of the chunk.rxNumRows contains the number of rows in a chunk

Download Outputto have a look at chunk printed out by the last block of code. In a future post, I'll look into squeezing some information out of this file.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.