Coding, Kettle, Pentaho, Big Data and more

Good old file handling

In a heavily webbed, automated, interconnected world with most data stored on relational databases, we can sometimes forget that there are indeed many situation where you simply want to FTP a file from one place to another.

That process in itself holds many dangers as I pointed out to someone on the forum today.Â Let me re-cap that post here on the blog…

Suppose your files are coming in using FTP to a local directory.

A file is being written, let’s call it FILE_20070328.txt.
Now, in advance you don’t know the size of that file. Let’s say it’s 10MB and takes 30 seconds to FTP.
In your transformation you detect this file and start working. Chances are very high that you’ll be reading an incomplete file.Â (See also this technical tip on variables and file handling)

There are 2 ways to solve this problem:

You write to FILE_20070328.txt.tmp and rename when the FTP is done.Â Rename is atomic for the filesystem and is therefor safe to use.

You write all files you need to transfer and then FTP a small file, called a trigger file or sentinel file.

The first option is generating a problem on a different level though. Suppose you have a data warehouse to update and you expect 10 files from a remote system (CUSTOMER_20070328.txt, SALES_20070328.txt, etc.)

In that case the option has to include counting the number of available files, evaluating complex wildcards etc, just to make sure you get all files. When you also need to handle half-complete FTP attempts, partial files, etc it becomes messy pretty fast.

The trigger/sentinel option is by far easier and more ellegant in use and that is why we included that option in Pentaho Data Integration.

Obviously, things become more complex if you have multiple systems writing files to your input directory all of the time.Â In that case try to remember to keep things as simple as possible.Â Here is some extra advice for those poor souls that need to deal with such situations:

create different directories per source system (if you can)

consider generating text files where the header-row is the same as the footer-row. Then evaluate completeness of files by comparing the first and last row of the file (head/tail -1 file.txt).

use locks to indicate that files are being processed (touch /tmp/locks/kettle_running.lock, rm /tmp/locks/kettle_running.lock)

move files that are being processed to a different directory

archive files that where succesfully processed to a different directory

don’t ever delete files until your disks run full, give yourself a break: your warehouse can be reloaded that way if you make a mistake.

If you have other tips, just shoot!Â Also feel free to add feature requests if you have any idea for a job entry to help out with your file handling problems.
Until next time,

One comment

When I do these kinds of jobs I usually write a cron entry to process things. For example, on one of my wife’s sites I have set up a place for her to upload photos that will get renamed, moved and resized in various ways. She can just dump the images in and they’ll get taken care of.

My shell script starts off with “find -mtime +1″ and that takes care of it. If the file hasn’t been written in a minute, it’s safe to work on it.

Locking is a great tip — flock works well in most scripting languages. People tend to forget that a cron job that runs every minute might take longer than a minute to complete, so you could easily have two scripts trying to process the same file.