Sunday, February 14, 2010

Last year I had a minor project that was importing a divergent collection of data into SQL Server. The typical file had thousands of records, on occasion million of records that were batch-sequence sensitive. A common C# pattern is to read one record and insert that into SQL Server. This pattern is not robust when the insert program dies unexpectedly because it may insert duplicate records (You can start adding widgets but that is not a KISS solution.). SQL Utilities such as BCP etc, tend to require a lot of fiddling to get the import right (and tend not to be very tolerant of data changes).

The problem was further complicated because there was a variety of uploading machines (isolated domains) which was difficult to update with new code(secured controls machines). In general, there were folders for data to be dropped which were monitored by Windows Services that did the upload.

The pattern that I ended up using was simple:

Convert each file to Xml in the upload program.

Insert the complete Xml into an upload table via a stored procedure

The SP checks for duplicates

Delete the file if successful.

Processing of the data may now be done using this table; in the upload sequence (or periodically using the file date sequence if needed) etc. Once uploaded each batch of records can now be examined and process into temp tables that can be applied in a single SqL transaction minimizing blockage on the system.

The code below shows how a tab-delimited file is handled.

If a new type of data is dropped in the upload folder it will still be inserted into the Upload table (there is no need to update the uploading programs). The handling of the data is done in SQL Server – new data is simply collected until you are ready to process it.

if the data format changes unexpectedly -- breaking the code then you can simply stop processing all of the files of that type until the issue is fixed (if a file is not processed, go to the next type) – thus maintaining sequence.

All processing decisions are done in Sql Server despite the diverse data sources.

If batches has serialization information, this information can be used to both monitor and order the data appropriately.