Bulk loading data with IDataReader and SqlBulkCopy

Introduction

Often large amounts of data need to be quickly loaded into a database. A common approach is to fill a DataTable and use the SqlBulkCopy class to load the data. The problem with this approach is that the data set must be materialized in memory. This is inefficient because the data must be copied and transformed several times and the cache utilization is poor. As well it does not scale because memory is usually a limited resource. But it is far faster than one row at a time inserts.

An alternative is to implement an IDataReader and use the SqlBulkCopy class to load the data but the implementation requirements are poorly documented. I have developed the BulkDataReader class that makes the implementation straightforward. With the class, it is possible to stream the data from one source into a database. Since the data has fewer transformations and is streamed from buffer to buffer (and thus utilizes caches better) the performance and resource utilization is much better than other approaches. In practice, the limiting factor with this approach is how fast the data can be read from the data source. In early releases of the .Net Framework, the DataTable approach was much slower than the IDataReader approach but DataTables are now much more efficient and the approaches have comparable performance when memory is not a constraint.

Bulk loading data is much faster that loading data with single inserts because the repeated overhead of transferring the data, parsing the insert statement, running the statement and issuing a transaction record is avoided. Instead, a more efficient path is used into the storage engine to stream the data. The setup cost of this path is however much higher than a single insert statement. The break-even point is typically around 10 to 100 rows. Once the data exceeds this size, bulk loading is almost always more efficient.

The choice of clustering key, index fill factor and compression has an impact on the time to load large data sets. In general, integer based surrogate keys and sequence objects are the best choices for clustering key. Integer based keys are small and new rows are inserted at the end of the clustered index. This means that SQL Server’s lazy writer can write the page once when it is full and more rows fit in a page. By contrast, GUID based and natural keys are inserted non-sequentially and are larger so they require more page writes for the same amount of data. Also, they can cause page splits, which reduces query performance. To minimize the effect of page splits in non-sequential keys, index fill factors should be set to allow some space for growth in index pages (80% is usually a reasonable value). Compression adds CPU overhead to reading and writing data but it often improves overall performance by increasing the number of rows per page and thus reducing the amount of I/O required.

Often a heap is used to stage data before inserting it into the target table. This consumes more disk bandwidth but has the advantage that the data loading can be buffered and merges are possible. Typically, a stored procedure runs continuously merging the top n rows into the target table and deleting them from the staging table. Heaps work well for this scenario since enqueue and dequeue operations have low cost.

Table partitioning when combined with Windows Server 2012 Storage Spaces can provide a large increase in physical storage bandwidth. But that is too large a topic for this posting. In general, good performance

BulkDataReader

The download has all the code and test code for the BulkDataReader class. To implement an IDataReader using the BulkDataReader, do the following:

·Override the SchemaName and TableName properties with meaningful names. These need not match the target schema and table names and are useful for debugging.

·Override the AddSchemaTableRows method to declare the schema of the source information. In general, this should match the target schema but some remapping is possible.

·Override the GetValue method to return the value of a column for the current row.

Care needs to be taken in the choice of bulk loading options. The choices can noticeably impact overall performance and throughput. The choices in the test code are for a case where multiple data sources are bulk loading data and readers are using snapshot isolation.

Results

To test the performance implications of various combinations of compression, clustering key types and loading techniques I used the test code in the download to load tables like this:

The tables approximate event logs. Event logs frequently have high insert volumes. The table also has typical constraints and indexes to correct for the overhead associated with them. MSTest was used for timing the runs and to minimize variability. Since the variability was low, only 3 test runs per test case was necessary. The download can be easily modified for other table structures.

Table 1 shows the time to load 1 000 000 rows into an empty table. The SQL Server 2012 database was running in a virtual machine on a workstation so much higher performance can be expected for realistic hardware. For example, a low end database server was able to load 115 692 SharePoint ULS events per second (1 000 000 rows in 8.6 seconds) using the BulkDataReader. Table 2 shows the time to load 1 000 000 rows of data into a table with 1 000 000 rows. The initial table had been rebuilt to simulate normal database maintenance. The natural key based table has one less index since it does not have a surrogate key thus its better performance than the GUID based clustering key.

Table 1: Time, in seconds, to load 1 000 000 rows of data into an empty table for compressed and uncompressed tables, various clustering key types and various loading techniques. The times are a mean of 3 runs.

Loader

Uncompressed

Compressed

Int

GUID

Natural

Heap

Int

GUID

Natural

Heap

Insert

2 322

4 217

3 592

1 764

2 152

7 817

2 952

1 715

DataTable

147

239

205

69

137

329

166

77

BulkDataReader

142

238

207

61

127

313

164

68

Table 2: Time, in seconds, to load 1 000 000 rows of data into a table with 1 000 000 rows for compressed and uncompressed tables, various clustering key types and various loading techniques. The times are a mean of 3 runs.

Loader

Uncompressed

Compressed

Int

GUID

Natural

Heap

Int

GUID

Natural

Heap

Insert

2 834

6 043

4 589

1 591

2 699

7 886

3 948

1 554

DataTable

164

286

287

68

151

336

198

91

BulkDataReader

160

258

269

95

140

301

189

52

As expected, bulk loading data is more efficient than single inserts, integer based surrogate keys are more efficient than alternatives and heaps are an efficient way to stage data.