Handling BULK Data insert from CSV to SQL Server

From this article, developers will get some useful guidelines on bulk data insertion in SQL Server.

Introduction

I am writing this article from the experiences of inserting huge data (around 5 million rows, 400
MB) from a CSV file to a SQL Server database.

Background

There are several ways to insert bulk data from a CSV file to a database; our goal was to perform faster insertion and execute the insertion from
a C# application.
To reach the goal, we experimented with some of the common well known techniques to handle bulk data insertion. Following are the techniques we experimented:

SQL BULK INSERT query

BCP or SqlBulkCopy library to insert bulk data using C# or VB

SQl Server Integration Service (SSIS)

Normal SQL command library in C# or VB

From this article, developers will get some useful guidelines on bulk data insertion in SQL Server.

1. SQL BULK Insert

Using the BULK INSERT statement we can insert bulk data into the database directly from
a CSV file. The simplest version of the BULK INSERT query looks like that:

The SqlBulkCopy class copies a bulk of data from a data table to
a database table. SqlBulkCopy takes the following types of parameters to copy
data in the database: System.Data.DataRow[], System.Data.DataTable,
System.Data.IDataReader.

Now the challenge is to convert the large CSV file to any of these datatypes:
DataRow[], DataTable, IDataReader. Some of the open source libraries are available
to perform such conversion. For experimenting, we used CSVReader (http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader)
which binds data from a CSV file via the System.Data.IDataReader interface.

The following code performs a SqlBulkCopy perfectly from a CSV to a database table.

It is better to add BulkCopyTimeout for very large CSV files: copy.BulkCopyTimeout = XXX;.

3. SQL Server Integration Service

SQL Server includes a powerful data integration and transformation application called SQL Server Integration Service (SSIS). One of the main functions of SSIS is to move
data from almost any formatted external data source into SQL Server. Using Business Intelligent Development Studio (BIDS) we can easily import data from
a CSV file to a database.
Also, it is very simple to put a package file in as automatic reoccurring job.

Here are the basic steps to create a SSIS service package to import data from
a CSV file to SQL Server.

Open SQL Server Business Intelligence Studio.

Create a new “Integration Service Project”.

In the “Control Flow” tab, drag a “Data Flow Task” from the toolbox.

Go to “Data Flow” tab.

In the “Data Flow” page, drag “Flat File Source” and “ADO.NET Destination” from
the toolbox and set them up.

Running the created SSIS package will duplicate data from the CSV file to the SQL
database.

4. Insert data using the conventional SQLCommand class

Data can be inserted to the database from a CSV file using the conventional
SQLCommand class. But this is a very slow process. Compared to the other three
ways I have already discussed, this process is at least 10 times slower. It is strongly recommended to not loop through the CSV file row by row and execute
SqlCommand for every row to insert a bulk amount of date from the CSV file to
the SQL Server database.

Comparative analysis

In our study, we found that BULK Insert SQL and SQLBulkCopy performed best. For around
a 400MB CSV file it took an average three minutes to insert data.
SQL Server Integration Service (SSIS) took around double the time than BULK Insert and
SQLBulkCopy. Normal data insertion took a long long time.

Based on our results we can say using BULK Insert SQL or the
SQLBulkCopy class performs best for huge data insertions. We can also use SSIS for
bulk data insertions, but using normal SQLCommand to insert bulk data is not a reasonable solution.

Caution: The result can vary from system to system.

Expert Opinions (References to make it better)

According to Adrian Hills: Bulk loading though ADO.NET can be extremely efficient if you use the
SqlBulkCopy class. If your source data is in XML,
you could try loading that into a DataSet - either by iterating through the XML document manually, or via the
DataSet.ReadXml method.
You can then highly optimise the bulk loading when using the SqlBulkCopy class by:

loading into a heap table (no indexes - create the indexes after the data has been loaded in)

Specifying the TableLock option for the bulk load, which will get a bulk upload lock on the table

Comments and Discussions

I am using option #2. Everything was working fine and the data loaded but when I made some modifications to the file via code I get an error that says "An item with the same key has already been added." Has anybody else run into this problem. I have tried loading with and without a primary key. Any ideas would be greatly appreciated.

//For some reason the SqlBulkCopy.WriteToServer method will call this GetValue method when reading //the value and it doesn't seem to know how to convert the string value of a Guid to a unique identifier. //However, it does actually know how to convert a System.Guid to a UniqueIdentifer so we can return that if //the value parses to Guid if (IsGuid(this[i])) return Guid.Parse(this[i]); else return this[i]; }

Because, with "around" 5 million rows in "around" 3 minutes, that's almost 28000 rows per second inserted, which I'm having a hard time believing. Please give the structure of your table and the actual time measurements, and a description of the system (because if you're doing this on a Cray, nobody cares!)

Then, if the data structure is trivial, I would be interested to see this test done with something more realistic. For example, inserting into a table with a clustered index, on the primary key, which is an identity column. You know, something like real life.

I would be interested to know what kind of file structures, indexing schemes, and column types or row sizes can turn things around and make SQL Statements faster than BCP, or something like that. Is there ever a case when we might, say, need to insert one row at a time? Maybe to control locking behavior on a heavily used table?

Notice i have an empty rows after second row.I m doing bulk insert with FIELDTERMINATOR = ',',ROWTERMINATOR = '\n'. It works perfectly fine when bulk inserting after removing the empty records. But when there is empty record at end of file, it is throwing error given below and no records are inserted in db. I cannot ask the user to remove empty line at EOF.

Msg 4832, Level 16, State 1, Line 1Bulk load: An unexpected end of file was encountered in the data file.Msg 7399, Level 16, State 1, Line 1The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.Msg 7330, Level 16, State 2, Line 1Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".