Introduction

Some third-party applications produce reports as CSV files where each line is in a different format. These applications use an export format that parses each record on-the-fly, and each value in the record separately. The
format for each value is set by the value itself.

For example, if there is data in the field, then it will export the data inside a quotation mark, but if there is no data then the data will be blank and without a quotation mark. Moreover some applications do not use 'data
type' when generating a report. If there is data and the data is numeric, then some applications might not use any quotation marks, and in the same field on a different record data that is not numeric will be exported with the quotation marks. We
can imagine a single CSV file with a specific column exported in 6 different formats.

In order to use bulk insert directly we have to make sure that all the data is consistent with one format.

The problem

We need to use bulk insert to import data from a CSV file into the SQL server database using pure T-SQL. A Bulk insert operation can use only one format file and our metadata must remain consistent with it. The first step
using bulk insert is to find a set of "bulk insert" rules (like: End Of Line, End Of Column, Collation…) that fit all of the data. This is not the case sometimes as mentioned above.

If you got the answer in the forum that this can't be done using pure T-SQL then remember that I always say "never say never".

* In this article we are going to talk only about a
pure T-SQL solution, as there are several solutions (such as using SSIS, CLR or a third party app) that can do it in different ways; sometimes in a better way.

Our Case Study

Our application exports the data as a CSV Comma Delimited file without a consistent format. In this example we will deal with a very common situation that fits these rules:

1. Our application use column type (just to make it easier for this article we will focus on a string column). So a numeric column will never use quotation marks and a string column will use quotation marks
on and off by these rules.

2. If there is data in the field then it will export the data inside quotation marks (no matter if the data is numeric or not, as the column is string type)

3. If there is no data then the data will be blank and without quotation marks.

The original sample data that we use looks like this:

ID

Phone Number

First Name

Last Name

1

9999999

ronen

Ariely

2

8888888

xxx1, xxx2

yyy

2

8888888

xxx1, xxx2

3

yyy

4

7777777

2222222

zzz

kkk

5

1111111

5000.5

5

According to the application export rules above, our CSV file looks like this:

1,9999999,"ronen","ariely"

2,8888888,"xxx1,xxx2",yyy

2,8888888,"xxx1,xxx2",

3,,,"yyy"

4,7777777,,

,2222222,zzz,kkk

5,1111111,"5000.5","5"

* we can see in the last line that our application uses column type, so even when our value is numeric it will be inside quotation marks. But we have to remember that there are some more complex situations, like applications
that do not use column type. Then the last line can look like: [5,5000.5,5]. And it can be more complex if the culture formats numbers similar to 5,000.5. Then our CSV line might look like this [5,5,000.5,5]

The solution:

* Remember that this is only a workaround for our specific case. For each set of data a slightly different solution might fit. The idea of how to get to the solution is what is important here.

STEP 1: Identify the import file format

In this step we will run several test with different format files. Our aim is to identify any potential problems and to find the best format file which will fit as many columns as we can from the start.

Finding the problematic columns and the consistent column format

First of all you have to find a record format that fits most of the data, as well as the columns that might have be in-consistent with this format. In order to do that we are going to run several tests and then we will implement
the conclusion at the next step. We will start with a simple bulk insert and continue with some more complex formats. Using the ERROR messages and the results, we will identify the potential problems.

Let's try this in practice

Open Notepad and copy our CSV data into the file.

1,9999999,"ronen","ariely"

2,8888888,"xxx1,xxx2",yyy

2,8888888,"xxx1,xxx2",

3,,,"yyy"

4,7777777,,

,2222222,zzz,kkk

5,1111111,"5000.5","5"

Save the file as "C:\ArielyBulkInsertTesting\Test01.csv"

* make sure that you use ANSI format when you save the file (you can use a different format like UNICODE but for this example we shall use ANSI).

Open Notepad and copy our XML format data into the file.

* Using a file format can help for more complex formats. I highly recommended always to use a file format.

In our case we can see that the first and second columns have no problem, but the problems start on the third column and continue to fourth column. First of all we have some quotation marks in the results. Moreover the third
column was split in several records and part of the data moved into the fourth column. Actually, as our format file says that the third column ends on the comma, then every time we have a comma as part of the string data it will be split. That make sense.

When we have string data we surround the content in quotes. If our data had "a consistent format" then all string data would be enclosed in quotes, even empty data.

Let's demonstrate a well formatted data CSV. Save this data as as "C:\ArielyBulkInsertTesting\Test02.csv"

1,9999999,"ronen","ariely"

2,8888888,"xxx1,xxx2","yyy"

2,8888888,"xxx1,xxx2",""

3,,"","yyy"

4,7777777,"",""

,2222222,"zzz","kkk"

5,1111111,"5000.5","5"

In that case the solution was very simple. We could use this format file:

Now execute the bulk insert and the data should be placed in the table correctly. If our data was formatted in this way (with consistent format) then we would not need this article :-)

BULK
INSERT
#test FROM
'C:\ArielyBulkInsertTesting\Test02.csv'

WITH(

FORMATFILE='C:\ArielyBulkInsertTesting\Test02.xml'

, MAXERRORS = 1

--, KEEPNULLS

--, DATAFILETYPE = 'native'

--, CODEPAGE='RAW'

--, ROWTERMINATOR='\r\n'

--, FIRSTROW = 3

);

GO

Let's continue to work on our "real" data! Clear the data

truncatetable
#test

GO

In some cases we might build a format file which bring us error messages. We already know that the data will not fit all records. This test will give us more info using the error message. Try to use this format file (C:\ArielyBulkInsertTesting\Test03.xml):

Moreover we can see that the rest of the records in our data were inserted (using SQL 2012). In some cases using data with a small amount of inconsistent records this can be the best way, as most of the data is inserted. Now
we can just check which records do not exist in the table and fix it. The error message includes the number of the problematic first row.

In conclusion the best format file we found succeeds in inserting the first and second columns without any problem. We recognized that the problems start on the third column.

STEP 2: insert the data into temporary table

This is the main step, as now we can use bulk insert to import the data into SQL Server. Since we found that our data does not have a consistent format, we are going to use a temporary table to import the data.

* We don’t have to use a temporary table, as we can just use OPENROWSET to get the data and do the parsing on-the-fly. I will show this in step 3.

The basic idea is to bring all the data before the problematic point (in our case the first and second columns) into separate columns, as they should appear in the final table. Then the rest of the data from the problematic
point to the end of the problematic point (or to the end of line if there is no other way) into one column. So in our case the third and fourth columns will be imported as one column.

Let's do it. We will use this format file (save as C:\ArielyBulkInsertTesting\Test04.xml), which is similar to "Test01.xml" file, without the third column:

The goal of this article is to give an optimal use of pure T-SQL in order to import data which is not well formatted into the database, in an appropriate and effective structure
for parsing(step 3). I will not elaborate on step 3 of parsing the data. There can be hundreds of different ways to do this for each case.

I'll show some sample solutions in step 3. Those solutions are not necessarily optimal parsing solutions, but represent solutions in which I use different string functions for parsing our data. Usually when parsing data in
SQL Server it is best to use CLR functions.

Step 3: parsing the data into the final table

Now that we imported the data, all that we need to do is parse the last column. These queries can do the job in our case:

After we found a way to parse the data, we can use a simple SELECT INTO query to move the data from the temporary table to the final table.

Usually if this is not a onetime operation then I prefer to use one query do it all without declaring a temporary table. I do need these steps to find my Bulk Insert query & format (step 1+2) and to find the parsing function
(step 3). Next I convert my queries into an OPENROWSET import query like this (in our case study)

Summary

The basic idea is to bring all the data in the problematic columns (or until the end of line if there is no other way) into one column. We can use a temporary table to store the data. Then we can parse the temporary column
using any way that suits us. We can use T-SQL functions or CLR functions like SPLIT. We can clean some characters using replace. We can find characters using CHARINDEX, and so on. This is all depends on your specific data. It has nothing to do with bulk insert
anymore :-)

We must separate the operation into two parts:

1. Insert the data using bulk insert into the data base (temporary table or using OPENROWSET) in such way that we will be able to use it for step two

2. Parsing and splitting the text on the last column into the final columns

* If you can export the file in a consistent format fit with bulk insert than you should do it! This is only a workaround solution.

* If you can build a well formatted import file in advance, from the original import file, using a small application which will format a new file, then do it! This is a much better solution as most languages do a better job
of parsing text than SQL Server (T-SQL).

* If you can manage the order of the columns during the exporting, then try to make sure that you move all the problematic columns to the end. This will help us to use the bulk insert in a more optimal way as we will need
to parse fewer columns in step 3

* Why not import all the data into one column in a temp table instead of STEP 1 & STEP 2?

This is always an option but probably not a good one. In our case study we use a very simple table structure with 4 columns and only 7 records, but in real life we might get a table with 20 columns or more and several million
records. If we have 2 columns (out of 20 columns) with potential problems and we can order the columns so those columns come last, than we can import the most of the data (18 columns) into the final data structure, and we will need to import only the last
two columns into one column for parsing. It is much better to separate the data into as many columns as we can and minimize the use of parsing. Parsing is a CPU intensive operation. Parsing the data after importing will probably take longer.

When you have to use complex parsing it is much better to use CLR solutions. As I mention in the start this is a pure T-SQL solution.

* There are no other references for this solution that I know of and most forum questions that I found where closed without an answer, or by sending the questioner to a different solution like using SSIS, or a third party
application, or by saying that it cannot be done using bulk insert and pure t-SQL.

Some Forum questions:

* This article answer several forum questions (more than 15 which I found using google, and I checked only the first several pages of search results) that remained unanswered for far too long.
I did not find any solutions or answers except my own based on this logic. This is a very easy solution but we have to think outside the box to get it :-)