Import Large CSVs into SQL Server Using PowerShell

A buddy of mine was running into issues importing a 1 GB CSV of Longitudes and Latitudes available from geonames.org. The CSV contains over 9 million rows, and no matter what he did, DTS/SSIS would not import the data because of conversion issues. That’s actually been happening to me a lot lately, too, so I find myself relying on PowerShell for quick imports.

I wrote this easily reusable PowerShell script which allows for the import of an unlimited number of rows. For the data I’m working with, this script takes about 31 seconds to insert 1 million rows on an indexed table!

Based off of the information provided by geoname’s readme, I created the SQL table using as precise of datatypes as I could. After the import, however, I found that while they said to use varchar across the board, there were two columns in which nvarchar was required.

geonoames-allcountries.sql

Transact-SQL

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

CREATETABLEallCountries (

GeoNameIdintPRIMARYKEY,

Namenvarchar(200),

AsciiNamevarchar(200),

AlternateNamesnvarchar(max),

Latitudefloat,

Longitudefloat,

FeatureClasschar(1),

FeatureCodevarchar(10),

CountryCodechar(2),

Cc2varchar(60),

Admin1Codevarchar(20),

Admin2Codevarchar(80),

Admin3Codevarchar(20),

Admin4Codevarchar(20),

Populationbigint,

Elevationvarchar(255),

Demint,

Timezonevarchar(40),

ModificationDatedate,

)

For the PowerShell portion, I used SqlBulkCopy and [System.IO.File]::OpenText because they’re the fastest for this type of operation, and a datatable because it’s the easiest to work with, in my opinion. I explored using iDataReader and streaming but it got too convoluted.

Initially, I loaded all 9 million rows into memory, but that used all my RAM and caused a ton of paging. To get around this, I performed the bulk import after every 100000 rows, then emptied the data table.

A huge thanks goes out to Stevie Nova whose code showed me how to optimise datatable row inserts using $row.ItemArray instead of a manual population ($row.item(“GeoNameID”) = $data[0]) This technique decreased my execution time by 400%!

So the basic idea behind this script came from said buddy who split up his CSV into physical files. I thought it would be better to split it within PowerShell itself, and thought something like this would do the trick:

That is pretty, but I wanted to make the code as optimized as possible, and figured repeated requests for $dt.rows.count would cause a performance hit. Sure enough, $dt.rows.count took 11 seconds longer per 1 million rows than modding $i, so I went with that.

After much tweaking, I got down to about 30-40 seconds per 1 million rows. Talk about I was excited when I saw this after I ran the import on a non-indexed table on a database with a Bulk-logged recovery model:

Script complete. 9,000,000 rows have been inserted into the database.
Total Elapsed Time: 00:04:45.9362223

Write-Output"$i rows have been inserted in $($elapsed.Elapsed.ToString()).";

$datatable.Clear()

}

}

# Close the CSV file

$reader.Close()

# Add in all the remaining rows since the last clear

if($datatable.Rows.Count-gt0){

$bulkcopy.WriteToServer($datatable)

$datatable.Clear()

}

# Sometimes the Garbage Collector takes too long.

[System.GC]::Collect()

Write-Output"Script complete. $i rows have been inserted into the database."

Write-Output"Total Elapsed Time: $($elapsed.Elapsed.ToString())"

In my last trial run, the script imported more than 1.7 million rows a minute while a semi-optimized SSIS insert performed at only 300,000 rows per minute!

If you’re interested going deeper into sqlbulkcopy performance, check out this PDF report from SQLBI.com.

Oh, one last thing: the way this script is written requires that the SQL table column order and CSV column order be the same. If you need to rearrange your columns, you can do so by explicitly specifying them when populating your datatable, though doing so takes up to 400% longer.

Alternatively, as Joel Reinford suggested below, you can create a view with the same order as your CSV, and use that instead.

Chrissy is a PowerShell MVP who has worked in IT for nearly 20 years, and currently serves as a Sr. Database Engineer in Belgium. Always an avid scripter, she attended the Monad session at Microsoft’s Professional Developers Conference in Los Angeles back in 2005 and has worked and played with PowerShell ever since. Chrissy is currently pursuing an MS in Systems Engineering at Regis University and helps maintain
RealCajunRecipes.com in her spare time. She holds a number of certifications, including those relating to SQL Server, SuSE Linux, SharePoint and network security. She recently became co-lead of the SQL PASS PowerShell Virtual Chapter. You can follow her on Twitter at @cl.

Ms. LeMaire,
I just wanted to thank you for this script. I had a person come to me and asked if I would be able to transfer data from a 16 GB csv file into a SQL database. Your script worked like a charm for me. Below is the total that came at the end and I thought you might wanted to know.

Script complete. 199924037 rows have been inserted into the database.
Total Elapsed Time: 04:01:44.2566768

I used the first method (no explicit columns). The first attempt failed because I tried to add a primary autoincrement key column in front of the data (first column). The script tried to use the first column as part of the data. I deleted the table and recreated the table and placed the primary autoincrement key column at the end of the table and everything went smoothly.

This is a great script. I have an alternate suggestion for the column mapping. Create a view with the columns in the same order as the columns in the text file. If the table has an identity column you can leave that column out of the view. Do the BCP against the view instead of the table.

$table = “allcountriesVIEW”

This method works but I haven’t done any timing against it to see if there is any performance hit.

I have a column that uses type date in the SQL server table, and when I try to use the script it shows an error that it can’t convert the csv (string) to the date type of the destination table, I hope that you can help me.