It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name

Data Type

Precision

Scale

Length

TestInt32

DT_I4

0

0

0

TestString

DT_STR

0

0

50

TestBool

DT_BOOL

0

0

0

TestCurrency

DT_CY

0

0

0

TestDBTimestamp

DT_DBTIMESTAMP

0

0

0

TestWString

DT_WSTR

0

0

50

TestNumeric

DT_NUMERIC

18

6

0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

I’ve worked with Mariner for almost 12 years. It’s been a very good journey, with many great experiences. I’ve worked with a lot of great people, and delivered some really interesting BI solutions to clients in a number of industries. One aspect of my job that I always particularly enjoyed was helping developers be more productive when creating BI solutions, and reducing the repetitive (read: “boring”) aspects of developing solutions on the Microsoft stack.

Recently, a new opportunity to focus more heavily on that came along. As a result, after a long and enjoyable career with Mariner doing business intelligence consulting, I am taking a new position with Varigence, a company that is producing tools that will make implementing BI solutions faster and easier, as well as introduce new capabilities and better integration into the Microsoft BI stack.

I’m really looking forward to the new role and the new experiences it will offer. I will continue to be heavily involved in Microsoft BI, so I plan to maintain this blog and continue speaking and writing on it as often as often as possible.

We had good turnout at the Greenville, SC SSIG on Tuesday. If you attended, I hope you enjoyed the presentation. After the meeting, I promised several attendees that I would make the samples developed during the demo available, and here they are. The zip includes both the SSAS project files, and a backup the sample database that the cube was built on. Both are done using the 2008 version of SQL Server.