Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I'm wondering if in SSIS there is a way to compare two csv files(with the same structure) with no reference key to make a record unique.

To give you a better picture from what I mean; In initial load we load data from a csv file into a table, in the subsequent loads, we will get another csv file (same format) that can have different records. We should compare what we have in the table vs. what exist in the new version of the csv file and only load the changed sub set. If there is a new records, it should be inserted, if update, it should be updated, if record deleted in the source, it should be marked as inactive in the table.

Now my questions are:

Is there any way to compare two csv files in SSIS, while there is no key to make the
records unique?

How can we compare two tables in SQL Server, while there is no key to make
records unique?

The volume of data in the csv file is quite high, more than 20 millions records!

How do you update without a key? I mean, how do you find which row is to be updated with values from which row?
–
dezsoAug 8 '12 at 6:33

1

If there is no unique key to these rows, by definition each file would represent the current set of data. It would always be inserted and existing data would be marked inactive. I know that's not what you want but that's how your question reads. So, how are you expected to determine a "match" between those file sets? If you don't know, then you need to go back to your business users and get guidance lest you implement the wrong solution.
–
billinkcAug 8 '12 at 14:25

2 Answers
2

If you have your first CSV loaded into a table, you can just as easily load the other one into a staging table (presumably with the same structure as the 'real' one). Then you can get the new rows by

SELECT * FROM staging_table
EXCEPT
SELECT * FROM real_table
;

Rows missing from the new CSV can be get reversing the two sides around EXCEPT. However, given the lack of a key on the staging table (and hopefully not on the real one - it's not clear from your question), deleting rows based on this comparison can be painful, especially with so many rows.

Add an index on the "real_table" before you run this. It'll save a lot of time.
–
BobAug 8 '12 at 13:55

This is one way of doing it. Thanks for your help. In the meantime I figured out another way which is using HashByte function to get the hash value based on the a string consists of columns value and then use this value to see whether the record has changed between loads.
–
SkyAug 8 '12 at 23:22

I also wrote an article about comparing two data sets using Excel, text editor or database engine. For your volume of data it is clear that, from the three options, only the database engine would be suitable.
Unless you use some SSIS feature (I don't have any knowledge about SSIS), then you have to compare the two data sets after they are loaded in the database.
I see two options:
a) you use the hash value representing that row, and compare between real table and temporary table, example:

The delete:
a) replace "select *" with "delete";
b) without having a reference column for the different rows, you need to write with Transact-SQL a procedure using a "cursor for update", and then delete where "current of" cursor.