Re: String Manipulation to Find Difference in Different Columns in a Wide Table

I have used the concatenate method to find duplicates or analyze differences, too, along with UNIX tools in a shell environment. I often delineate the columns with a '~', as it seems to be the least popular ASCII character that has a visible glyph.

Another way is to change your select to a view or derived table, or faster, a selectable stored procedure, where the output columns are the key columns, one payload column name, one column value, is column null Y/N, for every payload column. Then, you just select where count(*) > 1. Using unions in a view or derived table means many passes through the table, but a SP can create many rows out in per row in one pass, one per payload column. It does make for a relatively big sort.

Of course, you need to know what values are not in use to make nulls to them without ambiguity. If none, you need to make nulls into strings too long to be in the data. For instance, an Int4 length can at most be 11: "-2#########", so either a string of a number out of range like "9876543210" or a string longer than 11 are safe.

BTW, forcing leading sign and zeros can be helpful to align the radix in the string presentation.

However, removing data from the engine, sorting all the derived data and conversion to string is sometimes a bit too cumbersome. Using temporary tables may be slower and less robust or impossible. A query, batch or stored procedure that can do the comparison in one pass is the way to go. If there is an index whose leading columns are a high cardinality part of the key drives your payload value comparison, you want to work along that index. For CPU/VM locality of reference, work columns from left to right except do variable later, as they are often stored at the right end, and when LOB stored elsewhere, TEXT and BLOB last. Some research on row structure can be a help. Generating your code without calls as one long batch, like an urolled loop, means the optimizers can clean it up much more, preventing much processing. The optimizer can remove null checks on not null columns, for instance, since the answer is implicit. This means your code generator can be simpler.

If you are just pulling rows on the first difference, then you may profit by a different order. If there is a clustered index, or unique index, or the highest cardinality index, you want to work along that key, regardless of what key drives your payload value comparison. High cardinality columns are more likely different, if you can find reliable cardinality statistics. Fixed width columns are faster to compare than VARCHAR. If some columns are NOT NULL, you want to test them earlier, as they are easier: just "a.x != b.x", not "a.x != b.x OR (a.x is null and b.x is not null) OR (a.x is not null and b.x is null)" or maybe, simpler, "not(a.x = b.x OR (a.x is null and b.x is null))".