Several days ago I have spent a few hours of my life figuring out how to do dynamic pivot in Amazon Redshift. To tell you the truth, I have expected much more from this DBMS SQL language.
Redshift is based on Postgre SQL 8.0.2 ( which was released in 2005 !!!! )
Anything you would want for this, not too difficult task, does not ...

In collaboration with a current client over the past year or so, I've been working on the issue of Test Driven Development for ETL work, especially in the area of automated testing for data. He graciously agreed to allow me to rewrite some of the code we originally developed in Ruby as a set of PowerShell demo scripts and share it through ...

As I’ve said before these days I’m spending time working on Hadoop rather than SQL Server and one of my colleagues today alerted me to a rather cool feature Of Hive (in a nutshell Hive is the technology that makes enables us to write SQL statements against Hadoop). Check this out:
Pretty cool, no? From a single tablescan I can summarise and ...

Those who are familiar with Ralph Kimball’s theories on data warehousing modelling may be familiar with his assertion that a type 2 dimension member record should have a StartDate and an EndDate that defines the effective period of that record. He outlines this approach in his paper Slowly Changing Dimensions, Types 2 & 3 from the October 2008 ...

When your query must always return in a certain amount of time, trusting the optimizer might not be the best choice. Suppose your select must return in one second no matter what, and it usually returns in 50 ms, but sometimes, once in 100K calls, it takes full five seconds to complete, and clearly such query does not meet the requirement to ...

When two columns are correlated, it may fool the optimizer and cause it to choose a wrong plan. Here is a simple script that demonstrates it.The following script creates a table and populates some sample ...

You can use SUM to calculate the sum of several numbers, but
you cannot directly use set-based logic to calculate a product. Yet there is a
very simple trick – you can use EXP(SUM(LOG(…))) and get a product of several
numbers without a loop. The trick was originally posted on newsgroups by Tom Cooper.
For example, suppose that you have to ...

The following pattern is quite common in database
programming:
IF EXISTS(some query) BEGIN
DO SOMETHING;
END
When such code runs in high concurrency situations, it may
not work as expected. I will provide a repro when such logic fails 40% of the
time. The following script provides a test table and attempts to ...

Today I come into a nice bug (misfeature?) of SSIS 2008 and I'd like to share it with you. Let's start from the beginning:
One of the keys to make something speedy is to parallelize it. In TableDifference I read from two inputs, compare them and produce several outputs for new, deleted, updated rows and so on. It is the classical ...

If your data has a small percentage of duplicates, then IGNORE_DUP_KEY
may speed up your inserts. For larger amounts of duplicates, IGNORE_DUP_KEY may
slow them down significantly. I set up two tables, stripping down all the
irrelevant details, as follows:
CREATE TABLE t1(n INT NOT NULL PRIMARY KEY)
GO
CREATE TABLE ...