Monday, July 15, 2013

Impact of SELECT * versus SELECTing specific columns

A couple of weeks ago, I attended a DB2 for z/OS technical conference and one of the sessions focused on tuning SQL statements and other topics related to DB2 performance. For the records the presentation name was “DB2 10 for z/OS: selected performance topics” and the presenter was Cristian Molaro (Independent DB2 Consultant, IBM Gold Consultant & IBM Champion).

Anyway, I learned that there is a notable performance impact when doing a simple and dirty SELECT * compared to selecting only the specific columns you need. There was no benchmarks however, so I thought it would be a good test to perform, create some simple tests with various tables and measure the impact of this SELECT * bad coding practice. First thing to mention, the performance degradation is not visible if we are dealing with small tables (small number of rows). My tests with a 1000 rows table did not show any difference. But when I performed tests on a more realist size of table (1,000,000 rows), I did notice meaningful differences.

The below graphs represents the results of a performance test (CPU time, and Elapsed time) for a 1,000,000 rows table, residing in a DB2 v10 New Function Mode (NFM) subsystem. This represents a fairly simple table, with 7 columns only (integer, dates, and timestamps), and compared a SELECT * with SELECT COL1, COL2.

The results of this test speak for themselves, there is a 70% overhead of CPU time consumed, and a 17% overhead of Elapsed time spent when using the SELECT * statement, i.e. when retrieving all columns, some of which are probably not needed by “the application”. Other tests performed show similar results, granting that obviously the overhead of SELECT * varies depending on the number of columns specified in the SELECT COL1, COL2, … and the size of the table (number of rows).

Digging deeper, I used a product a few years ago that examines embedded SQL statement in Cobol programs: CA Plan Analyzer (it also works for other programming languages). While writing this blog article, I went back to this tool, to see if it would detect a SELECT * statement. It did. Not to enter into too much details about this tool, there is a rule (sort of trigger) called Expert Rule 0064 that will be triggered if an embedded SELECT * SQL statement is discovered in the application / plan / package analyzed. In addition to the performance impact mentioned detailed above, CA Plan Analyzer also notifies the user with the following recommendation (which makes a lot of sense, and IMHO another good reason why application developers should not use SELECT * type of SQL statements in their application) :

This should be avoided because of problems that can be encountered

when adding and removing columns from the underlying table(s). Your

application program host variables will not correspond to the

added/removed columns.

In a nutshell, if you are a DB2 application developer and you care about your applications performance : do not use SELECT * statements !

If you are a DB2 administrator, you may want to share this post with your application developers, and / or look for products that can detect the use of embedded SELECT * statements running in your DB2 environment.