Post navigation

Most of the time we need to know the number of records affected by insert or delete query in current session. Popular DBMS like SQL Server or MySQL offers built in mechanism to achieve this. For example, SQL Server has @@ROWCOUNT global variable and MySQL gets ROW_COUNT function that helps getting number of records affected in last query executed in current session. Redshift does not offer such built-in function to have this data. But by querying the system tables, we can get this information easily.

Getting number of records affected in last query executed in current session using system table

To get his number we need to use the following system tables of Redshift:

In order to get number of records affected in last query executed in current session for delete operation we need another system table named STL_DELETE. Alright, now we are familiar about some required system tables. It is time to build the query:

Testing the query

To test the query we will create an empty table like STV_TBL_PERM system table of redshift using the below command:

CREATE TABLE tmp_redshift_tableinfo (LIKE STV_TBL_PERM);

Open a new query editor and execute the scripts together like the image below.

Figure: Redshift Record affect by last query

The above scripts insert 100 rows from STV_TBL_PERM system table to the newly created tmp_redshift_tableinfo table. Afterwards, our query is getting number of records affected in last query executed in current session. Also, we can get this information for delete query by simply using STL_DELETE table in the above query instead of STL_INSERT table.

Amazon Redshift suggests to define primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exists in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraint in your table. Let us see why with some real-life example.

Need to think before defining primary key:

As spoke earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get wrong query result. An example gives you more insight.

You might have noticed, the emp_id is declared as primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift this is not a problem as Primary Key constraint is not enforced. Assume we have another table called “department” with following rows:
Figure-1: Department table.

Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:

We know there are duplicate values for employee id 2. So “DISTINCT” keyword is used to eliminate duplicate values from query result. Here is the output of the query:
Figure-2: Wrong query result.

See, it produces wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring the uniqueness. So database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:

ALTER TABLE employee
DROP CONSTRAINT employee_pkey;

And execute the select statement once again. We are getting the expected result now!
Figure-3: Correct Result

So, you got an idea of what you need to think before defining primary key. And of course, always remember Redhsift developer guide’s following comments.

Who wants to delete rows after inserting them into tables? But to comply with business need, we delete unnecessary records from tables. After deleting large number of rows from a table, usually we like to see disk space consumed by these rows are released automatically. Popular RDBMS like SQL Server usually marks the rows affected by an immediate Delete command as Ghost records rather physically deleting them from disk at once. Afterwards, using an asynchronous internal process, database engine removes these Ghost records physically from disk. But Amazon-Redshift database does not offer such facility.You need to do it by yourself. In this post you will come to know what you must need to do after deleting a large number of rows from a Redshift Table.

The issue you may face after deleting a large number of rows from a Redshift Table.

Database developers sometimes query on the system catalog tables to know total row count of a table that contains huge records for faster response. In terms of Redshift this approach would be dangerous.Because after a delete operation, Redshift removes records from the table but does not update the catalog table metadata. Let us see an example:

I have populated a table named “lineorder” with AWS sample records. The table contains around 16 million of rows. Now let us use the following catalog table query to know the total records of the table.

After, executing the above query we get the following result:
Figure 01: Table record count using catalog query.

Now let us verify the record count by issuing the query below:

SELECTCOUNT(*)
FROM lineorder;

Figure 02: Getting row count by querying on the table itself.

Alright, we are getting the identical result by executing both queries.

In, “lineorder” table there are total 248900 rows against order date = ‘19920712’. We will delete all these rows from “lineorder” table. For this the following delete statement is executed:

DELETEFROM lineorder
WHERE lo_orderdate = '19920712'

And here is the result.
Figure 03: 248900 rows are deleted.

At this moment total records of the table will be 600037902-248900=599789002.
Let us verify this with the below SQL Statement once again:

SELECTCOUNT(*)
FROM lineorder;

Here is the result:
Figure 04: Record count after deleting records.

Alright, we got our result now that after triggering the delete query we have 599789002 rows left in the table.Now, we are going to execute the catalog query to get the record count of the table. Let’s execute the following query once again:

Figure 05: Catalog tables are not refreshed after deleting large number of rows.

Oh!! The records count we are getting by executing catalog query is 600037902!!! which was the initial records of the table (before deleting records). But total rows should be 599789002 as we have already deleted 248900 rows. That means catalog table’s metadata is not refreshed after the records are deleted from the table.

This could be huge problem if you use catalog query in your application to get total row count of a table. Sometimes it is a common practice for database developers that they query the system tables (catalog tables) in order to get record count of relatively bigger tables instead of triggering SELECT COUNT(*) statement. Because querying on system table (catalog table) is much faster than querying on the actual table. But particularly in terms of Redshift this approach would be dangerous as specified earlier in this post!!

So, what you must need to do after deleting a large number of rows from a Redshift Table. In order to solve this issue, after deleting a large number of rows from a table, it is obvious to execute VACUUM command. I executed the command like below:

VACUUM lineorder;

While VACUUM gets finished, I executed the catalog query again to get total record count of the table. The image below depicts the result:

Figure 06: Accurate record count after running VACUUM command.

Now we are getting accurate result by executing catalog query as well.

Finally, VACUUM command also reclaim storage space and resort rows according to the Sort Key defined in the table. It also a best practice to ANALYZE redshift table after deleting large number of rows to keep the table statistic up to date. Updated statistics ensures faster query execution.

Hope this information will help you in your real life Redshift development.

Note: VACUUM is a slower and resource intensive operation. Best practice is to execute it in an off peak time window. So if you really need immediate record count after a delete operation, better to use SELECT COUNT (*) instead of using catalog query.

Amazon Redshift does not support table variable. For a quick testing, we sometimes do not want to store data in a real table rather we usually declare a table variable and complete our testing. This approach also saves our time from cleaning up the unused test tables and records. Redshift supports Common Table Expression and we can use it as temp table (still have limitation). Using WITH block (CTE) multiple logical tables can be generated combining arbitrary select statements for each logical expression. For more information about this visit WITH Clause in Redshift Developer Guide. In my recent project, while doing a quick test by generating multiple logical tables with some records, I noticed Interesting aggregate query result using WITH clause in Redshift. Let me show you in details.

Generating tables to see Interesting aggregate query result using WITH clause in Redshift:

I have written a script like below:

WITH t1 AS
(
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1
UNION ALL
SELECT 30 AS col1
),
t2 AS
(
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1
UNION ALL
SELECT 30 AS col1
)

I have generated table expression “t1” and “t2” just to do an aggregate query to take the SUM of all “col1” values. So my final query is as follows:

WITH t1 AS
(
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1
UNION ALL
SELECT 30 AS col1
),
t2 AS
(
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1
UNION ALL
SELECT 30 AS col1
)
SELECTSUM(sumofcol1) AS Total
FROM(
SELECT SUM(col1) AS sumofcol1
FROM t1
UNION ALL
SELECTSUM(col1)
FROM t2
) t

After executing the above query surprisingly I got result “10” but the result is supposed to be 120. See the below image:

Figure 01: Incorrect query result.

To be confirm about the result, I have executed the query in SQL Server environment and got the expected output as 120. Have a look:

Figure 02: Correct Query result in SQL Server Management Studio.

So really an interesting aggregate query result using WITH clause in Redshift! I don’t know why it happens, I will be grateful if anyone let me know the reason of such interesting query behavior.

How to solve it.

In order to resolve the issue, I have tried executing the query by removing the “SUM” from the last inner most sub query and it gave the correct result! Have a look on the corrected one below:

WITH t1 AS
(
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1
UNION ALL
SELECT 30 AS col1
),
t2 AS
(
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1
UNION ALL
SELECT 30 AS col1
)
SELECT SUM(sumofcol1) AS Total
FROM(
SELECT col1 AS sumofcol1
FROM t1
UNION ALL
SELECT col1
FROM t2
) t

For curiosity, I wanted to see how the query behaves when temp tables (# tables) are used instead of WITH Clause in Redshift. For this I have re-written the queries as follows:

As a part of maintenance and releasing new features, “Redshift create drop table if exists” option is now available in the database after Oct 16, 2014. Previously, it was bit cumbersome to check the existence of a table prior to create or drop it.

Previous approach of Redshift create drop table if exists

Querying in the Redshift catalog or information schema table, was the only way to know if any table exists or not. For example, in past, in order to check the existence of a table the following extra query need to be issued in Redshift:

SELECTTOP 1 tablename
FROM pg_table_def t
WHERE t.tablename = 'test';

Based on the result of the above query, new tables are created or existing tables are dropped as per developer requirement.

Current approach of Redshift create drop table if exists

Amazon has realized this essential user need and addressed this issue in their most latest update. Table existence can be checked now along with CREATE TABLE or DROP table statements using new IF EXISTS keyword. The following statement creates a table if it does not exist.

CREATETABLEIF NOT EXISTS Test (ID INT);

Similarly, IF EXISTS can be used along with DROP, VIEW, SCHEMA and USER DDL statement as well. Have a look on the query below:

DROP TABLE IF EXISTS test;

The query above will delete the table named “test” if this is present in the database.I hope this information will be helpful to you.

In my previous post, I have demonstrated how Identity column behaves with different INSERT statements. In today’s post we will come to know about Identity column SEED-STEP behavior with COPY command. In other words, how identity values are generated when a file is loaded from S3 to Redshift a table that has an auto increment identity column.

As we already know that an identity column does not guarantee the values in the column will be consecutive in all cases. It also stated in earlier post that most often we need consecutive or strict sequential ID numbers. In that scenario, we need to implement some additional actions in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases identity column SEED-STEP with COPY command does not generate consecutive identity values. We will also see some techniques to maintain the orders or the Identity values later in this post.

Two text files and a manifest file in the S3 bucket to populate the above table using COPY command.

Figure 01: Test data and manifest files in S3.

The “order_2001.txt” file contains sales records of 2001 year and order_2002 has sales data of year 2002. In order to facilitate multiple data files loading in parallel using a single COPY command, I have created a manifest file named “order.manifest”.

All testing are done in a single dw2.large node cluster environment. Our environment is ready, now let us jump to me main test.

Identity column SEED, STEP behavior with COPY command while uploading a single file.

In order to see the identity value generated in the above table, I have executed the below COPY command to load order_2002.txt file into an empty custorders table.

Let’s see the identity values generated by the COPY command by selecting top 100 records.

SELECTTOP 100 *
FROM custorders
ORDER BY
linenumber;

The above select statement produced the below results:

Figure 02: Identity values generated with COPY command.

From the above image we found the following discrepancies in Identity column SEED,-STEP behavior with COPY command.

As we loaded the file into the empty custorders table, as per table identity definition declaration “linenumber INT IDENTITY(1,1)” the first value of linenumber column should be 1 then 2 then 3 and so on. But the first value is inserted surprisingly 4312.

Furthermore, according to the identity STEP, the next row’s value should be increased by one step. So the value should be 4312+1=4313. But values are increased by 2 steps. Hence, in first row identity column “linenumber” value is 4312, in second row it is 4314 and third one is 4316 etc.

Why identity starts from 4312 instead of 1?

In order to find out reason behind these issues of “Identity column SEED-STEP behavior with COPY command” I have had a look on the redshift developer guide but did not find any clue. After a while I found myself that there are total 2156 records in the source text file (order_2002.txt). The first generated identity value of the table is twice of the total records count, which is 2156*2=4312. So why does this happen? To find out the answer, I have tried COPY command using different parameters. Finally, I got another clue when I triggered COPY command using “COMPUPDATE OFF” option. The COPY command is now as follows:

Now it produces much better result right? In previous attempt, the identity value started from an arbitrary 4312, but this time, at least it starts from 1. But, still, the identity values are incremented by 2 steps which is supposed to be 1. We will discuss about this issue later, but now let me tell you the reason why “using ‘COMPUPDATE OFF’ parameters with COPY command” generates initial identity value = 4312 instead of 1 during the first attempt (see Figure02) (4312 is twice of the total records count).

“COMPUPDATE” parameter in COPY command automatically decides the optimal compression encodings for each column of the target table based on ON|OFF settings. If COPY command is executed without “COMPUPDATE” option, COPY applies automatic compression on an empty target table. For this reason COPY command completes in two steps (passes). During first pass, identity values are incremented internally. In the second pass (step) it further increments the value while committing records to the target table. This could be reason why we did not get the initial identity value as per our expectation.

Why identity value increments by 2 instead of 1?

If you have a look on both “Figure 02” and “Figure 03”, you will find identity values are incremented by 2 instead of 1. I have single node in Redshift cluster. The node has two slices. As per Redsihft developer guide data are loaded in parallel in slices when multiple files are loaded using a single COPY command. COPY command loads data in sequential mode when a single file is pointed in COPY command. During data load, COPY command increments identity in such a manner so that the identity values of each slice are not overlapped with values of another slice. For this purpose COPY command would generate identity value in odd, even basis like below:

For slice one: 1, 3, 5…..n
For slice tow: 2, 4, 6…..n

Now let us do another test to validate this fact. As stated earlier in the “Environment” section (see Figure 01) that we have three files. The order_2011.txt contains sales data of year 2001, order_2002.txt file has sales data of year 2002 and a manifest file to upload both files using a single COPY command. The manifest file looks like below:

Splendid!! Identity column SEED-STEP behavior with COPY command is now as per our expectation. It generates initial value with 1 and the subsequent values are incremented by 1 as well. There is no missing, no gaps amongst identity values. The per slice identity allocation also clear to us. Executing the following query will make more sense regarding per slice records and identity value distributions.

SELECT slice,
num_values AS total_rows_per_slice,
minvalue AS min_identity_val,
maxvalue AS max_identity_val
FROM svv_diskusage
WHERE NAME = 'custorders'
AND col = 0
ORDER BY
slice,col;

From the above image, we get a clear idea about how redshift determines slice wise identity values. It seems COPY command loads the both files in parallel and allocates identity values as follows:

Loads “orders_2002.txt” files in slice one and allocates identity: 1, 3, and 5…..up to 4309
Loads orders_2001.txt in slice two and allocates identity: 2, 4, 6…..up to 4310

Finally, from the aforementioned test, we came to know loading files in parallel using COPY command with “COMPUPDATE OFF” option may generate expected identity values. It also comply with Identity column SEED-STEP behavior with COPY command. But in different scenario (multiple node cluster) Identity column SEED-STEP behavior with COPY command might be different. In that case, ROW_NUMBER window function can be used to generate sequential values. Another option could be to ask file provider (or client) to include line number in the file if this is a strict business need. Then the file can be loaded using COPY command with “EXPLICIT_IDS” parameter into an identity enabled redshift table.

During table design auto increment column is one of the best choices among database developers. Most of the cases an auto increment identity column is considered as the primary key of the table. It is also widely used as a surrogate key of dimension tables in a typical data warehouse system. Identity column SEED-STEP are used to generate the sequential values in the table. An identity column takes the value of current seed incremented by the step when a row is inserted into a table.

Identity column SEED, STEP can be used with CREATE TABLE statement in Amazon Redshift. A point to be noted that an identity column does not guarantee the values in the column will be consecutive in all cases. In real life, sometimes we need consecutive or strict sequential ID numbers. In that scenario, we need to adopt some additional measures in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases of INSERT statement identity column SEED-STEP does not generate consecutive identity values. We will also see some techniques to maintain the orders of the Identity values.

These scripts are executed one part at a time, PART-1 first then PART-2 and finally PART-3. The result produces uninterrupted sequential identity values (line numbers). See below image:
Figure 02: Identity column generates uninterrupted unique values.

Now, let us execute all insert scripts at a time (PART-1 + 2 +3). Prior to do such testing I have dropped and recreated the “custorders” table. Executing all three SQL insert statement parts at a time produces the following result:

Possible reason of this issue could be the parallelism. Several months ago, I have combined records of more than 10 big tables using UNION ALL in order to make a PIVOT table in SQL SERVER. During query execution, I have noticed all of my CPU cores (1 socket, 4 cores, and 8 logical processors) were busy. To know more about that I conducted a Google search and found that optimizer might execute queries in parallel when multiple queries (branches) combined to a single SQL statement. We know Redshift is designed to facilitate huge parallel processing. Although it is not documented anywhere, but I presume, Redshift attempts to insert each SQL statement of UNION ALL in parallel into each node slice. In order to ensure that there is no single point of conflict in assigning identity values to rows during parallel insertion, Redshift might maintain gaps while allocating identity value for each rows. This is only my thoughts based on previous experience. Please inform me if anyone has more concrete idea about this behavior.

How to solve the issue:

There is not straight forward way that can resolve this identity missing issue. A new table can be created to copy all data from the existing table with CREATE TABLE AS statement using ROW_NUMBER window function like below:

Please be noted the above approach may not suitable because CREATE TABLE AS does not inherits IDENTITY property from parent table. In that case create a VIEW over the table using the same ROW_NUMBER window function would be the perfect choice. See below:

Finally, in this post we have had a look on Identity column SEED-STEP behavior in different insert statement and some possible ways to resolve this issue. In one of my subsequent posts I will try to demonstrate the Identity column SEED-STEP behavior when data files are uploaded using Redshift COPY command. Till then thanks lot to read this post with patient. Hope this post is informative to you.

In one of my earlier posts, I have discussed about different approaches to create tables in Amazon Redshift database. Among these approaches, CREATE TABLE AS (CATS) and CREATE TABLE LIKE are two widely used create table command. Each command has its own significance. In this post, the differences, usage scenario and similarities of both commands will be discussed.

Both CREATE TABLE AS (CATS) and CREATE TABLE LIKE command can not create table independently. In other words, CREATE TABLE AS, CREATE TABLE LIKE command can create a table by copying column settings and records (CATS only) from and existing table. Both commands can be used in following scenario.

In Redshift, there is no way to include sort key, distribution key and some others table properties on an existing table. The only way is to create a new table with required sort key, distribution key and copy data into the that table. Using both CREATE TABLE AS and CREATE TABLE LIKE commands, a table can be created with these table properties. For an example:

The following command creates a new table with Sort Key, Distribution Key and inserts three rows into the table.

Now to serve the business we will need to include “category” along with existing sort key product_name and also want to change the distribution key as product_id. Hence the statement portion will be as follows:

DISTKEY (product_id),
SORTKEY(product_name,category)

As Redshift does not offer any ALTER TABLE statement to modify the existing table, the only way to achieve this goal either by using CREATE TABLE AS or LIKE statement. The below CREATE TABLE AS statement creates a new table named product_new_cats.

Now to the following command is used to get the records of the new “product_new_cats” table.

SELECT * FROM product_new_cats

The result is as follows:Figure 01: All records in product_new_cats

Now, we become sure, CATS statements copied all records from product table into the product_new_cats table. But what about sort key, distribution key and other settings? Let’s execute the following two commands:

From the above tow images, we found CREATE TABLE AS successfully created new sort and distribution keys. This command also inherits these settings from parent table. But one thing needs to point out here, CREATE TABLE AS command does not inherits “NOT NULL” setting from the parent table. Here, all columns of product_new_cats table are created as NULL(see Figure 03). But all columns of parent “product” table were declared as “NOT NULL” (Figure 02).

Alright, so far we have an idea about how “CREATE TABLE AS” command behaves. Now we will notice what happens when we create table using “CREATE TABLE LIKE” statement. Let’s execute the following scripts:

The above statements creates a table named “product_new_like” using CREATE TABLE LIKE statement and later command select all records from the newly created table. CREATE TABLE LIKE does not copy data from source table. So the SELECT * command will not return any rows. But it inherits columns settings. Let’s execute the SQL statement below and have a look the result:

By comparing output of “Figure 02” and “Figure 04” we see CREATE TABLE LIKE statement also inherits sort key, distribution key. But the main point to to note here that, CREATE TABLE LIKE command additionally inherits “NOT NULL” settings from the source table that CREATE TABLE AS does not.

Primary key, identity and default constraint behavior when table is created

using CREATE TABLE AS and CREATE TABLE LIKE:

In order to check whether CREATE TABLE AS and CREATE TABLE LIKE statement inherits primary key, default constraint and identity settings from source table or not.the following scripts can be executed.

To check Primary key:

SELECT DISTINCT
NAME AS tablename,
pc.conname AS constraint_name
FROM stv_tbl_perm tp
LEFT OUTER JOIN pg_constraint pc
ON pc.conrelid = tp.id
WHERE name IN ('product', 'product_new_cats', 'product_new_like')

Figure 05: CATS and LIKE does not inherits primary key.

From the above image, we can see both CREATE TABLE AS, CREATE TABLE LIKE do not inherit primary key constraint from source table.

To check Default and Identity:

SELECT DISTINCT p.name AS tablename,
attname AS columnname,
a.atthasdef t
FROM pg_attribute a
INNER JOIN stv_tbl_perm p
ON p.id = attrelid
WHERE name IN ('product','product_new_cats','product_new_like')
AND a.atthasdef = 1

Figure 06: CATS and LIKE does not inherits default constraint and identity.

The above query is used to select default constraint and identity column from all three tables (product, product_new_cats,product_new_like). But we found only the source table , product is returned here. CREATE TABLE AS, CREATE TABLE LIKE does not inherit default value as well as identity settings. CREATE TABLE LIKE has an option to copy “DEFAULT” expression from the source table by using “INCLUDING DEFAULTS”.

Amazon Redshift database is getting popular day by day specially for its super fast performance. During database development or administration everyone needs to create table either permanently or temporarily. Developers usually create tables using DDL like “CREATE TABLE” statement. But sometimes they need to create table quickly, without issuing the long, restricted DDL statements for different purposes. In this post I will try to show some techniques to create new tables in Redshift database.

Using DDL:

DDL is the most common approach to create tables in Redshift database. A typical DDL command is as follows:

Using SELECT INTO:

SELECT INTO is another way to create new tables. SELECT INTO facilitates fast bulk data copy to new tables from an existing table in the most easiest way. This approach also keeps the column names and data type same as the source table. Let us see an example:

The above statement creates a new table named “adultpatients” by selecting records from patients table considering the where clause.

Using CREATE TABLE LIKE:

When you want to create a table by copying columns names, data type, default constraint, NOT NULL attribute from an existing table, CREATE TABLE LIKE is the your best choice. The following statement creates a “product_backup” table by copying settings from the exiting “product” table.

CREATETABLEproduct_backup (LIKEproduct)

Please be noted, CREATE TABLE LIKE does not copy data to the newly created table as SELECT INTO and CREATE TABLE AS command does.

CREATE TABLE AS (CATS):

Alike SELECT INTO, CREATE TABLE AS (CATS) statements also create a new table along with data. In order to create a table using CATS command, you need to provided a query. The newly created table is populated with data as per the query defined in the CREATE TABLE AS (CATS) statement. The statement below can can create a table using CATS syntax.

The above query creates “product_backup” table based on the query provided inside the “AS” clause.

So, these are the commonly used approaches to create tables in Redshift based on different scenario. Hope this information will be helpful. In my future posts, I have a plan to describe about the in depth functionality, similarities and unique behaviors of each CREATE TABLE approach.

Redshift COPY command is the recommended and faster way to load data files from S3 to Redshift table. COPY has several parameters for different purposes. NOLOAD is one of them. When NOLOAD parameter is used in the COPY command, Redshift checks data file’s validity without inserting any records to the target table. About NOLOAD options, redshift developer guide says as follows:

“Checks the validity of the data file without actually loading the data. Use the NOLOAD option to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD option is much faster than loading the data since it only parses the files.”

As per above comments, if we need to check data file validity, NOLOAD option in COPY command offers faster performance. I have done some practical testing regarding this and in this post I am going to share my findings with you.

The table has 3 columns all are INTEGER type. In the error test files “salesamount” column contains error records like below:

Figure 01: Error records in text file

You can see, in the text file “salesamount” column contains dollar sign ($) end of the amount. As the column data type is Integer (not Currency), the table will generates error while inserting these values (1000$, 1500$).

Testing approaches to check data file validity:

Now you have an idea about the files and Redshift table. In order to check the validity of the data file and to know the impact of NOLOAD parameter, I have executed the COPY command considering the following scenario:

1. Using Good Data file:

Triggered COPY command to take the time without NOLOAD options. The command gets succeeded like below:Figure 02: COPY Command using good data file (NOLOAD option not included)

Uploaded the data file using COPY command to take the time with NOLOAD option. This command also successfully uploaded 100 million records into the target table.Figure 03: COPY Command using good data file with NOLOAD option.

2. Using Error Data File which has error records at the middle of the file:

Tried to upload the error data file using COPY command without NOLOAD option. The COPY command generates error when attempted to load at line # 49939851 (error contains middle of the file) and rollback the data loading process. The following image shows that:Figure 04:Error data in middle – without NOLOAD.

Attempted to upload the data file using COPY command with NOLOAD option. This attempt also failed because the error records contains at middle of the file (line # 49939851). So no records are loaded in the table because only “COPY ANALYE” command is executed. Have a look on the below image:Figure 05:Error data in middle – with NOLOAD (COPY ANALYZE is executed)

3. Using Error Data File which has error records at the end of the file:

Tried to upload the error data file using COPY command without NOLOAD option. This time COPY command generates error at the end of the file (line number = 99996861). No records are loaded. The error message image is as follows:Figure 06: Error data at the end – without NOLOAD

Attempted to upload the error data file (error records at the end of file) using COPY command with NOLOAD option. COPY command gets failed as usual at line 99996861 (near the maximum 100000000). The error image is like below:Figure 07:Error data at the end – with NOLOAD (COPY ANALYZE is executed)

At this stage, COPY command execution is finished in different scenario to get an idea about the way to check the validity of the data file using NOLOAD option. In the above images, you might have been noticed the COPY “Run Time”. For example in Figure 06, you find “Run Time”= 58.84 seconds (58 seconds, 84 milliseconds). This “Run Time” is the COPY command. This is should not be considered as exact duration.

I have stated in my earlier posts (Redshift COPY command- How to get actual data loading time) that when a COPY command is triggered, actual data loading process does not start immediately. Redshift performs some analysis on the target redshift table. In order to get the actual data loading duration we need to execute a SQL script. The SQL script is present in that post. I have executed that SQL statement using the “Query Id” that is generated after COPY command execution for each scenario. For example in “Figure 05” we got a Query Id=222494 when tried to upload the error (error records at end of the source file) data file using COPY command without NOLOAD option.
So the SQL statement using Query Id=222494 is as follows:

SELECTSUM("analyze_compression") AS "analyze_compression",
SUM("analyze_redshift_table") AS "analyze_redshift_table",
SUM("actual_copy_command") AS "actual_copy_command",
SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table",
SUM("transaction_commit") AS "transaction_commit"
FROM(
SELECTCASEWHEN t.text LIKE 'analyze compression phase%'
THENdate_diff('sec',starttime,endtime )
END AS "analyze_compression",
CASEWHEN t.text LIKE 'Analyze%' or lower(t.text) LIKE 'copy analyze%'
THENdate_diff('sec',starttime,endtime )
END AS "analyze_redshift_table",
CASEWHENlower(t.text) LIKE 'copy%from%'
THENdate_diff('sec',starttime,endtime )
END AS "actual_copy_command",
CASEWHEN t.text LIKE 'padb_fetch_sample%'
THENdate_diff('sec',starttime,endtime )
END AS "fetch_sample_data_of_target_table",
CASEWHEN t.text LIKE 'COMMIT%'
THENdate_diff('sec',starttime,endtime )
END AS "transaction_commit"
FROM svl_statementtext t
WHERE t.xid=(SELECT xid
FROM stl_query
WHERE query= 222494)
) AS t

After executing the SQL statement for all Query IDs of all scenario, I prepared the following summary of the duration matrix.

Figure 07: Duration matrix of different COPY attempt.

From the above information my observations are as follows:For good file (no error records in file):

For Good file “COPY Command using NOLOAD” is slower than “COPY without NOLOAD” parameter.

If source data file contains no error records, NOLOAD parameter performs slower than not to use the NOLOAD option in COPY command. For example, in the above table, if you have a look on the total duration of “Good File (No error records in file)”, you will find:

2. For Good File, “COPY using NOLOAD” consumes more time for table analysis.

When “NOLOAD” option is used system spends much time (48 seconds) for table analysis (For example: COPY ANALYZE testsales). It also takes similar time (48 seconds) to execute the actual COPY command. Using NOLOAD in COPY command does not insert any records to the target table.

Error records at the middle or at the end of the file:

In both cases, COPY command using NOLOAD performs faster. I took 24 seconds when NOLOAD is used and it took 31 seconds when NOLOAD is not used for the data file that has error records at the middle of the file. Similarly, for the file that contains error records at the end, 45 seconds are required with NOLOAD and 60 seconds without using NOLOAD parameter in the COPY command.

When source data file contains error records and NOLOAD option is used to check the validity of the data file, Redshift only executes “COPY ANALYZE” command. As a result only table analysis is performed, no actual COPY command is executed. So “Actual Copy Command Duration” is showed “NULL” in the above duration matrix table.

Finally, from the above testing steps, the summarized findings are as follows:

For data file that contains no error records, COPY command with NOLOAD option took more time than executing only COPY command without NOLOAD.

For good data file, if NOLOAD option is used, both “COPY ANALYZE” and “COPY … FROM” command gets executed. Though “COPY … FROM” does not import any data but it consumes time.

Performance measurement to check the validity of data file using “NOLOAD”, depends on where the error records exists in the data file. If error records exists beginning of the file, data file validity checking is finished earlier. In worst case scenario, if error records resides near end of the file, data file validity checking duration is much higher.

Special note:

Please be noted this test results, observations and findings related to “NOLOAD” option are very specific in a particular scenario. Test outcome may heavily vary for different data files, number of columns of the target table and data type of the table columns. So, if you want to check data file validity using “NOLOAD” option, please test it using your domain specific source data file in your own environment.