Jul 6, 2016

Sometimes there is a need to populate | insert a large volume of data from application code(c#, Java, vb.net) to a table in the database (SQL Server). If data volume is less then performance is not a big concern. But if the data volume is big we must have to consider the most efficient approach. Let us discuss different approaches and their advantages and disadvantages:

Looping over insert statement:

One of the easiest ways to write an insert statement inside a loop. And it will populate records one by one in the database. For example:

foreach (DataRow row in table.Rows) {

using (SqlConnection connection = new SqlConnection(connectionString))

using (SqlConnection connection = new SqlConnection(connectionString))

{

SqlCommand cmd = new SqlCommand(Query);

cmd.CommandType = CommandType.Text;

cmd.Connection = connection;

connection.Open();

cmd.ExecuteNonQuery();

}

In the above approach database is called only one time but inside database rows are being inserted one by one. So it is rows based approach and it is not efficient in case of a large volume of data.

Table value constructor:

In SQL server table value constructor is set based approach. And it is very efficient. Instead of the full insert statement, we can save the query string in table value constructor format. For example:

string Query = @"INSERT INTO Users(Name, Phone, Age)

VALUES ('Name_1', 'Phone_1', 'Age_1')

VALUES ('Name_2', 'Phone_2', 'Age_2')

.....

.....

VALUES ('Name_n', 'Phone_n', 'Age_n')"

Although this approach is very efficient but limited to 1000 rows. If there are more than 1000 rows above approach is not going to work.

Passing Data table:

One of the best approaches is passing the data table as a parameter to stored procedure. For example:

Create a table type in SQL server:

CREATETYPE UserType ASTABLE(

Name VARCHAR(50)NOTNULL,

Phone VARCHAR(10),

Age INT

)

Create a stored procedure which accepts UserType as a parameter:

CREATEPROCEDURE AddUsers(

@UserTable UserType READONLY

)

AS

BEGIN

INSERTINTO Users

SELECT*FROM @UserTable

END

Using C# code, execute stored procedure and pass data table as a parameter:

using (SqlConnection conn = new SqlConnection(conn))

using (SqlCommand cmd = new SqlCommand("dbo.AddUsers", conn))

{

var dt = new DataTable();

cmd.CommandType = CommandType.StoredProcedure;

cmd.Parameters.Add(new SqlParameter("@UserType", dt));

conn.Open();

cmd.ExecuteNonQuery();

}

Using bulk insert:

SQL server support bulk inserts that loading the data from a file. For this first, we have to download or copy data file at the specified location using programming language. Then using BULK INSERT command data can be imported into a table. For example:

BULKINSERT EXACT_HELP.dbo.Users

FROM'C:\Data\Users.txt'

WITH

(

FIELDTERMINATOR=' |',

ROWTERMINATOR=' |\n'

);

We can also do by using C# SqlBulkCopy.

Pass data in the XML format in SQL server:

In your programming language, first of all, covert whole data table in XML format. Pass that XML to a stored procedure to populate the table. For example:

These are common questions when we need to get data using more than one table. There is not a straightforward answer to these questions. Here we will discuss many aspects of this.

1. Use joins when we need to get data from both the tables in a SELECT statement.

2. Use subquery when we need to get data from only one table and another table is used only to check existence. For example:

Let's assume we have two tables: tblEmp and tblEmpDetail with the following schema:

CREATETABLE tblEmp(

ntEmpID BIGINTPRIMARYKEYIDENTITY,

vcName VARCHAR(50),

ntAge INT

)

CREATETABLE tblEmpDetail(

ntEmpDetailID BIGINTPRIMARYKEYIDENTITY,

ntEmpID BIGINTREFERENCES tblEmp,

moSalary MONEY,

dtDOJ DATETIME

)

--Inserting some recodes into it

INSERT tblEmp

SELECT

LEFT([Text],20),

severity

FROMSys.messages

INSERT tblEmpDetail

SELECT

ntEmpID,

2000 *(ntEmpID % 10),

GETDATE()-(ntEmpID % 30)

FROM tblEmp

Inner join: By using inner join we can select fields of both the tables. For example:

SELECT

E.ntEmpID,

E.vcName,

E.vcName,

ED.moSalary,

ED.dtDOJ

FROM tblEmp E INNERJOIN tblEmpDetail ED

ON E.ntEmpID = ED.ntEmpID

Subquery: we can get the data from the only table. For example:

SELECT

E.ntEmpID,

E.vcName,

E.vcName

FROM tblEmp E

WHERE ntEmpID IN(SELECT ntEmpID

FROM tblEmpDetail)

Here we cannot include fields of table tblEmpDetail in a select list. So if we need to get the data from fields of both tables join is the only solution. It is not possible by the subquery.

When we need to get the data from only one table by checking the existence in another table then it is possible by both joins as well as a subquery. In this case which one will perform better inner join or subquery?

Here we have written the same query in four different ways:

1.

--Using inner join

SELECT E.*

FROM tblEmp E INNERJOIN tblEmpDetail ED

ON E.ntEmpID = ED.ntEmpID

2.

--Using IN clause

SELECT*

FROM tblEmp

WHERE ntEmpID IN(SELECT ntEmpID

FROM tblEmpDetail

3.

--Using EXISTS clause

SELECT*

FROM tblEmp E

WHEREEXISTS(SELECT*

FROM tblEmpDetail

WHERE ntEmpID = E.ntEmpID)

4.

--Using correlated subquery

SELECT*

FROM tblEmp E

WHERE ntEmpID =(SELECT ntEmpID

FROM tblEmpDetail

WHERE ntEmpID = E.ntEmpID)

Let's compare the actual execution plan:

Query 1, 2 and 3 have same execution cost while query 4 is costliest. Also, query 2 and 3 have an exactly same execution plan. Let me explain it one by one.

Why query 4 is costliest compare to others?

If we will observe its execution plan we will find it has index spool operator which cost is 95% and has an assert operator. It is due to equality arithmetic operator that is:

ntEmpID =(SELECT ntEmpID

FROM tblEmpDetail

WHERE ntEmpID = E.ntEmpID)

SQL server doesn't know how many rows will be returned by this subquery. If it is more than one then it will throw an error: Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <=, >, >= or when the subquery is used as an expression.

So, to validate this, SQL server has followed following extra steps:

1. Index spool: It creates a copy of a table in tempdb which satisfy WHERE ntEmpID = E.ntEmpID and create an index on it.

2. Stream Aggregate: It counts the total records in tblEmpDetail for each ntEmpID something like this [Expr1006] = Count (*)

3. Assert: It checks if record count is one then it is fine otherwise throw error message that is CASE WHEN [Expr1006]>(1) THEN (0) ELSE NULL END

Note: Even top one clause will not help us. For example:

ntEmpID =(SELECTTOP(1) ntEmpID

FROM tblEmpDetail

WHERE ntEmpID = E.ntEmpID)

Conclusion: Never write subquery with equal to an operator for this scenario.

Solution: Create unique index on the field tblEmpDetail.ntEmpID. For example:

CREATEUNIQUENONCLUSTEREDINDEX NCI_ntEmpID

ON tblEmpDetail(ntEmpID)

Now if you will check the execution plan, you will find all four queries have equal execution cost (25%).

Suppose there is not any unique index on the field tblEmpDetail.ntEmpID so I am dropping it:

DROPINDEX NCI_ntEmpID ON tblEmpDetail

Now compare the execution plan of query 1, 2 and 3:

Query 1: It uses Hash Match (Inner Join)

Query 2 and 3: It uses Hash Match (Right Semi Join)

Operator Inner Join can get data from both tables while operator Right Semi Join can get the data from an only right table.

Conclusion: Inner join has more flexibility than a subquery. It can select the data from both tables as well as only from one table with same query cost just like subquery. For example:

--Inner Join 1

SELECT*

FROM tblEmp E INNERJOIN tblEmpDetail ED

ON E.ntEmpID = ED.ntEmpID

--Inner Join 2

SELECT E.*

FROM tblEmp E INNERJOIN tblEmpDetail ED

ON E.ntEmpID = ED.ntEmpID

Actual execution plan:

Definitely Inner join 1 will more costly than inner Join 2 since it is selecting more numbers of columns.

Is there any other difference between inner join and subquery except inner join get data from both tables while not by subquery?

We are going to check an inserting observation. Comparing the execution plan of the following query:

--Using inner join

SELECTTOP(10) E.*

FROM tblEmp E INNERJOIN tblEmpDetail ED

ON E.ntEmpID = ED.ntEmpID

--Using IN clause

SELECTTOP(10)*

FROM tblEmp

WHERE ntEmpID IN(SELECT ntEmpID

FROM tblEmpDetail)

Actual execution plan:

So, in this case, an inner join is performing better than a subquery. Now we are inserting few duplicates ntEmpID into the tblEmpDetail:

INSERT tblEmpDetail VALUES

(1,5000,GETDATE()),

(2,7000,GETDATE())

Comparing the execution plan of the same query:

--Using inner join

SELECTTOP(10) E.*

FROM tblEmp E INNERJOIN tblEmpDetail ED

ON E.ntEmpID = ED.ntEmpID

--Using IN clause

SELECTTOP(10)*

FROM tblEmp

WHERE ntEmpID IN(SELECT ntEmpID

FROM tblEmpDetail)

Now in this case subquery is performing better than inner join.

Why there are differences? Sometimes an inner join performs better and sometime subquery performs better. To understand this we are going to create two tables with fewer records and checking the output:

CREATETABLE tblStu(

ntStuID BIGINT,

vcName VARCHAR(50),

ntAge INT

)

CREATETABLE tblStuDetail(

ntStuDetailID BIGINT,

ntStuID BIGINT,

moSalary MONEY

)

Now we are inserting three records in both the tables:

INSERT tblStu VALUES

(1,'Scott', 18),

(2,'Greg', 21),

(3,'Davis', 22)

INSERT tblStuDetail VALUES

(1, 1, 5000),

(2, 2, 8000),

(3, 3, 6000)

Now, both the table has one to one relationship for ntStuID. That is one student has one detail and one details belong to one student.

Note: But SQL server doesn't know both tables has one to one relationship. To tell SQL server, we have to create a unique constraint in ntStuID of both tables. We can do it by making it unique key, Primary key or by creating a unique index.

Let's check output of following queries:

--Inner join

SELECT S.*FROM tblStu S INNERJOIN tblStuDetail SD

ON S.ntStuID = SD.ntStuID

--Subquery

SELECT*FROM tblStu S

WHERE S.ntStuID IN(SELECT SD.ntStuID

FROM tblStuDetail SD WHERE S.ntStuID = SD.ntStuID)

Inner Join:

ntStuID

vcName

ntAge

1

Scott

18

2

Greg

21

3

Davis

22

Sub query:

ntStuID

vcName

ntAge

1

Scott

18

2

Greg

21

3

Davis

22

We will get an exactly same output of both the query.

Now we are making the one to many or many to one relation between the tables. So we are going to insert few duplicate records into the table tblStuDetail:

INSERT tblStuDetail VALUES(4,2,5000),(5,2,8000)

Note: Again SQL server doesn't know both tables has one to many relationships or not. To tell SQL server, we have to create a unique constraint in ntStuID in only one table. We can do it by making it unique key, Primary key or by creating a unique index.

Let's check output of same queries:

--Inner join

SELECT S.*FROM tblStu S INNERJOIN tblStuDetail SD

ON S.ntStuID = SD.ntStuID

--Subquery

SELECT*FROM tblStu S

WHERE S.ntStuID IN(SELECT SD.ntStuID

FROM tblStuDetail SD WHERE S.ntStuID = SD.ntStuID)

Inner Join:

1

Scott

18

2

Greg

21

3

Davis

22

2

Greg

21

2

Greg

21

1

Scott

18

Sub query:

ntStuID

vcName

ntAge

1

Scott

18

2

Greg

21

3

Davis

22

Now inner join is returning more records than a subquery. Both queries have different output!! How can we compare the performance if both queries return different output? They are two different queries.

Last, now are going to make it many to many relationships between them. For this we are also going to insert few duplicate records into the table tblStu:

INSERT tblStu VALUES

(1,'Alan',19),

(2,'Sam',21)

Note: If two tables have not any primary key, unique key or unique index on column SD.ntStuID, SQL server assume both tables have many to many relations.

Let's again check output of same queries:

--Inner join

SELECT S.*FROM tblStu S INNERJOIN tblStuDetail SD

ON S.ntStuID = SD.ntStuID

--Subquery

SELECT*FROM tblStu S

WHERE S.ntStuID IN(SELECT SD.ntStuID

FROM tblStuDetail SD WHERE S.ntStuID = SD.ntStuID)

Inner Join:

ntStuID

vcName

ntAge

1

Scott

18

2

Greg

21

2

Greg

21

2

Greg

21

3

Davis

22

1

Alan

19

2

Sam

21

2

Sam

21

2

Sam

21

Sub query:

ntStuID

vcName

ntAge

1

Scott

18

2

Greg

21

3

Davis

22

1

Alan

19

2

Sam

21

Again, inner join returning more records than a subquery. Also subquery returning duplicate recodes. Both queries have different output. In this case, we cannot compare the performance between subquery and inner join since both queries have different output.

Question: If two tables have one to one relation for a key column and we want to get data from only one table in this case inner join will perform better or subquery?

Answer: As I know the performance of both the query will exactly same. Till now, I didn't find any differences if you please share with us.

Best practice: Inner join vs. subquery

1. If we need the data from both the tables we must have to choose inner join. If we need data from only one table then we can choose either subquery or inner join.

2. If two tables have one to many, many to one or many to many, subquery and inner join may have different output. So choose according to your application requirement.

3. If two tables have one to one relationship then you can choose either subquery or inner join since query optimizer will follow same execution plan for both of them. (Thanks to SQL server query optimizer)

4. Following subqueries are equivalent from query performance point of view:

a.

SELECT*FROM tblStu S

WHERE S.ntStuID IN(SELECT SD.ntStuID

FROM tblStuDetail SD WHERE S.ntStuID = SD.ntStuID)

b.

SELECT*FROM tblStu S

WHERE S.ntStuID IN(SELECT SD.ntStuID

FROM tblStuDetail SD)

c.

SELECT*FROM tblStu S

WHEREEXISTS(SELECT*

FROM tblStuDetail SD WHERE S.ntStuID = SD.ntStuID)

An SQL server query optimizer is smart enough that it will choose same execution for all.

5. Don't use subquery with equality operator unless there is not one to one relation between two tables. For example:

SELECT*FROM tblStu S

WHERE S.ntStuID =(SELECT SD.ntStuID

FROM tblStuDetail SD

WHERE S.ntStuID = SD.ntStuID)

Otherwise, it will decrease the performance or cause of an error.

6. If there is an aggregate function in subquery then there is no direct equivalent query using an inner join. For example:

SELECT*FROM tblStu S

WHERE S.ntStuID =(SELECTMAX(SD.ntStuID)

FROM tblStuDetail SD)

So you must have to use a subquery.

7. If a query is either inner join or subquery, internally is handled by any of these operators:

a. Nested loop join

b. Merge join

c. Hash join

So, technically there is no difference between inner join and subquery.

8. Consider on two inner join syntax:

--New way

SELECT S.*FROM tblStu S

INNERJOIN tblStuDetail SD

ON S.ntStuID = SD.ntStuID

--Old way

SELECT S.*

FROM tblStu S ,tblStuDetail SD

WHERE S.ntStuID = SD.ntStuID

There are not any differences except syntax. I will prefer using ON clause to write join condition instead of WHERE clause. If we will miss the ON clause:

SELECT S.*

FROM tblStu S

INNERJOIN tblStuDetail SD

We will get an error message: Incorrect syntax near 'SD'.

And we can correct it. What will happen if we miss the WHERE clause:

SELECT S.*

FROM tblStu S ,tblStuDetail SD

We get wrong output without any warning. The worst thing is if one or both table has too many records this query will be too costly and locked the both table for a long period of time. So I suggest don't write join query in the old way.