Why does session B wait on the U lock of jobmaster on pk index hold by A, while B is holding the X lock on clustered index? I mean since B has the X lock, which is lock to update the clustered index, seems it has no other locking needs?

What is the sequence of lock request/release on the three sessions? My understanding is, like for session B: update lock on JobMaster with PK and JobMaster with cluster index, then the X lock on JobMaster with Cluster index. Correct me if I am wrong.

Since we don't modify the non-cluster indexes, in any of the queries. Why do they need a U lock on the index? And why C is waiting for A, as C already have the U lock.

Answers

Load test? That may explain the four-digit spids in the deadlock trace. They did raise my eye-brows a bit.

You need to rework you retry logic. I suspect that you are using something like TransactionScope, because as I noted all transactions had started before the current batch. Whether you actually need TransactionScope I don't know, but obviously you need
it if the client runs several database calls in the same transaction.

Retry is not as trivial as it sounds. The code where you perform retry must acertain that there is no transaction in progress on entry. This can be implemented in two ways:

1) The method checks for active transactions entry.
2) By convention, the layer that performs the retry is also the outermost layer that is permitted to start transactions.

And then each data method has retry on timeout and deadlock, you will have problems. If the DataMethod2 retries a deadlock, the transaction has been rolled back, so now only the steps of DataMethod2 and DataMethod3 will be performed. If DataMethod2 retries
a query timeout without rolling back, and the stored procedure it called started a transaction, two bad things can happen:

1) The final commit will only decrease trancount to 1, so the transaction will not actually be committed.
2) The stored procedure will start from the beginning. Say that UpdateRelationalJobUploadStatusAndMetadata gets stuck on the second update. This means that on the retry, the first UPDATE will be reperformed. In this actual case, this is not a big deal; all
that happens is that DataUploadTime gets a different value. But assume that the line for PacketCouunt had read:

[PacketCount] = [PacketCount] + @PacketCount

I must say, that I don't really see the point with having retries on query timeouts. Do you know the rationale behind this decision? As I said, my recommendation is to have an infinite timeout. In some applications, like web apps, you don't want
queries to get stuck forever, but in that case you probably want to return a message to the user "system busy, please try later", and have the user to implement the retry.

There are basically two reasons for timeouts:
1) Blocking.
2) The batch simply takes a long time to run.

In the second case a retry is completely pointless. If the batch takes 35 seconds to run, it will never be able to complete. In the first case, retry is possibly meaningful with a user-controlled retry. But a direct retry?

You point out that there is a ROLLBACK in case of SP[1] in case of error. But the CATCH handler is not fired in case of an attention signal. SQL Server rolls back the current statement, but when XACT_ABORT is off, the full transaction is
not rolled back.

When XACT_ABORT is ON (which I recommend), the transaction is rolled back. Observe here that this applies to the outermost transaction started, the one by the client. The BEGIN TRANSACTION in the procedure only serves to increase @@trancount. You
don't get a rollback to this point. Thus, SQL Server pulls the rug for the client, and the client must be aware of this. But this is the same scenario as being a deadlock victim where you also gets rolled back.

You say If retry without rollback will cause nested transaction, we could expect SP[2] will met "trancount > 0"

The only check on trancount > 0 I can see is in SP[1], but that is after BEGIN TRANSACTION, so that condition will always be true.

As for the deadlocks, yes, they may be due to the timeout and retries. But right now, the deadlocks as such are not a problem for you. They are only symptoms of the underlying problem which is the architecture of your data layer.

I would recommend that you close the load tests as failed and send the developers to the drawing board. In summary, you need to implement:

1) In case of query timeout, roll back the transaction. Through TransactionScope if it was started that way, else submit "IF @@trancount > 0 ROLLBACK TRANSACTION."
2) Make sure that retries for deadlocks are always from the point where the transaction started, else retries are not safe.
3) Stop retrying on query timeout, and consider setting the command timeout to infinity (i.e. 0.)
4) Consider using SET XACT_ABORT ON in your stored procedures.

DBCC TRACEON(1222,-1) to ask sql server to dump deadlock in errorlog.Below blog will be helpful.It has 3 part i am providing you one part next 2 you can serach not net with same name and you will find it.

1. Have you made sure you created correct indexes ,I guess not use sys.dm_db_index_usage_stats to check if particular index is utilized or not.Below query is using where clause in Jobid and as per your post you have not created Index on Jobid you are surely
going to face issue.Proper index is first step in removing deadlocks

2.There is no sequence particularly..First for update an update lock is taken till the row is located and this update lock will convert into exclusive lock.Now since index is missing it takes huge time to search,table scan happens ,the records in table so
update lock is not released

Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers

Indeed, things were a little different from your post. You are only looking at the current statements, and also assuming that they are working with the same JobId.

However, the first and the last process in the list have a user-defined transaction in progress. The first process started its process just 20 ms before it called the procedure UpdateRelationalJobUploadStatusAndMetadata. But the last process started its
transaction 30 seconds earlier, so it is likely that it has performed some work prior to the current batch.

I am not going to analyse the deadlock as such further, because I see things in the trace that are more interesting. The first and second process has a trancount of 2, which indicates transaction nesting, but I am not sure. However, the last process has
a trancount of 3, which definitely means that it has issued multiple BEGIN TRANSACTION.

An attention is a signal from the client API which instructs SQL Server to stop executing the current batch. If the setting XACT_ABORT is OFF (the default), the transaction is not rolled back.

There are two typical situations for an attention signal:
1) The red button in SSMS:
2) A query timeout. (The message "Timeout expired")

It is telling that the attention signal is 30 seconds and 30 milliseconds after the transaction started. The default timeout in most client APIs is 30 seconds.

Unhandled query timeouts can cause a big mess in an application if connection jogs along with an open transaction. Locks pile up and when the connection is disconnected a lot of work is rolled back and lost. And it is quite clear that the application is
not handling the query timeouts correctly. I suspect that the application retries without rolling back and that is why trancount is 3.

I note that the other two processes also have fairly recent attention signals.

So rather than staring at this deadlock, you need to fix the application. You need to make sure that when you get an exception when running a query that you issue the command "IF @@trancount > 0 ROLLBACK TRANSACTION". I would also recommend
that you put "SET XACT_ABORT ON" inside your stored procedures.

Then it is another interesting question why you get these query timeouts in the first place. My recommendation is that unless you have a reason, you should set the query timeout to 0 (which you can do the connection object in most APIs) to wait forever.
But that it depends a little what these batches are doing.

Timeout: According to the info from our test side, they are having stress test at that moment. So we got a lot of timeout on the server side. This just conincide your thinking on the lastattention.

Unhandled Timeout: we do have a retry library, which if it catches a timeout, will just redo the execution.

Trans Count: Please refer to the whole SP[1] and [2] down below. For the first, we have rollback in the transaction. Did not do this on the second SP.

Below are several questions I still not quite clear:

Transcount is 2 or 3 for the three processes. If retry without rollback will cause nested transaction, we could expect SP[2] will met "trancount > 0", however this does not explain the behaviour of SP[1]. So seems when timeout, even we rollback
the tran in SP, there will also be nested trans, since rollback only applies to error? And it is pretty intersting to me what is the behaviour on the timeout on the server side, particular for the scenario, that when one query did not complete and return to
client, within timeout period, then client just retry the query, how would server treat the second query? Seems this is related to "Unhandled query timeouts can cause a big mess in an application if connection jogs along with an open transaction.
Locks pile up and when the connection is disconnected a lot of work is rolled back and lost. " Can you elaborate on this?

Back to the deadlock issue, does this also related to the handle on timeout and the nested transactions?

thanks.

SP[1]

-- Create stored procedure dbo.[UpdateRelationalJobUploadStatusAndMetadata] to get the job's metadata

Load test? That may explain the four-digit spids in the deadlock trace. They did raise my eye-brows a bit.

You need to rework you retry logic. I suspect that you are using something like TransactionScope, because as I noted all transactions had started before the current batch. Whether you actually need TransactionScope I don't know, but obviously you need
it if the client runs several database calls in the same transaction.

Retry is not as trivial as it sounds. The code where you perform retry must acertain that there is no transaction in progress on entry. This can be implemented in two ways:

1) The method checks for active transactions entry.
2) By convention, the layer that performs the retry is also the outermost layer that is permitted to start transactions.

And then each data method has retry on timeout and deadlock, you will have problems. If the DataMethod2 retries a deadlock, the transaction has been rolled back, so now only the steps of DataMethod2 and DataMethod3 will be performed. If DataMethod2 retries
a query timeout without rolling back, and the stored procedure it called started a transaction, two bad things can happen:

1) The final commit will only decrease trancount to 1, so the transaction will not actually be committed.
2) The stored procedure will start from the beginning. Say that UpdateRelationalJobUploadStatusAndMetadata gets stuck on the second update. This means that on the retry, the first UPDATE will be reperformed. In this actual case, this is not a big deal; all
that happens is that DataUploadTime gets a different value. But assume that the line for PacketCouunt had read:

[PacketCount] = [PacketCount] + @PacketCount

I must say, that I don't really see the point with having retries on query timeouts. Do you know the rationale behind this decision? As I said, my recommendation is to have an infinite timeout. In some applications, like web apps, you don't want
queries to get stuck forever, but in that case you probably want to return a message to the user "system busy, please try later", and have the user to implement the retry.

There are basically two reasons for timeouts:
1) Blocking.
2) The batch simply takes a long time to run.

In the second case a retry is completely pointless. If the batch takes 35 seconds to run, it will never be able to complete. In the first case, retry is possibly meaningful with a user-controlled retry. But a direct retry?

You point out that there is a ROLLBACK in case of SP[1] in case of error. But the CATCH handler is not fired in case of an attention signal. SQL Server rolls back the current statement, but when XACT_ABORT is off, the full transaction is
not rolled back.

When XACT_ABORT is ON (which I recommend), the transaction is rolled back. Observe here that this applies to the outermost transaction started, the one by the client. The BEGIN TRANSACTION in the procedure only serves to increase @@trancount. You
don't get a rollback to this point. Thus, SQL Server pulls the rug for the client, and the client must be aware of this. But this is the same scenario as being a deadlock victim where you also gets rolled back.

You say If retry without rollback will cause nested transaction, we could expect SP[2] will met "trancount > 0"

The only check on trancount > 0 I can see is in SP[1], but that is after BEGIN TRANSACTION, so that condition will always be true.

As for the deadlocks, yes, they may be due to the timeout and retries. But right now, the deadlocks as such are not a problem for you. They are only symptoms of the underlying problem which is the architecture of your data layer.

I would recommend that you close the load tests as failed and send the developers to the drawing board. In summary, you need to implement:

1) In case of query timeout, roll back the transaction. Through TransactionScope if it was started that way, else submit "IF @@trancount > 0 ROLLBACK TRANSACTION."
2) Make sure that retries for deadlocks are always from the point where the transaction started, else retries are not safe.
3) Stop retrying on query timeout, and consider setting the command timeout to infinity (i.e. 0.)
4) Consider using SET XACT_ABORT ON in your stored procedures.