New concurrency and scheduling diagnostics have been added to SQL Server

This article has been archived. It is offered "as is" and will no longer be updated.

BUG #: 102179 (SQLBUG_70) BUG #: 356317 (SQLBUG_80)

SUMMARY

To remain highly dynamic, SQL Server contains several internal processes to ensure stability. One example is the lock monitor that you can use to identify and resolve deadlock situations.

SQL Server 7.0 Service Pack 4 (SP4) and SQL Server 2000 Service Pack 3 (SP3) have added enhanced process health monitoring. Health monitoring enhancements have taken place in the following areas:

Blocking

Network problems

Input/Output (IO)

Memory

CPU

When SQL Server detects health problems, a series of new error messages, such as the following, are logged in the SQL Server error log. For details about these error message conditions, see the "More Information" section of this article.

Warning Note that a health related problem is often the result of a condition that was experienced previously. You must study the SQL Server error log and the system event logs carefully to determine the actual root cause.

For example, a 17883 error message may indicate a scheduler problem. However, the error log may show a previous exception that incorrectly left the SQL Server process in a poor state, or the application might have caused a severe blocking condition.

Note Microsoft tries to keep all content up-to-date with the latest 17883 conditions. However, the 17883 error message is a health detection message that can be triggered for many reasons. Microsoft has not only corrected known issues with the SQL Server software product but has also encountered the 17883 error in a variety of situations that are unrelated to the SQL Server software. For example, the error has occurred with external application CPU consumption and hardware failures. You must determine the root cause of the 17883 error message if you want to avoid an unwanted reoccurrence of the error.

MORE INFORMATION

This section outlines the health enhancements and associated error messages that can be logged to the SQL Server error log.

UMS

To better understand some of the additional health diagnostics, you must first understand how SQL Server uses a User Mode Scheduling (UMS) Ums.dll helper file.

Both SQL Server 7.0 and Microsoft SQL Server 2000 use logical schedulers. These schedulers help to make sure that SQL Server maximizes operating system resource usage in relation to key database action paths. The UMS layer makes sure that SQL Server correctly uses Win32 events to strictly control thread and fiber (or both) scheduling visibility to the operating system. By strictly controlling the threads or fibers that can run, SQL Server can maximize CPU usage as it relates to database primitives such as locking.

For example, logical scheduling permits lock waiters to sleep (WaitForSingleObject on a Win32 event) until the lock owner releases the lock and signals (SetEvent) them to wake up.

Extended lock detection

The lock monitor has been extended to detect a (worker thread) resource level blocking scenario. If a SPID that owns a lock is currently queued to the scheduler, because all the assigned worker threads have been created and all the assigned worker threads are in an un-resolvable wait state, the following error message is written to the SQL Server error log:

Client 1 becomes IDLE, shows up as sleeping, and awaiting a command with an open transaction in the sysprocesses system table.

Clients 2 through 255: Approximately 254 more clients log on to SQL Server and issue a SELECT from the authors table. These clients will all become blocked on the original update.

Client 1 tries to commit the transaction but it becomes queued because all the worker threads are tied up by clients 2 through 255.

Blocking

This error message commonly indicates an extended blocking situation. Each time the lock monitor runs (approximately every 5 seconds), messages can be added to the SQL Server error log.

Note A message is logged for each SPID/ECID that is experiencing the resource problem. Therefore, several messages can be logged during the same lock monitor iteration.

SQL Server does not automatically resolve this situation. However, it will indicate the problem as an error message (1223 or 1229) accordingly. When this problem occurs, you can resolve it in a number of ways.

Lock or query timeout

If the queries use a lock or a query timeout, the situation will commonly resolve itself as the timeouts occur. However, the situation warrants investigation because it indicates an application induced reduction in concurrency.

Transact-SQL KILL

If the administrator is able to query the sysprocesses system table, they can use the Transact-SQL KILL command to terminate the BLOCKING SPID and to terminate the appropriate BLOCKED SPIDS to free worker threads and return the system to a normal state.

Support assistance

If you cannot obtain the sysprocesses system table information, then get a process dump of the (Sqlservr.exe) process and contact Microsoft SQL Server support for additional investigation.

Parallel queries

In rare circumstances, this error message can occur because of a poor parallel query plan selection. If the parallel query elects to use a significant number of available SQL Server workers to complete the query, it can exhaust the SQL Server worker pool. The sysprocesses system table contains an ECID column to indicate the number of workers that are being used on behalf of the individual SPID. If the ECID value is high in relationship to the physical CPUs on the computer, it is generally an indication of a poorly tuned query. Review the query plan and the max degree of parallelism (MAXDOP) query option setting to correctly tune the query in question.

Schedulers matter

The number of logical schedulers matters. When SQL Server starts, the max worker thread setting is divided equally among the logical schedulers. As the number of CPUs available to the SQL Server increases the worker queue is divided more. An application that exhibits undesirable transaction scope activity can manifest resource shortage scenarios faster when more CPUs are involved. For this type of scenario, the applications transaction scope is immediately corrected.

The following table shows the worker pool assignments based on the number of CPUs if the sp_configure stored procedure setting for max worker threads is equal to 255.

CPUs blocking

Chain length

1

255

2

128

4

64

8

32

16

16

We recommend that you keep the max worker thread setting at the default of 255. For more information, click the following article number to view the article in the Microsoft Knowledge Base:

UMS health

A new internal routine has been added to check the health of the logical scheduler every 60 seconds. If the scheduler is deemed as stalled, or it has stopped responding, the appropriate error message is logged to the SQL Server error log. An error is logged every 60 seconds until the issue is resolved.

As mentioned earlier, these messages are often an indication of a previous event. Carefully consult the SQL Server error log and application event logs to help you determine the root cause of the problem.

Note A snapshot is taken every 60 seconds. Therefore, it can be 120 seconds before the condition is first detected.

Trace flags

You can disable these two checks if you start SQL Server with the -T1260 startup parameter.

SQL Server 2000 MiniDump file

Starting with SQL Server 2000 SP3, the ability to capture a MiniDump process has been implemented. Starting with build 8.00.765, a MiniDump file is generated when SQL Server first detects a stalled scheduler.

To prevent continued generation of the MiniDump files for these error messages (17883 and17884), the default behavior is to produce a single MiniDump file for the life of the SQL Server process. To enable a MiniDump file for every occurrence of the messages, turn on trace flag -T1262.

The MiniDump file is generated in the LOG folder and is SQLDmpr###.mdmp. This MiniDump file can be evaluated by Microsoft support to help determine the root cause of the problem.

Error 17881 and Error 17883

These messages indicate a single UMS scheduler has experience a yield problem. The health monitoring has detected what appears to be a scheduler with a worker thread that is not allowing other workers to progress, and the scheduler is being flagged as non-responsive. A scheduler that has stopped responding is generally a bug with the SQL Server product or an external component (XProc, COM object, and so on).

The following are examples of known 17833 conditions. Make sure that you search the Microsoft Knowledge Base for related articles. If your system requires an updated patch, apply it accordingly.

If you cannot determine the root cause immediately, consult the error log for problems and engage in extended support efforts.

When a scheduler is not properly responding, it can reduce overall concurrency for SQL Server. SQL Server can also appear to be stalled or it might stop responding.

Error 17882 and 17884

These messages indicate that all the UMS schedulers have experienced yield problems. This indicates a SQL Server system wide problem and SQL Server will appear to have stopped responding. As with the 17881 and 17883 messages, consult the error log and Microsoft the Knowledge Base for more information. If necessary, engage in extended support efforts.