SCSM A monitoring host is unresponsive or has crashed Error 4000

by Nathan Lasnoski

I was recently working on a Service Manager 2012 deployment where the health service was regularly crashing approximately 3 - 5 minutes after it was started. This resulted in most workflows firing for a minute or two, then crashing. This also prevented the connectors from completing, notifications from firing, and literally any work process from functioning properly. This is because the Health Service really could be called the "workflow service", making it responsible for almost every backend process in Service Manager.
The error in the event log was "A monitoring host is unresponsive or has crashed", with an error ID of 4000. In reviewing the error I found references to SCOM and a SNMP management pack. I proceeded to remove the management pack, disable many workflows, and remove changes. Finally, we rolled back to an earlier state, since we noticed this had been happening for some time. After the roll-back we found the processes started working properly, but then about 8 hours later, we saw the Health Service crashing again.
I then proceeded to look further at the workflows using the SQL queries that Travis blogged about, and which were noted on other Microsoft blogs. Here are some references:
http://blogs.technet.com/b/servicemanager/archive/2013/01/14/troubleshooting-workflow-performance-and-delays.aspxhttp://blogs.technet.com/b/mihai/archive/2012/07/13/service-manager-slow-perfomance.aspxhttp://gallery.technet.microsoft.com/Workflow-Performance-680438aeIn particular, we ran the following SQL command:
"
DECLARE @MaxState INT, @MaxStateDate Datetime, @Delta INT, @Language nvarchar(3)
SET @Delta = 0
SET @Language = 'ENU'
SET @MaxState = (
SELECT MAX(EntityTransactionLogId)
FROM EntityChangeLog WITH(NOLOCK)
)
SET @MaxStateDate = (
SELECT TimeAdded
FROM EntityTransactionLog
WHERE EntityTransactionLogId = @MaxState
)
SELECT
LT.LTValue AS 'Display Name',
S.State AS 'Current Workflow Watermark',
@MaxState AS 'Current Transaction Log Watermark',
DATEDIFF(mi,(SELECT TimeAdded
FROM EntityTransactionLog WITH(NOLOCK)
WHERE EntityTransactionLogId = S.State), @MaxStateDate) AS 'Minutes Behind',
S.EventCount,
S.LastNonZeroEventCount,
R.RuleName AS 'MP Rule Name',
MT.TypeName AS 'Source Class Name',
S.LastModified AS 'Rule Last Modified',
S.IsPeriodicQueryEvent AS 'Is Periodic Query Subscription', --Note: 1 means it is a periodic query subscription
R.RuleEnabled AS 'Rule Enabled', -- Note: 4 means the rule is enabled
R.RuleID
FROM CmdbInstanceSubscriptionState AS S WITH(NOLOCK)
LEFT OUTER JOIN Rules AS R
ON S.RuleId = R.RuleId
LEFT OUTER JOIN ManagedType AS MT
ON S.TypeId = MT.ManagedTypeId
LEFT OUTER JOIN LocalizedText AS LT
ON R.RuleId = LT.MPElementId
WHERE
S.State <= @MaxState - @Delta
AND R.RuleEnabled <> 0
AND LT.LTStringType = 1
AND LT.LanguageCode = @Language
AND S.IsPeriodicQueryEvent = 0
/*Note: Uncomment this line and use this optional criteria if you want to
look at a specific workflow that you know the display name of*/
--AND LT.LTValue LIKE '%Test%'
ORDER BY S.State Asc
"
This returned that one paticular workflow in paticular was over 10,000 minutes behind! We then disabled that workflow, after which the health service stopped crashing.
If you are running into this error with the Health Service, this query was super helpful.
Have a great day!
Nathan Lasnoski