Posts tagged Symantec

We kept getting messages that the cluster node was offline because the Quorum was unavailable. This made little sense as both nodes in this cluster were online and the Quorum disk was available. We could ping across the heartbeat, everything looked fine except for these errors.

After a little research we determined that a new version of Symantec Endpoint Security had been pushed to these servers. Even with the new version of endpoint security, we could establish communication across all networks between the 2 nodes so we were a little stumped. Eventually we ran across a policy that was being enforced from the Symantec central management server/policy/whatever its called!

As it turns out, Symantec endpoint security by default blocks all IPV6 traffic. If you’re like me, I didn’t even realize that a windows 2008 cluster would use IPV6 for the heartbeat communication. After disabling the rules that were preventing IPV6 traffic everything returned to normal.

So, the moral of all this is nothing new… NEVER trust anything new getting pushed to your servers..

Ran into this a while back, and we finally found a root cause so, I thought Id put it out here in hopes that it saves at least 1 person the amount of head bashing I had with it

Environment

Windows 2003 Enterprise R2 SP2 w/32GB RAM

SQL Server 2005 standard ed SP3 64bit active/passive cluster

We started seeing this glorious message in the SQL Server Error log.

A significant part of sql server process memory has been paged out. This may result in performance degradation. Duration XX seconds. Working set (KB) XXX, committed (KB) XXX, memory utilization 0%

The message varied slightly but the essence was always the same.

This error message can be too common on systems where SQL memory is misconfigured or where something is unduly pressuring SQL for memory. In this case a quick verification of the settings showed that everything was in order. The first 2 times this happened it was the middle of the night during backups(in the SLA window), so no one really noticed a performance degradation. We didn’t think much of it at the time but in hindsight, we should have.

The Failure

Monday morning 8 AM, developer makes a bad update to the database, No problem I say, Litespeed can rollback the transaction, So I start to copy the full db backup+tran logs off the server (~25gb) this is the way we process litespeed recoveries through the log reader. About 3 minutes later the server became totally unresponsive, and the error about paging the SQL process memory was logged. At the time I didn’t put 2 and 2 together as this particular server runs a varied workload of about 1500 batches/sec and has anywhere from 1200-2500 connections open at a time, so It could have been anything! After some further digging I figured out that the file copies were causing the sql memory to get paged out. At the time I had never heard of a file copy causing an issue in SQL Server!

The experts weigh in

While looking at the issue 2 perfmon counters stuck out–> Memory\Cached Bytes and Memory\Avail MBytes. While file copies were happening the cache bytes counter would increase very quickly while the avail bytes counter would drop, once the available mbytes dropped to 0 sql server started to page memory out. After a bit of paging, the errors were logged that SQL had its memory paged out and SQL became unresponsive. Since this was a high priority system, I did what any good SQL Server DBA would do, I contacted a few people in my network who may have seen this before. Interestingly enough I got the exact same response from every one of them, “use lock pages in memory” and don’t use windows explorer to do huge file copies as this is a known “problem”.

Even though I trusted my sources of info, I had a hard time believing that file copies of sizes all the way down to 1GB would cause this sort of havoc without this being something Bing+Google would know about (different file sizes mattered, some sizes worked fine, some would cause the problem).

Workarounds not welcome

I had a valid workaround with lock pages in memory and not using explorer for file copies but, I don’t normally like workarounds such as this on systems as important as this one is to us. After a few server rebuilds we finally figured out that we could reproduce this issue on any win2k3 R2 ent ed 64bit server, this would be the clue we finally needed to make a breakthrough. After rebuilding the systems from scratch and loading no drivers except SAN we noticed that we couldn’t cause the error! So, after painstakingly adding each and every piece of our standard server build we realized that Symantec AV (10.1.9.9) was the cause. Yes, another file system filter driver was misbehaving.

In looking back through the change communication, we ID’d where a new version of AV was pushed out and we just didn’t hit the error soon enough after the installation to put 2 & 2 together. Since disabling AV wasn’t an option we started trying to find a setting that specifically caused the problem and happened across a change that could be made and allow AV to run and SQL to not get paged out. By unchecking the network scanning options, the windows cache no longer increases during a (network) file copy. problem solved!!

Some time in the future vendors are going to figure out how to write good file system filter drivers, or they are going to stop trying to use them! After fighting this issue for a few weeks (or was it months) I can only hope this happens sooner rather than later