Recently I came across a scenario where requirement was having Active Passive windows NLB. So, if Active node experiences issue we should ensure that all the Application related services are stopped on that node and these services are started on passive node. To achieve the failover we need the health check. Windows NLB provides support for all network failures such as active node looses network connectivity, is shut down or crash. In all these cases windows NLB will start sending connections to passive node. However, what if application related service crashes, hangs or stops servicing the connections? Does windows NLB offers any options to initiate failover options with Application level health check?

Windows NLB is great feature for network load balancing, it has not evolved much since windows 2000 days and it does not address above questions directly. However, Microsoft does provide some sample script templates for Monitoring Application Level Health which can be enhanced as required.

I also came across another blog by DAVID TOSOFF, where he has provided a brilliant script which can be configured for application level health check and can be customized to work with any application for NLB. What more is it can also be configured to run as a service.

However, all i needed is to have a script which will check if certain services are running on Active node. If any of the service is not running, stop all services on Active Node, make it passive, start services or passive node and make it active. After looking for few pieces of code from different blogs and forums i came up with below Power shell script to accomplish this task.

In addition to the service health check and initiating failover, this script also provides basic health check and recovery for NLB cluster.

Note: I am no expert when it comes to scripting and this is not a perfect script but it works. There is a lot of room for improvement and if you have any suggestions please help me make it better:)

Next thing is to run this script as a service to monitor the health of NLB. There are many ways to do it. Schtasks is a great utility to install your custom scripts as a service or you can use Instsrv.exe and Srvany.exe which are part of Windows Server 2003 Resource Kit Tools.

However, i found task scheduler a better fit for my scenario. There is a nice post on TechNet blog which explains the details of it. However if you want a quick version, below is the only command that you would need.

While going through patch reports I noticed that 2 windows 2008 r2 sp2 servers had missed 2 patch cycles. Soon it was found that system state backup was not happening for these servers. No backup so no patching.

So I started with system state backup.

A simple command, wbadmin start systemstatebackup –backuptarget:c: gave following error,

The Windows Backup engine could not be contacted. Retry the operation.The RPC server is unavailable.

With several other application crashing on ntdll.dll KB 2545627 was perfect fit for our server and being the latest KB 2512352 was selected.

After the updated we found issue with other apps failing on ntdll.dll was fixed but it made no difference to primary issues of failing backup. We noticed same event 1000, this time with higher DL version numbers.

Log of files for which backup failed ‘C:\Windows\Logs\WindowsServerBackup\SystemStateBackup_Error date time.log’

I found following event, but it did not helped much,

Event ID: 519Description: The backup operation that started at “Time” has failed to back up volume(s) . Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

Quite interestingly, running the system state backup for GUI via backup module revealed more detailed error,

Event ID: 517Description: The backup operation that started at “Time” has failed with following error code ‘2155347997’. Please review the event details for a solution, and then rerun the backup operation once the issue is resolved.

KB http://support.microsoft.com/kb/2182466 ““2155347997 (0x8078001D)” error code when you perform a system state backup operation in Windows 7 or in Windows Server 2008 R2” reffer to the exact same issue. However we already had installed higher version of wbengine and this articles was not applicable for us anymore.

Checking all Image Paths for correct value is pain in itself, and its further complicated by multiple valid syntaxes. Thanks to Tom Acker for proving a nice and easy way to find invalid image paths with GetInvalidImagePath script.

Running this script revealed multiple image paths with space which needed to be enclosed in quotes and few more keys with incorrectly added forward slash “/” in image paths.

Once Image Paths were cleaned, system state worked like a charm 🙂

With valid backup available, now these servers are good the receive long awaited patches missed for previous and current cycle.

Recently I came across any issue with Windows 2008 R2 server with high memory utilization. This server was hosting a custom monitoring tool and it was not servicing runtime reporting request to do performance hit.

The Server had 8 CPU cores and 12 GB ram. CPU utilization was in check however RAM utilization was above 95% consistently. Server owner informed that they are usually forced to reboot the box to get memory utilization under control and generally after 2-3 days uptime memory spikes again to 95-100% and never goes down. First look at the task manager revealed that memory utilization above 95% however, total memory consumed by processes under process tab was approx 1.5 GB which is less than 13%.

So we started with case of a missing ram. Looking at installed products list, it mentioned SQL 2008 R2 SP2. Off course the usual suspect was SQL. Databases largely follow Linux memory policy, “Free memory is wasted memory”. Like any other database product SQL has the tendency to occupy free memory as required.

For performance issues there are multiple tools available out there but for Advanced memory utilization analysis RAMMAP from Sysinternals is best choice. RAMMAP revealed almost 10 GB was occupied by AWE????

Address Windowing Extensions or AWE is windows memory management functions which is used to allow more than 3GB memory to standard 32 bit application. Using AWE for SQL was great option on 32 bit OS with high amount of RAM. But we are running windows x64, where AWE should not have been used.

SQL memory utilization settings showed a different picture all together,

Although SQL is set to default value of allow maximum available memory to be utilized “Use AWE to allocate memory” was unchecked. SQL was still our primary suspect and to isolate SQL we took the downtime for application and stopped SQL service. Indeed SQL was the culprit, as immediately AWE utilization was clear and total memory utilized on server was less than 20%. Remember the saying, Things are not always what they look like!

We changed the maximum server memory settings for SQL to be 8 GB and started up the services. This time SQL had its max 8 GB and our monitoring APP had sufficient breathing space for all the data collection and reporting 🙂

However, I was wondering on why did i never saw this issue on my test servers? It turns out that AWE cannot be used by any account. Its control by GPO setting “Lock pages in Memory” option.

If you configure a user account to run SQL services, by default no user has rights for Lock Pages in memory settings and SQL wont be able to use AWE settings. In our case SQL service was running under Local System account which by default has the rights for using AWE.