If you are responsible for keeping your SharePoint Deployment healthy I assume that “traditional” system monitoring – whether via SCOM, the Performance Monitor or other tools, is on top of your list. But if your first reaction on constant high CPU, exhausted memory or full disks is to ask for more hardware then your actions are “too traditional”. Adding more hardware for sure will make your system healthier – but it comes with a price tag that might not be necessary.

In this first blog about SharePoint Sanity Checks, I show you that there are ways to figure out which sites, pages, views, custom or 3rd party Web Parts (from AvePoint, K2, Nintex, Metalogix …) in your SharePoint environment are wasteful with resources so that you can fix the root cause and not just fight the symptom.

Step #1: Server System Health Check

The first question must always be: How healthy are the Windows Servers that run your SharePoint Sites?

Not only must you look at Windows OS Metrics such as CPU, Memory, Disk and Network Utilization, you also need to monitor the individual SharePoint AppPool worker processes (w3wp.exe) to figure out whether you have individual sites that overload this server. The following is a screenshot that shows this information on a single dashboard.

Let me give you some recommendations on what to look out for and what to do in that case

Bad AppPools

In case some of your SharePoint AppPools consumes too many resources on that machine, you may want to consider deploying them to a different server. You don’t want cram too many heavy utilized SharePoint sites on a single server and suffer from the cross impact of these sites.

Storage Problems

If you see high disk utilization it is important to check what is causing it. I typically look closer at:

IIS: Is the web server busy with serving too much static content? If that’s the case make sure you have configured resource caching. That reduces static requests from users that use SharePoint often. Also check the log settings of IIS and the Modules loaded by IIS. Make sure you only log what you really need.

SQL Server: Is SQL Server running on the same machine as SharePoint and maybe even hosting other databases? Talk with your DBA on checking proper configuration of SQL Server as well as discuss a better deployment scenario such as putting the SharePoint Content Database on its own SQL Server.

SharePoint: Check the generated log files. I often see people increasing log levels for different reasons but then forgetting to turn it back to default resulting in large amounts of data that nobody looks at anyway.

CPU Utilization

The first thing I look at is if the CPU is consumed by one of the SharePoint AppPools or other services running on that same machine.

SharePoint: This correlates to what I wrote under Bad AppPools. If the reason is too much load on an AppPool consider deploying it on a different machine. Before you do this please follow my additional recommendations later in this blog to verify if configuration or coding issues might be to blame which can be fixed.

SQL Server: Do you have SharePoint Sites or individual Pages that cause extra high utilization on the SQL Server? If that is the case, follow my recommendations on how to identify bad pages or Web Parts that have excessive access to the database. In general you should talk with the DBA to do a performance sanity check.

Other processes? Do you have other services running on that box that spikes CPU? Some batch or reporting jobs that can be deployed on a different server?

Network Utilization

It comes down to the same suspects as above:

IIS: Analyze how “heavy” your SharePoint pages are. Follow the general best practices on Web Performance Optimization by making your sites “slimmer”. Make sure you have content compression features turned on and content caching properly configured.

SharePoint: Besides talking to the database – what other services does your SharePoint instance interact with? Do you have Web Parts communicating with an external service? If that is the case make sure that these remote service calls are optimized, e.g: cache already fetched data or only query data that you really need.

SQL Server: Analyze which SharePoint Sites/Services request data as well as which other applications request data. Optimize data access or consider redeploying SQL Server to optimize the data transfer between the application and the database server.

Step #2: IIS Health Check

I already covered some IIS metrics in Step 1 but I want you to have a closer look at these IIS specific metrics such as current load, available vs used worker threads and bandwidth requirements:

These are the metrics I always check to validate how healthy the IIS deployment is:

Average Page Response Size: If you have bloated websites your IIS is serving too much data. That not only clogs the network, but it also makes the end user wait longer for these pages to load. Keep an eye on the average page size. Especially after deploying an update make sure pages don’t get too big. I suggest performing constant Web Performance Sanity Checks on your top pages.

Thread Utilization: Have you sized your IIS correctly in terms of worker threads? Are all the busy threads really busy or just waiting on slow performing SharePoint requests? Check out my sections on Top Web Server and Top App Server Metrics of my recent Load Testing Best Practices blog

Bandwidth Requirement: Is our outbound network pipe already a bottleneck? If that’s the case do not blindly update your infrastructure but first check if you can optimize your page sizes as explained earlier.

Step #3: Component Health Check

What I mentioned in the first 2 steps actually falls into “traditional” system monitoring with some additional insight on metrics that go beyond normal resource utilization monitoring. If resources are maxed out I always want to find out which components are actually using these resources. Why? Because we should first try to optimize these components before we give them more resources. I look at the following dashboard for a quick sanity check:

A good SharePoint health metric is response time of SharePoint pages. If I see spikes, I know we jeopardize user adoption of SharePoint and I know I need to treat this with high priority. I look at the following metrics and data points to figure out what causes these spikes which most often directly correlate to higher resource consumption such as Memory, CPU, Disk and Network:

Memory Usage and Garbage Collection Impact: High memory usage alone is not necessarily a problem. The problem is if more memory is requested and the Garbage Collector needs to kick in and clear out a lot of old memory. That’s why I always keep an eye on overall memory usage patterns and the amount of time spent in Garbage Collection (GC). GC impacts both response time and it consumes a lot of CPU.

Which Pages are Slow? Trying to figure out why individual pages are slow is often easier than trying to figure out why on average the system is slower. I don’t waste time though focusing on a single slow page that is just used by a single user. Instead I focus on those pages that are slower than expected but also used by a lot of users. Optimizing them gives me more improvements for a larger audience.

Problematic Web Parts? SharePoint is built on Web Parts. Whether they come from Microsoft, well known 3rd Party providers (AvePoint, K2, Nintex, Metalogix …), or your own development team. Knowing which Web Parts are used and how slow they are allows you to focus even better. Too many times I have seen “Web Parts Gone Wild” caused by bad configuration or bad implementation. Check out my Top 5 SharePoint Performance Mistakes and you understand why that is a big problem.

The reason why Web Parts and Pages are slow can be caused by bad deployments, wrong configuration or really just bad coding. This is what I am going to focus on in my next blog post!

Next Steps: Fix the Problem; Don’t Just Buy More Hardware

I am interested to hear what you think about these metrics and please share ones with me that you use. In the next blog I will cover how to go deeper into SharePoint to identify the root cause of an unhealthy or slow system. Our first action should never be to just throw more hardware at the problem, but rather to understand the issue and optimize the situation.

Ten years ago, the main goal for managers in network operations was to ensure the network was simply up and running. Today a well performing network does not guarantee that critical business applications are being successfully delivered to users.

The reality for IT and operations is dealing with requests for faster, always available applications and as a consequence organizations are converging network and application performance to optimize delivery of business-critical applications and services.

Many IT organisations have defaulted to component-level monitoring of the network, servers and database tiers that lead to incomplete views of application performance. Network-centric approaches to monitoring have traditionally lacked insight into overall application performance and what really matters: the end-user’s experience.

And the reverse has been true too. Traditional application performance monitoring solutions have been unable to monitor and diagnose network performance issues. When Application Performance Management (APM) and Network Performance Management (NPM) reports are unified and delivered in context to one another, mean-time-to-resolve (MTTR) is reduced, performance management costs are slashed and deployment is simplified.

Using a single device and a unified reporting interface for best-in-class application and network performance management, Compuware APM enables IT and operations experts to tackle the increasing challenges of complex data centers. Our single solution manages both application and network performance and includes:

A single point of instrumentation: reduces costs and delivers network diagnostics in context ofapplication performance;

Private cloud and virtualization ready: delivers the industry’s most complete coverage for cloud and virtualized infrastructure, introducing support for Cisco VNTag, and extended support for Citrix XenApp, VMWare vSphere and IBM AIX Micro-partitioning;

This release also introduces new pricing and entry-level offerings that allow IT teams to easily start with a single application, then seamlessly and cost-effectively scale to enterprise-wide deployments.

To read more about our APM innovations and details on all the new enhancements included in the Compuware APM Spring 2012 Platform Release, click here.

See what the experts are saying about the APM market. Read Gartner’s comprehensive 2011 “Magic Quadrant for Application Performance Monitoring (APM)” report. It evaluates 29 vendors on completeness of vision and ability to execute.

When I saw this report, I was expecting it to be downloadable.

So here is my downloadable pdf version, which might be a little more manageable.

Just running through the options for quick access to core synthetic monitoring ‘outside-in’ data for a client today. Essentially, looking for a quick way to alert core staff to current up-to-the-minute performance for key website transactions running in Gomez.

The operational dashboard pulling data from the Active Network XF backbone tests is nicely displayed from a quick tab in the portal. Looking at ways to display this; as in the obligatory lcd or plasma in the ops centre.

Understanding how websites perform for users around the world is now as simple as typing in a URL, thanks to the new, free Instant Test Site from Gomez, Inc., a leading provider of web application experience management services. Testing website performance from outside a business’ firewall — from its customers’ perspective — is a more accurate way of identifying issues like slow page loads and outages so that businesses can ensure their web applications keep running, their revenues keep flowing, and their customers have quality web experiences.

The new Gomez service lets anyone instantly test the response time of their website or web application from up to ten international testing nodes without having to download or install any software or create any scripts. Simply visit www.gomez.com/testyoursite, enter the URL to be tested, select a node and within seconds the site returns a comprehensive performance report revealing the load speed of each object on the page. This granular detail helps businesses establish a performance baseline and prioritize troubleshooting by rapidly identifying the root cause of issues such as missing images, erroneous third party content, or ISP bottlenecks. The service can be used to conduct multiple tests of the same URL to compare results over time or from different locations around the world.

In addition to running free, on-demand performance tests, the site provides a video tutorial featuring tips for interpreting results, remedying issues and improving overall web performance. It is also a springboard to a micro site of educational materials about web performance measurement best practices and case studies.

“In today’s economy, it’s important to protect every dollar of online revenue, and a customer’s first impressions of your website can be the difference between a sale or a fail,” said Eric Schurr, SVP of marketing at Gomez. “Taking the Gomez Instant Test is a free and easy way to understand if technical issues are impeding the performance of your website and impacting your customers’ experiences.”