Getting The Most Out Of SQL CoPilotSee The Big Picture

This is the first in a series of planned articles showing you how to get the most out of SQL CoPilot.

I'll assume you have already read the Getting Started page and have the Quick Start report open in Management Studio. Where you go from here depends on what you want to achieve.

The first thing I suggest you do when you run SQL CoPilot against a server is to get a list of immediate issues to resolve.

Under the Home tab there is a tab called Big Picture. Click on this and after a few seconds a display will appear that is split into multiple blocks.

If everything is blue/green you have a healthy server and you can ignore the rest of this article. If not, it's time for some action. Here's what each square represents and what to do if it is red or orange.

Ignoring the few boxes that are purely informational (version, memory, etc.) I'll focus on the boxes that change color to orange or red to alert of a potential problem.

Mirrored Databases and AlwaysOn Availability Groups

If any mirrored databases are not synchronized, or any availability groups are not healthy, they are highlighted in red.
Click to display the databases, which shows the list of databases with affected databases highlighted.

Databases Offline

Well I guess you'll already know if some of your databases are not online!

But often when looking at a new instance for the first time, you come across a few that have been deliberately taken offline previously - maybe they were migrated to a new instance, or the application was decommissioned.

Take the opportunity to detach these and move the mdf, ndf and ldf files to separate storage for archiving.

Overdue Backups

So how do you decide if a backup is overdue? That depends on the schedule you have set for your full, differential and log backups.

SQL CoPilot has to make some assumptions here. It assumes that you perform a full or differential backup at least once per day, and a full backup at least once per week. If the recovery model is FULL or BULK LOGGED, there must also be a regular log backup. If none has been performed for more than 4 hours, this is highlighted.

For some shops this may be too strict and result in overdue backup warnings when a more relaxed schedule has been deliberately put in place. If so, I seriously suggest you re-examine your backup policy - it is arguably the most important of all DBA responsibilities.

It is also common for very large databases to have a weekly differential backup, and less regular full backups, and again these will show as overdue in the Big Picture report.
One possibility for a future release is to make these thresholds flexible, perhaps using a config database on each instance. Contact me if you think this would be a useful addition.

CPU Busy

High CPU usage is expected at times and nothing to worry about, but if you see consistent high CPU then it is time to look at the cause and do something about it.

This could involve adding or redesigning indexes, rewriting code, splitting workload across servers (for example by offloading reporting to a readable secondary), increasing the number of CPUs, or buying a new server.

High CPU percentage on its own is not the only measure. Have a look at the number of runnable tasks as well. If this is regularly non-zero, you may also be CPU bound, even if the CPU percentage usage is low.

Low Page Life Expectancy

This has long been a traditional pointer to low memory. In the early days of 32-bit servers and just a couple of GB of RAM, Microsoft recommended a minimum of 300 seconds as a figure to strive for.

But now almost everyone is using 64-bit servers, so page life expectancy should ideally be much higher. That said, you should only really worry if it drops down to very low values.

I see instances with 64GB RAM and page life expectancy below 500 that perform quite well, but it is still a useful indicator and improving code and indexing should have a measurable benefit.

SQL CoPilot uses a percentage of available memory when highlighting this metric, so you may see a high value on a server with a lot of RAM displayed in red or orange. This is just to bring it to your attention, and should not be treated as a problem unless performance is affected.

High Log Flush Wait Time

Writing to the transaction log should be quicker than a quick thing at the Quick Olympics. Fact.

It is fundamental to the efficient operation of SQL Server. Even in-memory OLTP in SQL 2014 still writes to the transaction log when data modifications are made. There is no escaping it.

As well as write latency (see Log Writes below), another measure of log write performance is the log flush wait time. This is the time it takes to flush the log cache to disk once a transaction is committed.

SQL CoPilot looks at the last log flush wait time for each database, and the highest value is displayed here. The performance counter for log flush wait time is not massively accurate, and seems to go up in large steps of around 15ms - 16ms at a time. In the Big Picture report, anything 32ms or higher is highlighted.

Possible causes of high log flush wait time are:

Slow I/O on the transaction log disk. Look at the virtual file stats (in the Database reports or the Activity | 10 Second reports), or connect to the server and check the performance counters.

The log file is auto growing, and perhaps has been configured to grow by a large amount

High number of virtual log files (check the Database reports)

Multiple log files for several databases on the same LUN.

Database mirroring or Always On

A log backup is running

The last three of these require reading from the log while other transactions are being written, which naturally slows down the write operations. Transaction log writes are sequential and can be slowed down if the write head on the disk has to move first.

Another way to check if the log flush is a bottleneck on your system is to check the WRITELOG wait time. See Performance | SQLOS | Waits and Latches, and Activity | 10 Seconds | Waits and Latches.

Suspect Pages

A table in msdb called suspect_pages records database pages where a read has failed because of an error 823 or 824. This part of the Big Picture report shows a summary of this table.

Not all rows are current, so an entry in suspect_pages does not necessarily mean a database has become corrupt. The event_type column indicates if the suspect page is current or if it has been restored, repaired or deallocated.

SQL CoPilot only reports pages with an event_type of 1, 2 or 3, and with a last_update_date within the last 30 days.

Your first action should be to perform a DBCC CHECKDB, and take the appropriate action based on the results. If no corruption is found, you can safely update or delete the suspect_pages rows.

If CHECKDB does find issues, your first response should be to restore from backup. Only use the REPAIR options on CHECKDB if a backup is not available.

Longest Running Process

Depending on the system, a long running process may indicate an issue, such as blocking, resource contention, or a user who has submitted a long and inefficient query.

This is highlighted for processes that have been running for more than 60 seconds, and excludes Profiler traces and SQL Agent jobs.

It does not currently exclude sp_server_diagnostics in SQL 2012. This will be added in the next release.

Read/Write Latency in Data and Log Files

Virtual file stats are a good indicator of the I/O throughput and performance for each data and log file.

The Big Picture report displays the average latency since the SQL service was last started for data and log files, and for reads and writes.

It is possible to see large values when in fact the daytime latency is quite low. If this is the case, have a look at jobs and business processes running overnight as they are likely to be the cause.

To confirm if latency is currently OK, look at the Activity | 10 Seconds | Virtual File Stats report.

High tempdb Usage

tempdb is very important to the efficient running of your SQL Servers, and if it fills up and cannot grow you will start to see a lot of errors.

This metric shows the percentage of tempdb in use.

Since tempdb should already be sized appropriately and should not need to grow, this is highlighted in orange when more than 50% is in use, and is red when more than 90%.

You can look at the type of data in tempdb and the sessions responsible in the Performance | tempdb reports.

Longest Blocked Process

Blocking is perfectly normal in SQL Server, and in fact it is necessary to ensure data integrity (depending on the isolation level), so short term blocks are nothing to worry about.

However, longer blocking times should be investigated.

This is highlighted for processes blocked for more than 60 seconds. View the Activity | Current | Sessions report to see all blocked requests and the sessions blocking them.

Runnable Tasks

As mentioned above in the discussion about high CPU, if the number of runnable tasks is consistently non-zero, it may suggest that CPU is a bottleneck.

SQL Agent Not Running

I suppose there may be situations where SQL Agent is not needed, in which case you can ignore this.

But for most instances the agent is used to schedule backups, index rebuilds, database consistency checks, etc., as well as various business processes.

If you don't have any third party monitoring application you should definitely have alerts set up and enabled for high severity errors, and these require the agent to be running.

Failed, Canceled, Disabled and Unscheduled Jobs

Not much to say here. This section highlights jobs that failed, and brings others to your attention that may no longer be required and could perhaps be deleted.

Long Running Jobs

A long running job is one where the last run (or current run time if still running) is greater than 120% of the average from the last 10 successful executions.

This is a purely arbitrary calculation, but is better than none at all.

You may find that some jobs with very erratic run times trigger this to be highlighted.

Default Trace Not Running

Unless you have a very high throughput instance in which every inch of performance is needed, there is no reason to disable the default trace.

It is an extremely useful record of recent activity, and there is a collection of reports in SQL CoPilot to show you its contents.

Admittedly the small size limit may mean only a few minutes of data may be retained on very busy servers, but you should not disable it without a very good reason.

To enable the default trace, run the following:

EXEC sp_configure 'default trace', 1RECONFIGURE

Deadlocks

In versions of SQL Server up to 2005, the only way to track deadlocks was by enabling a trace flag, so that deadlock details were recorded in the error log.

From 2008 onwards, this information is also recorded in the system_health extended event session. This is the source of the deadlock data in SQL CoPilot.

It means that a count of deadlocks is not displayed in the Big Picture report for SQL Server 2005, but you still have the option of enabling trace flag 1222 and checking the error log.

If you have deadlocks, click on the box to see a list of when they occurred and to see the running SQL and objects involved.

Failed Logins

A large number of failed logins can be indicator of several issues, ranging from a user or application using the wrong username or password, or specifying a non-existent database in the connection string, through to Kerberos issues, Windows AD problems, or even unauthorised attempts at access.

Your first action should be to look at the error code, state and message. There are plenty of resources on the web to guide you on their meanings and how to resolve them.