A basic approach to SQL Server administration

Menu

Tag Archives: performance

What is Page Life Expectancy (PLE), what makes it drop, and how can I manage memory better? Abusing disks slows many database servers, and it’s often something you can fix with tuning and not spending extra money on better disks or more memory. It takes a very long post to get through all of that, but if you stick with me through this then you’ll be looking at your servers from new angles with an effort that will be noticed by the Sys Admins, SAN Admin, the users, and your boss.

Before we being, there are some ground rules we need to get out of the way defining PLE and understanding there are external memory factors. There’s no shame in skipping that and going straight to the focus of this post. Personally, I find the next two sections a little dry, but I’m also my harshest critic!

Define PLE

Before we get too deep into it, lets make sure we’re on the same page on a couple things.

Page Life Expectancy is the number of seconds the average page of data has been the data cache (aka buffer cache). The data cache is part of the buffer pool (the similar name is why I use data cache instead of buffer cache, although they’re the same thing), which is the memory you allocated to SQL Server. SQL Server uses this area to give quicker access to data instead of taking the long, slow trip to persisted storage (disk).

Keep in mind that SQL reads the data pages from the cache, always from cache. If the data you need isn’t there then SQL Server does a physical read to put it there. After that’s done it will do a logical read to use the page that’s now in memory. If you want to dive into the detail you can do it here, specifically with the reading pages and writing pages links on that page.

That physical read is going to the slowest part of your server (even expensive flash is the slowest part, it’s just not as slow) to read the page(s) from persisted media, and it also uses the CPU in the process. This makes PLE critical because, even if you ignore the extra load you just placed on the I/O system and CPU, you’re waiting for an additional, slower action to take place.

You can find your page life expectancy in sys.dm_os_performance_counters using my post on OS Perf Counters. That post will also help you realize how much load you take off of your disks by raising PLE, then you’re really going to start to understand the burning drive picture you find there.

The problem is that many people see the 300 value for Page Life Expectancy you can still find documented is wrong, very wrong. If this is simplified to be how long the average data will stay in memory and you have a server with 64 GB of memory with 56 GB allocated to SQL Server, that means you’re reading about 56 GB of data from disk every 300 seconds. If you look at Page 36 of Troubleshooting SQL Server – A Guide for the Accidental DBA by Jonathan Kehayias and Ted Krueger you’ll see an actual scalable version of this; PLE should be 300 for every 4 GB of RAM on your server. That means for 64 GB of memory you should be looking at closer to 4,800 as what you should view as a critical point.

I’ll go so far as to say that you should be cautious of anything that doesn’t have a formula to it because our servers keep getting faster, our databases keep getting bigger, and static values…don’t. Jonathan Kehayias and Ted Kruger giving a formula is great, because the old value became outdated the day SQL Server went 64-bit, or even before that with the AWE setting.

Jonathan Kehayias also put this in his blog post Finding What Queries in the Plan Cache Use a Specific Index, and that had a great series of comments going back and forth between him and Brent Ozar. Brent is discussing not putting so much faith in PLE and watching wait stats instead as different speed I/O systems can greatly sway the impact of low PLE or even make it so a high PLE just isn’t high enough. They’re both right, and if you can understand the intent of each one then you’ll be in a much more comfortable place when tuning your servers.

If you force me to take a side on that post, Jonathan wins out over Brent here. This is because Jonathan is talking about having a higher PLE, which in turn reduces load on the disks and typically reduces the amount of work that needs to be done to execute a query. Brent is basically implying that money may have already be thrown at the issue, which negates the current symptoms on what could be a less scalable solution. However, you can’t discount Brent’s logic here as the things that are limiting your server’s performance at this time lie in the Wait Stats, not this counter.

Quick Rant: We’re Not Alone

So, you want higher PLE to show that you’re using your resources better, and the only way PLE goes up is by leaving data in the data cache once it’s there. The problem is, there’s a lot going on that wants to work the other way. Even if you have 64 GB of memory dedicated to SQL Server to host a single 64 GB database it won’t all fit in memory, that’s because other things want to play as well.

First, your memory isn’t completely dedicated to the buffer pool. Memory pressure from the OS or VMs changing size can drop the size of the buffer pool has unless you’re using “lock pages in memory”, which is not recommended in most situations especially on VMs. If you’re not on a completely dedicated box, which means a physical box with no apps, no GUI, no RDP sessions, no SQL Server, no network card driver, etc., then this can happen. Wait a minute…you can’t uninstall SQL Server to make sure SQL Server’s buffer pool remains untouched, yet SQL Server uses memory outside of this area for any purpose that isn’t specifically assigned to the buffer pool and that memory usage can get out of control. If there’s memory pressure then some pages are forced out of data cache to drop the size of the buffer pool, leaving you open to more contention and a lower PLE.

Second, your buffer pool isn’t dedicated to your data cache as it also houses your proc cache, which are execution plans stored to be run again. These plans take a lot of CPU to create and reusing them saves you a lot of resources, so you can’t complain that you’re sharing space here. SQL Server determines the amount of the buffer pool to assign to each, and it’s possible for either side to have unnecessary data in it. I’m going to focus on the data cache here, so I’ll defer the abuse of proc cache conversation to Kimberly Tripp. There’s more to it than just her post, but this is a great place to start understanding what’s there and not in use.

Focus, Focus, Focus

Ok, so we know what PLE is and that external factors can affect it, but that’s not really why you’re here, is it? Are you here because you have your data cache and you want to manage it to keep your PLE going up…maybe a little. My guess is that you’re here because you heard that PLE is critical to your server’s performance and you want to increase performance by raising this counter the right way. You do that in two ways.

First, make sure you have as much room as possible. If your server can handle more memory it’s often cheaper to buy that much memory than it is to intervene on the database side. Tuning and purging scales well, will help a lot more than just memory management, and is the best thing you can do given the time budgeted to it, but it just isn’t as quick and easy as a memory upgrade in many cases. Even if you buy memory now and tune later, it will continue to help. 64 GB of memory, as Brent Ozar pointed out here, is only a couple hundred dollars here. FYI, I absolutely hate throwing money at issues, and I still listed this first.

Second, keep as little in memory as possible to make efficient use of the space you have. Yeah, it’s that easy. Well, at least it’s that easy to say, but doing it takes a little more work. There are a list of ways you can trim down on the space you need.

Drop Unused Indexes

Merge Duplicate Indexes

Use Your Indexes – SARGability

Watch for Big Queries

Look in Your Proc Cache for Opportunities

Know What’s in Your Buffer Cache

Index Maintenance – Defrag

Index Maintenance – Statistics

Purge Your Data

Other

Through all of this, please remember that your initial thought may be to raise a counter, but is that your real goal? I’m betting you want your server to run faster, and PLE is merely one counter that helps put a value on performance. Servers with high PLE can run horrible, they just don’t need to keep running back to disk. Servers with low PLE can run great, they just need to go back to disk to get what they need.

A little bit ago I mentioned buying more memory, this will raise your PLE without making any queries really run more efficiently. Sure, they’re more likely to have their data in cache which is great, but they’re doing just as many logical reads and using just as much CPU to do their joins. The rest of this post isn’t like that, the rest is making queries more efficient in a way that just happens to make your PLE go up.

Drop Unused Indexes

This part’s tricky. SQL 2005-2008 R2 told you how much indexes were used since the last time SQL Server was started in the DMV dm_db_index_usage_stats. It’s still there in SQL 2012 and beyond, but these statistics get reset when you rebuild an index now. That means that best case scenario on a server that’s patched monthly, you have a month’s worth of data to go off of and you can’t even rely on that being the whole picture anymore. I get into more about how to keep track of this over time in my post Indexes – Unused and Duplicates.

This has a minor implication on PLE because as the pages of your unused indexes are updated SQL Server has to read part of the B+tree into memory to find the page, then it has to read the page to be updated into memory, it updates it in memory and marks it as a dirty page, then eventually writes it back to disk. The key here being that pieces are read into memory, AKA the buffer pool, the part of SQL Server you’re trying to clean up.

Merge Duplicated Indexes

This one’s not as tricky, but the performance issues are much worse. The problem is that you have two indexes, we’ll call them ix_A and ix_B, that are very similar. They have the same first two key fields, and stray a little after that. The differences mean that ix_A will be better for one query and ix_B will be better for another, and that means SQL Server will be dragging them both into cache. Sure, if you take the columns that are in ix_B that aren’t in ix_A and include them in ix_A then drop ix_B then the queries that used to use ix_B will probably need to do a little more work, but you’ll end up with a more efficient use of the cache. Your typical testing will show you the worst case scenario here, because you’ll run it each way multiple times and see what it’s like when the indexes are completely in cache. However, the real world is more friendly than that (that’s a first), and the query that used to use ix_B may actually run faster because now it’s using an index that’s more likely to be in cache, cutting down on physical reads. Going to the same blog post, Indexes – Unused and Duplicates, you can see details on how to help relieve this problem.

There’s a bit of an art to this at first, but you’ll get it down to a science with practice.

Use Your Indexes – SARGability

SARGability (Search ARGument capable) is it’s own topic, and it deserves at least one post of its own. You need to understand how indexes work, as stated in my post Indexing Fundamentals, and make sure that your queries can take advantage of indexes. If you look in a phone book for people with the last name like ‘%ood’ it’s no good because you need the first letter of the last name to take advantage of that index. If you look up people with the first name ‘Steve’ then that’s useless because the first key field was last name. Doing functions, implicit conversions, and many other techniques can cause you to need to scan an entire index.

In terms of PLE, that means you read the entire index into memory instead of just the pieces you need. That can add up to a lot of data being kicked out of cache to make room for a lot of physical reads from disk.

This isn’t an easy topic, and I’m not sure of any resources that say everything that needs to be said on the topic. While I have plans to write a post on it and will update this one when I do, it’s still an unwritten post. Until that time comes, Google “SQL Server SARGable” and see what pops up.

If you know how to read execution plans, there are two sections on a seek, scan, or lookup that you need to know here. Predicate and Seek Predicate. Predicate is what it had to scan for, while a seek predicate is what it could find efficiently using the fact that the key fields are sorted.

Watch for Big Queries

It doesn’t matter if you use Extended Events or Profiler Traces, you need to know when large queries are being run on your servers. If you want to watch this at a statement level then you’ll want to make the jump to Extended Events, but I’ve never seen it hurt to watch SQL Batch Completed and RPC Completed filtered only by duration over 10 seconds.

The point of this when it comes to PLE is that the queries that aren’t SARGable or don’t have an index to take advantage of will be doing a lot of scans. The bigger the scan the longer the query will take to run, and if you care about PLE then you’ll know what queries those are. Sure, some are legitimately doing enough work where they’ll always take that long to run, but most of the queries that take a while can be tuned. Be it changes to the query itself (should be your first instinct) or indexing changes, there’s often a way to have a query require less data in cache.

This is an extremely useful technique to use in correlation with watching for drops in PLE. If it drops then check to see what was running at the time. This shows you what queries were running, which would be most of your issues, but it could be outside of the query itself and actually be something like a large auto update statistics task being kicked off that you won’t see here.

Know What’s in Your Buffer Cache

Very few people do this, yet it’s so obvious once you start. You want to raise PLE, right? PLE is a measurement of how long stuff stays in your buffer cache, and drops because something else needed to be put in there. So, what’s in there that’s taking up all your space and how’d it get there? It took me years to come up with that question, and I feel like a bit of a fool for that.

It was actually one of my biggest tuning revelations since I read Grant Fritchey’s Execution Plans book. And you always know it’s a great revelation when you find yourself yelling at yourself for not realizing it earlier…years earlier.

So, how do you do this? Query your cache to find the indexes that take up the most space there, pick one that is an excessive amount of space or doesn’t look like it belongs there, query the proc cache to find out where that index is used, and tune that query either through code changes or indexing changes. The scripts to do this and a more detailed description of the process can be found in my post Cleaning up the Buffer Pool to Increase PLE

Look In Your Proc Cache for Opportunities

Lets start by saying this isn’t perfect. These numbers get reset throughout the day and some queries never make it in here at all. That’s saying this isn’t an all-inclusive, one-stop shop. It is in no way saying that you can’t make amazing improvements on the queries you find here.

There are two way you can use the cache. Most people know that you can get your most expensive queries such as on the MSDN sys.dm_exec_query_stats page, and you can use your imagination to sort this by any of the counters available here. That will give you an idea of what needs tuning, if you needed somewhere to start. I love taking this a step further and monitoring that exact information so I have a historical view instead of just what happens to be in cache now. You can read all about that in my post Query Stats.

The other way is parsing through the XML in the plans to find specific items. I heard that grunt when I mentioned XML, and you’re not alone. Not many of us are good at XML; Jonathan Kehayias is an extreme exception here. He gives you his queries so you can run with it, do what he does, and alter it to go even further. This is exactly what I did in my post Cleaning up the Buffer Pool to Increase PLE that I just mentioned in the last section.

Index Maintenance – Defrag

Many people only thing about defragmenting their indexes to help get contiguous reads on their disk, but that’s only half the story. In fact, the better you do everything in this post the less you’ll have to worry about your disks.

However, you also have to keep in mind how full each page of your indexes is. If you have an index page that’s 100% full then you have a full 8kb of data there. Add another row and you now have 8.1kb of data that’s split between two 8kb pages, so you’re wasting almost 50%. 50% isn’t even your worst case scenario because pages are never automatically removed or merged if records are deleted unless you delete every row stored in that page. Paul Randal’s post Performance Issues From Wasted Buffer Pool Memory takes a deeper look at this problem, and Ola Hallengren’s scripts can help you with a solution.

Index Maintenance – Statistics

It may seem obvious to do your index defrag job off hours. While I’ve seen servers without the job, I’ve never seen one scheduled for mid-day. That’s a good thing, because a lot of data needs to be pulled into cache to play around with indexes like that.

However, do you have auto update statistics turned on? Do NOT turn it off because of this, but understand that it updates the statistics by reading about 1% of the index into cache. If that’s a 100 GB index then you’re reading 1 GB of data which is a descent portion of your cache. To help avoid this, you should be updating your stats off hours, and Ola Hallengren’s scripts can help with this, too.

You need to update your stats because as more and more updates are made to a table the less accurate your stats are, making your execution plans less accurate. Once you update about 20% of the rows of a table since the last time your statistics were updated then they’ll automatically be updated again if you leave the default option turned on. Again, don’t go turning it off because of this post alone. Instead, schedule your stats to be updated off-hours and only the tables that have 20% of their rows updated throughout the day will get their stats updated automatically, and that will stop most auto-updates on the larger tables that would cause issues.

Purge Your Data

We talked a lot about what to do with the data you have, but do you need all of your data? Step away from SQL Server once and go talk to an accountant. Ask them where they have paper work from last month and they’ll point to a filing cabinet. Now, knowing they’re legally obligated to keep a lot of stuff for 7 years, ask them where a 5-year-old document is. It’s not right in front of them, but they know where to look for it. It’s not taking up valuable space in their office (read: not on the main production server), often not in the same building they’re in (read: not online), and it will take a bit to get to it. They no longer have an active business use for it, and they can get to it in the rare case that it’s needed.

Ask them where something is that’s 8 years old, and they’ll find a friendly way to tell you that they bought an OCD pyromaniac a pair of scissors and a book of matches. We’re more eloquent with deleting things in batches, but their way sounds like more fun.

If they kept everything in one room would you call that room a cluttered, unmanageable mess? If they kept historical records forever, would you think they were being obsessive and wasteful with purchasing storage?

So, can you pull up a detailed sales report for February 29th, 2000 from Prod for me?

Other…Am I Giving Up?

There are too many things to list in one blog post. You could write a book on this subject. I listed what I feel will help you the most, but I also wanted to take time to let you know that my list isn’t some magical, all-inclusive, everything you can do to make your servers run better. It’s a start. Hopefully a good one you found to be productive, but a mere start no matter how you look at it.

Speaking of “you could write a book on the subject”, well, they did. A lot of them did. Some even made them free PDFs. Not bootleg bit-torrent copies, but actual real, legal, free PDFs made possible by Red Gate in their book selection on SQL Server Central

DBA Pay Rate: If this isn’t on the list, try using a chart of the above counters

Rewritten Post

Some readers may have realized that I rewrote this post, and you can find the original version here. I did this because I didn’t feel the old post lived up to my expectations, and I’d love to hear feedback on what you think of the differences.

If a query is taking longer to run than normal, there’s a good chance it’s being blocked by something else. This is especially true when you’re doing something rediculously simple and SQL Server just sits there thinking. Symptoms of blocking problems include a trace of SQL:BatchCompleted and RPC:Completed with durations over 10 seconds comes back with results using less than a second of CPU time and very few reads and writes. Also, if you’re watching your wait stats, then you’ll start to see more waits that start with LCK_ than normal. Not the mention the overly obvious blocking found in your Running Processes

This query will show you the blocking currently occurring on your server.

However, if you have time to watch your servers all the time and keep running that query then you’re overstaffed in a world that leans towards being understaffed. For way too many reasons to list here, throw the results of this into a table every minute. If someone says a query ran long and you see the server was waiting on locks, look here. If you want to be proactive and look for ways you can improve server performance, look here.

Indexes aren’t free, and many databases end up with unused indexes. Every time you make any update to a table you will be updating the clustered index (I assume no heaps, because I hate heaps), and every index that has uses one of the columns that were updated. Inserts and Deletes affect every column and will affect every index, with exceptions for non-typical indexes. These updates cost a lot, to the point that I have several processes that disable certain nonclustered indexes, do all of the work, then rebuild the indexes afterwards.

The problem is that you can’t tell very easily if an index is being used or not. You can look at sys.dm_db_index_usage_stats, but that information is very limited in that gets wiped out on index rebuilds, SQL service restarts (server reboot included), and anytime you ask it to. That’s not mentioning that it also ignores obscure usage of the index. However, the obscure usage is more of something to keep in mind than it is something to base all of your decisions on.

From my point of view, there are two answers to this problem. There isn’t a right and wrong answer, it’s more of a step 1 and step 2.

The long and hard one is to capture the values in sys.dm_db_index_usage_stats on a normal basis (daily, before index maintenance) and give yourself long term statistics that you can determine if an index’s usage justifies its cost. Some indexes are updated 1,000,000 times for two uses. You obviously don’t need to keep this index around all the time, but you’re still paranoid about what those uses were. Were they ad-hoc queries run by a developer, or were they year-end reports run by the CEO? You can get some idea if you look at your history and check what day it was run. I would want to make a rollback script, drop the index, and make a searchable document of indexes I dropped in case anyone complained about year-end reports not working.

That was the long answer, which is right and is not replaced by the short answer. The script below is the short answer. If all of the key fields in an index match, or at least the first three in a larger index, then there’s a good chance that one of the indexes isn’t required or they could be consolidated.

For instance, if you have two indexes that have the same two key fields in the same order but the second index also includes two columns, then you’ll probably see usage on both indexes (see the long-hard solution above). If one index isn’t in use in this situation, your job is easy. If both are used, then look into dropping the index that doesn’t include any columns. Why? Because a slightly larger index rarely increases the overhead too much, and the queries that used the first index aren’t usually impacted too much by using a slightly larger index that has the same foundation.

This area gets shakey on how you should handle it, and is really one of the few areas in SQL Server I consider more of an art. For instance, the script below will show you indexes that have 5 key columns where the first three match. You look and find that both indexes are being used, but users are complaining that updates are a bit slow, too. I’d lean towards condensing by taking the the last two key columns of one index and including them on the other, but which index gets dropped and how do you test it?

The long answer takes a bit more work to figure out. You need to store the information long-term, keep a snapshot of what it looked like most recently, and do all the calculations to make sure you have the right numbers. It’s really no different than what I do for other DMVs, so you’ll see this process duplicated a lot if you follow my work. The tables for storing this information:

And now the script that captures the information and does all of your calculations. You may have noticed that the tables themselves do NOT store the index or table names which was intentional. Name changes happen, tables and indexes get dropped and created, all of that is extra work for little benefit. Is the benefit worth it? I thought it wasn’t. I’m not expecting everyone to agree with me here because there are arguments going the other way. My advice is STEAL MY WORK!!! Don’t copy it, you need to steal it. Make it your own, learn it well enough that you can improve it and build upon it. At that point you’ll truly understand every bit of it and hopefully start up your own blog. I hear EvenSimplerSQLServer.com is available, but that’s subject to change. If you don’t want to go that far, then write your ideas in the comment box below. I change these posts from time to time, and this script was actually added months after the original post was made. Just don’t ask me why it wasn’t here in the first place, my answer is just….DOH!!!! Even saying I was only 6 days into blogging at the time I wrote the original doesn’t make that acceptable.

Post navigation

Author’s notes

I hope what I'm doing here helps. Unfortunately I can't tell the difference between someone looking intrigued or quickly closing their browser from here. Feel free to contact me or comment on anything you see here. While it's always great to hear the good, I absolutely love hearing the bad because that's how I improve.

Not all of my posts are must-reads, but I feel some of them are. Here are some I'd like to recommend, but I'm a little biased.

More than anything, I also recommend that everyone give back like this. It's amazing how much teaching makes you learn better and faster, so it's a selfish endeavor as well. If you want to get started going public with your knowledge then let me know. I'm not the best source in this area, but I can point you to some brilliant people.

Thanks,
Steve

Consulting

I'm available for consulting outside of normal business hours in ET (New York -5:00 GMT). I have a focus on SQL Server performance, including T-SQL, indexing, PLE, and training your staff on what to do and avoid doing. In addition, I look over the general setup of your servers and let you know of anything that may be an issue.