All posts by John Tamburo

There are a gigantic number of tools out there that are free and extremely helpful. Since I’ve been away for a while with other issues (prepping a presentation for SQL Pass – hope it’s accepted; new boss at work; two kids graduating high school; and more), I thought that I would break the ice by discussing three essential DBA tools and lauding their creators.

The free tools listed in this article help my team and me manage well over 100 databases running in 70+ instances, totaling 110+ TB in storage. However, they would help anyone, no matter the size of the shop.

#4: Pinal Dave’s missing and unused index scripts.

The missing index script looks at the DMV in SQL 2008+ and will generate not just a list, but the CREATE INDEX statements themselves. You can look at each index in script form and then decide whether to combine for covering indexes, and whether or not the index is right for the shop.

Where I work, we use a modified version of this script monthly on all production databases. Our version scripts the indexes with all of our standard options, and even generates a proposed name according to our standards. We of course analyze each candidate, and more often than not we end up combining several recommendations into covering indexes, or modifying one existing index to better cover, and of course analyzing if performance (select AND insert/update) needs dictate adding the index at all.

Pinal’s unused index script views the DMV and determines what indexes are getting no activity. It’s easy to configure to thresholds in our shop, which are usually absolute zero index use in selects, prioritized down by number of updates to the affected index.

Good DBA will save these stats and accumulate them (DMV are only collected since last SQL start) and decide what indexes to drop. Pinal’s script generates a nice DROP INDEX statement.

I salute Pinal Dave for keeping up these wonderful scripts.

#3: Michelle Ufford’s dba_indexdefrag script

This script creates a stored procedure that can be scheduled to reorganize or rebuild indexes. It is loaded with options, including options to check and not actually rebuild or reorganize. It schedules easily into a job, can generate verbose results, and is vital in our OLTP environment where data changes constantly befoul query plans. This script has been frequently updated (It’s on Version 4.1 as of this writing), and it works flawlessly.

Michelle has created an invaluable utility for any production SQL Server DBA.

#2: sp_blitz by Brent Ozar

this script installs a stored procedure into master that runs a very comprehensive health check of any SQL server. You can get a monster output. It catches things that most DBA never check. Examples include: logins with sysadmin access, configuration parameters that are not set to their defaults, users with db_owner roles in a database, even databases running on a compatibility level older than that of the detected SQL Server.

sp_blitz catches databases without a recent integrity check. Many messages are just reminders for things that are unusual (max file size settings are an example). The tool gives an astoundingly thorough analysis of a given SQL Server instance; I recommend running it semi-monthly and reading the report carefully. Even if you are aware of setting X, it is good to be reminded and to see if anything is slipping under your radar.

You can even have sp_Blitz output a count of messages, and set this as a custom counter in many monitoring tools, and get an alert if the count goes up. This tool is amazing; Brent Ozar could have charged for it and gotten real money from it; He does not. Kudos!

#1: sp_whoisactive by Adam Machanic

Here is the tool that saves my butt. In a hotly-used OLTP environment, we can have poorly-formed queries that burn up tempdb, or we can have blocking.

Using this tool, which puts a stored procedure in master, one can see just what is running now. Not a sp_who2 with 400 SPIDs through which you have to sort, you can see the running queries at the moment the SP is executed. You can see the query that is running. You can see how much resources (reads, tempdb, CPU, waits, the login and hostname for the running SPID, and much more.

sp_whoisactive is so fast that you can repeatedly execute in management studio and see the query traffic passing through. As I said, this has helped me isolate more than one “query of death,” and assist the developers in remediating the offensive parts of the query.

sp_whoisactive is my big winner among these.

I respectfully suggest that you go and look at each of these helpful things, and I would wager that you would end up with most, if not all of them on your servers.

Recently I received pushback from a developer about the use of a GUID primary key. The proposed database schema, intended to be used with Entity Framework, contained numerous tables primary-keyed on a column named ID and typed as nvarchar(50). the tables were clustered on this column, and it was intended to hold a GUID that would be populated with the .NET call System.Guid.NewGuid().

Cue Blood-Curdling Screaming Now.

I can hear the blood-curdling screams from every DBA worth his or her salt. I asked for the column to be redefined as bigint IDENTITY, citing the well-known adverse effects of clustering a GUID, and in this case worse because the GUID would be stored as a 37-character Unicode value, taking a minimum of 74 bytes per row.

The developer objected because of turns to the SQL Server; this application requires extreme throughput and low latency. The developer would have to retrieve the generated identity value, and thereby dependent rows in child tables would have to be separately populated. If the developer used a program-generated GUID, no database access is required until insert time, which can be done in bulk through EF. Developer therefore offered to cluster on an IDENTITY column but keep the Primary Key as a GUID.

I still objected because the nonclustered index quickly reached 99.5% fragmentation in experiments after only about 1000 rows, and this was with a fill factor of 50% defined for that index. The developer argued that the fragmentation of the nonclustered index was meaningless. The index size was huge, whether I used the nvarchar(50) or the uniqueidentifier (16 bytes) data types to hold the GUID.

Should I Care?

Now the question is whether I should care about a highly fragmented nonclustered index. Remember that this is a foreign key, and this GUID must (for performance reasons) be indexed for join purposes on both parent and child table. But this is a SEEK operation, is it not? Numerous articles argue that SEEK operations on nonclustered indexes, even when highly fragmented (which is the norm when the key is a GUID) are not expensive.

However, focusing on individual seek performance misses an important point here. The query below explains why I care:

Order lines may be OK, because there will be an index on ORDER_GUID for the ORDER_LINE table. However, every line item seek would not be so lucky. The items will be wildly randomized in the ITEM table, and each ORDER_LINE will need to join ITEM. A smaller but virtually identical issue arises as ORDER_MAST joins to CUSTOMER.

Assume that the ORDER_MAST table is clustered on an identity and the ORDER_DATE column is a datetime and has a nonclustered index.

Assume also that there are 10,000 orders with 10 lines each. That means that there is a likelihood of reading 99,500 separate pages of data into memory in order to perform the ORDER_LINE to ITEM join. Furthermore, without a sequence number column in the order_line table, one would be highly likely to never be able to order the lines into the order in which they were inserted into the table.

Defragmenting the GUID index would reduce those 99,500 page reads to about 50,000 page reads for a non-clustered index with a 50% fill factor. If you have an 8 byte bigint clustered index that is also the PK, you would likely get many rows per page (depending on row size) and in the end increase speed by more than one order of magnitude.

Hence a sequential surrogate PK has its uses. It would dramatically speed up this query. We haven’t discussed the speed implications of insert performance as the index pages are split over and over again, and the concept that defragmenting a nonclustered index based on a GUID could actually degrade insert performance by increasing the number of page splits. Therefore, if anyone uses a utility such as dba_indexdefrag by Michelle Ufford, one is likely to generate slowness in inserts.

Do I care? Heck yes! This is a hill to die on for any SQL Server DBA.

Conclusion.

In this case, the developer agreed to use an internally generated ever increasing unique number based on time, machine ID, and an increasing integer. The generated number should fit into a BIGINT or a Decimal(19,0) data type, resulting in nine and not 74 bytes per row of PK. This agreement solves all of the problems in a manner substantially identical to the IDENTITY or a SEQUENCE, while meeting the developer’s performance concerns — rightly expressed because this application requires maximum throughput.

For years, we’ve seen the metric of Page Life Expectancy (PLE) touted as a health barometer for SQL Server. Moreover, we have seen PLE touted as a magic number: “300 seconds (5 minutes) means that you have a generally healthy SQL Server.” About 5 years ago, Johnathan Kehayias and Paul Randal both spoke strongly against this notion, and Jonathan argued that a good number is better determined by this formula:

(DataCacheSizeInGB/4GB*300)

Looks nifty, but this was again spoken of in 2011, when the wide availability of SSD was more of a pipe dream than the reality it is now. Even more so, the availability of all-memory SAN such as the Violin Memory line-up gives a whole new meaning, or perhaps a lack thereof, to the PLE metric itself.

Is PLE relevant anymore? A case study.

Here is a server, we will call it A. A has SQL 2008 fully patched (and an application requirement preventing me from upgrading it to 2014), 97GB free to SQL Server, and is attached to a SAN with three tiers of storage:

This server presents with a consistent 25-27 minute PLE – about 1500-1620 seconds. The buffer cache hovers at a very reliable 87-88gb. the main OLTP database hovers near 3TB and uses storage in the SSD and SATA groups, keeping only the BLOBs in the SATA. Also, tempdb is completely SSD.

Sounds healthy under the old “300 second” rule. However, under the Kehayias formula above, we come up with a preferred PLE of:

(87/4)* 300= 6525

Wow. According to this, the PLE should be 4 1/3 times longer! This server is on the brink of setting fire to the SAN! The new outfit for data center workers near this SAN:

PLE is a measurement of how long a page sits unreferenced in memory, on average. The idea is to keep enough pages in memory so that SQL Server finds the data page there and does not need to perform physical I/O to retrieve data.

But how important is that in an era where SSD is becoming commonplace, and some SANs actually manage the block’s location based on the amount of traffic it sees (think Compellent as a pioneering example)? SSD speed is fast. Therefore, does it choke up the SAN bus and fabric with waits on mechanicals, as one would have seen in many SAN five years ago and more? I would venture to say no.

One common PLE interpretation says that my server is healthy. Even big queries run quickly as far as the users are concerned; the empirical data seem to agree. Another more rigorous analysis suggests that PLE is outdated and proposes a newer metric. However, hardware advances seem to be giving that metric a challenge in the present day.

Here are some other metrics from that same server:

Roughly 327 logical reads per physical read.

180 compiles per second

50-60 log flushes per second

Checkpoints at 120 pages per second

Read ahead at 120 pages per second

My options with this server are to:

Do nothing and be happy

Spend a few grand to increase RAM to 256G and thereby increase the SQL Server buffer cache to ~200G

Go nuts and reprovision this machine with a $200,000 monster server with terabytes of RAM.

I think that PLE as a metric tells me little in this case unless I apply Jonathan’s formula to it. I also think that Jonathan’s metric tells me to add some RAM to the server, and we will see an incremental boost in throughput. I don’t think it will be massive; I also don’t think it will be negligible. The $6000 we will spend will be well worth it.

Conclusion

PLE needs to be evaluated with more than a simple “300 second” threshold. Jonathan Kehayias’ formula is a good starting point, but even more rigor is required. The metric needs to be re-examined, and perhaps expressed as a ratio or something else that accounts for the size of RAM versus databases.

Right now, a threshold in a monitoring tool isn’t good, unless you tweak the alert for each server you monitor. Something to keep in mind as you juggle the servers, keep the plates spinning, and don the fire suit as you pass by the SAN.

The importance of a good monitoring tool cannot be over-emphasized. I’ve posted on this issue before. Now, I’d like to get into a tad more detail about this issue. If you have responsibility for more than 2-3 servers, you need to have the alerts and reports. However, you have other reasons to employ a good tool.

Note: This article features Dell Spotlight on SQL Server. I like the tool a great deal, but there are other good tools out there.

Benefit: Good visual understanding of Emergent situations.

Have a peek at this screen shot. All screen shots are enlarged if you click on them. Since this is from an actual production environment, certain information must be obfuscated.

In this particular situation, the tool informs me that the server has a process that has been blocked for a long enough period to trip the red alert level.

Clicking on the Blocked process icon gives us the details of the error, and clicking further gives us an analysis window:

This shows that the lead blocker is blocking one process, and the blocker is running but “AWAITING COMMAND.” In this case, investigation reveals a massive update in process by the application. Further checking was not necessary in this case because the blocking cleared up about a minute later.

The above exemplifies a situation which a good DBA could have diagnosed by using sp_who2 and DBCC INPUTBUFFER on the lead blocker and the blocked task(s). However, that would require time that the 4-5 seconds to examine in the tool did not, and moreover, the DBA is able to examine the issue proactively and not reactively (meaning when the angry user calls up with a hung session).

Let’s take a look at a slightly more critical scenario:

The above, my friends, is a classic “code Brown” situation. The server showed up on the heat map (see later on) as a giant red blob taking up over a quarter of the real estate on the screen. Why? If you haven’t zoomed in, you probably couldn’t see this:

That’s right, 27.6 seconds average I/O wait! You can guess that the SAN is being murdered by something truly massive that is whipping tempdb. As an aside the SAN in question placed 14 files of TEMPDB on SSD drives – the bottleneck was the SAN fabric itself. Ever seen the classic drawing of Brent Ozar standing in front of a porta-potty labeled “Temp DB?” He illustrates an important point and with good reason.

In this case, there was a table-valued function querying a 100+ million row table — twice — using a UNION (not UNION ALL). This function created a 100GB heap on the tempDB and then proceeded to thrash the heap for eliminating duplicates in the UNION. I was one very upset DBA. More than once, this culprit blocked several production jobs, including the peer-to-peer replication.

We were able to diagnose this issue with minutes of work, and again, we were able to proactively raise it to the developers and ensure that fixing the issue was a high priority.

Benefit: Enterprise overview

This is vitally important. When you have your monitoring tool set up properly, you should be able to look at the screen and see what’s up with the enterprise. Something like this:

The above screen shot reflects nirvana. All pretty green blocks. This tool monitors VMWare clusters, Windows servers, SQL instances and replication publishers. Just over 40 hosts and just over 135 connections. The above screen shot was zoomed down to reduce away the name and type of each block, again for reasons of confidentiality.

We have a full sized window with the tool open on a 42 inch monitor in the room where DBA sits, and this permits us to instantly check the state of the enterprise. Even better, we can do it from our iPhones too:

Therefore, we have good ability to take a look whenever we desire and evaluate the state of the enterprise.

Benefit: Alerting

Any good tool won’t just sit there and wait for you to see a problem; it will send you an email (or even page your phone) when a situation requiring attention emerges.

We have our alerts configured to send us email for some high alerts. Others that absolutely require attention are sent to our phones as pages (texts). This is easy because we have configured our email server with DBA Notify (email) and DBA Page (phones) email addresses. A good alerting tool will let you send to one or more emails, and will even let you execute some program(s) upon an alert. For example, if replication to the reporting servers is backlogged, run a program to pause the report queues so that we don’t publish reports with stale data.

Alerting is the one thing in any decent monitoring tool that will save the DBA’s backside. For this reason, you’ll find out that many tools come with hundreds of possible alerts, with many already set up to “turn red” on a hair trigger. What the DBA must do is to look at each alert and set it up conservatively. Better to receive 100 alerts and find out that you can dial 75 of them back, rather than to set up a rose-colored (or should I say green-colored) view of the world that misses the one thing that takes the shop down.

Caveat: Snoozing alerts

In my shop, disk space on the SAN is at a premium in our most critical environments. We’ve come close to filling the SAN, and I embarked upon size reduction projects. In one database, I’ve freed up nearly a terabyte of space inside of the databases, but the data files come close to filling up the drives. Therefore, I get alerts on disk space usage from the Windows Server level of monitoring, and I snooze them for a month at a time (knowing that the data files can’t grow and have plenty of free space inside them).

It’s better to snooze a pile of alerts, and get pinged monthly or even less often, than it is to turn the alert off for that machine or (yipes!) the enterprise.

Working the Tool to Work the Servers.

In our shop, while on duty we work the monitoring tool to know when to work the server itself. If I get chronic yellow alerts for processor queue length on a VM hosting a SQL server, and I’m snoozing that alert more than once or twice a month (daily would be bad), I know to bring the issue up with the network admin to see if I can get that VM some more horsepower.

This is an optimal situation when you are part of a small DBA staff shepherding a large number of servers. You want to be in a position to know where to apply your force, and the monitoring tool is what military types call a “force multiplier.”

When you are off duty, the tool pages you if something is wrong, and if you are on duty, you can see things emerge and address them before they have business impact.

Conclusion: It’s worth it!

I cannot imagine any shop where DBA have to manage more than 3-4 SQL servers being effective without a top-flight monitoring tool. I have identified the tool that I use, but that doesn’t mean that it is the only tool that can do the job. Look for yourself. Get comfortable with the tools using demos.

Then make the business case to your management. Reduced downtime. Better utilization of employee time. Force multiplication. Ensure that your management knows that this is not a “toy” for your DBA staff to “look cool;” but rather an integral and necessary part of the toolset that you need to do your job. I’m privileged to work for a company that recognizes and supports my team, and the end result has been a DBA team that does not engender any worry from management.

Every production database should be subject to an integrity check with DBCC CHECKDB. Every week at least. It’s not just a best practice; the failure to do so should be considered dereliction of DBA duty. The longer you go with an integrity error in your database, the higher the likelihood of a catastrophic, call Paul Randal for help, situation.

DBCC CHECKDB in a HA tight-disk situation

However, what happens when you have a situation where you have a high-uptime VLDB with little to no disk to spare? You might have the feeling that you are in a catch-22; If you use TABLOCK to avoid the snapshots, you need downtime, but if you don’t use TABLOCK to do the DBCC, you have no disk (on the drive with the file where your object’s pages are stored) and your DBCC crashes with error 17053, root cause OS Error 112.

There is a way around this: Take a full backup of the database, and restore it to another server, where the database may be accessed without blocking other applications. Once restore is complete, run DBCC CHECKDB with TABLOCK. BACKUP DATABASE copies database pages with data into the backup, and RESTORE DATABASE copies them back. Running DBCC CHECKDB will give you a reliable picture of the integrity of the production database. Moreover, this strategy gives you a reliable test of your backup strategy.

DBCC CHECKDB as a Scheduled Job

What if you can run DBCC CHECKDB without TABLOCK and not have problems, or if you can get the database into single-user mode for a clean run with TABLOCK? Then you need to make a scheduled job out of it, for each and every database on your server. SQL Server Maintenance plans will do this for you.

You should schedule DBCC CHECKDB weekly. I recommend that you schedule this job to notify DBA whether or not the job succeeds. It is good to put the DBCC output in your face every single week, to give you assurance that it has been completed.

What if DBCC CHECKDB finds a problem?

If you have a consistency error, then, regardless of any other issues, you get with management and you come down as soon as possible! Get that database into single-user mode and start to fix the issue. The exact things you should do to fix the errors depend on the errors, and those repair attempts are best left to a more detailed post.

The important thing to remember is that time is of the essence. Every second you are up with an integrity error is another second where the opportunity exists for that error to become serious and perhaps irreparable.

Once the error(s) are corrected and a DBCC CHECKDB with TABLOCK passes in single-user mode, then take a full backup at once.Assume that all previous backups are corrupt.

Your managers may be grumpy that a production system had to come down for hours to repair a data inconsistency, but that is better than failure to report a potential problem that escalates to disaster. If your managers refuse to bring the system down after you warn them of the error and its potential risks, then the responsibility lies with them and not you.

Whatever you do, running and verifying regular integrity checks on all production databases should be considered a core responsibility of your job.

If you are creating a database, and it is going to be large, you may want to consider having more file groups than just PRIMARY. File groups enable the DBA to segregate objects into separate files. For example, in our shop, our largest database has the PRIMARY file group, which holds tables and clustered indexes. Non-clustered indexes go into the INDEX file group, and WORM tables that contain BLOB data that isn’t frequently read go into our ARCHIVE file group.

File groups make it extremely easy to segregate a database into files that can be assigned to drives or LUNs that contain a certain speed of disk. For example, one can have PRIMARY and INDEX on SSD, while the ARCHIVE group is on SATA. The DBA can gain more granular performance control by using file groups and LUNs together to give or take away speed. The DBA can control cost by using a file group to route infrequently-used data to cheaper (and usually slower) storage.

The Difference Between File Groups and Files

File Groups are assigned to objects (indexes, tables). Each File group represents a span of data pages. These pages are not, however, constrained to be on one file. The DBA can assign many files to a single file group.

Assume that you have just the PRIMARY file group, and there is a great deal of update traffic. If you have one file in that group, those pages must go solely to that one file. Now, imagine that you have a second file, and it’s sized the same as the first file. SQL server will distribute the pages in the file group fairly evenly between the files in the group. If the drive / LUN / mount point is physically separated, you now have two paths to obtain or update data, usually resulting in a performance boost. Obviously, having the data on the same drive / mount point / LUN buys you no performance.

Using multiple files in multiple file groups multiplies the potential performance gain, provided that you’re not piling the files on the same drive or mount point or LUN.

However, you can also manage space. As the drive holding the SQL data grows closer to full, you may need to cap one file and open another file to handle new data pages. This happened to me as the data in our new app grew from 250GB to nearly 2TB, at a rate nearly four times what was predicted before the application went live. Certain Data files were set to grow no more as the mount point holding the file approached 100% full; a new mount point was added and a new file opened. With mount points, this can be done with minimal or even zero downtime depending on configuration (results may vary with failover clustering – this may require bringing SQL Server down and back up after adding the drive or mount point as a resource).

To conclude, managing file groups and the files within the group is a powerful tool in the DBA’s performance arsenal. Learn to shoot straight with it.

There is English. There is also geek-speak. Then there is DBA-Speak. Then, for the truly advanced among us, there is SQL Server DBA-speak.

However, if we are to sound like professional DBAs and not newbies, we must pronounce our terms of art correctly. Failure to do so can result in: (a) Losing the chance to speak at SQL Pass Summit; (b) Being uninvited from the cool vendor parties at the Summit; (c) Generally being ostracized by the SQL DBA community; and (d) Death.

So, here are some terms that are often mispronounced, along with the canonical ways to pronounce them:

The Data Type CHAR: Is it pronounced CHARR, like the delicious blackened coating on a delicious piece of steak served at a vendor party at SQL Pass Summit? Or is it pronounced CARE, like the first symbol of “character?” The latter pronunciation is a capital offense.

The Data Type VARCHAR: This is more complex, because we have multiple pronunciations:

V-ARE CHARR: the Rhyme-like pronunciation that shows that you are a mighty DBA and not some poseur.

V-AIR CHARR: Welcome to poseur world.

V-ARE CARE: Capital Offense

V-AIR CARE: Double Capital offense. They execute you, wake you up, and do it again. Sort of like the end of Braveheart.

SARGABLE: Short for “search argument able.” This refers to the good situation where a WHERE, HAVING or ON clause doesn’t do silly things like wrap the searching column in a function to force a scan. It lets SQL use indexes. Is the word pronounced SARGE-able, sounding like a short name for a Marine Drill Instructor? Or is it S-ARGH-able, with the hard G? Finally, could it be S-AAAAARGH-able, as if the DBA is a pirate?

In this case, the first or the third are equally cool. However, a DBA may only use the third version at SQL Pass Summit, and only when presenting, and only with a parrot perched on his or her shoulder. Use the second version only if your résumé (and you must type the word with the accented “e” in both places) is up to date.

Standardization is important. This standard will be updated as new terms of art arise. Feel free to contribute your own goodies.

Sounds ominous! You are waiting 7.5 seconds every second to get a needed latch?! Why is this happening to my servers? They seem alright, because queries are moving along nicely and the users are happy. However, this bad statistic must mean that there is a massive latent problem that this new default alert in the monitoring tool caught.

Research time!

What is a latch wait?

Well, first, we should define a latch. A latch is a type of lock that protects access to SQL Server’s structures in memory like the buffer pool or SQL Server’s internal data structures. They’re vitally necessary to keep integrity on in memory data pages.

Latches are serialized. One latch completes work before the next latch is granted. If a process needs a latch and one cannot be obtained, it is suspended until the latch becomes available.

Now, with that description, the latch wait seems even more ominous! Waiting 7.5 milliseconds for each latch of memory? Yipes! Anything more than a microsecond’s wait is an eternity – correct?

Well, let’s use a DMV – sys.dm_os_latch_stats to see what’s going on:

I looked at the total wait time quickly, and see that ACCESS_METHODS_DATASET_PARENT is the predominant latch wait on this server. Paul Randal wrote a wonderful post that explains many of these latch waits. This particular wait is created by parallelism. As Paul warns in his post, let’s not go bananas and set MAXDOP to 1.

Instead, the first thing to check is whether hyperthreading was accidentally enabled on this server. In this case, the answer appears to be no, but the sysadmins need to check this (you can’t do it from SQL but there is a VB Script out there that can detect it — if your sysadmins will let you run that script on a production server).

While the Sysadmins investigate, I note that the second most prevalent class is BUFFER. This is the classic wait for an in memory page or structure. This one seems to be one that I can live with, especially since ACCESS_METHODS_DATASET_PARENT is responsible for over 90% of the wait time and also the highest max wait times.

What My problem was

In my case, it seems that parallelism is the cause. Hyperthreading is not, in this case, the problem. SSIS also runs packages on this server to do ETL work. We have four sockets with two cores each. What I did to resolve this issue was to configure max degree of parallelism to 4, allowing four cores to be free to handle the SSIS packages and other SQL traffic.

We are seeing an incremental improvement in query performance on this server. What was good is now a little better. The alert hasn’t showed up since we changed configuration.

Does the new alert help?

Short answer: yes. However, this alert needs to be tuned server-by-server in Spotlight, and may need to be ignored for certain test servers.

I will take this opportunity to strongly encourage every DBA who manages more than five SQL Servers to review all of the available monitoring products out there. I like Spotlight. You might like SQL Sentry or Idera or RedGate. Find the one that works best with your style of DB management and implement it. Be ready to tune the thing for 30-120 days.

We know that we can improve auditability and simply database security administration by doignt he following:

Eliminate SQL Server authentication in favor of integrated security; and

Create AD Groups, and set those groups up as logins and database users, with specific permissions granted and or denied as the business requires; and

Assigning AD users to those groups.

But why the extra level of AD Groups? Why not just add the AD users as database logins with specific permissions at the more granular user level?

We could certainly do that, but the administrative tasks that come from this approach grow geometrically.

AD Groups reduce SQL Login maintenance.

Sheer Overhead

Let’s assume that you have 100 AD users who can access 4 databases on 2 SQL servers each. That is the size of a very large workgroup or a small enterprise. Without a grouping approach maintained at the AD level, you have to create 200 logins, 400 database users, and assign roles or (worse) individual permissions to each of these users. That is a boat load of maintenance work.

On the other hand, grouping means that the user’s DB access is provisioned with his or her AD profile, and DBA has no involvement aside from a one-time provisioning of the access to the DBA group.

DBA resources are freed to work on more pressing issues.

Orphan Prevention

Assume for argument that a person is terminated and his or her AD account is deleted. That does nothing to delete the SQL Server logins or database users. You now have an orphaned login, and if you drop the login you have orphaned database users.

While it is true that you can write or download T-SQL scripts that can check AD for missing logins, identify orphans, and sweep the databases to drop users and then logins, this is maintenance work that requires attendance and usually automation in the form of SQL jobs that run regularly.

On the other hand, if the users are organized into AD Groups and one of the logins is deleted, there is no SQL Server maintenance at all.

Auditability Redux

Assume that Jane Doe is promoted from Payroll assistant to sales manager. Jane no longer needs access to the payroll tables, meaning that her roles must be individually maintained by DBA, separately and in addition to any maintenance of her profile in Active Directory. Worse, if the roles don’t cover everything and there are individual GRANT commands to tables and other entities that need to be revoked, Jane’s simple change results in significant script work for the DBA.

On the other hand, if you have an AD Group DOMAINPayrollWorker with all permissions set there, then the network admin who provisions Jane’s profile need only drop her from the group and her permissions change instantly.

That kind of control over access is the kind of stuff that gives auditors cause for celebration.

Conclusion

I’ve previously discussed the security benefits of eliminating SQL Server authentication. Combined with the approach of Active Directory group and role based authentication, the security benefit also comes with a savings in DBA time.

…then you need to have an audit-friendly method of securing your SQL assets. In general, a few rules make this easier on DBA and on management.

1. Get rid of SQL Server authentication!

The sooner the better. If you do not have a sa login, it is really hard for a third party to login as sa. And even if you have disabled sa in favor of another sysadmin-powered login with a name like “janitor_readonly,” if that login is hacked then your data is in the open.

On the other hand, in the modern era, there is little to no need for SQL Server authentication at all. With SQL 2008 and newer, the DBA actually has to opt in to using SQL Server authentication. Your database servers should exclusively use integrated security.

Why? First of all, every login is mapped to the AD user that was running the program that logged in. Server applications have their own logins, and the clients are logged in with the end-user’s login.

2. Groups and Roles

The first thing you might think is something like, “I have 10,000 people logging into application X!” I’ll be spending every waking minute maintaining SQL logins and permissions to the databases they need.

Nope. Create Active Directory groups. Assign them logins and database users with the minimum permissions that are needed. Examples:

DOMAINAccounting_RO – Read only to Accounting

DOMAINAccounting_RW – Read/Write to Accounting

DOMAINAccounting_Payroll – Read/Write Payroll Data

DOMAINSQL Admins – DBA with sysadmin privileges

Usually each database has a limited set of privilege levels. in my shop, we have a limited set of groups and we use T-SQL to generate DSADD commands to add the group accounts to Active Directory, and pass these on to the sysadmins to be added (no DBA should be a Domain Admin – would you trust any DBA with that much power? 😀 ).

Each domain user is added to one or the other domain groups (or multiple groups; the permissions accumulate unless an explicit DENY is issued).

Use database roles to dereference objects further. You then GRANT permissions to the database role, and then assign the database user for the AD Group.

Then when the desktop/sysadmins provision a user, they are added to the groups using ADUC (Active Directory Users and Computers), and the database users and permissions are defined once at the outset, and only need minor review as DDL changes are made.

User DOMAINCFO would be added to DOMAINAccounting_RW and DOMAINAccountingPayroll. Then he or she would have the needed permissions at the outset.

3. Auditing Made Easy

If you enable successful logins to be saved to the database log, you will see that you will see the actual AD user that was logged in, along with the machine from which the login was made. This information is also shown in sp_who2 and elsewhere in good monitoring tools.

Auditing tools now have access as well. Now, if someone went in and ran a query like this:

…any good auditing tool would capture the user ID and workstation name / IP of the culprit. If someone did the same query with the sa login, you would have a lot less information on the culprit than would be advisable.

In short, it’s high time that SQL Server Authentication join the relics of Information Technology in the museum, right next to the IBM System/370-158.