Saturday, September 24, 2016

This is the first in a series of three SQL-Server best practice tuning guides to help you achieve optimal performance, speed, and efficiency from your Very Large SQL-Server Databases. Although these techniques will apply to Microsoft SQL Server, they will generally also apply to other relational database systems (RDMBS) like MySQL, Oracle, DB2, Postgres and others.

The series begins with a focus on optimizing physical data-storage requirements (i.e., space reduction), since it is perhaps the most important overall tuning consideration for large databases.

Tuning Very Large SQL-Server Databases — Part 1

SQL-Server Performance Optimization and SQL-Server Speed Optimization should always be a primary consideration during
the database design phase, regardless of the size of the relational database — with the exception perhaps of
"tiny" or "trivial" databases where future growth is guaranteed not to be an issue. And,
very large databases (in excess of a Gigabyte of data, or millions of rows of data in any
particular table) definitely require the SQL-Server database designer / administrator to consider
the ramifications of proper (or improper) database design. Perhaps this seems obvious, but given my many years of
experience and encounters with large corporate databases, what should be obvious is anything but obvious to many
developers, designers, and administrators that implement these massive databases.

Most of this discussion applies to all modern Microsoft SQL-Server versions (SQL 2008, 2012, 2014, 2016) as well as versions as early as SQL-Server 7.0, 2000, 2005. The newest versions of SQL-Server, especially the full-featured Enterprise editions, may also offer you additional features of relevance like built-in data-compression. But, even fancy features like data-compression are no substitute for proper design.

Reduce Physical Storage Requirements

The first thing I always make sure of when designing databases, and when tuning large SQL-Server databases for
performance, is that physical storage space is not being wasting with data-types that are overly precise or
excessively large for their intended function, and that data is properly normalized. If you don't understand the
concepts of normalization, please study up on it before reading further (or pretending you can do your job as a
SQL-Server developer or DBA).

Keep in mind that this discussion is all to be considered in the context of very large databases,
where there will be millions of rows of data in one or more tables. In fact, you may be working with tens of
millions or hundreds of millions of rows of data. As such, every byte of storage saved on a per-row basis will potentially result in Megabytes saved (in reduced physical storage), or perhaps even gigabytes saved in the overall table and
database size.

I will now present some very simple rules for reducing physical storage requirements and some
(storage space) savings-opportunity examples to consider:

Do not useNVarchar or NChar (vs. Varchar and Char
respectively) unless you need Unicode character support. Using NVarchar (or NCHAR) where a standard
Varchar/Char will suffice will double the storage requirements for a column, since Unicode
characters map to double-byte pairs (to accommodate all those foreign language characters and extended symbols,
etc).

Savings Opportunity: I have encountered NVarChar use on columns where there is an Alphanumeric "TransactionID"
on millions of rows of data, but where that AlphaNumeric value is most certainly only containing the 10 digits
and the standard 26 letters of the English alphabet. Instant space savings!

Use SmallDateTime instead of DateTime, unless you really need the precision. SmallDateTime
fields are accurate to the nearest minute, whereas full DateTime types are accurate to 3.33 millisecond
intervals. And, that extra precision comes with the price tag of double the storage —
SmallDateTime requires 4 Bytes instead of 8 Bytes for DateTime.

Savings Opportunities: if you see fields like "AccountingPeriodBeginDate", it should be obvious that
an accounting period is going to start on a particular day boundary, and SmallDateTime will
more than suffice. I also regularly see fields like "LastChangeDate" on tables in large databases
where, upon inquiring, nobody can provide any reason why any accuracy beyond a particular
minute is needed, but yet the fields are defined as full DateTime fields.

Use the smallest Integer data-types that will safely and comfortably handle all
possible values in an integer-data column. SQL-Server provides a few datatypes for this, including
bigint, int, smallint, and tinyint. Each has their storage limitations and storage-space requirements. TinyInt
takes just one byte, but can only store values ranging from 0-255 (i.e., 2^8 values), whereas SmallInt is two
bytes in length, and can store 2^16 values, Int is 4-bytes and holds 2^32 values, and BigInt is 8-bytes and
holds 2^64 values (see the pattern?)

Savings Opportunity: look for columns that have an obviously small number of values, but whose
column-definitions are grossly oversized. E.g., if you see a field like (accounting) "Quarter" defined
as an Int, TinyInt can easily hold the four possible values (Quarter 1, 2, 3, or 4) and saves 3bytes/row. Or,
perhaps you'll see UserID defined as an Int where there's no way in the next 100 years the application will have
over 10 or 20,000 users — this sure looks like a good situation to downsize to SmallInt and save half the
space!

As with Integer types, the money data-types come in a large and small size — Money and
SmallMoney respectively; the first of which takes 8-bytes to store, and the latter which takes only 4 bytes.
SmallMoney values are limited to roughly +/-214,000.0000 compared to hundreds of trillions for Money.

Savings Opportunity: if you have per-unit Costs, Charges, or the like stored in your database
columns where the price is never going to even begin to approach that $214,000 value, use SmallMoney and cut
your storage by half. E.g., if you are storing the price of items sold in your convenient-store locations, or
even the price-per-gallon of gasoline you sell, this limitation should suffice for the foreseeable future.

Decimal and Numeric data-types provide a bit more complexity when analyzing the proper "best"
size for the fields. See the SQL-Server Books Online to size appropriately, since the maximum total number of
decimal digits that can be stored, both to the left and to the right of the decimal point, will determine the
storage-bytes required. You want to size your columns in a way that they will always be large enough to hold the
values you will place in them, without over-sizing them.

Variable-length fields where fixed-length should be used. Varchar fields
require 2 extra bytes in addition to the the defined maximum field-width in order to store a value which holds
the characters (used) count. Varchars are great for when there really is heavy variation in the actual length of
data being stored. But, they do not make sense if you'll always be filling all bytes in the field (or even all
bytes minus two).

Savings Opportunity: perhaps one of the most ridiculous things you will encounter is a one or two character
field that is defined as Varchar(1) or (2). This is one sure way to make a one-character field take 3-bytes of
storage per row. I have seen this repeatedly with things like "Flag" fields or "Code" fields
people design into table structures. I personally think the RDBMS should not allow a user to even define
Variable-width columns of just one or two characters.

Note: SQL-Server has one oddity to watch out for to ensure that Char fields are truly treated
as fixed-length fields — as stated in the Books Online: "If SET ANSI_PADDING is OFF when either
CREATE TABLE or ALTER TABLE is executed, a char column that is defined as NULL is handled as Varchar"...
just so you are aware (note: I find this wacky to say the least - if I define something as Char, I really mean
Char and not Varchar)

Look for obvious cases where proper data-normalization is being ignored.
Quite often, you'll encounter massive data-repetition in tables with millions or tens/hundreds of millions of
rows. The key to speed is to make that repetitive data take up as little space as possible (without employing
some data-compression algorithm that is going to kill performance on save/retrieves - in case you were thinking
about such a thing).

Savings Opportunity: start by looking for just plainly blatant ignorant denormalization. I have seen so many
massive tables where things like "LastUpdateUser" is a 10 or 20 character (if not Unicode character)
field in databases where those last-update "Users" are a finite group of individuals that can be
counted on both hands. Talk about a waste! If you encounter this type of situation, it's time
to create a "UpdateUsers" table (or whatever) with an Integer primary-key (in this example, a TinyInt)
and a field called "UserName" or whatever, and replace the "LastUpdateUser" on the
massive-rows table with "LastUpdateUserID" that is a TinyInt column (foreign-key to the new
UpdateUsers Table). This will save a ton of space! And, there is nearly zero cost in joining to such a small
table later if you really need to get the name associated with the ID.

Another Savings Opportunity: there is another technique I have employed to tune very large
tables (in excess of 50 million rows/table) that may get a bit tougher to initially grasp. But,
this technique can be a huge space and processing-time saver. It involves a bit of
data-analysis, and thinking "out of the box". Basically, I will look for data-patterns that repeat,
and whose unique (pattern) values are finite though not instantly apparent — especially when the patterns
span multiple columns on the same table. I've routinely discovered that the unique values from a combination of
few fields may only number in the thousands, even though the total table rows exceeds 50 or 100 million. When I
find these, and establish the reason for the pattern (and ensure my logic will accommodate
future conditions too), I will pull the unique field-combinations into a "lookup"
table and create a small FK (SmallInt or Int quite often) to point to those combinations (and replace the
column(s) whose data I have moved to a lookup table). Why? I'll get into this in the next section about why to
do any of this space-tuning, but the bottom line is speed and efficiency. If you have a particular situation you
need tuning help with, contact me.

If you begin with the above space-savings guidelines, chances are you'll find a way to make some significant
storage-space reductions in your SQL-Server databases. And, you will have made the first steps towards performing
some serious speed-optimizations in your database as well.

One note of caution when making any database changes: be sure you understand the ramifications
beyond the scope of the database. Many of the changes I've outlined will be more or less transparent to other
applications, depending on what (languages) the applications are developed in, and how the applications access the
data (note: I always recommend that applications do not have direct access to the data-tables, but
rather access data through stored procedures — for many reasons, one of which is to allow simplified
maintenance to the underlying database tables without impacting other applications).

If you don't know what side-effects your changes could cause for Graphical User Interfaces (GUI applications) or
reporting applications, and so on are, then be sure to involve someone that does.
E.g., When changing Int fields to SmallInt, a Borland (aka CodeGear) Delphi based program most likely
won't care, but C# data-bindings may have issues on an ASP.NET page or such. So, be careful.
From what I have seen, Delphi as well as C# all seem fine with SmallDateTime in place of full DateTime fields in many cases,
but it can also depend on how customized an application is.

Computing / Estimating Storage Requirements

To realize how much potential for storage-space savings there is, you will want to be able to quickly rough-out some
approximations based on the columns in table(s), the number of bytes used per-column, and the number of rows per
table.

In theory, you should be able to quickly estimate a Table's (physical storage) size just by knowing the number of
rows, and multiplying that by the total of all columns' storage requirements. This can be put into a simple formula
something like this (I didn't put ALL data types in here, but you should get the idea), where you just multiple the
number of rows in a table by:

This calculation does not take into consideration Keys or Indexes, but it will give you the approximate data-space
requirements. And, it's a good way to quickly estimate reductions in storage space too. For the
data-types I didn't cover, see the Microsoft SQL-Server books-online, which give a description of how much each
data-type takes for storage - see the "Data Types" (Transact SQL) section.

About Me

I have spent much of my life working with computers, technology, programming, software development, and business consulting. I have a strong attachment to technology and science in general.

My interests and insight go well beyond computers though, as you will soon discover while you peruse my blog entries covering topics from finance to gluten free diets. I have even recently authored a Gluten-Free Dessert Recipes Cookbook of gourmet quality desserts for Celiacs and Wheat-Free / Gluten-Free individuals.

Note: This site is not intended for use as a source of technology, health, legal, or professional advice, and I assume no responsibility for errors, inaccuracies, omissions, or any thing else resulting from you reading this blog. All content is copyrighted with all rights reserved.