SQL Server, Columnstore, Data Platform & Community

Varchar(x) where x <= 2 does not make sense... or does it?

While preparing my “Inheriting a database for Developers” session for SQLSaturday#162 in Cambridge, I have had a very interesting finding on the matter of the Varchar datatype usage. There is a more or less common knowledge of the SQL Server internals, that the Varchar datatype internally adds 2 additional bytes to the content of the column. Those 2 bytes are used to control the length of the text content internally, so SQL Server knows and understands where the text actually ends. From this point of view, even when the content of the column is empty, then it takes 2 bytes of space anyway.

With this knowledge, it was for me extremely easy and logical to believe that creating any columns with Varchar(2) or even Varchar(1) datatypes does not make any sense at all.
I know that I did believe in it until a couple of days ago, until I actually started playing with it. :)

This is a first blog in the series of exploring a little bit deeper this datatype usage.
I have found that the following situations are very interesting to investigate:
– Tables with Varchar(x) NULL;
– Tables with Varchar(x) NOT NULL, having different amounts of data;
– Tables with Varchar(x) NULL, using Sparse columns (SQL Server 2008+);
– Tables with Varchar(x) using different Data Compression methods (SQL Server 2008+);

All of the findings will be compared between CHAR(x) and VARCHAR(x) datatypes on SQL Server Box and Azure SQL Database editions.

Some basic information:
First of all, lets consider tha we should bare in mind that both datatypes Char and Varchar while being similar in purpose of holding a some short text content, are very different in their internal implementation nature. The Char datatype size is always well defined and is placed on the datapage at the location and at the sequence order defined by the table creation, while Varchar, like other variable length datatypes are being moved from their position to the end of the table definition, right after the last fixed size column datatype. This gives a very interesting twist to the final results of this experiment. :)

1. We shall create 2 similar tables using different datatypes (Char and Varchar) compare the respective occupied space. Both tables will have 7 columns, clustered primary key and the differences between them shall be marked in bold.
2. We shall insert 1.000.000 rows into each of the tables, so that we could have some information to play with. The final results will naturally be much bigger on bigger tables or in more extreme internal layouts, with much wider rows.

I won’t spent more time keeping you waiting, and lets dive into SQL Server 2012 and experiment the following code:

These are results of what those queries, with Varchar clearly occupying much less space. The difference is around 4MB which represents in this case around 10% of the total space, which is pretty amazing to say at least. What has just happened here? Why on planet Earth do Varchar is occupying much less space, when it was clearly stated, that it has 2 extra bytes, so it should be at least equal to the table with Char datatypes.

I was lucky enough to remember a great article on the matters of NULLable records when using variable datatypes, written by Kimberly Tripp. I don’t want to spoil all the fun of reading that article, but I shall just refer that the Varchar Nullable columns shall not occupy a lot of space when having a NULL value inserted. This actually means that the records of the second table (VarcharTestTable) are actually smaller. We can confirm this by consulting the internal information of the both tables:

Consulting the results brings some very interesting information: the minimum and the maximum row length of the CharTestTable is 48 bytes, while the rows at the VarcharTestTable occupy just 44 bytes. Also, at the last index level the number of pages (8K) differs between 6212 for CharTestTable and 5862 for VarcharTestTable.

This actually means that in the cases where almost every row of our table is NULL, the Varchar(1) or Varchar(2) datatype actually can be quite useful in the terms of space. In the second part of this post, we shall take this matters a little bit deeper and from more different angles. :)

Update for tests on Windows Azure SQL Database on 27th of August of 2012:

I have successfully tested almost the same script on Windows Azure with a couple of exceptions, which shall be revealed in the next post. I just reveal that after inserting the required registers I have rebuild both clustered indexes with this commands: