How many times you've followed that rule?
How many times you've answered that question on a job interview?

And now ask yourself: Were you right then? Have you ever tested that approach?

My answers to these two questions are: "No" and "Yes".

That happened that two years ago I serviced one client with very busy database, which suffered from high Index Fragmentation.
I've tried to use approach from MSDN to address Fragmentation by Rebuild or Reorganize.
Very quickly I've noticed that "Reorganize" runs too long in their environment and Index Rebuild works much quicker for tables with Fragmentation as low as 10%-15%.

At that time I did not have time and desire to deeply research that case, but very heavy hesitation has shaken my believe in MSDNs Re-Indexing rule.

Recently I've hit same issue again and finally decided to do a research.

At first I've created test table with clustered index and fill factor 70% in my test Database:

You can see that table has 709000 rows on 10300 pages and has only 5.835% fragmentation.

Lets look deeper in what I just did:
At first I've inserted 700000 records into the table. It created 7K pages and filled them with 100% of their capacity and and it looked like this:

The next step was "Index Rebuild" to enforce 70% Fill Factor and number of pages has increased to 10K with 70% of their capacity filled. It looked like this:

On the next step I inserted 30 records on the first 300 pages. That action caused Page splits for those 300 pages, which produced 600 fragments. Now if you divide 600 by total number of pages 10300, you almost get the Average fragmentation I've got: 5.825% (the actual number is little bit different because there are more calculable variables). As a result pages looked like these:

(the bluish ones - are 600 fragmented pages and have filled only 50%)

The last query just produced the fragmentation report.

Now we are ready to start our defragmentation procedure.
Before running Altering the index I started Query Profiler to capture the results:

Amazingly Index Rebuild used 8 times less reads than Reorganize, 3 times less writes and finished 4 times faster!!!!

Have you expected that? I have not.

I did not believe my eyes. I started hunt for fragmentation values when Reorganize would out perform Index rebuild.

Case #2

I've started that case by filling 10K pages by 70% of data. Then I've up-loaded all pages to almost 100% of their capacity, so they looked like this:

Then I added couple of records to first 300 pages to split them and produce the fragmentation like this:

As a result I've got following:

960600 records in 10300 pages with the same fragmentation of 5.835%.

I've used following script to produce that picture:

INSERTINTO tbl_Test_Index_REBUILD(ID, A)
SELECT Fragment+1, 'A'FROM (SELECT TOP 26 0 AS MsgId FROM sys.messages) AS Msg,
(SELECT TOP 10000 (ROW_NUMBER() OVER( ORDER BY message_id)-1)*70 AS Fragment FROM sys.messages) AS Fragments
OPTION (MAXDOP 1);
GOINSERT INTO tbl_Test_Index_REBUILD(ID, A)
SELECT Fragment+1, 'A'FROM (SELECT TOP 2 0 AS MsgId FROM sys.messages) AS Msg,
(SELECT TOP 300 (ROW_NUMBER() OVER( ORDER BY message_id)-1)*70 AS Fragment FROM sys.messages) AS Fragments
OPTION (MAXDOP 1);
GO

Please note: The fragmentation is the same as in the first case, but Average used space on each page is completely different. That is what makes the real difference, not a fragmentation.

When I run Reorganize and Rebuild again I've got the following results in Profiler:

Index Rebuild used 6.5 times lees Reads, ~30% less writes and finished also about 30% faster.
Not so impressive as in the first case, but still "Rebuild" outperformed "Reorganize".

Case #3

In this case I've just increased fragmentation by splitting 1000 pages:

As a result 730K records on 11K pages with 18.127% of fragmentation.

After Reorganize and Rebuild I've got this:

Not sure if timing was captured correctly, but you still can see the significant difference in number of reads. "Rebuild" is better again.

Case #4 (extreme)

In that case I've increased capacity of all pages up to 98% as in Case #2 and then split 1700 pages:

In the result I've got 963400 records in 11,7K pages with fragmentation of 29% (almost 30% limit as in MSDN's guideline)

And here is what I've got:

Rebuild is faster and uses less Reads and Writes. No surprise.

Do I have to test anymore?
Why not?

Case #5

In that case I wanted to see how "Rebuild" or "Reorganize" impact other transactions on busy servers.

In order to do this test I've inserted to my table only 70K records and split only 30 pages (10 times less than in the first case)
Before running ALTER INDEX command I started following script in another window to emulate heavy activity:

SET NOCOUNT ON

WHILE 1 > 0
BEGIN
;WITH a AS ( SELECT top 1 * FROM tbl_Test_Index_REBUILDORDER BY NEWID()
)
UPDATE a SET ID = ABS(CAST(NEWID() AS VARBINARY) % 1000000)
OPTION (MAXDOP 1)
END

Surprisingly, when I run "INDEX REORGANIZE", I almost immediately got an error:

Msg 1205, Level 13, State 52, Line 68Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Here is captured in Profiler Deadlock Graph:

I've tried some other scenarios and discovered that even though "Rebuild" causes blocking, but does not produce Deadlocks. "Reorganize" produced Deadlocks only on the very small amount of pages (about 1K).
When I've tried bigger tables I couldn't reproduce Deadlock scenario, but I do not think it is impossible, I've just have not tried hard enough.

As the conclusion:
"INDEX REORGANIZE" is generally slower than "INDEX REBUILD". It uses much more I/O operations and can produce deadlocks.

Do you still want to use "REORGANIZE" option in your code?

That is pretty tough question.

If you take a look at the internal difference how "REBUILD" and "REORGANIZE" work you'll see following:
- "INDEX REBUILD" is creating brand new copy of the index in the database. Would say your table is 2Tb in size. That means you need another 2Tb in your database to finish "INDEX REBUILD" successfully. In addition to that you need 2Tb in your Log file, which means 4Tb of extra space. Do you have it?
- "INDEX REORGANIZE" works much slower, but in small transactions. You can just run it periodically, in your lowest DB activity time and just kill process automatically after an hour or two. It is not a problem if process won't finish completely. Partial results are also beneficial. If process will be a deadlock victim - that is also not a problem other than you get deadlock alert in the middle of the night.

As a result of my research I will be always using "INDEX REBUILD" option from now going forward, unless I'm in a situation when "REBUILD" does not work. Then "REORGANIZE" will take the place.

Now we have following aggregation request:
Report all sales aggregated by a year with a percentage of each year within whole data set;

Not a big deal. That how it would look like in classic scenario:

;WITH AnnualSales AS (
SELECT YEAR(Period_Dt) AS YearPeriod,
SUM(Sales_Amount) AS AnnualSale
FROM tbl_No_Croup_By
GROUP BY YEAR(Period_Dt)
), TotalSales AS ( SELECT SUM(AnnualSale) AS TotalSale FROM AnnualSales)
SELECT a.YearPeriod, a.AnnualSale
, t.TotalSale
, CAST(ROUND(CAST(a.AnnualSale * 100. / t.TotalSale AS SMALLMONEY),2) AS VARCHAR) + '%' AS Year_Total
FROM AnnualSales AS aINNER JOIN TotalSales AS t ON 1 = 1
ORDER BY a.YearPeriod;

Here is an approximate result:

And now will do the Magic

SELECT DISTINCT YEAR(Period_Dt) AS YearPeriod
, SUM(Sales_Amount) OVER(PARTITION BY YEAR(Period_Dt)) AS AnnualSale
, SUM(Sales_Amount) OVER(PARTITION BY 1) AS TotalSale , CAST(ROUND(CAST(SUM(Sales_Amount)OVER(PARTITION BY YEAR(Period_Dt)) * 100./ SUM(Sales_Amount)OVER(PARTITION BY 1) AS SMALLMONEY),2) AS VARCHAR) + '%' AS Year_Total
FROM tbl_No_Croup_By
ORDER BY YEAR(Period_Dt);GO

The result is exactly the same, but code does not have any single "GROUP BY" and not so obvious.

Is it good?
Turn On statistics in your session by running two following commands:SET STATISTICS IO ONSET STATISTICS TIME ON
And then run both scripts together.
If after execution you look at Messages tab you can see following picture:

Wow!!!
"GROUP BY" uses 5 times less CPU power and almost 1000 times less IO system!!!

As a conclusion: You CAN use "OVER"clause to do the aggregation in three following cases:
1. When data set is extremely small and fits in just one 8 Kb page;
2. When you want to hide your logic from any future developer or even yourself to make debugging and troubleshooting a nightmare;
3. When you really want to kill your SQL Server and its underlying disk system;

Wednesday, January 27, 2016

That is very easy and well know topic, but I still have to do it because there are still a lot of people who do not really know how counting really works.

Just recently I've met a person, who argued with me on the way of counting records in a table.
He was irritated by me using "COUNT(*)" and suggested using "COUNT(1)" instead.

I've tried to explain that result would be the same anyway with the same expenses.
Unfortunately there was no time for an explanation and I want to put it in that post for everybody.

At first we can run following script against AdventureWorks Database (which you can download here):

SET STATISTICS IO ONSET STATISTICS TIME ONSELECT COUNT(*) FROM Person.Person;SELECT COUNT(1) FROM Person.Person;SET STATISTICS TIME OFFSET STATISTICS IO OFF

If you run that script multiple times and see in "Messages" tab, you'll notice that both statements execute equally quick and require absolutely same amount of IO to count the records.

The way of thinking that "COUNT(*)" is bad starts from the fact that in most cases "SELECT * " is bad and you have to avoid asterisk when you can.
But then, what is the difference between "COUNT(*)" and "COUNT(1)"?

"*" Means that we "want to count all records in the table"
"1" Means that we "want to count all records in the table substituted by number 1"

You Might be surprised, but all following statements produce equal results and spend the same amount of resources:

If, before the execution, you press "Ctrl-M" in your SQL Server Management Studio you can see the query execution times for all these queries:
All query plans for these queries are the same and cost exactly the same 20%

Look at the most expensive part of all queries: Index Scan, which consumes 86% of a query.
It produces 19972 output rows and all these rows have the same size 9 bytes, even in the case of crazy long sentence:

So, what is the difference of what we put inside of COUNT() function?

The difference starts when we try to count by specific field. For example by Firs and Middle name:

SET STATISTICS IO ONSELECT COUNT(FirstName) FROM Person.Person;SELECT COUNT(MiddleName) FROM Person.Person;SET STATISTICS IO OFF

The result will be little surprising:

Why counting on the same table is different?
Because Middle name has NULL values, which were ignored by COUNT() function.

If we rewrite the second query like this:

SELECT COUNT(IsNull(MiddleName,'')) FROM Person.Person;

We will get the perfect result of 19972 records.

That means that using Column names for counting should be used in cases when you want to count items ONLY in the specific field and do not care about overall number of records.

You also might use "DISTINCT" statement to count only unique records:

SELECT COUNT(DISTINCT FirstName) FROM Person.Person;

That query returns only 1018 counted First names.

"DISTINCT" is pretty expensive operation. Use it only when it is absolutely necessary.

At this point we established that "COUNT(*)" and "COUNT(1)" are basically equivalent and produce the same result, but do we have a faster way to count the rows?
Of cause we have!

You can see that second query takes only 12 read operations to get the row count in a table.
For the small table difference is not significant, but for tables with billions of rows you can prefer that method over anything else.

WARNING!
Use "sys.dm_db_partition_stats" only in cases when you do not need the exact record count.
MSDN says that it provides "The approximate number of rows in the partition.".
That means for small tables you can safely use "COUNT(*)" and for large ones you can get an approximation in very quick manner.