When migrating a SQL Server workload we need to think about a few things

A baseline of what resources our workload needs, at normal and peak operation, often we may just capture performance counters for CPU, Disk, maybe some SQL Server counters such as Batch Requests per second and ideally a business counter such as OPM (Orders Per Minute).

The cloud presents some challenges. If we are migrating to the Infrastructure as a Service,its really easy to verify we have the correct resources by simply running tools like SQLIO or running the workload and checking the performance counters for throughput and latency. If we are migrating a database to “Windows Azure SQL Database” or SQL Azure as it is still commonly called, we are a bit more restricted. WASD doesn’t allow us to use perfmon or SQLIO, so the best benchmark we can use is something measured from the application like OPM.

So what does SQLIO tell me about running SQL in the cloud ?

As a test we created a four core VM in the West Europe data centre and then run a series of SQLIO tests to measure the “saturation” point for disk IO (Thanks Fintan and Stephen!). The resulting chart is below.

This shows that the disk we tested scaled up to about 30MB/Second at just over 30ms latency. The highlighted section on the chart shows the “saturation point” where increasing the outstanding IO gives no more throughout. This is essentially the limit.

30MB/Sec equates to almost exactly 1,000 IOPS. Quite a nice round number! so much so that I think its safe to say that this is probably a limit imposed by throttling rather than the actual limitation of the hardware.

One key challenge for SQL infrastructure running in the cloud is the log drive performance. Traditionally we like to keep the latency at say 3ms, but assumabley due the replication and the fact the VHD is mounted on top of a blob store we have very high write latency. At low yields, say only a few MB/Sec, it maintains 10-20ms average, but then quickly increases to 100ms plus average!

I ran a sample ecommerce OLTP supplied by dell on a four core VM running in IaaS. We can see that the number one wait statistics was indeed WRITELOG, so a primary bottleneck is the disk subsystem. You can see the delays from the DMV “exec requests”.

So what does this mean again ?

This gives us a few facts we can use when doing capacity planning for workloads moving to the cloud:

Azure IaaS (4 core) offering currently has a max ceiling of 1,000 IOPS for a disk. How much you get in terms of MB/Sec will depend on the block size: 30MB/Sec with an 8K block size and 200MB/Sec with 64k block size. You won’t get beyond 200MB/Sec or 1,000 IOPS due to throttling, but maybe this will change in the future.

While the max IOPS is 1,000 I was just in a session with the SQLCAT team at SqlBits and they recommend to consider 500 IOPS per disk for capacity planning..

Different VM Sized allow you to add more disks, and therefore get more IOPS. It seems that you probably want to consider using multiple disks as a matter of course in Azure IaaS as the charge is not based on the number of disks

Write latency will start to affect service quality moreso than it will with on premise disk subsystems which are not replicated. You need to make sure that this is either not important or design applications to work more asynchronously, so this does not translate into poor user quality with log writes (as our sample OLTP suffered from)

Thoughts for some more testing

I note that there are two options for disk configuration in Azure Iaas: host caching and whether geo-replication is enabled. I would love to test the effect of both of these on IO throughput and latency. This could for example affect placement of TempDB. We ran the tests with geo-replication disabled.

I’m just going to see if we can get more throughput or better latency by utilising more than one logical disks. Will hopefully post the results ;-)

If you want to repeat these tests you can sign up for Azure IaaS here.

As a small company, this can be a key question, as tendering can be a gamble, and when up against the big 3 or 4, you are tendering against a “brand impression” that is created by the bigger company as a whole, not necessarily the local team they may be using, which can drastically reduce your chances of being successful.

We used to respond to a lot of tenders (At significant cost) and get no-where, before we started asking ourselves the following question before responding:

“Would I take out a personal loan of €xxx,xxx to respond to this tender, or would that be a waste of money?”.

The things that qualify when it is worthwhile to go forward can be:

Inside connections in company – will tell you what is influencing any decision, or if there is already a preferred company – there nearly always is.

Connections to people who have tendered to company before and won. (they can give you the how/why)

Ability of company to support the solution you would propose – ie. Their in house skills.

Track record of company type/size that previous contracts have been awarded to.

This does not mean that you can never tender – when the big boys come in way too expensive, you may be approached and requested to tender. When this happens you have to find out why you are being asked:

Is it because other tenders were too expensive? (Most promising scenario – get in, discuss, and build up a trust relationship – smaller companies do not have the same overheads as larger ones, so lots of opportunity here)

Is it because they didn’t get 3 quotes, and all public bodies need to have 3 quotes before they can award (if this is the case, run away)

Is it because of the track record of those that tendered?

Is it due to past project problems with companies that did tender?(important to identify what went wrong in the past – the company or individual might be un-workable with, or they may just be tired of being ripped off, or something else)

Is it because no-one tendered (be cautious here – they still need 3 quotes)

Where the tender was awarded and the project failed – they now need to turn it into a success, or become answerable for the failure. This is never a proper tender, as they already had the 3, and a massive budget will already have been wasted. (Discover why the project failed, or don’t proceed)

Note: With any tender, too little information is nearly always provided at the RFP stage, and scope/costs will creep if you turn a blind eye to even one ambiguous line in the RFP.

If giving a fixed price quote (which big companies don’t do), make sure you pin down every aspect of what you are proposing and clearly flag what is OUT of scope, ensuring there is no ambiguity. Scope creep is not an option.

Where information is not provided, quote for a full analysis/design stage, and say that a fixed price quote for delivery will be given at the end of design stage. There is no point winning a tender process which results in a contract that would LOSE you money.

Find out what SDLC processes are in place, and itemise the SDLC documents you will be responsible for producing. This is a very expensive process, as they go through many versions till approved for signoff.

Never forget the testing overhead in an enterprise project. Between assembly, unit, peer-to-peer, integration, system, performance, stress, soak testing etc, you are talking between 30% and 50% of the overall costs.

Find out how many people/teams you will need to interact with within the company to carry out the project – eg. Architecture, Hardware, networking, security, integration, support, Requisitions (for software and hardware), Test teams, Project Management, Business areas etc.

The number of people involved push the costs higher – you will put a LOT of time into conference calls / meetings / politics / getting agreement between teams with interesting histories of working together etc.

Support – Don’t forget this – even if its not asked for in tender – you don’t have to cost it – just say you offer it – it sets expectations, and is something customers often don’t consider when putting a tender out, as they may intend to fully support in-house, but the reality is that even in-house support will have questions from time to time - Never consider your end-product as static – the reality is that it’d be dead of it couldn’t accommodate a world of changing data.

And always remember, there is no point going in to something if you have an unhappy customer at the end of it – ensure you understand clearly what will make the customer happy. A project is for a period of time – the customer relationship is for life.

Recently, I noticed some very interesting side-effects from a mirror that was in a suspended state.

When I went to rebuild the mirror, I had to play catch up, with DB Backups, Differential backups and TLog Backups…nothing new here.

I copied the backup across (it took nearly 6 hours). The Differential was just about half the size of the database, and differentials were taken every 2 hours.

I then broke the mirror just before the differential backup was due to be taken.

And – the next differential was TINY – a fraction of the size of the previous differential – it copied across straight away, and I had the mirror up and running in minutes.

So, I looked at the Differential backup history, and could see that differential backups when mirrored (but suspended) appear to be significantly larger than differential backups when there is no mirror.

So I am going to state what was probably obvious to everyone (except me!): When your mirror becomes suspended, the differential backup appears to hold on to everything that is required for the mirror to catch up, which can result in unmanageably large files. So, breaking the mirror early can result in a speedier recovery to full mirroring.

Also, worth noting was that the full backup when the mirror was in the suspended state was 50% larger than full backup with no mirror (mirror suspended for a week before I was asked to look at it).

So, a suspended mirror can cause issues other than the obvious loss of High Availability – it can have a relatively significant impact on backup size, storage and of course network bandwidth as a result.

This weeks questions are on waits and extended events. OK admission – I suck at extended events. Never really got on the gravy train with SQL 2008 and still clinging onto the hope that RML utilities will be upgraded to support SQL 2012. I’d even do the upgrade myself if the source code was on codeplex, such is my aversion to extended events. Waits on the other hand are the bread and butter of performance tuning – to be a SQL MCM I would expect you not only need to be able to articulate main wait types, but at least share some horror stories of situations where you’ve seen them. Mere mention of CXPACKET should be enough for a one hour debate.

Waits (not just wait stats)

1. Can you define “signal wait time” ?

2. In the DMV ‘sys.dm_os_wait_stats’ does the column “wait_time” include or exclude signal wait time?

3. What does a high “signal wait time” usually mean ?

4. What does a high wait time but low signal wait time mean ?

5. What is the difference between WRITELOG and LOGBUFFER wait types ?

6. When running a query what could cause an IO_COMPLETION wait type ?

7. When running a query what could cause an ASYNC_IO_COMPLETION wait type ?

8. What can cause PAGELATCH_XX ?

9. If you have a RESOURCE_SEMAPHORE wait for a large query how could you resolve this?

10. How could you resolve THREADPOOL waits on a server with a high number of concurrent users?

11. How would you determine why a log file was continually growing ?

12. What can sys.dm_io_virtual_file_stats show you that you cannot see from perfmon counters ?

Extended Events (Assume SQL 2008R2)

What is a predicate WRT extended events ?

Is a predicate at the event or session scope ?

Explain the difference between sys.dm_xe_packages and sys.dm_xe_objects ?

Name some targets available in SQL 2008R2 (six available) ?

How big is an asynchronous buffer by default ?

What happens if an event is bigger than this default size?

Explain the difference between the three options for dealing with a full extended events buffer using EVENT_RETENTION_MODE

If the services is restarted will an xevents session continue ?

Why would you want to use extended events to track page splits instead of just using the performance counter ?

What is the purpose of this xevents node “//RingBufferTarget/event/action/value”

I’m at the Belfast user Group Thurs the 11th April to demo and discuss two of my favourite topics (below). If you are interested in SQL Server or want to know a bit more about Windows Azure hype v feature then please do come alone.

SQL 2012 Always On Deep Dive (Beginner/Intermediate)

Always On Availability Groups offers a huge leap forward in terms of high availability. This sessions is a demo based introduction to the high availability changes in SQL 2012 emphasising key features and benefits of Availability Groups.

During the demonstration we will show how to seamlessly upgrade a database from SQL 2005+ to Denali with ZERO down time and then use the Always on features to simplify reporting, availability and disaster recovery.

This session will be useful for anyone who is working in a high availability environment or currently using replication to provide reporting and looking for a neater solution.

Windows Azure for SQL Folk (Beginner)

So what is this cloud stuff and how does it affect database professionals?

In this session we aim to cover the fundamental concepts that you should know on the Cloud and Windows Azure. In this demo based hour we will step through building Azure applications, deploying and managing them and what storage choices you have when developing applications in the cloud.

This sessions is Ideally suited to anyone who wants to understand more behind how Azure works, what it offers and what is probably just hype. Note that we are looking at the whole of Azure rather than just SQL Azure.