How to Move Thousands of Database Files with Minimal Downtime

Introduction

Running out of disk space is an inconvenient symptom of a business doing well.

Database file migration is rarely as easy as we want it to be – the downtimes alone can lead many people to put it off, even if this ultimately exacerbates the problem.

This is a common scenario facing many sysadmins or database administrators, and its handling greatly impacts the entire company. I worked through this issue with a client recently and wanted to share how you can best complete this task while minimizing the downtime required.

The Scenario

SQL Server Availability Groups with a couple of databases are capable of handling the high uptime requirements for the application and database. However, those databases grow when business is going well. When you realize that you are running out of space for your data files and your disk/volume cannot be extended (for any one of the possible reasons), then ultimately you need to add a new drive with a new drive letter. Unfortunately, this isn’t always quite as straightforward as it first appears.

The Technical Approach

To move database files in a SQL Server Availability Group, we typically run `ALTER DATABASE [Blah] MODIFY FILE`, move the physical files and we’re done. However, an Availability Group uses a “shared-nothing” architecture, so the data files on the primary and the secondary replica each have separate metadata.

The SQL Server Availability Group architecture is so named because it is meant to improve the uptime/availability of a group of databases. If we want to make these changes with the least impact to service, we need to handle this infrastructure update as we would other HA solutions in SQL Server: we change the secondary first, followed by the primary. This improves safety and allows us to back-out any failed change while minimally impacting overall system availability.

But wait – the secondary replica isn’t writable. This means we cannot make those changes to the metadata on the secondary while the database(s) remains in the Availability Group. The simplest solution in this case? Remove the secondary from the Availability Group, make the changes to the primary’s metadata and reseed the secondary with a fresh copy of the primary (including the new storage paths).

Seems pretty clear cut, right?

The Implemented Solution

The databases in question were both numerous and large enough to make it more painful to reseed than somehow updating the metadata on the secondary replica. So, what to do?

It is possible to pause data movement for an Availability Group replica, which then allows us to take the databases in question offline and make the necessary metadata changes before “un-pausing” the data movement again. This is possible to do via SSMS and T-SQL, but is slow and not repeatable in a clean and concise way. My preferred tool in this case is PowerShell and the fantastic dbatools PowerShell module.

dbatools provides hundreds of commands to make the day-to-day tasks of a DBA easier. The commands allow automation, which improves efficiency and reduces the human-error aspect of DBA processes. But I needed to know if it had the commands that I needed to execute this task.

I ended up using a mixture of dbatools commands and regular old PowerShell to build the process I needed; you can find the code I used here on Github Gist.

This code allows you to enter a few parameters, and then the
databases that belong to a chosen availability group will be moved from their
old storage location to their new location on the secondary replica. However, there
are a few caveats and interesting aspects of this approach that need to be
taken into consideration before running the code:

1. It works great (on my machine)! This code has been used to move ~20k data and log files amounting to around 20TB of data in one environment. It fulfilled my needs but wasn’t written to cover all eventualities; treat it as alpha code and tweak to fit your needs

2. I’m not great at PowerShell, but I know enough to be dangerous. This code breaks a few coding standard conventions in PowerShell and if it hurts your eyes, then so be it. I respectfully refer you to point one.

3. I was working in an environment
that allowed me to use the “readable secondary” feature. This was for
lines 42 – 44 of the code (to get the physical locations of the database files
on the secondary). That may not be available to you as it depends on licensing
for your machines. You could skip the check and avoid needing the readable
secondary part, therefore avoid having to license the secondary replica for
read access.

4. Some of the commands needed are
not available in dbatools (or at least, I didn’t understand them to be). I
built T-SQL to do those parts, but there could be a better way.

5. Same goes for lines 28-32.
Robocopy does a fantastic job of synching folders including ACL permissions.

6. Line 23: The databases and the
files that belonged to them were named the same, so my `$db` list was created
using those files. This also ensured I only processed databases that had their
files in the old location.

7. Lines 78-104: These move the files from the old folder to the new one. My requirement was to move data and log files into the same target folder – this might not be right for your usage. I also ended up running this in multiple PowerShell sessions in parallel to speed up the move. As I don’t have a programming background, I had to get creative!

8. This process was repeated over multiple weekends and took FOREVER (think 500+ hours long without scripting). I made liberal use of Write-Host with output coloring so that we could see progress when we looked in on the RDP session every few hours.

The Result

The database files were moved on the secondary replicas
only, meaning the primaries were available for application access during the
entire move. Once a secondary was finished, we failed over the AG and repeated
the process on the old primary, failing back again once the second move was completed.

The “downtime” of the databases was a few seconds
– essentially the time it took for the AG role to pass from one replica to the
other. This was completely seamless to the applications, which were accessing
the Availability Group using the Listener and using the auto-reconnect
functionality inside the .Net SQL-Client.

Beyond a hiccup here or there where network connections
blipped or the wrong volume was deleted after moving a few TB of data, the
entire process was relatively painless, if boring. But, in IT Operations, boring is actually a
good thing!

At this stage, I owe a big thanks to dbatools (again!),
which minimizes the hassle of working with so many facets of SQL Server, especially
at scale.

I would also repeat my regular recommendation of learning
PowerShell and also automating as many processes as you can. It enables you to
avoid human-errors and to let the computer to do the boring repetitive work it
was designed to do.

Have questions about this process or how I use PowerShell to solve other data management issues? Get in touch!

sorry for the delayed reply, your comment almost got lost between the spam comment flood.

The reason for offlining the entire instance was that a database on the secondary side of an AG cannot be taken offline – SQL Server says “NO!”. So the only way to achieve the same thing is to take the entire instance offline.

Not beautiful, I agree, but necessary for the AG to continue to be in a paused state instead of rebuilding it.