Implementing a standby server

FreeNAS Experienced

This is a hypothetical scenario at this stage. A FreeNAS server may be the wrong appliance to use for this, but I'd like to understand why and what the limitations are.

Consider a small business that employs predominantly Windows PCs. The business doesn't have a lot of cash to throw at hardware or software, but still wants to maximise business continuity by minimising system downtime. A couple of FreeNAS servers are considered as file servers; one active, the other on cold standby (ie. no automatic switching). Replication has been set up between the two servers, but the business understands there is the potential to lose any data created since the last replication event. To keep the cost down, local authentication rather than directory services is employed. The servers are hardened in the sense they meet FreeNAS basic hardware requirements and ZFS RAIDZ2 has been implemented. Note though, the servers may not be identical in their hardware or pool configuration eg. the second server may employ fewer, but larger disks for its pool.

In preparation for swinging the second server into action in the event of a catastrophic failure of the active server, what are some of the things to consider within the FreeNAS OS to minimise any downtime? Below are several questions I've been pondering and searching through forum posts for answers to.

It appears possible to replicate the pool directly, except for the system dataset, which needs to be treated separately. From the perspective of the second server, how important is the system dataset of the first server?

I notice that user and group account information that exists on the first server isn't 'replicated' (by design I understand) across to the second server. At the time of a switch, I doubt it is just a matter of unplugging the boot drive from the first server and plugging it into the second to 'transfer' account information across? What steps should be taken to ensure that account information is 'synced' between the two servers under normal operating conditions?

As in the previous point, SMB shares aren't 'replicated' as such to the second server, but the underlying datasets are. How should shares be treated?

I understand permissions are transferred during replication. However, under normal conditions, files on the second server need to be read-only to prevent users from accidentally changing data on the wrong server. At the time of a switch, the permissions have to change to that of the active server. Is it possible to change the state of permissions in this way?

It's unlikely that jails, VMs and plugins would be employed for this business, but if any of these were, are there challenges in replicating and activating these?

Assuming it is possible to implement a standby FreeNAS file server, is there anything else I might need to consider?

FreeNAS Experienced

Not a dev or anything, but here are my thoughts:
1. System dataset is mostly for logs and stats. You don't need to replicate that (and, presumably, you can't actually set it up for replication). It's mostly there so you don't have the mentioned logs and reporting databases wearing out your boot drive (assuming it's a flash drive) and not taking RAM (if you don't set the system dataset, it all goes to tmpfs). In other words: Don't replicate it (because you can't normally), and don't worry. *DO* however keep a relatively recent copy of the config database.
2. Ah... There's many misunderstandings here, I'll address them in order:
- You would be correct in assuming that user accounts between systems wouldn't automatically transfer. That's one of the benefits of using a directory service (FreeNAS even provides such if you want).
- ZFS pools are identified internally by GUID, so even if you were to name your pools the same, the boot drive from Server1 will not attempt to import the pool on Server2 just because you swapped the drive, and even if you were to upload Server1's config to Server2, it's not guaranteed to work. You would probably have better luck hand-mirroring everything (which is saying a lot).
- Account information is best synced with a directory service. See Here for information on how to set up the built-in domain controller service on the primary server, and then you can connect the secondary server to the domain, which should "sync" the account information
3. You can still export those shares. Just export them as "Read Only" (Advanced Mode area).
4. Sure. Just untick the Read Only box and save. Be warned that this would probably break replication if/when the primary server comes back online and tries replicating again...
5. I have not tried this, but if you're using Iocage jails I believe almost all configuration is stored in the dataset and very little is in the configuration database. Those would start right up. VMs are a bit trickier; you'll get their disks just fine, but the VM configuration is not replicated so you'd need to take note of things like NIC addresses and order of devices (Yes, that actually matters in some cases), but so long as the backup server is capable and has enough RAM for it there shouldn't be any issues booting a replicated VM's drives. Beware again also of that potentially essentially breaking replication.
6. Beware of "simply" renaming your backup server to the same name as the original. Windows sometimes gets savvy that something is wrong and may refuse to connect to any shares if it notices the server looks different.

I honestly think you should not treat a backup server as a potential replacement for the primary, but instead as a backup. I would focus on ensuring your primary server is stable and available, while keeping a Disaster Recovery plan to restore the primary server in the event of a problem. So long as you maintain your machine, FreeNAS will chug along for years (case in point, I recently upgraded my 9.3 home server with only minor issues, a system that's nearly (if not more than) a decade old and only did so to take advantage of newer features). Keep your backups tested, and your drives in rotation, and barring Random Acts Of God, you should be fine.

FreeNAS Guru

Depending on the number of local users then it may just be simpler to add the user to both boxes at the time of creation ensuring the UUID is identical on both boxes.

Regarding the SMB share point then something I do is create one share point and then use permissions to move users in the right direction. This means you only have one share point to create and again I would create this early on. If this wouldn't work for you then simply create the share points on the backup box in advanced. Unless you have hundreds of shares it's not a big deal.

Simply send the command zfs set readonly=off on the parent dataset and all permissions will work as they did on the primary.

Other things to consider are when disaster strikes you need to make a quick decision about how bad is your primary server. If its an issue you may be able to fix within a few hours or a day then you may be best sending users to the backup in 'read-only' mode. However, if your pool is totally lost then do the above. Just remember when/if you fail-over you no longer have a backup.

PS: I run exactly this setup on all my systems (I always build two) and in practice, it works well, however, I've never had to fail-over in a real-life situation. I used to use local user accounts but over the last couple of years, I've used AD which is a godsend for users, groups and permissions in general. You may also want to explore using Microsofts DFS for obscuring namespaces making fail-over to users much more transparent.

PPS: If you ever actually fail-over I would disable SSH on your server just in case your old primary came back to life and then replicates wiping out all your user's changes.

FreeNAS Experienced

@Tsaukpaetra@Johnny Fartpants I'm just working through your responses with a fine tooth comb, but at this stage, I just wanted to thank you both so much for the valuable insights you've both provided. They are a treasure trove of useful information! Thank you!!

- Account information is best synced with a directory service. See Here for information on how to set up the built-in domain controller service on the primary server, and then you can connect the secondary server to the domain, which should "sync" the account information

I can appreciate how this would be the preferred approach where the user base changes dynamically or grows to the point where it becomes an onerous exercise to manually sync accounts. Thank you for the link reference. Four options are suggested for directory services - Active Directory, LDAP, NIS and Kerberos. Clients are provided for the latter three. I understand that these require a separate directory server. With the AD option, am I correct in saying where an AD network does not exist, FreeNAS can be configured to be its own primary(?) DC. If so, the second server could then be configured as the backup DC?
Any Windows Home clients on the network would have to be upgraded to Windows Professional to participate in an AD network. If, for whatever obscure reason, this were a barrier, which of the other three services might be the next preferred choice? Each appears to have their plusses and minuses. For instance, NIS appears to align well with the Unix-like heritage of FreeNAS, but it carries with it some security implications. In addition, any of the three choices require a directory server to be installed. I'm assuming, if a package is available, it may be possible to install the required directory server in a jail?

It's probably true to say that in an environment which doesn't yet have a directory service, AD is preferred over the other three directory service options discussed above in terms of its popularity and support base. The key to leveraging it though is to ensure that Windows Professional is deployed as the minimum Windows version for the client PCs. If a Windows Home client is on the physical network, it is unable to participate in AD services. Could it still participate in local authentication with a FreeNAS server set up as a DC or is this facility mutually exclusive to AD?

FreeNAS Guru

I have been doing similar thing on my system for such event as I was putting my new server in service.
One advantage and disadvantage about replication is the lack of real time operation. For that, you need to create a recursive snapshot (automatically or manually) and perform replication as soon as the snapshot is created.
If you start the initial replication to a remote server, I wouyld create the new pool with the same name as the one you are currently using. Because the remote server is on the network, the remote system doesn't care about the name of the pool being the same as the source.
I usually start first recursive replication with the -R and -p option to retain permission all all the datasets. Then I perform incremental.
I should mention that the remote pool will be set to Readonly, this will take care of users deleting files if accessed.
The remote server will have a different IP address and even iocage jails will be replicated. Because you will not set your remote Freenas with the same IP address as the source, and hopefully you don't have them set to DHCP (you can), the jails will not be mounted, and if they did, it way not matter at the time.
You can still perform CLI replication of the boot drive, I think, but the remote doesn't need to have the iocage location of the remote volume, so iocage shouldn't be mounted.

If local system were to crash, you could just activate the replicated boot volume on the remote or if you have a mirror of the original local boot drive, it will have all the users, iocage, certificate and IP as the local one. So booting from the local copy or original boot on the remote system will allow you to access the remote dataset as if it was the local system. You just make it local at that point.
Remember that you will have to set the pool as readonly=off to be able to mount the datasets and make everything writeable.

If the crash was predictable, then you would replicate all the datasets but will prevent any writing to the old local system to take place. As this is very difficult or nearly impossible task, replication would not be the best option, it could be just enough, but then how do you want your snapshot interval to be implemented and replicated?

Another approach would be to use Replication at first for user access and dataset configuration, but then you may want to use mirroring using dedicated redundant server and I think IXSystem covers this scenario with TrueNas.
In a nutshell, you may want to look at server and load balancing as one of the possible alternatives.

FreeNAS Experienced

I honestly think you should not treat a backup server as a potential replacement for the primary, but instead as a backup. I would focus on ensuring your primary server is stable and available, while keeping a Disaster Recovery plan to restore the primary server in the event of a problem. So long as you maintain your machine, FreeNAS will chug along for years

Other things to consider are when disaster strikes you need to make a quick decision about how bad is your primary server. If its an issue you may be able to fix within a few hours or a day then you may be best sending users to the backup in 'read-only' mode. However, if your pool is totally lost then do the above. Just remember when/if you fail-over you no longer have a backup.

Restoring normal services is not as trivial as I thought it might be. Thinking about it, where there's a hardware failure such as PSU failure on the primary, but the pool is otherwise healthy, rather than switching, the preferred course of action might be to shut down both servers, pull the disks and boot device out of the first server and plug them into the second server to continue operating. Once the hardware issue on the first server is resolved, it can be returned to service as the active server at a time when the user base is least impacted, or, if the hardware is similar between the servers, by plugging in the boot device and disks from the second server, it can assume the role of the second server.

Where the pool on the first server has failed, I like the idea of sending users to the second server in read-only mode, while investigations on the first server are underway.

In the event of catastrophic pool failure, would it be reasonable to pull the disks from the second server, plug them in the first server and import the pool to continue operating? A new pool would still have to be created for the second server and replication services reinstated, but what appeals to me about this approach is that it avoids complications arising such as the situation below.

FreeNAS Experienced

One advantage and disadvantage about replication is the lack of real time operation. For that, you need to create a recursive snapshot (automatically or manually) and perform replication as soon as the snapshot is created.

Thanks @Apollo. I think for a small business working with FreeNAS servers and within tight budgetary constraints, near real-time operation is the best that can be expected. Yes, I agree the replication window should not be limited, and the snapshot frequency has to be carefully tuned so that, in the event that the second pool has to be engaged due to catastrophic failure of the primary pool, data loss is kept to a minimum.

FreeNAS Guru

In the event of catastrophic pool failure, would it be reasonable to pull the disks from the second server, plug them in the first server and import the pool to continue operating? A new pool would still have to be created for the second server and replication services reinstated, but what appeals to me about this approach is that it avoids complications arising such as the situation below.

readonly can mean dataset which were never before mounted wouldn't be accessible. Setting to Readonlyu=off and rebooting would be the only choice.

In the event of catastrophic pool failure, would it be reasonable to pull the disks from the second server, plug them in the first server and import the pool to continue operating? A new pool would still have to be created for the second server and replication services reinstated, but what appeals to me about this approach is that it avoids complications arising such as the situation below.

I would advise against it.
In the rush of the moment you might forget to plug the drives, or have one drive misconnected and your pool will we either not mountable or become degraded.
If system suffered physical damages, them you are going to put your backup at risk.
Best to use the backup server as your current server.

If, this is a reasonable approach, after moving disks from the second server to the first, would the pool still be in read-only mode requiring the command below to be executed?

FreeNAS Guru

Thanks @Apollo. I think for a small business working with FreeNAS servers and within tight budgetary constraints, near real-time operation is the best that can be expected. Yes, I agree the replication window should not be limited, and the snapshot frequency has to be carefully tuned so that, in the event that the second pool has to be engaged due to catastrophic failure of the primary pool, data loss is kept to a minimum.

You can set a cron job to create snapshots every so often but you want snapshots to have a fairly short lifetime. The problem is that if you replicate, all the snapshots are going to be stored on the backup and will never be retired when time as lapsed, unless you use "Delete stale snapshots". This might add latency on the backup side causing delays in the replication process.
It can take a while for replication to take place if you have large number of snapshots and datasets. Something to be aware of.

FreeNAS Guru

Another note:
If you take my approach about having a remote system with a replicated version of the original pool (same pool name, though) but running it with its own Freenas boot disk and different config as your original system, both system can coexist on the same network but Remote system doesn't have to mount the iocage environment.
To cover Boot drive, make a mirror of your local system and once mirroring has been finalized, you can put it in a safe place to be used in case of an emergency. You will need to make backup copies of your local system config and in the event of a crash, assuming the pool is damaged, you can turn your backup system off, then remove the remode drive and plugging back the mirror that was saved earlier. Boot and Remote system will be as if it was the local system. Update the config from the latest copy and reboot and you would have everything working, without forgetting the readonly option.

Essentially yes, activating the Active Directory service makes the FreeNAS server be the "source of truth" for accounts, using whatever accounts you've added to it. You would not need to go to the Directory Services tab to connect active directory (because that would mean it's asking to sync it's own accounts with itself, which would probably cause problems, haven't actually tried that myself) because that's the one that's actually providing it.

This one I'm not so sure of. At present I'm not sure if FreeNAS can be configured to be a sibling Domain Controller. Unfortunately it doesn't seem like anyone has had need to do that enough for it to be added as a configurable option in the GUI.

Only if you need to have the Windows accounts be auto-authenticated and logging on via the FreeNAS server. Normal Windows Home can still access using username/password just as it would without it. The main thing you'd be using AD for is consistent accounts between servers.

Sure! Naturally you'd be doing it in the command line (technically it's possible to install GUI in a jail, but with caveats and probably not something you'd want to try without experience). Alternatively, if you have a turnkey solution that uses virtual machines, FreeNAS should also be able to boot and run it as well.

If a Windows Home client is on the physical network, it is unable to participate in AD services. Could it still participate in local authentication with a FreeNAS server set up as a DC or is this facility mutually exclusive to AD?

Right, I'll reiterate that this just isn't the case. You don't need to join a computer to a domain in order to use file sharing services. The pro version is for management, policies, and all sorts of other advanced things that you probably don't need (yet), and isn't a concern. Home computers can connect just fine, it simply won't be automatically authenticated.

Here's what that looks like, if you tick "Remember my credentials" then Windows will automatically try to reuse the username and password next time:

Note that you may need to prefix the domain name as I did, or the name of the server proper, depending on your network setup.

FreeNAS Experienced

After careful consideration of the advice I've been provided, and after weighing up the options, it appears that it is possible to use a pair of FreeNAS servers for a small business in a way that maximises system uptime and minimises the loss of data. For the sake of simplicity, in the discussion below, I've assumed identical hardware and pool configurations for both the active server and the backup server. To keep things simple, I've also excluded jails, plugins and VMs from the discussion.

Excluding environmental factors (theft, fire, extended loss of power, acts of God, etc), there are just two worst case scenarios that take out the active server:

Failure of the hardware supporting the pool; or,

Catastrophic pool failure.

Failure of hardware supporting the pool

If it's evident there is a hardware failure eg PSU failure, the quickest course of action to bring services back online will be to move the boot device and disks from the failed active server into the backup server and then bring up the backup server as the new active server. Care is required to ensure that the pool is not inadvertently destroyed through mishandling of the disks and boot devices. The key steps to follow are:

Shut down the backup server

Remove and store the boot device and disks from the backup server.

If it is not already shut down, shut down the failed active server.

Remove the boot device and disks from the failed active server and install them in the backup server.

Boot the backup server. It now becomes the active server. At this point, users are able to access data on the server.

Following hardware repairs on the failed active server:

Install the boot device and disks from the backup server into the repaired server.

Boot the server. What was the failed active server is now the backup server.

Confirm that data replication is occurring.

Visible Impact
The system is unavailable during the initial intervention. All users are affected. However, downtime should typically be no more than about 15 mins. There is no loss of server data.

Catastrophic pool failure

The first step is determining the severity of the pool failure. During the investigation, users can be directed to the backup server where they can access their data in read-only mode. Virtual namespace services such DFS would make this step more transparent. If the failed pool can be repaired with minimal or no data loss, once the pool is restored, users are then directed back to the active server.

If it's evident there is a catastrophic failure of the active server pool, it will then be necessary to switch over to use the backup server pool. The key steps to follow are:

Shut down both servers.

Remove and store the backup server boot device. Replace it with the boot device from the failed active server.

Boot the backup server.

Import the pool.

Change pool state so that it can be written to (zfs set readonly=off). What was the backup server is now the active server. At this point, users are able to access data on the server.

Set up periodic snapshots.

Attention now turns to the failed active server.

Install the stored backup server boot device in what was the failed active server.

Boot the server. What was the active server is now the backup server

Recreate the pool.

Set up replication with the active server.

Visible impact
All users are affected. In an environment without DFS, users are directed to the backup server where their data is available, but in read-only mode and only up till the last replication event. In the event that the failed pool is recovered, users are then redirected back to the active server where offline changes they may have made to data during the intervening period may need to be merged back.

If it is ascertained that there is a catastrophic failure of the active server pool, the system then becomes unavailable while a switch to the backup server pool is underway. Downtime, while this occurs, should be no more than about 15-30 mins. Any data created since the last replication event will be lost.

Other considerations

Catastrophic pool failure is trickier to deal with than failure of the hardware supporting the pool. To minimise data loss and disruption to services involving catastrophic pool failure, several important tasks need to be undertaken/reviewed during normal system operation.

If directory services are not employed, users and groups need to be manually created on the backup server with matching UIDs and GIDs.

Share points on the backup server need to be created in advance.

Check that the data on the backup server is read-only (zfs set readonly=on).

To minimise data loss, carefully review the frequency of periodic snapshots and ensure replication is occurring during periods of user activity.

What was a surprising finding in working through this small business scenario, is that when there is a failure of the active server, whether it is the hardware or pool, the boot device of the failed active server is switched across to the backup server. Switching the software state of the backup server's boot device from backup to active is not the preferred default action. Doing so, it appears, complicates returning to the status quo.
Follow up

This review has spawned two other posts around AD and namespace services like DFS. When considering a standby server, the discussion has revealed that these complementary services provide some tangible benefits. What I'm curious to know is 'Can these services be provided without involving a Microsoft Server backend?'.