How to build a predictable server replication scheme

One day your CFO calls you at 9:45 AM and says “London is waiting on the financial report I just updated in the Notes database. When will that get replicated over to the server in London so they can work on it?” You think to yourself “I don’t know, first it has to go to the hub and then to the London server and the schedule for each is every 60 minutes from 12:01 AM to 11:59 PM. That’s two replication events, so it means it could happen anytime between right now to 2 hours from now.” So you say to her “I’m not sure, but I can force that over right now.” You drop what you were doing and go to force the replication process. Has this ever happened to you?

Well with proper planning your answer could have been as simple as “Sure, the hub replicates with your server every hour on the hour and then with the London server 10 minutes past the hour. Your data will replicate to the hub at 10:00 and at 10:10 it will replicate to London.” She hangs up happy and you go back to your work.

In my years supporting a broad variety of environments ranging from 5 to 90,000 users, I have figured out a number of tips like this one that can make a big difference in the operation of your Notes environment. When I shared them with my friends at IBM services, they adopted these as best practices when working with their customers. I would like to share those tips with everyone.

When it comes to monitoring replication events, it’s easier to troubleshoot replication issues and to predict when replication will occur if the replication times are listed explicitly in the connection document instead of using a time range and interval. The reason is the replication interval starts when the previous replication event is finished. So if the replication interval is set to 60 minutes and the first replication starts at 8:00 AM and takes 5 minutes to complete, the next replication will occur at 9:05 AM, not 9:00 AM. Then the next will be at 10:10 AM, etc. causing a drift in the replication time. You can avoid this drift by explicitly listing the replication times. To define replication to occur at an explicit time, the connection times should use the format “8:00 AM; 9:00 AM; 10:00 AM” etc. and the interval should be “0”. Also set a replication time limit with a value less than the interval so replication events do not overlap. Explicitly defining the times is particularly useful when data must travel through multiple servers to get from one end user to another and a fast, predictable delivery time is needed. This is more efficient than setting the replication interval ridiculously short to compensate for the seemingly unpredictable nature of using the time range/interval method.

Now let’s say you have a hub server that is running 4 replicator tasks and you need it to replicate hourly with 24 different servers. You can schedule it to replicate with 4 servers at a time, 10 minutes apart. so a part of that schedule might look like this:

While I’m talking about replication times, I should mention that it’s wise to not have any replication going on while the nightly maintenance tasks are running, like Design and Compact. This is usually between 1:00 AM and 4:00 AM. (Check your program documents and the “ServerTasksAtn=” parameters in the notes.ini.) If you have servers in different time zones, be sure replication does not occur between them during the maintenance window of either server. For example, if you have a server in London (time zone GMT) and another server in New York (time zone GMT -5) and they both have a maintenance window from 1:00 AM to 4:00 AM, then from the London server’s perspective, no replication should occur between the times of 1:00 AM – 4:00 AM or 6:00 AM – 9:00 AM GMT. From the New York server’s perspective, no replication should occur between 8:00 PM – 11:00 PM or 1:00 AM – 4:00 AM EST. If you have more than two or 3 time zones to deal with, create a table and indicate the maintenance window of each server in its local time and translate it into the time zone for each server that replicates with it.

This may seem like a lot of effort compared to using a replication interval, but this little bit of planning makes a huge impact on managing large domains. It will also help preserve your sanity. (I have constructed a replication schema that involved 200+ servers replicating through 4 hub servers spanning 20 time zones and these techniques were absolutely essential to make it manageable and predictable.)

Regarding mail routing using connection documents, even though there is an option for “Pull-Push” on the router type setting, the actual behavior of pull is to tell the other server to push. As a result, if there are any firewalls between the servers, the firewall rules must allow both servers to initiate a conversation on port 1352 for NRPC mail routing. This also applies to mail routing via Notes Named Network (NNN). This is critical because if two servers are in the same NNN, but the firewall prevents either server from initiating the conversation, then mail will fail. This is true even if there is a valid path between the two that involves a third server. If they are in the same NNN, they MUST be able to connect directly to each other.

In a hub/spoke configuration, let the hub initiate the replication so you only have to check one log to see the results.

A few other tips:
Create a view in the Domino Directory to display connections by both source and destination servers.
In a hub/spoke configuration, let the hub servers initiate the replication so you only have to check one log to see the results.
Create a connection that replicates from spoke to hub that runs once per day as a safety.
Always use DNS names rather than IP addresses unless that is not an option (like servers isolated via firewalls)
Do not use hosts file or lmhost file to define addresses.
Use Notes Named Networks for mail routing whenever possible.
Beware of putting servers in the same NNN that do not have a direct path between them.
Disable replication on Templates and change replication priority on templates to Low, then set connections to only replicate medium and high. Now they won’t replicate and they won’t report an error.