We have a large environment with a lot of Active Directory DNS master servers
where we slave a (lot of) zone(s) from to our NSD. So the NSD receives several
notifies per second and attempts several transfers every few seconds and has in
general a lot of throughput to handle.
Sometimes IXFR is failing and it does fallback to AXFR, working fine.
However, the post transfer reload seems to also fail sometimes, e.g. in the log
we have the following (anonymized myhostname/myzone):
Aug 10 02:52:18 myhostname nsd[17775]: xfrd: zone myzone.tld: needs to be loaded. reload lost? try again
We have no idea WHY the reload is lost at all. Maybe some limit is reached on
the machine?
Fine for now, but unfortunately afterwards, xfrd enters a "endless loop" of the
following messages:
Aug 10 02:52:30 myhostname nsd[17775]: xfrd: zone myzone.tld committed "received update to serial 9297248 at 2016-08-10T02:52:30 from 10.32.208.10"
Aug 10 02:52:30 myhostname nsd[17775]: xfrd: zone myzone.tld: soa serial 9297248 update failed, restarting transfer (notified zone)
Aug 10 02:53:03 myhostname nsd[17775]: xfrd: zone myzone.tld committed "received update to serial 9297249 at 2016-08-10T02:53:03 from 10.32.208.10"
Aug 10 02:53:03 myhostname nsd[17775]: xfrd: zone myzone.tld: soa serial 9297249 update failed, restarting transfer (notified zone)
[...]
...until the zone expires :(
A manual nsd-control reload myzone.tld does NOT fix the looping xfrd.
A manual nsd-control transfer myzone.tld does NOT fix the looping xfrd.
A manual nsd-control force_transfer myzone.tld does FIX it, until next same lost
reload and loop (so this is our current hotfix/workaround: an hourly cron-job
which calls force_transfer for all zones).
So, all looks like there is some race between the master triggering the reload
after transfers and the xfrd process.
We run NSD 4.1.7 on OpenBSD 5.9 (from base) and our config is simple and
looks like this:
server:
server-count: 2
hide-version: yes
verbosity: 1
database: "" # disable database
remote-control:
control-enable: yes
zone:
name: myzone.tld
zonefile: slave/%s
allow-notify: 1.1.1.1 NOKEY
request-xfr: 1.1.1.1 NOKEY
allow-notify: 1.1.1.2 NOKEY
request-xfr: 1.1.1.2 NOKEY
allow-notify: 1.1.1.3 NOKEY
request-xfr: 1.1.1.3 NOKEY
allow-notify: 1.1.1.4 NOKEY
request-xfr: 1.1.1.4 NOKEY
[... lot's of zones ...]
Increasing xfrd-reload-timeout: 10 does NOT fix it, xfrd still enters the loop
mentioned above.
If we postpone the initial zone writes with high zonefiles-write: 86700 value
we do not hit the bug loop above until first write, as zone->soa_disk_acquired
seems 0 and conditions are not met in xfrd_check_failed_updates().
So this seems to be related to the xfrd->reload_cmd_last_sent timestamp.
Sidenote: These timestamps seem to be gathered via time(), note: time() values
are subject to changes/adjustments/setting back by e.g. ntpd so it would be
really better to use clock_gettime(...MONOTONIC), if these timestamps are
needed to be reliable increasing and avoiding races.
Sidenote: The same bug seemed to be reported already back in 2008
on the mailing-lists:
https://www.nlnetlabs.nl/pipermail/nsd-users/2008-July/000799.htmlhttps://open.nlnetlabs.nl/pipermail/nsd-users/2008-September/000820.html

Hi umaxx,
The error is something to look at.
The base cause is that the master servers do not properly send zone transfers. The IXFRs do not actually match (old vs new). And, from what you say force-transfer fixes it, the soa serial number is also not reliably kept track of. So I would recommend trying to get the master servers fixed (if you can).
Perhaps a config where you disable IXFR for the zone? This may workaround the issue with the master's notion of changes.
change the config lines to this:
request-xfr: AXFR 1.1.1.1 NOKEY
Best regards, Wouter

Hi,
You mentioned the older reports, we fixed that by checking if the IXFR has problems, and if so, reverting to an AXFR. But this fix alone seems insufficient for you; and additionally there is another issue with the timestamps it looks like.
Best regards, Wouter

(In reply to Wouter Wijngaards from comment #1)
> The error is something to look at.
>
> The base cause is that the master servers do not properly send zone
> transfers. The IXFRs do not actually match (old vs new). And, from what
> you say force-transfer fixes it, the soa serial number is also not reliably
> kept track of.
Yes, immediately before the "reload lost" message there is the following in
the logs:
Aug 10 12:38:28 myhostname nsd[20243]: diff: domain cetqe-patl7.myzone.tld. does not exist
Aug 10 12:38:28 myhostname nsd[20243]: Failed to apply IXFR cleanly (deletes nonexistent RRs, adds existing RRs). Zone myzone.tld. contents is different from master, starting AXFR. Transfer received update to serial 9298441 at 2016-08-10T12:38:19 from 1.1.1.2
Aug 10 12:38:28 myhostname nsd[20243]: zone myzone.tld. received update to serial 9298441 at 2016-08-10T12:38:19 from 1.1.1.2 of 300 bytes in 0.000107 seconds
...and then the:
Aug 10 12:38:28 myhostname nsd[14088]: xfrd: zone magna.global: needs to be loaded. reload lost? try again
...followed by the "update failed, restarting transfer" loop.
> So I would recommend trying to get the master servers fixed
> (if you can).
Not really possible. Also I do not think they are "broken".
These are "standard" Active Directory Domain Controllers (a lot of them),
which act master-master sending a lot of notifies.
Also, BIND can handle them fine (we are in the middle of the process
of migrating from BIND to NSD/Unbound).
Sidenote: the myzone.tld is rather large 1.5MB with more than 45k lines on
disk. Maybe this is related? Some buffer filled up? Also broken network
is possible, sometimes loosing some packages somewhere.
> Perhaps a config where you disable IXFR for the zone? This may workaround
> the issue with the master's notion of changes.
> change the config lines to this:
> request-xfr: AXFR 1.1.1.1 NOKEY
That would work but would transfer the whole 1.5MB every some seconds :(
The IXFR works mostly. It just (reliable) fails after runnnig several thousands
IXFRs for about an hour.

(In reply to Wouter Wijngaards from comment #2)
> You mentioned the older reports, we fixed that by checking if the IXFR has
> problems, and if so, reverting to an AXFR. But this fix alone seems
> insufficient for you;
Yes and no :)
I really think NSD/xfrd should recover from a broken IXFR correctly instead
of going into a logging loop.
Main question for me is: Why does the "reload lost" happen at all,
and why can it not correctly recover/commit-and-serve afterwards and
end up in the loop, instead of just doing AXFR correctly?
Within the mentioned "update failed, restarting transfer" loop I see the
following interesting behaviour:
$ nsd-control zonestatus myzone.tld
zone: myzone.tld
state: ok
served-serial: "9298439 since 2016-08-10T12:38:28"
commit-serial: "9299506 since 2016-08-10T14:37:00"
So, it has the 9299506 correctly transfered, but then fails to serve it?
In the logs I have:
Aug 10 14:37:00 myhostname nsd[14088]: xfrd: zone myzone.tld committed "received update to serial 9299506 at 2016-08-10T14:37:00 from 1.1.1.4"
Aug 10 14:37:04 myhostname nsd[886]: notify for myzone.tld. from 1.1.1.4 serial 9299507
Aug 10 14:37:04 myhostname nsd[14088]: xfrd: zone myzone.tld committed "received update to serial 9299507 at 2016-08-10T14:37:04 from 1.1.1.4"
Aug 10 14:37:04 myhostname nsd[14088]: xfrd: zone myzone.tld: soa serial 9299507 update failed, restarting transfer (notified zone)
Then I see:
$ nsd-control zonestatus myzone.tld
zone: myzone.tld
state: refreshing
served-serial: "9298439 since 2016-08-10T12:38:28"
commit-serial: "9298439 since 2016-08-10T12:38:28"
... so back from 929950x down to 9298439 while refreshing/restarting? :(
and then the loop starts over:
$ nsd-control zonestatus myzone.tld
zone: myzone.tld
state: ok
served-serial: "9298439 since 2016-08-10T12:38:28"
commit-serial: "9299509 since 2016-08-10T14:37:31"
> and additionally there is another issue with the
> timestamps it looks like.
Yes, it seems to be.

I forgot to mention: since we can reproduce this easily I'm happy
to do any further test.
Let me know how we can track this down further?
Adding some debug printf's somewhere should be no problem!
Thanks!

Some more information, the underlying OpenBSD runs with the default
configuration. So no network or socket buffer tweaks applied.
From login.conf default daemon class is used.
Lot's of throughput on the system.
So, likely we hit some OS limit here, resulting in the initially
failing IXFRs. But I have no idea which ones, and how to track down further.

Just to be clear: this does not happen after *each* failing IXFR.
Mostly IXFR just works.
Sometimes IXFR fails, but succesful fallback to AXFR, everything fine.
Rarely (after about 1h in my case), IXFR fails, fallback to
AXFR happens, followed by the failing loop.

Hi umaxx,
Is this failing AXFR always one that is very small, like the 300 byte one you logged? The working ones would then be 45kb or so?
(45kb is not a problem, by the way; NSD is also used for zones in the Gbs of size).
Best regards, Wouter

(In reply to Wouter Wijngaards from comment #8)
> Is this failing AXFR always one that is very small, like the 300 byte one
> you logged? The working ones would then be 45kb or so?
Nope. A quick grep reveals that it fails on various sizes, I found entries in the last hours between 2xx bytes and 8k.

I have sent a patch to you; this should break up the loop. (But does not handle the timing issue you noted, or the failed download itself). When the loop starts and it looks like this patch wants, it'll start a force_transfer for you. That hopefully should make it get out of the mess by itself.
Best regards, Wouter