Thanks for the bug report.
This probably belongs on the mailing list more than here, for reasons that I hope will become clear shortly.
So, I have good news and bad news, and a possible way forward.
The first thing to note is that the particular error you are seeing is a red herring - not an actual error, but a debugging message that really shouldn't be at level 0. It isn't a failure case, just something strange (despite the fact that we do it Samba to Samba).
There may, if you turn the logs up enough, be a real error message printed at a higher log level. Without data it is hard, but I would speculate that you are hitting timeouts at either the network or LDB layer.
The second thing is that, by numerous reports, you are at the outer edge of Samba's current practical scale. We greatly admire the large installations that run Samba at the very edge of its capabilities, but we also know from reports that things like this can and do come up for them.
The good news is that you are not alone: as an example, in my work on very similar customer Samba bugs for my employer Catalyst, we have isolated performance issues in both our DRS client and server, as well as some inappropriate timeouts that could simply be extended.
The bad news is that while some small fixes are in git master, there is still much, much more to do. I'm confident that it is practical to make Samba scale to the size you need, with the application of some appropriate development resource.
Finally, at your scale, particularly if this in in production, I would strongly suggest engaging a Samba commercial support provider to assist in isolating what exactly is happening on your network, and to propose the work required to fix it.
Thanks,
Andrew Bartlett

Without further feedback, I'm going to say this is fixed in Samba 4.5rc1, as we addressed a large number of performance issues that will hurt domains of this size. Those could certainly lock up a DC for long enough to cause this kind of behaviour.