This error has nothing to do with your XML issue, it is related to your gateway not bootstrapping properly on first start. You may need to turn the gateway service off for 20s and then bring it up again (it's worth checking your policy connections string too).

So it appears the issue is with the nonces! There's nothing special about the gateway configuration, we're connecting to the dashboard over https so I don't know if that makes a difference?

Any help you can provide would be great as this is a bit of a showstopper as part of our evaluation (we're trying to get policies to apply based on JWT tokens - but the debug logs show there are no policies being loaded from the dashboard to use!)

So you can see the dashboard is expecting nonce YzA5MWQ2YTMtMTllZS00MWNmLTRjMGYtMWMyYmRjN2NhMWIyNWE2NDdjYjc0N2M4NDhhYTRkNzVhNzg1ZDVhZWY3Njk= however YzA5MWQ2YTMtMTllZS00MWNmLTRjMGYtMWMyYmRjN2NhMWIyZjc5MDBkNDAxNmY5NDc3MzY3NWRlZGI2MGJhY2QxYjA= is sent.

The solution I think here, although it has other potential knock on issues, is to lock the mutex for the entirety of the HTTP call. This would ensure that the nonce is used and updated in one transaction if it is indeed updated after every http transaction.

EDIT. I think the actual threading issue is probably between the API Loading and the Heartbeat which then has a knock on effect on the policy rather than it being the policy loading thread at fault. I can't seem to see how, looking at the code, the Heartbeat and API Loading could operate with the mutex to produce the log output above.

It could indeed be an issue with the mutex, but a mutex works across threads / goroutines, so each time a read or a write occurs to the nonce it is locked, this will also lock any threads trying to access the resource. Mind you, because the other side will be iterating the nonce on the heartbeat, there's a chance it changes dashboard side, even though it is correct gateway-side.

In fact, the simplest thing to do would actually be to move the heartbeat to after the initial load has occurs, since this causes the most nonce changes, it's a tricky one.

Moving the heartbeat to after the initial load was my initial thought too but I don't know the subtitles around the length of time the API loading takes on production servers vs the timeout of the node registration.

Regarding mutex while I concur that it does block across threads; however there is a time during the http connection where the mutex is unlocked, this could lead to concurrency issues where two threads are issuing a http request at exactly the same moment - they'll be race condition and whichever completes first will succeed (as the nonce will/should only be valid once).

Agreed that there may be something dashboard side although no code for me to look through on that. I'll try running dashboard in debug mode (if that mode exists?) and see if I can see anything.

We've just pushed a patch to our develop branch which makes the mutexes work across the lifetime of the transaction, this should preent other processes from hijacking the nonce and changing it, since it's only heartbeat and loaders, there's no knock-ons.

Just to confirm I'm now running the nightly and I can now see the policies being loaded correctly! API Loading also feels to be much faster, however I've nothing to quantify this. I suppose something could have been blocking!

That's a new feature: the pub/sub payload that causes the cluster reload is now cryptographically signed with an assymetric key pair, you can disable this check in the tyk.conf by setting allow_insecure_configs: true

This is because we've extended the command set that can be sent to an instance (such as reconfiguring the instance remotely and hot-reloading the process without traffic loss... but we needed to secure those payloads.