@emmanuel-florent here's the code. I've tested this many times with 4x LoPy4's before posting; pressing the reset button at the same time on them all. Almost always, one or two devices will not connect and do not recover.

Gateway logs show 1 join request from the failed device(s), I watched the logs for two minutes each time awaiting any signs of life (there were none).

@harald The Gateway (more specifically, TTN console data logs) show a single join event. No additional join events occur in TTN console after this issue is invoked (which suggests the LoPy4 is the issue).

@emmanuel-florent I've omitted the region settings to just show the minimal code. We're heavy-users of LoPy hardware; everything upstream of the join event is fine.

@robert-hh I agree there is a collision, though it's not attempting every 15 seconds. A failed connection never recovers, even if other failed devices are turned off (leaving 1 failed device "on" to attempt its own recovery).

It does appear more-and-more that there's an issue in the stack for this scenario. With REPL connected, sending a CTRL+C during a failure condition

Left as is, this bug could be viewed as a DDoS layer for Pycom LoPy4. Devices that attempt to connect on power-up could find themselves in this situation and never "connect" using any of the recommended examples in Pycom Docs. I doubt many people have experienced the issue as they are not holding educational workshops that invoke more-than-normal LoRaWAN events over a short time.

The firmware does repeat the join request.
Uplink join frequency (in Mhz) depends of each regional specification same for retry frequency (in seconds)
Depending on your setup (region in particular) you have to make a particular setup so what regional specification do you target ?
Example for US_915 subband 1 you would OTAA join on channel 64 and then that would be:

@graham It looks for me as if two devices try to join at the same time over and over again, and then the join messages get lost due to collision. An indirect proof for that would be, if every device joins when started by itself. So the reset you used as workaround would then force the devices out of their malicious sync.

It happens on different gateways, so I'm not leaning in that direction yet. I can very reliably trigger the fault by powering up 5x LoPy4's all at the same time. At least one or two of the last-to-be-powered devices will hang.

The only semi-reliable method we have for educational workshops (where 10+ devices are being powered on/off at the same time) is this:

print("Device EUI: " + binascii.hexlify(lora.mac()).upper().decode('utf-8'))
while not lora.has_joined():
# if no connection in a few seconds, then reboot
if utime.time() > 15:
print("possible timeout")
machine.reset()
pass

Although it looks and feels like a dity workaround for something that is breaking somewhere at the OS-level.

Anyone with 5+ LoPy4's to test could replicate the issue easily and reliably. This failure state triggers all-the-time, it doesn't seem to be dependant on the Gateway model.

If this happened to a device in the field, without our workaround, it would remain in a hung state indefinitely. That, erm, concerns me a little.