Start neutron-dhcp-agent with no networks being hosted on it.
agent reporting is fine, I have manually pdb'd this and triggered the agent report hundreds of times every 1-2 seconds and neutron-server always responds in ~1 second.

Now schedule a network onto the agent
Now the agent sync times out.

I can see the reply queue in rabbit and it starts to fill up with unacked messages and the agent starts to produce the stack trace above consistently.

Removing the network and restarting the agent gets the agent reporting normally again.

Now if I do the same thing except don't use the rabbit ssl port and setting everything works flawlessly.

We also see this behaviour with nova-compute. Something happens and then all messages get stuck in unack and timeouts appear in the log.

I suspect this could be more to do with the python-amqp version but I'm not certain.
We've tried with the SSL in rabbitmq and used versions 3.6.5 and 3.6.10, we've also tried using an F5 LB in front to offload SSL to that but to no avail.

That will issue an INFO log message each time an RPC request or reply is acked. If you can run this with your reproducer and post the logs from both the rpc server side and the rpc client side I'd very much appreciate that.

The commit you identified modified how acknowledgement was handled. And since you're seeing unacked messages on the reply queues that commit seems suspect. The debug patch I sent you logged how messages were being acked. I expected the log would show that the acks were not getting done - but according to your output oslo.messaging is acking messages:

Both of these issues occur around the time your RPC calls are timing out. This could result in reply messages getting "stuck" in queues since reply queues are randomly named and regenerated when the transport is established.

As a test I set the heartbeat timeout to 15sec and used iptables to start dropping rpc traffic between the rpc client/servers and rabbitmq. This naturally resulted in rpc failures and unconsumed messages on queues:

Hi Sam, I have got timeout errors at nova-compute side on my cloud. I used your oslo.messaging patch and changed amqp version to amqp==1.4.9 and kombu version to kombu==3.0.37 versions, it worked well. Now I don't get any timeout errors at nova-compute side. But I am using different library versions at nova-conductor side and it is working well. at nova-conductor side, I am using amqp (2.2.1) , kombu (4.1.0) and oslo.messaging (5.30.2)

As far as those error messages: AMQPError is the generic base for exceptions thrown from the amqp library. Unfortunately there's not enough detail in the exception to understand what actually is failing. However where that exception is logged is a connection error handler that causes oslo.messaging to re-connect, which is the best that oslo.messaging can do at that point.

What results do you get when moving to the latest release of Pike (5.30.6) using amqp==2.2.1 and kombu==4.1.0?

We upgraded nova to pike last week and we have experienced similar issues but they seem slightly different and the downgrading of amqp and kombu hasn't helped in this case. We had to roll back to nova ocata and have spent the last few day trying to figure out what the problem is. Still unknown at this stage.

Looks like this could also be an issue with the queens versions too. We upgraded ceilometer from ocata to queens and seeing similar issues with messages not being ack'd. We have downgraded back to ocata for now.

We have the same issues with nova-compute, some messages get stuck in the queue when using SSL.
We've identified that this problem happen periodically, always in the same queue and with large messages (> 100k characters)

After a thorough analysis, we ended up on the py-amqp socket read function [1]
and more in detail on the SSL Class implementation [2]
Using a "big" buffer the read function did not work correctly, by debbuging this code [3]:
```
frame_type, channel, size = unpack('>BHI', frame_header)
```
we saw data as: