Kit, liv and bed are espressobin wireless access points with Atheros QCA9888 chips each of which have their wan ports connected to the upstream switch infrastructure. None of the APs (kit,liv or bed) have dnsmasq or firewalls enabled, they are simply bridged with the uplinking ethernet port on each espressobin.

Problem:
Wireless clients connecting to any of the access points will obtain a dhcp lease almost immediately if they have not connected to any of the APs recently. If I roam, that is connect to the liv wifi after having been connected to the kti wifi network, DHCP Fails I can see the DHCPDISCOVER travel from the espressobin to the router, and the router sending a DHCPOFFER which travels via unicast through the tp-link, the netgear to the appropriate port on the netgear switch but then it disappears, it never appears on the ap uplink ethernet interface. I am positive that the unicast DHCPOFFER packet is being sent to the appropriate switch port, I have verified it by running wireshark on many switch ports simultaneously. Running tcpdump on each espressobin I do not see the these DHCPOFFER packets. However if I wait some time, say 5 minutes, the offer eventually appears and the wifi client gets an address. The packet also magically begins to appear in tcpdump on the espressobin which is acting as an ap for the wifi client at that particular time. In short, problem is associate with one ap, fine initially, move to another ap, DHCPOFFERS dropped by new ap for 5 minutes then inexplicably able to obtain ip and internet access.

I have also noticed that some arp packets are dropped by the espressobin during this process in the same fashion.

Solutions tried:
turning on proxy arp on ap ethernet interface-> I realize now this will have no effect because there is only a single subnet.
Completely working around the problem.. turning each ap into a nat firewall and creating a private subnet where the ethernet port perform ip masquerading this works fine, but not at all what I want.

Does anyone have any idea what could be causing this strange behavior?

It most likely has to do with the mac address table in MS510TX switch.
5 minutes is a typical arp timeout time, so I suspect that upon roaming to another AP, the mac of the client is still forwarded to the port of the old AP.
Have you done any configuration on the switch, taking in consideration that it is managed?

It could be that the arp mapping in the MS510TX is somehow forwarding it to the the wrong place but the evidence indicates otherwise.

I have experimented with completely bypassing the MS510TX and the problem still occurs. That is I plug the the APs directly to the SG105E tp link switch and remove the connection between the SG105E and the MS510TX, problem still occurs.

I have used wireshark to monitor the MS510TX traffic and I can see that at least the DHCPOFFERS are leaving the appropriate port for the AP that the wifi clients are connected to at the time when they send the DHCPDISCOVER request. If the MS510TX had a different port associated with the client mac address wouldn't it send the DHCPDISCOVER to the port of the old ap? it doesn't.

Would it be possible that the esspressobin's themselves do not think that the Mac address of the wifi client that is currently connected to them is actually located in wifi interfaces list of mac addresses. I have monitored the arp tables on the espressobin aps and they apprear to be updating appropriately, e.g. when the client roams to a new ap, that ap's arp table entry updates so that it thinks the clients mac address is on the wifi address and not a neighboring mac address accessible through its wired connection.

Though the arp tables on the aps seem to be updating appropriately, would turning on hairpin mode, that is allowing the aps to forward arp requests outside the same interface that they arrived on, permit me to detect that the aps thought that the wifi client was still accessible through the wired interface, instead of the wireless interface?

Running tcpdump on each espressobin I do not see the these DHCPOFFER packets. However if I wait some time, say 5 minutes, the offer eventually appears and the wifi client gets an address. The packet also magically begins to appear in tcpdump on the espressobin which is acting as an ap for the wifi client at that particular time.

overmyhead:

I have used wireshark to monitor the MS510TX traffic and I can see that at least the DHCPOFFERS are leaving the appropriate port for the AP that the wifi clients are connected to at the time when they send the DHCPDISCOVER request. If the MS510TX had a different port associated with the client mac address wouldn't it send the DHCPDISCOVER to the port of the old ap? it doesn't.

Does that mean that you see the DHCPOFFER leaving the correct port of the MS510TX towards the Esspesobin, but you don't see it reaching Espressobin?

Yes, that's right I seem to see them leave the MS510TX on the appropriate port towards the espressobin ap (say liv), and then on the liv ap where you would expect to see the DHCPOFFER there is nothing, except after 5 minutes when they inexplicably begin to appear.

nothing in.the logs, I think hostap is already in debug mode, I don't see anything in the logs really except for the client successfully associating. To get more information from hostapd, or ath10k I would have have to download the whole buildroot and recompile with debugging enabled which I am a.bit reluctant to do. I wouldn't be surprised if it were something wrong with either ath10k or hostapd because the problem only occurs with wireless clients. Taking a wired client and roaming.g the wired ethernet works fine, by unplugging a laptop connected to kit and walking over to live for example, it is only the wireless client's that have this issue.

This happens because when you unplug the cable and the port goes down, all the arp information concerning this port is erased. But in wifi no port goes down, so there can be stale entries. However this is not your case, since it seems that the espressobin is discarding the packet.
The debug I was referring to is the option conloglevel in config/system

Can you alter your topology ( remove dual path to switch )?
Can you try a release just prior to "roaming" ( testing purposes )?
Can you try sending a gratuitous arp when stale from client?
Can you try DHCPrelay on AP ( routed or just point to the dhcp address)?
Can you put mac changer ifup scripts on the clients?

May/has no effect??? I turned it up anyway to 8 and installed syslog-ng

On kit I see the following message in the logs, which I had actually seen earlier with the previous logging scheme but did not really recognize a correlation.
Jan 13 09:55:14 kit kernel: [ 910.125116] br-wan: port 3(wan) received tcn bpdu
Jan 13 09:55:14 kit kernel: [ 910.130115] br-wan: topology change detected, propagating

br-wan is the bridge which includes the uplink to the MS510Tx as well as the wifi interface on it.

The message comes appears on kit and once it appears I am able to obtain a dhcp address on kit if I am roaming to kit, or on liv if I happen to be roaming from kit to liv. The strange thing is that I am only able to see the message on kit, eventhough I believe both of them are configured essentially the same way.

Thanks, for the help so far, trendy. What are the tcn bdpu messages? How do I configure them to update more frequently.

Do you have double links or loops among the switches? The topology you have presented earlier doesn't need any STP.

A tcn bpdu is created upon topology change to notify the root bridge.
So my guess is that the roaming is creating some sort of loop among the espressobins and the switch, which causes the block of the port until it autorecovers from the error disabled state.
Try to disable STP for a start to verify this is the culprit.

Yes the network is more complicated than the diagram. First there are three vlans, a management vlan 1, a lan vlan 10, and an extra vlan on the sg105e to connect to a cable modem on that switch. to complicate things further there is an apple airport.

also, some machines have more than one connection to the network, maybe that is what you mean by a double link.

By double link I meant if two switches have two cables connecting them, or backup links. That, and other situations, could create a bridge loop, which can render the network useless. We use stp to prevent that.
If there are no loops, backup links, double links or anything like that, then enabling STP won't offer anything and will introduce some delays when a port comes up. There is a transition time to verify that there won't be a loop from the new link that came up.
The vlans don't matter so much here, just verify the physical topology that doesn't have any loops.