Anatomy of a Docker Networking Heisenbug

Earlier last week I was asked to look into the intermittent failure of one of the smoke tests for our new gossip based DNS implementation. The test is quite straightforward – it starts then kills containers in a loop across a number of hosts, asserting that DNS records and tombstones are propagated in a timely fashion. The distribution of failures was odd however – some people could reproduce it reliably, while others not at all; an initial analysis by Tom and Matthias had revealed that for some reason the TCP connection between the weave routers was flapping, resulting in delayed delivery of gossip. Why?

Of course I was completely unable to reproduce this behaviour on my own workstation – more on that later. Consequently I borrowed Tom’s GCE instances, configured the test to use only two weave peers with a single connection and set about the problem with tcpdump and conntrack:

For some reason, netfilter is destroying the flow mapping prematurely! This has two consequences: firstly, further outbound router traffic to $HOST1 no longer has its source address rewritten (see packet at 1438082127.587805) – it will be ignored by $HOST2 (even if it makes it that far – the intervening network may filter the unroutable RFC1918 address). Secondly, further inbound traffic from $HOST1 is rejected with a TCP RST (see packets at 1438082140.931649, 1438082140.931698 and 1438082140.931828). This explains the connection flap – but why is the kernel prematurely removing an in-use established flow mapping?

The conntrack events we’ve been examining are delivered to userspace via a netlink socket; this gives us a place to start looking in the kernel source. Poking around in the UAPI files (v3.16.0) we find the following in uapi/linux/netfilter/nf_conntrack_common.h:

Inspection of nf_conntrack_netlink.c reveals that both uses of IPCT_DESTROY occur via nf_ct_delete, which is itself referred to from only two places: the flow timeout handler (death_by_timeout) and the netlink code that processes manual delete requests from userspace. We’re not manually deleting the flow, and since the default timeout for the established state is five days it’s unlikely to be the timeout either. How to make further progress?

It occurs that nf_ct_delete must be being called somehow, and as we’ve accounted for all the static references in the code it must be via a function pointer. At this point we can turn to the kernel’s dynamic function tracing capability to see who is calling nf_ct_delete:

Sneaky, it dereferences the pointer inside the timer struct. Anyway, we have another lead – who calls nf_ct_kill_acct, and why? Another quick grep leads us to nf_nat_ipv4_fn in net/ipv4/netfilter/iptable_nat.c:

Revelation – if a packet matching an extant masquerade flow attempts to egress through an interface other than the one that established the flow initially, the flow is destroyed and the packet dropped. Unfortunately we’re back to square one – why has the kernel suddenly decided to route a packet out of a different interface to the one it used milliseconds before? And what is that interface?

Whilst extremely powerful, dynamic function tracing has some limitations – it’s not possible to inspect the function arguments, which we’d ideally like to do here to determine where the kernel has decided to send our packet. Perhaps we can get the information in another way?

For some reason, a TCP packet from the router on $HOST2 to $HOST1 is egressing the docker bridge via one of the other application container’s veth interfaces, instead of eth0 on $HOST2 as expected. That certainly constitutes a change of interface, but what on Earth is going on?

One thing I had observed is that the connection flap did not occur if the routers were left connected without the test running, even though they periodically exchange heartbeat traffic; furthermore, the timing of the failure was apparently random, indicating some kind of racy interaction between the weave router traffic and the starting and stopping of our test containers. Could it be that for a brief window during container startup, an erroneous route was in force that affected our traffic?

Something about this hypothesis didn’t seem quite right. The dynamic function trace of nf_ct_delete invocations revealed that it was being invoked from the bridge forwarding code – layer 2. Perhaps the problem resided there – lets examine another tcpdump, this time of the traffic on the eth0 interface in the weave router container namespace, with link level headers displayed:

Interesting – the destination MAC is fluctuating. The weave router container has been configured with the docker bridge IP as default gateway – when it wants to send a packet off-host, it will insert the corresponding MAC entry into the destination hardware address of the ethernet packet. Initially it is 3a:a4:1b:bf:23:96 – we can see we’re sending packets to $HOST1 and getting responses. The packet at 1438082114.248335 does not get a response – incidentally its timestamp matches that of the nf_ct_delete call, the conntrack DESTROY message and the peculiar iptables logging. We then see a number of further packets to $HOST1 that do not receive a response, and then for the last two packets (which are some time later) the destination MAC changes to 56:70:1e:fc:91:9e. For some reason the MAC of docker0 is changing, and before ARP in the weave router netns catches up a packet is sent to the old docker0 MAC which now happens to correspond to one of the application container veth interfaces. What gives?

It turns out that the standard behaviour of the bridge device on Linux is to assume the lowest MAC of any slave interfaces unless it has had one set explictly – here is the relevant code from net/bridge/br_stp_if.c:

Unless addr_assign_type (visible from /sys/class/net/docker0/addr_assign_type) is NET_ADDR_SET, we loop over all ports to find the numerically lowest MAC and use that. Consequently, whenever our test containers were started, there was a chance that the veth pair created to connect them to the docker0 bridge would have a lower MAC than extant slaves, and this would cause the bridge MAC to change.

Docker 1.6.2 and 1.7.1 use an ioctl to set the MAC address of the bridge, whereas 1.7.0 uses netlink directly leading to the problems outlined above. This is why I couldn’t reproduce the problem – I had rebuilt my test VMs recently and picked up 1.7.1. If you’re using 1.7.0 you may want to upgrade – issue filed here.

Footnote – you may have noticed that the debugfs traces and iptables logging show monotonic time since boot, not epoch time like tcpdump and conntrack. I created this utility to compute an approximate offset which could be added to obtain epoch time – the following value applies to the logs in this article:

About the author

Adam has accumulated over thirty years of hands on experience in writing computer software, eighteen of which have been spent on commercial projects in both the private and public sectors. In the past he has worked as a Systems Programmer for one of the UKs earliest ISPs (Demon Internet), developed network intrusion detection systems under contract to the Defense Evaluation and Research Agency, and served as a Technical Architect for startups in the fields of global multicarrier MPLS VPN provisioning (Nexagent) and smart home automation (PassivSystems).
Due to his work with DERA he has a particular interest in cryptography and security, and has recently focused on developer tools and infrastructure automation with positions at Mind Candy and now Weaveworks.