NetworkStorageDeadlock

When using network attached storage, Linux and other OSes are vulnerable to a fundamental deadlock problem. The premise of the problem is that one may need memory in order to free memory, and freeing memory is done mostly when there is very little free memory left.

The problem can happen with normally written files, MAP_SHARED mmaped files and swap, but is most likely to happen with the latter two. It can hit with NFS, iSCSI, AoE or any other network storage protocol, since they all rely on the network layer.

The bug can get triggered as follows:

the system is low on free memory

as a result, kswapd starts trying to free up memory by evicting pages

if the pages are dirty, they have to be written back to storage

writing pages back to storage over network requires allocating memory

memory for packet headers (if the NIC can assemble the packet itself)

memory for entire packets

kswapd can get this memory, because it has higher priority and the system sets aside something extra for the pageout code

now the system is really low on memory, it may even have no free memory left at all/!\

the NAS appliance receives the write request from the computer

the NAS appliance sends back an ACK packet acknowledging that the data was received

the NAS appliance sends back a packet acknowledging the OS that the data was written to disk

however, at this point the OS may not have any memory left to receive these packets from the NAS

the OS never knows whether the I/O has completed, since it cannot receive any more network packets

even if it can still receive packets, memory could be filled up with packets from other connections/!\

the computer deadlocks

Note that locally attached disks do not have this deadlock because Linux has reserves of buffer heads and other data structures needed to start disk I/O. Using these reserves the system can pull itself away from deadlock when normal memory allocations would fail.

Proposal for a solution

This solution is built around two concepts:

IP networks are lossy anyway, so we can throw away non-critical packets;

we can use a reserved memory pool to avoid such deadlocks, provided the memory pool is only used for the right network traffic.

We can identify what network traffic should and should not be able to use these mempools by setting a special flag on the memory critical network sockets, eg. SOCK_MEMALLOC.

At package send time, if the normal memory allocation fails and the current socket is flagged SOCK_MEMALLOC, we can allocate the network buffers from the memory pool reserved for this situations. The network buffer needs to be flagged so that, when it is freed, it goes back into this pool.

Package receive time is harder, since at the time a packet is received we do not yet know for which socket it is. Once memory runs out we will have to do an allocation from the reserved memory pool for any incoming packet. However, networking is lossy. This means that when we (later) find out that the packet was not for one of the SOCK_MEMALLOC sockets, we can just drop it and pretend we never received it. The sending host will retry it, so everything will be fine.

Dropping packets for non-SOCK_MEMALLOC sockets may need some modifications to certain parts of the network stack, but if it makes it possible to run Linux hosts stably from just iSCSI or NFS root, that is well worth the hassle IMHO...

Daniel Phillips has a patch available that implements a lot of what's needed.

Potential problems with this solution

Know a solution or workaround to any of these problems? Please tell riel(at)surriel(dot)com know or edit this page directly.

fragments

most OSes send back-to-front

you need to have all the fragments of a packet before you know whether or not you can discard them

this could be quite a bit of memory

possible solution: if we just allocated the last buffer from the mempool, received a fragment and the packet is not yet complete, we drop all fragments of this packet

workaround: use smaller packets to/from your NAS box

layered/multiplexed protocols

the protocol could be multiplexed over one TCP/IP connection

may be problematic if there is so much traffic that the swap/VM IO can be drowned in other traffic

iSCSI can have this problem

not a problem? other IO can complete on the same socket during our swap IO, but we already have the memory allocated on which we do that other IO

problem? what if there is a protocol that needs us to allocate memory to process other incoming data, say block invalidations?

needs mempool for such protocol handling?

unfixable?

encrypted traffic

you may need out-of-band traffic (eg. key exchange) before you can receive the ACK

possibly even renegotiation in userspace

these events do not happen very often, so maybe can be ignored initially?

DHCP

same problems as encrypted traffic

Network I/O protocol enhancements

The life of operating systems could potentially be made easier with some protocol enhancements:

the client can tell the server "I am out of memory, send me ACKs only" to avoid having to process megabytes of in-progress read I/O while out of memory