Wednesday, February 19, 2014

On allowing shorter timeout on Mellanox cards and other tips and tricks

Modifying the minimum Timeout detection on Mellanox cards:

If you want to leverage the connection timeout detection of Mellanox card to setup/design a fault tolerant system you very quickly realize that the tools for detecting a crash at your disposition are using a resolution time an order of magnitude higher then the actual latency you are aiming for. This has some significant effect on the overall cluster management , fault detection and fault recovery system you can design. But luckily there is some workaround the problem.

First the issue :

Mellanox Connext2 NICs enforce a lower limit on timeouts (specifically, the IBV_QP_TIMEOUT option). For these cards the minimum timeout value on conenctX2 is 500ms combined with the default setting of 7 retries, this means that after a timeout (e.g., a crashed server) the transmit buffer is held by the NIC for about 4 seconds before it is returned with an error.

The consequence:

You can have the nodes that maintained a connection with the faulty server running out of transmit buffers , which either leads to errors, or leave the the whole cluster hanging for a couple of seconds... Not really nice.

The solution :

To fix the problem, you need to modify the firmware in the NICs as follow:

Get from Mellanox the appropriate version of the firmware to start with.

This file needs to be combined with an appropriate .ini file. First, fetch the existing .ini file from the NIC: flint -d /dev/mst/mtXXXXXX_pci_cr0 dc > MT_XXXXXX.ini

Check /dev/mst to verify the file name there. In this case the .ini file is named after the board_id printed by ibv_devinfo

Edit the .ini file to add a new qp_minimal_timeout_val parameter with a value of zero. It goes in the HCA section, like this:

Search This Blog

Subscribe To This Blog

About Me

Provide advisory services via Blopeur Ltd.Use to work / do research for SAP and more specifically HANA enterprise Cloud. Also use to lead the (now retired by SAP) Open Source project: Hecatonchire - it aims to bring together the flexibility of virtualization, cloud and high
performance computing in order to break free of current cloud
limitations. Hecatonchire deliver a
framework of tools aiming to provide memory, I/O and CPU resource aggregation
capabilities to x86/Linux native application.