SFQM aims to enable fault detection within DPDK, the very first feature to
meet this goal is the DPDK Keep Alive Sample app that is part of DPDK 2.2.

DPDK Keep Alive or KA is a sample application that acts as a heartbeat/watchdog
for DPDK packet processing cores, to detect application thread failure. The
application supports the detection of ‘failed’ DPDK cores and notification to a
HA/SA middleware. The purpose is to detect Packet Processing Core fails (e.g.
infinite loop) and ensure the failure of the core does not result in a fault
that is not detectable by a management entity.

Fig. 3 DPDK Keep Alive Sample Application

Essentially the app demonstrates how to detect ‘silent outages’ on DPDK packet
processing cores. The application can be decomposed into two specific parts:
detection and notification.

The detection period is programmable/configurable but defaults to 5ms if no
timeout is specified.

The Notification support is enabled by simply having a hook function that where this
can be ‘call back support’ for a fault management application with a compliant
heartbeat mechanism.

The rest of the initialization and run-time path follows the same paths as the
the L2 forwarding application. The only addition to the main processing loop is
the mark alive functionality and the example random failures.

Keep Alive Monitor Agent Core Monitoring Options
The application can run on either a host or a guest. As such there are a number
of options for monitoring the Keep Alive Monitor Agent Core through a Local
Agent on the compute node:

Application Location

DPDK KA

LOCAL AGENT

HOST

X

HOST/GUEST

GUEST

X

HOST/GUEST

For the first implementation of a Local Agent SFQM will enable:

Application Location

DPDK KA

LOCAL AGENT

HOST

X

HOST

Through extending the dpdkstat plugin for collectd with KA functionality, and
integrating the extended plugin with Monasca for high performing, resilient,
and scalable fault detection.