Menu

Ceph Async Messenger

This post discusses the high-level architecture of Ceph network layer focusing on Async Messenger code flow. Ceph currently has three messenger implementations – Simple, Async, xio. Async Messenger by far is the most efficient messenger over the others. It can handle different transport types like posix, rdma, dpdk. It uses a limited thread pool for connections (based on number of replicas or EC chunks) and polling system to achieve high-concurrency. It supports epoll, kqueue and select based on system’s environment and availability.

Async Messenger comprises mainly of three components:

Network Stack – initializes stack based on Posix, RDMA, DPDK options

Processor – handles socket layer communication

AsyncConnection – maintains connection state machine

To understand the code flow, let’s look at the client-server model i.e. from the perspective of how a connection is established between Rados client and OSD. Starting with the OSD, three types of messengers are created – public, cluster and heartbeat:

Network Stack

On creation of an AsyncMessenger instance, it initiates the network stack. Based on the messenger type parameter, Posix (/rdma/dpdk) transport type stack is created. Heap memory is allocated here for Processors (more on this below) based on number of workers.

Number of Worker threads created is dependent on ms_async_op_threads config value. EventCenter class which is an abstraction to handle various event drivers such as epoll, kqueue, select is initiated for all the workers. Each worker thread then creates an epoll (/kqueue/select) instance (line#7) and starts waiting in order to process epoll events inside each thread (line#20).

Once this is all set up, OSD needs to bind the port and, listen in on the socket for incoming connections.

Processor & AsyncConnection

Socket processing operations such as bind, listen, accept are handled by Processor class . This class is initiated in AsyncMessenger’s constructor. line#4 and line#7 call into Processor’s bind method. After the address and port are bound, it starts ‘listening’ on the socket and waits for incoming connections.

After bind and listen, OSD is started in osd->init() call (in ceph_osd.cc). This will call into the Processor’s start method which adds the listen file descriptor to epoll with a callback to Processor::accept. Now every time there’s a new connection, ‘EventEpoll’ is ready to process.

Network connections between client and servers are maintained via AsyncConnection state machine. On connection acceptance in Processor, it calls into AsyncConnection::accept which assigns an initial state of START_ACCEPTING. Further, AsyncConnection read_handler which is dispatched as an external event, begins processing the connection:

This completes the connection establishment between Rados client and an OSD server. We did not delve into the finer details like policy parameters, heartbeat clusters, EventCenter code and more – it’s worth looking into those and piecing all this information together for broader understanding. Perhaps in another post!