Followup to “The Common Communications Interface (CCI)”

A few people have made remarks to me about the pair of CCI guest blog entries from Scott Atchley of Oak Ridge (entry 1, entry 2) indicating that they didn’t quite “get it”. So let me try to put Scott’s words in concrete terms…

CCI is an API that represents a unification of low-level “native” network APIs. Specifically: many network vendors are doing essentially the same things in their low-level “native” network APIs. In the HPC world, MPI hides all these different low-level APIs. But there are real-world non-HPC apps out there that need extreme network performance, and therefore write their own unification layers for verbs, portals, MX, …etc. Ick!

So why don’t we unify all these low-level native network APIs?

NOTE: This is quite similar to how we unified the high-level network APIs into MPI.

Two other key facts are important to realize here:

A CCI open source reference implementation that supports multiple different network types is available for download.

At least one vendor has firmware implementing CCI; it’s not just a(nother) software abstraction layer.

The software was architected around a plugin interface; supporting another network is simply a matter of writing another plugin (quite similar to the OpenFabrics libibverbs library).

CCI doesn’t have to be yet-another-software-layer (like uDAPL).

Specifically: CCI was very carefully designed such that it could be implemented in firmware (i.e., the software API is a thin shim to native CCI concepts in firmware).

If CCI uses plugins just like libibverbs, why did we bother? I.e., why didn’t we just use libibverbs?

Don’t forget that one of the tenants of CCI is simplicity (the others are portability, performance, scalability, and robustness). In short: CCI is far simpler than the libibverbs API.

The CCI implementation is currently open to a limited number of early beta users. While the group is working on making the software ready for 1.0, read the full CCI API specification on the Oak Ridge CCI site; post comments below if you’re interested in more detail.

I'm not sure that a single abstraction layer makes sense for portable (cross O/S) code, especially if your application does things outside of networking (say file I/O). For example, the I/O model for highest performance is going to be different between Linux and Windows.
I also have doubts about whether an API that requires the underlying library to have threads (in order to signal/deliver events) is a good approach, as it requires surfacing a bunch of 'knobs' for the application to control the threading policy (number of threads, affinity, priority, etc).
Point being that simplicity, protability, and performance often conflict with one another.
-Fab

Fab, currently one major API works across all O/Ses and that is Sockets. No one would argue that Sockets exposes the capabilities of today's networking hardware (no zero-copy, no OS bypass). If an application will only ever use Sockets and only run on standard Ethernet NICs, then there is little to gain by using CCI. If the application could run on more capable hardware (whether over Ethernet or another fabric), then CCI might make sense.
Verbs works on most O/Ses and provides access to modern networking features, yet no one would argue that it is a simple API. We believe there is a middle ground.
Forgive me for not knowing the optimal I/O model for Windows, perhaps you could give me an example. CCI is inherently an asynchronous API (similar to MPI). You initiate a send (small message or RMA) and poll for completion. If the app prefers to block via a native O/S method (e.g. select(), poll(), epoll(), kqueue(), WSA*(), etc.), CCI can provide a native OS handle to the application. If CCI does not allow for a high performance implementation in Windows, we would be very interested in what changes CCI would need in order to provide one.
CCI gives the choice to the app. If the app does not want additional threads, it must poll for completions and to ensure progress if the underlying hardware does not provide progress. If the app does not want to burn the cycles to poll, it can block which requires someone to check for completions and signal the blocker. Whether that someone is a progress thread or the kernel is an implementation detail.
There are trade-offs for simplicity, portability, and performance and we hope that CCI provides the ability for the app to choose the best combination given its needs.

Hi Scott,
> currently one major API works across all O/Ses and that is Sockets
Just to be clear, this is not just Sockets, but synchronous Sockets (blocking or non-blocking, but not aio/overlapped). There is a lot to be gained by moving away from synchronous I/O, though the learning curve for async I/O can be steep. Throw in the concept of scatter/gather and it gets even steeper.
High performance asynchronous I/O applications in Windows can benefit from using I/O completion ports, rather than using per-I/O event objects and WaitForMultipleObjects (which has a limit to how many objects can be waited on concurrently). I/O completion ports allow aggregating completions from multiple files or sockets, and the application can poll the completion port, or block on it waiting for any I/O completion to be added. Multiple threads can poll events form an I/O completion port. MSMPI today uses I/O completion ports internally, so that we can block for completions if we exceed our polling limit. MSMPI supports blocking for completions for all of our communcation channels, SHM, NetworkDirect, and Sockets.
It would be great to allow CCI endpoints to be associated with a user's I/O completion port, allowing users to get completions for CCI events side by side with their file completions, all through one function (GetQueuedCompletionStatus). There are design issues that you'll need to work out, though (who provides the OVERLAPPED structure that identifies the I/O operation and is returned by GetQueueCompletionStatus, for example).
Cheers,
-Fab

I'm one of the developers of the Charm++ runtime system for HPC applications, among which NAMD is the most widely-used. For the last 15 years, we've maintained our own native machine layers (Elan, Myrinet, SHMEM, Infiniband, Blue Gene DCMF/PAMI, LAPI, uGNI) because of a huge impedance mismatch between our execution model and what MPI provides. We've been hoping that various past proposals and projects would get some traction (e.g. GASnet), but none seems to have taken off. Could we get access to the CCI beta to do a port, see how it performs, and possibly contribute our expertise to this shared foundation?

Some of the individuals posting to this site, including the moderators, work for Cisco Systems. Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of Cisco. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Cisco or any other party. This site is available to the public. No information you consider confidential should be posted to this site. By posting you agree to be solely responsible for the content of all information you contribute, link to, or otherwise upload to the Website and release Cisco from any liability related to your use of the Website. You also grant to Cisco a worldwide, perpetual, irrevocable, royalty-free and fully-paid, transferable (including rights to sublicense) right to exercise all copyright, publicity, and moral rights with respect to any original content you provide. The comments are moderated. Comments will appear as soon as they are approved by the moderator.