RFC 7609

IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

4. SMC-R Memory-Sharing Architecture
4.1. RMB Element Allocation Considerations
Each TCP connection using SMC-R must be allocated an RMBE by each
SMC-R peer. This allocation is performed by each endpoint
independently to allow each endpoint to select an RMBE that best
matches the characteristics on its TCP socket endpoint. The RMBE
associated with a TCP socket endpoint must have a receive buffer that
is at least as large as the TCP receive buffer size in effect for
that connection. The receive buffer size can be determined by what
is specified explicitly by the application using setsockopt() or
implicitly via the system-configured default value. This will allow
sufficient data to be RDMA-written by the SMC-R peer to fill an
entire receive buffer size's worth of data on a given data flow.
Given that each RMB must have fixed-length RMBEs, this implies that
an SMC-R endpoint may need to maintain multiple RMBs of various sizes
for SMC-R connections on a given SMC-R link and can then select an
RMBE that most closely fits a connection.
4.2. RMB and RMBE Format
An RMB is a virtual memory buffer whose backing real memory is
pinned. The RMB is subdivided into a whole number of equal-sized RMB
Elements (RMBEs). Each RMBE begins with a 4-byte eye catcher for
diagnostic and service purposes, followed by the receive data buffer.
The contents of this diagnostic eye catcher are implementation
dependent and should be used by the local SMC-R peer to check for
overlay errors by verifying an intact eye catcher with every RMBE
access.
The RMBE is a wrapping receive buffer for receiving RDMA writes from
the peer. Cursors, as described below, are exchanged between peers
to manage and track RDMA writes and local data reads from the RMBE
for a TCP connection.
4.3. RMBE Control Information
RMBE control information consists of consumer cursors, producer
cursors, wrap counts, CDC message sequence numbers, control flags
such as urgent data and "writer blocked" indicators, and TCP
connection information such as termination flags. This information
is exchanged between SMC-R peers using CDC messages, which are passed
using RoCE SendMsg. A TCP/IP stack implementing SMC-R must receive
and store this information in its internal data structures, as it is
used to manage the RMBE and its data buffer.

The format and contents of the CDC message are described in detail in
Appendix A.4 ("Connection Data Control (CDC) Message Format"). The
following is a high-level description of what this control
information contains.
o Connection state flags such as sending done, connection closed,
failover data validation, and abnormal close.
o A sequence number that is managed by the sender. This sequence
number starts at 1, is increased each send, and wraps to 0. This
sequence number tracks the CDC message sent and is not related to
the number of bytes sent. It is used for failover data
validation.
o Producer cursor: a wrapping offset into the receiver's RMBE data
area. Set by the peer that is writing into the RMBE, it points to
where the writing peer will write the next byte of data into an
RMBE. This cursor is accompanied by a wrap sequence number to
help the RMBE owner (the receiver) identify full window size
wrapping writes. Note that this cursor must account for (i.e.,
skip over) the RMBE eye catcher that is in the beginning of the
data area.
o Consumer cursor: a wrapping offset into the receiver's RMBE data
area. Set by the owner of the RMBE (the peer that is reading from
it), this cursor points to the offset of the next byte of data to
be consumed by the peer in its own RMBE. The sender cannot write
beyond this cursor into the receiver's RMBE without causing data
loss. Like the producer cursor, this is accompanied by a wrap
count to help the writer identify full window size wrapping reads.
Note that this cursor must account for (i.e., skip over) the RMBE
eye catcher that is in the beginning of the data area.
o Data flags such as urgent data, writer blocked indicator, and
cursor update requests.
4.4. Use of RMBEs
4.4.1. Initializing and Accessing RMBEs
The RMBE eye catcher is initialized by the RMB owner prior to
assigning it to a specific TCP connection and communicating its RMB
index to the SMC-R partner. After an RMBE index is communicated to
the SMC-R partner, the RMBE can only be referenced in "read-only
mode" by the owner, and all updates to it are performed by the remote
SMC-R partner via RDMA write operations.

Initialization of an RMBE must include the following:
o Zeroing out the entire RMBE receive buffer, which helps minimize
data integrity issues (e.g., data from a previous connection
somehow being presented to the current connection).
o Setting the beginning RMBE eye catcher. This eye catcher plays an
important role in helping detect accidental overlays of the RMBE.
The RMB owner should always validate these eye catchers before
each new reference to the RMBE. If the eye catchers are found to
be corrupted, the local host must reset the TCP connection
associated with this RMBE and log the appropriate diagnostic
information.
4.4.2. RMB Element Reuse and Conflict Resolution
RMB elements can be reused once their associated TCP and SMC-R
connections are terminated. Under normal and abnormal SMC-R
connection termination processing, both SMC-R peers must explicitly
acknowledge that they are done using an RMBE before that element can
be freed and reassigned to another SMC-R connection instance. For
more details on SMC-R connection termination, refer to Section 4.8.
However, there are some error scenarios where this two-way explicit
acknowledgment may not be completed. In these scenarios, an RMBE
owner may choose to reassign this RMBE to a new SMC-R connection
instance on this SMC-R link group. When this occurs, the partner
SMC-R peer must detect this condition during SMC-R Rendezvous
processing when presented with an RMBE that it believes is already in
use for a different SMC-R connection. In this case, the SMC-R peer
must abort the existing SMC-R connection associated with this RMBE.
The abort processing resets the TCP connection (if it is still
active), but it must not attempt to perform any RDMA writes to this
RMBE and must also ignore any data sitting in the local RMBE
associated with the existing connection. It then proceeds to free up
the local RMBE and notify the local application that the connection
is being abnormally reset.
The remote SMC-R peer then proceeds to normal processing for this new
SMC-R connection.

4.5. SMC-R Protocol Considerations
The following sections describe considerations for the SMC-R protocol
as compared to TCP.
4.5.1. SMC-R Protocol Optimized Window Size Updates
An SMC-R receiver host sends its consumer cursor information to the
sender to convey the progress that the receiving application has made
in consuming the sent data. The difference between the writer's
producer cursor and the associated receiver's consumer cursor
indicates the window size available for the sender to write into.
This is somewhat similar to TCP window update processing and
therefore has some similar considerations, such as silly window
syndrome avoidance, whereby TCP has an optimization that minimizes
the overhead of very small, unproductive window size updates
associated with suboptimal socket applications consuming very small
amounts of data on every receive() invocation. For SMC-R, the
receiver only updates its consumer cursor via a unique CDC message
under the following conditions:
o The current window size (from a sender's perspective) is less than
half of the receive buffer space, and the consumer cursor update
will result in a minimum increase in the window size of 10% of the
receive buffer space. Some examples:
a. Receive buffer size: 64K, current window size (from a sender's
perspective): 50K. No need to update the consumer cursor.
Plenty of space is available for the sender.
b. Receive buffer size: 64K, current window size (from a sender's
perspective): 30K, current window size from a receiver's
perspective: 31K. No need to update the consumer cursor; even
though the sender's window size is < 1/2 of the 64K, the window
update would only increase that by 1K, which is < 1/10th of the
64K buffer size.
c. Receive buffer size: 64K, current window size (from a sender's
perspective): 30K, current window size from a receiver's
perspective: 64K. The receiver updates the consumer cursor
(sender's window size is < 1/2 of the 64K; the window update
would increase that by > 6.4K).

o The receiver must always include a consumer cursor update whenever
it sends a CDC message to the partner for another flow (i.e., send
flow in the opposite direction). This allows the window size
update to be delivered with no additional overhead. This is
somewhat similar to TCP DelayAck processing and quite effective
for request/response data patterns.
o If a peer has set the B-bit in a CDC message, then any consumption
of data by the receiver causes a CDC message to be sent, updating
the consumer cursor until a CDC message with that bit cleared is
received from the peer.
o The optimized window size updates are overridden when the sender
sets the Consumer Cursor Update Requested flag in a CDC message to
the receiver. When this indicator is on, the consumer must send a
consumer cursor update immediately when data is consumed by the
local application or if the cursor has not been updated for a
while (i.e., local copy of the consumer cursor does not match the
last consumer cursor value sent to the partner). This allows the
sender to perform optional diagnostics for detecting a stalled
receiver application (data has been sent but not consumed). It is
recommended that the Consumer Cursor Update Requested flag only be
sent for diagnostic procedures, as it may result in non-optimal
data path performance.
4.5.2. Small Data Sends
The SMC-R protocol makes no special provisions for handling small
data segments sent across a stream socket. Data is always sent if
sufficient window space is available. In contrast to the TCP Nagle
algorithm, there are no special provisions in SMC-R for coalescing
small data segments.
An implementation of SMC-R can be configured to optimize its sending
processing by coalescing outbound data for a given SMC-R connection
so that it can reduce the number of RDMA write operations it
performs, in a fashion similar to Nagle's algorithm. However, any
such coalescing would require a timer on the sending host that would
ensure that data was eventually sent. Also, the sending host would
have to opt out of this processing if Nagle's algorithm had been
disabled (programmatically or via system configuration).

4.5.3. TCP Keepalive Processing
TCP keepalive processing allows applications to direct the local
TCP/IP host to periodically "test" the viability of an idle TCP
connection. Since SMC-R connections have a TCP representation along
with an SMC-R representation, there are unique keepalive processing
considerations:
o SMC-R-layer keepalive processing: If keepalive is enabled for an
SMC-R connection, the local host maintains a keepalive timer that
reflects how long an SMC-R connection has been idle. The local
host also maintains a timestamp of last activity for each SMC-R
link (for any SMC-R connection on that link). When it is
determined that an SMC-R connection has been idle longer than the
keepalive interval, the host checks to see whether or not the
SMC-R link has been idle for a duration longer than the keepalive
timeout. If both conditions are met, the local host then performs
a TEST LINK LLC command to test the viability of the SMC-R link
over the RoCE fabric (RC-QPs). If a TEST LINK LLC command
response is received within a reasonable amount of time, then the
link is considered viable, and all connections using this link are
considered viable as well. If, however, a response is not
received in a reasonable amount of time or there's a failure in
sending the TEST LINK LLC command, then this is considered a
failure in the SMC-R link, and failover processing to an alternate
SMC-R link must be triggered. If no alternate SMC-R link exists
in the SMC-R link group, then all of the SMC-R connections on this
link are abnormally terminated by resetting the TCP connections
represented by these SMC-R connections. Given that multiple SMC-R
connections can share the same SMC-R link, implementing an SMC-R
link-level probe using the TEST LINK LLC command will help reduce
the amount of unproductive keepalive traffic for SMC-R
connections; as long as some SMC-R connections on a given SMC-R
link are active (i.e., have had I/O activity within the keepalive
interval), then there is no need to perform additional link
viability testing.

o TCP-layer keepalive processing: Traditional TCP "keepalive"
packets are not as relevant for SMC-R connections, given that the
TCP path is not used for these connections once the SMC-R
Rendezvous processing is completed. All SMC-R connections by
default have associated TCP connections that are idle. Are TCP
keepalive probes still needed for these connections? There are
two main scenarios to consider:
1. TCP keepalives that are used to determine whether or not the
peer TCP endpoint is still active. This is not needed for
SMC-R connections, as the SMC-R-level keepalives mentioned
above will determine whether or not the remote endpoint
connections are still active.
2. TCP keepalives that are used to ensure that TCP connections
traversing an intermediate proxy maintain an active state. For
example, stateful firewalls typically maintain state
representing every valid TCP connection that traverses the
firewall. These types of firewalls are known to expire idle
connections by removing their state in the firewall to conserve
memory. TCP keepalives are often used in this scenario to
prevent firewalls from timing out otherwise idle connections.
When using SMC-R, both endpoints must reside in the same
Layer 2 network (i.e., the same subnet). As a result,
firewalls cannot be injected in the path between two SMC-R
endpoints. However, other intermediate proxies, such as
TCP/IP-layer load balancers, may be injected in the path of two
SMC-R endpoints. These types of load balancers also maintain
connection state so that they can forward TCP connection
traffic to the appropriate cluster endpoint. When using SMC-R,
these TCP connections will appear to be completely idle, making
them susceptible to potential timeouts at the load-balancing
proxy. As a result, for this scenario, TCP keepalives may
still be relevant.
The following are the TCP-level keepalive processing requirements for
SMC-R-enabled hosts:
o SMC-R peers should allow TCP keepalives to flow on the TCP path of
SMC-R connections based on existing TCP keepalive configuration
and programming options. However, it is strongly recommended that
platforms provide the ability to specify very granular keepalive
timers (for example, single-digit-second timers) and should
consider providing a configuration option that limits the minimum
keepalive timer that will be used for TCP-layer keepalives on
SMC-R connections. This is important to minimize the amount of
TCP keepalive packets transmitted in the network for SMC-R
connections.

o SMC-R peers must always respond to inbound TCP-layer keepalives
(by sending ACKs for these packets) even if the connection is
using SMC-R. Typically, once a TCP connection has completed the
SMC-R Rendezvous processing and is using SMC-R for data flows, no
new inbound TCP segments are expected on that TCP connection,
other than TCP termination segments (FIN, RST, etc.). TCP
keepalives are the one exception that must be supported. Also,
since TCP keepalive probes do not carry any application-layer
data, this has no adverse impact on the application's inbound data
stream.
4.6. TCP Connection Failover between SMC-R Links
A peer may change which SMC-R link within a link group it sends its
writes over in the event of a link failure. Since each peer
independently chooses which link to send writes over for a specific
TCP connection, this process is done independently by each peer.
4.6.1. Validating Data Integrity
Even though RoCE is a reliable transport, there is a small subset of
failure modes that could cause unrecoverable loss of data. When an
RNIC acknowledges receipt of an RDMA write to its peer, that creates
a write completion event to the sending peer, which allows the sender
to release any buffers it is holding for that write. In normal
operation and in most failures, this operation is reliable.
However, there are failure modes possible in which a receiving RNIC
has acknowledged an RDMA write but then was not able to place the
received data into its host memory -- for example, a sudden,
disorderly failure of the interface between the RNIC and the host.
While rare, these types of events must be guarded against to ensure
data integrity. The process for switching SMC-R links during
failover, as described in this section, guards against this
possibility and is mandatory.
Each peer must track the current state of the CDC sequence numbers
for a TCP connection. The sender must keep track of the sequence
number of the CDC message that described the last write acknowledged
by the peer RNIC, or Sequence Sent (SS). In other words, SS
describes the last write that the sender believes its peer has
successfully received. The receiver must keep track of the sequence
number of the CDC message that described the last write that it has
successfully received (i.e., the data has been successfully placed
into an RMBE), or Sequence Received (SR).

When an RNIC fails and the sender changes SMC-R links, the sender
must first send a CDC message with the F-bit (failover validation
indicator; see Appendix A.4) set over the new SMC-R link. This is
the failover data validation message. The sequence number in this
CDC message is equal to SS. The CDC message key, the length, and the
SMC-R alert token are the only other fields in this CDC message that
are significant. No reply is expected from this validation message,
and once the sender has sent it, the sender may resume sending on the
new SMC-R link as described in Section 4.6.2.
Upon receipt of the failover validation message, the receiver must
verify that its SR value for the TCP connection is equal to or
greater than the sequence number in the failover validation message.
If so, no further action is required, and the TCP connection resumes
on the new SMC-R link. If SR is less than the sequence number value
in the validation message, data has been lost, and the receiver must
immediately reset the TCP connection.
4.6.2. Resuming the TCP Connection on a New SMC-R Link
When a connection is moved to a new SMC-R link and the failover
validation message has been sent, the sender can immediately resume
normal transmission. In order to preserve the application message
stream, the sender must replay any RDMA writes (and their associated
CDC messages) that were in progress or failed when the previous SMC-R
link failed, before sending new data on the new SMC-R link. The
sender has two options for accomplishing this:
o Preserve the sequence numbers "as is": Retry all failed and
pending operations as they were originally done, including
reposting all associated RDMA write operations and their
associated CDC messages without making any changes. Then resume
sending new data using new sequence numbers.
o Combine pending messages and possibly add new data: Combine failed
and pending messages into a single new write with a new sequence
number. This allows the sender to combine pending messages into
fewer operations. As a further optimization, this write can also
include new data, as long as all failed and pending data are also
included. If this approach is taken, the sequence number must be
increased beyond the last failed or pending sequence number.

4.7. RMB Data Flows
The following sections describe the RDMA wire flows for the SMC-R
protocol after a TCP connection has switched into SMC-R mode (i.e.,
SMC-R Rendezvous processing is complete and a pair of RMB elements
has been assigned and communicated by the SMC-R peers). The ladder
diagrams below include the following:
o RMBE control information kept by each peer. Only a subset of the
information is depicted, specifically only the fields that reflect
the stream of data written by Host A and read by Host B.
o Time line 0-x, which shows the wire flows in a time-relative
fashion.
o Note that RMBE control information is only shown in a time
interval if its value changed (otherwise, assume that the value is
unchanged from the previously depicted value).
o The local copy of the producer cursors and consumer cursors that
is maintained by each host is not depicted in these figures. Note
that the cursor values in the diagram reflect the necessity of
skipping over the eye catcher in the RMBE data area. They start
and wrap at 4, not 0.
4.7.1. Scenario 1: Send Flow, Window Size Unconstrained
SMC Host A SMC Host B
RMBE A Info RMBE B Info
(Consumer Cursors) (Producer Cursors)
Cursor Wrap Seq# Time Time Cursor Wrap Seq# Flags
4 0 0 0 4 0 0
0 0 1 ---------------> 1 0 0 0
RDMA-WR Data
(4:1003)
4 0 2 ...............> 2 1004 0 0
CDC Message
Figure 16: Scenario 1: Send Flow, Window Size Unconstrained
Scenario assumptions:
o Kernel implementation.
o New SMC-R connection; no data has been sent on the connection.

o Host A: Application issues send for 1000 bytes to Host B.
o Host B: RMBE receive buffer size is 10,000; application has issued
a recv for 10,000 bytes.
Flow description:
1. The application issues a send() for 1000 bytes; the SMC-R layer
copies data into a kernel send buffer. It then schedules an RDMA
write operation to move the data into the peer's RMBE receive
buffer, at relative position 4-1003 (to skip the 4-byte
eye catcher in the RMBE data area). Note that no immediate data
or alert (i.e., interrupt) is provided to Host B for this RDMA
operation.
2. Host A sends a CDC message to update the producer cursor to
byte 1004. This CDC message will deliver an interrupt to Host B.
At this point, the SMC-R layer can return control back to the
application. Host B, once notified of the completion of the
previous RDMA operation, locates the RMBE associated with the RMBE
alert token that was included in the message and proceeds to
perform normal receive-side processing, waking up the suspended
application read thread, copying the data into the application's
receive buffer, etc. It will use the producer cursor as an
indicator of how much data is available to be delivered to the
local application. After this processing is complete, the SMC-R
layer will also update its local consumer cursor to match the
producer cursor (i.e., indicating that all data has been
consumed). Note that a message to the peer updating the consumer
cursor is not needed at this time, as the window size is
unconstrained (> 1/2 of the receive buffer size). The window size
is calculated by taking the difference between the producer cursor
and the consumer cursor in the RMBEs (10,000 - 1004 = 8996).

Scenario assumptions:
o New SMC-R connection; no data has been sent on this connection.
o Host A: Application issues send for 3000 bytes to Host B and then
another send for 4000 bytes.
o Host B: RMBE receive buffer size is 10,000. Application has
already issued a recv for 10,000 bytes.
Flow description:
1. The application issues a send() for 3000 bytes; the SMC-R layer
copies data into a kernel send buffer. It then schedules an RDMA
write operation to move the data into the peer's RMBE receive
buffer, at relative position 4-3003. Note that no immediate data
or alert (i.e., interrupt) is provided to Host B for this RDMA
operation.
2. Host A sends a CDC message to update its producer cursor to
byte 3003. This CDC message will deliver an interrupt to Host B.
At this point, the SMC-R layer can return control back to the
application.
3. Host B, once notified of the receipt of the previous CDC message,
locates the RMBE associated with the RMBE alert token and proceeds
to perform normal receive-side processing, waking up the suspended
application read thread, copying the data into the application's
receive buffer, etc. After this processing is complete, the SMC-R
layer will also update its local consumer cursor to match the
producer cursor (i.e., indicating that all data has been
consumed). It will not, however, update the partner with this
information, as the window size is not constrained
(10,000 - 3000 = 7000 bytes of available space). The application
on Host B also issues a new recv() for 10,000 bytes.
4. On Host A, the application issues a send() for 4000 bytes. The
SMC-R layer copies the data into a kernel buffer and schedules an
async RDMA write into the peer's RMBE receive buffer at relative
position 3003-7004. Note that no alert is provided to Host B for
this flow.
5. Host A sends a CDC message to update the producer cursor to
byte 7004. This CDC message will deliver an interrupt to Host B.
At this point, the SMC-R layer can return control back to the
application.

Scenario assumptions:
o Kernel implementation.
o Existing SMC-R connection, Host B's receive window size is fully
open (peer consumer cursor = peer producer cursor).
o Host A: Application issues send for 20,000 bytes to Host B.
o Host B: RMBE receive buffer size is 10,000; application has issued
a recv for 10,000 bytes.
Flow description:
1. The application issues a send() for 20,000 bytes; the SMC-R layer
copies data into a kernel send buffer (assumes that send buffer
space of 20,000 is available for this connection). It then
schedules an RDMA write operation to move the data into the peer's
RMBE receive buffer, at relative position 1004-9999. Note that no
immediate data or alert (i.e., interrupt) is provided to Host B
for this RDMA operation.
2. Host A then schedules an RDMA write operation to fill the
remaining 1000 bytes of available space in the peer's RMBE receive
buffer, at relative position 4-1003. Note that no immediate data
or alert (i.e., interrupt) is provided to Host B for this RDMA
operation. Also note that an implementation of SMC-R may optimize
this processing by combining steps 1 and 2 into a single
RDMA write operation (with two different data sources).
3. Host A sends a CDC message to update the producer cursor to
byte 1004. Since the entire receive buffer space is filled, the
producer writer blocked flag (the "Wrt Blk" indicator (flag) in
Figure 19) is set and the producer cursor wrap sequence number
(the producer "Wrap Seq#" in Figure 19) is incremented. This CDC
message will deliver an interrupt to Host B. At this point, the
SMC-R layer can return control back to the application.
4. Host B, once notified of the receipt of the previous CDC message,
locates the RMBE associated with the RMBE alert token and proceeds
to perform normal receive-side processing, waking up the suspended
application read thread, copying the data into the application's
receive buffer, etc. In this scenario, Host B notices that the
producer cursor has not been advanced (same value as the consumer
cursor); however, it notices that the producer cursor wrap
sequence number is different from its local value (1), indicating
that a full window of new data is available. All of the data in
the receive buffer can be processed, with the first segment

(1004-9999) followed by the second segment (4-1003). Because the
producer writer blocked indicator was set, Host B schedules a CDC
message to update its latest information to the peer: consumer
cursor (1004), consumer cursor wrap sequence number (the current
value of 2 is used).
5. Host A, upon receipt of the CDC message, locates the TCP
connection associated with the alert token and, upon examining the
control information provided, notices that Host B has consumed all
of the data (based on the consumer cursor and the consumer cursor
wrap sequence number) and initiates the next RDMA write to fill
the receive buffer at offset 1003-9999.
6. Host A then moves the next 1000 bytes into the beginning of the
receive buffer (4-1003) by scheduling an RDMA write operation.
Note that at this point there are still 8 bytes remaining to be
written.
7. Host A then sends a CDC message to set the producer writer blocked
indicator and to increment the producer cursor wrap sequence
number (3).
8. Host B, upon notification, completes the same processing as step 4
above, including sending a CDC message to update the peer to
indicate that all data has been consumed. At this point, Host A
can write the final 8 bytes to Host B's RMBE into
positions 1004-1011 (not shown).

2. Host A sends a CDC message to update its producer cursor to
byte 1500 and to turn on the producer Urgent Data Pending (UrgP)
and Urgent Data Present (UrgA) flags. This CDC message will
deliver an interrupt to Host B. At this point, the SMC-R layer
can return control back to the application.
3. Host B, once notified of the receipt of the previous CDC message,
locates the RMBE associated with the RMBE alert token, notices
that the Urgent Data Pending flag is on, and proceeds with out-of-
band socket API notification -- for example, satisfying any
outstanding select() or poll() requests on the socket by
indicating that urgent data is pending (i.e., by setting the
exception bit on). The urgent data present indicator allows
Host B to also determine the position of the urgent data (the
producer cursor points 1 byte beyond the last byte of urgent
data). Host B can then perform normal receive-side processing
(including specific urgent data processing), copying the data into
the application's receive buffer, etc. Host B then sends a CDC
message to update the partner's RMBE control area with its latest
consumer cursor (1500). Note that this CDC message must occur,
regardless of the current local window size that is available.
The partner host (Host A) cannot initiate any additional RDMA
writes until it receives acknowledgment that the urgent data has
been processed (or at least processed/remembered at the SMC-R
layer).
4. Upon receipt of the message, Host A wakes up, sees that the peer
consumed all data up to and including the last byte of urgent
data, and now resumes sending any pending data. In this case, the
application had previously issued a send for 1000 bytes of normal
data, which would have been copied in the send buffer, and control
would have been returned to the application. Host A now initiates
an RDMA write to move that data to the peer's receive buffer at
position 1500-2499.
5. Host A then sends a CDC message to update its producer cursor
value (2500) and to turn off the Urgent Data Pending and Urgent
Data Present flags. Host B wakes up, processes the new data
(resumes application, copies data into the application receive
buffer), and then proceeds to update the local current consumer
cursor (2500). Given that the window size is unconstrained, there
is no need for a consumer cursor update in the peer's RMBE.

Flow description:
1. The application issues a send() for 500 bytes of urgent data; the
SMC-R layer copies data into a kernel send buffer (if available).
Since the writer is blocked (window size closed), it cannot send
the data immediately. It then sends a CDC message to notify the
peer of the Urgent Data Pending (UrgP) indicator (the writer
blocked indicator remains on as well). This serves as a signal to
Host B that urgent data is pending in the stream. Control is also
returned to the application at this point.
2. Host B, once notified of the receipt of the previous CDC message,
locates the RMBE associated with the RMBE alert token, notices
that the Urgent Data Pending flag is on, and proceeds with out-of-
band socket API notification -- for example, satisfying any
outstanding select() or poll() requests on the socket by
indicating that urgent data is pending (i.e., by setting the
exception bit on). At this point, it is expected that the
application will enter urgent data mode processing, expeditiously
processing all normal data (by issuing recv API calls) so that it
can get to the urgent data byte. Whether the application has this
urgent mode processing or not, at some point, the application will
consume some or all of the pending data in the receive buffer.
When this occurs, Host B will also send a CDC message to update
its consumer cursor and consumer cursor wrap sequence number to
the peer. In the example above, a full window's worth of data was
consumed.
3. Host A, once awakened by the message, will notice that the window
size is now open on this connection (based on the consumer cursor
and the consumer cursor wrap sequence number, which now matches
the producer cursor wrap sequence number) and resume sending of
the urgent data segment by scheduling an RDMA write into relative
position 1000-1499.
4. Host A then sends a CDC message to advance its producer cursor
(1500) and to also notify Host B of the Urgent Data Present (UrgA)
indicator (and turn off the writer blocked indicator). This
signals to Host B that the urgent data is now in the local receive
buffer and that the producer cursor points to the last byte of
urgent data.
5. Host B wakes up, processes the urgent data, and, once the urgent
data is consumed, sends a CDC message to update its consumer
cursor (1500).

6. Host A wakes up, sees that Host B has consumed the sequence number
associated with the urgent data, and then initiates the next RDMA
write operation to move the 1000 bytes associated with the next
send() of normal data into the peer's receive buffer at
position 1500-2499. Note that the send API would have likely
completed earlier in the process by copying the 1000 bytes into a
send buffer and returning back to the application, even though we
could not send any new data until the urgent data was processed
and acknowledged by Host B.
7. Host A sends a CDC message to advance its producer cursor to 2500
and to reset the Urgent Data Pending and Urgent Data Present
flags. Host B wakes up and processes the inbound data.
4.8. Connection Termination
Just as SMC-R connections are established using a combination of TCP
connection establishment flows and SMC-R protocol flows, the
termination of SMC-R connections also uses a similar combination of
SMC-R protocol termination flows and normal TCP connection
termination flows. The following sections describe the SMC-R
protocol normal and abnormal connection termination flows.
4.8.1. Normal SMC-R Connection Termination Flows
Normal SMC-R connection flows are triggered via the normal stream
socket API semantics, namely by the application issuing a close() or
shutdown() API. Most applications, after consuming all incoming data
and after sending any outbound data, will then issue a close() API to
indicate that they are done both sending and receiving data. Some
applications, typically a small percentage, make use of the
shutdown() API that allows them to indicate that the application is
done sending data, receiving data, or both sending and receiving
data. The main use of this API is scenarios where a TCP application
wants to alert its partner endpoint that it is done sending data but
is still receiving data on its socket (shutdown for write). Issuing
shutdown() for both sending and receiving data is really no different
than issuing a close() and can therefore be treated in a similar
fashion. Shutdown for read is typically not a very useful operation
and in normal circumstances does not trigger any network flows to
notify the partner TCP endpoint of this operation.
These same trigger points will be used by the SMC-R layer to initiate
SMC-R connection termination flows. The main design point for SMC-R
normal connection flows is to use the SMC-R protocol to first shut
down the SMC-R connection and free up any SMC-R RDMA resources, and
then allow the normal TCP connection termination protocol (i.e., FIN
processing) to drive cleanup of the TCP connection. This design

2. An SMC-R connection progresses to the Active state once the SMC-R
Rendezvous processing has successfully completed, RMB element
indices have been exchanged, and SMC-R links have been activated.
In this state, the TCP connection is fully established, rendezvous
processing has been completed, and SMC-R peers can begin the
exchange of data via RDMA.
3. Active close processing (on the SMC-R peer that is initiating the
connection termination).
A. When an application on one of the SMC-R connection peers issues
a close(), a shutdown() for write, or a shutdown() for both
read and write, the SMC-R layer on that host will initiate
SMC-R connection termination processing. First, if a close()
or shutdown(both) is issued, it will check to see that there's
no data in the local RMB element that has not been read by the
application. If unread data is detected, the SMC-R connection
must be abnormally reset; for more details on this, refer to
Section 4.8.2 ("Abnormal SMC-R Connection Termination Flows").
If no unread data is pending, it then checks to see whether or
not any outstanding data is waiting to be written to the peer,
or if any outstanding RDMA writes for this SMC-R connection
have not yet completed. If either of these two scenarios is
true, an indicator that this connection is in a pending close
state is saved in internal data structures representing this
SMC-R connection, and control is returned to the application.
If all data to be written to the partner has completed, this
peer will send a CDC message to notify the peer of either the
PeerConnectionClosed indicator (close or shutdown for both was
issued) or the PeerDoneWriting indicator. This will provide an
interrupt to inform that partner SMC-R peer that the connection
is terminating. At this point, the local side of the SMC-R
connection transitions in the PeerCloseWait1 state, and control
can be returned to the application. If this process could not
be completed synchronously (the pending close condition
mentioned above), it is completed when all RDMA writes for data
and control cursors have been completed.
B. At some point, the SMC-R peer application (passive close) will
consume all incoming data, realize that that partner is done
sending data on this connection, and proceed to initiate its
own close of the connection once it has completed sending all
data from its end. The partner application can initiate this
connection termination processing via close() or shutdown()
APIs. If the application does so by issuing a shutdown() for
write, then the partner SMC-R layer will send a CDC message to
notify the peer (the active close side) of the PeerDoneWriting
indicator. When the "active close" SMC-R peer wakes up as a

result of the previous CDC message, it will notice that the
PeerDoneWriting indicator is now on and transition to the
PeerCloseWait2 state. This state indicates that the peer is
done sending data and may still be reading data. At this
point, the "active close" peer will also need to ensure that
any outstanding recv() calls for this socket are woken up and
remember that no more data is forthcoming on this connection
(in case the local connection was shutdown() for write only).
C. This flow is a common transition from 3A or 3B above. When the
SMC-R peer (passive close) consumes all data and updates all
necessary cursors to the peer, and the application closes its
socket (close or shutdown for both), it will send a CDC message
to the peer (the active close side) with the
PeerConnectionClosed indicator set. At this point, the
connection can transition back to the Closed state if the local
application has already closed (or issued shutdown for both)
the socket. Once in the Closed state, the RMBE can now be
safely reused for a new SMC-R connection. When the
PeerConnectionClosed indicator is turned on, the SMC-R peer is
indicating that it is done updating the partner's RMBE.
D. Conditional state: If the local application has not yet issued
a close() or shutdown(both), we need to wait until the
application does so. Once it does, the local host will send a
CDC message to notify the peer of the PeerConnectionClosed
indicator and then transition to the Closed state.
4. Passive close processing (on the SMC-R peer that receives an
indication that the partner is closing the connection).
A. Upon receipt of a CDC message, the SMC-R layer will detect that
the PeerConnectionClosed indicator or PeerDoneWriting indicator
is on. If any outstanding recv() calls are pending, they are
completed with an indicator that the partner has closed the
connection (zero-length data presented to the application). If
there is any pending data to be written and
PeerConnectionClosed is on, then an SMC-R connection reset must
be performed. The connection then enters the AppCloseWait1
state on the passive close side waiting for the local
application to initiate its own close processing.
B. If the local application issues a shutdown() for writing, then
the SMC-R layer will send a CDC message to notify the partner
of the PeerDoneWriting indicator and then transition the local
side of the SMC-R connection to the AppCloseWait2 state.

C. When the application issues a close() or shutdown() for both,
the local SMC-R peer will send a message informing the peer of
the PeerConnectionClosed indicator and transition to the Closed
state if the remote peer has also sent the local peer the
PeerConnectionClosed indicator. If the peer has not sent the
PeerConnectionClosed indicator, we transition into the
PeerFinCloseWait state.
D. The local SMC-R connection stays in this state until the peer
sends the PeerConnectionClosed indicator in a CDC message.
When the indicator is sent, we transition to the Closed state
and are then free to reuse this RMBE.
Note that each SMC-R peer needs to provide some logic that will
prevent being stranded in a termination state indefinitely. For
example, if an Active Close SMC-R peer is in a PeerCloseWait (1 or 2)
state waiting for the remote SMC-R peer to update its connection
termination status, it needs to provide a timer that will prevent it
from waiting in that state indefinitely should the remote SMC-R peer
not respond to this termination request. This could occur in error
scenarios -- for example, if the remote SMC-R peer suffered a failure
prior to being able to respond to the termination request or the
remote application is not responding to this connection termination
request by closing its own socket. This latter scenario is similar
to the TCP FINWAIT2 state, which has been known to sometimes cause
issues when remote TCP/IP hosts lose track of established connections
and neglect to close them. Even though the TCP standards do not
mandate a timeout from the TCP FINWAIT2 state, most TCP/IP
implementations assign a timeout for this state. A similar timeout
will be required for SMC-R connections. When this timeout occurs,
the local SMC-R peer performs TCP reset processing for this
connection. However, no additional RDMA writes to the partner RMBE
can occur at this point (we have already indicated that we are done
updating the peer's RMBE). After the TCP connection is reset, the
RMBE can be returned to the free pool for reallocation. See
Section 4.4.2 for more details.
Also note that it is possible to have two SMC-R endpoints initiate an
Active close concurrently. In that scenario, the flows above still
apply; however, both endpoints follow the active close path (path 3).

4.8.2. Abnormal SMC-R Connection Termination Flows
Abnormal SMC-R connection termination can occur for a variety of
reasons, including the following:
o The TCP connection associated with an SMC-R connection is reset.
In TCP, either endpoint can send a RST segment to abort an
existing TCP connection when error conditions are detected for the
connection or the application overtly requests that the connection
be reset.
o Normal SMC-R connection termination processing has unexpectedly
stalled for a given connection. When the stall is detected
(connection termination timeout condition), an abnormal SMC-R
connection termination flow is initiated.
In these scenarios, it is very important that resources associated
with the affected SMC-R connections are properly cleaned up to ensure
that there are no orphaned resources and that resources can reliably
be reused for new SMC-R connections. Given that SMC-R relies heavily
on the RDMA write processing, special care needs to be taken to
ensure that an RMBE is no longer being used by an SMC-R peer before
logically reassigning that RMBE to a new SMC-R connection.
When an SMC-R peer initiates a TCP connection reset, it also
initiates an SMC-R abnormal connection flow at the same time. The
SMC-R peers explicitly signal their intent to abnormally terminate an
SMC-R connection and await explicit acknowledgment that the peer has
received this notification and has also completed abnormal connection
termination on its end. Note that TCP connection reset processing
can occur in parallel to these flows.

+-----------------+
|-------------->| CLOSED |<-------------|
| | | |
| +-----------------+ |
| |
| |
| |
| +-----------------------+ |
| | Any state | |
|1B | (before setting | 2B|
| | PeerConnectionClosed | |
| | indicator in | |
| | peer's RMBE) | |
| +-----------------------+ |
| 1A | | 2A |
| Active Abort | | Passive Abort |
| V V |
| +--------------+ +--------------+ |
|-------|PeerAbortWait | | Process Abort|------|
| | | |
+--------------+ +--------------+
Figure 23: SMC-R Abnormal Connection Termination State DiagramFigure 23 above shows the SMC-R abnormal connection termination state
diagram:
1. Active abort designates the SMC-R peer that is initiating the TCP
RST processing. At the time that the TCP RST is sent, the active
abort side must also do the following:
A. Send the PeerConnAbort indicator to the partner in a CDC
message, and then transition to the PeerAbortWait state.
During this state, it will monitor this SMC-R connection
waiting for the peer to send its corresponding PeerConnAbort
indicator but will ignore any other activity in this connection
(i.e., new incoming data). It will also generate an
appropriate error to any socket API calls issued against this
socket (e.g., ECONNABORTED, ECONNRESET).
B. Once the peer sends the PeerConnAbort indicator to the local
host, the local host can transition this SMC-R connection to
the Closed state and reuse this RMBE. Note that the SMC-R peer
that goes into the active abort state must provide some
protection against staying in that state indefinitely should
the remote SMC-R peer not respond by sending its own
PeerConnAbort indicator to the local host. While this should
be a rare scenario, it could occur if the remote SMC-R peer

(passive abort) suffered a failure right after the local SMC-R
peer (active abort) sent the PeerConnAbort indicator. To
protect against these types of failures, a timer can be set
after entering the PeerAbortWait state, and if that timer pops
before the peer has sent its local PeerConnAbort indicator (to
the active abort side), this RMBE can be returned to the free
pool for possible reallocation. See Section 4.4.2 for more
details.
2. Passive abort designates the SMC-R peer that is the recipient of
an SMC-R abort from the peer designated by the PeerConnAbort
indicator being sent by the peer in a CDC message. Upon receiving
this request, the local peer must do the following:
A. Using the appropriate error codes, indicate to the socket
application that this connection has been aborted, and then
purge all in-flight data for this connection that is waiting to
be read or waiting to be sent.
B. Send a CDC message to notify the peer of the PeerConnAbort
indicator and, once that is completed, transition this RMBE to
the Closed state.
If an SMC-R peer receives a TCP RST for a given SMC-R connection, it
also initiates SMC-R abnormal connection termination processing if it
has not already been notified (via the PeerConnAbort indicator) that
the partner is severing the connection. It is possible to have two
SMC-R endpoints concurrently be in an active abort role for a given
connection. In that scenario, the flows above still apply but both
endpoints take the active abort path (path 1).
4.8.3. Other SMC-R Connection Termination Conditions
The following are additional conditions that have implications for
SMC-R connection termination:
o An SMC-R peer being gracefully shut down. If an SMC-R peer
supports a graceful shutdown operation, it should attempt to
terminate all SMC-R connections as part of shutdown processing.
This could be accomplished via LLC DELETE LINK requests on all
active SMC-R links.
o Abnormal termination of an SMC-R peer. In this example, there may
be no opportunity for the host to perform any SMC-R cleanup
processing. In this scenario, it is up to the remote peer to
detect a RoCE communications failure with the failing host. This

could trigger SMC-R link switchover, but that would also generate
RoCE errors, causing the remote host to eventually terminate all
existing SMC-R connections to this peer.
o Loss of RoCE connectivity between two SMC-R peers. If two peers
are no longer reachable across any links in their SMC-R link
group, then both peers perform a TCP reset for the connections,
generate an error to the local applications, and free up all QP
resources associated with the link group.
5. Security Considerations
5.1. VLAN Considerations
The concepts and access control of virtual LANs (VLANs) must be
extended to also cover the RoCE network traffic flowing across the
Ethernet.
The RoCE VLAN configuration and access permissions must mirror the IP
VLAN configuration and access permissions over the Converged Enhanced
Ethernet fabric. This means that hosts, routers, and switches that
have access to specific VLANs on the IP fabric must also have the
same VLAN access across the RoCE fabric. In other words, the SMC-R
connectivity will follow the same virtual network access permissions
as normal TCP/IP traffic.
5.2. Firewall Considerations
As mentioned above, the RoCE fabric inherits the same VLAN
topology/access as the IP fabric. RoCE is a Layer 2 protocol that
requires both endpoints to reside in the same Layer 2 network (i.e.,
VLAN). RoCE traffic cannot traverse multiple VLANs, as there is no
support for routing RoCE traffic beyond a single VLAN. As a result,
SMC-R communications will also be confined to peers that are members
of the same VLAN. IP-based firewalls are typically inserted between
VLANs (or physical LANs) and rely on normal IP routing to insert
themselves in the data path. Since RoCE (and by extension SMC-R) is
not routable beyond the local VLAN, there is no ability to insert a
firewall in the network path of two SMC-R peers.
5.3. Host-Based IP Filters
Because SMC-R maintains the TCP three-way handshake for connection
setup before switching to RoCE out of band, existing IP filters that
control connection setup flows remain effective in an SMC-R
environment. IP filters that operate on traffic flowing in an active
TCP connection are not supported, because the connection data does
not flow over IP.

5.4. Intrusion Detection Services
Similar to IP filters, intrusion detection services that operate on
TCP connection setups are compatible with SMC-R with no changes
required. However, once the TCP connection has switched to RoCE out
of band, packets are not available for examination.
5.5. IP Security (IPsec)
IP security is not compatible with SMC-R, because there are no IP
packets on which to operate. TCP connections that require IP
security must opt out of SMC-R.
5.6. TLS/SSL
Transport Layer Security/Secure Socket Layer (TLS/SSL) is preserved
in an SMC-R environment. The TLS/SSL layer resides above the SMC-R
layer, and outgoing connection data is encrypted before being passed
down to the SMC-R layer for RDMA write. Similarly, incoming
connection data goes through the SMC-R layer encrypted and is
decrypted by the TLS/SSL layer as it is today.
The TLS/SSL handshake messages flow over the TCP connection after the
connection has switched to SMC-R, and so they are exchanged using
RDMA writes by the SMC-R layer, transparently to the TLS/SSL layer.
6. IANA Considerations
The scarcity of TCP option codes available for assignment is
understood, and this architecture uses experimental TCP options
following the conventions of [RFC6994] ("Shared Use of Experimental
TCP Options").
TCP ExID 0xE2D4C3D9 has been registered with IANA as a TCP Experiment
Identifier. See Section 3.1.
If this protocol achieves wide acceptance, a discrete option code may
be requested by subsequent versions of this protocol.