TRex Advance stateful support

1. Audience

This document assumes basic knowledge of TRex, and assumes that TRex is installed and configured.
For information, see the manual, especially the material up to the Basic Usage section.
Consider this document as an extension to the manual, we might integrate both of them in the future.

2. Advance Stateful support

2.1. Feature overview

TRex supports Stateless (STL) and Stateful (STF) modes.

This document describes the new Advance Stateful mode (ASTF) that support TCP layer.

The following UDP/TCP related use-cases will be addressed by ASTF mode.

Ability to work when the DUT terminates the TCP stack (e.g. compress/uncompress, see figure 1). In this case there is a different TCP session on each side, but L7 data are almost the same.

Figure 1. DUT is TCP proxy

Ability to work in either client mode or server mode. This way TRex client side could be installed in one physical location on the network and TRex server in another. figure 2 shows such an example

Figure 2. C/S mode

Performance and scale

High bandwidth - ~200gb/sec with many realistic flows (not one elephant flow)

High connection rate - order of MCPS

Scale to millions of active established flows

Simulate latency/jitter/drop in high rate

Emulate L7 application, e.g. HTTP/HTTPS/Citrix- there is no need to implement the exact application.

Simulate L7 application on top of TLS using OpenSSL

BSD baseTCP implementation

Ability to change fields in the L7 stream application - for example, change HTTP User-Agent field

2.1.2. Can we address the above requirements using existing DPDK TCP stacks?

Can we leverage one of existing DPDK TCP stacks for our need? The short answer is no.
We chose to take a BSD4.4 original code base with FreeBSD bug fixes patches and improve the scalability to address our needs.
More on the reasons why in the following sections, but let me just say the above TCP DPDK stacks are optimized for real client/server application/API while in most of our traffic generation use cases, most of the traffic is known ahead of time allowing us to do much better.
let’s take a look into what are the main properties of TRex TCP module and understand what were the main challenges we tried to solve.

2.1.3. The main properties of scalable TCP for traffic generation

Interact with DPDK API for batching of packets

Multi-instance - lock free. Each thread will get its own TCP context with local counters/configuration, flow-table etc ,RSS

Async, Event driven - No OS API/threads needed

Start write buffer

Continue write

End Write

Read buffer /timeout

OnConnect/OnReset/OnClose

Accurate with respect to TCP RFCs - at least derive from BSD to be compatible - no need to reinvent the wheel

Enhanced tcp statistics - as a traffic generator we need to gather as many statistics as we can, for example per template tcp statistics.

Ability to save descriptors for better simulation of latency/jitter/drop

The folowing figure shows the block diagram of new TRex TCP design

Figure 3. Stack

And now lets proceed to our challenges, let me just repeat the objective of TRex, it is not to reach a high rate with one flow, it is to simulate a realistic network with many clients using small flows. Let’s try to see if we can solve the scale of million of flows.

2.1.4. Tx Scale to millions of flows

Figure 4. TCP Tx side

Most TCP stacks have an API that allow the user to provide his buffer for write (push) and the TCP module will save them until the packets are acknowledged by the remote side. Figure 4 shows how one TX queue of one TCP flow looks like on the Tx side. This could create a scale issue in worst case. Let’s assume we need 1M active flows with 64K TX buffer (with reasonable buffer, let’s say RTT is small). The worst case buffer in this case could be
1M x 64K * mbuf-factor (let’s assume 2) = 128GB. The mbuf resource is expensive and needs to be allocated ahead of time.
the solution we chose for this problem (which from a traffic generator’s point of view) is to change the API to be a poll API, meaning TCP will request the buffers from the application layer only when packets need to be sent (lazy). Now because most of the traffic is constant in our case, we could save a lot of memory and have an unlimited scale (both of flows and tx window).

This optimization won’t work with TLS since constant sessions will have new data

2.1.5. Rx Scale to millions of flows

Figure 5. Example of multiple streams

The same problem exists in the case of reassembly in the rx side, in worst case there is a need to store a lot of memory in reassembly queue. To fix this we can add a filter API for the application layer. Let’s assume that the application layer can request only a partial portion of the data since the rest is less important, for example data in offset of 61K-64K and only in case of restransmission (simulation). In this case we can give the application layer only the filtered data that is really important to it and still allow TCP layer to work in the same way from seq/ack perspective.

This optimization won’t work with TLS since constant sessions will have new data

2.1.6. Simulation of latency/jitter/drop in high scale

Figure 6. TCP Rx side

There is a requirement to simulate latency/jitter/drop in the network layer. Simulating drop in high rate it is not a problem, but simulating latency/jitter in high rate is a challenge because there is a need to queue a high number of packets. See figure 6 on the left.
A better solution is to queue a pointer to both the TCP flow and the TCP descriptor (with TSO information) and only when needed (i.e. when it has already left the tx queue) build the packet again (lazy). The memory footprint in this case can be reduced dramatically.

2.1.7. Emulation of L7 application

To emulate L7 application on top of the TCP layer we can define a set of simple operations.
The user would be able to build an application emulation layer from Python API or by a utility that we will provide that will analyze a pcap file and convert it to TCP operations.
Another thing that we can learn from pcap is the TCP parameters like MSS/Window size/Nagel/TCP options etc
Let’s give a simple example of a L7 emulation of HTTP Client and HTTP Server

This way both Client and Server don’t need to know the exact application protocol, they just need to have the same story/program. In real HTTP server, the server parses the HTTP requeset, learns the Content-Length field, waits for the rest of the data and finally retrieves the information from disk. With our L7 emulation there is no need. Even in cases where the data length is changed (for example NAT/LB that changes the data length) we can give some flexibility within the program on the value range of the length
In case of UDP it is a message base protocols like send_msg/wait_for_msg etc.

2.1.8. Stateful(STF) vs Advance Stateful (ASTF)

Same Flexible tuple generator

Same Clustering mode

Same VLAN support

NAT - no need for complex learn mode. ASTF supports NAT64 out of the box.

Flow order. ASTF has inherent ordering verification using the TCP layer. It also checks IP/TCP/UDP checksum out of the box.

Latency measurement is supported in both.

In ASTF mode, you can’t control the IPG, less predictable (concurrent flows is less deterministic)

2.2. ASTF package folders

Location

Description

/astf

astf native (py) profiles

/automation/trex_control_plane/astf/examples

automation examples

/automation/trex_control_plane/astf/trex_astf_lib

astf lib compiler (convert py to JSON)

/automation/trex_control_plane/stf

stf automation (used by astf mode)

/automation/trex_control_plane/astf/examples

stf automation example

2.3. Getting started Tutorials

The tutorials in this section demonstrate basic TRex ASTF use cases. Examples include common and moderately advanced TRex concepts.

As you can see from the pseudo code there is no need to open all the servers ahead of time, we open and allocate socket only when packet match the criteria of server side

the program is the opposite of the client side.

The above is just a pseudo code that was created to explain how logically TRex works. It was simpler to show a pseudo code that runs in one thread in blocking fashion, but in practice it is run in an event driven and many flows can multiplexed in high performance and scale.
The L7 program can be written using Python API (it is compiled to micro-code event driven by TRex server).

2.3.3. Tutorial: Profile with two templates

Goal

Simple browsing, HTTP and HTTPS flow. In this example, each template has different destination port (80/443)

Traffic profile

the profile include HTTP and HTTPS profile. each second there would be 2 HTTPS flows and 1 HTTP flow.

The server side chooses the template base on the destination port. Because each template has a unique destination port (80/443) there is nothing to do. In the next example we will show what to do in case both templates has the same destination port.
From the client side, the scheduler will schedule in each second 2 HTTPS flows and 1 HTTP flow base on the CPS

2.3.4. Tutorial: Profile with two templates same ports

Goal

create profile with two HTTP templates. In this example, both templates has the same destination port (80)

In a real world the same server can handle many types of transaction on the same port base on the request. In this TRex version we have this limitation as it is only an emulation. next we would add better engine that could associate the template base on server Ip-port socket or by L7 data

2.3.7. Tutorial: Change tcp.mss using tunables mechanism

Profile tunable is a mechanism to tune the behavior of ASTF traffic profile.
TCP layer has a set of tunables. IPv6 and IPv4 have another set of tunables.

There are two types of tunables:

Global tunable: per client/server will affect all the templates in specific side.

Per-template tunable: will affect only the associated template (per client/server). Will have higher priority relative to global tunable.

By default, the TRex server has a default value for all the tunables and only when you set a specific tunable the server will override the value.
Example of a tunable is tcp.mss. You can change the tcp.mss:

in case there is no errors the err object won’t be there. in case of an error counters the err section will include the counter and the description. the all section includes the good and error counters value

2.3.9. Tutorial: Simple simulator

Goal

Use the TRex ASTF simple simulator.

The TRex package includes a simulator tool, astf-sim. The simulator operates as a Python script that calls an executable. The platform requirements for the simulator tool are the same as for TRex. There is no need for super user in case of simulation.

The TRex simulator can:

Demonstrates the most basic use case using TRex simulator. In this simple simulator there is one client flow and one server flow and there is only one template (the first one).
the objective of this simulator is to verify the TCP layer and application layer.
In this simulator, it is possible to simulate many abnormal cases for example:

Drop of specific packets.

Change of packet information (e.g. wrong sequence numbers)

Man in the middle RST and redirect

Keepalive timers.

Set the round trip time

Convert the profile to JSON format

We didn’t expose all the capability of the simulator tool but you could debug the emulation layer using this tool and explore the pcap output files.

2.3.13. Tutorial: L7 emulation - fin/ack/fin/ack

Bt default when the L7 emulation program is ended the socket is closed implicitly.

This example force the server side to wait for close from peer (client) and only then will send FIN.

fin-ack example

# client commands
prog_c =ASTFProgram()
prog_c.send(http_req)
prog_c.recv(len(http_response))# implicit close
prog_s =ASTFProgram()
prog_s.recv(len(http_req))
prog_s.send(http_response)
prog_s.wait_for_peer_close();# wait for client to close the socket the issue a close

2.3.17. Tutorial: L7 emulation - Elefeant flows

Let say we would like to send only 50 flows with very big size (4GB)
Loading a 4GB of buffer would be a challenge as TRex memory is limited.
What we can do is loop inside the client side to send 1MB buffer 4096 times and then finish with termination.

By default send() command waits for the ACK on the last byte. To make it non-blocking, especially in case big BDP (large window is required) it is possible to work in non-blocking mode, this way to achieve full pipeline.

2.4. Performance

2.5. Client/Server only mode

With ASTF mode, it is possible to work in either client mode or server mode. This way TRex client side could be installed in one physical location on the network and TRex server in another.

We are in the process to move to interactive model, so the following ways to configure the C/S modes (as batch) is changing. This is a temporary solution and it going to be more flexible in the future. the roadmap is to give a RCP command to configure each port to client or server mode.

The current way to control the C/S mode is using the following CLI switch. There is only a way to disable the Client side ports for transmission, but there is no way to change the mode of each port.

Table 1. batch CLI options

CLI

Description

--astf-server-only

Only server side ports (1,3..) are enabled with ASTF service. Traffic won’t be transmitted on clients ports.

The responder information is ignored in ASTF mode as the server side learn the VLAN/MAC from the DUT.
Another fact is the TRex ports is behaving like trunk ports with all VLAN allowed. This means that when working with a Switch the Switch could flood the wrong packet (SYN with the wrong VLAN) to the TRex server side port and TRex will answer to this packet by mistake (as all VLAN are allowed, e.g. client side packet with client side VLAN will be responded in the server side). To overcome this either use ` allowed vlan` command in the Switch or use a dummy ARP resolution to make the Switch learn the VLAN on each port

2.7. Multi core support

Distribution to multi core is done using RSS hardware assist. Using this feature it is possible to support up to 200Gb/sec of TCP/UDP traffic with one server.
It can work only for NICs that support RSS (almost all the physical nics) and only for TCP/UDP traffic without tunnels (e.g. GRE).
For more complex tunnel traffic there is a need for a software distributed that will be slower.
It is possible to enable this assit on some virtual NIC (e.g. vmxnet3) if there would be a request for that.

Table 6. NICs that support RSS

Chipset

support

vlan skip

qinq skip

Intel 82599

+

-

-

Intel 710

+

+

-

Mellanox ConnectX-5/4

+

+

-

Napatech

-

-

-

IPv6 and IPv4 are supported. For IPv6 to work there is a need that the server side IPv6 template e.g. info.ipv6.dst_msb ="ff03::" will have zero in bits 32-40 (LSB is zero).

When using RSS, the number of sockets (available source ports per clients IP) will be 100% divided by the numbers of cores (c). For example for c=3 each client IP would have 64K/3 source ports = 21K ports. In this case it is advised to add more clients to the pool (in case you are close to the limit).
in other words the reported socket-util is the console should be multiply by c

conn. closed (includes drops) - this counter could be higher than tcps_connects for client side as flow could be be dropped before establishment

tcps_segstimed

segs where we tried to get rtt

tcps_rttupdated

times we succeeded

tcps_delack

delayed acks sent

tcps_sndtotal

total packets sent (TSO)

tcps_sndpack

data packets sent (TSO)

tcps_sndbyte

data bytes sent by application

tcps_sndbyte_ok

data bytes sent by tcp layer could be more than tcps_sndbyte (asked by application)

tcps_sndctrl

control (SYN,FIN,RST) packets sent

tcps_sndacks

ack-only packets sent

tcps_rcvtotal

total packets received (LRO)

tcps_rcvpack

packets received in sequence (LRO)

tcps_rcvbyte

bytes received in sequence

tcps_rcvackpack

rcvd ack packets (LRO) 2

tcps_rcvackbyte

tx bytes acked by rcvd acks (should be the same as tcps_sndbyte )

tcps_rcvackbyte_of

tx bytes acked by rcvd acks -overflow ack

tcps_preddat

times hdr predict ok for data pkts

tcps_drops

*

connections dropped

tcps_conndrops

*

embryonic connections dropped

tcps_timeoutdrop

*

conn. dropped in rxmt timeout

tcps_rexmttimeo

*

retransmit timeouts

tcps_persisttimeo

*

persist timeouts

tcps_keeptimeo

*

keepalive timeouts

tcps_keepprobe

*

keepalive probes sent

tcps_keepdrops

*

connections dropped in keepalive

tcps_sndrexmitpack

*

data packets retransmitted

tcps_sndrexmitbyte

*

data bytes retransmitted

tcps_sndprobe

window probes sent

tcps_sndurg

packets sent with URG only

tcps_sndwinup

window update-only packets sent

tcps_rcvbadoff

*

packets received with bad offset

tcps_rcvshort

*

packets received too short

tcps_rcvduppack

*

duplicate-only packets received

tcps_rcvdupbyte

*

duplicate-only bytes received

tcps_rcvpartduppack

*

packets with some duplicate data

tcps_rcvpartdupbyte

*

dup. bytes in part-dup. packets

tcps_rcvoopackdrop

*

OOO packet drop due to queue len

tcps_rcvoobytesdrop

*

OOO bytes drop due to queue len

tcps_rcvoopack

*

out-of-order packets received

tcps_rcvoobyte

*

out-of-order bytes received

tcps_rcvpackafterwin

*

packets with data after window

tcps_rcvbyteafterwin

*

,"bytes rcvd after window

tcps_rcvafterclose

*

packets rcvd after close

tcps_rcvwinprobe

rcvd window probe packets

tcps_rcvdupack

*

rcvd duplicate acks

tcps_rcvacktoomuch

*

rcvd acks for unsent data

tcps_rcvwinupd

rcvd window update packets

tcps_pawsdrop

*

segments dropped due to PAWS

tcps_predack

*

times hdr predict ok for acks

tcps_persistdrop

*

timeout in persist state

tcps_badsyn

*

bogus SYN, e.g. premature ACK

tcps_reasalloc

*

allocate tcp reasembly ctx

tcps_reasfree

*

free tcp reasembly ctx

tcps_nombuf

*

no mbuf for tcp - drop the packets

Table 9. UDP counters

Counter

Error

Description

udps_accepts

*

connections accepted

udps_connects

*

connections established

udps_closed

*

conn. closed (including drops)

udps_sndbyte

*

data bytes transmitted

udps_sndpkt

*

data packets transmitted

udps_rcvbyte

*

data bytes received

udps_rcvpkt

*

data packets received

udps_keepdrops

*

keepalive drop

udps_nombuf

*

no mbuf

udps_pkt_toobig

*

packets transmitted too big

Table 10. Flow table counters

Counter

Error

Description

err_cwf

*

client pkt that does not match a flow could no happen in loopback. Could happen if DUT generated a packet after TRex close the flow

err_no_syn

*

server first flow packet with no SYN

err_len_err

*

pkt with L3 length error

err_no_tcp

*

no tcp packet- dropped

err_no_template

*

server can’t match L7 template no destination port or IP range

err_no_memory

*

no heap memory for allocating flows

err_dct

*

duplicate flows due to aging issues and long delay in the network

err_l3_cs

*

ipv4 checksum error

err_l4_cs

*

tcp/udp checksum error (in case NIC support it)

err_redirect_rx

*

redirect to rx error

redirect_rx_ok

redirect to rx OK

err_rx_throttled

rx thread was throttled due too many packets in NIC rx queue

err_c_nf_throttled

Number of client side flows that were not opened due to flow-table overflow. ( to elarge the number see the trex_cfg file for dp_max_flows)

err_s_nf_throttled

Number of server side flows that were not opened due to flow-table overflow. ( to elarge the number see the trex_cfg file for dp_max_flows)

err_s_nf_throttled

Number of too many flows events from maintenance thread. It is not the number of flows that weren’t opened

err_c_tuple_err

How many flows were not opened in the client side because there were not enough clients in the pool. When this counter is reached, the TRex performance is affected due to the lookup for free ports. To solve this issue try adding more clients to the pool.

1

see TSO, we count the number of TSO packets with NIC that support that, this number could be significantly smaller than the real number of packets

2

see LRO, we count the number of LRO packets with NIC that support that, this number could be significantly smaller than the real number of packets

Important information

It hard to compare the number of TCP tx (client) TSO packets to rx (server) LRO packets as it might be different. The better approach would be to compare the number of bytes

tcps_sndbyte == tcps_rcvackbyte only if the flow were terminated correctly (in other words what was put in the Tx queue was transmitted and acked)

Total Tx L7 bytes are tcps_sndbyte_ok+tcps_sndrexmitbyte+tcps_sndprobe

The Console/JSON does not show/sent zero counters

Pseudo code tcp counters

if((c->tcps_drops ==0)&&(s->tcps_drops ==0)){/* flow wasn't initiated due to drop of SYN too many times *//* client side */assert(c->tcps_sndbyte==UPLOAD_BYTES);assert(c->tcps_rcvbyte==DOWNLOAD_BYTES);assert(c->tcps_rcvackbyte==UPLOAD_BYTES);/* server side */assert(s->tcps_rcvackbyte==DOWNLOAD_BYTES);assert(s->tcps_sndbyte==DOWNLOAD_BYTES);assert(s->tcps_rcvbyte==UPLOAD_BYTES);}

some rules for counters

/* for client side */1. tcps_connects<=tcps_closed /* Drop before socket is connected will be counted in different counter and will be counted in tcps_closed but not in tcps_connects */2. tcps_connattempt==tcps_closed
/* for server side */1. tcps_accepts=tcps_connects=tcps_closed

2.10.1. TSO/LRO NIC support

see manual for

2.11. FAQ

2.11.1. Why should I use TRex in this mode?

ASTF mode can help to solve the following requirement:

Test realistic scenario on top of TCP when DUT is acting like TCP proxy

Test realistic scenario in high scale (flows/bandwidth)

flexibility to change the TCP/IP flow option

flexibility to emulate L7 application using Python API (e.g. Create many types of HTTP with different user-Agent field)

Measure latency in high resolution (usec)

2.11.2. Why do I need to reload TRex server again with different flags to change to ASTF mode, In other words, why STL and ASTF can’t work together?

In theory, we could have supported that, but it required much more effort because the NIC memory configuration is fundamentally different. For example, in ASTF mode, we need to configure all the Rx queues to receive the packets and to configure the RSS to split the packets to different interface queues. While in Stateful we filter most of the packets and count them in hardware.

2.11.3. Is your core TCP implementation based on prior work?

Yes, BSD4.4-Lite version of TCP with a bug fixes from freeBSD and our changes for scale of high concurrent flow and performance.
The reasons why not to develop the tcp core logic from scratch can be found here Why do we use the Linux kernel’s TCP stack?

2.11.4. What TCP RFCs are supported?

RFC 793

RFC 1122

RFC 1323

RFC 6928

Not implemented:

RFC 2018

2.11.5. Could you have a more recent TCP implementation?

Yes, BSD4.4-Lite is from 1995 and does not have RFC 2018. We started as a POC and we plan to merge the latest freeBSD TCP core with our work.

2.11.6. Can I reduce the active flows with ASTF mode, there are too many of them?

The short answer is no. The active(concurrent) flows derived from RTT and responses of the tested network/DUT. You can increase the number of active flows by adding delay command.

2.11.7. Will NAT64 work in ASTF mode?

Yes. see IPv6 in the manual. Server side will handle IPv4 sockets

client side NAT64 server side

IPv6 -> IPv4
IPv6 <- IPv4

example

client side IPv6
xx::16.0.0.1->yy::48.0.0.1
DUT convert it to IPv4

16.0.0.1->48.0.0.1
DUT convert it to IPv6

16.0.0.1<-48.0.0.1

xx::16.0.0.1←yy::48.0.0.1

client works in IPV6 server works on IPv4

2.11.8. Is TSO/LRO NIC hardware optimization supported?

Yes. LRO improve the performance. GRO is not there yet

2.11.9. Can I get the ASTF counters per port/template?

Curently the TCP/App layer counters are per client/server side.
We plan to add it in the interactive mode with RPC API.

2.12. Appendix

2.12.1. Blocking vs non-blocking

Let’s simulate a very long HTTP download session with astf-sim to understand the difference betwean blocking and non-blocking

rtt= 10msec

shaper rate is 10mbps (simulate egress shaper of the DUT)

write in chucks of 40KB

max-window = 24K

BDP (10msec*10mbps=12.5KB) but in case of blocking we will wait the RTT time in idle, reducing the maximum throughput.