High-Performance Networking Series 3: Possibilities for Effective CPU Usage in TCP Data Reconstruction

Masaru Ito |June 11, 2024 | HPC Networking

Introduction

The third article in Fixstars Solutions’ high-performance networking series focuses on achieving data reconstruction at line rate using a CPU to manage packet transfers from the NIC to the GPU. Crucially, we used FSS lightning-kit, a high-throughput data transfer framework. This framework aims to provide a uniform interface for receiving large volumes of data over a network, and it provides flexibility in that the developer can opt to use a GPU or CPU to manage host to device memory transfers. In this article, we demonstrate high-performance data reconstruction from TCP packets using a 100GbE network interface card (NIC), achieving a peak performance of 89 Gbps. To achieve this, lightning-kit utilizes the Data Plane Development Kit (DPDK).

While there are other DPDK-based TCP stack applications (F-Stack, mTCP, TLDK, Seastar, and Accelerated Network Stack (ANS)), lightning-kit is unique in that it is designed to be a framework to adapt to application-specific requirements, not aiming for drop-in replacement of the existing network stack. Consequently, it allows the user to omit certain TCP features, such as requesting packet retransmission, due to the minimal likelihood of packet loss in this context. Although RDMA (InfiniBand, RoCE) can also be used for peer-to-peer data transfer, certain enterprise scenarios necessitate the use of the TCP protocol due to considerations like development and operating costs, device restrictions, and other factors.

Methods

This section describes our methods to achieve high-throughput data transfer. The essential steps in this process include receiving TCP packets, sending TCP ACKs in response, extracting TCP payloads, and storing them in a data buffer. To maximize throughput, it is crucial to detect packet arrival at the NIC from the CPU and send an ACK as quickly as possible, as the TCP packet sender must wait for the expected sequence number in an ACK packet before sending more data. This performance requirement suggests it is best to run the extraction and storage of data using another CPU core. Consequently, the current implementation consists of two CPU threads: one for handling TCP packets and another for building data from payloads. The rest of this section shows the technologies used for each step.

Handling TCP packets is realized by the DPDK library because DPDK supports zero-copy packet transfer from NIC to user space. Please see the official documentation for more information (official page). The DPDK features we used are listed below.

Polling Mode for the RX queue.
Selecting proper CPU socket’s memory as NIC belongs.
Configure thread affinity to run CPU threads on the same socket as the NIC.

There are also tuning configurations for NVIDIA NIC we used.

LRO (Large Receive Offload)
Offloading IP checksum
Offloading TCP checksum

After receiving a packet, its sequence number is extracted and an ACK packet is constructed. To accelerate this procedure, rte_pktmbuf_clone is used. Our experiment is done by P2P data transfer so that the sender’s MAC address, IP address, and TCP port do not change. This allows to reuse of the Ethernet frame and TCP/IP packet header. rte_pktmbuf_clone realizes the payload reuse.

Moving on to the data reconstruction step, CPU affinity is configured to ensure stable performance. The main bottleneck in this step is the memory copy from the payload to the data buffer. From our experiments, std::memcpy is insufficient for our needs. Since our target involves copying sequentially located data, the streaming store is the most efficient approach. Streaming stores can avoid loading data into the CPU cache, thereby saving time and reducing memory throughput compared to standard store, which load data into the cache from the target memory.

Experimental Results

System Topology

In these experiments, both the sender and receiver machines utilize NVIDIA 100GbE NICs, and the data is transmitted using TCP as a protocol, though lightning-kit also supports UDP. All APIs for handling packet reception and transmission are implemented using the DPDK library. Two threads run continuously: one polls the NIC’s RX queue status and sends ACK packets back to the sender when packets are ready, while the other reconstructs the sent data from the received Ethernet frame payload. The diagram below illustrates the scenario.

Testing Environment

Receiver PC
CPU : AMD EPYC 7313 16-Core Processor
Number of Sockets : 2
NIC : ConnectX6 DX
DPDK : 22.11.2401.1.0
PCIe : gen 4, 16x width
Memory : 8 memories from Micron Technology 3200 MT/s per socket
Linux kernel: Linux kelpie20 5.15.96-rt61 #1 SMP PREEMPT_RT
OS: Ubuntu 20.04.6
Sending PC (capable of packet throughputs up to 100Gbps)
CPU : Intel(R) Xeon(R) E5-2620 v3 @ 2.40GHz
NIC : ConnectX6 DX
PCIe: gen 3, 16x width

Methodology

As mentioned above, TCP was used as the protocol in this experiment. To control the throughput, the chunk size was adjusted on the sender machine. Chunk size refers to the amount of data sent before the sender verifies the sequence number of the received ACK matches what is expected. Once these numbers match, the client can send the next chunk. In lossless packet environments, larger chunks achieve higher throughput because the client pauses to wait for ACKs less frequently. In all experiments,128GB of data was sent.

Though the throughput performance is the same as that of the default Linux kernel, a kernel with RT-patch is used because many customers operate in an environment where minimizing network jitter is crucial. The jitter performance of lightning-kit will be discussed in a forthcoming article.

In this experiment, the NIC is attached to socket 1, so the DPDK EAL parameter –socket-mem is set to “0,4096”. Additionally, the lcore is set as 18, ensuring that data reconstruction thread affinity is on core 19. In all test runs, the CPU frequency was 3000MHz.

Results

	Without data reconstruction		With data reconstruction
Chunk size (MB)	Throughput (Gbps)	Average RTT (usec)	Throughput (Gbps)	Average RTT (usec)
1	70.26	102.54	68.32	105.72
2	80.09	188.10	77.77	194.04
4	85.05	360.48	82.68	371.10
8	89.09	697.14	86.07	722.01
16	90.84	1376.15	87.72	1425.33
32	91.77	2733.60	88.69	2828.36
64	92.23	5448.02	89.07	5641.19
128	92.53	10871.39	89.17	11282.01

Throughput and RTT by chunk size

The table above shows the throughput and RTT with and without data reconstruction. Without data reconstruction, the maximum throughput is 92Gbps, which is the maximum throughput of our sender side. With data reconstruction, the throughput is about 3Gbps slower. This difference possibly comes from memory read latency in frame construction. One packet size is approximately 8000 bytes, so the time per packet is about 24 ns (= (RTT difference) / (# of packets) = (RTT_{w/o frame building} – RTT_{w frame building})/(chunk size/8000)). In future updates, we will implement lightning-kit to hide memory read latency.

CPU vs GPU

Our previous article showed the performance of GPU packet processing. Now we can compare the throughput and suggest some pros and cons.

	GPU data reconstruction		CPU data reconstruction
Chunk size (MB)	Throughput (Gbps)	Average RTT (usec)	Throughput (Gbps)	Average RTT (usec)
1	58.22	127.27	68.32	105.72
2	70.14	216.18	77.77	194.04
4	79.42	388.32	82.68	371.10
8	84.90	731.93	86.07	722.01
16	88.62	1410.75	87.72	1425.33
32	90.82	2762.26	88.69	2828.36
64	91.75	5477.66	89.07	5641.19
128	92.19	10912.11	89.17	11282.01

Comparison between GPU and CPU throughput

The above table shows the current best performance of CPU and GPU. CPU w/o data reconstruction provides a reference maximum performance on the CPU. As the reader can see, when the chunk size is small, the CPU is 10Gbps faster than the GPU. When the chunk size is over 4MB, the performance gap narrows. Hence, from the point of view of throughput, when a system requires small chunk sizes, then it is better to choose CPUs, but if the chunk size can be increased, then GPUs become a viable option. When you want to process large amounts of data in a GPU and opt for GPU packet processing, then it is possible to select lower-class CPU cores and memory with minimal performance impact.

Conclusion

This article demonstrates that lightning-kit technology can approach the nominal performance limit of a 100GbE NIC for both GPU and CPU packet processing. As Fixstars continues to develop lightning-kit, we anticipate that the framework will achieve wide adoption in light of these encouraging results. Please reach out to Fixstars for bespoke, high-speed software applications.

About us

IR