High-Performance Networking Series 1: Data Reconstruction on GPUs and its Current Limitations

Masaru Ito |May 13, 2024 | Networking

Introduction

In this first of a series of posts on high-performance networking, we discuss the possibilities for data reconstruction by using a GPU to directly interact with a machine’s NIC. In particular, we use lightning-kit, a GPU-accelerated, high-speed packet processing library currently in development at Fixstars Solutions. With benchmarking and optimization underway, performance trends are promising. This analysis presents throughput and latency measurements as received ethernet packets were streamed from a 100 Gbps NIC into an NVIDIA GPU. Two frameworks were used to manage GPU-NIC communication: NVIDIA’s new DOCA GPUNetIO framework and the somewhat older DPDK GPU Direct. When processing the packets, a stable throughput of 55 Gbps was achieved when using either technology to stream the packets. Without processing, a peak throughput of 75 Gbps was achieved in the test environment, demonstrating that RDMA with packet processing results in performance degradation. Developers at Fixstars strive daily to develop lightning-kit in order to achieve packet processing speeds approaching the nominal NIC throughput, with consistent improvements occurring regularly.

System Topology

DPDK GPU Direct
Lightning-kit uses a CPU to poll the NIC’s complete queue and notify the GPU of packet reception. A GPU kernel polls the CPU for notifications and reconstructs the received data. When TCP is used as the protocol, another GPU kernel runs to generate ACK packets for the CPU to send back through the network interface. DPDK GPU Direct is a more mature technology to deal with NICs and GPUs that can be regarded as a baseline for comparison.

DOCA GPUNetIO
DOCA GPUNetIO is a rapidly developing technology by NVIDIA to utilize GPUs and NICs together for high-speed networking. DOCA operates somewhat differently from DPDK in that it runs solely on the GPU, with the GPU initiating packet RX and TX directly. DPDK, on the other hand, needs to poll the NIC from a CPU. This means DPDK uses CPU threads while DOCA does not. Consequently, compared to DPDK approach, one can choose a less powerful CPU for your system with DOCA.

When using DOCA, Lightning-Kit uses a CPU to launch GPU kernels which, once started, receive and reconstruct data independently of the CPU. At least two kernels are deployed: One polls the NIC for ether packets, and the other reconstructs the data. When the protocol is TCP, a third kernel sends ACK packets.

Testing Environment

Receiver PC (VMWare virtual machine)
CPU : Intel(R) Xeon(R) Silver 4214R @ 2.40GHz
GPU : NVIDIA RTX A6000
NIC : ConnectX6 Dx
CUDA Toolkit : 12.4
Nvidia driver : 550.67
DPDK : 22.11.2401.1.0
DOCA : 2.6.0058
PCIe : gen 2, 32x width
NIC firmware : 22.39.2048
Sending PC (Host environment, capable of packet throughputs up to 100Gbps)
CPU : Intel(R) Xeon(R) E5-2620 v3 @ 2.40GHz
NIC : ConnectX7
PCIe: gen 3, 16x width

Experimental Methodology

TCP was implemented for server-client communication, with the client sending 128 GB to the server for all tests. When processing packets in the GPU, the TCP payloads were simply concatenated, which is a fundamental process for dealing with large amounts of data such as images and videos. As was already mentioned, the performance was measured both with and without packet processing in order to determine the overhead caused by processing and determine the project status.

In line with other works [1,2], throughput and latency were measured at the server as a function of chunk size. The client checks the sequence number of the ACK only after sending a chunk of data, comparing the expected ACK. After matching these numbers, the client can send the next packet. Because the client waits for ACKs less frequently, a high throughput can be achieved.

Results

Detailed results are listed below, first when processing packets:

Though the performance is similar, DOCA GPUNetIO is slightly faster in terms of both latency and throughput. With DOCA, it was observed that when the chunk size exceeds 4 MB, sufficient GPU cooling is necessary to prevent the GPU from dropping its clock frequency due to high temperatures. In the test environment, the GPU temperature must be under 59 ℃.

Next are the results without packet processing:

Without processing, the performance is significantly different between DPDK and DOCA. In fact, DPDK performs more slowly without packet processing. The reason for this counterintuitive behavior is currently under investigation. Both versions see performance that improves with the chunk size until a plateau is reached.

Conclusions

This article shows the current GPU packet processing performance and limitations. Currently, DOCA is a better option, yet the result is still only half the potential performance with a 100 Gbps NIC. Fixstars Solutions will continue to investigate and post the latest status of this endeavor. To readers with high-performance data transfer needs, consider lightning-kit as an open-source solution for your networking needs!

Reference

[1] Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-Accelerated Stateful Packet Processing Framework. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, Philadelphia, PA, 321-332.
[2] Zeke Wang, Hongjing Huang, Jie Zhang, Fei Wu, andGustavo Alonso. FpgaNIC: An FPGA-based versatile 100gb SmartNIC for GPUs. In ATC, 2022.

About us

IR