High-Performance Networking Series 2: Efficient Host to Device Copies for High-Throughput TCP Data Reconstruction

Masaru Ito |May 29, 2024 | Networking

Introduction

Recent work at Fixstars Solutions focuses on developing high-throughput data transfer software. This second article in our high-performance networking series demonstrates our process of tuning the direct transfer of packets from a 100 GbE NIC to an Nvidia GPU, where the data was reconstructed. We begin by describing the theory behind the performance of the host to device transfer, which provides hints at selecting the proper GPU for a given application. Our experiments show that an impressive 92Gbps was achieved using Fixstars’ open-source lightning-kit data transfer framework, in line with theoretical predictions.

Theory

Earlier experiments revealed that the performance bottleneck in the lightning-kit flow of data was an overhead incurred when calling cudaMemcpyAsync during data reconstruction. The use of cudaMemcpyAsync is essential when building data from ethernet payloads because it allows an overlap between memory copies in one CUDA stream with CUDA kernel execution in another. However, measurements showed that launching cudaMemcpyAsync costs about 2000 clocks (You can measure the clock frequency of your GPU here), and when all threads in a warp call it, this cost is multiplied by the warpSize. Based on this, the frame-building kernel performance can be modeled by the equation

\[ cycles_{iteration} = offset+cycles_{memcpy} \times packets,\]

where

cycles_iteration refers to the number of clock cycles in one iteration of the frame building loop,
cycles_memcpy refers to the number of clock cycles consumed to initialize one call to cudaMemcpyAsync,
offset is the cost of the rest of the loop iteration excluding cudaMemcpyAsync, and
packets is the number of packets processed in one iteration.

Theoretically, the maximum number of bits that can be processed in one loop iteration is equal to the size of DMA buffer where ether payloads are written by NIC. Calling this variable bits, we have the constraint that

\[bits \geq payload \times packets,\]

where payload refers to the number of bits in the payload of one packet, which is influenced by the MTU of the network interface. This experiment had about 8000 bytes, or 64 kbit, of payload per packet.

Assuming the constraint is saturated, the theoretical throughput bps can be calculated (neglecting the time between loop iterations) as

\[bps = \frac{bits}{seconds_{iteration}} = \frac{bits}{cycles_{iteration} / frequency} = \frac{payload \times packets \times frequency}{offset + cycles_{memcpy} \times packets}.\]

Obviously, the clock cycles per memcpy is bounded below by zero, and so the performance will reach a limit, all else remaining fixed (a reflection of Amdahl’s Law). The above equation can be rearranged to

\[packets = \frac{bps\times offset}{payload \times frequency – cycles_{memcpy} \times bps}.\]

From our research, the cost to launch cudaMemcpyAsync is around 1800, and the offset of lightning-kit’s frame processing loop is 100k~500k. The clock frequency is determined by the particular GPU in use. The following graph shows the theoretical relationship between the required number of packets in one loop iteration and the target throughput in Gbps. Curves for typical GPU clock frequencies are shown.

This graph clearly shows the saturation of the throughput as the number of packets processed in a loop iteration approaches infinity (that is, as the DMA buffer size grows). The frequencies chosen match those of popular GPUs: the A100 model runs at a nominal 1410MHz, RTX A6000 Ampere gen at 1950Mhz, and the Geforce RTX 4090 at 2520MHz (note that Geforce products don’t support RDMA, this is just a reference). When using cudaMemcpyAsync, it is clear currently available GPUs do not reach the nominal 100Gbps throughput of the NIC. To improve it, the cost to launch cudaMemcpyAsync must be reduced. In this use case, most packets have the same payload size, and the distance to the next packet location is the same because of DOCA’s RXQ cyclic buffer, which means cudaMemcpy2DAsync can be utilized, allowing more than one memcpy with a function call. The following graphs show the theoretical results when 2 and 32 payloads are copied by one call to cudaMemcpy2DAsync, so that the effective cycles_memcpy is divided by 2 or 32 compared to the previous graph.

The first graph above demonstrates that GPUs with clock speeds greater than 1410Mhz can keep up with a 100Gbps NIC. Also, second graph shows that GPUs can process data at 100Gbps even when the frequency is 210 MHz, the minimal frequency on an RTX A6000!

Experimental Results

System Topology

In these experiments, the sender and receiver machines both use Mellanox 100GbE NICs, and the data is sent using TCP as a protocol, though Fixstars’ lightning-kit also supports UDP. All of the API to handle receiving and sending ether packets is done inside the GPU using the DOCA GPUNetIO library. In the GPU, there are three CUDA kernels constantly running: One is polling the NIC’s RX queue status to notice when packets are ready. The second manages sending ACK packets back to the sender. The final one reconstructs the sent data from the received ethernet payload. A diagram of this scenario is shown below.

Testing Environment

Receiver PC
CPU : AMD EPYC 7313 16-Core Processor
GPU : NVIDIA RTX A6000
NIC : ConnectX6 DX
CUDA Toolkit : 12.4
Nvidia driver : 550.54.15
DPDK : 22.11.2401.1.0
DOCA : 2.6.0058
PCIe : gen 4, 16x width
Sending PC (capable of packet throughputs up to 100Gbps)
CPU : Intel(R) Xeon(R) E5-2620 v3 @ 2.40GHz
NIC : ConnectX6 DX
PCIe: gen 3, 16x width

Methodology

As mentioned above, in this experiment TCP was used as a protocol. Therefore, to control the throughput, the chunk size was changed on the sender machine. Chunk size refers to the amount of data sent before the sender checks the sequence number of the received ACK with what is expected. After matching these numbers, the client can send the next chunk. Because the client pauses sending to wait for ACKs less frequently, larger chunks achieve a higher throughput in low packet loss environments. Also, we changed the GPU frequency artificially using nvidia-smi. Because it is difficult to fix the number of packets in one loop iteration, the best performance achieved was recorded and compared to the simulated results above. 128GB of data was sent for all experiments.

Results

When using cudaMemcpyAsync, the following data was collected:

GPU	A6000	A6000	A6000
Clock Frequency (MHz)	1005	1410	1950
Chunk Size (MB)	2	2	2
Theoretical Throughput (Gbps)	35	49	69
Measured Throughput (Gbps)	29.88	42.87	51.66

Comparing theoretical and measured maximal throughput. cudaMemcpyAsync version.

	clock frequency (MHz)
chunk size (MB)	1005	1410	1950
1	28.07	38.23	48.2
2	29.88	42.87	51.66

Throughput in Gbps by chunk size and clock frequency, cudaMemcpyAsync version. Packets on the cyclic buffer were overwritten before their payloads were copied into a data buffer when chunk size is more than 2MB.

	clock frequency (MHz)
chunk size (MB)	1005MHz	1410MHz	1950MHz
1	269.54	197.68	155.28
2	518.25	358.69	296.6

RTT in usec by chunk size and clock frequency, cudaMemcpyAsync version.

When using cudaMemcpy2DAsync, the data was as follows:

GPU	A6000	A6000	A6000
Clock Frequency (MHz)	1005	1410	1950
Chunk Size (MB)	128	64	128
Theoretical Throughput (Gbps)	71	99	138 (over NIC spec)
Measured Throughput (Gbps)	56.98	77.51	90.97

Comparing theoretical and measured maximal throughput with two cudaMemcpyAsync calls merged into one cudaMemcpy2DAsync.

	clock frequency (MHz)
chunk size (MB)	1005	1410	1950
1	43.84	49.66	55.37
2	48.36	60.29	66.26
4	51.36	65.24	73.97
8	52.79	68.38	80.42
16	56.52	72.33	83.71
32	56.80	77.05	86.59
64	56.67	77.51	89.27
128	56.98	76.21	90.97

Throughput in Gbps by chunk size and clock frequency with two cudaMemcpyAsync calls merged into one cudaMemcpy2DAsync.

	clock frequency (MHz)
chunk size (MB)	1005	1410	1950
1	171.28	150.46	134.20
2	317.22	252.99	229.58
4	604.69	474.31	417.41
8	1180.63	911.14	773.20
16	2215.67	1729.18	1492.52
32	4420.86	3256.96	2897.68
64	8874.23	6485.43	5629.78
128	17658.84	13201.35	11058.74

RTT in usec by chunk size and clock frequency with two cudaMemcpyAsync calls merged into one cudaMemcpy2DAsync.

GPU	A6000	A6000	A6000	A6000
Clock Frequency (MHz)	210	1005	1410	1950
Chunk Size (MB)	128	128	128	128
Theoretical Throughput (Gbps)	233	>1000	>1000	>1000
Measured Throughput (Gbps)	76.19	92.12	92.18	92.19

Comparing theoretical and measured maximal throughput with 32 cudaMemcpyAsync calls merged into one cudaMemcpy2DAsync.

	clock frequency (MHz)
chunk size (MB)	210	1005	1410	1950
1	34.95	53.10	55.94	58.22
2	45.55	65.74	69.29	70.14
4	54.87	77.60	78.84	79.42
8	63.63	83.86	84.51	84.90
16	68.49	87.46	87.41	88.62
32	74.22	90.33	90.70	90.82
64	74.76	91.59	91.67	91.75
128	76.19	92.12	92.18	92.19

Throughput in Gbps by chunk size and clock frequency with 32 cudaMemcpyAsync calls merged into one cudaMemcpy2DAsync.

	clock frequency (MHz)
chunk size (MB)	210	1005	1410	1950
1	217.08	140.26	131.02	127.27
2	337.26	230.02	219.33	216.18
4	565.33	397.69	391.19	388.32
8	979.70	741.06	735.48	731.93
16	1828.08	1429.64	1428.94	1410.75
32	3382.15	2775.78	2765.52	2762.26
64	6723.19	5487.01	5482.41	5477.66
128	13204.04	10920.58	10913.63	10912.11

RTT in usec by chunk size and clock frequency with 32 cudaMemcpyAsync calls merged into one cudaMemcpy2DAsync.

In the tables showing maximum performance for each frequency, one can see agreement between theoretical and measured performance, with the measured throughput being somewhat less than the theoretical limit because our derivation assumed the DMA buffer was fully utilized. The overall maximal performance (92Gbps) was not closer to 100Gbps because of limitations on the sender side. The throughput when receiving packets and sending ACKs without processing was the same, proving that the bottleneck is no longer the memcpy operation when cudaMemcpy2DAsync is used. In the tables with variable chunk sizes, the throughput increases with larger chunk sizes. This is because, while RTT increases with chunk size as well, the throughput as calculated from (chunk size) / (RTT) actually increases, so that the program can fully utilize the NIC bandwidth. RTT and latency will be investigated in more detail in a future post.

Selecting The Proper GPU

This section provides some pointers to help select a GPU with the proper specs for a given application. A modern GPU has a clock frequency over 1000MHz clock frequency, so these GPUs all pass the minimum bar to process packets from a 100 Gbps NIC. However, an important point is that cudaMemcpy2DAsync can be used only when each packet has the same size and the packets are evenly spaced in memory. If your workload needs to handle flexible length of payload, the current method cannot be applied. In this case, the frequency strongly impacts throughput, with higher-frequency GPUs achieving a higher peak throughput. The table below lists the frequency and peak theoretical performance when using cudaMemcpy2DAsync for reference. When the maximum throughput is greater than 100Gbps, the full performance of 100GbE can hope to be utilized.

	number of payloads copied by cudaMemcpy2DAsync
Frequency (MHz)	1	2	32
1000	34	69	1111
1400	48	97	1555
1800	62	125	2000
2200	76	152	2444
2600	90	180	2888
3000	104	208	3333

The correspondence between GPU frequency and theoretical throughput [Gbps]

DMA to GPU device memory uses RDMA, but some GPUs don’t support it, most notably the Geforce series. Another point about DMA is the BAR1 memory size. BAR1 memory is exposed to the NIC and presents an upper limit to the size of the DMA area, so a larger size is preferred. For these experiments, over 4GB was enough. The BAR1 size depends on the GPU hardware and its running mode. In the case of the A6000, the mode was changed with displaymodeselector to disable graphics mode. You can check the size of BAR1 by using nvidia-smi -q:

$ nvidia-smi -q
BAR1 Memory Usage
    Total                             : 65536 MiB
    Used                              : 1 MiB
    Free                              : 65535 MiB

Conclusion

This article shows that Fixstars Solutions lightning-kit technology can utilize the full performance of a 100Gbps NIC, and provides some pointers when selecting a GPU for your system. Fixstars will continue to investigate and post the latest status of this endeavor. To readers with high-performance data transfer needs, consider lightning-kit as an open-source solution for your networking needs!

About us

IR