High-Performance Networking Series 2: Efficient Host to Device Copies for High-Throughput TCP Data Reconstruction
Introduction
Recent work at Fixstars Solutions focuses on developing high-throughput data transfer software. This second article in our high-performance networking series demonstrates our process of tuning the direct transfer of packets from a 100 GbE NIC to an Nvidia GPU, where the data was reconstructed. We begin by describing the theory behind the performance of the host to device transfer, which provides hints at selecting the proper GPU for a given application. Our experiments show that an impressive 92Gbps was achieved using Fixstars’ open-source lightning-kit data transfer framework, in line with theoretical predictions.
Theory
Earlier experiments revealed that the performance bottleneck in the lightning-kit flow of data was an overhead incurred when calling cudaMemcpyAsync
during data reconstruction. The use of cudaMemcpyAsync
is essential when building data from ethernet payloads because it allows an overlap between memory copies in one CUDA stream with CUDA kernel execution in another. However, measurements showed that launching cudaMemcpyAsync costs about 2000 clocks (You can measure the clock frequency of your GPU here), and when all threads in a warp call it, this cost is multiplied by the warpSize
. Based on this, the frame-building kernel performance can be modeled by the equation
where
cyclesiteration
refers to the number of clock cycles in one iteration of the frame building loop,cyclesmemcpy
refers to the number of clock cycles consumed to initialize one call tocudaMemcpyAsync
,offset
is the cost of the rest of the loop iteration excludingcudaMemcpyAsync
, andpackets
is the number of packets processed in one iteration.
Theoretically, the maximum number of bits that can be processed in one loop iteration is equal to the size of DMA buffer where ether payloads are written by NIC. Calling this variable bits
, we have the constraint that
where payload
refers to the number of bits in the payload of one packet, which is influenced by the MTU of the network interface. This experiment had about 8000 bytes, or 64 kbit, of payload per packet.
Assuming the constraint is saturated, the theoretical throughput bps
can be calculated (neglecting the time between loop iterations) as
Obviously, the clock cycles per memcpy
is bounded below by zero, and so the performance will reach a limit, all else remaining fixed (a reflection of Amdahl’s Law). The above equation can be rearranged to
From our research, the cost to launch cudaMemcpyAsync is around 1800, and the offset of lightning-kit’s frame processing loop is 100k~500k. The clock frequency is determined by the particular GPU in use. The following graph shows the theoretical relationship between the required number of packets in one loop iteration and the target throughput in Gbps. Curves for typical GPU clock frequencies are shown.
This graph clearly shows the saturation of the throughput as the number of packets processed in a loop iteration approaches infinity (that is, as the DMA buffer size grows). The frequencies chosen match those of popular GPUs: the A100 model runs at a nominal 1410MHz, RTX A6000 Ampere gen at 1950Mhz, and the Geforce RTX 4090 at 2520MHz (note that Geforce products don’t support RDMA, this is just a reference). When using cudaMemcpyAsync
, it is clear currently available GPUs do not reach the nominal 100Gbps throughput of the NIC. To improve it, the cost to launch cudaMemcpyAsync
must be reduced. In this use case, most packets have the same payload size, and the distance to the next packet location is the same because of DOCA’s RXQ cyclic buffer, which means cudaMemcpy2DAsync
can be utilized, allowing more than one memcpy with a function call. The following graphs show the theoretical results when 2 and 32 payloads are copied by one call to cudaMemcpy2DAsync
, so that the effective cyclesmemcpy
is divided by 2 or 32 compared to the previous graph.
The first graph above demonstrates that GPUs with clock speeds greater than 1410Mhz can keep up with a 100Gbps NIC. Also, second graph shows that GPUs can process data at 100Gbps even when the frequency is 210 MHz, the minimal frequency on an RTX A6000!
Experimental Results
System Topology
In these experiments, the sender and receiver machines both use Mellanox 100GbE NICs, and the data is sent using TCP as a protocol, though Fixstars’ lightning-kit also supports UDP. All of the API to handle receiving and sending ether packets is done inside the GPU using the DOCA GPUNetIO library. In the GPU, there are three CUDA kernels constantly running: One is polling the NIC’s RX queue status to notice when packets are ready. The second manages sending ACK packets back to the sender. The final one reconstructs the sent data from the received ethernet payload. A diagram of this scenario is shown below.
Testing Environment
- Receiver PC
CPU : AMD EPYC 7313 16-Core Processor
GPU : NVIDIA RTX A6000
NIC : ConnectX6 DX
CUDA Toolkit : 12.4
Nvidia driver : 550.54.15
DPDK : 22.11.2401.1.0
DOCA : 2.6.0058
PCIe : gen 4, 16x width - Sending PC (capable of packet throughputs up to 100Gbps)
CPU : Intel(R) Xeon(R) E5-2620 v3 @ 2.40GHz
NIC : ConnectX6 DX
PCIe: gen 3, 16x width
Methodology
As mentioned above, in this experiment TCP was used as a protocol. Therefore, to control the throughput, the chunk size was changed on the sender machine. Chunk size refers to the amount of data sent before the sender checks the sequence number of the received ACK with what is expected. After matching these numbers, the client can send the next chunk. Because the client pauses sending to wait for ACKs less frequently, larger chunks achieve a higher throughput in low packet loss environments. Also, we changed the GPU frequency artificially using nvidia-smi. Because it is difficult to fix the number of packets in one loop iteration, the best performance achieved was recorded and compared to the simulated results above. 128GB of data was sent for all experiments.
Results
When using cudaMemcpyAsync
, the following data was collected:
GPU | A6000 | A6000 | A6000 |
Clock Frequency (MHz) | 1005 | 1410 | 1950 |
Chunk Size (MB) | 2 | 2 | 2 |
Theoretical Throughput (Gbps) | 35 | 49 | 69 |
Measured Throughput (Gbps) | 29.88 | 42.87 | 51.66 |
cudaMemcpyAsync
version.clock frequency (MHz) | |||
chunk size (MB) | 1005 | 1410 | 1950 |
1 | 28.07 | 38.23 | 48.2 |
2 | 29.88 | 42.87 | 51.66 |
cudaMemcpyAsync
version. Packets on the cyclic buffer were overwritten before their payloads were copied into a data buffer when chunk size is more than 2MB.clock frequency (MHz) | |||
chunk size (MB) | 1005MHz | 1410MHz | 1950MHz |
1 | 269.54 | 197.68 | 155.28 |
2 | 518.25 | 358.69 | 296.6 |
cudaMemcpyAsync
version.When using cudaMemcpy2DAsync
, the data was as follows:
GPU | A6000 | A6000 | A6000 |
Clock Frequency (MHz) | 1005 | 1410 | 1950 |
Chunk Size (MB) | 128 | 64 | 128 |
Theoretical Throughput (Gbps) | 71 | 99 | 138 (over NIC spec) |
Measured Throughput (Gbps) | 56.98 | 77.51 | 90.97 |
cudaMemcpyAsync
calls merged into one cudaMemcpy2DAsync
.clock frequency (MHz) | |||
chunk size (MB) | 1005 | 1410 | 1950 |
1 | 43.84 | 49.66 | 55.37 |
2 | 48.36 | 60.29 | 66.26 |
4 | 51.36 | 65.24 | 73.97 |
8 | 52.79 | 68.38 | 80.42 |
16 | 56.52 | 72.33 | 83.71 |
32 | 56.80 | 77.05 | 86.59 |
64 | 56.67 | 77.51 | 89.27 |
128 | 56.98 | 76.21 | 90.97 |
cudaMemcpyAsync
calls merged into one cudaMemcpy2DAsync
.clock frequency (MHz) | |||
chunk size (MB) | 1005 | 1410 | 1950 |
1 | 171.28 | 150.46 | 134.20 |
2 | 317.22 | 252.99 | 229.58 |
4 | 604.69 | 474.31 | 417.41 |
8 | 1180.63 | 911.14 | 773.20 |
16 | 2215.67 | 1729.18 | 1492.52 |
32 | 4420.86 | 3256.96 | 2897.68 |
64 | 8874.23 | 6485.43 | 5629.78 |
128 | 17658.84 | 13201.35 | 11058.74 |
cudaMemcpyAsync
calls merged into one cudaMemcpy2DAsync
. GPU | A6000 | A6000 | A6000 | A6000 |
Clock Frequency (MHz) | 210 | 1005 | 1410 | 1950 |
Chunk Size (MB) | 128 | 128 | 128 | 128 |
Theoretical Throughput (Gbps) | 233 | >1000 | >1000 | >1000 |
Measured Throughput (Gbps) | 76.19 | 92.12 | 92.18 | 92.19 |
cudaMemcpyAsync
calls merged into one cudaMemcpy2DAsync
.clock frequency (MHz) | ||||
chunk size (MB) | 210 | 1005 | 1410 | 1950 |
1 | 34.95 | 53.10 | 55.94 | 58.22 |
2 | 45.55 | 65.74 | 69.29 | 70.14 |
4 | 54.87 | 77.60 | 78.84 | 79.42 |
8 | 63.63 | 83.86 | 84.51 | 84.90 |
16 | 68.49 | 87.46 | 87.41 | 88.62 |
32 | 74.22 | 90.33 | 90.70 | 90.82 |
64 | 74.76 | 91.59 | 91.67 | 91.75 |
128 | 76.19 | 92.12 | 92.18 | 92.19 |
cudaMemcpyAsync
calls merged into one cudaMemcpy2DAsync
.clock frequency (MHz) | ||||
chunk size (MB) | 210 | 1005 | 1410 | 1950 |
1 | 217.08 | 140.26 | 131.02 | 127.27 |
2 | 337.26 | 230.02 | 219.33 | 216.18 |
4 | 565.33 | 397.69 | 391.19 | 388.32 |
8 | 979.70 | 741.06 | 735.48 | 731.93 |
16 | 1828.08 | 1429.64 | 1428.94 | 1410.75 |
32 | 3382.15 | 2775.78 | 2765.52 | 2762.26 |
64 | 6723.19 | 5487.01 | 5482.41 | 5477.66 |
128 | 13204.04 | 10920.58 | 10913.63 | 10912.11 |
In the tables showing maximum performance for each frequency, one can see agreement between theoretical and measured performance, with the measured throughput being somewhat less than the theoretical limit because our derivation assumed the DMA buffer was fully utilized. The overall maximal performance (92Gbps) was not closer to 100Gbps because of limitations on the sender side. The throughput when receiving packets and sending ACKs without processing was the same, proving that the bottleneck is no longer the memcpy
operation when cudaMemcpy2DAsync
is used. In the tables with variable chunk sizes, the throughput increases with larger chunk sizes. This is because, while RTT increases with chunk size as well, the throughput as calculated from (chunk size) / (RTT) actually increases, so that the program can fully utilize the NIC bandwidth. RTT and latency will be investigated in more detail in a future post.
Selecting The Proper GPU
This section provides some pointers to help select a GPU with the proper specs for a given application. A modern GPU has a clock frequency over 1000MHz clock frequency, so these GPUs all pass the minimum bar to process packets from a 100 Gbps NIC. However, an important point is that cudaMemcpy2DAsync
can be used only when each packet has the same size and the packets are evenly spaced in memory. If your workload needs to handle flexible length of payload, the current method cannot be applied. In this case, the frequency strongly impacts throughput, with higher-frequency GPUs achieving a higher peak throughput. The table below lists the frequency and peak theoretical performance when using cudaMemcpy2DAsync
for reference. When the maximum throughput is greater than 100Gbps, the full performance of 100GbE can hope to be utilized.
number of payloads copied by cudaMemcpy2DAsync | |||
Frequency (MHz) | 1 | 2 | 32 |
1000 | 34 | 69 | 1111 |
1400 | 48 | 97 | 1555 |
1800 | 62 | 125 | 2000 |
2200 | 76 | 152 | 2444 |
2600 | 90 | 180 | 2888 |
3000 | 104 | 208 | 3333 |
DMA to GPU device memory uses RDMA, but some GPUs don’t support it, most notably the Geforce series. Another point about DMA is the BAR1 memory size. BAR1 memory is exposed to the NIC and presents an upper limit to the size of the DMA area, so a larger size is preferred. For these experiments, over 4GB was enough. The BAR1 size depends on the GPU hardware and its running mode. In the case of the A6000, the mode was changed with displaymodeselector to disable graphics mode. You can check the size of BAR1 by using nvidia-smi -q:
$ nvidia-smi -q
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Conclusion
This article shows that Fixstars Solutions lightning-kit technology can utilize the full performance of a 100Gbps NIC, and provides some pointers when selecting a GPU for your system. Fixstars will continue to investigate and post the latest status of this endeavor. To readers with high-performance data transfer needs, consider lightning-kit as an open-source solution for your networking needs!
Author
Masaru Ito