Does DPDK/RDMA between 2 machines gives lower latency than local host ping

Question

I know loopback will go through the kernel network stack until reach IP layer, including syscall overhead and some memory copy overhead. DPDK and RDMA use different technology to avoid these.
So let's say I have two machine connected by dpdk/rdma, then I do net latency test, will that be faster than loopback on just one machine? I do a quick test of ping localhost on CPU E5-2630 v4 @ 2.20GHz, which on average is 0.010ms.

I come up with this question when I was testing my ceph cluster using vstart.sh, I want to minize network latency in order to carefully analyze how osd-related code affect latency.

Hi @LiangMury can you please confirm is your question `Does DPDK/RDMA between 2 machines gives lower latency than local host ping` or `how do I minimize latency for packets`? — Vipin Varghese, Feb 23 '21 at 02:01
Hi @VipinVarghese, my question is `Does DPDK/RDMA between 2 machines gives lower latency than local host ping`, I am just curious that is localhost ping faster than any network connection between remote machines, no matter what technology it use. — Liang Mury, Feb 23 '21 at 02:55
are there any updates. My custom icmp reply gets only around 045 ms — Vipin Varghese, Mar 09 '21 at 07:00
@LiandMury you would have taken a look into https://blogs.oracle.com/linux/the-power-of-xdp, which shows PING can respond with `time=0.03003 ms`. But flip side it never tell you rate at which PING is send. On DPDK/XDP with 14Mpps send as ICMP request, I do not think you will attain `time=0.025 ms or less than` — Vipin Varghese, Mar 19 '21 at 15:27

score 0 · Accepted Answer · answered Sep 14 '21 at 02:22

based on the conversation via comments, the real question is Does DPDK/RDMA between 2 machines gives lower latency than localhost ping.

[Answer] yes, you can achieve the same. But has some caveats

DPDK rte_eth_tx_burst only enqueue the packet descriptor for DMA on PCIe to send traffic. This does not actually send the packet out.
DPDK rte_eth_tx_buffer_flush explicitly flushes out any previously buffered packets to Hardware.
It is costly to manipulate ICMP request (RX buffer) by modifying each byte, instead, opt for rte_pktmbuf_alloc to grab a mbuf and set the ref_cnt to 250.
prepare the new buffer with the right ethernet, IP and ICMP payload data

hence with the right NIC (which supports low latency transmit), DPDK API rte_eth_tx_buffer_flush, and pre-allocated mbuf with ref_cnt updated to higher value you can achieve you can achieve 0.010ms on average.

Note: For better baseline, use packet-generator or packet Balster send ICMP request to the target machine with Kernel and DPDK solution to compare the real loading performance for line rate such as 1%, 5%, 10%, 25%, 50%, 75%, 100%.

Does DPDK/RDMA between 2 machines gives lower latency than local host ping

1 Answers1