Why another RDMA wrapper
TL; DR¶
If you're writing synchronous RDMA code in Rust and you're tired of:
- Hand-rolling FFI over
ibverbs-sys/rdma-core-sys - Fighting lifetimes over simple CQ / QP / MR ownership, and
- Rebuilding
rdma-corejust to link a tiny binary
Then sideway gives you:
- Rust-flavored wrappers over modern ibverbs (
ibv_wr_*,ibv_start_poll, Extended CQ / QP) - A
dlopenbased static library so that you don't vendor rdma-core - A basic but usable wrapper for
librdmacm
Checkout our code at https://github.com/RDMA-Rust/sideway
This post is about how we built it, what trade-offs we made, and what we learned on the way.
What do we get?¶
- A Rust flavored wrapper for rdma-core (can saturate 400 Gbps RNIC)
- Two bug fixes submitted to the rdma-core upstream
- A "dummy" companion library for C/C++ programmer with
libibverbs/librdmacm(rdma-core-mummy) - A lot of Rust + RDMA programming experience
What's wrong with current wrappers?¶
Most current libibverbs and librdmacm Rust wrappers focus heavily on fancy language features (100% safe, async-first APIs, etc.), and less on being production-ready. They are great for exploration, and provide a lot of insights for our implementation, but awkward when you try to ship a serious RDMA application.
When you actually start coding, you often end up:
- Reaching for
ibverbs-sys(or another bindgen-generated crate) - Manually handling error-prone C macros / enum definitions
- Juggling C types and Rust types at the same time
On top of that, most crates assume you build and link the entire rdma-core project into your binary, or required you've install rdma-core on your compile machine. That pulls in unnecessary dependencies and build complexity.
We wanted something that:
- Keeps the control path ergonomic in Rust
- Keeps the data path as close as possible to raw
ibverbsto achieve high performance - Doesn't require vendoring rdma-core into every build
How does sideway compare?¶
We are not claiming to replace other crates, different projects have different trade-offs. But for high performance, synchronous RDMA, there are some concrete issues you run into.
| Crate | Requires building rdma-core | Wrapping librdmacm | Performance issues |
|---|---|---|---|
| ibverbs | Yes | No | 1. Poll CQ at a constant batching size 2. No batching WRs posting operation 3. Unable to specify send flags for moderation or fencing |
| rrddmma | Optional | No | Leave options to user |
| async-rdma | No (But specific version required) | Yes | 1. No batching WRs posting operation 2. Unable to specify send flags for moderation for fencing 3. Create a CQ for every QP |
| rdma | No (But specific version required) | No | Leave options to user |
| rdma-rs | Yes | Yes | 1. No batching WRs posting operation 2. Unable to specify send flags for moderation for fencing |
| Sideway | No | Yes | Focused on modern verbs APIs (ibv_wr_*, CQ / QP Ex), zero-cost abstractions on the data path, and layered design |
Performance¶
To make sure our wrapper doesn't leave performance on the table, we vibe coded a perftest-style benchmarking tool called stride. It's not meant to be a full replacement, just enough for validate that "Rust + sideway" can keep up with the classic C tooling.
All experiments run on a Nebius GPU-H100-SXM instance with 8 * ConnectX-7 InfiniBand (IB) 400Gbps RDMA NIC
| Key | Value |
|---|---|
| Environment | GPU-H100-SXM EU-NORTH1 |
| OS | Ubuntu 24.04.3 LTS (Noble Numbat) x86_64 |
| CPU | 2 × Intel(R) Xeon(R) Platinum 8468 (128) @ 2.10 GHz |
| Kernel | 6.11.0-1016-nvidia |
| Memory | 1.536 TiB |
| RNIC | ConnectX-7 InfiniBand 400 Gbps (firmware 28.39.3004) |
| OFED | DOCA 2.9.2 |
| GPU | NVIDIA H100 SXM5 80GB |
| NVIDIA GPU Driver | 570.195.03 |
| CUDA | 12.8 |
| Stride | 87b3780 |
| Perftest | 20e05b9 |
We slighty modified stride's code to hardcode the LID in connection setup phase, to make it work with IB. Apart from that, both tools use the same verbs settings.
What we measure¶
We look at three basic scenarios, always with a single RC QP and message sizes from 2B to 8MiB:
-
Single QP WRITE bandwidth (host / GPU memory)
- Host memory:
S-1QP-BW(stride) vsP-1QP-BW(perftest) - GPU memory (GDR):
S-1QP-GDR-BWvsP-1QP-GDR-BW
- Host memory:
-
Single QP "maxed out" WRITE bandwidth (GPU memory)
- GPU memory (GDR), Tx depth 4096, post list 64:
S-1QP-GDR-MAX-BWvsP-1QP-GDR-MAX-BW - To better see small-message behavior, we also add message per second (MPS):
S-1QP-GDR-MAX-MPSvsP-1QP-GDR-MAX-MPS
- GPU memory (GDR), Tx depth 4096, post list 64:
-
Single QP WRITE latency (host / GPU memory)
- Host memory:
S-1QP-LAT(stride) vsP-1QP-LAT(perftest) - GPU memory (GDR):
S-1QP-GDR-LAT(perftest does not support WRITE latency with GPU buffers) - We also add P99 for jitter:
S-1QP-LAT-P99vsP-1QP-LAT-P99vsS-1QP-GDR-LAT-P99
- Host memory:
Results at a glance¶
- Bandwidth (host memory / GDR) For a single QP, sideway/stride and perftest track each other closely across 2B–8MiB. At large messages, both saturate the 400 Gb/s link: stride reaches ~380 Gb/s, and perftest is within a few percent of that.
| Message Size | S-1QP-BW | P-1QP-BW | S-1QP-GDR-BW | P-1QP-GDR-BW |
|---|---|---|---|---|
| 2 | 0.0676 | 0.068664 | 0.077 | 0.075424 |
| 4 | 0.1352 | 0.16 | 0.1546 | 0.15 |
| 8 | 0.278 | 0.31 | 0.3098 | 0.31 |
| 16 | 0.6217 | 0.63 | 0.6201 | 0.62 |
| 32 | 1.2465 | 1.26 | 1.2379 | 1.24 |
| 64 | 2.4941 | 2.51 | 2.4714 | 2.49 |
| 128 | 4.3461 | 5.04 | 4.9527 | 4.97 |
| 256 | 8.6742 | 10.12 | 9.9171 | 9.93 |
| 512 | 17.3712 | 20.25 | 19.7997 | 19.77 |
| 1024 | 34.7707 | 40.47 | 39.6772 | 39.8 |
| 2048 | 69.6075 | 80.88 | 68.9667 | 79.51 |
| 4096 | 139.2933 | 164.27 | 140.3711 | 160.21 |
| 8192 | 278.4098 | 305.2 | 291.2193 | 320.74 |
| 16384 | 360.9202 | 343.31 | 371.0165 | 371.11 |
| 32768 | 372.1774 | 358.08 | 381.8385 | 382.28 |
| 65536 | 377.9962 | 365.91 | 387.1709 | 387.1 |
| 131072 | 381.4992 | 370.98 | 390.377 | 390.55 |
| 262144 | 382.0199 | 372.04 | 391.1473 | 391.67 |
| 524288 | 381.505 | 372.35 | 391.1888 | 391.9 |
| 1048576 | 381.2342 | 372.09 | 391.1909 | 392.25 |
| 2097152 | 381.307 | 372.12 | 391.1919 | 392.39 |
| 4194304 | 380.6771 | 371.92 | 391.1924 | 392.44 |
| 8388608 | 379.7436 | 371.82 | 391.1926 | 392.44 |
# Perftest server
taskset -c 66 ib_write_bw -d mlx5_0 -n 100000 -a -Q 1
# Perftest client
taskset -c 68 ib_write_bw -d mlx5_1 -n 100000 -a 127.0.0.1 -Q 1
# Perftest server (GDR)
taskset -c 66 ib_write_bw -d mlx5_0 -n 100000 -a -Q 1 --use_cuda 4
# Perftest client (GDR)
taskset -c 68 ib_write_bw -d mlx5_1 -n 100000 -a 127.0.0.1 -Q 1 --use_cuda 5
# Stride server
taskset -c 66 stride-perf write bw -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024))
# Stride client
taskset -c 68 stride-perf write bw -d mlx5_1 127.0.0.1 -n 1000000 -a --max-msg-size $((8*1024*1024))
# Stride server (GDR)
taskset -c 66 stride-perf write bw -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024)) --use-cuda 4
# Stride client (GDR)
taskset -c 68 stride-perf write bw -d mlx5_1 127.0.0.1 -n 1000000 -a --max-msg-size $((8*1024*1024)) --use-cuda 5
- Bandwidth (GDR, maxed out)
With a larger
tx_depthandpost_list, a single QP hits ~392.5 Gbit/s — essentially line-rate for a 400 Gbit/s port. For small messages, throughput reaches ~20 Mpps.
| Message Size | S-1QP-GDR-MAX-BW | S-1QP-GDR-MAX-MPS | P-1QP-GDR-MAX-BW | P-1QP-GDR-MAX-MPS |
|---|---|---|---|---|
| 2 | 0.3153 | 19.7076 | 0.32 | 19.855156 |
| 4 | 0.632 | 19.7505 | 0.64 | 19.890087 |
| 8 | 1.2636 | 19.7432 | 1.28 | 19.92576 |
| 16 | 2.5247 | 19.7239 | 2.54 | 19.851894 |
| 32 | 5.0515 | 19.7323 | 5.07 | 19.800955 |
| 64 | 10.0584 | 19.6453 | 10.11 | 19.740872 |
| 128 | 20.068 | 19.5976 | 20.13 | 19.659211 |
| 256 | 39.8753 | 19.4703 | 39.73 | 19.400613 |
| 512 | 77.5093 | 18.9232 | 78.32 | 19.121788 |
| 1024 | 147.5344 | 18.0096 | 148.25 | 18.096757 |
| 2048 | 237.1809 | 14.4764 | 240.4 | 14.672663 |
| 4096 | 321.151 | 9.8007 | 324.23 | 9.894732 |
| 8192 | 358.1028 | 5.4642 | 361.24 | 5.512146 |
| 16384 | 391.0239 | 2.9833 | 391.81 | 2.989263 |
| 32768 | 391.1139 | 1.492 | 392.17 | 1.495991 |
| 65536 | 391.1547 | 0.7461 | 392.51 | 0.74865 |
| 131072 | 391.1747 | 0.3731 | 392.63 | 0.374441 |
| 262144 | 391.1837 | 0.1865 | 392.6 | 0.187209 |
| 524288 | 391.1888 | 0.0933 | 392.5 | 0.093579 |
| 1048576 | 391.1903 | 0.0466 | 392.46 | 0.046785 |
| 2097152 | 391.1912 | 0.0233 | 392.43 | 0.023391 |
| 4194304 | 391.1923 | 0.0117 | 392.44 | 0.011696 |
| 8388608 | 391.1925 | 0.0058 | 392.44 | 0.005848 |
# Perftest server
taskset -c 66 ib_write_bw -d mlx5_0 -n 102400 -a -Q 1 --report_gbits --use_cuda 4 -l 64 -t 4096
# Perftest client
taskset -c 68 ib_write_bw -d mlx5_1 -n 102400 -a 127.0.0.1 -Q 1 --report_gbits --use_cuda 5 -l 64 -t 4096
# Stride server
taskset -c 66 stride-perf write bw -d mlx5_0 -n 102400 -a --max-msg-size $((8*1024*1024)) --use-cuda 4 -l 64 -t 4096
# Stride client
taskset -c 68 stride-perf write bw -d mlx5_1 -n 102400 -a --max-msg-size $((8*1024*1024)) 127.0.0.1 --use-cuda 5 -l 64 -t 4096
-
Latency For small messages in host memory, perftest’s numbers look better at first glance — but that’s because we measure slightly different things:
- perftest’s WRITE latency is effectively half of a round-trip: client WRITE → server DMA starts → server DMA ends (single-trip data + DMA time).
- stride’s WRITE latency measures data transfer + ACK (client WRITE → server NIC ACK → client CQE).
GPU WRITE latency with stride is slightly higher but still scales cleanly with message size.
| Message Size | P-1QP-LAT | P-1QP-LAT-P99 | S-1QP-LAT | S-1QP-LAT-P99 | S-1QP-GDR-LAT | S-1QP-GDR-LAT-P99 |
|---|---|---|---|---|---|---|
| 2 | 3.38 | 3.47 | 5.286 | 5.487 | 5.412 | 5.487 |
| 4 | 3.38 | 3.46 | 5.322 | 5.495 | 5.411 | 5.479 |
| 8 | 3.38 | 3.47 | 5.334 | 5.495 | 5.42 | 5.483 |
| 16 | 3.39 | 3.47 | 5.336 | 5.495 | 5.418 | 5.487 |
| 32 | 3.4 | 3.51 | 5.329 | 5.495 | 5.415 | 5.483 |
| 64 | 3.4 | 3.48 | 5.351 | 5.495 | 5.432 | 5.499 |
| 128 | 3.46 | 3.57 | 5.371 | 5.543 | 5.45 | 5.511 |
| 256 | 3.51 | 3.62 | 5.421 | 5.575 | 5.473 | 5.539 |
| 512 | 3.57 | 3.71 | 5.48 | 5.619 | 5.486 | 5.555 |
| 1024 | 3.61 | 3.78 | 5.53 | 5.683 | 5.523 | 5.595 |
| 2048 | 3.75 | 4.19 | 5.634 | 5.743 | 5.583 | 5.647 |
| 4096 | 3.93 | 4.49 | 5.785 | 5.895 | 5.692 | 5.767 |
| 8192 | 4.13 | 4.75 | 5.947 | 6.163 | 5.887 | 6.027 |
| 16384 | 4.53 | 5.23 | 6.357 | 6.499 | 6.242 | 6.315 |
| 32768 | 5.07 | 5.61 | 6.968 | 7.087 | 6.724 | 6.803 |
| 65536 | 6.11 | 6.84 | 7.91 | 8.063 | 7.66 | 7.751 |
| 131072 | 9.53 | 9.9 | 10.644 | 10.807 | 10.12 | 10.207 |
| 262144 | 14.66 | 15.98 | 14.999 | 15.239 | 13.86 | 13.967 |
| 524288 | 20.12 | 21.64 | 20.692 | 21.599 | 19.25 | 19.359 |
| 1048576 | 30.93 | 33.13 | 31.571 | 33.567 | 29.983 | 30.079 |
| 2097152 | 53.5 | 56.62 | 53.569 | 57.503 | 51.442 | 51.551 |
| 4194304 | 99 | 103.14 | 97.649 | 104.575 | 94.327 | 94.463 |
| 8388608 | 189.92 | 194.66 | 186.243 | 196.351 | 180.107 | 180.223 |
# Perftest server
taskset -c 66 ib_write_lat -d mlx5_0 -n 100000 -a -I 0
# Perftest client
taskset -c 68 ib_write_lat -d mlx5_1 -n 100000 -a 127.0.0.1 -I 0
# Stride server
taskset -c 66 stride-perf write lat -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024))
# Stride client
taskset -c 68 stride-perf write lat -d mlx5_1 -n 100000 -a --max-msg-size $((8*1024*1024)) 127.0.0.1
# Stride server (GDR)
taskset -c 66 stride-perf write lat -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024)) --use-cuda 4
# Stride client (GDR)
taskset -c 68 stride-perf write lat -d mlx5_1 -n 100000 -a --max-msg-size $((8*1024*1024)) 127.0.0.1 --use-cuda 5
Design: layered abstraction¶
We treat RDMA abstraction as a layered problem:
rdma-mummy-sys -- the raw rdma-core API wrapper¶
The first layer is a bindgen-generated + manually corrected library on top of rdma-core (actually rdma-core-mummy, more on that in the dlopen section), This is:
- The base for all upper layers
- Available for users who want direct access to the C APIs with minimal sugar
Sideway -- the Rust flavor wrapper over rdma-core¶
The second layer is sideway, a Rust-flavored rdma-core wrapper. It:
- Re-encapsulates C macros and anonymous enums as Rust enums
- Uses the builder pattern to make APIs more ergonomic
- Encapsulates an iterator-style CQ polling API
The iterator style is faster than
ibv_poll_cqin some projects, and exposes more information: for example, it lets you directly obtain hardware timestamps from CQEs.
We deliberately do not make everything safe:
- Wrappers for
ibv_reg_mrandibv_post_send/ibv_wr_*remainunsafe - Trying to fully encode their safety in this layer would either:
- Hurt performance
- Over-constrain upper layers that need to do unusual things (e.g., GDR with GPU memory)
Instead, the idea is:
- Sideway gives you a clear, zero-cost mapping to modern verbs
- Upper layers (like
trespassor your own library) can build safe memory pools or connection abstractions on top
For user experience, we also provide a unified interface over CQ / CQ Ex and QP / QP Ex.
If you've written RDMA before, you must know ibv_post_send and ibv_poll_cq, but you may not have used ibv_wr_* and ibv_start_poll, these newer APIs are more efficient and expose features like hardware timestamps from work completions. Sideway lets you start with the "classic" style and move to the extended APIs with minimal changes.
Let's compare the traditional ibv_post_send, and ibv_wr_*, and our sideway
struct ibv_send_wr wr;
struct ibv_sge sge = {
.addr = local_addr,
.length = 4096,
.lkey = lkey,
};
wr.wr_id = 233;
wr.next = NULL;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = rkey;
// Before actual post, you could construct more WRs use `next` to chain them,
// which could increase performance
int ret = ibv_post_send(qp, &wr, NULL);
let mut guard = qp.start_post_send();
let write_handle = guard
.construct_wr(233, WorkRequestFlags::Signaled)
.setup_write(rmr.rkey(), rmr.get_ptr() as _);
write_handle.setup_sge(lmr.lkey(), lmr.get_ptr() as _, 4096);
// Before actual post, you could construct more WRs and then post them at once,
// which could increase performance
guard.post()?;
How about ibv_poll_cq vs ibv_start_poll vs sideway?
Although ibv_poll_cq looks cleaner, it doesn't support hardware timestamps and costs more cycles,
because it has to copy from CQ to work completions.
We also invest in better error reporting on the RDMA control path. Instead of a mysterious -EINVAL from the kernel on modify_qp, we:
- Tell you what attribute is invalid
- Surface specific error-based errors (like
ETIMEDOUT) as typed Rust errors, making it easier to debug and monitor control-plane behavior
Example on invalid attributes
diff --git a/examples/rc_pingpong_split.rs b/examples/rc_pingpong_split.rs
index c0d53ed..2d9c61c 100644
--- a/examples/rc_pingpong_split.rs
+++ b/examples/rc_pingpong_split.rs
@@ -250,8 +250,7 @@ impl PingPongContext {
.setup_path_mtu(mtu)
.setup_dest_qp_num(remote_context.qp_number)
.setup_rq_psn(psn)
- .setup_max_dest_read_atomic(0)
- .setup_min_rnr_timer(0);
+ .setup_port(ib_port);
// setup address vector
let mut ah_attr = AddressHandleAttribute::new();
$ cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 -n 1 &
$ cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 -n 1 127.0.0.1
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/examples/rc_pingpong_split -d rxe_0 -g 1 -n 1 127.0.0.1`
local address: QPN 0x007a, PSN 0xcf45d8, GID 0000:0000:0000:0000:0000:ffff:ac1e:0818
remote address: QPN 0x0079, PSN 0x3f2d77, GID 0000:0000:0000:0000:0000:ffff:ac1e:0818
thread 'main' panicked at examples/rc_pingpong_split.rs:266:31:
Failed to modify QP to RTR: ModifyQueuePairError(InvalidAttributeMask { cur_state: Init, next_state: ReadyToReceive, invalid: QueuePairAttributeMask[Port], needed: QueuePairAttributeMask[MinResponderNotReadyTimer, MaxDestinationReadAtomic], source: Os { code: 22, kind: InvalidInput, message: "Invalid argument" } })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
The kernel only tells you EINVAL. sideway tells you that you need to set MinResponderNotReadyTimer and MaxDestinationReadAtomic, and you shouldn't set
Port in this transition. It's much easier to fix the code when you have that context.
Example on invalid GID / Addr
$ cargo run --example rc_pingpong_split -- -d mlx_0 -g 0 -n 1 &
$ cargo run --example rc_pingpong_split -- -d rxe_0 -g 0 -n 1 127.0.0.1
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/examples/rc_pingpong_split -d rxe_0 -g 0 -n 1 127.0.0.1`
local address: QPN 0x007d, PSN 0xcd1e89, GID fe80:0000:0000:0000:5054:00ff:fe36:7656
remote address: QPN 0x014b, PSN 0xa0dc7d, GID fe80:0000:0000:0000:fc6b:02ff:fea7:d4bf
thread 'main' panicked at examples/rc_pingpong_split.rs:267:31:
Failed to modify QP to RTR: ModifyQueuePairError(ResolveRouteTimedout { sgid_index: 0, gid: Gid { raw: [254, 128, 0, 0, 0, 0, 0, 0, 252, 107, 2, 255, 254, 167, 212, 191] }, source: Os { code: 110, kind: TimedOut, message: "Connection timed out" } })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
The result gives us a hint with sgid_index and gid, let's look them up with show_gids
$ cargo run --example show_gids
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/examples/show_gids`
Dev | Port | Index | GID | IPv4 | Ver | Netdev
--------+------+-------+-----------------------------------------+-------------+--------+---------
mlx5_0 | 1 | 0 | fe80:0000:0000:0000:fc6b:02ff:fea7:d4bf | | RoCEv1 | enp8s0
mlx5_0 | 1 | 1 | fe80:0000:0000:0000:fc6b:02ff:fea7:d4bf | | RoCEv2 | enp8s0
rxe_0 | 1 | 0 | fe80:0000:0000:0000:5054:00ff:fe36:7656 | | RoCEv2 | enp6s0
Looks like we're using a link local address for connection setup! That's why it didn't make it. We should set up a global scope IPv6 or IPv4 address with correct routes to make it work.
Without a hint of GID index and GID, it takes experience and time to figure out where that ETIMEDOUT came from.
Trepass -- the communication library (WIP)¶
The third layer is trespass, a communication library that offers a socket-like interface over RDMA, you can think of it as
- In the same space as UCX /
libfabric - But intentionally much simpler and focused Its goals:
- Hide RDMA details such as
- Which operations consume recv WQEs
- How to do flow control
- How to exchange MR information and other connection information
- How to do batching for best performance
- Let users reuse existing socket programming experience (C / C++ / Rust) to get started quickly.
RPC library?¶
The top layer, an RPC framework using RDMA as transport, is intentionally out of scope for us for now. Different applications have very different requirements:
- Long running storage systems
- RPC-heavy microservices
- ML inference services Our goal is to provide solid building blocks (rdma-mummy-sys, sideway, trespass) that such frameworks can use.
Where is async?¶
You may wonder: where is async? This is Rust, after all.
Our view:
- RDMA itself is asynchronous
- Its control path is not usually performance critical
- Directly integrating verbs with today’s general-purpose async runtimes (especially Tokio) is not straightforward
We think a thread-per-core style runtime is more promising for high-performance RDMA, and sideway is designed so that such a runtime can be built on top
- Run a dedicated RDMA worker thread per core
- Communicate with the rest of the system via channels
- Keep verbs usage straightforward and predictable.
We may explore this more in the future, but sideway itself is synchronous by design.
The magic of dlopen¶
For some historical reasons, libibverbs and some of the providers uses an annoying private symbol mechanism https://github.com/linux-rdma/rdma-core/blob/master/Documentation/versioning.md
The idea is to ensure that distributed binaries link against the exact rdma-core version they were built with. While this is understandable, it makes it hard to distribute a single binary across different environments, even when you don’t actually use the private functions.
To solve this, we provide a static library called rdma-core-mummy which:
- Uses
dlopento loadlibibverbsandlibrdmacmat runtime (these two are what you usually need in DCN) - So your application only links against our static library instead of all of rdma-core.
You might ask: why not just use libloading in Rust?
Because we also want other C/C++ developers to benefit from this approach.
Lifetime or not, that's a question¶
RDMA resources depend on each other
RDMA resources dependency (from Nugine)
At first glance, lifetimes seem like the natural Rust solution:
- Express resource relationships in the type system
- Catch misuse at compile time
We tried that (versions before v0.4.0).
When we encapsulated CQ, QP and MR in a WorkerContext (a common C/C++ pattern), lifetime annotations quickly turned into a pain:
- The types became complex and fragile
- We had to use patterns like
ouroborosto silence the borrow checker - Things that are obviously safe conceptually still made the compiler shout at us The moment where "safety" stops helping and starts blocking reasonable designs is the moment we decided we were going too far.
So we chose a simpler model: Arc for most resources.
- Users don't have to think about lifetimes
- Resources are still dropped in a correct order
- The overhead of refcounting doesn't show up on the data path
You can see this in
stride
Before Arc: https://github.com/RDMA-Rust/stride/commit/c5a6a0c83973bd187f1f4ac57a07942d8d278e23
After Arc: https://github.com/RDMA-Rust/stride/commit/fd30aeb4b5b5fa27ee89d1877670058021d670e3
Much more readable, and closer to how RDMA is actually used in real systems.
Rust's unittest framework & patches for upstream¶
To keep sideway healthy, we rely heavily on Rust’s built-in test framework:
- Unit tests that can run with or without an RDMA NIC present,
- CI that exercises SoftRoCE where available. During this process, we found two bugs in rdma-core and upstreamed fixes
SoftRoCE (RXE) CQ bug¶
When we try to port ibv_rc_pingpong to sideway, it ran fine on Mellanox NICs but behaved strangely on SoftRoCE: we kept polling CQEs with wr_id 0.
At first, we thought maybe there is something wrong with our implementation, so we tried to use bpftrace to catch what's happening in the kernel
bpftrace script for debugging
#include <rdma/ib_verbs.h>
struct rxe_pool_elem {
void *pool;
void *obj;
struct kref ref_cnt;
struct list_head list;
struct completion complete;
u32 index;
};
enum queue_type {
QUEUE_TYPE_TO_CLIENT,
QUEUE_TYPE_FROM_CLIENT,
QUEUE_TYPE_FROM_ULP,
QUEUE_TYPE_TO_ULP,
};
struct rxe_queue {
void *rxe;
void *buf;
void *ip;
size_t buf_size;
size_t elem_size;
unsigned int log2_elem_size;
u32 index_mask;
enum queue_type type;
/* private copy of index for shared queues between
* driver and clients. Driver reads and writes
* this copy and then replicates to rxe_queue_buf
* for read access by clients.
*/
u32 index;
};
struct rxe_sq {
int max_wr;
int max_sge;
int max_inline;
spinlock_t sq_lock; /* guard queue */
struct rxe_queue *queue;
};
struct rxe_rq {
int max_wr;
int max_sge;
spinlock_t producer_lock; /* guard queue producer */
spinlock_t consumer_lock; /* guard queue consumer */
struct rxe_queue *queue;
};
struct rxe_qp {
struct ib_qp ibqp;
struct rxe_pool_elem elem;
struct ib_qp_attr attr;
unsigned int valid;
unsigned int mtu;
bool is_user;
void *pd;
void *srq;
void *scq;
void *rcq;
enum ib_sig_type sq_sig_type;
struct rxe_sq sq;
struct rxe_rq rq;
};
kprobe:rxe_completer,kprobe:rxe_requester {
$qp = (struct rxe_qp *)arg0;
printf("%s: valid: %u, state: %u\n", func, $qp->valid, $qp->attr.qp_state);
printf("%s: max_wr: %u, max_sge: %u, index: %u\n", func, $qp->sq.max_wr, $qp->sq.max_sge, $qp->sq.queue->index);
}
kretprobe:rxe_completer,kretprobe:rxe_requester {
printf("%s: return: %d\n", func, retval);
}
kretprobe:rxe_xmit_packet {
printf("%s: return %d\n", func, retval);
}
With command below
cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 -n 1 &
cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 127.0.0.1 -n 1
Bpftrace gave us this
$ sudo bpftrace ./rxe_qp_tracer.bt
Attaching 5 probes...
rxe_requester: valid: 0, state: 3
rxe_requester: max_wr: 1, max_sge: 8, index: 1
rxe_requester: return: -11
rxe_completer: valid: 0, state: 3
rxe_completer: max_wr: 1, max_sge: 8, index: 1
rxe_completer: return: -11
rxe_requester: valid: 1, state: 3
rxe_requester: max_wr: 1, max_sge: 8, index: 0
rxe_xmit_packet: return 0
rxe_requester: return: 0
...
rxe_completer: valid: 0, state: 6
rxe_completer: max_wr: 1, max_sge: 8, index: 1
rxe_completer: return: -11
^C
Which means the rxe_completer only ran finite rounds, and the kernel reported correct number of completions. So we turned to user space driver, after carefully debugging and code review, we finally found the reason and fixed it:
rxe: fix completion queue consumer index overrun #1512
Then we add it to the CI: E2E rc_pingpong test
Device detection issue¶
Another bug appeared after a reboot: one of our tests started to fail. The only difference? SoftRoCE wasn't set up yet. We realized that the machine still had Mellanox VFs, so the test should have passed. But the rdma-core's logic around devices and rdmacm didn't behave as expected.
That turned into: librdmacm: prevent NULL pointer access during device initialization
Corresponding unit test: rdmacm::communication_manager::tests::test_cm_id_reference_count
Examples¶
If you want to see what real code looks like, check out the examples:
https://github.com/RDMA-Rust/sideway/tree/main/examples
There are small programs for
We dogfood sideway through these examples and through stride.
Production status & Roadmap¶
- Status: v0.Y.Z, the API may still change, but the overall design is fairly stable
- We are trying to use sideway to build more tools (for example, stride), and adjusting the APIs as we discover better patterns
- We've tested our APIs on Mellanox CX-5, CX-6, BF3 mini and SoftRoCE, other RNICs haven't be tested yet.
- Near-term:
- Provide UD, UC support
- Provide CQ Event support
- Provide Atomic FAA / CAS support
- Provide thread domain support
- Plugin system for working with direct verbs
- More tests, docs, and bug fixes
- Long-term:
- Complete the trespass communication library
- Explore a "thread-per-core" async runtime on top of sideway
Let's make Rust network fast with RDMA!