2025¶

November 16, 2025
in RDMA, Rust
19 min read

Why another RDMA wrapper

TL; DR

If you're writing synchronous RDMA code in Rust and you're tired of:

Hand-rolling FFI over ibverbs-sys / rdma-core-sys
Fighting lifetimes over simple CQ / QP / MR ownership, and
Rebuilding rdma-core just to link a tiny binary

Then sideway gives you:

Rust-flavored wrappers over modern ibverbs (ibv_wr_*, ibv_start_poll, Extended CQ / QP)
A dlopen based static library so that you don't vendor rdma-core
A basic but usable wrapper for librdmacm

Checkout our code at https://github.com/RDMA-Rust/sideway

This post is about how we built it, what trade-offs we made, and what we learned on the way.

What do we get?

A Rust flavored wrapper for rdma-core (can saturate 400 Gbps RNIC)
Two bug fixes submitted to the rdma-core upstream
A "dummy" companion library for C/C++ programmer with libibverbs / librdmacm (rdma-core-mummy)
A lot of Rust + RDMA programming experience

What's wrong with current wrappers?

Most current libibverbs and librdmacm Rust wrappers focus heavily on fancy language features (100% safe, async-first APIs, etc.), and less on being production-ready. They are great for exploration, and provide a lot of insights for our implementation, but awkward when you try to ship a serious RDMA application.

When you actually start coding, you often end up:

Reaching for ibverbs-sys (or another bindgen-generated crate)
Manually handling error-prone C macros / enum definitions
Juggling C types and Rust types at the same time

On top of that, most crates assume you build and link the entire rdma-core project into your binary, or required you've install rdma-core on your compile machine. That pulls in unnecessary dependencies and build complexity.

We wanted something that:

Keeps the control path ergonomic in Rust
Keeps the data path as close as possible to raw ibverbs to achieve high performance
Doesn't require vendoring rdma-core into every build

How does sideway compare?

We are not claiming to replace other crates, different projects have different trade-offs. But for high performance, synchronous RDMA, there are some concrete issues you run into.

Crate	Requires building rdma-core	Wrapping librdmacm	Performance issues
ibverbs	Yes	No	1. Poll CQ at a constant batching size 2. No batching WRs posting operation 3. Unable to specify send flags for moderation or fencing
rrddmma	Optional	No	Leave options to user
async-rdma	No (But specific version required)	Yes	1. No batching WRs posting operation 2. Unable to specify send flags for moderation for fencing 3. Create a CQ for every QP
rdma	No (But specific version required)	No	Leave options to user
rdma-rs	Yes	Yes	1. No batching WRs posting operation 2. Unable to specify send flags for moderation for fencing
Sideway	No	Yes	Focused on modern verbs APIs (`ibv_wr_*`, CQ / QP Ex), zero-cost abstractions on the data path, and layered design

Performance

To make sure our wrapper doesn't leave performance on the table, we vibe coded a perftest-style benchmarking tool called stride. It's not meant to be a full replacement, just enough for validate that "Rust + sideway" can keep up with the classic C tooling.

All experiments run on a Nebius GPU-H100-SXM instance with 8 * ConnectX-7 InfiniBand (IB) 400Gbps RDMA NIC

Key	Value
Environment	GPU-H100-SXM EU-NORTH1
OS	Ubuntu 24.04.3 LTS (Noble Numbat) x86_64
CPU	2 × Intel(R) Xeon(R) Platinum 8468 (128) @ 2.10 GHz
Kernel	6.11.0-1016-nvidia
Memory	1.536 TiB
RNIC	ConnectX-7 InfiniBand 400 Gbps (firmware 28.39.3004)
OFED	DOCA 2.9.2
GPU	NVIDIA H100 SXM5 80GB
NVIDIA GPU Driver	570.195.03
CUDA	12.8
Stride	87b3780
Perftest	20e05b9

We slighty modified stride's code to hardcode the LID in connection setup phase, to make it work with IB. Apart from that, both tools use the same verbs settings.

What we measure

We look at three basic scenarios, always with a single RC QP and message sizes from 2B to 8MiB:

Single QP WRITE bandwidth (host / GPU memory)
- Host memory: S-1QP-BW (stride) vs P-1QP-BW (perftest)
- GPU memory (GDR): S-1QP-GDR-BW vs P-1QP-GDR-BW
Single QP "maxed out" WRITE bandwidth (GPU memory)
- GPU memory (GDR), Tx depth 4096, post list 64: S-1QP-GDR-MAX-BW vs P-1QP-GDR-MAX-BW
- To better see small-message behavior, we also add message per second (MPS): S-1QP-GDR-MAX-MPS vs P-1QP-GDR-MAX-MPS
Single QP WRITE latency (host / GPU memory)
- Host memory: S-1QP-LAT (stride) vs P-1QP-LAT (perftest)
- GPU memory (GDR): S-1QP-GDR-LAT (perftest does not support WRITE latency with GPU buffers)
- We also add P99 for jitter: S-1QP-LAT-P99 vs P-1QP-LAT-P99 vs S-1QP-GDR-LAT-P99

Results at a glance

Bandwidth (host memory / GDR) For a single QP, sideway/stride and perftest track each other closely across 2B–8MiB. At large messages, both saturate the 400 Gb/s link: stride reaches ~380 Gb/s, and perftest is within a few percent of that.

Line TableRaw DataCommand

bandwidth

Message Size	S-1QP-BW	P-1QP-BW	S-1QP-GDR-BW	P-1QP-GDR-BW
2	0.0676	0.068664	0.077	0.075424
4	0.1352	0.16	0.1546	0.15
8	0.278	0.31	0.3098	0.31
16	0.6217	0.63	0.6201	0.62
32	1.2465	1.26	1.2379	1.24
64	2.4941	2.51	2.4714	2.49
128	4.3461	5.04	4.9527	4.97
256	8.6742	10.12	9.9171	9.93
512	17.3712	20.25	19.7997	19.77
1024	34.7707	40.47	39.6772	39.8
2048	69.6075	80.88	68.9667	79.51
4096	139.2933	164.27	140.3711	160.21
8192	278.4098	305.2	291.2193	320.74
16384	360.9202	343.31	371.0165	371.11
32768	372.1774	358.08	381.8385	382.28
65536	377.9962	365.91	387.1709	387.1
131072	381.4992	370.98	390.377	390.55
262144	382.0199	372.04	391.1473	391.67
524288	381.505	372.35	391.1888	391.9
1048576	381.2342	372.09	391.1909	392.25
2097152	381.307	372.12	391.1919	392.39
4194304	380.6771	371.92	391.1924	392.44
8388608	379.7436	371.82	391.1926	392.44

# Perftest server
taskset -c 66 ib_write_bw -d mlx5_0 -n 100000 -a -Q 1
# Perftest client
taskset -c 68 ib_write_bw -d mlx5_1 -n 100000 -a 127.0.0.1 -Q 1
# Perftest server (GDR)
taskset -c 66 ib_write_bw -d mlx5_0 -n 100000 -a -Q 1 --use_cuda 4
# Perftest client (GDR)
taskset -c 68 ib_write_bw -d mlx5_1 -n 100000 -a 127.0.0.1 -Q 1 --use_cuda 5

# Stride server
taskset -c 66 stride-perf write bw -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024))
# Stride client
taskset -c 68 stride-perf write bw -d mlx5_1 127.0.0.1 -n 1000000 -a --max-msg-size $((8*1024*1024))
# Stride server (GDR)
taskset -c 66 stride-perf write bw -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024)) --use-cuda 4
# Stride client (GDR)
taskset -c 68 stride-perf write bw -d mlx5_1 127.0.0.1 -n 1000000 -a --max-msg-size $((8*1024*1024)) --use-cuda 5

Bandwidth (GDR, maxed out) With a larger tx_depth and post_list, a single QP hits ~392.5 Gbit/s — essentially line-rate for a 400 Gbit/s port. For small messages, throughput reaches ~20 Mpps.

Line TableRaw DataCommand

bandwidth

Message Size	S-1QP-GDR-MAX-BW	S-1QP-GDR-MAX-MPS	P-1QP-GDR-MAX-BW	P-1QP-GDR-MAX-MPS
2	0.3153	19.7076	0.32	19.855156
4	0.632	19.7505	0.64	19.890087
8	1.2636	19.7432	1.28	19.92576
16	2.5247	19.7239	2.54	19.851894
32	5.0515	19.7323	5.07	19.800955
64	10.0584	19.6453	10.11	19.740872
128	20.068	19.5976	20.13	19.659211
256	39.8753	19.4703	39.73	19.400613
512	77.5093	18.9232	78.32	19.121788
1024	147.5344	18.0096	148.25	18.096757
2048	237.1809	14.4764	240.4	14.672663
4096	321.151	9.8007	324.23	9.894732
8192	358.1028	5.4642	361.24	5.512146
16384	391.0239	2.9833	391.81	2.989263
32768	391.1139	1.492	392.17	1.495991
65536	391.1547	0.7461	392.51	0.74865
131072	391.1747	0.3731	392.63	0.374441
262144	391.1837	0.1865	392.6	0.187209
524288	391.1888	0.0933	392.5	0.093579
1048576	391.1903	0.0466	392.46	0.046785
2097152	391.1912	0.0233	392.43	0.023391
4194304	391.1923	0.0117	392.44	0.011696
8388608	391.1925	0.0058	392.44	0.005848

# Perftest server
taskset -c 66 ib_write_bw -d mlx5_0 -n 102400 -a -Q 1 --report_gbits --use_cuda 4 -l 64 -t 4096
# Perftest client
taskset -c 68 ib_write_bw -d mlx5_1 -n 102400 -a 127.0.0.1 -Q 1 --report_gbits --use_cuda 5 -l 64 -t 4096

# Stride server
taskset -c 66 stride-perf write bw -d mlx5_0 -n 102400 -a --max-msg-size $((8*1024*1024)) --use-cuda 4 -l 64 -t 4096
# Stride client
taskset -c 68 stride-perf write bw -d mlx5_1 -n 102400 -a --max-msg-size $((8*1024*1024)) 127.0.0.1 --use-cuda 5 -l 64 -t 4096

Latency For small messages in host memory, perftest’s numbers look better at first glance — but that’s because we measure slightly different things:
- perftest’s WRITE latency is effectively half of a round-trip: client WRITE → server DMA starts → server DMA ends (single-trip data + DMA time).
- stride’s WRITE latency measures data transfer + ACK (client WRITE → server NIC ACK → client CQE).

GPU WRITE latency with stride is slightly higher but still scales cleanly with message size.

Line TableRaw DataCommand

latency

Message Size	P-1QP-LAT	P-1QP-LAT-P99	S-1QP-LAT	S-1QP-LAT-P99	S-1QP-GDR-LAT	S-1QP-GDR-LAT-P99
2	3.38	3.47	5.286	5.487	5.412	5.487
4	3.38	3.46	5.322	5.495	5.411	5.479
8	3.38	3.47	5.334	5.495	5.42	5.483
16	3.39	3.47	5.336	5.495	5.418	5.487
32	3.4	3.51	5.329	5.495	5.415	5.483
64	3.4	3.48	5.351	5.495	5.432	5.499
128	3.46	3.57	5.371	5.543	5.45	5.511
256	3.51	3.62	5.421	5.575	5.473	5.539
512	3.57	3.71	5.48	5.619	5.486	5.555
1024	3.61	3.78	5.53	5.683	5.523	5.595
2048	3.75	4.19	5.634	5.743	5.583	5.647
4096	3.93	4.49	5.785	5.895	5.692	5.767
8192	4.13	4.75	5.947	6.163	5.887	6.027
16384	4.53	5.23	6.357	6.499	6.242	6.315
32768	5.07	5.61	6.968	7.087	6.724	6.803
65536	6.11	6.84	7.91	8.063	7.66	7.751
131072	9.53	9.9	10.644	10.807	10.12	10.207
262144	14.66	15.98	14.999	15.239	13.86	13.967
524288	20.12	21.64	20.692	21.599	19.25	19.359
1048576	30.93	33.13	31.571	33.567	29.983	30.079
2097152	53.5	56.62	53.569	57.503	51.442	51.551
4194304	99	103.14	97.649	104.575	94.327	94.463
8388608	189.92	194.66	186.243	196.351	180.107	180.223

# Perftest server
taskset -c 66 ib_write_lat -d mlx5_0 -n 100000 -a -I 0
# Perftest client
taskset -c 68 ib_write_lat -d mlx5_1 -n 100000 -a 127.0.0.1 -I 0

# Stride server
taskset -c 66 stride-perf write lat -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024))
# Stride client
taskset -c 68 stride-perf write lat -d mlx5_1 -n 100000 -a --max-msg-size $((8*1024*1024)) 127.0.0.1

# Stride server (GDR)
taskset -c 66 stride-perf write lat -d mlx5_0 -n 100000 -a --max-msg-size $((8*1024*1024)) --use-cuda 4
# Stride client (GDR)
taskset -c 68 stride-perf write lat -d mlx5_1 -n 100000 -a --max-msg-size $((8*1024*1024)) 127.0.0.1 --use-cuda 5

Design: layered abstraction

We treat RDMA abstraction as a layered problem:

`rdma-mummy-sys` -- the raw rdma-core API wrapper

The first layer is a bindgen-generated + manually corrected library on top of rdma-core (actually rdma-core-mummy, more on that in the dlopen section), This is:

The base for all upper layers
Available for users who want direct access to the C APIs with minimal sugar

`Sideway` -- the Rust flavor wrapper over rdma-core

The second layer is sideway, a Rust-flavored rdma-core wrapper. It:

Re-encapsulates C macros and anonymous enums as Rust enums
Uses the builder pattern to make APIs more ergonomic
Encapsulates an iterator-style CQ polling API The iterator style is faster than ibv_poll_cq in some projects, and exposes more information: for example, it lets you directly obtain hardware timestamps from CQEs.

We deliberately do not make everything safe:

Wrappers for ibv_reg_mr and ibv_post_send / ibv_wr_* remain unsafe
Trying to fully encode their safety in this layer would either:
- Hurt performance
- Over-constrain upper layers that need to do unusual things (e.g., GDR with GPU memory)

Instead, the idea is:

Sideway gives you a clear, zero-cost mapping to modern verbs
Upper layers (like trespass or your own library) can build safe memory pools or connection abstractions on top

For user experience, we also provide a unified interface over CQ / CQ Ex and QP / QP Ex. If you've written RDMA before, you must know ibv_post_send and ibv_poll_cq, but you may not have used ibv_wr_* and ibv_start_poll, these newer APIs are more efficient and expose features like hardware timestamps from work completions. Sideway lets you start with the "classic" style and move to the extended APIs with minimal changes.

Let's compare the traditional ibv_post_send, and ibv_wr_*, and our sideway

ibv_post_sendibv_wr_*

struct ibv_send_wr wr;

struct ibv_sge sge = {
    .addr = local_addr,
    .length = 4096,
    .lkey = lkey,
};

wr.wr_id = 233;
wr.next = NULL;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = remote_addr;
wr.wr.rdma.rkey = rkey;

// Before actual post, you could construct more WRs use `next` to chain them,
// which could increase performance
int ret = ibv_post_send(qp, &wr, NULL);

ibv_wr_start(qp);

qp->wr_id = 233;
qp->wr_flags = IBV_SEND_SIGNALED;
ibv_wr_rdma_write(qp, rkey, remote_addr);

ibv_wr_set_sge(qp, lkey, local_addr, 4096);

// Before actual post, you could construct more WRs and then post them at once,
// which could increase performance
ibv_wr_complete(qp);

sideway

let mut guard = qp.start_post_send();

let write_handle = guard
    .construct_wr(233, WorkRequestFlags::Signaled)
    .setup_write(rmr.rkey(), rmr.get_ptr() as _);

write_handle.setup_sge(lmr.lkey(), lmr.get_ptr() as _, 4096);

// Before actual post, you could construct more WRs and then post them at once,
// which could increase performance
guard.post()?;

How about ibv_poll_cq vs ibv_start_poll vs sideway?

Although ibv_poll_cq looks cleaner, it doesn't support hardware timestamps and costs more cycles, because it has to copy from CQ to work completions.

ibv_poll_cqibv_start_poll

struct ibv_wc wc;

while (ibv_poll_cq(cq, 1, &wc) > 0) {
    printf("wr_id: %lu, status: %u, opcode: %u\n", wc.wr_id, wc.status, wc.opcode);
}

int ret = ibv_start_poll(cq, &attr);
// assume the first ret is not ENOENT
while (ret != ENOENT) {
    printf("wr_id: %lu, status: %u, opcode: %u\n", cq->wr_id, cq->status, ibv_wc_read_opcode(cq));
    ret = ibv_next_poll(cq);
}
ibv_end_poll(cq);

sideway

match cq.start_poll() {
    Ok(mut poller) => {
        while let Some(wc) = poller.next() {
            println!("wr_id: {}, status: {}, opcode: {}", wc.wr_id(), wc.status(), wc.opcode())
        }
    }
    Err(_) => {
        continue;
    }
}

We also invest in better error reporting on the RDMA control path. Instead of a mysterious -EINVAL from the kernel on modify_qp, we:

Tell you what attribute is invalid
Surface specific error-based errors (like ETIMEDOUT) as typed Rust errors, making it easier to debug and monitor control-plane behavior

Example on invalid attributes

diff --git a/examples/rc_pingpong_split.rs b/examples/rc_pingpong_split.rs
index c0d53ed..2d9c61c 100644
--- a/examples/rc_pingpong_split.rs
+++ b/examples/rc_pingpong_split.rs
@@ -250,8 +250,7 @@ impl PingPongContext {
            .setup_path_mtu(mtu)
            .setup_dest_qp_num(remote_context.qp_number)
            .setup_rq_psn(psn)
-            .setup_max_dest_read_atomic(0)
-            .setup_min_rnr_timer(0);
+            .setup_port(ib_port);

        // setup address vector
        let mut ah_attr = AddressHandleAttribute::new();

$ cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 -n 1 &

$ cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 -n 1 127.0.0.1
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
    Running `target/debug/examples/rc_pingpong_split -d rxe_0 -g 1 -n 1 127.0.0.1`
local address: QPN 0x007a, PSN 0xcf45d8, GID 0000:0000:0000:0000:0000:ffff:ac1e:0818
remote address: QPN 0x0079, PSN 0x3f2d77, GID 0000:0000:0000:0000:0000:ffff:ac1e:0818

thread 'main' panicked at examples/rc_pingpong_split.rs:266:31:
Failed to modify QP to RTR: ModifyQueuePairError(InvalidAttributeMask { cur_state: Init, next_state: ReadyToReceive, invalid: QueuePairAttributeMask[Port], needed: QueuePairAttributeMask[MinResponderNotReadyTimer, MaxDestinationReadAtomic], source: Os { code: 22, kind: InvalidInput, message: "Invalid argument" } })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The kernel only tells you EINVAL. sideway tells you that you need to set MinResponderNotReadyTimer and MaxDestinationReadAtomic, and you shouldn't set Port in this transition. It's much easier to fix the code when you have that context.

Example on invalid GID / Addr

$ cargo run --example rc_pingpong_split -- -d mlx_0 -g 0 -n 1 &

$ cargo run --example rc_pingpong_split -- -d rxe_0 -g 0 -n 1 127.0.0.1
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
Running `target/debug/examples/rc_pingpong_split -d rxe_0 -g 0 -n 1 127.0.0.1`
 local address: QPN 0x007d, PSN 0xcd1e89, GID fe80:0000:0000:0000:5054:00ff:fe36:7656
remote address: QPN 0x014b, PSN 0xa0dc7d, GID fe80:0000:0000:0000:fc6b:02ff:fea7:d4bf

thread 'main' panicked at examples/rc_pingpong_split.rs:267:31:
Failed to modify QP to RTR: ModifyQueuePairError(ResolveRouteTimedout { sgid_index: 0, gid: Gid { raw: [254, 128, 0, 0, 0, 0, 0, 0, 252, 107, 2, 255, 254, 167, 212, 191] }, source: Os { code: 110, kind: TimedOut, message: "Connection timed out" } })
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The result gives us a hint with sgid_index and gid, let's look them up with show_gids

$ cargo run --example show_gids
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.09s
 Running `target/debug/examples/show_gids`
  Dev   | Port | Index |                   GID                   |    IPv4     |  Ver   | Netdev
--------+------+-------+-----------------------------------------+-------------+--------+---------
mlx5_0 |  1   |   0   | fe80:0000:0000:0000:fc6b:02ff:fea7:d4bf |             | RoCEv1 | enp8s0
mlx5_0 |  1   |   1   | fe80:0000:0000:0000:fc6b:02ff:fea7:d4bf |             | RoCEv2 | enp8s0
rxe_0  |  1   |   0   | fe80:0000:0000:0000:5054:00ff:fe36:7656 |             | RoCEv2 | enp6s0

Looks like we're using a link local address for connection setup! That's why it didn't make it. We should set up a global scope IPv6 or IPv4 address with correct routes to make it work. Without a hint of GID index and GID, it takes experience and time to figure out where that ETIMEDOUT came from.

`Trepass` -- the communication library (WIP)

The third layer is trespass, a communication library that offers a socket-like interface over RDMA, you can think of it as

In the same space as UCX / libfabric
But intentionally much simpler and focused Its goals:
Hide RDMA details such as
Which operations consume recv WQEs
How to do flow control
How to exchange MR information and other connection information
How to do batching for best performance
Let users reuse existing socket programming experience (C / C++ / Rust) to get started quickly.

RPC library?

The top layer, an RPC framework using RDMA as transport, is intentionally out of scope for us for now. Different applications have very different requirements:

Long running storage systems
RPC-heavy microservices
ML inference services Our goal is to provide solid building blocks (rdma-mummy-sys, sideway, trespass) that such frameworks can use.

Where is async?

You may wonder: where is async? This is Rust, after all.

Our view:

RDMA itself is asynchronous
Its control path is not usually performance critical
Directly integrating verbs with today’s general-purpose async runtimes (especially Tokio) is not straightforward

We think a thread-per-core style runtime is more promising for high-performance RDMA, and sideway is designed so that such a runtime can be built on top

Run a dedicated RDMA worker thread per core
Communicate with the rest of the system via channels
Keep verbs usage straightforward and predictable.

We may explore this more in the future, but sideway itself is synchronous by design.

The magic of `dlopen`

For some historical reasons, libibverbs and some of the providers uses an annoying private symbol mechanism https://github.com/linux-rdma/rdma-core/blob/master/Documentation/versioning.md

The idea is to ensure that distributed binaries link against the exact rdma-core version they were built with. While this is understandable, it makes it hard to distribute a single binary across different environments, even when you don’t actually use the private functions.

To solve this, we provide a static library called rdma-core-mummy which:

Uses dlopen to load libibverbs and librdmacm at runtime (these two are what you usually need in DCN)
So your application only links against our static library instead of all of rdma-core.

You might ask: why not just use libloading in Rust?

Because we also want other C/C++ developers to benefit from this approach.

Lifetime or not, that's a question

RDMA resources depend on each other

At first glance, lifetimes seem like the natural Rust solution:

Express resource relationships in the type system
Catch misuse at compile time

We tried that (versions before v0.4.0).

When we encapsulated CQ, QP and MR in a WorkerContext (a common C/C++ pattern), lifetime annotations quickly turned into a pain:

The types became complex and fragile
We had to use patterns like ouroboros to silence the borrow checker
Things that are obviously safe conceptually still made the compiler shout at us The moment where "safety" stops helping and starts blocking reasonable designs is the moment we decided we were going too far.

So we chose a simpler model: Arc for most resources.

Users don't have to think about lifetimes
Resources are still dropped in a correct order
The overhead of refcounting doesn't show up on the data path You can see this in stride

Before Arc: https://github.com/RDMA-Rust/stride/commit/c5a6a0c83973bd187f1f4ac57a07942d8d278e23

After Arc: https://github.com/RDMA-Rust/stride/commit/fd30aeb4b5b5fa27ee89d1877670058021d670e3

Before `Arc`

pub struct WorkerContext<'a> {
    pub queue_pairs: Vec<GenericQueuePair<'a>>,
    pub send_completion_queue: Rc<RefCell<ExtendedCompletionQueue<'a>>>,
    pub recv_completion_queue: Option<Rc<RefCell<ExtendedCompletionQueue<'a>>>>,
    ...
}

After `Arc`

pub struct WorkerContext {
    pub queue_pairs: Vec<GenericQueuePair>,
    pub send_completion_queue: GenericCompletionQueue,
    pub recv_completion_queue: Option<GenericCompletionQueue>,
    ...
}

Much more readable, and closer to how RDMA is actually used in real systems.

Rust's unittest framework & patches for upstream

To keep sideway healthy, we rely heavily on Rust’s built-in test framework:

Unit tests that can run with or without an RDMA NIC present,
CI that exercises SoftRoCE where available. During this process, we found two bugs in rdma-core and upstreamed fixes

SoftRoCE (RXE) CQ bug

When we try to port ibv_rc_pingpong to sideway, it ran fine on Mellanox NICs but behaved strangely on SoftRoCE: we kept polling CQEs with wr_id 0.

At first, we thought maybe there is something wrong with our implementation, so we tried to use bpftrace to catch what's happening in the kernel

bpftrace script for debugging

rxe_qp_tracer.bt

#include <rdma/ib_verbs.h>

struct rxe_pool_elem {
        void            *pool;
        void                    *obj;
        struct kref             ref_cnt;
        struct list_head        list;
        struct completion       complete;
        u32                     index;
};

enum queue_type {
        QUEUE_TYPE_TO_CLIENT,
        QUEUE_TYPE_FROM_CLIENT,
        QUEUE_TYPE_FROM_ULP,
        QUEUE_TYPE_TO_ULP,
};

struct rxe_queue {
        void            *rxe;
        void    *buf;
        void    *ip;
        size_t                  buf_size;
        size_t                  elem_size;
        unsigned int            log2_elem_size;
        u32                     index_mask;
        enum queue_type         type;
        /* private copy of index for shared queues between
        * driver and clients. Driver reads and writes
        * this copy and then replicates to rxe_queue_buf
        * for read access by clients.
        */
        u32                     index;
};

struct rxe_sq {
        int                     max_wr;
        int                     max_sge;
        int                     max_inline;
        spinlock_t              sq_lock; /* guard queue */
        struct rxe_queue        *queue;
};

struct rxe_rq {
        int                     max_wr;
        int                     max_sge;
        spinlock_t              producer_lock; /* guard queue producer */
        spinlock_t              consumer_lock; /* guard queue consumer */
        struct rxe_queue        *queue;
};

struct rxe_qp {
        struct ib_qp ibqp;
        struct rxe_pool_elem elem;
        struct ib_qp_attr attr;
        unsigned int valid;
        unsigned int            mtu;
        bool                    is_user;
        void            *pd;
        void            *srq;
        void      *scq;
        void            *rcq;
        enum ib_sig_type        sq_sig_type;
        struct rxe_sq           sq;
        struct rxe_rq           rq;
};

kprobe:rxe_completer,kprobe:rxe_requester {
  $qp = (struct rxe_qp *)arg0;
        printf("%s: valid: %u, state: %u\n", func, $qp->valid, $qp->attr.qp_state);
        printf("%s: max_wr: %u, max_sge: %u, index: %u\n", func, $qp->sq.max_wr, $qp->sq.max_sge, $qp->sq.queue->index);
}

kretprobe:rxe_completer,kretprobe:rxe_requester {
        printf("%s: return: %d\n", func, retval);
}

kretprobe:rxe_xmit_packet {
        printf("%s: return %d\n", func, retval);
}

With command below

cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 -n 1 &

cargo run --example rc_pingpong_split -- -d rxe_0 -g 1 127.0.0.1 -n 1

Bpftrace gave us this

$ sudo bpftrace ./rxe_qp_tracer.bt
Attaching 5 probes...
rxe_requester: valid: 0, state: 3
rxe_requester: max_wr: 1, max_sge: 8, index: 1
rxe_requester: return: -11
rxe_completer: valid: 0, state: 3
rxe_completer: max_wr: 1, max_sge: 8, index: 1
rxe_completer: return: -11
rxe_requester: valid: 1, state: 3
rxe_requester: max_wr: 1, max_sge: 8, index: 0
rxe_xmit_packet: return 0
rxe_requester: return: 0
...
rxe_completer: valid: 0, state: 6
rxe_completer: max_wr: 1, max_sge: 8, index: 1
rxe_completer: return: -11
^C

Which means the rxe_completer only ran finite rounds, and the kernel reported correct number of completions. So we turned to user space driver, after carefully debugging and code review, we finally found the reason and fixed it:

rxe: fix completion queue consumer index overrun #1512

Then we add it to the CI: E2E rc_pingpong test

Device detection issue

Another bug appeared after a reboot: one of our tests started to fail. The only difference? SoftRoCE wasn't set up yet. We realized that the machine still had Mellanox VFs, so the test should have passed. But the rdma-core's logic around devices and rdmacm didn't behave as expected.

That turned into: librdmacm: prevent NULL pointer access during device initialization

Corresponding unit test: rdmacm::communication_manager::tests::test_cm_id_reference_count

Examples

If you want to see what real code looks like, check out the examples:

https://github.com/RDMA-Rust/sideway/tree/main/examples

There are small programs for

We dogfood sideway through these examples and through stride.

Production status & Roadmap

Status: v0.Y.Z, the API may still change, but the overall design is fairly stable
We are trying to use sideway to build more tools (for example, stride), and adjusting the APIs as we discover better patterns
We've tested our APIs on Mellanox CX-5, CX-6, BF3 mini and SoftRoCE, other RNICs haven't be tested yet.
Near-term:
- Provide UD, UC support
- Provide CQ Event support
- Provide Atomic FAA / CAS support
- Provide thread domain support
- Plugin system for working with direct verbs
- More tests, docs, and bug fixes
Long-term:
- Complete the trespass communication library
- Explore a "thread-per-core" async runtime on top of sideway

Let's make Rust network fast with RDMA!