Skip to content

Benchmarks

Velocity-FL's pitch is "the uv of Federated Learning" — a Rust core under a Python API. That's a performance claim, and it should be measured, not marketed. This page documents the measurement methodology and shows the current numbers.

What we measure

Every FL round is: clients produce weights (user-side, out of our hands) → the server aggregates them. The only thing the library controls is the aggregation step, so that's what we time. Client-weight construction is outside the timed region.

Two layers of measurement serve different purposes:

Layer Harness File Answers
Rust aggregate, raw divan vfl-core/benches/aggregate.rs How fast is the inner loop?
Through the Python API pytest-benchmark tests/bench/test_round_speed.py How fast is the user-visible path (_rust.Orchestrator.run_round) vs the pure-Python fallback?

The second layer is the one the tagline is defended on — real users call through PyO3, not Rust directly.

How to reproduce

uv sync
make bench

make bench ensures a release-profile velocity._core is installed in the venv before running the pytest harness (cargo bench already selects its own bench profile). The normal make build stays debug for fast edit/test rebuilds; you never need to remember --release by hand.

Or invoke the harnesses directly:

uv run maturin develop --release --uv
cargo bench --bench aggregate
uv run pytest tests/bench/ --benchmark-only --benchmark-columns=mean,stddev,rounds --benchmark-sort=mean

Shape tiers

Three tiers, chosen to span realistic FL model sizes:

tier layers total params rough analogue
tiny 4 ~970 smoke test / FedAvg sanity check
medium 10 ~1.0M ResNet-18-scale
large 16 ~10.0M ResNet-50-scale

All tiers use 10 clients. Weights are deterministic f32 in [-0.1, 0.1].

Results

Snapshot: 2026-04-23 (idle box — no background games, no browser, load average 0.27 at start). Hardware: AMD Ryzen 5 3600X (6C/12T, Zen 2), 9.7 GB RAM, WSL2 on Windows. rustc 1.95.0, Python 3.12.3, uv 0.11.5, PyO3 0.23 + numpy 0.23, release build (maturin develop --release) with lto = "thin" + codegen-units = 1.

Rust aggregation — raw (divan, mean)

No PyO3 boundary, no Python — the theoretical best case. Aggregation kernel only, measured against pre-built Vec<f32> client updates.

strategy tiny medium large
FedAvg 4.2 µs 4.8 ms 70.3 ms
FedProx 3.7 µs 4.5 ms 73.3 ms
FedMedian 76 µs 87.1 ms 903 ms
TrimmedMean(k=1) 93 µs 98.6 ms 988 ms
Krum(f=1) 34 µs 73.9 ms 740 ms
MultiKrum(f=1) 38 µs 78.4 ms 773 ms
Bulyan(f=1) 101 µs 147 ms 1.53 s
ArKrum 42 µs 56 ms 720 ms

FedAvg on a 10M-parameter model with 10 clients aggregates in ~70 ms. FedAvg accumulates in f64 and downcasts to f32 at the end to bound rounding error as client counts scale. FedMedian uses select_nth_unstable_by (O(C) expected) on a scratch buffer hoisted out of the coordinate loop. Krum and Multi-Krum share a pairwise-distance matrix (O(n² · d) per round), dominate FedMedian at the small tier because of the matrix setup overhead, and converge on FedMedian's cost at the medium and large tiers where the distance sum dominates.

Through the Python API (pytest-benchmark, mean)

This is the number users actually feel. _rust.Orchestrator.run_round with pre-built client updates, crossing the PyO3 boundary once per round, compared against a pure-Python FedAvg on the same inputs.

tier Rust FedAvg Rust FedProx Rust FedMedian Rust TrimmedMean(k=1) Rust Krum(f=1) Rust MultiKrum(f=1) Rust Bulyan(f=1) Rust ArKrum Python FedAvg Rust FedAvg speedup vs Python FedAvg
tiny (~1K) 4.9 µs 4.5 µs 78 µs 91 µs 38 µs 42 µs 99 µs 42 µs 421 µs 87×
medium (~1M) 4.0 ms 4.0 ms 94.2 ms 101 ms 71.4 ms 74.3 ms 143 ms 56 ms 545 ms 135×
large (~10M) 42.2 ms 42.2 ms 890 ms 1.01 s 721 ms 739 ms 1.47 s 720 ms 5.82 s 138×

Pure-Python FedAvg at the large tier costs ~5.8 s per round on this snapshot. The full tests/bench/ suite takes ~9 min on this box at 30 tests across 7 strategies × 3 tiers plus readout probes. This is the first clean-box snapshot since the 2026-04-22 numpy migration: compared to the prior 2026-04-20 snapshot (WSL2 actively loaded), Rust FedAvg at large dropped 53 → 42.2 ms and Python FedAvg drifted 5.12 → 5.82 s, so the speedup ratios widened from 95–117× to 87–138× across tiers — consistent with "the kernel is stable, the Python denominator is noisy." FedProx now matches FedAvg exactly in this run (both 42.2 ms at large; the 81 ms FedProx outlier in the prior snapshot was load noise — the kernels dispatch identically, see Findings below). Treat the speedup column as directional until CodSpeed lands; the consistent finding is "Rust FedAvg dominates at every tier."

Orchestrator.run_round takes a zero-copy fast path when no attacks are registered: the PyO3 wrapper passes &ClientUpdate slices straight into the aggregation kernel, so no f32 weight data is cloned between Python and aggregation. When attacks are registered, the owned path kicks in automatically (attacks can mutate client updates).

TrimmedMean tracks FedMedian at every tier (101 ms vs 94.2 ms at medium; 1.01 s vs 890 ms at large). Two select_nth_unstable_by calls per coordinate cost slightly more than FedMedian's single call at k=1, even with the median-of-evens averaging skipped — the partition is the hot loop, the post-processing isn't. The two-call structure becomes worthwhile as k grows (the second call operates on a smaller window), but at k=1 it's a wash. The matched NumPy oracle in tests/strategy_reference.py confirms parity.

Krum/Multi-Krum land above Python FedAvg at the medium and large tiers (721 ms / 739 ms vs 5.82 s at large — Krum is faster than Python FedAvg here, but ~17× slower than Rust FedAvg). That is algorithmically honest: Krum is O(n²·d), FedAvg is O(n·d). The f=1 Krum kernel builds a 10×10 pairwise-distance matrix across all d parameters per call — at large (d ≈ 10 M), that's ~500 M f32 adds before the top-k selection. Robustness buys a ~17× cost factor over non-robust FedAvg at this scale; the matched Python oracle in tests/strategy_reference.py confirms the kernel is correct, it is not a perf bug.

Bulyan composes the two slowest robust kernels and lands at their sum minus a subset discount: Phase 1 runs Multi-Krum over all n clients (739 ms at large), Phase 2 runs a coordinate-wise trimmed mean over the m = n - 2f = 8 survivors (cheaper than the n=10 TrimmedMean row because the per-coord partition shrinks from 10 to 8). Observed 1.47 s at large vs MultiKrum + TrimmedMean ≈ 1.75 s on the same snapshot — the subset discount is real. Bulyan pays a 35× cost factor over Rust FedAvg but gives the strongest distance-based Byzantine guarantee in the suite (breakdown at n - 4f). The oracle in tests/strategy_reference.py composes multi_krum_reference with trimmed_mean_reference end-to-end and Hypothesis tests pin parity.

ArKrum lands near Krum/Multi-Krum at every tier (42 µs / 56 ms / 720 ms vs Krum's 38 µs / 71.4 ms / 721 ms at tiny/medium/large) — the extra per-client filter + change-point step adds O(n) work on top of Krum's O(n²·d) distance loop, and at the realistic n=10 client count that's negligible. The parameter-free benefit (caller doesn't supply f) doesn't cost a perf tier — paying only for the small filter + gap-detection passes. Compared to Bulyan (which composes Multi-Krum with trimmed-mean for the strongest distance-based bound), ArKrum at large is ~2× faster (720 ms vs 1.47 s) — the trade is "no extra trimmed-mean pass over survivors" for "data-driven f̂ instead of a hard breakdown bound." See Strategies → ArKrum for the algorithm and known limitations.

Realistic round cost (run_round + readout)

The run_round numbers above measure aggregation in isolation. A real FL server always reads global weights back after aggregating, to distribute them to clients next round — that call goes through Orchestrator.global_weights(). Before the numpy buffer-protocol migration, that getter returned dict[str, list[float]] with one PyFloat allocation per parameter, and the cost was hidden outside the run_round-only table above.

Before numpy migration (loaded system, 2026-04-22, nephew running Roblox; directional — re-measure on idle before quoting externally):

tier global_weights() only full round (run_round + readout) implied run_round alone getter share
tiny (~1K params) 11.3 µs 15.9 µs ~4.6 µs 71%
medium (~1M) 35.3 ms 39.3 ms ~4.0 ms 90%
large (~10M) 425 ms 459 ms ~34 ms 93%

After numpy migration (same box, same hour, same Roblox — kept so the pre/post delta is apples-to-apples):

tier global_weights() only full round (run_round + readout) getter speedup round speedup
tiny 1.88 µs 6.0 µs 2.6×
medium 129 µs 5.6 ms 273×
large 6.6 ms 56.3 ms 64×

At large, the getter dropped from 425 ms to 6.6 ms and the realistic round from 459 ms to 56.3 ms. .global_weights() is now ~12% of the full round (6.6 / 56.3) instead of 93% — the Rust aggregation kernel is once again the bottleneck, which is what the perf story actually claims.

Clean idle re-measure (2026-04-23) — same numpy path, no background load. Absolutes improve modestly; the migration story itself is unchanged.

tier global_weights() only full round (run_round + readout) Δ getter vs loaded Δ round vs loaded
tiny 1.16 µs 6.0 µs −38% ~0
medium 92.1 µs 5.1 ms −29% −9%
large 5.7 ms 47.0 ms −14% −16%

Realistic-round speedup vs Python FedAvg at large: 5.82 s / 47.0 ms = 124×. Matches the run_round-alone table (138×) because marshaling overhead is essentially gone. The table above is now an honest apples-to-apples read, not a sliver of the user-facing cost.

Findings worth calling out

The Python return path has been closed. Orchestrator.global_weights(), ClientUpdate.weights, free aggregate, and apply_gaussian_noise now return dict[str, numpy.ndarray[float32]] — the underlying Vec<f32> buffer is shared with numpy via the buffer protocol, zero-copy. One ndarray wrapper per layer instead of one PyFloat per parameter; O(layers) instead of O(params). See the "Realistic round cost" subsection above for the measured delta. Input sides (__init__, set_global_weights) stay on HashMap<String, Vec<f32>> — the no-attack input path was already zero-copy.

FedProx is not server-side distinct from FedAvg. In vfl-core/src/strategy.rs, FedProx dispatches to the same aggregation kernel as FedAvg — the proximal term is a client-side regularizer applied during local training, not during server aggregation. The near-identical times are correct, not a measurement artifact. Pick FedProx for the convergence behaviour, not the speed.

FedMedian and TrimmedMean are the slowest aggregators — ~21× (FedMedian) and ~24× (TrimmedMean) Rust FedAvg at large through Python, with TrimmedMean(k=1) within ~14% of FedMedian. Coordinate-wise select_nth_unstable_by is branchy and doesn't vectorise well. Further gains would need SIMD quickselect or a histogram-based median — not worth it until a coordinate-wise robust aggregator sits on a hot path.

Krum and Multi-Krum sit below FedMedian/TrimmedMean through Python at the medium / large tiers (71 ms vs 94 ms at medium; 721 ms vs 890 ms at large). Distance-matrix work is mostly f32 adds and benefits from contiguous access; median's nth-element selection is inherently branchy. Both Byzantine-robust paths cost ~17× Rust FedAvg at large, which is what the O(n²·d) factor predicts.

WSL2 on a shared desktop CPU is noisy. Standard deviations on the large tier sit in the 2–11% range in this idle snapshot (vs 1–17% on loaded snapshots). Directional claims like "Rust FedAvg beats Python FedAvg at every tier" are safe; single-digit-percent regressions will be invisible on this hardware. Point estimates like the speedup column move between snapshots from system load (4–10× on 2026-04-20 loaded vs 87–138× today idle) — the kernel hasn't changed across those snapshots, the Python denominator has. The CodSpeed macro runners are the answer for continuous measurement — tracked as a follow-up, not yet integrated.