Benchmarks¶
VelocityFL's pitch is "the uv of Federated Learning" — a Rust core under a Python API. That's a performance claim, and it should be measured, not marketed. This page documents the measurement methodology and shows the current numbers.
What we measure¶
Every FL round is: clients produce weights (user-side, out of our hands) → the server aggregates them. The only thing the library controls is the aggregation step, so that's what we time. Client-weight construction is outside the timed region.
Two layers of measurement serve different purposes:
| Layer | Harness | File | Answers |
|---|---|---|---|
| Rust aggregate, raw | divan | vfl-core/benches/aggregate.rs |
How fast is the inner loop? |
| Through the Python API | pytest-benchmark | tests/bench/test_round_speed.py |
How fast is the user-visible path (_rust.Orchestrator.run_round) vs the pure-Python fallback? |
The second layer is the one the tagline is defended on — real users call through PyO3, not Rust directly.
How to reproduce¶
Or invoke the harnesses directly:
cargo bench --bench aggregate
uv run pytest tests/bench/ --benchmark-only --benchmark-columns=mean,stddev,rounds --benchmark-sort=mean
Shape tiers¶
Three tiers, chosen to span realistic FL model sizes:
| tier | layers | total params | rough analogue |
|---|---|---|---|
tiny |
4 | ~970 | smoke test / FedAvg sanity check |
medium |
10 | ~1.0M | ResNet-18-scale |
large |
16 | ~10.0M | ResNet-50-scale |
All tiers use 10 clients. Weights are deterministic f32 in [-0.1, 0.1].
Results¶
Snapshot: 2026-04-20. Hardware: AMD Ryzen 5 3600X (6C/12T, Zen 2), 9.7 GB RAM, WSL2 on Windows. rustc 1.95.0, Python 3.12.3, uv 0.11.5, PyO3 0.21, release build (maturin develop --release) with lto = "thin" + codegen-units = 1.
Rust aggregation — raw (divan, mean)¶
No PyO3 boundary, no Python — the theoretical best case. Aggregation
kernel only, measured against pre-built Vec<f32> client updates.
| strategy | tiny | medium | large |
|---|---|---|---|
| FedAvg | 5.0 µs | 7.6 ms | 121 ms |
| FedProx | 4.1 µs | 7.7 ms | 124 ms |
| FedMedian | 83 µs | 96.1 ms | 1.01 s |
| TrimmedMean(k=1) | 90 µs | 105 ms | 1.06 s |
| Krum(f=1) | 36 µs | 92.7 ms | 990 ms |
| MultiKrum(f=1) | 43 µs | 97.6 ms | 1.01 s |
FedAvg on a 10M-parameter model with 10 clients aggregates in ~121 ms.
FedAvg accumulates in f64 and downcasts to f32 at the end to bound
rounding error as client counts scale. FedMedian uses
select_nth_unstable_by (O(C) expected) on a scratch buffer hoisted out
of the coordinate loop. Krum and Multi-Krum share a pairwise-distance
matrix (O(n² · d) per round), dominate FedMedian at the small tier
because of the matrix setup overhead, and converge on FedMedian's cost
at the medium and large tiers where the distance sum dominates.
Through the Python API (pytest-benchmark, mean)¶
This is the number users actually feel. _rust.Orchestrator.run_round
with pre-built client updates, crossing the PyO3 boundary once per round,
compared against a pure-Python FedAvg on the same inputs.
| tier | Rust FedAvg | Rust FedProx | Rust FedMedian | Rust TrimmedMean(k=1) | Rust Krum(f=1) | Rust MultiKrum(f=1) | Python FedAvg | Rust FedAvg speedup vs Python FedAvg |
|---|---|---|---|---|---|---|---|---|
| tiny (~1K) | 4.4 µs | 4.7 µs | 84 µs | 88 µs | 39 µs | 41 µs | 417 µs | 95× |
| medium (~1M) | 4.4 ms | 4.4 ms | 100 ms | 101 ms | 58 ms | 59 ms | 512 ms | 117× |
| large (~10M) | 53 ms | 81 ms | 956 ms | 1.04 s | 729 ms | 818 ms | 5.12 s | 97× |
Pure-Python FedAvg at the large tier costs ~5.1 s per round on this
snapshot. The full tests/bench/ suite takes ~7 min on this box at
21 tests × 6 strategies. The 4–10× range from the prior 2026-04-20
snapshot widened to ~95–117× under a less-loaded WSL state today — the
underlying Rust kernel is unchanged; the gap is system noise on the
Python denominator (Python FedAvg-large moved from 8.48 s → 5.12 s,
while Rust FedAvg-large moved from 817 ms → 53 ms — the ~16× change on
the Rust side is what's most attributable to load, not algorithm). Treat
the speedup column as directional until CodSpeed lands; the consistent
finding is "Rust FedAvg dominates at every tier."
Orchestrator.run_round takes a zero-copy fast path when no attacks are
registered: the PyO3 wrapper passes &ClientUpdate slices straight into
the aggregation kernel, so no f32 weight data is cloned between Python
and aggregation. When attacks are registered, the owned path kicks in
automatically (attacks can mutate client updates).
TrimmedMean tracks FedMedian at every tier (101 ms vs 100 ms at
medium; 1.04 s vs 956 ms at large). Two select_nth_unstable_by calls
per coordinate cost slightly more than FedMedian's single call at k=1,
even with the median-of-evens averaging skipped — the partition is the
hot loop, the post-processing isn't. The two-call structure becomes
worthwhile as k grows (the second call operates on a smaller window),
but at k=1 it's a wash. The matched NumPy oracle in
tests/strategy_reference.py confirms parity.
Krum/Multi-Krum land above Python FedAvg at the medium and large
tiers (729 ms / 818 ms vs 5.12 s at large — Krum is faster than Python
FedAvg here, but ~14× slower than Rust FedAvg). That is algorithmically
honest: Krum is O(n²·d), FedAvg is O(n·d). The f=1 Krum kernel builds a
10×10 pairwise-distance matrix across all d parameters per call — at
large (d ≈ 10 M), that's ~500 M f32 adds before the top-k selection.
Robustness buys a ~14× cost factor over non-robust FedAvg at this scale;
the matched Python oracle in tests/strategy_reference.py confirms the
kernel is correct, it is not a perf bug.
Findings worth calling out¶
The Python return path is the next obvious lever. Orchestrator.run_round
still marshals HashMap<String, Vec<f32>> → Python dict[str,
list[float]] on return, allocating one PyFloat per parameter. The
input side is already zero-copy on the no-attack path (the PyO3 wrapper
passes &ClientUpdate slices to the kernel); the return side is the
next lever — a numpy.ndarray / buffer-protocol return would share the
underlying f32 buffer with zero copies. Today's Rust-FedAvg-large is
53 ms through the Python surface vs ~121 ms in raw divan from the prior
snapshot — the same-day raw divan number isn't measured, but the
cross-sample directional gap is what the buffer-protocol PR is
projected to close.
FedProx is not server-side distinct from FedAvg. In
vfl-core/src/strategy.rs, FedProx dispatches to the same aggregation
kernel as FedAvg — the proximal term is a client-side regularizer
applied during local training, not during server aggregation. The
near-identical times are correct, not a measurement artifact. Pick
FedProx for the convergence behaviour, not the speed.
FedMedian and TrimmedMean are the slowest aggregators — both ~18×
FedAvg at large through Python, with TrimmedMean(k=1) within ~10% of
FedMedian. Coordinate-wise select_nth_unstable_by is branchy and
doesn't vectorise well. Further gains would need SIMD quickselect or a
histogram-based median — not worth it until a coordinate-wise robust
aggregator sits on a hot path.
Krum and Multi-Krum sit below FedMedian/TrimmedMean through Python
at the medium / large tiers (58 ms vs 100 ms at medium; 729 ms vs
956 ms at large). Distance-matrix work is mostly f32 adds and benefits
from contiguous access; median's nth-element selection is inherently
branchy. Both Byzantine-robust paths cost ~14× FedAvg at large, which
is what the O(n²·d) factor predicts.
WSL2 on a shared desktop CPU is noisy. Standard deviations on the
large tier sit in the 1–17% range in this snapshot. Directional claims
like "Rust FedAvg beats Python FedAvg at every tier" are safe;
single-digit-percent regressions will be invisible on this hardware.
Point estimates like the speedup column move ~3× between snapshots
purely from system load (4–10× on the prior 2026-04-20 snapshot vs
95–117× today); the CodSpeed macro runners are
the answer for continuous measurement — tracked as a follow-up, not yet
integrated.