Convergence¶

Benchmarks measure how fast the aggregation kernel runs. This page measures whether the framework actually does federated learning — i.e., whether a real model trained across real non-IID clients via the VelocityFL orchestrator converges to a useful classifier.

Performance numbers (docs/benchmarks.md) and convergence numbers (this page) are kept separate on purpose. A fast aggregator that doesn't converge is worthless; a slow one that does is at least correct. The benchmarks defend the speed claim; this page defends the FL works at all claim.

What we measure¶

A run "converges" if all of the following hold across rounds:

Test loss decreases monotonically (with slack — FedAvg on non-IID data is not a contraction, single-round wiggles are allowed).
Test accuracy improves from random-guess baseline to a useful level for the task.
RoundSummary.global_loss matches the loss the caller actually measured on the test set — the Rust core stores what the caller passes in via reported_loss=, never a proxy.

The Rust orchestrator does not see models or data, so it cannot invent a loss. RoundSummary.global_loss is NaN unless the caller passes a real value to run_round(..., reported_loss=...).

Reproducing¶

The hermetic convergence proof runs on every CI build:

uv run pytest tests/test_convergence.py -v

The MNIST and CIFAR-10 demos require the [hf,torch] extras and download their datasets on first run via Hugging Face datasets:

uv pip install 'velocity-fl[hf,torch]'
uv run maturin develop --release
uv run python examples/mnist_fedavg.py
uv run python examples/cifar10_fedavg_dirichlet.py

Results¶

Hermetic — Gaussian-blobs, 4 clients, non-IID, 8 rounds¶

Synthetic 2D Gaussian-blobs (4 classes, well-separated). Sharded non-IID across 4 clients (each client sees ~1 class). Tiny MLP (2→16→4). Runs in ~14s on CPU, no network. Lives at tests/test_convergence.py.

The test asserts:

loss[final] < loss[initial] * 0.5
No round regresses by more than 25% over the previous round.
Final test accuracy ≥ 0.85.
Final accuracy strictly greater than first-round accuracy.

This is the regression guard — if a future change to aggregation, weight serialisation, or the PyO3 boundary breaks real FL, this test fails before the merge.

MNIST FedAvg — 5 clients, non-IID, 10 rounds¶

Snapshot: 2026-04-20, AMD Ryzen 5 3600X (WSL2), single CPU thread per round of local training. MNIST sharded so each client only sees 2 of the 10 digit classes (McMahan et al. 2017 setup). Small MLP (784→128→64→10, ~109K params), batch size 64, lr=0.01, 1 local epoch per round. Data loaded via load_federated("ylecun/mnist", partition="shard", shards_per_client=2) with seed=0.

round	pre-loss	post-loss	post-acc	sec/round
init	2.3072	—	0.110	—
1	2.3072	1.8419	0.432	1.43
2	1.8419	1.2093	0.663	1.40
3	1.2093	0.8882	0.714	1.35
4	0.8882	0.6797	0.798	1.39
5	0.6797	0.5962	0.811	1.30
6	0.5962	0.5103	0.849	1.37
7	0.5103	0.4844	0.845	1.25
8	0.4844	0.4037	0.882	1.33
9	0.4037	0.4099	0.866	1.27
10	0.4099	0.3510	0.897	1.28

11.0% → 89.7% test accuracy in 10 rounds. Loss falls from 2.31 to 0.35. The single-round dip at round 9 is expected on non-IID data — FedAvg is not a contraction when client objectives disagree.

The loss/accuracy trajectory is bit-identical to the previous torchvision-based snapshot — same bytes (HF ylecun/mnist and torchvision pull the same LeCun source), same seeds, same model — but sec/round dropped ~8× because load_federated materialises tensors once up-front rather than re-decoding PIL every batch. The cost moved from hot-path to load-path.

Per-round wall time is still dominated by client-side local training in PyTorch, not by aggregation. The Rust orchestrator's contribution at this model size is ~10 ms (see docs/benchmarks.md medium tier).

CIFAR-10 Dirichlet — 10 clients, α=0.1, 10 rounds¶

Snapshot: 2026-04-20, AMD Ryzen 5 3600X (WSL2). CIFAR-10 partitioned with Dirichlet(α=0.1) across 10 clients — heavy label skew, per-client sample counts range from 604 to 10,626. Small CNN (conv32→conv64→ fc128→10, ~550K params), batch size 64, lr=0.01, 2 local epochs per round. Data loaded via load_federated("cifar10", partition="dirichlet", alpha=0.1, min_partition_size=50) with seed=0.

round	pre-loss	post-loss	post-acc	sec/round
init	2.3081	—	0.107	—
1	2.3081	2.0880	0.262	45.74
2	2.0880	1.7357	0.393	46.21
3	1.7357	1.4509	0.487	45.84
4	1.4509	1.3803	0.522	45.70
5	1.3803	1.2154	0.560	45.28
6	1.2154	1.1537	0.594	45.16
7	1.1537	1.0767	0.624	45.46
8	1.0767	1.0756	0.620	45.27
9	1.0756	1.0454	0.630	46.94
10	1.0454	1.0501	0.631	47.00

10.7% → 63.1% test accuracy in 10 rounds under heavy label skew. Loss falls from 2.31 to 1.05. The nightly example enforces a 0.55 floor with ~8 percentage points of slack. A centralised training run (all data on one client, same model/schedule) would reach higher accuracy; the gap is the cost of heavy non-IID. That cost is exactly what robustness research measures against.

What this proves and what it doesn't¶

Proves: the end-to-end pipeline (PyTorch local training → state-dict flatten → PyO3 boundary → Rust FedAvg → state-dict unflatten → eval) is correct. A real model genuinely trains on both a forgiving task (MNIST) and a harder heavily non-IID task (CIFAR-10 Dirichlet α=0.1).

Does not prove:

Convergence at scale (50–100+ clients). The hermetic test uses 4, MNIST uses 5, CIFAR-10 uses 10. Non-IID FedAvg behaviour shifts with client count.
Convergence on language modelling or multi-modal tasks — the current column standardisation only targets (image, label) shapes.
Robustness with attacks active. Existing attack simulations (model poisoning, Sybil, Gaussian noise, label flipping) live in vfl-core/src/security.rs and have unit tests but no end-to-end-with-real-training convergence assertion yet.

What's still mocked¶

The aggregation path, round loop, datasets, and partitioners are now real end-to-end. The following surfaces still contain stand-ins:

Attacks (vfl-core/src/security.rs) — simulate_* functions do mutate weights, but only as illustrative perturbations rather than implementations of published attack literature (e.g. real inner-product manipulation, real projected-gradient attacks). The unit tests assert "weights changed," not "attack achieves its documented effect."
Storage — ExperimentConfig.storage is a URI string with no resolver behind it.
Strategies — FedAvg, FedProx (server-side identical to FedAvg by construction; the proximal term is client-side), FedMedian, Krum, and MultiKrum are real. Other published strategies (Trimmed Mean, Bulyan, FedYogi, FedAdam) are not implemented.
Orchestrator dataset field — on the Rust side, ExperimentConfig.dataset remains a record-keeping string. The real loading entry point is velocity.datasets.load_federated, which is where HF resolution actually happens; the Rust field is preserved only so it appears verbatim in history_json().

These are tracked as roadmap items: replace each with a real, plug-and-play implementation, then add a convergence test that exercises it from this page.

Follow-ups¶

Convergence-under-attack: assert FedMedian survives a poisoned client where FedAvg is corrupted, measured rather than asserted by hand
Dirichlet-α sweep: chart accuracy vs heterogeneity at fixed client count (α ∈ {0.01, 0.1, 0.5, 1.0, ∞ ≈ IID})
Crowd-scale convergence (50+ clients) — paired with the crowd-scale benchmark tier so speed and convergence numbers come from the same setup