BALLAST: When a Bandit Teaches Your Database How Long to Wait

Imagine you’re a team leader. You send a message and wait for a response. How long do you wait before assuming your colleague has “disappeared”? Too short — and you panic for no reason. Too long — and the whole project stalls. BALLAST is a system that teaches databases to answer this question automatically, using machine learning techniques.

The Problem: Raft’s Achilles Heel

Raft is a consensus protocol — the way distributed databases (like etcd, Consul, CockroachDB) agree on who’s the “leader” and which data is current. It works like this:

One node is the leader — it accepts writes
Other nodes are followers — they replicate data
If a follower doesn’t hear from the leader for a certain time (election timeout), it starts a new election

The problem? That “certain time” is typically a random value from a range, e.g., 150-300ms. And that’s where trouble begins.

Why Random Timeouts Fail

In a stable local network (LAN), random timeouts work great. But in reality:

WAN networks have variable latency (sometimes 10ms, sometimes 500ms)
Packets get lost — especially under load
Delays are correlated — when things go bad, they go bad for everyone
Nodes differ — one on a fast server, another on an overloaded machine

The result? Split votes — a situation where multiple nodes simultaneously start elections and none wins. The system becomes unavailable for seconds, sometimes minutes.

The Solution: Let the Machine Choose Timeouts

BALLAST (Bandit-Assisted Learning for Latency-Aware Stable Timeouts) replaces static heuristics with contextual bandits — algorithms that learn to pick the best option based on context.

Contextual Bandits in a Nutshell

Imagine a slot machine with multiple levers (arms). Each gives a different reward. You don’t know which is best — you need to explore. But you also want to earn — so you need to exploit what you already know.

A contextual bandit goes further: before each choice, it sees “context” — e.g., current network conditions. And it learns that in this specific context, this particular lever works best.

How BALLAST Uses Bandits

Arms: Discrete timeout values to choose from (e.g., 100ms, 150ms, 200ms, 300ms)
Context: Current network conditions — latency, packet loss, recent election history
Reward: Speed of return to stability, minimizing unavailability time
Algorithm: LinUCB variant — linear bandit with upper confidence bound

$$ a_t = \arg\max_a \left( \theta_a^\top x_t + \alpha \sqrt{x_t^\top A_a^{-1} x_t} \right) $$

where $x_t$ is the context vector, $\theta_a$ are learned weights for arm $a$, and the second term is the “exploration bonus.”

Safe Exploration

Here’s the catch: exploring in a production system can be dangerous. What if the bandit picks a very short timeout during a momentary network overload? A cascade of unnecessary elections.

BALLAST introduces risk-capping — limiting risk during unstable periods:

When the system is unstable, the bandit acts conservatively
Exploration happens mainly during “calm” moments
Upper and lower bounds on selected timeouts

Results: Numbers Speak for Themselves

BALLAST was tested in discrete-event simulation with realistic scenarios:

Test Scenarios

Scenario	Description
Long-tail delay	Heavy-tailed latency — occasionally very long
Packet loss	Random packet drops
Correlated bursts	Correlated problems — when bad, bad everywhere
Node heterogeneity	Nodes with different performance
Partition recovery	Recovery after network partition

Key Findings

Dramatic improvement in challenging WAN conditions — significantly shorter unavailability time
No regression in easy cases — on stable LAN, works as well as standard approach
Real-time adaptation — system learns on the fly, no manual tuning required

Who Is This For?

For Distributed Systems Engineers

If you build or maintain Raft-based systems (etcd, Consul, TiKV, CockroachDB), BALLAST shows that:

Default timeouts are just a starting point
ML can significantly improve availability under tough conditions
It’s worth monitoring split vote frequency

For ML Researchers

An interesting contextual bandit use case:

Non-stationary environment (network conditions change)
Safety requirement (safe exploration)
Discrete action space with continuous context

Technical Details

For the advanced reader — key elements:

Arm space is discretized — instead of choosing a continuous timeout value, the bandit selects from a fixed set (e.g., 8-16 options). This simplifies learning and ensures stability.

Context includes:

Average RTT from last N measurements
Latency variance
Number of recent failed elections
Time since last stable leader term

Model updates happen online, after each observation. No offline training or historical data collection required.

Conclusion

BALLAST demonstrates an elegant marriage of two worlds: machine learning theory (contextual bandits) and distributed systems engineering (Raft protocol). Instead of manually tuning timeouts and hoping they work under all conditions, the system learns optimal values itself.

This is an important direction: more and more “magic constants” in distributed systems can be replaced with adaptive ML algorithms.

Custom Implementation: Rust Simulator

To better understand how BALLAST works, I implemented my own discrete-event simulator in Rust. The project compares two timeout selection strategies:

Random: Classic randomized timeouts (150-300ms) as in standard Raft
BALLAST: LinUCB contextual bandit adapting timeouts to conditions

Simulation Results

Scenario	Random (ms)	BALLAST (ms)	Improvement
WAN Bursty	19068	1763	+90.8%
Heterogeneous	993	304	+69.4%
Stable LAN	214	184	+14.0%
Stable WAN	388	343	+11.6%
WAN Lossy	466	419	+10.1%
TOTAL	21537	3448	+84.0%

BALLAST reduces total cluster unavailability by 84%!

What the Simulation Shows

Most interesting observations:

WAN Bursty (+90.8%): In scenarios with correlated latency bursts, BALLAST learns to use long timeouts (300-1000ms), while Random keeps “panicking” with short timeouts.
Heterogeneous (+69.4%): When nodes have different performance, BALLAST adapts to slower nodes.
Stable conditions (~12%): Even in stable conditions there’s slight improvement — BALLAST quickly finds the optimal timeout.

Project Structure

src/
├── bandit.rs      # LinUCB implementation
├── network.rs     # Network simulation (6 scenarios)
├── raft.rs        # Simplified Raft nodes
├── strategy.rs    # Strategies (Random, BALLAST)
├── simulation.rs  # Simulation engine
└── main.rs        # CLI and comparison

Running

git clone https://github.com/mysma-9403/BALLAST-Simulator
cd raft-ballast-sim
cargo run --release

Source code: GitHub

📎 Links

Based on the publication 📄 2512.21165
Simulator implementation: GitHub

The Problem: Raft’s Achilles Heel#

Why Random Timeouts Fail#

The Solution: Let the Machine Choose Timeouts#

Contextual Bandits in a Nutshell#

How BALLAST Uses Bandits#

Safe Exploration#

Results: Numbers Speak for Themselves#

Test Scenarios#

Key Findings#

Who Is This For?#

For Distributed Systems Engineers#

For ML Researchers#

Technical Details#

Conclusion#

Custom Implementation: Rust Simulator#

Simulation Results#

What the Simulation Shows#

Project Structure#

Running#

📎 Links#