Imagine having to choose the best ad variant, but each time you only learn how many users clicked on the one you showed. This is the essence of bandit learning: it balances exploration (trying out new options) with exploitation (using the current best) to discover the winner as quickly as possible. In a world where every experiment has a cost—from ad budgets to a patient’s time in experimental therapy—bandit algorithms can significantly accelerate optimal decision-making. Yet, despite their practical power, these solutions are surprisingly hard to analyze theoretically!
The article is based on an publication: arXiv:2506.14746 PDF
🎯 A Seemingly Simple Problem…
Suppose you have a reward function $ f(x) $ that assigns a value to each action $ x $ —for example, an ad impression, therapy outcome, or website conversion. Your goal is not to learn the entire function $ f $, but to find the best action:
$$ x^* = \arg\max_{x \in \mathcal{X}} f(x) $$
If you know that $ f $ belongs to a known function class $ \mathcal{F} $, this should be easy—right?
The authors of the paper show that this intuition is misleading.
🧠 What Is Bandit Learning?
Bandit learning is exploration-based learning:
- You can query $ f(x) $, but only for a single chosen $ x $ at a time.
- You do not have full access to the function,
- You aim to find $ x^* $ by exploring different actions.
Examples:
- Choosing the best ad variant (A/B/n testing),
- Personalizing a user interface,
- Recommending products.
💥 What This Paper Contributes?
The authors reveal two fundamentally new insights:
❗ 1. No Dimensional Measure of Difficulty
In supervised learning, we have concepts like the VC-dimension — which tells us whether a function class is learnable. .
In bandit learning no analogous measure exists – not even for finite classes of functions!
❗ 2. “Easy” Function Classes Can Be Unusable
The authors construct a class of functions $ \mathcal{F} $ for which:
- Information-theoretically easy: only 2 queries are needed to find the optimal $ x^* $.
- Computationally hard: no polynomial-time algorithm can do it, unless $ RP = NP $.
Moreover, even though empirical risk minimization (ERM) can efficiently handle this class, any bandit-based approach fails due to the computational complexity of selecting queries.
🔍 Intuition
Imagine 100 button variants and you want the one that yields the highest reward. You know the reward function has the form $ f(x) = x_3 \land x_5 $, but you don’t know which indices are ‘active.’ You can only test a few options.
Bandit learning requires not just knowing the class structure but also clever query selection, which can be computationally infeasible.
🧪 Practical Rust Example: ε-Greedy A/B/n
Below is a concise snippest showing the core ε-Greedy logic:
// ε: 10% chance to explore randomly
let epsilon = 0.1;
let trials = 10_000;
...
for _ in 0..trials {
// 1. DECISION: EXPLORE OR EXPLOIT?
let choose_random = random::<f64>() < epsilon;
let idx = if choose_random {
// 2. EXPLORE: pick a random variant
rand::thread_rng().gen_range(0..variants.len())
} else {
// 3. EXPLOIT: pick the variant with the highest observed CTR
variants
.iter()
.enumerate()
.max_by(|a, b| a.1.ctr().partial_cmp(&b.1.ctr()).unwrap())
.map(|(i, _)| i)
.unwrap()
};
// 4. SIMULATE: "pull the lever" and update stats
let _ = variants[idx].simulate_click();
}
Full code is available in the repo: AiDaily-1
Although simple, this algorithm can fall into the trap of exploration - it will never guarantee finding $ x^* $ in a limited number of trials, especially when the definition of $ f $ is hidden in hard-to-find patterns.
What’s more, even if, after 5,000 trials, the algorithm finally correctly determined that B is better, for another 5,000 trials it will still devote 10% of them (i.e., 500 trials) to random exploration, including deliberately choosing the worse variant of A.
✅ Conclusions
- Knowing the function class $ \mathcal{F} $ alone – is not enough.
- We cannot describe bandit difficulty in the same way as classification.
- We need a ** new theory of bandit learnability **.
📎 Links
- Based on the publication 📄 arXiv:2506.14746 PDF