Bandit Learning


Imagine having to choose the best ad variant, but each time you only learn how many users clicked on the one you showed. This is the essence of bandit learning: it balances exploration (trying out new options) with exploitation (using the current best) to discover the winner as quickly as possible. In a world where every experiment has a cost—from ad budgets to a patient’s time in experimental therapy—bandit algorithms can significantly accelerate optimal decision-making. Yet, despite their practical power, these solutions are surprisingly hard to analyze theoretically!

The article is based on an publication: arXiv:2506.14746 PDF

🎯 A Seemingly Simple Problem…

Suppose you have a reward function $ f(x) $ that assigns a value to each action $ x $ —for example, an ad impression, therapy outcome, or website conversion. Your goal is not to learn the entire function $ f $, but to find the best action:

$$ x^* = \arg\max_{x \in \mathcal{X}} f(x) $$

If you know that $ f $ belongs to a known function class $ \mathcal{F} $, this should be easy—right?

The authors of the paper show that this intuition is misleading.


🧠 What Is Bandit Learning?

Bandit learning is exploration-based learning:

  • You can query $ f(x) $, but only for a single chosen $ x $ at a time.
  • You do not have full access to the function,
  • You aim to find $ x^* $ by exploring different actions.

Examples:

  • Choosing the best ad variant (A/B/n testing),
  • Personalizing a user interface,
  • Recommending products.

💥 What This Paper Contributes?

The authors reveal two fundamentally new insights:

❗ 1. No Dimensional Measure of Difficulty

In supervised learning, we have concepts like the VC-dimension — which tells us whether a function class is learnable. .

In bandit learning no analogous measure exists – not even for finite classes of functions!


❗ 2. “Easy” Function Classes Can Be Unusable

The authors construct a class of functions $ \mathcal{F} $ for which:

  • Information-theoretically easy: only 2 queries are needed to find the optimal $ x^* $.
  • Computationally hard: no polynomial-time algorithm can do it, unless $ RP = NP $.

Moreover, even though empirical risk minimization (ERM) can efficiently handle this class, any bandit-based approach fails due to the computational complexity of selecting queries.

🔍 Intuition

Imagine 100 button variants and you want the one that yields the highest reward. You know the reward function has the form $ f(x) = x_3 \land x_5 $, but you don’t know which indices are ‘active.’ You can only test a few options.

Bandit learning requires not just knowing the class structure but also clever query selection, which can be computationally infeasible.

🧪 Practical Rust Example: ε-Greedy A/B/n

Below is a concise snippest showing the core ε-Greedy logic:


// ε: 10% chance to explore randomly
let epsilon = 0.1; 
let trials = 10_000;

...



for _ in 0..trials {
     // 1. DECISION: EXPLORE OR EXPLOIT?
    let choose_random = random::<f64>() < epsilon;

    let idx = if choose_random {
        // 2. EXPLORE: pick a random variant
        rand::thread_rng().gen_range(0..variants.len())
    } else {
        // 3. EXPLOIT: pick the variant with the highest observed CTR
        variants
            .iter()
            .enumerate()
            .max_by(|a, b| a.1.ctr().partial_cmp(&b.1.ctr()).unwrap())
            .map(|(i, _)| i)
            .unwrap()
    };

    // 4. SIMULATE: "pull the lever" and update stats
    let _ = variants[idx].simulate_click();
}

Full code is available in the repo: AiDaily-1

Although simple, this algorithm can fall into the trap of exploration - it will never guarantee finding $ x^* $ in a limited number of trials, especially when the definition of $ f $ is hidden in hard-to-find patterns.

What’s more, even if, after 5,000 trials, the algorithm finally correctly determined that B is better, for another 5,000 trials it will still devote 10% of them (i.e., 500 trials) to random exploration, including deliberately choosing the worse variant of A.


✅ Conclusions

  • Knowing the function class $ \mathcal{F} $ alone – is not enough.
  • We cannot describe bandit difficulty in the same way as classification.
  • We need a ** new theory of bandit learnability **.