Online Bandit Draft Selection

Motivation

Different draft models have different acceptance rates across input distributions. A short repetitive prompt may favour a small n-gram drafter; a long code-generation task may favour a 1B-parameter LLM drafter.

FlashSpec treats draft-model selection as a K-armed bandit problem and adapts the choice online without human tuning.

Selectors

UCB1 (`bandit/ucb.py`)

Score for arm k at round t:

score_k = μ̂_k + c · √(2 log t / n_k)

μ̂_k: empirical mean acceptance rate (optionally windowed).
c: exploration constant (default 1.0).
n_k: number of times arm k has been pulled.
t: total rounds elapsed.

Thompson Sampling (`bandit/thompson.py`)

Maintains Beta(α_k, β_k) posterior for each arm. At each round draws one sample per arm and selects the argmax. Posterior update:

α_k += accepted      β_k += (1 - accepted)

Oracle (`bandit/oracle.py`)

Always selects the arm with the highest true acceptance rate. Used only in regret upper-bound experiments. Never used in production.

UCB1 regret bound — proof sketch

Theorem (Auer et al., 2002): For K arms and T rounds, UCB1 satisfies:

E[R_T] ≤ Σ_{k ≠ k*} [ 8 log(T) / Δ_k + (1 + π²/3) Δ_k ]

where Δ_k = μ* - μ_k is the suboptimality gap of arm k and k* is the optimal arm.

Consequence: E[R_T] = O(√(K T log T)) (substituting the worst-case gap).

Sketch of proof:

Exploration phase — any arm with fewer than 8 log(t) / Δ_k² pulls must have been selected, since its UCB exceeds the optimal arm's empirical mean.
Exploitation phase — once arm k* has been pulled enough times, its UCB score dominates all suboptimal arms' UCB scores with high probability (via Chernoff–Hoeffding).
Combining — summing over all suboptimal arms and rounds gives the bound.

Empirical verification: tests/unit/test_bandit.py::TestUCB1Selector::test_ucb1_regret_within_theoretical_bound runs T=10,000 rounds with K=3 arms and asserts empirical_regret / theoretical_upper_bound < 1.5.

Empirical regret convergence

The figure below (generated by notebooks/02_bandit_analysis.ipynb) shows cumulative regret for UCB1, Thompson, and Oracle over T=10,000 rounds with K=3 arms and true rates [0.3, 0.7, 0.5].

Cumulative Regret vs Round (K=3, Δ_best=0.4)

Regret
  400 |
      |  Thompson ········
  300 |             UCB1 ──────
      |
  200 |
      |
  100 |      Oracle ──────
      |
    0 +──────────────────────────
      0    2000   5000   10000   t

Both UCB1 and Thompson track O(√(KT log T)); Oracle achieves near-zero regret. Run notebooks/02_bandit_analysis.ipynb to regenerate with exact numbers.

Windowed statistics

When window_size > 0 (default 500), mean_accept_rate uses only the last window_size rounds.

Justification: The acceptance rate of a draft model depends on the current prompt distribution. When the topic shifts (e.g., a new conversation turn with different vocabulary), the historical acceptance rates become stale. A windowed mean allows both UCB1 and Thompson to track non-stationary distributions without requiring an explicit reset.

The window is implemented as a collections.deque(maxlen=window_size) per arm. Thread-safety is provided by the per-selector threading.Lock.

JSON serialisation

All selectors support to_json() / from_json() for checkpointing and resuming across generation sessions.

selector = UCB1Selector(n_arms=2, window_size=500)
# ... run for 1000 rounds ...
state = selector.to_json()          # serialize to string
restored = UCB1Selector.from_json(state)  # restore exactly
assert restored.t == selector.t    # round counter preserved

Adding a new selector

Subclass DraftSelector from flashspec.bandit.base.
Implement select(), update(), _state_dict(), _from_state_dict().
Export from flashspec/bandit/__init__.py.
Add unit tests in tests/unit/test_bandit.py following existing patterns.
Register in BanditConfig.strategy literal type in flashspec/utils/config.py.

References

Auer et al. (2002), "Finite-time Analysis of the Multiarmed Bandit Problem", Machine Learning 47(2-3):235-256.
Thompson (1933), "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples", Biometrika 25(3-4):285-294.
Chapelle & Li (2011), "An Empirical Evaluation of Thompson Sampling", NeurIPS 2011.