Online Bandit Draft Selection
Motivation
Different draft models have different acceptance rates across input distributions. A short repetitive prompt may favour a small n-gram drafter; a long code-generation task may favour a 1B-parameter LLM drafter.
FlashSpec treats draft-model selection as a K-armed bandit problem and adapts the choice online without human tuning.
Selectors
UCB1 (bandit/ucb.py)
Score for arm k at round t:
μ̂_k: empirical mean acceptance rate (optionally windowed).c: exploration constant (default 1.0).n_k: number of times arm k has been pulled.t: total rounds elapsed.
Thompson Sampling (bandit/thompson.py)
Maintains Beta(α_k, β_k) posterior for each arm. At each round draws one sample per arm and selects the argmax. Posterior update:
Oracle (bandit/oracle.py)
Always selects the arm with the highest true acceptance rate. Used only in regret upper-bound experiments. Never used in production.
UCB1 regret bound — proof sketch
Theorem (Auer et al., 2002): For K arms and T rounds, UCB1 satisfies:
where Δ_k = μ* - μ_k is the suboptimality gap of arm k and k* is the
optimal arm.
Consequence: E[R_T] = O(√(K T log T)) (substituting the worst-case gap).
Sketch of proof:
- Exploration phase — any arm with fewer than
8 log(t) / Δ_k²pulls must have been selected, since its UCB exceeds the optimal arm's empirical mean. - Exploitation phase — once arm k* has been pulled enough times, its UCB score dominates all suboptimal arms' UCB scores with high probability (via Chernoff–Hoeffding).
- Combining — summing over all suboptimal arms and rounds gives the bound.
Empirical verification: tests/unit/test_bandit.py::TestUCB1Selector::test_ucb1_regret_within_theoretical_bound
runs T=10,000 rounds with K=3 arms and asserts empirical_regret / theoretical_upper_bound < 1.5.
Empirical regret convergence
The figure below (generated by notebooks/02_bandit_analysis.ipynb) shows
cumulative regret for UCB1, Thompson, and Oracle over T=10,000 rounds with
K=3 arms and true rates [0.3, 0.7, 0.5].
Cumulative Regret vs Round (K=3, Δ_best=0.4)
Regret
400 |
| Thompson ········
300 | UCB1 ──────
|
200 |
|
100 | Oracle ──────
|
0 +──────────────────────────
0 2000 5000 10000 t
Both UCB1 and Thompson track O(√(KT log T)); Oracle achieves near-zero regret.
Run notebooks/02_bandit_analysis.ipynb to regenerate with exact numbers.
Windowed statistics
When window_size > 0 (default 500), mean_accept_rate uses only the
last window_size rounds.
Justification: The acceptance rate of a draft model depends on the current prompt distribution. When the topic shifts (e.g., a new conversation turn with different vocabulary), the historical acceptance rates become stale. A windowed mean allows both UCB1 and Thompson to track non-stationary distributions without requiring an explicit reset.
The window is implemented as a collections.deque(maxlen=window_size) per arm.
Thread-safety is provided by the per-selector threading.Lock.
JSON serialisation
All selectors support to_json() / from_json() for checkpointing and resuming
across generation sessions.
selector = UCB1Selector(n_arms=2, window_size=500)
# ... run for 1000 rounds ...
state = selector.to_json() # serialize to string
restored = UCB1Selector.from_json(state) # restore exactly
assert restored.t == selector.t # round counter preserved
Adding a new selector
- Subclass
DraftSelectorfromflashspec.bandit.base. - Implement
select(),update(),_state_dict(),_from_state_dict(). - Export from
flashspec/bandit/__init__.py. - Add unit tests in
tests/unit/test_bandit.pyfollowing existing patterns. - Register in
BanditConfig.strategyliteral type inflashspec/utils/config.py.
References
- Auer et al. (2002), "Finite-time Analysis of the Multiarmed Bandit Problem", Machine Learning 47(2-3):235-256.
- Thompson (1933), "On the likelihood that one unknown probability exceeds another in view of the evidence of two samples", Biometrika 25(3-4):285-294.
- Chapelle & Li (2011), "An Empirical Evaluation of Thompson Sampling", NeurIPS 2011.