FlashSpec Benchmarks

Methodology

All benchmarks follow the immutable result schema defined in AGENTS.md §14. Once a result JSON is committed, the config that produced it is frozen.

Hardware

Field	Value
GPU	NVIDIA H100 SXM5
CUDA	12.4
Driver	550.x
Precision	bfloat16
Batch size	1

Warmup

50 warmup steps are discarded before measurement begins.

Datasets

MT-Bench: conversational multi-turn benchmark.
HumanEval: Python code generation.
Alpaca: instruction-following (subset 500).

Baselines

Method	Description
`vanilla_ar`	Standard autoregressive generation
`medusa`	Cai et al. (2024), multi-head draft heads
`eagle`	Li et al. (2024), feature-level drafting
`flashspec_ucb`	FlashSpec with UCB1 bandit
`flashspec_thompson`	FlashSpec with Thompson sampling

Running benchmarks

# Fast smoke-test (no GPU or weights required):
make bench-quick

# Full benchmark (requires GPU + model weights):
make bench

Results are written to benchmarks/results/ as JSON files.

Performance targets (Llama-3-8B-Instruct, γ=4)

Metric	Target
Speedup vs vanilla AR	≥ 2.0×
Regression threshold	< 10% vs prior commit
Acceptance rate α	≥ 0.65

Adding a new benchmark

Create a YAML config in benchmarks/configs/.
Do not modify existing configs after first commit.
Add the config name to the benchmark workflow in .github/workflows/benchmark.yml.