回归验证与测试路径#

这一节把"改了什么就跑哪些测试"这件事具体化，避免每次改 scheduler、cache 或 parser 都靠临时经验拼验证。

测试目录结构#

SGLang 的测试集中在两个位置：

test/
└── srt/                    # runtime 相关测试
    ├── test_serving_chat.py        # OpenAI-compatible chat 接口回归
    ├── test_serving_completions.py # 文本补全接口回归
    ├── test_radix_cache.py         # RadixCache 单元测试
    ├── test_grammar_backend.py     # grammar constraint 测试
    ├── test_chunked_prefill.py     # chunked prefill 正确性
    ├── test_speculative_decoding.py # speculative decoding 回归
    ├── test_torch_compile.py       # torch.compile 集成
    └── test_bench_latency.py       # 延迟 benchmark

test/srt/ 里的测试分三类：

接口回归（test_serving_*.py）：发真实 HTTP 请求，验证响应格式和内容正确性；
单元测试（test_radix_cache.py 等）：直接调用内部类，验证局部逻辑；
Benchmark（test_bench_latency.py 等）：度量性能指标，不做 assert。

按改动类型选测试#

改动：协议层（`entrypoints/openai/protocol.py` 或 `serving_*.py`）#

最小验证：

# 验证接口格式和基本 correctness
python -m pytest test/srt/test_serving_chat.py -x -v
python -m pytest test/srt/test_serving_completions.py -x -v

额外验证（如果改动影响了 response_format 或 structured output）：

python -m pytest test/srt/test_grammar_backend.py -x -v

不需要跑：test_radix_cache.py、test_speculative_decoding.py——协议层改动不影响这两层逻辑。

改动：Scheduler admission 逻辑（`managers/scheduler.py`）#

这类改动影响最广，验证路径最长。

第一步：接口 correctness（先确认没有明显回归）：

python -m pytest test/srt/test_serving_chat.py -x -v -k "not bench"

第二步：chunked prefill 状态机（admission 改动最容易破坏跨轮次状态）：

python -m pytest test/srt/test_chunked_prefill.py -x -v

第三步：压力场景（用高并发跑一段时间，观察 token_usage 是否异常增长）：

python -m sglang.bench_serving --backend sglang --dataset-name random \
    --num-prompts 200 --request-rate 20 --random-input-len 512 \
    --random-output-len 256 --port 30000
# 然后同时观察 /metrics 里的 sglang:token_usage，正常应该在 0.7-0.9 之间波动
# 如果持续升到 1.0 并触发 OOM，说明 KV slot 分配有泄漏

改动：RadixCache（`mem_cache/radix_cache.py`）#

第一步：单元测试（最快定位 match/insert/evict 的局部问题）：

python -m pytest test/srt/test_radix_cache.py -x -v

第二步：prefix reuse 的端到端效果（确认 cache hit 逻辑仍然正确）：

# 手动验证：发两个有共同前缀的请求，观察第二个请求的 TTFT 应该明显低于第一个
import openai
client = openai.OpenAI(base_url="http://localhost:30000/v1", api_key="none")

long_prefix = "A very long shared prefix: " + "x " * 200

resp1 = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": long_prefix + "Summarize."}]
)
# 第二个请求应该有 prefix hit，TTFT 明显更低
resp2 = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": long_prefix + "Explain."}]
)

第三步：通过 /metrics 确认 sglang:cache_hit_ratio 在发送重复前缀后有显著上升（应该 > 0.8）。

改动：Grammar constraint（structured generation 相关）#

python -m pytest test/srt/test_grammar_backend.py -x -v

# 如果改动涉及 JSON schema 编译缓存
python -m pytest test/srt/test_grammar_backend.py -x -v -k "json_schema"

手动验证时，重点检查三类场景：

空 schema（只要求 JSON object）：{"type": "object"}
复杂嵌套 schema（3 层以上）；
含 enum 的 schema（有限词表约束最容易出 bitmask 错误）。

改动：Speculative Decoding（`model_executor/` 下 eagle/medusa 相关）#

核心验证：确认 TP=1 下的输出和无 speculative decoding 时一致（拒绝采样保分布等价性）：

python -m pytest test/srt/test_speculative_decoding.py -x -v

TP > 1 场景额外验证（speculative decoding + TP 的组合最容易出现输出不一致）：

python -m pytest test/srt/test_speculative_decoding.py -x -v -k "tp"

改动：性能优化（不改语义，只优化速度）#

性能优化不应该只跑 benchmark，必须先跑 correctness 测试：

# 先确认语义没有回归
python -m pytest test/srt/test_serving_chat.py -x -v -k "not bench"

# 再跑 benchmark 对比优化前后
python -m sglang.bench_latency --model-path <model> --batch-size 8 --input-len 512 --output-len 256

如果 benchmark 数字变好但接口测试失败，说明优化破坏了 correctness，不能提交。

三类测试的层次关系#

接口回归 (test_serving_*.py)
    ↑ 失败时先看这层
功能单元 (test_radix_cache.py, test_grammar_backend.py 等)
    ↑ 定位具体问题时深入这层
性能 Benchmark (test_bench_latency.py)
    ↑ correctness 验证通过后才看这层

这个顺序的价值在于：先证明"语义没坏"，再证明"性能没退化"。如果反过来，benchmark 数字看起来正常但 correctness 有问题，会给自己一个错误的安全感。

快速判断：这次改动该跑什么#

改动位置	必跑	按需跑
`entrypoints/openai/`	test_serving_chat.py	test_grammar_backend.py
`managers/scheduler.py`	test_serving_chat.py, test_chunked_prefill.py	bench_serving (压力)
`mem_cache/radix_cache.py`	test_radix_cache.py	手动验证 cache_hit_ratio
grammar / structured generation	test_grammar_backend.py	test_serving_chat.py
speculative decoding	test_speculative_decoding.py	TP > 1 专项
纯性能优化	test_serving_chat.py	bench_latency

小结#

测试路径的核心判断不是"有没有跑测试"，而是"跑的测试是否覆盖了改动真正影响的层次"。

改协议层：跑接口回归，不需要跑 cache 单元测试；
改 scheduler：跑接口回归 + chunked prefill，再做压力验证；
改 cache：先跑单元测试，再通过 metrics 验证 end-to-end 效果；
任何改动：先 correctness，再 benchmark，永远不要反过来。