vLLM 启用 Qwen3.6 的 preserve_thinking:双机 A/B 验证
TL;DR#
- Qwen3.6 的
preserve_thinking是写在模型 chat template 里的 Jinja 变量,不是 vLLM CLI 标志。 - 正确启用方式:
--default-chat-template-kwargs '{"preserve_thinking": true}'。 - 在双机 DGX Spark 集群上用一台开、一台关做 A/B 对照:同样输入下 prompt tokens 55 vs 51(差 4 tokens 就是保留下来的历史
<think>块),completion tokens 355 vs 382(关掉后模型重新"想"一遍)。
起因#
Reddit 的一篇文章提到 Qwen3.6 随 KV cache 修复一起 “ships preserve_thinking flag”(原文链接)。集群里跑的正好是 Qwen3.6-35B-A3B-FP8(reasoning parser qwen3),想在两台 DGX Spark 上把这个开关打开。
参数归属:chat template 而非 vLLM CLI#
这里"flag"是个容易误导的措辞。Qwen3.6 的 chat_template.jinja 里写着:
{%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
也就是说:
preserve_thinking是模板里的 Jinja 变量,默认未定义时按当前模板行为等价于 False- 不启用时,chat template 只保留最近一轮的历史
<think>块,更早的全剥掉 - 启用时,所有历史轮次的思考链都保留在上下文里
vLLM 本身的 --reasoning-parser 或 --reasoning-config 都不管这个——它们只负责解析输出,而 preserve_thinking 影响的是输入 prompt 的构造。正确的启用方式是把它作为 kwarg 注入模板:
vllm serve ... \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"preserve_thinking": true}'
A/B 验证:两台机互相做对照#
集群本来就是两台 DGX Spark 各跑一份 vLLM(前面 Bifrost 网关做路由),天然适合对照:
- Server 1(实验组):
preserve_thinking=1 - Server 2(对照组):
preserve_thinking=""(关闭)
Ansible 用 --limit 只对 server 2 重新 deploy,改一个 extra var 就够了,Makefile 和 playbook 都不动:
uv run ansible-playbook -i inventory.ini playbooks/vllm-model-deploy.yml \
--limit 100.67.164.92 \
-e "vllm_preserve_thinking=" \
...
测试消息设计成能明显依赖历史推理的三轮对话:
{
"messages": [
{"role": "user", "content": "What is 17 * 23?"},
{"role": "assistant",
"reasoning_content": "17*23 = 17*(20+3) = 340 + 51 = 391.",
"content": "391"},
{"role": "user",
"content": "Now add 9 to that, and briefly explain which prior step you are building on."}
],
"max_tokens": 512,
"temperature": 0.0,
"seed": 42
}
每一条消息都是为了放大 preserve_thinking 的观测效果而特意设计的:
- 第 1 条(user)“What is 17 * 23?”
给模型一个简单但推理路径非平凡的算术问题。选 17×23 而不是 7×8 是因为它需要分步展开(
17×(20+3)),天然会在 reasoning 里留下可辨识的中间步骤——这样历史<think>块既有实质内容、又短到能被一眼看出 token 差异。 - 第 2 条(assistant)带
reasoning_content+content这是人工注入的历史推理链,模拟"模型刚回答完第一轮"。这里需要显式放进reasoning_content字段,chat template 才会把它渲染成<think>...</think>块;content: "391"是最终答案。preserve_thinking在这里影响的就是这一段<think>要不要随下一轮请求一起回传给模型。 - 第 3 条(user)“Now add 9 to that, and briefly explain which prior step you are building on.”
这句话是整个测试的关键,里面藏了两个触发器:
"add 9 to that":that是代词,指向第二轮的结果391。关闭preserve_thinking时,模型只能看到孤零零的content: "391",不知道 391 怎么来的,需要重新构造推理;开启时,历史<think>在上下文里,可以直接复用。"briefly explain which prior step you are building on":强制模型输出对历史推理的引用。如果模板剥掉了历史<think>,模型就得在新 reasoning 里自己把"17×23=391"再推一遍才能"解释"——这部分多出来的 token,我判断是 B 组 completion 比 A 组多 27 的一个主要来源。
再加上 temperature=0.0 + seed=42 固定采样,排除随机性干扰,两台机跑出的差异也就更容易归因到 preserve_thinking 本身。
同样的 payload,同时打到两台的 :30000:
| 指标 | A: preserve_thinking=1 | B: preserve_thinking=0 |
|---|---|---|
prompt_tokens |
55 | 51 |
completion_tokens |
355 | 382 |
| 最终答案 | ✅ 400 | ✅ 400 |
差异点有两个:
- Prompt 多了 4 tokens —— 正好是被保留的上一轮
<think>...</think>片段。关掉后 chat template 把它剥掉了。 - 关掉时 completion 反而多了 27 tokens —— 模型看不到自己之前的推理,只能靠
content字段里那个孤零零的391重新推断是怎么来的,reasoning 部分又走了一遍类似的验证流程。
本题足够简单,两边答案都对。至少从这个简单例子看:如果历史推理链更长、问题复杂度更高,A 这边省下的 completion 可能会更多,B 也可能更容易出现遗忘或重算。
实际输出对比#
A 组(preserve_thinking=1)— reasoning:
Here's a thinking process:
1. **Analyze User Input:**
- User asks: "Now add 9 to that, and briefly explain which prior step you are building on."
- "That" refers to the result of the previous calculation: 17 * 23 = 391.
- So, I need to calculate 391 + 9.
- I also need to briefly explain which prior step I'm building on.
2. **Perform Calculation:**
- 391 + 9 = 400.
3. **Identify Prior Step:**
- The prior step was calculating the product of 17 and 23, which equals 391.
- I'm building directly on that result.
4. **Formulate Response:**
- State the new result: 400.
- Briefly explain the connection: This builds directly on the previous multiplication step (17 × 23 = 391), to which I added 9.
- Keep it concise as requested.
5. **Self-Correction/Verification:**
- Calculation: 391 + 9 = 400. Correct.
- Explanation: Clearly references the prior multiplication step. Matches user request.
- Tone/Format: Direct and brief.
All good. Proceed.
Output matches the response.✅
A 组 — content:
400. This builds directly on the previous step where I calculated 17 × 23 = 391, to which I then added 9.
B 组(preserve_thinking=0)— reasoning:
Here's a thinking process:
1. **Analyze User Input:**
- User asks: "Now add 9 to that, and briefly explain which prior step you are building on."
- "That" refers to the previous result: 391 (from 17 * 23)
- Need to calculate: 391 + 9
- Need to briefly explain which prior step I'm building on.
2. **Calculate:**
- 391 + 9 = 400
3. **Identify Prior Step:**
- The prior step was calculating 17 * 23 = 391
- I'm building directly on that multiplication result.
4. **Formulate Response:**
- State the new result: 400
- Briefly explain the connection: I'm adding 9 to the result of the previous multiplication (17 × 23 = 391).
- Keep it concise as requested.
5. **Draft Response:**
400. I'm building directly on the previous step, where I calculated 17 × 23 = 391, and simply added 9 to that result.
6. **Self-Correction/Verification:**
- Math check: 391 + 9 = 400 ✓
- Explanation matches request: "briefly explain which prior step you are building on" ✓
- Tone/length: Concise ✓
- Ready.✅
B 组 — content:
400. I'm building directly on the previous step, where I calculated 17 × 23 = 391, and simply added 9 to that result.
两份输出对读下来可以看到几个细节:
- A 组的 reasoning 是标准 5 步结构(Analyze → Perform → Identify → Formulate → Self-Correction),B 组因为缺少上一轮的思考链兜底,多出了第 5 步 “Draft Response”——相当于先把答案打一遍草稿、再做校验,这就是 +27 completion tokens 的直接来源。
- A 组在 “Identify Prior Step” 里写的是 “calculating the product of 17 and 23"(复述了运算类型),B 组写的是 “calculating 17 * 23 = 391"(直接把算式补回来)——后者因为看不到历史
<think>,要在新 reasoning 里显式把中间结果重新陈述一遍。 - 最终
content两边意思等价,字面略有差别(“to which I then added 9” vs “and simply added 9 to that result”),属于同等质量下的措辞波动。
何时开启#
- 更适合开启:多轮 iterative 场景(写代码、数学链、调试类对话)——历史 reasoning 往往是可复用资产
- 也可以考虑关闭:无状态的单轮 Q&A、或希望把思维链对用户完全不可见的场景
- 注意:保留下来的历史
<think>块会占 prompt token,长对话时需要和max_model_len的 budget 一起考量
集群最终两台都切回 preserve_thinking=1,统一通过 Bifrost 网关对外。