vLLM 启用 Qwen3.6 的 preserve_thinking：双机 A/B 验证 ::

TL;DR#

Qwen3.6 的 preserve_thinking 是写在模型 chat template 里的 Jinja 变量，不是 vLLM CLI 标志。
正确启用方式：--default-chat-template-kwargs '{"preserve_thinking": true}'。
在双机 DGX Spark 集群上用一台开、一台关做 A/B 对照：同样输入下 prompt tokens 55 vs 51（差 4 tokens 就是保留下来的历史 <think> 块），completion tokens 355 vs 382（关掉后模型重新"想"一遍）。

起因#

Reddit 的一篇文章提到 Qwen3.6 随 KV cache 修复一起 “ships preserve_thinking flag”（原文链接）。集群里跑的正好是 Qwen3.6-35B-A3B-FP8（reasoning parser qwen3），想在两台 DGX Spark 上把这个开关打开。

参数归属：chat template 而非 vLLM CLI#

这里"flag"是个容易误导的措辞。Qwen3.6 的 chat_template.jinja 里写着：

{%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}

也就是说：

preserve_thinking 是模板里的 Jinja 变量，默认未定义时按当前模板行为等价于 False
不启用时，chat template 只保留最近一轮的历史 <think> 块，更早的全剥掉
启用时，所有历史轮次的思考链都保留在上下文里

vLLM 本身的 --reasoning-parser 或 --reasoning-config 都不管这个——它们只负责解析输出，而 preserve_thinking 影响的是输入 prompt 的构造。正确的启用方式是把它作为 kwarg 注入模板：

vllm serve ... \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"preserve_thinking": true}'

A/B 验证：两台机互相做对照#

集群本来就是两台 DGX Spark 各跑一份 vLLM（前面 Bifrost 网关做路由），天然适合对照：

Server 1（实验组）：preserve_thinking=1
Server 2（对照组）：preserve_thinking=""（关闭）

Ansible 用 --limit 只对 server 2 重新 deploy，改一个 extra var 就够了，Makefile 和 playbook 都不动：

uv run ansible-playbook -i inventory.ini playbooks/vllm-model-deploy.yml \
  --limit 100.67.164.92 \
  -e "vllm_preserve_thinking=" \
  ...

测试消息设计成能明显依赖历史推理的三轮对话：

{
  "messages": [
    {"role": "user", "content": "What is 17 * 23?"},
    {"role": "assistant",
     "reasoning_content": "17*23 = 17*(20+3) = 340 + 51 = 391.",
     "content": "391"},
    {"role": "user",
     "content": "Now add 9 to that, and briefly explain which prior step you are building on."}
  ],
  "max_tokens": 512,
  "temperature": 0.0,
  "seed": 42
}

每一条消息都是为了放大 preserve_thinking 的观测效果而特意设计的：

第 1 条（user）“What is 17 * 23?” 给模型一个简单但推理路径非平凡的算术问题。选 17×23 而不是 7×8 是因为它需要分步展开（17×(20+3)），天然会在 reasoning 里留下可辨识的中间步骤——这样历史 <think> 块既有实质内容、又短到能被一眼看出 token 差异。
第 2 条（assistant）带 reasoning_content + content 这是人工注入的历史推理链，模拟"模型刚回答完第一轮"。这里需要显式放进 reasoning_content 字段，chat template 才会把它渲染成 <think>...</think> 块；content: "391" 是最终答案。preserve_thinking 在这里影响的就是这一段 <think> 要不要随下一轮请求一起回传给模型。
第 3 条（user）“Now add 9 to that, and briefly explain which prior step you are building on.” 这句话是整个测试的关键，里面藏了两个触发器：
- "add 9 to that"：that 是代词，指向第二轮的结果 391。关闭 preserve_thinking 时，模型只能看到孤零零的 content: "391"，不知道 391 怎么来的，需要重新构造推理；开启时，历史 <think> 在上下文里，可以直接复用。
- "briefly explain which prior step you are building on"：强制模型输出对历史推理的引用。如果模板剥掉了历史 <think>，模型就得在新 reasoning 里自己把"17×23=391"再推一遍才能"解释"——这部分多出来的 token，我判断是 B 组 completion 比 A 组多 27 的一个主要来源。

再加上 temperature=0.0 + seed=42 固定采样，排除随机性干扰，两台机跑出的差异也就更容易归因到 preserve_thinking 本身。

同样的 payload，同时打到两台的 :30000：

指标	A: preserve_thinking=1	B: preserve_thinking=0
`prompt_tokens`	55	51
`completion_tokens`	355	382
最终答案	✅ 400	✅ 400

差异点有两个：

Prompt 多了 4 tokens —— 正好是被保留的上一轮 <think>...</think> 片段。关掉后 chat template 把它剥掉了。
关掉时 completion 反而多了 27 tokens —— 模型看不到自己之前的推理，只能靠 content 字段里那个孤零零的 391 重新推断是怎么来的，reasoning 部分又走了一遍类似的验证流程。

本题足够简单，两边答案都对。至少从这个简单例子看：如果历史推理链更长、问题复杂度更高，A 这边省下的 completion 可能会更多，B 也可能更容易出现遗忘或重算。

实际输出对比#

A 组（preserve_thinking=1）— reasoning：

Here's a thinking process:

1.  **Analyze User Input:**
   - User asks: "Now add 9 to that, and briefly explain which prior step you are building on."
   - "That" refers to the result of the previous calculation: 17 * 23 = 391.
   - So, I need to calculate 391 + 9.
   - I also need to briefly explain which prior step I'm building on.

2.  **Perform Calculation:**
   - 391 + 9 = 400.

3.  **Identify Prior Step:**
   - The prior step was calculating the product of 17 and 23, which equals 391.
   - I'm building directly on that result.

4.  **Formulate Response:**
   - State the new result: 400.
   - Briefly explain the connection: This builds directly on the previous multiplication step (17 × 23 = 391), to which I added 9.
   - Keep it concise as requested.

5.  **Self-Correction/Verification:**
   - Calculation: 391 + 9 = 400. Correct.
   - Explanation: Clearly references the prior multiplication step. Matches user request.
   - Tone/Format: Direct and brief.

   All good. Proceed.
   Output matches the response.✅

A 组 — content：

400. This builds directly on the previous step where I calculated 17 × 23 = 391, to which I then added 9.

B 组（preserve_thinking=0）— reasoning：

Here's a thinking process:

1.  **Analyze User Input:**
   - User asks: "Now add 9 to that, and briefly explain which prior step you are building on."
   - "That" refers to the previous result: 391 (from 17 * 23)
   - Need to calculate: 391 + 9
   - Need to briefly explain which prior step I'm building on.

2.  **Calculate:**
   - 391 + 9 = 400

3.  **Identify Prior Step:**
   - The prior step was calculating 17 * 23 = 391
   - I'm building directly on that multiplication result.

4.  **Formulate Response:**
   - State the new result: 400
   - Briefly explain the connection: I'm adding 9 to the result of the previous multiplication (17 × 23 = 391).
   - Keep it concise as requested.

5.  **Draft Response:**
   400. I'm building directly on the previous step, where I calculated 17 × 23 = 391, and simply added 9 to that result.

6.  **Self-Correction/Verification:**
   - Math check: 391 + 9 = 400 ✓
   - Explanation matches request: "briefly explain which prior step you are building on" ✓
   - Tone/length: Concise ✓
   - Ready.✅

B 组 — content：

400. I'm building directly on the previous step, where I calculated 17 × 23 = 391, and simply added 9 to that result.

两份输出对读下来可以看到几个细节：

A 组的 reasoning 是标准 5 步结构（Analyze → Perform → Identify → Formulate → Self-Correction），B 组因为缺少上一轮的思考链兜底，多出了第 5 步 “Draft Response”——相当于先把答案打一遍草稿、再做校验，这就是 +27 completion tokens 的直接来源。
A 组在 “Identify Prior Step” 里写的是 “calculating the product of 17 and 23"（复述了运算类型），B 组写的是 “calculating 17 * 23 = 391"（直接把算式补回来）——后者因为看不到历史 <think>，要在新 reasoning 里显式把中间结果重新陈述一遍。
最终 content 两边意思等价，字面略有差别（“to which I then added 9” vs “and simply added 9 to that result”），属于同等质量下的措辞波动。

何时开启#

更适合开启：多轮 iterative 场景（写代码、数学链、调试类对话）——历史 reasoning 往往是可复用资产
也可以考虑关闭：无状态的单轮 Q&A、或希望把思维链对用户完全不可见的场景
注意：保留下来的历史 <think> 块会占 prompt token，长对话时需要和 max_model_len 的 budget 一起考量

集群最终两台都切回 preserve_thinking=1，统一通过 Bifrost 网关对外。