TL;DR#

  • Qwen3.6 的 preserve_thinking写在模型 chat template 里的 Jinja 变量,不是 vLLM CLI 标志。
  • 正确启用方式:--default-chat-template-kwargs '{"preserve_thinking": true}'
  • 在双机 DGX Spark 集群上用一台开、一台关做 A/B 对照:同样输入下 prompt tokens 55 vs 51(差 4 tokens 就是保留下来的历史 <think> 块),completion tokens 355 vs 382(关掉后模型重新"想"一遍)。

起因#

Reddit 的一篇文章提到 Qwen3.6 随 KV cache 修复一起 “ships preserve_thinking flag”(原文链接)。集群里跑的正好是 Qwen3.6-35B-A3B-FP8(reasoning parser qwen3),想在两台 DGX Spark 上把这个开关打开。

参数归属:chat template 而非 vLLM CLI#

这里"flag"是个容易误导的措辞。Qwen3.6 的 chat_template.jinja 里写着:

{%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}

也就是说:

  • preserve_thinking模板里的 Jinja 变量,默认未定义时按当前模板行为等价于 False
  • 不启用时,chat template 只保留最近一轮的历史 <think> 块,更早的全剥掉
  • 启用时,所有历史轮次的思考链都保留在上下文里

vLLM 本身的 --reasoning-parser--reasoning-config 都不管这个——它们只负责解析输出,而 preserve_thinking 影响的是输入 prompt 的构造。正确的启用方式是把它作为 kwarg 注入模板:

vllm serve ... \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"preserve_thinking": true}'

A/B 验证:两台机互相做对照#

集群本来就是两台 DGX Spark 各跑一份 vLLM(前面 Bifrost 网关做路由),天然适合对照:

  • Server 1(实验组)preserve_thinking=1
  • Server 2(对照组)preserve_thinking=""(关闭)

Ansible 用 --limit 只对 server 2 重新 deploy,改一个 extra var 就够了,Makefile 和 playbook 都不动:

uv run ansible-playbook -i inventory.ini playbooks/vllm-model-deploy.yml \
  --limit 100.67.164.92 \
  -e "vllm_preserve_thinking=" \
  ...

测试消息设计成能明显依赖历史推理的三轮对话:

{
  "messages": [
    {"role": "user", "content": "What is 17 * 23?"},
    {"role": "assistant",
     "reasoning_content": "17*23 = 17*(20+3) = 340 + 51 = 391.",
     "content": "391"},
    {"role": "user",
     "content": "Now add 9 to that, and briefly explain which prior step you are building on."}
  ],
  "max_tokens": 512,
  "temperature": 0.0,
  "seed": 42
}

每一条消息都是为了放大 preserve_thinking 的观测效果而特意设计的:

  1. 第 1 条(user)“What is 17 * 23?” 给模型一个简单但推理路径非平凡的算术问题。选 17×23 而不是 7×8 是因为它需要分步展开(17×(20+3)),天然会在 reasoning 里留下可辨识的中间步骤——这样历史 <think> 块既有实质内容、又短到能被一眼看出 token 差异。
  2. 第 2 条(assistant)带 reasoning_content + content 这是人工注入的历史推理链,模拟"模型刚回答完第一轮"。这里需要显式放进 reasoning_content 字段,chat template 才会把它渲染成 <think>...</think> 块;content: "391" 是最终答案。preserve_thinking 在这里影响的就是这一段 <think> 要不要随下一轮请求一起回传给模型
  3. 第 3 条(user)“Now add 9 to that, and briefly explain which prior step you are building on.” 这句话是整个测试的关键,里面藏了两个触发器:
    • "add 9 to that"that 是代词,指向第二轮的结果 391关闭 preserve_thinking 时,模型只能看到孤零零的 content: "391",不知道 391 怎么来的,需要重新构造推理;开启时,历史 <think> 在上下文里,可以直接复用。
    • "briefly explain which prior step you are building on":强制模型输出对历史推理的引用。如果模板剥掉了历史 <think>,模型就得在新 reasoning 里自己把"17×23=391"再推一遍才能"解释"——这部分多出来的 token,我判断是 B 组 completion 比 A 组多 27 的一个主要来源。

再加上 temperature=0.0 + seed=42 固定采样,排除随机性干扰,两台机跑出的差异也就更容易归因到 preserve_thinking 本身。

同样的 payload,同时打到两台的 :30000

指标 A: preserve_thinking=1 B: preserve_thinking=0
prompt_tokens 55 51
completion_tokens 355 382
最终答案 ✅ 400 ✅ 400

差异点有两个:

  1. Prompt 多了 4 tokens —— 正好是被保留的上一轮 <think>...</think> 片段。关掉后 chat template 把它剥掉了。
  2. 关掉时 completion 反而多了 27 tokens —— 模型看不到自己之前的推理,只能靠 content 字段里那个孤零零的 391 重新推断是怎么来的,reasoning 部分又走了一遍类似的验证流程。

本题足够简单,两边答案都对。至少从这个简单例子看:如果历史推理链更长、问题复杂度更高,A 这边省下的 completion 可能会更多,B 也可能更容易出现遗忘或重算

实际输出对比#

A 组(preserve_thinking=1)— reasoning:

Here's a thinking process:

1.  **Analyze User Input:**
   - User asks: "Now add 9 to that, and briefly explain which prior step you are building on."
   - "That" refers to the result of the previous calculation: 17 * 23 = 391.
   - So, I need to calculate 391 + 9.
   - I also need to briefly explain which prior step I'm building on.

2.  **Perform Calculation:**
   - 391 + 9 = 400.

3.  **Identify Prior Step:**
   - The prior step was calculating the product of 17 and 23, which equals 391.
   - I'm building directly on that result.

4.  **Formulate Response:**
   - State the new result: 400.
   - Briefly explain the connection: This builds directly on the previous multiplication step (17 × 23 = 391), to which I added 9.
   - Keep it concise as requested.

5.  **Self-Correction/Verification:**
   - Calculation: 391 + 9 = 400. Correct.
   - Explanation: Clearly references the prior multiplication step. Matches user request.
   - Tone/Format: Direct and brief.

   All good. Proceed.
   Output matches the response.✅

A 组 — content:

400. This builds directly on the previous step where I calculated 17 × 23 = 391, to which I then added 9.

B 组(preserve_thinking=0)— reasoning:

Here's a thinking process:

1.  **Analyze User Input:**
   - User asks: "Now add 9 to that, and briefly explain which prior step you are building on."
   - "That" refers to the previous result: 391 (from 17 * 23)
   - Need to calculate: 391 + 9
   - Need to briefly explain which prior step I'm building on.

2.  **Calculate:**
   - 391 + 9 = 400

3.  **Identify Prior Step:**
   - The prior step was calculating 17 * 23 = 391
   - I'm building directly on that multiplication result.

4.  **Formulate Response:**
   - State the new result: 400
   - Briefly explain the connection: I'm adding 9 to the result of the previous multiplication (17 × 23 = 391).
   - Keep it concise as requested.

5.  **Draft Response:**
   400. I'm building directly on the previous step, where I calculated 17 × 23 = 391, and simply added 9 to that result.

6.  **Self-Correction/Verification:**
   - Math check: 391 + 9 = 400 ✓
   - Explanation matches request: "briefly explain which prior step you are building on" ✓
   - Tone/length: Concise ✓
   - Ready.✅

B 组 — content:

400. I'm building directly on the previous step, where I calculated 17 × 23 = 391, and simply added 9 to that result.

两份输出对读下来可以看到几个细节:

  • A 组的 reasoning 是标准 5 步结构(Analyze → Perform → Identify → Formulate → Self-Correction),B 组因为缺少上一轮的思考链兜底,多出了第 5 步 “Draft Response”——相当于先把答案打一遍草稿、再做校验,这就是 +27 completion tokens 的直接来源。
  • A 组在 “Identify Prior Step” 里写的是 “calculating the product of 17 and 23"(复述了运算类型),B 组写的是 “calculating 17 * 23 = 391"(直接把算式补回来)——后者因为看不到历史 <think>,要在新 reasoning 里显式把中间结果重新陈述一遍。
  • 最终 content 两边意思等价,字面略有差别(“to which I then added 9” vs “and simply added 9 to that result”),属于同等质量下的措辞波动。

何时开启#

  • 更适合开启:多轮 iterative 场景(写代码、数学链、调试类对话)——历史 reasoning 往往是可复用资产
  • 也可以考虑关闭:无状态的单轮 Q&A、或希望把思维链对用户完全不可见的场景
  • 注意:保留下来的历史 <think> 块会占 prompt token,长对话时需要和 max_model_len 的 budget 一起考量

集群最终两台都切回 preserve_thinking=1,统一通过 Bifrost 网关对外。