Apple M5 上 omlx + Gemma4-26B 性能调优实录

2026-04-19

#AI #LLM #MLX #omlx #Apple Silicon #M5 #Gemma #Benchmark

基于一次 M5 本地测试，记录 MoE 模型带宽瓶颈，以及通过内存热缓存把长上下文推理提速到约 6.4 倍的过程。

[Read more]

两台 DGX Spark 跑 Qwen3.6-35B-A3B：直连 vLLM vs 经过 Gateway 的吞吐对比

2026-04-17

#AI #LLM #NVIDIA #DGX Spark #vLLM #Benchmark #Gateway #Qwen

实测两台 DGX Spark 上 Qwen3.6-35B-A3B-FP8 的 vLLM 吞吐：单机单流 ~50 tok/s，双机经 FastAPI Gateway 并发 N=16 聚合可达 ~485 tok/s。

[Read more]

2026-04-14

#AI #LLM #NVIDIA #DGX Spark #vLLM #Bifrost #负载均衡 #Benchmark

将两台 DGX Spark 从不稳定的 vLLM TP=2 跨节点部署迁移到单节点独立运行 + Bifrost 负载均衡网关的完整实践和 benchmark 结果。

[Read more]

2026-04-14

#AI #LLM #NVIDIA #DGX Spark #vLLM #Expert Parallel #Benchmark

在两台 DGX Spark 上对比 vLLM EP2 和 TP2 跑 NVIDIA Nemotron 3 Super 120B A12B NVFP4 的结果，并分析 EP2 失稳的可能原因。

[Read more]

2026-03-25

#spring-boot #http2 #h2c #java25 #performance #benchmark #k6 #microservice

围绕 Spring Boot 3.5、JDK HttpClient 和 h2c，我做了一次完整压测：先观察 h2c 不会天然更快，再设计一个 0 错误的正例，展示它在特定场景下为什么会比 HTTP/1.1 更占优。

[Read more]

2026-03-14

#ai #mlx #ollama #apple-silicon #benchmark

在 M2 MacBook Pro (32GB) 上系统对比 MLX 与 Ollama 的推理性能，附 9B 与 35B 模型的实测数据。

[Read more]