从 Cilium Gateway 到 CoreDNS：一次跨层级的 K8s 连锁故障排查

记录 Spring AI 2.0.0-M2 中 OllamaChatOptions.disableThinking() 导致 Ollama 返回 HTTP 400 的 bug，分析根因、对比两种绕过方案的 tradeoff，最终选择 ClientHttpRequestInterceptor 作为最小侵入的临时修复。

[Read more]

一次 make coverage 卡死排查：Reactive Redis 与 Lettuce SharedLock 的连锁问题

2026-02-28

#spring boot #webflux #redis #lettuce #reactive #troubleshooting

本文复盘一次集成测试在 make coverage 阶段卡死的问题：先是连接池超时，再是 Lettuce SharedLock 自旋。重点分享排查路径、错误假设、最终根因与可复用修复策略。

[Read more]

How a Performance Optimization Caused Cascading Redis Timeouts in Spring WebFlux

2026-02-27

#spring boot #spring webflux #redis #reactive #troubleshooting

A seemingly harmless removal of publishOn(Schedulers.boundedElastic()) led to cascading Redis timeouts in production. This post explains how Spring’s @Cacheable blocks the Netty event loop when used with RedisCacheManager, and why BlockHound failed to catch it.

[Read more]

postgresql在prometheus stack中没有采集到metrics的排查

2025-10-26

#homelab #k8s #prometheus stack #metrics #serviceMonitor #troubleshooting

我在homelab的k8s集群中使用helm部署了postgresql，但是prometheus stack没有采集到postgresql的指标数据。怎么排查这个问题呢？

[Read more]

Java21虚拟线程-锁在哪里呢？

2024-07-01

#netflix #java21 #springboot3 #troubleshooting

这是netflix中使用java21的virtual thread碰到的一个故障案例，排查过程很精彩，问题对使用该特性的服务应该也是很容易碰到的。

[Read more]

Posts for: #troubleshooting