fix(k8s): rebalance topology spread rollouts

2026-06-13 12:56:29 +08:00
parent 17e017f5a3
commit 60f653a0c1
5 changed files with 32 additions and 3 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -33900,3 +33900,21 @@ production browser smoke:
 **邊界**：
 - 本輪是 GitOps manifest candidate，不手動刪 Pod、不直接 `kubectl patch` live、不 Docker restart、不 Nginx reload、不 firewall 變更。
 - 完成前仍只能說 service / cold-start green；不可宣稱 API workload balanced。
+
+## 2026-06-13 — API topology spread rollout strategy follow-up
+
+**新增實證**：
+- 第一輪 hard spread 已進 live，ArgoCD revision `17e017f5`，`awoooi-api` deployment 已有 `minDomains=2` 與 `whenUnsatisfiable=DoNotSchedule`。
+- Rollout 完成後 API 仍 `2/2` 在 121。原因是 rolling update 期間舊 Pod 尚在 120 terminating，新 Pod 兩顆排到 121 時 scheduler 仍把舊 Pod 算進 skew；舊 Pod 消失後最終狀態偏斜。
+
+**追加修正**：
+- API / Web rolling strategy 改為 `maxSurge: 0`、`maxUnavailable: 1`。
+- API / Web template 加入 `awoooi.dev/topology-rebalance-generation=2026-06-13T13:05:00+08:00`，強制以新策略重跑一次 rollout。
+- Worker 保留單副本策略，只維持 hard spread constraint 供未來多副本使用。
+
+**驗收條件**：
+- ArgoCD sync 到新 revision。
+- API / Web rollout success。
+- API / Web pods 皆 120 / 121 各一顆。
+- public API / Web smoke 通過。
+- cold-start scorecard 仍為 `WARN=0 BLOCKED=0`。
--- a/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md
+++ b/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md
@@ -180,6 +180,8 @@ sudo kubectl top pods -A --sort-by=cpu

 修正策略改為 `maxSkew: 1`、`minDomains: 2`、`topologyKey: kubernetes.io/hostname`、`whenUnsatisfiable: DoNotSchedule`。120 / 121 都 Ready 時，API / Web replicas=2 必須跨節點；Worker 單副本仍可跑，未來擴副本時也必須分散。

+2026-06-13 13:00 追加判讀：只把 spread 改硬仍不夠。API rollout 時舊 Pod 仍在 120 terminating，新 ReplicaSet 兩顆都排到 121，scheduler 當下因舊 Pod 尚在而視為不違反 skew；舊 Pod 消失後 live API 反而變成 121 集中。因此 API / Web 也必須使用 `maxSurge: 0`、`maxUnavailable: 1`，讓 rollout 先釋出一格再排新 Pod，避免新舊 ReplicaSet 交錯造成最終偏斜。
+
 驗證：

 - `kubectl kustomize k8s/awoooi-prod` 可渲染三個 `topologySpreadConstraints`。
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -57,6 +57,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
 | 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. |
 | 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. |
 | 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. |
+| 2026-06-13 API rollout strategy hardening | IN_PROGRESS | First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps candidate sets API/Web `maxSurge=0`, `maxUnavailable=1`, and adds a topology rebalance annotation to force a clean rollout. |

 ---

--- a/k8s/awoooi-prod/05-deployment-web.yaml
+++ b/k8s/awoooi-prod/05-deployment-web.yaml
@@ -24,14 +24,17 @@ spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1
-      maxUnavailable: 0
+      # 2026-06-13 Codex: no surge keeps topology spread honest during rollouts.
+      maxSurge: 0
+      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: awoooi-web
        system: awoooi
        environment: prod
+      annotations:
+        awoooi.dev/topology-rebalance-generation: "2026-06-13T13:05:00+08:00"
    spec:
      # 2026-06-13 Codex: 120 / 121 皆 Ready 時強制跨節點分散，避免 replicas=2 合法同落單節點。
      topologySpreadConstraints:
--- a/k8s/awoooi-prod/06-deployment-api.yaml
+++ b/k8s/awoooi-prod/06-deployment-api.yaml
@@ -24,7 +24,10 @@ spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
-      maxSurge: 1
+      # 2026-06-13 Codex: no surge keeps topology spread honest during rollouts.
+      # With surge=1, two new pods can both schedule to the opposite node while
+      # old pods are terminating, then become imbalanced after old pods exit.
+      maxSurge: 0
      # 2026-05-24 Codex: allow one unavailable replica so rollout can replace
      # a bad old ReplicaSet instead of deadlocking at 1/2 when probes regress.
      maxUnavailable: 1
@@ -34,6 +37,8 @@ spec:
        app: awoooi-api
        system: awoooi
        environment: prod
+      annotations:
+        awoooi.dev/topology-rebalance-generation: "2026-06-13T13:05:00+08:00"
    spec:
      # 2026-06-13 Codex: 120 / 121 皆 Ready 時強制跨節點分散，避免 replicas=2 合法同落單節點。
      topologySpreadConstraints: