diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index cc9d2758..c575f0d9 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -33900,3 +33900,21 @@ production browser smoke: **邊界**: - 本輪是 GitOps manifest candidate,不手動刪 Pod、不直接 `kubectl patch` live、不 Docker restart、不 Nginx reload、不 firewall 變更。 - 完成前仍只能說 service / cold-start green;不可宣稱 API workload balanced。 + +## 2026-06-13 — API topology spread rollout strategy follow-up + +**新增實證**: +- 第一輪 hard spread 已進 live,ArgoCD revision `17e017f5`,`awoooi-api` deployment 已有 `minDomains=2` 與 `whenUnsatisfiable=DoNotSchedule`。 +- Rollout 完成後 API 仍 `2/2` 在 121。原因是 rolling update 期間舊 Pod 尚在 120 terminating,新 Pod 兩顆排到 121 時 scheduler 仍把舊 Pod 算進 skew;舊 Pod 消失後最終狀態偏斜。 + +**追加修正**: +- API / Web rolling strategy 改為 `maxSurge: 0`、`maxUnavailable: 1`。 +- API / Web template 加入 `awoooi.dev/topology-rebalance-generation=2026-06-13T13:05:00+08:00`,強制以新策略重跑一次 rollout。 +- Worker 保留單副本策略,只維持 hard spread constraint 供未來多副本使用。 + +**驗收條件**: +- ArgoCD sync 到新 revision。 +- API / Web rollout success。 +- API / Web pods 皆 120 / 121 各一顆。 +- public API / Web smoke 通過。 +- cold-start scorecard 仍為 `WARN=0 BLOCKED=0`。 diff --git a/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md b/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md index a8bed1f9..b5013421 100644 --- a/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md +++ b/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md @@ -180,6 +180,8 @@ sudo kubectl top pods -A --sort-by=cpu 修正策略改為 `maxSkew: 1`、`minDomains: 2`、`topologyKey: kubernetes.io/hostname`、`whenUnsatisfiable: DoNotSchedule`。120 / 121 都 Ready 時,API / Web replicas=2 必須跨節點;Worker 單副本仍可跑,未來擴副本時也必須分散。 +2026-06-13 13:00 追加判讀:只把 spread 改硬仍不夠。API rollout 時舊 Pod 仍在 120 terminating,新 ReplicaSet 兩顆都排到 121,scheduler 當下因舊 Pod 尚在而視為不違反 skew;舊 Pod 消失後 live API 反而變成 121 集中。因此 API / Web 也必須使用 `maxSurge: 0`、`maxUnavailable: 1`,讓 rollout 先釋出一格再排新 Pod,避免新舊 ReplicaSet 交錯造成最終偏斜。 + 驗證: - `kubectl kustomize k8s/awoooi-prod` 可渲染三個 `topologySpreadConstraints`。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 0f21694a..4f376aa2 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -57,6 +57,7 @@ Full cold-start may be declared green only for the latest verified evidence set. | 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. | | 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. | | 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. | +| 2026-06-13 API rollout strategy hardening | IN_PROGRESS | First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps candidate sets API/Web `maxSurge=0`, `maxUnavailable=1`, and adds a topology rebalance annotation to force a clean rollout. | --- diff --git a/k8s/awoooi-prod/05-deployment-web.yaml b/k8s/awoooi-prod/05-deployment-web.yaml index ddf7df70..c60c3f8f 100644 --- a/k8s/awoooi-prod/05-deployment-web.yaml +++ b/k8s/awoooi-prod/05-deployment-web.yaml @@ -24,14 +24,17 @@ spec: strategy: type: RollingUpdate rollingUpdate: - maxSurge: 1 - maxUnavailable: 0 + # 2026-06-13 Codex: no surge keeps topology spread honest during rollouts. + maxSurge: 0 + maxUnavailable: 1 template: metadata: labels: app: awoooi-web system: awoooi environment: prod + annotations: + awoooi.dev/topology-rebalance-generation: "2026-06-13T13:05:00+08:00" spec: # 2026-06-13 Codex: 120 / 121 皆 Ready 時強制跨節點分散,避免 replicas=2 合法同落單節點。 topologySpreadConstraints: diff --git a/k8s/awoooi-prod/06-deployment-api.yaml b/k8s/awoooi-prod/06-deployment-api.yaml index a3114b67..2949ade0 100644 --- a/k8s/awoooi-prod/06-deployment-api.yaml +++ b/k8s/awoooi-prod/06-deployment-api.yaml @@ -24,7 +24,10 @@ spec: strategy: type: RollingUpdate rollingUpdate: - maxSurge: 1 + # 2026-06-13 Codex: no surge keeps topology spread honest during rollouts. + # With surge=1, two new pods can both schedule to the opposite node while + # old pods are terminating, then become imbalanced after old pods exit. + maxSurge: 0 # 2026-05-24 Codex: allow one unavailable replica so rollout can replace # a bad old ReplicaSet instead of deadlocking at 1/2 when probes regress. maxUnavailable: 1 @@ -34,6 +37,8 @@ spec: app: awoooi-api system: awoooi environment: prod + annotations: + awoooi.dev/topology-rebalance-generation: "2026-06-13T13:05:00+08:00" spec: # 2026-06-13 Codex: 120 / 121 皆 Ready 時強制跨節點分散,避免 replicas=2 合法同落單節點。 topologySpreadConstraints: