diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index e14afce8..3d731c2b 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -33861,3 +33861,19 @@ production browser smoke: 1. 繼續推進 S4.9 owner response 真實回覆資料包,必填 owner role / team、decision、decision reason、affected scope、redacted evidence refs、followup owner;驗收前維持 `0 / false`。 2. 持續盤點高價值配置控管,優先納入 Nginx、K8s manifest、ArgoCD app、Gitea workflow、registry / Harbor、Sentry / SigNoz / Alertmanager、public gateway、AI provider route、資料庫 migration 與 secrets injection 流程。 3. 任何主機維護、Kali 更新、Nginx / Docker / firewall / active scan 仍需獨立維護窗口與人工批准,不得由治理頁或 AwoooP approval 直接替代。 + +## 2026-06-13 — API / Web topology spread hardening candidate + +**背景**: +- 12:43 cold-start 仍是 `PASS=83 WARN=0 BLOCKED=0`,服務面可用。 +- Live K3s 顯示 `awoooi-api` 兩顆副本都在 120;`awoooi-web` 仍分散在 120 / 121。 +- Live deployment 已有 `topologySpreadConstraints`,但 `whenUnsatisfiable=ScheduleAnyway` 只是軟偏好,所以 scheduler 合法接受同節點。 + +**修正內容**: +- `k8s/awoooi-prod/06-deployment-api.yaml`、`05-deployment-web.yaml`、`08-deployment-worker.yaml` 改為 `minDomains: 2` + `whenUnsatisfiable: DoNotSchedule`。 +- 目標是 120 / 121 都 Ready 時,API / Web replicas=2 必須跨節點;Worker 單副本仍可跑,未來多副本時也要分散。 +- `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` 與 reboot workplan 同步更新,避免再把 soft spread 誤當 workload balanced guarantee。 + +**邊界**: +- 本輪是 GitOps manifest candidate,不手動刪 Pod、不直接 `kubectl patch` live、不 Docker restart、不 Nginx reload、不 firewall 變更。 +- 完成前仍只能說 service / cold-start green;不可宣稱 API workload balanced。 diff --git a/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md b/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md index 401b4903..a8bed1f9 100644 --- a/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md +++ b/docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md @@ -176,17 +176,19 @@ sudo kubectl top pods -A --sort-by=cpu - `k8s/awoooi-prod/06-deployment-api.yaml` - `k8s/awoooi-prod/08-deployment-worker.yaml` -策略是 `maxSkew: 1`、`topologyKey: kubernetes.io/hostname`、`whenUnsatisfiable: ScheduleAnyway`。這會明確偏好 120 / 121 分散,但在單節點維護或另一台短暫不可用時不會把 rollout 卡死。 +2026-06-13 12:43 live refresh 發現 `ScheduleAnyway` 仍可能讓 `awoooi-api` 兩顆副本合法同落 120;cold-start 仍綠,但不能宣稱 API workload balanced。 + +修正策略改為 `maxSkew: 1`、`minDomains: 2`、`topologyKey: kubernetes.io/hostname`、`whenUnsatisfiable: DoNotSchedule`。120 / 121 都 Ready 時,API / Web replicas=2 必須跨節點;Worker 單副本仍可跑,未來擴副本時也必須分散。 驗證: - `kubectl kustomize k8s/awoooi-prod` 可渲染三個 `topologySpreadConstraints`。 - `git diff --check` 通過。 - ArgoCD `awoooi-prod` 已同步到 revision `acaae99986aee2e1f5630984981ccb0f2b676bb8`,operation `Succeeded`。 -- Live deployment 的 API / Web / Worker 均有 `topologySpreadConstraints`。 -- `awoooi-api`:120 / 121 各一顆。 -- `awoooi-web`:120 / 121 各一顆。 -- `awoooi-worker`:單副本在 121,符合單副本判定。 +- Live deployment 的 API / Web / Worker 均需有 `topologySpreadConstraints`。 +- `awoooi-api` replicas=2:120 / 121 各一顆,否則是 `SERVICE_OK_PLACEMENT_DRIFT`。 +- `awoooi-web` replicas=2:120 / 121 各一顆。 +- `awoooi-worker` replicas=1:可在任一節點;若 replicas > 1,必須跨 120 / 121 分散。 - Rollout 後冷啟動:2026-06-13 00:34 final scorecard `PASS=83 WARN=0 BLOCKED=0`。00:33 曾因 110 `known_hosts` 對 120 / 188 過期而短暫 blocked,已在備份後刷新 host key trust。 補充:rolling update 當下 API 新 Pod 先集中到 121,因為舊 Pod 正在 120 Terminating;後續以單顆 stateless API Pod 受控重排,最終達到 120 / 121 各一顆。沒有執行全 namespace delete、Docker restart、Nginx reload 或 firewall 變更。 @@ -195,9 +197,9 @@ sudo kubectl top pods -A --sort-by=cpu | Workload | 目標 | |----------|------| -| `awoooi-api` replicas=2 | 120 / 121 各一個 | -| `awoooi-web` replicas=2 | 120 / 121 各一個 | -| `awoooi-worker` replicas=1 | 可留任一節點,但需有重啟恢復證據 | +| `awoooi-api` replicas=2 | 120 / 121 各一個,`whenUnsatisfiable=DoNotSchedule` | +| `awoooi-web` replicas=2 | 120 / 121 各一個,`whenUnsatisfiable=DoNotSchedule` | +| `awoooi-worker` replicas=1 | 可留任一節點;多副本時 `whenUnsatisfiable=DoNotSchedule` | | ArgoCD / Velero / kube-system 單副本 | 不急搬;先保持穩定 | | DaemonSet | 120 / 121 各一個,已符合 | diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 6e75eefa..0f21694a 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -56,6 +56,7 @@ Full cold-start may be declared green only for the latest verified evidence set. | 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. | | 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. | | 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. | +| 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. | --- diff --git a/k8s/awoooi-prod/05-deployment-web.yaml b/k8s/awoooi-prod/05-deployment-web.yaml index 80cc9ed7..ddf7df70 100644 --- a/k8s/awoooi-prod/05-deployment-web.yaml +++ b/k8s/awoooi-prod/05-deployment-web.yaml @@ -33,11 +33,12 @@ spec: system: awoooi environment: prod spec: - # 2026-06-13 Codex: 明確偏好跨 120 / 121 分散,但不在單節點維護時卡死 rollout。 + # 2026-06-13 Codex: 120 / 121 皆 Ready 時強制跨節點分散,避免 replicas=2 合法同落單節點。 topologySpreadConstraints: - maxSkew: 1 + minDomains: 2 topologyKey: kubernetes.io/hostname - whenUnsatisfiable: ScheduleAnyway + whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: awoooi-web diff --git a/k8s/awoooi-prod/06-deployment-api.yaml b/k8s/awoooi-prod/06-deployment-api.yaml index 9f8a12eb..a3114b67 100644 --- a/k8s/awoooi-prod/06-deployment-api.yaml +++ b/k8s/awoooi-prod/06-deployment-api.yaml @@ -35,11 +35,12 @@ spec: system: awoooi environment: prod spec: - # 2026-06-13 Codex: 明確偏好跨 120 / 121 分散,但不在單節點維護時卡死 rollout。 + # 2026-06-13 Codex: 120 / 121 皆 Ready 時強制跨節點分散,避免 replicas=2 合法同落單節點。 topologySpreadConstraints: - maxSkew: 1 + minDomains: 2 topologyKey: kubernetes.io/hostname - whenUnsatisfiable: ScheduleAnyway + whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: awoooi-api diff --git a/k8s/awoooi-prod/08-deployment-worker.yaml b/k8s/awoooi-prod/08-deployment-worker.yaml index ea117d33..44091905 100644 --- a/k8s/awoooi-prod/08-deployment-worker.yaml +++ b/k8s/awoooi-prod/08-deployment-worker.yaml @@ -41,11 +41,12 @@ spec: environment: prod component: signal-processor spec: - # 2026-06-13 Codex: Worker 目前 min=1,但 HPA 擴到多副本時優先跨節點分散。 + # 2026-06-13 Codex: Worker 目前 min=1;擴到多副本時必須跨 120 / 121 分散。 topologySpreadConstraints: - maxSkew: 1 + minDomains: 2 topologyKey: kubernetes.io/hostname - whenUnsatisfiable: ScheduleAnyway + whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: awoooi-worker