docs(ops): record topology spread verification [skip ci]
This commit is contained in:
@@ -33918,3 +33918,16 @@ production browser smoke:
|
||||
- API / Web pods 皆 120 / 121 各一顆。
|
||||
- public API / Web smoke 通過。
|
||||
- cold-start scorecard 仍為 `WARN=0 BLOCKED=0`。
|
||||
|
||||
**正式驗證**:
|
||||
- Gitea main:`60f653a0 fix(k8s): rebalance topology spread rollouts`。
|
||||
- ArgoCD:`sync=Synced`,revision `60f653a0c1c0938c5d8e06a06df354da38e4470c`;health 仍 `Degraded`,沿用既有 `km-vectorize` governance debt。
|
||||
- `awoooi-api`:`ready=2/2`,`maxSurge=0 / maxUnavailable=1`,`minDomains=2 / DoNotSchedule`,pods 分別在 `mon` 與 `mon1`。
|
||||
- `awoooi-web`:`ready=2/2`,`maxSurge=0 / maxUnavailable=1`,`minDomains=2 / DoNotSchedule`,pods 分別在 `mon` 與 `mon1`。
|
||||
- Public smoke:`https://awoooi.wooo.work/api/v1/health=healthy`,`/zh-TW/governance?tab=automation-inventory=200`。
|
||||
- Full cold-start:12:59 read-only rerun `PASS=83 WARN=0 BLOCKED=0`。
|
||||
|
||||
**最新判定**:
|
||||
- Service / cold-start:`GREEN`。
|
||||
- API / Web workload balancing:`LIVE_VERIFIED`。
|
||||
- DR scorecard:仍不可宣稱完成,credential escrow evidence 仍缺 `5` 個。
|
||||
|
||||
@@ -182,6 +182,8 @@ sudo kubectl top pods -A --sort-by=cpu
|
||||
|
||||
2026-06-13 13:00 追加判讀:只把 spread 改硬仍不夠。API rollout 時舊 Pod 仍在 120 terminating,新 ReplicaSet 兩顆都排到 121,scheduler 當下因舊 Pod 尚在而視為不違反 skew;舊 Pod 消失後 live API 反而變成 121 集中。因此 API / Web 也必須使用 `maxSurge: 0`、`maxUnavailable: 1`,讓 rollout 先釋出一格再排新 Pod,避免新舊 ReplicaSet 交錯造成最終偏斜。
|
||||
|
||||
2026-06-13 12:59 live verification:ArgoCD revision `60f653a0` synced;API / Web 均 `ready=2/2`,strategy `maxSurge=0 / maxUnavailable=1`,spread `minDomains=2 / DoNotSchedule`。`awoooi-api` pod 分別在 `mon` / `mon1`,`awoooi-web` pod 分別在 `mon` / `mon1`。完整 cold-start rerun `PASS=83 WARN=0 BLOCKED=0`。
|
||||
|
||||
驗證:
|
||||
|
||||
- `kubectl kustomize k8s/awoooi-prod` 可渲染三個 `topologySpreadConstraints`。
|
||||
@@ -214,7 +216,7 @@ sudo kubectl top pods -A --sort-by=cpu
|
||||
| A. 角色判定 | DONE | 100% | 120/121 已確認是 K3s control-plane AA;workload 尚未均衡 |
|
||||
| B. 服務盤點 | DONE | 100% | 110 / 188 主要服務、port、container 類型已盤點 |
|
||||
| C. 遷移分層 | DONE | 100% | 分為可優先搬、可規劃、禁止直接搬 |
|
||||
| D. 實際 workload 均衡 | LIVE_VERIFIED | 100% | API / Web / Worker topology spread 已進 Gitea main / ArgoCD live;API/Web 皆 120 / 121 各一顆 |
|
||||
| D. 實際 workload 均衡 | LIVE_VERIFIED | 100% | API / Web hard topology spread 與 no-surge rollout 已進 Gitea main / ArgoCD live;API/Web 皆 120 / 121 各一顆 |
|
||||
| E. 第一批無狀態服務遷移 | NOT_STARTED | 0% | 需另開變更窗口與 rollback plan |
|
||||
| F. Stateful 遷移 | BLOCKED_BY_DESIGN | 0% | 需要 storage / backup / restore / HA 設計 |
|
||||
|
||||
|
||||
@@ -57,7 +57,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
|
||||
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. |
|
||||
| 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. |
|
||||
| 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. |
|
||||
| 2026-06-13 API rollout strategy hardening | IN_PROGRESS | First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps candidate sets API/Web `maxSurge=0`, `maxUnavailable=1`, and adds a topology rebalance annotation to force a clean rollout. |
|
||||
| 2026-06-13 API rollout strategy hardening | LIVE_VERIFIED | First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps rollout reached ArgoCD revision `60f653a0`, API/Web use `maxSurge=0`, `maxUnavailable=1`, `minDomains=2`, `DoNotSchedule`, and both deployments are split `mon` / `mon1`. Public API / governance route smoke passed and 12:59 cold-start returned `PASS=83 WARN=0 BLOCKED=0`. |
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user