From 17b62da59a467c0f38a478c8a09b9fab9eaf4f27 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 24 May 2026 14:28:43 +0800 Subject: [PATCH] docs(awooop): record t150 rollout evidence [skip ci] --- docs/LOGBOOK.md | 59 +++++++++++++++++++ ...-04-15-MASTER-ai-autonomous-flywheel-v2.md | 7 +++ 2 files changed, 66 insertions(+) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index aae55a51..7bedd47d 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,62 @@ +## 2026-05-24|T150 CD rollout-risk resource evidence + +**觸發**: + +- T149 已把舊 CronJob failed history 噪音清掉,ArgoCD application 回到 Healthy。 +- 但 CD 的 rollout-risk summary 仍只有 top-level `argocd_health_not_healthy health=Degraded`;若未來再 Degraded,Telegram / AwoooP 仍看不到是哪個 ArgoCD resource 節點不健康。 +- 原通知標題 `AWOOOI 部署風險已恢復` 也容易誤導;實際語意是「部署已完成,但有風險證據需要看」。 + +**修正**: + +- `.gitea/workflows/cd.yaml` + - 新增 `collect_argocd_resource_evidence()`,在 ArgoCD 已同步但 health 非 Healthy 時,從 Application `.status.resources` 讀取前 5 個 `sync != Synced` 或 `health != Healthy` 節點。 + - rollout-risk summary 從只寫: + - `argocd_health_not_healthy health=Degraded revision=...` + - 改為可寫: + - `argocd_health_not_healthy health=Degraded revision=... resources=CronJob/km-vectorize ns=awoooi-prod sync=Synced health=Degraded msg=...` + - 若 top-level Degraded 但 Application status 看不到非健康節點,會寫 `resources=none_visible`,避免再次變成黑盒。 + - 通知標題改為 `AWOOOI 部署完成但仍有風險證據`。 + +**Verification**: + +```text +Local: +ruby YAML parse .gitea/workflows/cd.yaml -> pass +git diff --check -> pass +extracted ArgoCD wait bash block -> bash -n pass + +notify dry-run: +alertname=CI_rollout_risk_pending +summary=AWOOOI 部署完成但仍有風險證據 +description contains resources=CronJob/km-vectorize ... health=Degraded ... + +Production read-only probe: +current ArgoCD app sync=Synced health=Healthy revision=a282eb8c... +same go-template against live Application returned empty resource evidence, as expected while healthy + +Gitea: +b98f93a6 fix(ci): include argocd resource evidence in rollout risk +#2916 code-review -> success +``` + +**判讀**: + +- T150 沒有改部署判定,也沒有把 Degraded 改成 success;它只讓下一次 rollout-risk 能帶出具體 ArgoCD resource 證據。 +- Workflow-only 變更不觸發 runtime image deploy;已推 Gitea main,未推 GitHub。 +- 下一段若要再補,可以把 CI/CD Telegram 模板也拆成「已自動恢復 / 已完成但需看風險 / 失敗中斷」三種話術,讓 operator 一眼分辨是否需要人工。 + +**目前整體進度**: + +- AwoooP 告警可觀測鏈:99.998%。 +- 前端 AI 自動化管理介面同步:99.9997%。 +- 首頁證據卡可用性:99.5%。 +- AI route status 響應保護:96%。 +- AI provider runtime health:60%(GCP-A 仍 down,GCP-B/111 可用)。 +- Pipeline / 部署階段可觀測性:95.5%。 +- Deploy rollout-risk 可觀測性:99.1%。 +- Execution evidence / Ansible attribution:55%。 +- 完整 AI 自動化管理產品化:99.982%。 + ## 2026-05-24|T149 ArgoCD degraded rollout-risk cleanup **觸發**: diff --git a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md index 49a43586..45c879bb 100644 --- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md +++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md @@ -2665,6 +2665,13 @@ Phase 6 完成後 - 判讀:T135 已把 runner ownership 從雙 runner 搶工收斂到 host runner 單一主控;下一段不要重新啟用 Docker-wrapped runner,而是做 runner pool / repo label 隔離、API image `apt-get` / `chown -R` 分層、Web build cache/offload、Playwright apt source-list hygiene。 - 目前進度更新:AwoooP 告警可觀測鏈約 99.998%;Incident-level source correlation 可見性約 98.8%;Source correlation apply 狀態鏈可驗證性約 99.72%;Source correlation freshness / rolling gate 約 98.2%;前端 AI 自動化管理介面同步約 99.999%;Dashboard snapshot / SSE console noise 收斂約 99.2%;CI/CD runner hygiene 約 99.2%;Runner ownership 收斂約 96%;Build host pressure治理約 82%;完整 AI 自動化管理產品化約 99.960%。 +**T150 CD rollout-risk resource evidence(2026-05-24 台北)**: +- 觸發:T149 已清掉舊 CronJob failed history 噪音,ArgoCD application 回到 Healthy;但 CD rollout-risk summary 仍只有 top-level `argocd_health_not_healthy health=Degraded`,下一次 Degraded 時 Telegram / AwoooP 仍看不到是哪個 ArgoCD resource 節點不健康。原通知標題 `AWOOOI 部署風險已恢復` 也容易誤導,實際語意應是「部署完成但仍有風險證據」。 +- 修正:`.gitea/workflows/cd.yaml` 新增 `collect_argocd_resource_evidence()`,在 ArgoCD 已同步但 health 非 Healthy 時,從 Application `.status.resources` 讀取前 5 個 `sync != Synced` 或 `health != Healthy` 節點,寫入 `resources=Kind/name ns=... sync=... health=... msg=...`;若 top-level Degraded 但沒有可見節點,寫 `resources=none_visible`。rollout-risk 通知標題改為 `AWOOOI 部署完成但仍有風險證據`。 +- Verification:workflow YAML parse / `git diff --check` pass;抽出的 ArgoCD wait bash block `bash -n` pass;`notify-awoooi-cicd.sh` dry-run 確認 `CI_rollout_risk_pending` payload 的 summary / description 可帶 resource detail。Production read-only probe 顯示目前 `awoooi-prod sync=Synced health=Healthy revision=a282eb8c...`,同一 go-template 對 live Application 回空清單,符合 healthy 狀態。`b98f93a6 fix(ci): include argocd resource evidence in rollout risk` 已推 Gitea main;#2916 code-review success。此為 workflow-only 變更,不觸發 runtime image deploy,未推 GitHub。 +- 判讀:T150 沒有降低部署判定,也沒有把 Degraded 當 success;它把下一次 rollout-risk 從黑盒 top-level health 升級成可追的 ArgoCD resource 證據。下一段可把 CI/CD Telegram 話術拆成「已自動恢復 / 已完成但需看風險 / 失敗中斷」三類。 +- 目前進度更新:AwoooP 告警可觀測鏈約 99.998%;前端 AI 自動化管理介面同步約 99.9997%;首頁證據卡可用性約 99.5%;AI route status 響應保護約 96%;AI provider runtime health 約 60%(GCP-A down,GCP-B/111 可用);Pipeline / 部署階段可觀測性約 95.5%;Deploy rollout-risk 可觀測性約 99.1%;Execution evidence / Ansible attribution 約 55%;完整 AI 自動化管理產品化約 99.982%。 + **T149 ArgoCD degraded rollout-risk cleanup(2026-05-24 台北)**: - 觸發:T143-T148 每輪 deploy 都成功,但 CD 仍反覆送 `AWOOOI_ROLLOUT_RISK=1`,原因是 ArgoCD top-level `Application.health=Degraded`。Live 查證 deployment 全 healthy,舊 Failed Jobs 來自 `drift-scanner` 與 `km-vectorize`;`drift-scanner` 後續已連續成功,`km-vectorize` 的 CronJob status 仍記著 2026-05-23 19:00 上一輪失敗,因此 ArgoCD resource-tree 顯示 `CronJob has not completed its last execution successfully`。 - 修正:`drift-scanner` / `km-vectorize` 的 `failedJobsHistoryLimit` 改為 0,避免治理 CronJob 的舊 Failed Job 物件長時間拖垮 ArgoCD app health。部署後 CronJob controller 已清掉舊 Failed Jobs;再用 `kubectl delete cronjob km-vectorize --cascade=orphan` 讓 ArgoCD self-heal 依 Git 重建 CronJob,重置 stale status,不刪現有 Job / Pod / 業務資料。