docs(awooop): record t150 rollout evidence [skip ci]
This commit is contained in:
@@ -1,3 +1,62 @@
|
||||
## 2026-05-24|T150 CD rollout-risk resource evidence
|
||||
|
||||
**觸發**:
|
||||
|
||||
- T149 已把舊 CronJob failed history 噪音清掉,ArgoCD application 回到 Healthy。
|
||||
- 但 CD 的 rollout-risk summary 仍只有 top-level `argocd_health_not_healthy health=Degraded`;若未來再 Degraded,Telegram / AwoooP 仍看不到是哪個 ArgoCD resource 節點不健康。
|
||||
- 原通知標題 `AWOOOI 部署風險已恢復` 也容易誤導;實際語意是「部署已完成,但有風險證據需要看」。
|
||||
|
||||
**修正**:
|
||||
|
||||
- `.gitea/workflows/cd.yaml`
|
||||
- 新增 `collect_argocd_resource_evidence()`,在 ArgoCD 已同步但 health 非 Healthy 時,從 Application `.status.resources` 讀取前 5 個 `sync != Synced` 或 `health != Healthy` 節點。
|
||||
- rollout-risk summary 從只寫:
|
||||
- `argocd_health_not_healthy health=Degraded revision=...`
|
||||
- 改為可寫:
|
||||
- `argocd_health_not_healthy health=Degraded revision=... resources=CronJob/km-vectorize ns=awoooi-prod sync=Synced health=Degraded msg=...`
|
||||
- 若 top-level Degraded 但 Application status 看不到非健康節點,會寫 `resources=none_visible`,避免再次變成黑盒。
|
||||
- 通知標題改為 `AWOOOI 部署完成但仍有風險證據`。
|
||||
|
||||
**Verification**:
|
||||
|
||||
```text
|
||||
Local:
|
||||
ruby YAML parse .gitea/workflows/cd.yaml -> pass
|
||||
git diff --check -> pass
|
||||
extracted ArgoCD wait bash block -> bash -n pass
|
||||
|
||||
notify dry-run:
|
||||
alertname=CI_rollout_risk_pending
|
||||
summary=AWOOOI 部署完成但仍有風險證據
|
||||
description contains resources=CronJob/km-vectorize ... health=Degraded ...
|
||||
|
||||
Production read-only probe:
|
||||
current ArgoCD app sync=Synced health=Healthy revision=a282eb8c...
|
||||
same go-template against live Application returned empty resource evidence, as expected while healthy
|
||||
|
||||
Gitea:
|
||||
b98f93a6 fix(ci): include argocd resource evidence in rollout risk
|
||||
#2916 code-review -> success
|
||||
```
|
||||
|
||||
**判讀**:
|
||||
|
||||
- T150 沒有改部署判定,也沒有把 Degraded 改成 success;它只讓下一次 rollout-risk 能帶出具體 ArgoCD resource 證據。
|
||||
- Workflow-only 變更不觸發 runtime image deploy;已推 Gitea main,未推 GitHub。
|
||||
- 下一段若要再補,可以把 CI/CD Telegram 模板也拆成「已自動恢復 / 已完成但需看風險 / 失敗中斷」三種話術,讓 operator 一眼分辨是否需要人工。
|
||||
|
||||
**目前整體進度**:
|
||||
|
||||
- AwoooP 告警可觀測鏈:99.998%。
|
||||
- 前端 AI 自動化管理介面同步:99.9997%。
|
||||
- 首頁證據卡可用性:99.5%。
|
||||
- AI route status 響應保護:96%。
|
||||
- AI provider runtime health:60%(GCP-A 仍 down,GCP-B/111 可用)。
|
||||
- Pipeline / 部署階段可觀測性:95.5%。
|
||||
- Deploy rollout-risk 可觀測性:99.1%。
|
||||
- Execution evidence / Ansible attribution:55%。
|
||||
- 完整 AI 自動化管理產品化:99.982%。
|
||||
|
||||
## 2026-05-24|T149 ArgoCD degraded rollout-risk cleanup
|
||||
|
||||
**觸發**:
|
||||
|
||||
@@ -2665,6 +2665,13 @@ Phase 6 完成後
|
||||
- 判讀:T135 已把 runner ownership 從雙 runner 搶工收斂到 host runner 單一主控;下一段不要重新啟用 Docker-wrapped runner,而是做 runner pool / repo label 隔離、API image `apt-get` / `chown -R` 分層、Web build cache/offload、Playwright apt source-list hygiene。
|
||||
- 目前進度更新:AwoooP 告警可觀測鏈約 99.998%;Incident-level source correlation 可見性約 98.8%;Source correlation apply 狀態鏈可驗證性約 99.72%;Source correlation freshness / rolling gate 約 98.2%;前端 AI 自動化管理介面同步約 99.999%;Dashboard snapshot / SSE console noise 收斂約 99.2%;CI/CD runner hygiene 約 99.2%;Runner ownership 收斂約 96%;Build host pressure治理約 82%;完整 AI 自動化管理產品化約 99.960%。
|
||||
|
||||
**T150 CD rollout-risk resource evidence(2026-05-24 台北)**:
|
||||
- 觸發:T149 已清掉舊 CronJob failed history 噪音,ArgoCD application 回到 Healthy;但 CD rollout-risk summary 仍只有 top-level `argocd_health_not_healthy health=Degraded`,下一次 Degraded 時 Telegram / AwoooP 仍看不到是哪個 ArgoCD resource 節點不健康。原通知標題 `AWOOOI 部署風險已恢復` 也容易誤導,實際語意應是「部署完成但仍有風險證據」。
|
||||
- 修正:`.gitea/workflows/cd.yaml` 新增 `collect_argocd_resource_evidence()`,在 ArgoCD 已同步但 health 非 Healthy 時,從 Application `.status.resources` 讀取前 5 個 `sync != Synced` 或 `health != Healthy` 節點,寫入 `resources=Kind/name ns=... sync=... health=... msg=...`;若 top-level Degraded 但沒有可見節點,寫 `resources=none_visible`。rollout-risk 通知標題改為 `AWOOOI 部署完成但仍有風險證據`。
|
||||
- Verification:workflow YAML parse / `git diff --check` pass;抽出的 ArgoCD wait bash block `bash -n` pass;`notify-awoooi-cicd.sh` dry-run 確認 `CI_rollout_risk_pending` payload 的 summary / description 可帶 resource detail。Production read-only probe 顯示目前 `awoooi-prod sync=Synced health=Healthy revision=a282eb8c...`,同一 go-template 對 live Application 回空清單,符合 healthy 狀態。`b98f93a6 fix(ci): include argocd resource evidence in rollout risk` 已推 Gitea main;#2916 code-review success。此為 workflow-only 變更,不觸發 runtime image deploy,未推 GitHub。
|
||||
- 判讀:T150 沒有降低部署判定,也沒有把 Degraded 當 success;它把下一次 rollout-risk 從黑盒 top-level health 升級成可追的 ArgoCD resource 證據。下一段可把 CI/CD Telegram 話術拆成「已自動恢復 / 已完成但需看風險 / 失敗中斷」三類。
|
||||
- 目前進度更新:AwoooP 告警可觀測鏈約 99.998%;前端 AI 自動化管理介面同步約 99.9997%;首頁證據卡可用性約 99.5%;AI route status 響應保護約 96%;AI provider runtime health 約 60%(GCP-A down,GCP-B/111 可用);Pipeline / 部署階段可觀測性約 95.5%;Deploy rollout-risk 可觀測性約 99.1%;Execution evidence / Ansible attribution 約 55%;完整 AI 自動化管理產品化約 99.982%。
|
||||
|
||||
**T149 ArgoCD degraded rollout-risk cleanup(2026-05-24 台北)**:
|
||||
- 觸發:T143-T148 每輪 deploy 都成功,但 CD 仍反覆送 `AWOOOI_ROLLOUT_RISK=1`,原因是 ArgoCD top-level `Application.health=Degraded`。Live 查證 deployment 全 healthy,舊 Failed Jobs 來自 `drift-scanner` 與 `km-vectorize`;`drift-scanner` 後續已連續成功,`km-vectorize` 的 CronJob status 仍記著 2026-05-23 19:00 上一輪失敗,因此 ArgoCD resource-tree 顯示 `CronJob has not completed its last execution successfully`。
|
||||
- 修正:`drift-scanner` / `km-vectorize` 的 `failedJobsHistoryLimit` 改為 0,避免治理 CronJob 的舊 Failed Job 物件長時間拖垮 ArgoCD app health。部署後 CronJob controller 已清掉舊 Failed Jobs;再用 `kubectl delete cronjob km-vectorize --cascade=orphan` 讓 ArgoCD self-heal 依 Git 重建 CronJob,重置 stale status,不刪現有 Job / Pod / 業務資料。
|
||||
|
||||
Reference in New Issue
Block a user