docs(awooop): record t150 rollout evidence [skip ci]

This commit is contained in:
Your Name
2026-05-24 14:28:43 +08:00
parent b98f93a62f
commit 17b62da59a
2 changed files with 66 additions and 0 deletions

View File

@@ -1,3 +1,62 @@
## 2026-05-24T150 CD rollout-risk resource evidence
**觸發**
- T149 已把舊 CronJob failed history 噪音清掉ArgoCD application 回到 Healthy。
- 但 CD 的 rollout-risk summary 仍只有 top-level `argocd_health_not_healthy health=Degraded`;若未來再 DegradedTelegram / AwoooP 仍看不到是哪個 ArgoCD resource 節點不健康。
- 原通知標題 `AWOOOI 部署風險已恢復` 也容易誤導;實際語意是「部署已完成,但有風險證據需要看」。
**修正**
- `.gitea/workflows/cd.yaml`
- 新增 `collect_argocd_resource_evidence()`,在 ArgoCD 已同步但 health 非 Healthy 時,從 Application `.status.resources` 讀取前 5 個 `sync != Synced``health != Healthy` 節點。
- rollout-risk summary 從只寫:
- `argocd_health_not_healthy health=Degraded revision=...`
- 改為可寫:
- `argocd_health_not_healthy health=Degraded revision=... resources=CronJob/km-vectorize ns=awoooi-prod sync=Synced health=Degraded msg=...`
- 若 top-level Degraded 但 Application status 看不到非健康節點,會寫 `resources=none_visible`,避免再次變成黑盒。
- 通知標題改為 `AWOOOI 部署完成但仍有風險證據`
**Verification**
```text
Local:
ruby YAML parse .gitea/workflows/cd.yaml -> pass
git diff --check -> pass
extracted ArgoCD wait bash block -> bash -n pass
notify dry-run:
alertname=CI_rollout_risk_pending
summary=AWOOOI 部署完成但仍有風險證據
description contains resources=CronJob/km-vectorize ... health=Degraded ...
Production read-only probe:
current ArgoCD app sync=Synced health=Healthy revision=a282eb8c...
same go-template against live Application returned empty resource evidence, as expected while healthy
Gitea:
b98f93a6 fix(ci): include argocd resource evidence in rollout risk
#2916 code-review -> success
```
**判讀**
- T150 沒有改部署判定,也沒有把 Degraded 改成 success它只讓下一次 rollout-risk 能帶出具體 ArgoCD resource 證據。
- Workflow-only 變更不觸發 runtime image deploy已推 Gitea main未推 GitHub。
- 下一段若要再補,可以把 CI/CD Telegram 模板也拆成「已自動恢復 / 已完成但需看風險 / 失敗中斷」三種話術,讓 operator 一眼分辨是否需要人工。
**目前整體進度**
- AwoooP 告警可觀測鏈99.998%。
- 前端 AI 自動化管理介面同步99.9997%。
- 首頁證據卡可用性99.5%。
- AI route status 響應保護96%。
- AI provider runtime health60%GCP-A 仍 downGCP-B/111 可用)。
- Pipeline / 部署階段可觀測性95.5%。
- Deploy rollout-risk 可觀測性99.1%。
- Execution evidence / Ansible attribution55%。
- 完整 AI 自動化管理產品化99.982%。
## 2026-05-24T149 ArgoCD degraded rollout-risk cleanup
**觸發**

View File

@@ -2665,6 +2665,13 @@ Phase 6 完成後
- 判讀T135 已把 runner ownership 從雙 runner 搶工收斂到 host runner 單一主控;下一段不要重新啟用 Docker-wrapped runner而是做 runner pool / repo label 隔離、API image `apt-get` / `chown -R` 分層、Web build cache/offload、Playwright apt source-list hygiene。
- 目前進度更新AwoooP 告警可觀測鏈約 99.998%Incident-level source correlation 可見性約 98.8%Source correlation apply 狀態鏈可驗證性約 99.72%Source correlation freshness / rolling gate 約 98.2%;前端 AI 自動化管理介面同步約 99.999%Dashboard snapshot / SSE console noise 收斂約 99.2%CI/CD runner hygiene 約 99.2%Runner ownership 收斂約 96%Build host pressure治理約 82%;完整 AI 自動化管理產品化約 99.960%。
**T150 CD rollout-risk resource evidence2026-05-24 台北)**
- 觸發T149 已清掉舊 CronJob failed history 噪音ArgoCD application 回到 Healthy但 CD rollout-risk summary 仍只有 top-level `argocd_health_not_healthy health=Degraded`,下一次 Degraded 時 Telegram / AwoooP 仍看不到是哪個 ArgoCD resource 節點不健康。原通知標題 `AWOOOI 部署風險已恢復` 也容易誤導,實際語意應是「部署完成但仍有風險證據」。
- 修正:`.gitea/workflows/cd.yaml` 新增 `collect_argocd_resource_evidence()`,在 ArgoCD 已同步但 health 非 Healthy 時,從 Application `.status.resources` 讀取前 5 個 `sync != Synced``health != Healthy` 節點,寫入 `resources=Kind/name ns=... sync=... health=... msg=...`;若 top-level Degraded 但沒有可見節點,寫 `resources=none_visible`。rollout-risk 通知標題改為 `AWOOOI 部署完成但仍有風險證據`
- Verificationworkflow YAML parse / `git diff --check` pass抽出的 ArgoCD wait bash block `bash -n` pass`notify-awoooi-cicd.sh` dry-run 確認 `CI_rollout_risk_pending` payload 的 summary / description 可帶 resource detail。Production read-only probe 顯示目前 `awoooi-prod sync=Synced health=Healthy revision=a282eb8c...`,同一 go-template 對 live Application 回空清單,符合 healthy 狀態。`b98f93a6 fix(ci): include argocd resource evidence in rollout risk` 已推 Gitea main#2916 code-review success。此為 workflow-only 變更,不觸發 runtime image deploy未推 GitHub。
- 判讀T150 沒有降低部署判定,也沒有把 Degraded 當 success它把下一次 rollout-risk 從黑盒 top-level health 升級成可追的 ArgoCD resource 證據。下一段可把 CI/CD Telegram 話術拆成「已自動恢復 / 已完成但需看風險 / 失敗中斷」三類。
- 目前進度更新AwoooP 告警可觀測鏈約 99.998%;前端 AI 自動化管理介面同步約 99.9997%;首頁證據卡可用性約 99.5%AI route status 響應保護約 96%AI provider runtime health 約 60%GCP-A downGCP-B/111 可用Pipeline / 部署階段可觀測性約 95.5%Deploy rollout-risk 可觀測性約 99.1%Execution evidence / Ansible attribution 約 55%;完整 AI 自動化管理產品化約 99.982%。
**T149 ArgoCD degraded rollout-risk cleanup2026-05-24 台北)**
- 觸發T143-T148 每輪 deploy 都成功,但 CD 仍反覆送 `AWOOOI_ROLLOUT_RISK=1`,原因是 ArgoCD top-level `Application.health=Degraded`。Live 查證 deployment 全 healthy舊 Failed Jobs 來自 `drift-scanner``km-vectorize``drift-scanner` 後續已連續成功,`km-vectorize` 的 CronJob status 仍記著 2026-05-23 19:00 上一輪失敗,因此 ArgoCD resource-tree 顯示 `CronJob has not completed its last execution successfully`
- 修正:`drift-scanner` / `km-vectorize``failedJobsHistoryLimit` 改為 0避免治理 CronJob 的舊 Failed Job 物件長時間拖垮 ArgoCD app health。部署後 CronJob controller 已清掉舊 Failed Jobs再用 `kubectl delete cronjob km-vectorize --cascade=orphan` 讓 ArgoCD self-heal 依 Git 重建 CronJob重置 stale status不刪現有 Job / Pod / 業務資料。