docs(logbook): record incident fallback deployment [skip ci]
This commit is contained in:
@@ -1,3 +1,32 @@
|
||||
## 2026-06-04|Incident 降級誤判與 ACTION REQUIRED 收斂
|
||||
|
||||
**背景**:Telegram 出現 `node-exporter-188` / `node-exporter-110` / `gitea` ACTION REQUIRED,AI 診斷因 `step_timeout` 降級並誤標 `KubePodOOM(20%)`。live 查證後確認不是 Kubernetes Pod OOM,而是 Docker/Gitea 記憶體壓力告警加上 stale approval/incident 狀態未收斂。
|
||||
|
||||
**live 查證與收斂:**
|
||||
- Alertmanager active alerts:空。
|
||||
- `INC-20260602-25D945` 真實告警為 `DockerContainerMemoryLimitPressure`,target=`momo-pro-system`。
|
||||
- `INC-20260603-4EC935` / `INC-20260603-3DAAA5` 真實告警為 `DockerContainerMemoryLimitPressure`,target=`gitea`。
|
||||
- `INC-20260602-5734BE` 真實告警為 `GiteaMemoryPressure`。
|
||||
- 110/188 直接主機檢查:相關容器 running,`OOMKilled=false`。
|
||||
- 簽核三筆 `NO_ACTION` approval:`25D945`、`3DAAA5`、`5734BE`。
|
||||
- resolve `INC-20260603-4EC935`(已 `EXECUTION_SUCCESS` 但 incident 卡在 `investigating`)。
|
||||
- readback:四筆皆 `status=resolved`、latest approval `EXECUTION_SUCCESS`、reconciliation `consistent`、pending approvals 不再包含四筆。
|
||||
|
||||
**程式修正:**
|
||||
- `apps/api/src/agents/diagnostician_agent.py`:降級診斷優先保留原始 `alertname`,不再因 evidence 含 memory / 記憶體就推成 `KubePodOOM`;文案明示原始告警、target、host、降級分類。
|
||||
- `apps/api/src/repositories/approval_repository.py`:repository 版 `get_pending()` 補上 expired PENDING 收斂,對齊 legacy service 行為。
|
||||
- 新增 `apps/api/tests/test_diagnostician_degraded_fallback.py`、`apps/api/tests/test_approval_repository_pending_expiry.py`。
|
||||
|
||||
**驗證與部署:**
|
||||
- 本地乾淨 worktree:`python -m py_compile ...` passed。
|
||||
- `DATABASE_URL=sqlite+aiosqlite:///:memory: PYTHONPATH=apps/api pytest apps/api/tests/test_diagnostician_degraded_fallback.py apps/api/tests/test_approval_repository_pending_expiry.py -q`:6 passed。
|
||||
- `DATABASE_URL=sqlite+aiosqlite:///:memory: PYTHONPATH=apps/api pytest apps/api/tests/test_agent_step_timeouts.py -q`:23 passed,保留既有 warning。
|
||||
- Release commit:`0bb4773b fix(aiops): preserve alert identity in degraded diagnosis`,已推 Gitea `main`。
|
||||
- 後續 main commit `46a7fc3f feat(web): compact homepage delivery matrix` 包含 `0bb4773b`,Gitea CD run `3639`:tests / build-and-deploy / post-deploy-checks 全 success。
|
||||
- Deploy marker:`e55f877c chore(cd): deploy 46a7fc3 [skip ci]`。
|
||||
- Production image:`awoooi-api` / `awoooi-worker` / `awoooi-web` 已切至 `46a7fc3f...`;API 2/2、Web 2/2、Worker 1/1 available。
|
||||
- live smoke:`/api/v1/health` healthy;Pod 內降級分類 smoke 回 `DockerContainerMemoryLimitPressure`,不再出現 `KubePodOOM`。
|
||||
|
||||
## 2026-06-04|Code Review → Codex 只讀交付橋接
|
||||
|
||||
**背景**:統帥先前詢問 Code Review 之後需要 coding 的工作是否能串接 Codex;本輪在 `/zh-TW/code-review` 補上可視化交付橋接,讓審查後的修正工作能被整理成 Codex 工作候選,但仍不自動 coding、不自動推版、不開 runtime、不碰 Kali 主機、不切 GitHub primary。
|
||||
|
||||
Reference in New Issue
Block a user