docs(ops): record runaway event packet production readback [skip ci]

This commit is contained in:
Your Name
2026-06-18 14:55:47 +08:00
parent 2d278568cb
commit 10425f7f2c
4 changed files with 53 additions and 0 deletions

View File

@@ -72,6 +72,24 @@
- Monitoring / alert / PlayBook / Telegram event packet / live scrape`100%`
- Runtime remediation / Telegram 實發 / Bot API call / host write`0 / false`;本段未發 Telegram、未讀 secret、未 kill process、未重啟服務、未改 firewall/K8s。
### 2026-06-18 14:51 台北Host runaway event packet 正式部署 / production readback
**部署錨點**`f358a0f6 fix(api): route runaway host alerts to ai event packets` 已由 Gitea CD `#3150` 部署deploy marker 為 `2d278568 chore(cd): deploy f358a0f [skip ci]`。CD 證據顯示 tests job `3221 passed, 23 skipped`、B5 integration `5 passed`、API / Web image 皆使用 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`post-deploy alert-chain smoke `9/9` 通過monitoring coverage `14/14` jobs up。
**Runtime readback**
- ArgoCD `awoooi-prod``Synced / Healthy`revision `2d278568cb60e1bd79af4275249df374f220039f`
- Kubernetes deployments`awoooi-api 2/2``awoooi-web 2/2``awoooi-worker 1/1`image revision 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`
- Production health`https://awoooi.wooo.work/api/v1/health``healthy / prod / mock_mode=false`
- Production routes`/zh-TW/iwooos``/zh-TW/governance?tab=automation-inventory``/zh-TW/awooop/tenants``200`;抽樣掃描 `codex_delegation``source_thread_id``批准!``192.168.0.110``192.168.0.120` 命中 `0`
- Prometheus`awoooi_host_runaway_process_monitor_up{host="110"}=1`,兩條 orphan browser group count 均為 `0``awoooi_host_gitea_actions_active_container_count{host="110"}=2``awoooi_host_runaway_process_remediation_authorized{host="110"}=0``HostRunawayProcessMonitorMissing``HostOrphanBrowserSmokeHighCpu` 未 firing。
**完成度同步**
- 110 CPU runaway observe -> classify -> alert -> AI event packet -> PlayBook / KM contract`100%` 並已正式部署讀回。
- Live observability`100%`110 textfile exporter 已安裝、Prometheus 已 scrape。
- Runtime remediation`0% / false`,這是刻意保留的安全 gate沒有 owner approval、maintenance window、evidence ref、dry-run 與 post-check 時,不允許自動 kill、Docker restart、systemd restart、Nginx reload、firewall change 或 K8s action。
**下一步**:若未來 `HostOrphanBrowserSmokeHighCpu` 真正 firing處理順序必須是 AI triage packet -> remediation dry-run evidence -> owner approval / maintenance window / evidence ref -> gated `SIGTERM` -> post-check -> KM / PlayBook / Verifier 回寫;合法 CI load 則走容量 / 排程 / runner queue 分流,不走 process remediation。
## 2026-06-18P2-406B Receipt Readback Owner Review 本地完成
**背景**P2-004 已把依賴 / 供應鏈漂移收斂成只讀監控讀回;統帥要求每次推進都不能忘記目標與方向,因此本段把日報 / 週報 / 月報、Telegram receipt owner review、P2-004 drift monitor 與 P2-403J 報表真相串成同一個 owner review surface讓治理頁可以直接看到 AI Agent 分工、互審與仍被關閉的 runtime 邊界。

View File

@@ -44,6 +44,18 @@ Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped.
Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check.
```
2026-06-18 14:51 production event-packet readback:
```text
Host runaway alert-to-event packet is deployed in production.
Deploy marker: 2d278568 chore(cd): deploy f358a0f [skip ci].
Runtime image: awoooi-api / awoooi-web / awoooi-worker use f358a0f6c3e614e407dedb6eee89bf10b2bc8173.
ArgoCD readback: awoooi-prod Synced / Healthy.
Alert mapping: HostOrphanBrowserSmokeHighCpu -> orphan_browser_smoke_runaway_process; HostCiRunnerLoadSaturation -> ci_runner_load_saturation.
Allowed declaration: monitoring, alert rules, live scrape, AI event packet routing, PlayBook / KM contract, and production deployment are complete for this evidence set.
Forbidden declaration: Telegram send, Bot API call, Gateway queue write, process kill, Docker/systemd restart, Nginx reload, firewall/K8s action, or runtime remediation is authorized.
```
| 項目 | 2026-06-18 13:43 Asia/Taipei live result | 判定 |
|------|-------------------------------------------|------|
| Overall recovery readiness | `99%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` |

View File

@@ -5101,3 +5101,17 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的
- 精準測試 `apps/api/tests/test_telegram_message_templates.py` 已新增兩條 regression`59 passed``telegram_gateway.py` py_compile 通過。
**裁決:** 這是 alert -> AI event packet 的只讀與訊息模板閉環,不是 Telegram 實發、Bot API call、Gateway queue write、host write 或 process kill 授權。所有卡片仍固定 `runtime_write_gate=0`,真正修復仍必須走 dry-run、owner approval、maintenance window、evidence ref、post-check 與 KM / PlayBook 回寫。
### 2026-06-18 14:51 (台北) — §8 / Host CPU AIOps — event packet 正式部署讀回
**觸發**`HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 專屬 lane 已在 source 與 regression test 完成,必須補 production 真相,避免把本地訊息模板完成誤判為正式站已可用。
**已推進:**
- `f358a0f6 fix(api): route runaway host alerts to ai event packets` 已由 Gitea CD `#3150` 正式部署deploy marker `2d278568 chore(cd): deploy f358a0f [skip ci]`
- CD tests job 顯示 `3221 passed, 23 skipped`B5 integration `5 passed`post-deploy alert-chain smoke `9/9` 通過monitoring coverage `14/14` jobs up。
- ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`revision `2d278568cb60e1bd79af4275249df374f220039f`
- `awoooi-api``awoooi-web``awoooi-worker` 均已 rollout 到 image `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`
- Production health 與 `/zh-TW/iwooos``/zh-TW/governance?tab=automation-inventory``/zh-TW/awooop/tenants` route smoke 通過;抽樣禁字命中 `0`
- Prometheus 仍讀到 110 runaway exporter`monitor_up=1`、orphan browser group count `0`、active CI containers `2``remediation_authorized=0`missing / orphan alerts 未 firing。
**裁決:** 110 CPU runaway 的 observe -> classify -> alert -> AI event packet -> PlayBook / KM contract 已經正式站可用。這仍不是 runtime 修復授權Telegram 實發、Bot API call、Gateway queue write、host write、process kill、Docker / systemd restart、Nginx reload、firewall / K8s action 仍維持 `0 / false`。未來若 alert firingAI 只能先產生 triage packet、dry-run evidence、owner / maintenance / evidence gate 與 post-check 計畫,再由 gated remediation helper 執行最小 `SIGTERM`

View File

@@ -236,6 +236,15 @@ Evidence: `apps/api/src/services/telegram_gateway.py`、`apps/api/tests/test_tel
Blocked: No for alert-to-event packet; yes for Telegram live send / runtime remediation by design.
Next: 等 code-review / CD 後做 production readback若未來 alert 實際 firing確認 Telegram card 與 AwoooP Run truth-chain 都能呈現同一 lane。
2026-06-18 14:51 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet production readback
Before: `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 已有 source + test但尚未完成正式站部署與 runtime revision 讀回。
After: `f358a0f6` 已由 Gitea CD `#3150` 部署deploy marker `2d278568`。ArgoCD `awoooi-prod` 為 `Synced / Healthy`API / Web / Worker image 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`production health 與 IwoooS / Governance / AwoooP tenants routes 皆 `200` 且敏感字串抽樣命中 `0`。
Evidence: Gitea CD `#3150` tests `3221 passed, 23 skipped`、B5 integration `5 passed`、post-deploy alert-chain smoke `9/9`、monitoring coverage `14/14` jobs upPrometheus 仍讀到 110 `monitor_up=1`、orphan browser group count `0`、CI active containers `2`、`remediation_authorized=0`missing / orphan alerts 未 firing。
Blocked: No for production alert-to-event packet deployment; yes for runtime remediation by design.
Next: Future firing alert must produce AI triage packet and dry-run evidence first. `HostCiRunnerLoadSaturation` remains capacity / runner scheduling triage, not process kill. Runtime remediation remains `0` until owner approval, maintenance window, evidence ref, gated SIGTERM, post-check, and KM / PlayBook / Verifier writeback exist.
Completion: host runaway monitoring / alert / PlayBook / Telegram event packet / production deploy readback 100%; runtime auto-remediation remains safely gated at 0.
2026-06-18 13:43 Asia/Taipei
Phase: P1/P2/P3 live readback
Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.