docs(ops): record runaway event packet production readback [skip ci]
This commit is contained in:
@@ -72,6 +72,24 @@
|
||||
- Monitoring / alert / PlayBook / Telegram event packet / live scrape:`100%`。
|
||||
- Runtime remediation / Telegram 實發 / Bot API call / host write:仍 `0 / false`;本段未發 Telegram、未讀 secret、未 kill process、未重啟服務、未改 firewall/K8s。
|
||||
|
||||
### 2026-06-18 14:51 台北|Host runaway event packet 正式部署 / production readback
|
||||
|
||||
**部署錨點**:`f358a0f6 fix(api): route runaway host alerts to ai event packets` 已由 Gitea CD `#3150` 部署,deploy marker 為 `2d278568 chore(cd): deploy f358a0f [skip ci]`。CD 證據顯示 tests job `3221 passed, 23 skipped`、B5 integration `5 passed`、API / Web image 皆使用 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`,post-deploy alert-chain smoke `9/9` 通過,monitoring coverage `14/14` jobs up。
|
||||
|
||||
**Runtime readback**:
|
||||
- ArgoCD `awoooi-prod`:`Synced / Healthy`,revision `2d278568cb60e1bd79af4275249df374f220039f`。
|
||||
- Kubernetes deployments:`awoooi-api 2/2`、`awoooi-web 2/2`、`awoooi-worker 1/1`,image revision 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`。
|
||||
- Production health:`https://awoooi.wooo.work/api/v1/health` 回 `healthy / prod / mock_mode=false`。
|
||||
- Production routes:`/zh-TW/iwooos`、`/zh-TW/governance?tab=automation-inventory`、`/zh-TW/awooop/tenants` 皆 `200`;抽樣掃描 `codex_delegation`、`source_thread_id`、`批准!`、`192.168.0.110`、`192.168.0.120` 命中 `0`。
|
||||
- Prometheus:`awoooi_host_runaway_process_monitor_up{host="110"}=1`,兩條 orphan browser group count 均為 `0`,`awoooi_host_gitea_actions_active_container_count{host="110"}=2`,`awoooi_host_runaway_process_remediation_authorized{host="110"}=0`;`HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 未 firing。
|
||||
|
||||
**完成度同步**:
|
||||
- 110 CPU runaway observe -> classify -> alert -> AI event packet -> PlayBook / KM contract:`100%` 並已正式部署讀回。
|
||||
- Live observability:`100%`,110 textfile exporter 已安裝、Prometheus 已 scrape。
|
||||
- Runtime remediation:仍 `0% / false`,這是刻意保留的安全 gate;沒有 owner approval、maintenance window、evidence ref、dry-run 與 post-check 時,不允許自動 kill、Docker restart、systemd restart、Nginx reload、firewall change 或 K8s action。
|
||||
|
||||
**下一步**:若未來 `HostOrphanBrowserSmokeHighCpu` 真正 firing,處理順序必須是 AI triage packet -> remediation dry-run evidence -> owner approval / maintenance window / evidence ref -> gated `SIGTERM` -> post-check -> KM / PlayBook / Verifier 回寫;合法 CI load 則走容量 / 排程 / runner queue 分流,不走 process remediation。
|
||||
|
||||
## 2026-06-18|P2-406B Receipt Readback Owner Review 本地完成
|
||||
|
||||
**背景**:P2-004 已把依賴 / 供應鏈漂移收斂成只讀監控讀回;統帥要求每次推進都不能忘記目標與方向,因此本段把日報 / 週報 / 月報、Telegram receipt owner review、P2-004 drift monitor 與 P2-403J 報表真相串成同一個 owner review surface,讓治理頁可以直接看到 AI Agent 分工、互審與仍被關閉的 runtime 邊界。
|
||||
|
||||
@@ -44,6 +44,18 @@ Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped.
|
||||
Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check.
|
||||
```
|
||||
|
||||
2026-06-18 14:51 production event-packet readback:
|
||||
|
||||
```text
|
||||
Host runaway alert-to-event packet is deployed in production.
|
||||
Deploy marker: 2d278568 chore(cd): deploy f358a0f [skip ci].
|
||||
Runtime image: awoooi-api / awoooi-web / awoooi-worker use f358a0f6c3e614e407dedb6eee89bf10b2bc8173.
|
||||
ArgoCD readback: awoooi-prod Synced / Healthy.
|
||||
Alert mapping: HostOrphanBrowserSmokeHighCpu -> orphan_browser_smoke_runaway_process; HostCiRunnerLoadSaturation -> ci_runner_load_saturation.
|
||||
Allowed declaration: monitoring, alert rules, live scrape, AI event packet routing, PlayBook / KM contract, and production deployment are complete for this evidence set.
|
||||
Forbidden declaration: Telegram send, Bot API call, Gateway queue write, process kill, Docker/systemd restart, Nginx reload, firewall/K8s action, or runtime remediation is authorized.
|
||||
```
|
||||
|
||||
| 項目 | 2026-06-18 13:43 Asia/Taipei live result | 判定 |
|
||||
|------|-------------------------------------------|------|
|
||||
| Overall recovery readiness | `99%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` |
|
||||
|
||||
@@ -5101,3 +5101,17 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的
|
||||
- 精準測試 `apps/api/tests/test_telegram_message_templates.py` 已新增兩條 regression,`59 passed`;`telegram_gateway.py` py_compile 通過。
|
||||
|
||||
**裁決:** 這是 alert -> AI event packet 的只讀與訊息模板閉環,不是 Telegram 實發、Bot API call、Gateway queue write、host write 或 process kill 授權。所有卡片仍固定 `runtime_write_gate=0`,真正修復仍必須走 dry-run、owner approval、maintenance window、evidence ref、post-check 與 KM / PlayBook 回寫。
|
||||
|
||||
### 2026-06-18 14:51 (台北) — §8 / Host CPU AIOps — event packet 正式部署讀回
|
||||
|
||||
**觸發**:`HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 專屬 lane 已在 source 與 regression test 完成,必須補 production 真相,避免把本地訊息模板完成誤判為正式站已可用。
|
||||
|
||||
**已推進:**
|
||||
- `f358a0f6 fix(api): route runaway host alerts to ai event packets` 已由 Gitea CD `#3150` 正式部署,deploy marker `2d278568 chore(cd): deploy f358a0f [skip ci]`。
|
||||
- CD tests job 顯示 `3221 passed, 23 skipped`,B5 integration `5 passed`;post-deploy alert-chain smoke `9/9` 通過,monitoring coverage `14/14` jobs up。
|
||||
- ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `2d278568cb60e1bd79af4275249df374f220039f`。
|
||||
- `awoooi-api`、`awoooi-web`、`awoooi-worker` 均已 rollout 到 image `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`。
|
||||
- Production health 與 `/zh-TW/iwooos`、`/zh-TW/governance?tab=automation-inventory`、`/zh-TW/awooop/tenants` route smoke 通過;抽樣禁字命中 `0`。
|
||||
- Prometheus 仍讀到 110 runaway exporter:`monitor_up=1`、orphan browser group count `0`、active CI containers `2`、`remediation_authorized=0`,missing / orphan alerts 未 firing。
|
||||
|
||||
**裁決:** 110 CPU runaway 的 observe -> classify -> alert -> AI event packet -> PlayBook / KM contract 已經正式站可用。這仍不是 runtime 修復授權:Telegram 實發、Bot API call、Gateway queue write、host write、process kill、Docker / systemd restart、Nginx reload、firewall / K8s action 仍維持 `0 / false`。未來若 alert firing,AI 只能先產生 triage packet、dry-run evidence、owner / maintenance / evidence gate 與 post-check 計畫,再由 gated remediation helper 執行最小 `SIGTERM`。
|
||||
|
||||
@@ -236,6 +236,15 @@ Evidence: `apps/api/src/services/telegram_gateway.py`、`apps/api/tests/test_tel
|
||||
Blocked: No for alert-to-event packet; yes for Telegram live send / runtime remediation by design.
|
||||
Next: 等 code-review / CD 後做 production readback;若未來 alert 實際 firing,確認 Telegram card 與 AwoooP Run truth-chain 都能呈現同一 lane。
|
||||
|
||||
2026-06-18 14:51 Asia/Taipei
|
||||
Phase: P3 AI Ops alert-to-event packet production readback
|
||||
Before: `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 已有 source + test,但尚未完成正式站部署與 runtime revision 讀回。
|
||||
After: `f358a0f6` 已由 Gitea CD `#3150` 部署,deploy marker `2d278568`。ArgoCD `awoooi-prod` 為 `Synced / Healthy`,API / Web / Worker image 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`;production health 與 IwoooS / Governance / AwoooP tenants routes 皆 `200` 且敏感字串抽樣命中 `0`。
|
||||
Evidence: Gitea CD `#3150` tests `3221 passed, 23 skipped`、B5 integration `5 passed`、post-deploy alert-chain smoke `9/9`、monitoring coverage `14/14` jobs up;Prometheus 仍讀到 110 `monitor_up=1`、orphan browser group count `0`、CI active containers `2`、`remediation_authorized=0`,missing / orphan alerts 未 firing。
|
||||
Blocked: No for production alert-to-event packet deployment; yes for runtime remediation by design.
|
||||
Next: Future firing alert must produce AI triage packet and dry-run evidence first. `HostCiRunnerLoadSaturation` remains capacity / runner scheduling triage, not process kill. Runtime remediation remains `0` until owner approval, maintenance window, evidence ref, gated SIGTERM, post-check, and KM / PlayBook / Verifier writeback exist.
|
||||
Completion: host runaway monitoring / alert / PlayBook / Telegram event packet / production deploy readback 100%; runtime auto-remediation remains safely gated at 0.
|
||||
|
||||
2026-06-18 13:43 Asia/Taipei
|
||||
Phase: P1/P2/P3 live readback
|
||||
Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.
|
||||
|
||||
Reference in New Issue
Block a user