diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 5e1246b2..60aa16ac 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -72,6 +72,24 @@ - Monitoring / alert / PlayBook / Telegram event packet / live scrape:`100%`。 - Runtime remediation / Telegram 實發 / Bot API call / host write:仍 `0 / false`;本段未發 Telegram、未讀 secret、未 kill process、未重啟服務、未改 firewall/K8s。 +### 2026-06-18 14:51 台北|Host runaway event packet 正式部署 / production readback + +**部署錨點**:`f358a0f6 fix(api): route runaway host alerts to ai event packets` 已由 Gitea CD `#3150` 部署,deploy marker 為 `2d278568 chore(cd): deploy f358a0f [skip ci]`。CD 證據顯示 tests job `3221 passed, 23 skipped`、B5 integration `5 passed`、API / Web image 皆使用 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`,post-deploy alert-chain smoke `9/9` 通過,monitoring coverage `14/14` jobs up。 + +**Runtime readback**: +- ArgoCD `awoooi-prod`:`Synced / Healthy`,revision `2d278568cb60e1bd79af4275249df374f220039f`。 +- Kubernetes deployments:`awoooi-api 2/2`、`awoooi-web 2/2`、`awoooi-worker 1/1`,image revision 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`。 +- Production health:`https://awoooi.wooo.work/api/v1/health` 回 `healthy / prod / mock_mode=false`。 +- Production routes:`/zh-TW/iwooos`、`/zh-TW/governance?tab=automation-inventory`、`/zh-TW/awooop/tenants` 皆 `200`;抽樣掃描 `codex_delegation`、`source_thread_id`、`批准!`、`192.168.0.110`、`192.168.0.120` 命中 `0`。 +- Prometheus:`awoooi_host_runaway_process_monitor_up{host="110"}=1`,兩條 orphan browser group count 均為 `0`,`awoooi_host_gitea_actions_active_container_count{host="110"}=2`,`awoooi_host_runaway_process_remediation_authorized{host="110"}=0`;`HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 未 firing。 + +**完成度同步**: +- 110 CPU runaway observe -> classify -> alert -> AI event packet -> PlayBook / KM contract:`100%` 並已正式部署讀回。 +- Live observability:`100%`,110 textfile exporter 已安裝、Prometheus 已 scrape。 +- Runtime remediation:仍 `0% / false`,這是刻意保留的安全 gate;沒有 owner approval、maintenance window、evidence ref、dry-run 與 post-check 時,不允許自動 kill、Docker restart、systemd restart、Nginx reload、firewall change 或 K8s action。 + +**下一步**:若未來 `HostOrphanBrowserSmokeHighCpu` 真正 firing,處理順序必須是 AI triage packet -> remediation dry-run evidence -> owner approval / maintenance window / evidence ref -> gated `SIGTERM` -> post-check -> KM / PlayBook / Verifier 回寫;合法 CI load 則走容量 / 排程 / runner queue 分流,不走 process remediation。 + ## 2026-06-18|P2-406B Receipt Readback Owner Review 本地完成 **背景**:P2-004 已把依賴 / 供應鏈漂移收斂成只讀監控讀回;統帥要求每次推進都不能忘記目標與方向,因此本段把日報 / 週報 / 月報、Telegram receipt owner review、P2-004 drift monitor 與 P2-403J 報表真相串成同一個 owner review surface,讓治理頁可以直接看到 AI Agent 分工、互審與仍被關閉的 runtime 邊界。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 5a17e692..47e24cf1 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -44,6 +44,18 @@ Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped. Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check. ``` +2026-06-18 14:51 production event-packet readback: + +```text +Host runaway alert-to-event packet is deployed in production. +Deploy marker: 2d278568 chore(cd): deploy f358a0f [skip ci]. +Runtime image: awoooi-api / awoooi-web / awoooi-worker use f358a0f6c3e614e407dedb6eee89bf10b2bc8173. +ArgoCD readback: awoooi-prod Synced / Healthy. +Alert mapping: HostOrphanBrowserSmokeHighCpu -> orphan_browser_smoke_runaway_process; HostCiRunnerLoadSaturation -> ci_runner_load_saturation. +Allowed declaration: monitoring, alert rules, live scrape, AI event packet routing, PlayBook / KM contract, and production deployment are complete for this evidence set. +Forbidden declaration: Telegram send, Bot API call, Gateway queue write, process kill, Docker/systemd restart, Nginx reload, firewall/K8s action, or runtime remediation is authorized. +``` + | 項目 | 2026-06-18 13:43 Asia/Taipei live result | 判定 | |------|-------------------------------------------|------| | Overall recovery readiness | `99%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` | diff --git a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md index 61d47eea..0f97285b 100644 --- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md +++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md @@ -5101,3 +5101,17 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的 - 精準測試 `apps/api/tests/test_telegram_message_templates.py` 已新增兩條 regression,`59 passed`;`telegram_gateway.py` py_compile 通過。 **裁決:** 這是 alert -> AI event packet 的只讀與訊息模板閉環,不是 Telegram 實發、Bot API call、Gateway queue write、host write 或 process kill 授權。所有卡片仍固定 `runtime_write_gate=0`,真正修復仍必須走 dry-run、owner approval、maintenance window、evidence ref、post-check 與 KM / PlayBook 回寫。 + +### 2026-06-18 14:51 (台北) — §8 / Host CPU AIOps — event packet 正式部署讀回 + +**觸發**:`HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 專屬 lane 已在 source 與 regression test 完成,必須補 production 真相,避免把本地訊息模板完成誤判為正式站已可用。 + +**已推進:** +- `f358a0f6 fix(api): route runaway host alerts to ai event packets` 已由 Gitea CD `#3150` 正式部署,deploy marker `2d278568 chore(cd): deploy f358a0f [skip ci]`。 +- CD tests job 顯示 `3221 passed, 23 skipped`,B5 integration `5 passed`;post-deploy alert-chain smoke `9/9` 通過,monitoring coverage `14/14` jobs up。 +- ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `2d278568cb60e1bd79af4275249df374f220039f`。 +- `awoooi-api`、`awoooi-web`、`awoooi-worker` 均已 rollout 到 image `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`。 +- Production health 與 `/zh-TW/iwooos`、`/zh-TW/governance?tab=automation-inventory`、`/zh-TW/awooop/tenants` route smoke 通過;抽樣禁字命中 `0`。 +- Prometheus 仍讀到 110 runaway exporter:`monitor_up=1`、orphan browser group count `0`、active CI containers `2`、`remediation_authorized=0`,missing / orphan alerts 未 firing。 + +**裁決:** 110 CPU runaway 的 observe -> classify -> alert -> AI event packet -> PlayBook / KM contract 已經正式站可用。這仍不是 runtime 修復授權:Telegram 實發、Bot API call、Gateway queue write、host write、process kill、Docker / systemd restart、Nginx reload、firewall / K8s action 仍維持 `0 / false`。未來若 alert firing,AI 只能先產生 triage packet、dry-run evidence、owner / maintenance / evidence gate 與 post-check 計畫,再由 gated remediation helper 執行最小 `SIGTERM`。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index fa7b6e74..aec54f80 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -236,6 +236,15 @@ Evidence: `apps/api/src/services/telegram_gateway.py`、`apps/api/tests/test_tel Blocked: No for alert-to-event packet; yes for Telegram live send / runtime remediation by design. Next: 等 code-review / CD 後做 production readback;若未來 alert 實際 firing,確認 Telegram card 與 AwoooP Run truth-chain 都能呈現同一 lane。 +2026-06-18 14:51 Asia/Taipei +Phase: P3 AI Ops alert-to-event packet production readback +Before: `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 已有 source + test,但尚未完成正式站部署與 runtime revision 讀回。 +After: `f358a0f6` 已由 Gitea CD `#3150` 部署,deploy marker `2d278568`。ArgoCD `awoooi-prod` 為 `Synced / Healthy`,API / Web / Worker image 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`;production health 與 IwoooS / Governance / AwoooP tenants routes 皆 `200` 且敏感字串抽樣命中 `0`。 +Evidence: Gitea CD `#3150` tests `3221 passed, 23 skipped`、B5 integration `5 passed`、post-deploy alert-chain smoke `9/9`、monitoring coverage `14/14` jobs up;Prometheus 仍讀到 110 `monitor_up=1`、orphan browser group count `0`、CI active containers `2`、`remediation_authorized=0`,missing / orphan alerts 未 firing。 +Blocked: No for production alert-to-event packet deployment; yes for runtime remediation by design. +Next: Future firing alert must produce AI triage packet and dry-run evidence first. `HostCiRunnerLoadSaturation` remains capacity / runner scheduling triage, not process kill. Runtime remediation remains `0` until owner approval, maintenance window, evidence ref, gated SIGTERM, post-check, and KM / PlayBook / Verifier writeback exist. +Completion: host runaway monitoring / alert / PlayBook / Telegram event packet / production deploy readback 100%; runtime auto-remediation remains safely gated at 0. + 2026-06-18 13:43 Asia/Taipei Phase: P1/P2/P3 live readback Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.