diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 04735d5f..63c78c03 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -25,7 +25,7 @@ - Runtime auto-remediation:`0%`;這是安全閘門,不是缺件。真正 `SIGTERM` 必須有 owner approval、maintenance window、evidence ref、rule 與 post-check。 - 110 CPU orphan Chrome recurrence guard:`READY`;下一次會由 `HostOrphanBrowserSmokeHighCpu` 指向 PlayBook,而不只剩泛用 `HostHighCpuLoad`。 -**邊界**:本段未 SSH 寫主機、未 kill live process、未重啟 Docker/systemd/Nginx、未改 firewall/K8s、未讀 secrets、未開 runtime gate。Live 部署 exporter / alert reload 若要執行,需走 Ansible / deploy-alerts 的既有維護流程與 owner readback。 +**邊界**:本段 source / repo 收斂未 kill live process、未重啟 Docker/systemd/Nginx、未改 firewall/K8s、未讀 secrets、未開 runtime gate。Live exporter 已於 14:29 以最小 Ansible-equivalent textfile exporter 安裝補齊;仍未執行任何 process remediation。 ### 2026-06-18 14:28 台北|deploy-alerts post-deploy contract 修補 @@ -35,9 +35,30 @@ **完成度同步**: - 110 runaway process AIOps source / PlayBook / tests:`100%`。 -- Alert deployment contract recovery:`IN_PROGRESS`,等待本修補 commit 觸發 `deploy-alerts` 重跑。 +- Alert deployment contract recovery:`DONE`,平行修補 `93ac6030` 補回 canonical alert rule,`deploy-alerts #3147` 成功,live Prometheus rules 已讀到 `SourceProviderIngestionStale` 與 host runaway process alerts。 - Runtime remediation:仍 `0%`;本修補沒有開任何 process kill、Docker restart、systemd restart、Nginx reload、firewall change 或 secret read。 +### 2026-06-18 14:31 台北|110 read-only exporter live install / Prometheus scrape 完成 + +**Live 安裝**:因本機缺 `ansible-playbook`,依照 `infra/ansible/roles/host-textfile-exporters` 的 source-of-truth 做最小等效安裝:複製 `host-runaway-process-exporter.py` 與 `host-runaway-process-remediation.py` 到 110 `/home/wooo/scripts/`,設定 `0755`,加入每 2 分鐘執行的 cron,並立即刷新 `/home/wooo/node_exporter_textfiles/host_runaway_process.prom`。本段只寫入 read-only exporter/helper 與 cron,未重啟 Docker/systemd/Nginx、未改 firewall/K8s、未 kill process、未讀 secrets。 + +**110 textfile readback**: +- `awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1`。 +- `awoooi_host_runaway_browser_orphan_group_count{host="110",rule="headless_browser_smoke"} 0`。 +- `awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke"} 0`。 +- `awoooi_host_gitea_actions_active_container_count{host="110"} 2`。 +- `awoooi_host_load5_per_core{host="110"} 0.806667`。 +- `awoooi_host_swap_used_ratio{host="110"} 0.999960`。 +- `awoooi_host_runaway_process_remediation_authorized{host="110"} 0`。 + +**Prometheus readback**:第二次 scrape 已讀到 `awoooi_host_runaway_process_monitor_up{host="110"}=1`、兩條 orphan browser group count 都是 `0`、CI active container count `2`、`remediation_authorized=0`;`HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 均未 firing。Live rules 同時包含 `HostRunawayProcessMonitorMissing`、`HostRunawayProcessMonitorStale`、`HostOrphanBrowserSmokeHighCpu`、`HostCiRunnerLoadSaturation`、`HostRunawayProcessRemediationUnexpectedlyAuthorized` 與 `SourceProviderIngestionStale`。 + +**完成度同步**: +- 110 runaway process live observability:`0% -> 100%`。 +- 110 orphan Chrome recurrence guard:`READY_AND_SCRAPED`。 +- Runtime auto-remediation:仍 `0%`,這是安全設計;若未來要由 AI 進入修復,必須先產生 triage packet、dry-run evidence、owner approval、maintenance window、evidence ref、post-check 與 KM 回寫,不得由 exporter 自行 kill。 +- 目前 110 高 CPU 判讀:orphan headless browser 已歸零;剩餘負載應歸因於 active CI 或其他一般 workload,不能再被誤判為前一輪 stockPlatform orphan Chrome 事故。 + ## 2026-06-18|P2-406B Receipt Readback Owner Review 本地完成 **背景**:P2-004 已把依賴 / 供應鏈漂移收斂成只讀監控讀回;統帥要求每次推進都不能忘記目標與方向,因此本段把日報 / 週報 / 月報、Telegram receipt owner review、P2-004 drift monitor 與 P2-403J 報表真相串成同一個 owner review surface,讓治理頁可以直接看到 AI Agent 分工、互審與仍被關閉的 runtime 邊界。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index c90fd8f9..5a17e692 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -33,6 +33,17 @@ Allowed declaration: full cold-start service readiness is green for this evidenc Forbidden declaration: DR complete or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged. ``` +2026-06-18 14:31 live runaway-process readback supersedes repo-only AIOps wording: + +```text +110 host runaway process exporter is live-installed and scraped. +Textfile source: /home/wooo/node_exporter_textfiles/host_runaway_process.prom. +Prometheus readback: monitor_up=1, orphan_browser_groups=0 for headless_browser_smoke and stockplatform_headless_smoke, active Gitea Actions containers=2, load5_per_core around 0.79-0.81, swap_used_ratio around 1.0, remediation_authorized=0. +Alerts: HostRunawayProcessMonitorMissing is not firing; HostOrphanBrowserSmokeHighCpu is not firing. +Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped. +Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check. +``` + | 項目 | 2026-06-18 13:43 Asia/Taipei live result | 判定 | |------|-------------------------------------------|------| | Overall recovery readiness | `99%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` | @@ -41,7 +52,7 @@ Forbidden declaration: DR complete or runtime/security acceptance. Credential es | P2 service / data truth | `100%` | `VERIFIED_FULL_STACK_GREEN_FOR_SERVICE` | | P3 docs / automation contracts | `100%` | `DONE_WITH_STALE_JOB_CLASSIFICATION` | | 110 host runtime | `fwupd-refresh.timer` intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; `systemctl --failed` returns `0 loaded units listed`; rollback is `sudo systemctl enable --now fwupd-refresh.timer` | `GREEN_WITH_FWUPD_TIMER_DISABLED` | -| 110 host runaway process guard | `host-runaway-process-exporter.py` / `host-runaway-process-remediation.py` 已納入 110 Ansible textfile exporter source-of-truth;告警可分辨 orphan browser smoke 與合法 Gitea Actions CI load;修復器預設 dry-run,`SIGTERM` 需 owner approval、maintenance window、evidence ref | `AIOPS_MONITOR_READY_RUNTIME_GATE_0` | +| 110 host runaway process guard | 14:31-14:32 live scrape confirms `monitor_up=1`, orphan browser groups `0`, active Gitea Actions containers `2`, `load5_per_core≈0.79-0.81`, `swap_used_ratio≈1.0`, and `remediation_authorized=0`; exporter/helper also remain in Ansible textfile exporter source-of-truth. | `LIVE_SCRAPED_RUNTIME_GATE_0` | | 120 reachability | ping OK, SSH OK, boot around `2026-06-14 02:23`, K3s active, node `mon Ready` | `GREEN` | | 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` | | 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` | @@ -796,6 +807,8 @@ grep -E 'awoooi_host_runaway_|awoooi_host_gitea_actions_|awoooi_host_load5_per_c /home/wooo/node_exporter_textfiles/host_runaway_process.prom ``` +Prometheus 也必須讀得到同一份 textfile;2026-06-18 14:31-14:32 live scrape 已確認 `awoooi_host_runaway_process_monitor_up{host="110"}=1`、orphan group count `0`、active CI container count `2`、`remediation_authorized=0`,且 missing / orphan alerts 均未 firing。 + 判讀: | 指標組合 | 判定 | 行動 | diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 90a63065..bb65172d 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,7 +15,7 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. | | P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-18 13:43 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=84 WARN=0 BLOCKED=0`. | -| P3 docs / automation contracts | DONE_WITH_RUNAWAY_PROCESS_AIOPS | 100% | Workplan, SOP v1.26, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, and 2026-06-18 live readback are updated. Repo-side readiness audit now also checks runaway process exporter / remediation helper / alert group; live cold-start remains `PASS=84 WARN=0 BLOCKED=0` from the latest service readiness readback. | +| P3 docs / automation contracts | DONE_WITH_RUNAWAY_PROCESS_AIOPS_LIVE_SCRAPED | 100% | Workplan, SOP v1.26, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, and 2026-06-18 live readback are updated. 14:31-14:32 Prometheus scrape confirms 110 `monitor_up=1`, orphan browser group count `0`, active CI containers `2`, load5/core around `0.79-0.81`, swap ratio around `1.0`, `remediation_authorized=0`, and missing/orphan alerts not firing. Repo-side readiness audit also checks runaway process exporter / remediation helper / alert group; live cold-start remains `PASS=84 WARN=0 BLOCKED=0` from the latest service readiness readback. | Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-18 13:43, services are green with `WARN=0` and `BLOCKED=0`; the retained stale `km-vectorize` failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked. @@ -218,6 +218,14 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy Phase: P3 AI Ops runaway process automation Before: 110 CPU 滿載只能靠人工 `ps/top` 判斷;泛用 `HostHighCpuLoad` 無法分辨跨專案 orphan Chrome smoke 與合法 Gitea Actions CI load。 After: 新增 read-only `host-runaway-process-exporter.py`、gated `host-runaway-process-remediation.py`、Prometheus `host_runaway_process_alerts`、Ansible textfile exporter source-of-truth、SOP v1.26 與 `HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`。Exporter 暴露 orphan browser、active CI、load/core、swap ratio 與 `remediation_authorized=0`;修復器預設 dry-run,`SIGTERM` 必須帶 owner approval、maintenance window、evidence ref。 + +2026-06-18 14:31 Asia/Taipei +Phase: P3 AI Ops runaway process live observability +Before: Repo-side exporter / alert / PlayBook 已完成,但 110 Prometheus 尚未讀到 `awoooi_host_runaway_process_*` live metrics。 +After: 110 已安裝 read-only exporter/helper 與 cron,立即刷新 textfile,Prometheus 第二次 scrape 讀到 `monitor_up=1`、orphan browser group count `0`、active CI containers `2`、load5/core 約 `0.79-0.81`、swap ratio 約 `1.0`、`remediation_authorized=0`;`HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 未 firing。 +Evidence: `/home/wooo/node_exporter_textfiles/host_runaway_process.prom`、Prometheus query `awoooi_host_runaway_process_monitor_up{host="110"}`、`ALERTS{alertname="HostRunawayProcessMonitorMissing",host="110",alertstate="firing"}`。 +Blocked: No for live observability; yes for runtime remediation by design until owner approval / maintenance window / evidence ref / dry-run / post-check exist. +Next: Keep cron scrape under normal monitoring; if orphan count becomes >0, create AI triage packet and remediation dry-run before any gated `SIGTERM`. Completion: monitoring / alert / PlayBook / KM contract 100%; runtime auto-remediation remains gated at 0 until a real owner-approved apply is executed. 2026-06-18 13:43 Asia/Taipei