diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 5049e0a0..910b940b 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -44865,3 +44865,37 @@ production browser smoke: **目前判定**: - 188 host hygiene preflight automation:`0% -> 100%`。 - 188 runtime repair:仍 `0%`。 + +## 2026-06-26 — 07:39 post-start quick-check live refresh / SOP v1.66 + +**時間與來源**: +- 2026-06-26 07:38-07:39 Asia/Taipei。 +- 來源:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`、delegated cold-start wrapper、MOMO dedicated preflight、StockPlatform freshness API、110 backup-status、public route curl、110 CPU / process readback。 + +**只讀證據**: +- Host reachability:`110 / 120 / 121 / 188` ping 與 SSH port 全部 OK。 +- Cold-start:`PASS=89 WARN=0 BLOCKED=0`,result `GREEN`。 +- Wrapper:`POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`,warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`,result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 +- K3s:`mon` / `mon1` both `Ready control-plane`;AWOOI API/Web/Worker pods Running,restart `0`;`km-vectorize` latest retained official Job Completed。 +- MOMO:health `V10.701`;daily snapshot `109061` rows, `2025-07-01..2026-06-24`;current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`;latest import job `57 completed`,daily freshness `1|2026-06-24`。 +- StockPlatform:freshness `status=ok`、latest trading date `2026-06-25`、blockers empty;price / chips / margin / AI recommendations all current for `2026-06-25`。 +- Backup:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。 +- Public routes:AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps all returned expected 2xx/3xx statuses. +- 110 CPU:load around `5.19 / 4.66 / 4.91`; `vmstat` samples show CPU idle mostly `80%+`, no immediate swap thrash; top processes are normal platform services such as Gitea, ClickHouse, Docker, Kafka, StockPlatform, AWOOOI API, and Sentry. No orphan Chrome recurrence was visible in this wrapper output. + +**做過的命令類型**: +- 只讀:ping / SSH port check / `kubectl get` / cold-start wrapper / MOMO preflight / StockPlatform freshness API / backup-status / public route curl / `vmstat` / `ps`。 +- 寫入:repo docs-only。 +- 未做:host / Docker / systemd / Nginx / firewall / K8s / DB / Wazuh runtime 寫操作;未執行 `pg_resetwal`、certbot renew、service restart、`reset-failed`、manual data import、secret read。 + +**目前判定**: +- 主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness:`GREEN`。 +- 整體狀態仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 + +**仍 blocked**: +- DR credential escrow evidence missing `5`:`restic_repository_password`、`offsite_provider_credentials`、`break_glass_admin_credentials`、`dns_registrar_recovery`、`oauth_ai_provider_recovery`。 +- 188 host hygiene:host PostgreSQL `14/main` checkpoint/WAL failure、certbot renew / challenge failure、startup unit failure,需維護窗口與 rollback owner。 +- Wazuh:manager registry accepted remains `0`; route 200 / Dashboard visibility cannot be used as host registry recovery. + +**不得宣稱**: +- 不得宣稱 `DR_COMPLETE`、credential escrow complete、188 host fully green、Wazuh registry recovered、runtime/security acceptance enabled、或所有主機衛生問題已解決。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index a02bfadc..0d99c87b 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.65 +> Version: v1.66 > Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -12,6 +12,8 @@ 若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 +2026-06-26 07:39 live quick-check refresh:`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 完整跑完,四主機 ping / SSH 全部 OK,delegated cold-start 為 `PASS=89 WARN=0 BLOCKED=0`,wrapper 總結為 `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`、warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`、`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。MOMO health `V10.701`,daily snapshot `109061` rows / `2025-07-01..2026-06-24`,current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`,latest import job `57 completed`。StockPlatform freshness `status=ok`、latest trading date `2026-06-25`,price / chips / margin / AI recommendations 均為 `2026-06-25`。Backup-status 07:39 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 `5.19 / 4.66 / 4.91`,CPU idle 多數樣本 `80%+`,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration:主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠;forbidden declaration:DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。 + 2026-06-26 07:19 follow-up:`gitea/main` 已包含前一輪 SOP 文件 commit `1fd5e2a8`,ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`,API `2/2`、Web `2/2`、Worker `1/1`,pods `restart=0`。重跑 full cold-start 仍是 `PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。直接 public route 讀回:AWOOOI API `200`、AWOOOI Web `307`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock freshness `200`、Bitan `200`、Gitea `200`、Harbor `200`、Registry `/v2/` expected `401`、Sentry expected `302`、SigNoz `200`、Langfuse `200`。188 blocker 精準分類:`pg_lsclusters` 顯示 host PostgreSQL `14/main` down,`systemctl status postgresql@14-main` 顯示 `invalid primary checkpoint record` 與 `PANIC: could not locate a valid checkpoint record`;`certbot.service` 顯示 `sentry.wooo.work` renew rate-limited,`snap.certbot.renew.service` 顯示 challenge failed;`awoooi-startup.service` 曾嘗試以 root 執行 `pg_resetwal` 並失敗。本輪不執行 `pg_resetwal`、不 `reset-failed`、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`,並可先跑 `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color` 取得只讀 preflight。110 load 已降到約 `4.83 / 4.82 / 5.52`,top CPU 是 active AWOOOI Web `turbo build` / Docker buildx;Swap 仍滿但 memory available 約 `41Gi`,本輪不手動清 swap。整體宣告仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 2026-06-26 07:02 全主機 live refresh:`110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。110 `systemctl=running`、failed units `0`,但 load `5.83 / 7.26 / 5.77` 且 top CPU 是 AWOOOI Web `next build`,Swap 仍 `7.8Gi / 7.8Gi`;這是 CI/build 壓力,不是 orphan Chrome 或 Docker 事故。120 / 121 `systemctl=running`、K3s active,nodes `mon` / `mon1` 均為 Ready。ArgoCD `awoooi-prod` 在 06:57 曾短暫 `OutOfSync / Progressing`,因 deploy marker `52f61da4` rollout 正在替換 API/Web/Worker;07:00 後已穩定為 `Synced / Healthy`,API `2/2`、Web `2/2`、Worker `1/1`,API/Web 仍跨 `mon` / `mon1`。重跑 live cold-start:`PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。StockPlatform `/api/v1/system/freshness` 曾在容器剛重啟約 35 秒時短暫 `502`,後續連續讀回皆 `200` 且 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;這類 rollout warmup 只有連續失敗才算 blocker。MOMO health 是 `V10.699`,cold-start direct evidence 仍顯示 current-month parity `15383 / 15383` 截至 `2026-06-24`,daily freshness `1|2026-06-24`。Backup status 06:58:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。188 產品容器健康,但 host `systemctl=degraded` 仍是真實 host hygiene blocker:`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` failed。112 Wazuh manager/indexer/dashboard active,ports `1514 / 1515 / 55000` listen,但 production Wazuh route 仍回報 `disabled_waiting_iwooos_wazuh_owner_gate`、`configured=false`、manager registry accepted `0`、runtime gate `0`。111 / 168 可連線,但兩邊 AWOOOI dev workspaces 皆 ahead 17 且 HEAD 不同(`111=56c83257`、`168=59485d51`);Mac Mini `/System/Volumes/Data` 只剩約 `3.2Gi`。目前 service recovery 宣告維持 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,host hygiene / DR escrow / Wazuh registry / workstation capacity 明確列為 service green 之外的 blocker。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 2c6d6c6f..6b9fc04c 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,7 +15,9 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. | | P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. | -| P3 docs / automation contracts | DONE_WITH_ROUTE_RETRY_V165 | 100% | Workplan, SOP v1.65, one-page post-start quick check v1.6, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | +| P3 docs / automation contracts | DONE_WITH_ROUTE_RETRY_V166 | 100% | Workplan, SOP v1.66, one-page post-start quick check v1.6, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | + +2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. `scripts/reboot-recovery/post-start-quick-check.sh --no-color` returned `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Delegated cold-start returned `PASS=89 WARN=0 BLOCKED=0`; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO `V10.701` has job `57 completed`, daily freshness `1|2026-06-24`, and current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`. StockPlatform freshness is `ok` through `2026-06-25` with price / chips / margin / AI recommendations current. Backup core remains green: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`; DR still has `escrow_missing=5`. 110 load around `5.19 / 4.66 / 4.91` is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist. 2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.