docs(ops): record all-host reboot readback
Some checks failed
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled

This commit is contained in:
ogt
2026-06-26 06:47:23 +08:00
parent e0fbedfda8
commit 186e3945e8
3 changed files with 38 additions and 3 deletions

View File

@@ -39,6 +39,39 @@
**邊界**:本段只完成治理視圖、脫敏可視化與正式站讀回;不 SSH、不 Ansible apply、不 kubectl write、不發 Telegram、不讀 secret、不切 provider、不升級套件、不執行 shadow-canary 寫入、不把內部對話內容顯示到前端。
## 2026-06-26全主機 read-only refresh服務綠燈、DR/Wazuh/雙機同步仍分開列管
**背景**:使用者要求再次確認所有主機狀況。依主機重啟 / cold-start / backup SOP只做 read-only live check未執行 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作,未讀 secret 明文。
**Read-only evidence**
- Reachability`192.168.0.110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。
- 110boot `2026-06-23 14:49``systemctl is-system-running=running`failed units `0`Docker activeGitea / Harbor / Prometheus / Alertmanager / Sentry 本機面綠。Load 約 `4-6`,即時 top 主要為 AWOOOI CI pytest、`unattended-upgrade`、Gitea、ClickHouse、Docker / Kafka / PostgreSQL 背景負載;未見 orphan Chrome。Swap 仍滿但 `vmstat` 未見即時 thrash。
- 188boot `2026-06-23 14:49`,產品容器 healthy包括 MOMO、2026FIFA、VibeWork、ClawBot、MinIO、exportersPostgreSQL / Redis / MOMO health / SignOz 可用。`systemctl``degraded`failed units 為 `awoooi-startup.service``certbot.service``postgresql@14-main.service``snap.certbot.renew.service`;目前未造成 public service blocker但需要另列 host hygiene / certbot cleanup。
- 120 / 121boot `2026-06-23 14:49-14:50``systemctl is-system-running=running`failed units `0`K3s nodes `mon` / `mon1``Ready control-plane`API/Web replicas split across both nodesWorker runningCronJobs latest `km-vectorize` official run completed。
- ArgoCD`awoooi-prod sync=Synced health=Healthy revision=b2945ab9f716d9d685434ae0e67b9318414b27fe`
- `km-vectorize`schedule `0 3 * * *``lastSchedule=2026-06-25T19:00:00Z``lastSuccess=2026-06-25T19:00:14Z`,等於 2026-06-26 03:00 台北時間官方排程成功。
- Public routesAWOOOI、AWOOOI API、VibeWork、AwoooGo、MOMO、Stock、Bitan、Gitea、Harbor、Registry `/v2/`、Sentry、SigNoz、Langfuse 全部回預期 2xx/3xx/401。
- AWOOOI API health`healthy / prod / mock_mode=false`PostgreSQL、Redis、OpenClaw、SigNoz、Ollama GCP A/B/local 都 up。
- MOMO`https://mo.wooo.work/health``healthy`,版本 `V10.690`;前一 wrapper DB evidence 維持 latest import job `57 completed`、daily freshness `1|2026-06-24`、monthly parity `15383|15383`
- StockPlatform`/api/v1/system/freshness``status=ok``latest_trading_date=2026-06-25`blockers `[]`price / chips / margin / AI recommendations 皆為 `2026-06-25`
- Backup`backup-status` 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0`、offsite / rclone fresh、`last_backup_all=2026-06-26 02:31:02`credential escrow 仍缺 5 項。
- 112Kali host reachable`wazuh-manager``wazuh-indexer``wazuh-dashboard` active1514 / 1515 / 55000 listeninghost `systemctl` degraded by `networking.service` failed。此證據只代表 Wazuh services / ports up不代表 agent registry accepted。
- 111MacBook Pro reachable容量約 135Gi available`~/codex-workspaces/awoooi-dev` 可讀,但分支 `codex/awoooi-current-main-dev-base-20260624` ahead 17HEAD `56c83257`
- 168可由 `ogt@192.168.0.168` 讀回hostname `WOOOMacMiniM4.local`,是 Mac Mini 另一位址;同一 `awoooi-dev` 分支 ahead 17但 HEAD `59485d51`,與 111 不同。這是 workstation sync drift不是 reboot service blocker。
- Local Mac Miniroot/Data volume available 約 `3.4Gi`,容量偏低;需列入 workstation capacity cleanup但不影響本輪 public service green。
**結果**
- Service / host / public route / K3s / ArgoCD / MOMO data / Stock data / backup core`GREEN`
- Current recovery verdict`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- 工作完成度:主機重啟後服務恢復 `100%`DR 完成度仍 blocked by escrowWazuh registry / workstation sync 另列,不可混入服務綠燈。
**仍 blocked / 不得宣稱**
- `DR_COMPLETE` 仍 blocked`escrow_missing=5`
- Wazuh manager registry accepted 仍 `0`route `200`、Dashboard 可開、transport 或 service active 都不能宣稱 Wazuh 全主機納管完成。
- 188 failed units 需另做 host hygiene / certbot / startup unit readback不可用 public routes green 直接掩蓋。
- 111 / 168 `awoooi-dev` HEAD 不一致,不能宣稱雙機 Codex workspace 完全同步。
- Mac Mini free space 約 `3.4Gi`,不能宣稱 workstation capacity 無風險。
## 2026-06-26主機重啟 SOP 隔日 readback 與 route retry gate
**背景**2026-06-25 21:14 已達 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。隔日 06:26 重新跑 live read-only check確認服務綠燈是否維持並處理 wrapper 對單次 route `000` 過度敏感的 SOP 缺口。

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.61
> Version: v1.62
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -12,6 +12,8 @@
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
2026-06-26 06:40-06:44 全主機 read-only refresh`110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。核心 reboot scope 維持 green110 `systemctl=running`、failed units `0`Docker / Gitea / Harbor / Prometheus / Alertmanager 可用120 / 121 `systemctl=running`、failed units `0`K3s nodes `mon` / `mon1` Ready188 產品容器與 PostgreSQL / Redis / MOMO / SignOz 可用。ArgoCD `awoooi-prod` 已從先前 degraded 收斂為 `Synced / Healthy`revision `b2945ab9f716d9d685434ae0e67b9318414b27fe``km-vectorize` official 03:00 台北時間 run 成功,`lastSuccess=2026-06-25T19:00:14Z`。Public routes for AWOOOI / VibeWork / AwoooGo / MOMO / Stock / Bitan / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse return expected statuses; AWOOOI API health is `healthy / prod / mock_mode=false`; MOMO health is `V10.690`; StockPlatform freshness is `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status remains core green with `escrow_missing=5`. Boundaries: 188 host still has failed units `awoooi-startup.service`, `certbot.service`, `postgresql@14-main.service`, `snap.certbot.renew.service` that require host hygiene cleanup; 112 Wazuh services / ports are active but Wazuh manager registry accepted remains `0`; 111 / 168 Codex workspaces are reachable but have different local HEADs on the same ahead branch; Mac Mini free space is about `3.4Gi`. Current service verdict remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DR_COMPLETE` or `Wazuh recovered`.
2026-06-26 06:26-06:28 隔日 read-only refresh四主機 ping/SSH OKcold-start `PASS=89 WARN=0 BLOCKED=0`MOMO `V10.690` 且 latest import job `57 completed`StockPlatform `/api/v1/system/freshness` 仍為 `status=ok` / `latest_trading_date=2026-06-25` / blockers `[]`backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``last_backup_all=2026-06-26 02:31:02``escrow_missing=5`。06:26 full wrapper 首輪在 `https://awoooi.wooo.work/zh-TW/iwooos``https://vibework.wooo.work/` 出現單次 `000`,但獨立 curl 立即回 `200`route-only wrapper 也回 `PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN`;因此 v1.61 將 public route gate 改為最多 3 次 retry只有連續失敗才算 `BLOCKED`retry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns `POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。
2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, `intelligence-sync` 21:00 finished `status=0`, `core.margin_short_daily` reached `2026-06-25` / 1976 rows, and `ai-recommendation-pipeline` 21:10 finished `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25` with `draft_count=120`, `candidate_count=120`, and `rag_documents=1000`. StockPlatform `/api/v1/system/freshness` now returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`, with price / chips / margin / AI recommendations all on `2026-06-25`. The 21:14 full wrapper returns cold-start `PASS=89 WARN=0 BLOCKED=0` and overall `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. The only remaining recovery red gate is DR credential escrow evidence `escrow_missing=5`; Wazuh manager registry accepted remains `0` as a security evidence blocker, not a reboot service blocker.

View File

@@ -11,8 +11,8 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 06:26-06:28 read-only refresh confirms the 2026-06-25 21:14 green baseline still holds. Four hosts ping/SSH OK; cold-start `PASS=89 WARN=0 BLOCKED=0`; MOMO health `V10.690`, latest import job `57 completed`, daily freshness `1|2026-06-24`; StockPlatform freshness `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. The 06:26 full wrapper saw one-time route `000` on IwoooS / VibeWork, but independent curl and route-only wrapper immediately returned `200` / `PASS=31 WARN=0 BLOCKED=0`; v1.61/v1.6 now retries public routes before blocking. 06:28 core wrapper with routes skipped returned `PASS=15 WARN=2 BLOCKED=0`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 06:40-06:44 all-host read-only refresh confirms the 2026-06-25 21:14 green baseline still holds and adds 112/111/168 reachability. `110 / 120 / 121 / 188 / 112 / 111 / 168` ping and SSH port are OK. Core scope services are green: cold-start `PASS=89 WARN=0 BLOCKED=0`; public routes return expected statuses; AWOOOI API health is `healthy / prod / mock_mode=false`; MOMO health `V10.690`, latest import job `57 completed`, daily freshness `1|2026-06-24`; StockPlatform freshness `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space are workstation blockers, not reboot service blockers. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 06:44 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `b2945ab9f716d9d685434ae0e67b9318414b27fe`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired. 2026-06-25 21:13 StockPlatform `/api/v1/system/freshness` returned `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_ROUTE_RETRY_V161 | 100% | Workplan, SOP v1.61, one-page post-start quick check v1.6, route retry gate, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |