docs(ops): record all-host reboot readback
Some checks failed
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
Some checks failed
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
This commit is contained in:
@@ -39,6 +39,39 @@
|
||||
|
||||
**邊界**:本段只完成治理視圖、脫敏可視化與正式站讀回;不 SSH、不 Ansible apply、不 kubectl write、不發 Telegram、不讀 secret、不切 provider、不升級套件、不執行 shadow-canary 寫入、不把內部對話內容顯示到前端。
|
||||
|
||||
## 2026-06-26|全主機 read-only refresh:服務綠燈、DR/Wazuh/雙機同步仍分開列管
|
||||
|
||||
**背景**:使用者要求再次確認所有主機狀況。依主機重啟 / cold-start / backup SOP,只做 read-only live check;未執行 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作,未讀 secret 明文。
|
||||
|
||||
**Read-only evidence**:
|
||||
- Reachability:`192.168.0.110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。
|
||||
- 110:boot `2026-06-23 14:49`,`systemctl is-system-running=running`,failed units `0`;Docker active;Gitea / Harbor / Prometheus / Alertmanager / Sentry 本機面綠。Load 約 `4-6`,即時 top 主要為 AWOOOI CI pytest、`unattended-upgrade`、Gitea、ClickHouse、Docker / Kafka / PostgreSQL 背景負載;未見 orphan Chrome。Swap 仍滿但 `vmstat` 未見即時 thrash。
|
||||
- 188:boot `2026-06-23 14:49`,產品容器 healthy,包括 MOMO、2026FIFA、VibeWork、ClawBot、MinIO、exporters;PostgreSQL / Redis / MOMO health / SignOz 可用。`systemctl` 仍 `degraded`,failed units 為 `awoooi-startup.service`、`certbot.service`、`postgresql@14-main.service`、`snap.certbot.renew.service`;目前未造成 public service blocker,但需要另列 host hygiene / certbot cleanup。
|
||||
- 120 / 121:boot `2026-06-23 14:49-14:50`,`systemctl is-system-running=running`,failed units `0`;K3s nodes `mon` / `mon1` 皆 `Ready control-plane`,API/Web replicas split across both nodes,Worker running,CronJobs latest `km-vectorize` official run completed。
|
||||
- ArgoCD:`awoooi-prod sync=Synced health=Healthy revision=b2945ab9f716d9d685434ae0e67b9318414b27fe`。
|
||||
- `km-vectorize`:schedule `0 3 * * *`,`lastSchedule=2026-06-25T19:00:00Z`,`lastSuccess=2026-06-25T19:00:14Z`,等於 2026-06-26 03:00 台北時間官方排程成功。
|
||||
- Public routes:AWOOOI、AWOOOI API、VibeWork、AwoooGo、MOMO、Stock、Bitan、Gitea、Harbor、Registry `/v2/`、Sentry、SigNoz、Langfuse 全部回預期 2xx/3xx/401。
|
||||
- AWOOOI API health:`healthy / prod / mock_mode=false`,PostgreSQL、Redis、OpenClaw、SigNoz、Ollama GCP A/B/local 都 up。
|
||||
- MOMO:`https://mo.wooo.work/health` 回 `healthy`,版本 `V10.690`;前一 wrapper DB evidence 維持 latest import job `57 completed`、daily freshness `1|2026-06-24`、monthly parity `15383|15383`。
|
||||
- StockPlatform:`/api/v1/system/freshness` 回 `status=ok`,`latest_trading_date=2026-06-25`,blockers `[]`;price / chips / margin / AI recommendations 皆為 `2026-06-25`。
|
||||
- Backup:`backup-status` 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite / rclone fresh、`last_backup_all=2026-06-26 02:31:02`;credential escrow 仍缺 5 項。
|
||||
- 112:Kali host reachable;`wazuh-manager`、`wazuh-indexer`、`wazuh-dashboard` active,1514 / 1515 / 55000 listening;host `systemctl` degraded by `networking.service` failed。此證據只代表 Wazuh services / ports up,不代表 agent registry accepted。
|
||||
- 111:MacBook Pro reachable;容量約 135Gi available;`~/codex-workspaces/awoooi-dev` 可讀,但分支 `codex/awoooi-current-main-dev-base-20260624` ahead 17,HEAD `56c83257`。
|
||||
- 168:可由 `ogt@192.168.0.168` 讀回,hostname `WOOOMacMiniM4.local`,是 Mac Mini 另一位址;同一 `awoooi-dev` 分支 ahead 17,但 HEAD `59485d51`,與 111 不同。這是 workstation sync drift,不是 reboot service blocker。
|
||||
- Local Mac Mini:root/Data volume available 約 `3.4Gi`,容量偏低;需列入 workstation capacity cleanup,但不影響本輪 public service green。
|
||||
|
||||
**結果**:
|
||||
- Service / host / public route / K3s / ArgoCD / MOMO data / Stock data / backup core:`GREEN`。
|
||||
- Current recovery verdict:`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。
|
||||
- 工作完成度:主機重啟後服務恢復 `100%`;DR 完成度仍 blocked by escrow;Wazuh registry / workstation sync 另列,不可混入服務綠燈。
|
||||
|
||||
**仍 blocked / 不得宣稱**:
|
||||
- `DR_COMPLETE` 仍 blocked:`escrow_missing=5`。
|
||||
- Wazuh manager registry accepted 仍 `0`;route `200`、Dashboard 可開、transport 或 service active 都不能宣稱 Wazuh 全主機納管完成。
|
||||
- 188 failed units 需另做 host hygiene / certbot / startup unit readback,不可用 public routes green 直接掩蓋。
|
||||
- 111 / 168 `awoooi-dev` HEAD 不一致,不能宣稱雙機 Codex workspace 完全同步。
|
||||
- Mac Mini free space 約 `3.4Gi`,不能宣稱 workstation capacity 無風險。
|
||||
|
||||
## 2026-06-26|主機重啟 SOP 隔日 readback 與 route retry gate
|
||||
|
||||
**背景**:2026-06-25 21:14 已達 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。隔日 06:26 重新跑 live read-only check,確認服務綠燈是否維持,並處理 wrapper 對單次 route `000` 過度敏感的 SOP 缺口。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.61
|
||||
> Version: v1.62
|
||||
> Last updated: 2026-06-26 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -12,6 +12,8 @@
|
||||
|
||||
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
|
||||
|
||||
2026-06-26 06:40-06:44 全主機 read-only refresh:`110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。核心 reboot scope 維持 green:110 `systemctl=running`、failed units `0`,Docker / Gitea / Harbor / Prometheus / Alertmanager 可用;120 / 121 `systemctl=running`、failed units `0`,K3s nodes `mon` / `mon1` Ready;188 產品容器與 PostgreSQL / Redis / MOMO / SignOz 可用。ArgoCD `awoooi-prod` 已從先前 degraded 收斂為 `Synced / Healthy`,revision `b2945ab9f716d9d685434ae0e67b9318414b27fe`;`km-vectorize` official 03:00 台北時間 run 成功,`lastSuccess=2026-06-25T19:00:14Z`。Public routes for AWOOOI / VibeWork / AwoooGo / MOMO / Stock / Bitan / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse return expected statuses; AWOOOI API health is `healthy / prod / mock_mode=false`; MOMO health is `V10.690`; StockPlatform freshness is `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status remains core green with `escrow_missing=5`. Boundaries: 188 host still has failed units `awoooi-startup.service`, `certbot.service`, `postgresql@14-main.service`, `snap.certbot.renew.service` that require host hygiene cleanup; 112 Wazuh services / ports are active but Wazuh manager registry accepted remains `0`; 111 / 168 Codex workspaces are reachable but have different local HEADs on the same ahead branch; Mac Mini free space is about `3.4Gi`. Current service verdict remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DR_COMPLETE` or `Wazuh recovered`.
|
||||
|
||||
2026-06-26 06:26-06:28 隔日 read-only refresh:四主機 ping/SSH OK,cold-start `PASS=89 WARN=0 BLOCKED=0`,MOMO `V10.690` 且 latest import job `57 completed`,StockPlatform `/api/v1/system/freshness` 仍為 `status=ok` / `latest_trading_date=2026-06-25` / blockers `[]`,backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。06:26 full wrapper 首輪在 `https://awoooi.wooo.work/zh-TW/iwooos` 與 `https://vibework.wooo.work/` 出現單次 `000`,但獨立 curl 立即回 `200`,route-only wrapper 也回 `PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN`;因此 v1.61 將 public route gate 改為最多 3 次 retry,只有連續失敗才算 `BLOCKED`,retry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns `POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。
|
||||
|
||||
2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, `intelligence-sync` 21:00 finished `status=0`, `core.margin_short_daily` reached `2026-06-25` / 1976 rows, and `ai-recommendation-pipeline` 21:10 finished `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25` with `draft_count=120`, `candidate_count=120`, and `rag_documents=1000`. StockPlatform `/api/v1/system/freshness` now returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`, with price / chips / margin / AI recommendations all on `2026-06-25`. The 21:14 full wrapper returns cold-start `PASS=89 WARN=0 BLOCKED=0` and overall `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. The only remaining recovery red gate is DR credential escrow evidence `escrow_missing=5`; Wazuh manager registry accepted remains `0` as a security evidence blocker, not a reboot service blocker.
|
||||
|
||||
@@ -11,8 +11,8 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 06:26-06:28 read-only refresh confirms the 2026-06-25 21:14 green baseline still holds. Four hosts ping/SSH OK; cold-start `PASS=89 WARN=0 BLOCKED=0`; MOMO health `V10.690`, latest import job `57 completed`, daily freshness `1|2026-06-24`; StockPlatform freshness `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. The 06:26 full wrapper saw one-time route `000` on IwoooS / VibeWork, but independent curl and route-only wrapper immediately returned `200` / `PASS=31 WARN=0 BLOCKED=0`; v1.61/v1.6 now retries public routes before blocking. 06:28 core wrapper with routes skipped returned `PASS=15 WARN=2 BLOCKED=0`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 06:40-06:44 all-host read-only refresh confirms the 2026-06-25 21:14 green baseline still holds and adds 112/111/168 reachability. `110 / 120 / 121 / 188 / 112 / 111 / 168` ping and SSH port are OK. Core scope services are green: cold-start `PASS=89 WARN=0 BLOCKED=0`; public routes return expected statuses; AWOOOI API health is `healthy / prod / mock_mode=false`; MOMO health `V10.690`, latest import job `57 completed`, daily freshness `1|2026-06-24`; StockPlatform freshness `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space are workstation blockers, not reboot service blockers. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 06:44 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `b2945ab9f716d9d685434ae0e67b9318414b27fe`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. |
|
||||
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired. 2026-06-25 21:13 StockPlatform `/api/v1/system/freshness` returned `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
|
||||
| P3 docs / automation contracts | DONE_WITH_ROUTE_RETRY_V161 | 100% | Workplan, SOP v1.61, one-page post-start quick check v1.6, route retry gate, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
|
||||
|
||||
Reference in New Issue
Block a user