Files

ogt 3274607af8 fix(ops): expose momo source absence after reboot [skip ci]

2026-06-27 11:56:34 +08:00

189 KiB

Raw Permalink Blame History

AWOOOI 全棧冷啟動與主機重啟 SOP

Version: v1.78 Last updated: 2026-06-27 Asia/Taipei Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

0. 最新 Live Baseline 與釋出判定

本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天，必須先重跑 live check，再更新本節與 docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md。

若只是重啟後要快速判斷能不能宣稱恢復，先跑機器可讀摘要：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates，並把 delegated logs 和可重放的 summary.txt 留在 /tmp/awoooi-post-reboot-readiness-*。v1.75 起，同一輪驗收後續步驟必須吃同一個 $ARTIFACT_DIR/summary.txt，例如 scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file "$ARTIFACT_DIR/summary.txt" --no-color 與 scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file "$ARTIFACT_DIR/summary.txt" --no-color；不得在同一輪 evidence chain 反覆重跑 live probes 後混用不同時間點結論。v1.76 起，delegated cold-start 若在 K3s rollout / CD 替換瞬間出現單次 BLOCKED AWOOOI API not reachable，但 wrapper 自己的 public https://awoooi.wooo.work/api/v1/health route retry 已回 2xx，該 blocker 只列為 route/API warmup evidence warning；public API 仍失敗、其他 non-route blocker、或 retry 後未恢復時，仍維持 hard blocked。宣告 guard 會把 summary 轉成 allowed / forbidden declaration，避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 SERVICE_GREEN=1 但 NEXT_REQUIRED_GATES 仍非空，再由 dispatch checklist 把尚未完成的 blocker 轉成 owner / evidence / forbidden-action checklist；需要機器可讀 intake 時，再跑 scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --dispatch-file <dispatch.txt> --output /tmp/awoooi-post-reboot-owner-packets.json 產生 awoooi_post_reboot_next_gate_owner_packets_v1 JSON，並立刻跑 scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json。dispatch / packet / guard 均固定 DISPATCH_AUTHORIZED=0、REQUEST_SENT_COUNT=0、OWNER_RESPONSE_ACCEPTED=0、HOST_WRITE_AUTHORIZED=0、SECRET_VALUE_COLLECTION_ALLOWED=0、RUNTIME_GATE=0；guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起，任何 owner response JSON 還必須經過 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file <owner-packets.json> --response-file <file>：空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closed；preflight 通過也只表示可進入獨立 reviewer acceptance，不是 runtime 授權。需要人工展開時，再跑 scripts/reboot-recovery/post-start-quick-check.sh --no-color 並以 docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B；短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。

v1.76 owner gate replay rule：同一輪 summary 產生後，owner packet 與 owner response preflight 必須優先使用 --summary-file "$ARTIFACT_DIR/summary.txt"，例如 scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.json 與 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --response-file <file>。只有在刻意要重新取 live evidence 時，才允許省略 --summary-file；否則 preflight 不得自己重跑 summary 造成同一輪狀態漂移。

2026-06-27 11:51 最新 live revalidation：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-115046/summary.txt 回傳 POST_START_RESULT=BLOCKED、POST_START_PASS=37、POST_START_WARN=3、POST_START_BLOCKED=2、SERVICE_GREEN=0、PRODUCT_DATA_GREEN=1、STOCK_FRESHNESS_STATUS=ok、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=none、BACKUP_CORE_GREEN=1、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。本輪已再次修復 188 momo_pg_daily crontab configured drift，backup-status 回 core_blockers=0、configured_missing_188=0；K3s / ArgoCD live readback 顯示 120 / 121 皆 Ready，awoooi-prod 為 Synced / Healthy，api/web/worker pods 均 Running。現在 hard blocker 是 MOMO business data freshness：daily_sales_snapshot 最新仍為 2026-06-24，DRIVE_INTAKE_COUNT=0，Drive archive / global latest 即時業績_當日 均為 2026-06-25T04:21:47Z，最新 import job 57 已 clean completed 且 sync_success=true。因此可宣稱主機、K3s、public routes、backup core 與 Stock freshness 已恢復；不可宣稱 full-stack green，直到 MOMO 來源檔補齊並由正式 import pipeline 更新 DB。DR complete 仍因 ESCROW_MISSING_COUNT=5 禁止宣稱，Wazuh 全主機納管仍因 manager registry accepted 0 禁止宣稱。

2026-06-27 00:58 最新 live summary：本輪先修復兩個實際 SOP blocker。第一，scripts/ops/recovery-scorecard-contract-check.py 已改成 PyYAML optional，標準 Python 環境也能驗證 recovery recording-rule contract，不會因 ModuleNotFoundError: yaml 中斷 DR/offsite checklist。第二，188 ollama crontab 已備份到 /home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt，並把 AWOOOI momo PostgreSQL daily backup 從 app-side /home/ollama/momo-pro/scripts/pg_backup.sh 收斂回 host-owned /home/ollama/bin/momo-pg-backup.sh；刷新 188 textfile exporter 後 awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1。00:58 scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-005728/summary.txt 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=3、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。同輪 backup-status 回 core_blockers=0、configured_missing_188=0；Prometheus live contract 回 awoooi_recovery_core_ready=1、awoooi_recovery_dr_offsite_ready=0，表示主機 / K3s / public routes / product data / backup core 已恢復，DR 仍只因 credential escrow 缺 5 個 non-secret evidence marker blocked，Wazuh 全主機 registry accepted 仍為 0。

2026-06-27 00:02 最新 live summary：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=4、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、STOCK_FRESHNESS_STATUS=ok、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=none、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。同一輪 production route smoke 回傳：IwoooS 200、Wazuh read-only routes 200、VibeWork 200、AwoooGo 200、MOMO health 200、Stock 200；AWOOOI API health healthy / prod / mock_mode=false，PostgreSQL / Redis / OpenClaw / SigNoz / GCP Ollama provider up，local Ollama endpoint 仍為 cooldown / degraded，由 provider fallback 承接，不是網站或 API service blocker。最新 deploy marker 為 e506b9d5 chore(cd): deploy fe74d86 [skip ci]；本輪 89b9e67a 是 SOP / scripts / docs source update，不是 runtime bundle deploy marker。112 Wazuh 與 120 K3s 的 23:56 脫敏 readback 仍作為本輪相鄰 evidence：120 ArgoCD Synced / Healthy、Pod 均 Running 或 Completed；Wazuh manager registry 並非全空，但 WAZUH_MANAGER_REGISTRY_ACCEPTED=0 維持，不能宣稱全主機納管恢復。

2026-06-26 23:56 live summary retained for comparison：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=3、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、STOCK_FRESHNESS_STATUS=ok、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=none、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。同一時段只讀補查 120：ArgoCD awoooi-prod 為 Synced / Healthy，awoooi-prod Pod 均為 Running 或 Completed；歷史 km-vectorize-29689620 failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋，不是目前服務 blocker。同一時段只讀補查 112：systemd running，Wazuh manager / indexer / dashboard active，manager API root 回 401，Dashboard unauthenticated check endpoints 回 401，manager registry 脫敏讀回為 local manager 1、受管 agent 5、active managed 5、disconnected 0、never connected 0。此證據證明 registry 不再是「全空」，但仍不能宣稱 Wazuh 全主機納管恢復，因為 SOP expected scope 仍是 6、Dashboard API connection / version 尚未以登入或 owner evidence 驗收，owner response accepted 仍為 0。

2026-06-26 18:46 最新即時恢復真相已覆蓋 12:13 對今日產品資料的判讀：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=PRODUCT_DATA_PENDING_EOD_WINDOW、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=0、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=core_margin_short_daily_missing,ai_recommendations_stale、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、WAZUH_MANAGER_REGISTRY_ACCEPTED=0。同一輪 live cold-start 長檢查回傳 PASS=87 WARN=0 BLOCKED=0 與 Result: GREEN，代表 110 / 120 / 121 / 188 主機、K3s、public routes、AWOOI API、MOMO、backup core、exporters、cron 與 Alertmanager 服務層已恢復；但 StockPlatform 今日官方 margin-short 尚未發布，AI recommendations 仍依賴該資料，因此不可宣稱所有產品資料最新。18:43 已以授權 SIGTERM 清除 110 上兩組 6 小時以上 stockplatform-review-bulk-ux orphan Chrome process group，REMAINING=0；18:44-18:46 已停止 168 Mac Mini 上本機 AWOOOI next build 並清理 temp/build/cache 與 Antigravity backup browser recordings，使 /System/Volumes/Data 從約 1.0Gi / 100% 回到約 8.7Gi / 96%。112 Kali 的 networking.service failed 已定位為 /etc/network/if-up.d/wg-nat 錯誤 shebang #\!/bin/bash 導致 Exec format error；Wazuh manager / indexer / dashboard 仍 active，該 hook 修復需要 112 sudo 提權，未使用或保存密碼。

2026-06-26 12:13 latest live summary supersedes the 08:59 gate set：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=4、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、BACKUP_CORE_GREEN=1、DR_ESCROW_BLOCKED=1、ESCROW_MISSING_COUNT=5、HOST_188_SERVICE_GREEN=1、HOST_188_HYGIENE_BLOCKED=0、HOST_188_RESULT=HOST_188_HYGIENE_GREEN.、WAZUH_ROUTE_CODE=200、WAZUH_TRANSPORT_COUNT=6、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning、WAZUH_DASHBOARD_INDEX_OK=3、RUNTIME_ACTION_AUTHORIZED=0、OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export。188 host hygiene 已從 blocker 移除；目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復，但不得宣稱憑證已正式 renew，需等 snap certbot timer / ACME window readback。

2026-06-26 13:01 owner response preflight baseline：新增 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color 與 docs/templates/post-reboot-next-gate-owner-response.json。無 response file 時必須輸出 POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_file expected_gates=2 received=0 accepted=0 runtime_gate=0；直接使用模板時必須輸出 POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_content expected_gates=2 received=0 accepted=0 runtime_gate=0。此 gate 只驗收 credential_escrow_evidence 與 wazuh_manager_registry_export 的脫敏 owner evidence，不送 request、不寫 escrow marker、不讀 secret、不做 Wazuh / host / Kali runtime action，也不把一般批准訊息轉成 owner accepted。

2026-06-26 17:45 single-summary replay baseline：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 現在會自動寫入 /tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt，同一輪後續 declaration guard、next-gate dispatch、owner packet、contract guard 與 owner response preflight 均用此 summary 重放。17:45 summary 回傳 SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、BACKUP_CORE_GREEN=1、DR_ESCROW_BLOCKED=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export。post-start-quick-check.sh 也已補 route warmup 分類：若 delegated cold-start 的 BLOCKED 全部是 public route，且 wrapper 自己的 route retry 已全部恢復，該 cold-start blocker 會降級為 evidence warning，不再把整輪服務恢復誤判成 blocked；非 route blocker 或 retry 後仍失敗仍維持 hard blocked。

2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence：當時 HOST_188_HYGIENE_BLOCKED=1、NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export。此段只用來比對 188 修復前後差異；現行 gate set 必須使用 12:13 baseline。

2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence：當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出，目前 expected NEXT_GATE_COUNT=2，只剩 credential escrow 與 Wazuh registry。

2026-06-26 08:29 owner-packet JSON baseline：scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color 將 dispatch output 轉成 schema_version=awoooi_post_reboot_next_gate_owner_packets_v1，包含三個 owner_packets、next_gate_count=3、p0_gate_count=3、request_sent_count=0、owner_response_received_count=0、owner_response_accepted_count=0、runtime_action_authorized_count=0。此 JSON 是 AI / operator / owner review intake，不是外部 request，也不是維護窗口批准。

2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence：舊版鎖定三個 P0 gate。12:13 起 contract guard 依 source.next_required_gates 動態驗收，現行 expected success line 是 POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0；若 188 hygiene future regression，才會回到 gates=3。

2026-06-26 08:47 Wazuh registry detail baseline：scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/value：WAZUH_COVERAGE_SCOPE=6、WAZUH_DIRECT_ACTIVE=2、WAZUH_NO_TRANSPORT=1、WAZUH_SSH_BLOCKED=3、WAZUH_ROUTE_CODE=200、WAZUH_TRANSPORT_COUNT=6、WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning、WAZUH_DASHBOARD_INDEX_OK=3、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、WAZUH_RUNTIME_GATE=0。scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color 的 wazuh_manager_registry_export gate 會把這些狀態放入 CURRENT_EVIDENCE。判讀鐵律：route 200、transport 6、Dashboard index pattern 3 都不是 manager registry accepted；全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。

2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence：當時 HOST_188_FULLY_GREEN 仍 forbidden。12:13 起 guard 依 HOST_188_HYGIENE_BLOCKED=0 動態允許 188 host hygiene green，但仍拒絕 DR_COMPLETE、WAZUH_REGISTRY_RECOVERED、RUNTIME_ACTION_AUTHORIZED。

2026-06-26 07:39 live quick-check refresh：scripts/reboot-recovery/post-start-quick-check.sh --no-color 完整跑完，四主機 ping / SSH 全部 OK，delegated cold-start 為 PASS=89 WARN=0 BLOCKED=0，wrapper 總結為 POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0、warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2、RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED。MOMO health V10.701，daily snapshot 109061 rows / 2025-07-01..2026-06-24，current-month parity 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24，latest import job 57 completed。StockPlatform freshness status=ok、latest trading date 2026-06-25，price / chips / margin / AI recommendations 均為 2026-06-25。Backup-status 07:39 顯示 110 13/13 fresh failed=0、188 2/2 fresh failed=0、core_blockers=0、offsite/rclone fresh、last_backup_all=2026-06-26 02:31:02、escrow_missing=5。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 5.19 / 4.66 / 4.91，CPU idle 多數樣本 80%+，目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作，不是 orphan Chrome。這一輪 allowed declaration：主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠；forbidden declaration：DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。

2026-06-26 07:19 follow-up：gitea/main 已包含前一輪 SOP 文件 commit 1fd5e2a8，ArgoCD awoooi-prod 讀回 Synced / Healthy，revision 1fd5e2a8b0f18d24eed16aa2a44286bcbf230603，API 2/2、Web 2/2、Worker 1/1，pods restart=0。重跑 full cold-start 仍是 PASS=87 WARN=0 BLOCKED=0，result GREEN。直接 public route 讀回：AWOOOI API 200、AWOOOI Web 307、VibeWork 200、AwoooGo 200、MOMO health 200、Stock freshness 200、Bitan 200、Gitea 200、Harbor 200、Registry /v2/ expected 401、Sentry expected 302、SigNoz 200、Langfuse 200。188 blocker 精準分類：pg_lsclusters 顯示 host PostgreSQL 14/main down，systemctl status postgresql@14-main 顯示 invalid primary checkpoint record 與 PANIC: could not locate a valid checkpoint record；certbot.service 顯示 sentry.wooo.work renew rate-limited，snap.certbot.renew.service 顯示 challenge failed；awoooi-startup.service 曾嘗試以 root 執行 pg_resetwal 並失敗。本輪不執行 pg_resetwal、不 reset-failed、不重啟 service；188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理，詳見 docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md，並可先跑 scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color 取得只讀 preflight。110 load 已降到約 4.83 / 4.82 / 5.52，top CPU 是 active AWOOOI Web turbo build / Docker buildx；Swap 仍滿但 memory available 約 41Gi，本輪不手動清 swap。整體宣告仍是 FULL_STACK_GREEN_DR_ESCROW_BLOCKED。

2026-06-26 07:02 全主機 live refresh：110 / 120 / 121 / 188 / 112 / 111 / 168 ping 與 SSH port 全部 OK。110 systemctl=running、failed units 0，但 load 5.83 / 7.26 / 5.77 且 top CPU 是 AWOOOI Web next build，Swap 仍 7.8Gi / 7.8Gi；這是 CI/build 壓力，不是 orphan Chrome 或 Docker 事故。120 / 121 systemctl=running、K3s active，nodes mon / mon1 均為 Ready。ArgoCD awoooi-prod 在 06:57 曾短暫 OutOfSync / Progressing，因 deploy marker 52f61da4 rollout 正在替換 API/Web/Worker；07:00 後已穩定為 Synced / Healthy，API 2/2、Web 2/2、Worker 1/1，API/Web 仍跨 mon / mon1。重跑 live cold-start：PASS=87 WARN=0 BLOCKED=0，result GREEN。StockPlatform /api/v1/system/freshness 曾在容器剛重啟約 35 秒時短暫 502，後續連續讀回皆 200 且 status=ok、latest_trading_date=2026-06-25、blockers []；這類 rollout warmup 只有連續失敗才算 blocker。MOMO health 是 V10.699，cold-start direct evidence 仍顯示 current-month parity 15383 / 15383 截至 2026-06-24，daily freshness 1|2026-06-24。Backup status 06:58：110 13/13 fresh failed=0、188 2/2 fresh failed=0、core_blockers=0、offsite/rclone fresh、last_backup_all=2026-06-26 02:31:02、escrow_missing=5。188 產品容器健康，但 host systemctl=degraded 仍是真實 host hygiene blocker：awoooi-startup.service、postgresql@14-main.service、certbot.service、snap.certbot.renew.service failed。112 Wazuh manager/indexer/dashboard active，ports 1514 / 1515 / 55000 listen，但 production Wazuh route 仍回報 disabled_waiting_iwooos_wazuh_owner_gate、configured=false、manager registry accepted 0、runtime gate 0。111 / 168 可連線，但兩邊 AWOOOI dev workspaces 皆 ahead 17 且 HEAD 不同（111=56c83257、168=59485d51）；Mac Mini /System/Volumes/Data 只剩約 3.2Gi。目前 service recovery 宣告維持 FULL_STACK_GREEN_DR_ESCROW_BLOCKED，host hygiene / DR escrow / Wazuh registry / workstation capacity 明確列為 service green 之外的 blocker。

2026-06-26 06:50-06:55 188 host hygiene read-only triage：188 product services remain green, but host systemctl is still degraded and must not be smoothed into full host green. Failed units are awoooi-startup.service, postgresql@14-main.service, certbot.service, and snap.certbot.renew.service. Evidence shows the host PostgreSQL cluster 14/main is down in pg_lsclusters, while product DB / exporters still respond through containerized services; therefore pg_isready or pg_up=1 cannot substitute for host cluster health. The 188 startup service detected could not locate a valid checkpoint record on 2026-06-23 and attempted pg_resetwal as root, which failed; v1.63 treats PostgreSQL checkpoint/WAL errors as break-glass only and the repo-side startup script now fails closed instead of running pg_resetwal. Certbot renew for sentry.wooo.work is also failing and hit ACME rate-limit / challenge failure, but the public cert is still valid until 2026-07-09 16:03:40 UTC. Current declaration: SERVICE_GREEN_HOST_HYGIENE_BLOCKED for 188, while overall service recovery remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED.

2026-06-26 06:40-06:44 全主機 read-only refresh：110 / 120 / 121 / 188 / 112 / 111 / 168 ping 與 SSH port 全部 OK。核心 reboot scope 維持 green：110 systemctl=running、failed units 0，Docker / Gitea / Harbor / Prometheus / Alertmanager 可用；120 / 121 systemctl=running、failed units 0，K3s nodes mon / mon1 Ready；188 產品容器與 PostgreSQL / Redis / MOMO / SignOz 可用。ArgoCD awoooi-prod 已從先前 degraded 收斂為 Synced / Healthy，revision b2945ab9f716d9d685434ae0e67b9318414b27fe；km-vectorize official 03:00 台北時間 run 成功，lastSuccess=2026-06-25T19:00:14Z。Public routes for AWOOOI / VibeWork / AwoooGo / MOMO / Stock / Bitan / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse return expected statuses; AWOOOI API health is healthy / prod / mock_mode=false; MOMO health is V10.690; StockPlatform freshness is status=ok, latest_trading_date=2026-06-25, blockers []; backup-status remains core green with escrow_missing=5. Boundaries: 188 host still has failed units awoooi-startup.service, certbot.service, postgresql@14-main.service, snap.certbot.renew.service that require host hygiene cleanup; 112 Wazuh services / ports are active but Wazuh manager registry accepted remains 0; 111 / 168 Codex workspaces are reachable but have different local HEADs on the same ahead branch; Mac Mini free space is about 3.4Gi. Current service verdict remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED, not DR_COMPLETE or Wazuh recovered.

2026-06-26 06:26-06:28 隔日 read-only refresh：四主機 ping/SSH OK，cold-start PASS=89 WARN=0 BLOCKED=0，MOMO V10.690 且 latest import job 57 completed，StockPlatform /api/v1/system/freshness 仍為 status=ok / latest_trading_date=2026-06-25 / blockers []，backup-status 110 13/13 fresh failed=0、188 2/2 fresh failed=0、core_blockers=0、offsite_fresh=1、rclone_gdrive_fresh=1、last_backup_all=2026-06-26 02:31:02、escrow_missing=5。06:26 full wrapper 首輪在 https://awoooi.wooo.work/zh-TW/iwooos 與 https://vibework.wooo.work/ 出現單次 000，但獨立 curl 立即回 200，route-only wrapper 也回 PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN；因此 v1.61 將 public route gate 改為最多 3 次 retry，只有連續失敗才算 BLOCKED，retry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。

2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, intelligence-sync 21:00 finished status=0, core.margin_short_daily reached 2026-06-25 / 1976 rows, and ai-recommendation-pipeline 21:10 finished STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25 with draft_count=120, candidate_count=120, and rag_documents=1000. StockPlatform /api/v1/system/freshness now returns status=ok, latest_trading_date=2026-06-25, blockers [], with price / chips / margin / AI recommendations all on 2026-06-25. The 21:14 full wrapper returns cold-start PASS=89 WARN=0 BLOCKED=0 and overall POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. The only remaining recovery red gate is DR credential escrow evidence escrow_missing=5; Wazuh manager registry accepted remains 0 as a security evidence blocker, not a reboot service blocker.

2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two stockplatform-review-bulk-ux Chrome process groups 2756503 and 2829627 with root Chrome process PPID=1, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted SIGTERM at 20:24. Post-check showed no remaining PGID entries; vmstat showed CPU idle around 85-90%, si/so=0, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start PASS=89 WARN=0 BLOCKED=0, but overall POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1, RESULT=BLOCKED, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.

2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea main and live /home/wooo/stockplatform-v2 are now at fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints; six missing production cron entrypoint scripts are restored, run-intelligence-sync.sh contains the Docker-backed psql shim, and live contract check confirms every scripts/ops/*.sh referenced by install-production-cron.sh exists. The only live write performed for StockPlatform recovery was a fast-forward git pull --ff-only origin main on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: source-remediation-queue 19:56 and 20:00 succeeded, market-index-ingestion 20:00 succeeded, price-ingestion 20:02 succeeded, margin-short-ingestion 20:05 succeeded, chips-ingestion 20:06 succeeded, and ai-recommendation-pipeline 20:10 ran but correctly produced the internal blocker core_margin_short_daily_incomplete,official_margin_short_daily_official_pending. StockPlatform /api/v1/system/freshness therefore still returns status=blocked because the 2026-06-25 official margin-short source is pending and ai.recommendations must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.

2026-06-25 19:35 product-version / data-freshness refresh supersedes the 19:06 data-complete wording. Host boot, K3s, AWOOOI runtime, MOMO service/data, backup/offsite, Bitan cleanliness, and expanded public routes are available, but the stricter post-start wrapper now checks StockPlatform /api/v1/system/freshness and correctly returns RESULT=BLOCKED when product data is not current. The 19:35 lightweight wrapper run used --skip-cold-start --skip-backup --skip-cpu after the 19:24 full host/cold-start/backup readback and returned PASS=31 WARN=1 BLOCKED=1, with the single blocker StockPlatform freshness is blocked: core_margin_short_daily_missing,ai_recommendations_stale. stock.wooo.work, /healthz, and /api/healthz all return 200; public routes now covered by the wrapper include AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Do not declare "all products and data are latest" until StockPlatform freshness is ok; keep DR blocked until escrow_missing=0.

2026-06-25 19:06 post-CD live read-only refresh supersedes the 18:53 wrapper wording. Consecutive main pushes caused older deploy markers to be replaced, so the latest production truth is deploy marker d8ca8224 chore(cd): deploy 9dbe044 [skip ci]. Read-only ArgoCD shows awoooi-prod Synced / Healthy at revision d8ca822422021d0fda8da8fa4c354c0c4db7ff22; API/Web/Worker live image tag 9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be; API 2/2, Web 2/2, Worker 1/1. The 19:05 post-start quick check returns RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED, delegated cold-start remains PASS=89 WARN=0 BLOCKED=0, and 19:05-19:06 route stability checks confirm AWOOOI API, IwoooS, AwoooGo, Stock, VibeWork, Bitan, and MOMO health all return 200 for five consecutive external reads. Earlier AwoooGo / Stock 502 reads were post-deploy upstream warmup transients, not persistent service failures. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on V10.690; latest import job 57 completed cleanly; MOMO_DAILY_FRESHNESS 1|2026-06-24; current-month daily snapshot and realtime tables match through 2026-06-24. post-start-quick-check.sh parses cold-start PASS / WARN / BLOCKED summary before classifying exit codes, so WARN-only rollout/stale evidence is no longer inflated into a service blocker. The wrapper returns RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED when service blockers are zero but escrow_missing=5 remains. Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh production routes are now 200 disabled_waiting_iwooos_wazuh_owner_gate, but configured=false, manager query accepted 0, manager registry accepted 0, and runtime gate 0; treat Wazuh as a security registry evidence blocker, not a reboot service blocker.

Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: 2026-06-25 19:05 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
Post-start quick check: 2026-06-25 21:14 PASS=38 WARN=2 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. Cold-start layer remains GREEN and StockPlatform freshness is now OK; DR remains blocked by credential escrow evidence.
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
Service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, StockPlatform data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared, and 110 orphan StockPlatform Chrome smoke groups cleared by targeted approved SIGTERM. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is now green for the 2026-06-25 evidence set.
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. 19:35 expanded route readback returned expected 2xx/3xx for AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
MOMO release state: mo.wooo.work health is healthy on version V10.690. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 18:23 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain recent container-replace / scheduler fail-closed / notification evidence notes, but no service blocker remains.
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
StockPlatform data state: `/api/v1/system/freshness` returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.margin_short_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, `core.market_index_daily.global` 2026-06-25 / 2001 rows, and `ai.recommendations` 2026-06-25 / 2868 rows. The 21:10 natural AI pipeline produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`; no manual ingestion or DB write was performed.
Product version readback: StockPlatform live source `/home/wooo/stockplatform-v2` matches Gitea `wooo/stockplatform-v2.git` main `fb91aa4c6272469d1d26e0820169629eac17d28a`; VibeWork live image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea `wooo/vibework.git` main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source `/home/wooo/awooogo` matches Gitea `wooo/AwoooGo` main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833`; 188 runtime directory is a compose/image deployment path, not a git worktree, so add image revision label evidence before declaring code-image parity.
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 20:24 readback found a second recurrence with orphan process groups `2756503` and `2829627`, root Chrome `PPID=1`, elapsed about 5h, no active parent smoke, GPU process CPU around 96%, and renderer CPU around 22%; approved targeted `SIGTERM` cleared both PGIDs. 21:14 CPU attribution shows current load is dominated by an active AWOOOI Web `next build` process group and its worker processes, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
Backup / monitoring state: 19:05 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
Route transient handling: post-deploy `502` on Stock or AwoooGo is a blocker only if it persists after upstream container health is ready and 3-5 consecutive external route reads still fail. For AwoooGo, live upstream is on 110 `192.168.0.110:32190`; do not test only `127.0.0.1` on 110 because the listener may bind the host address. For K3s workload balancing, wait for terminating pods to disappear before judging API/Web placement; final required state for two-replica API/Web is split across `mon` and `mon1`.
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
Deploy storm / CD replacement state: if several main commits land during recovery, older CD runs may be canceled by newer commits. Do not treat the canceled run as a service failure. Wait for the final deploy marker, verify live image tags, ArgoCD health, public routes, DB freshness, backup status, and post-start quick check before declaring latest production recovered.
Wazuh / SOC boundary state: production Wazuh read-only route presence is not equivalent to Wazuh registry recovery. `/api/iwooos/wazuh` and `/api/v1/iwooos/wazuh` returning `200 disabled_waiting_iwooos_wazuh_owner_gate` only proves the route boundary is deployed; manager registry accepted, owner evidence accepted, active response, host write, agent re-enroll, restart, secret patch, Kali active scan, and runtime gate remain `0 / false`.
Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom.
Allowed declaration: host boot, core service readiness, K3s, public route availability, AWOOOI API health, MOMO service health/data freshness, Bitan public-content cleanliness, and backup/offsite readiness are green for the latest read-only evidence set.
Forbidden declaration: all product data latest, StockPlatform data freshness green, DR complete, credential escrow complete, Wazuh host registry accepted, 110 live monitor synced, or runtime/security acceptance. Credential escrow evidence is still missing and StockPlatform freshness is blocked; neither may be smoothed into green.

2026-06-24 22:17 Codex workstation continuity readback:

MacBook Pro 192.168.0.111 can authenticate to Gitea over SSH with its own public key named MacBook Pro Codex 20260624.
MOMO Pro Mac Mini workspace is /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty=0.
MOMO Pro MacBook workspace is /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty=0.
MOMO import-boundary regression: pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q => 10 passed.
MOMO production release: Gitea main and cd.yaml #904 are at 84035906aba0e5e190d031a13cfd9b47a8cd1f73; 188 live source marker proves production deploy.
Codex Start Here / workstation dashboard / scorecard safe artifacts were copied to MacBook Pro; latest artifact dashboard readback is refreshed after the docs closeout commit. Raw Codex App DB, auth, sessions, raw conversations, .env, runtime volumes, raw .git directories, passwords, tokens, and Mac Mini private keys were not copied.
AwoooGo MacBook dev workspace remains ready at /Users/ooo/codex-workspaces/awooogo-dev, branch dev, upstream gitea/dev, commit 8471b376d97c1436d4612ece17f51ba0950f114d, dirty=0.
Safe handoff artifacts still match by local / remote SHA-256 readback after Start Here / workstation dashboard / scorecard refresh. Exact hash values are intentionally not hard-coded in this runbook because they change whenever handoff artifacts are refreshed. Raw Codex App DB, auth, sessions, raw conversations, .env, runtime volumes, raw .git directories, passwords, tokens, and Mac Mini private keys were not copied.
This improves workstation continuity after host reboot / operator relocation, and the MOMO import-boundary fix is now production-deployed; it does not change service cold-start status: full-stack green remains blocked by MOMO data freshness and DR remains blocked by credential escrow evidence.

2026-06-18 12:17 live readback supersedes older service-availability wording:

Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=83 WARN=1 BLOCKED=0, Result=DEGRADED.
Service state: SERVICE_AVAILABLE_DEGRADED; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
Rollout state after transient 12:14 startup window: awoooi-api 2/2, awoooi-web 2/2, worker 1/1, canary 1/1, public API health 200 healthy.
Only live warning: retained stale K8s Job km-vectorize-29689620 from 2026-06-14 03:00. Later official km-vectorize Jobs 29692500 / 29693940 / 29695380 are Complete.
Allowed declaration: services are available with one stale failed Job warning.
Forbidden declaration: full cold-start green, DR complete, or runtime/security acceptance.

2026-06-18 13:43 live readback supersedes the stale-Job warning wording:

Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=84 WARN=0 BLOCKED=0, Result=GREEN.
Service state: FULL_STACK_GREEN_FOR_SERVICE; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
K8s Job classification: FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0. The retained km-vectorize failure stays as evidence but no longer blocks service readiness after later official successful Jobs.
Allowed declaration: full cold-start service readiness is green for this evidence set.
Forbidden declaration: DR complete or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged.

2026-06-18 14:31 live runaway-process readback supersedes repo-only AIOps wording:

110 host runaway process exporter is live-installed and scraped.
Textfile source: /home/wooo/node_exporter_textfiles/host_runaway_process.prom.
Prometheus readback: monitor_up=1, orphan_browser_groups=0 for headless_browser_smoke and stockplatform_headless_smoke, active Gitea Actions containers=2, load5_per_core around 0.79-0.81, swap_used_ratio around 1.0, remediation_authorized=0.
Alerts: HostRunawayProcessMonitorMissing is not firing; HostOrphanBrowserSmokeHighCpu is not firing.
Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped.
Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check.

2026-06-18 14:51 production event-packet readback:

Host runaway alert-to-event packet is deployed in production.
Deploy marker: 2d278568 chore(cd): deploy f358a0f [skip ci].
Runtime image: awoooi-api / awoooi-web / awoooi-worker use f358a0f6c3e614e407dedb6eee89bf10b2bc8173.
ArgoCD readback: awoooi-prod Synced / Healthy.
Alert mapping: HostOrphanBrowserSmokeHighCpu -> orphan_browser_smoke_runaway_process; HostCiRunnerLoadSaturation -> ci_runner_load_saturation.
Allowed declaration: monitoring, alert rules, live scrape, AI event packet routing, PlayBook / KM contract, and production deployment are complete for this evidence set.
Forbidden declaration: Telegram send, Bot API call, Gateway queue write, process kill, Docker/systemd restart, Nginx reload, firewall/K8s action, or runtime remediation is authorized.

2026-06-18 16:08 P3-009 Host Runaway AIOps product readback:

Host runaway AIOps closed-loop read model is deployed in production.
Deploy marker: 42c08ece chore(cd): deploy 27143fb [skip ci].
API endpoint: /api/v1/agents/agent-host-runaway-aiops-loop-readiness.
Production readback: schema_version=host_runaway_aiops_loop_readiness_v1, current_task_id=P3-009, next_task_id=P3-010, completion=100, loop_stage_count=6, alert_lane_count=2, asset_writeback_contract_count=5.
Host 110 live readback in the model: orphan browser groups=0, active CI containers=2, remediation_authorized=0, runtime/write counters=0.
Governance route: /zh-TW/governance?tab=automation-inventory shows P3-009 on desktop 1440x1100 and mobile 390x844 with missing text=0, console/page errors=0, horizontal overflow=false.
Allowed declaration: monitoring, alert rules, AI event packet, PlayBook / KM contract, Verifier/writeback contract, gated remediation dry-run boundary, and product-visible readback are complete for this evidence set.
Forbidden declaration: AI runtime remediation is enabled. Process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain forbidden without owner approval, maintenance window, evidence ref, dry-run, and post-check.

項目	2026-06-24 11:35 Asia/Taipei live result	判定
Overall recovery readiness	`98%`	`SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED`
P0 host / K3s recovery	`100%`	`DONE`
P1 backup / alert / escrow	`96%`	`BLOCKED_DR_ESCROW`
P2 service / data truth	`96%`	`BLOCKED_MOMO_DATA_FRESHNESS`
P3 docs / automation contracts	`100%`	`DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY`
110 host runtime	`fwupd-refresh.timer` intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; `systemctl --failed` returns `0 loaded units listed`; rollback is `sudo systemctl enable --now fwupd-refresh.timer`	`GREEN_WITH_FWUPD_TIMER_DISABLED`
110 host runaway process guard	14:31-14:32 live scrape confirms `monitor_up=1`, orphan browser groups `0`, active Gitea Actions containers `2`, `load5_per_core≈0.79-0.81`, `swap_used_ratio≈1.0`, and `remediation_authorized=0`; exporter/helper also remain in Ansible textfile exporter source-of-truth.	`LIVE_SCRAPED_RUNTIME_GATE_0`
120 reachability	ping OK, SSH OK, boot around `2026-06-14 02:23`, K3s active, node `mon Ready`	`GREEN`
121 reachability	ping OK, SSH OK, failed units `0`	`GREEN`
188 host runtime	production services green, but host `systemctl` degraded by `awoooi-startup.service`, `postgresql@14-main.service`, `certbot.service`, and `snap.certbot.renew.service`; host PostgreSQL cluster `14/main` is down while product DB containers/exporters are healthy; certbot renewal for shared `sentry.wooo.work` certificate is failing but public cert is still valid until 2026-07-09 UTC	`SERVICE_GREEN_HOST_HYGIENE_BLOCKED`
K3s node state	`mon Ready control-plane`, `mon1 Ready control-plane`; bad pods `0`; `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`	`GREEN_WITH_RETAINED_EVIDENCE`
110 -> 120 / 188 SSH trust	00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc`	`GREEN_WITH_GUARDRAIL`
Backup status	11:20 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; escrow readback still shows `ESCROW_MISSING_COUNT=5`	`GREEN_WITH_DR_ESCROW_WARNING`
Offsite sync / verify	01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20	`GREEN`
Backup / cold-start alerts	01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules	`GREEN_WITH_EXPECTED_REDLIGHTS`
Cold-start scorecard	11:35 read-only scorecard：`PASS=86 WARN=0 BLOCKED=1`。Public routes / TLS、momo DB parity、backup exporters、120/121 K3s、MinIO / Velero、AWOOOI API/Web 皆通過；only blocker is MOMO data freshness.	`BLOCKED_MOMO_DATA_FRESHNESS`
momo DB parity	`10936	10936
momo scheduler	container healthy; Drive listing from container works; pending folder `當日業績匯入` count is `0` for `即時業績_當日`; no current `Permission denied` evidence in the latest readback	`GREEN_WITH_SOURCE_ABSENT`
ArgoCD app health	11:35 readback: `awoooi-prod` sync `Synced`, health `Healthy`, source revision `7db7800e399caed5487a705c81ec993dec76c70f`; API/Web/Worker ready.	`GREEN`
Workload balancing	Live API/Web/Worker/CronJob image is `e999c16b3435f197b78fe2adfeec1c4faa6c4675`; API/Web pods remain split across `mon` / `mon1`, Worker single replica remains healthy on `mon`	`GREEN`
Credential escrow	5 non-secret evidence markers missing	`BLOCKED`

Release rule:

Do not declare full cold-start green unless the latest scorecard has `WARN=0` and `BLOCKED=0`.
Do not declare aggregate backup green unless latest `backup-status` has `core_blockers=0`.
Do not declare DR scorecard complete while credential escrow markers are missing.

2026-06-14 18:15 live rule:

110 / 120 / 121 / 188 core service recovery remains available, but the latest 18:15 scorecard is DEGRADED because `WARN=1`.
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
NO-GO for "DR complete" while credential escrow evidence markers are missing.
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
NO-GO for "full cold-start green" until `km-vectorize` failed Job is cleared by an official successful run.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
Current allowed wording: "core service and backup are available; 110 failed units are cleared after intentionally disabling `fwupd-refresh.timer`; high-value config Owner Packet 前台同步後 recovery readback shows no service regression; cold-start is degraded only by the `km-vectorize` official Job failure; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained but failed Pod/log are currently absent, so the next official 03:00 run remains the evidence gate."

2026-06-18 12:17 live rule:

GO for controlled service availability: PASS=83 WARN=1 BLOCKED=0, public routes/TLS green, API health 200 healthy, API/Web/Worker/Canary ready after rollout convergence.
GO for repo-side reboot readiness mechanism: readiness audit PASS=185 WARN=1 BLOCKED=0; only skipped live gate warning before the live check was run.
NO-GO for "full cold-start green" until the retained stale failed Job evidence is either cleared by normal K8s history policy or explicitly accepted by an owner-provided readback package.
NO-GO for "DR complete" while credential escrow evidence markers remain missing.
Do not delete the failed Job manually during routine SOP verification. Keep it as evidence unless an approved maintenance window explicitly authorizes cleanup.
Current allowed wording: "SOP / Plan B / automation contracts are complete; live services are available with one retained stale km-vectorize failed Job warning; hard blockers are zero; DR remains blocked by credential escrow evidence."

2026-06-18 13:43 live rule:

GO for full cold-start service readiness for this evidence set: PASS=84 WARN=0 BLOCKED=0.
GO for controlled runner/CD release under the normal security gates; this is not a bypass for owner response, runtime writer, Telegram, Gateway, K8s, Docker, Nginx, firewall, or secret operations.
GO for retaining stale failed Job evidence: FAILED_JOBS=1 and STALE_FAILED_JOBS=1 are allowed when ACTIVE_FAILED_JOBS=0 and later official successful Jobs exist.
NO-GO for DR complete while credential escrow evidence markers remain missing: ESCROW_MISSING_COUNT=5.
NO-GO for deleting retained failed Jobs during routine verification. Cleanup requires an explicit maintenance window and owner acceptance.
Current allowed wording: "full-stack service recovery is green for the current evidence set; stale km-vectorize failure is retained as historical evidence, not an active blocker; DR complete remains blocked by credential escrow evidence."

After any future 120 recovery, rerun this exact chain from 110:

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

0.1 When To Use This

Use this SOP when any of these happen:

110/120/121/188 reboot unexpectedly.
All services are abnormal after a power/network event.
K3s is stuck activating.
Host load remains high during startup and service health is mixed.
Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.

The rule is simple: recover the dependency chain, not the loudest symptom.

0.2 啟動判定分層

重啟後不能只用一個訊號宣稱完成。每台主機與整個平台都必須分四層判定：

層級	代表意義	最低證據	不代表
`HOST_POWERED`	主機或 VM 看起來已通電	console / hypervisor 顯示 running，或 LAN ARP 開始出現	OS 已完成開機
`HOST_BOOTED`	OS 已進入可互動狀態	ping OK、SSH port open、`who -b` 有本次 boot time	systemd / Docker / K3s 已健康
`HOST_READY`	主機基礎服務可承接下一層	`systemctl is-system-running` 非 degraded；failed units 可解釋；cron / docker / DB / K3s 依角色正常	public route 或業務資料已正常
`SERVICE_READY`	主機承載服務可用	服務 health、port、container health、DB / Redis / K3s / Harbor / Alertmanager checks 通過	備份、排程、告警、資料一致性與資料新鮮度已驗證
`FULL_STACK_GREEN`	可以宣稱重啟恢復完成	cold-start scorecard `WARN=0`、`BLOCKED=0`，備份/offsite/DB/告警/排程/資料新鮮度都綠	120 不可達或 MOMO 業務資料 stale 時永遠不能宣稱

2026-06-12 的 110/120 事故收斂判定是：

110 HOST_READY = yes
120 HOST_READY = yes
Core public services SERVICE_READY = yes
FULL_STACK_GREEN = yes, because cold-start scorecard is PASS=83 WARN=0 BLOCKED=0
DR_COMPLETE = no, because credential escrow evidence is incomplete

2026-06-24 的 MOMO 資料停更判定是：

110 / 120 / 121 / 188 HOST_READY = yes
Core public services SERVICE_READY = yes
MOMO_RELEASE_CURRENT = yes, because mo.wooo.work health is V10.653 and Gitea main / CD #904 deployed commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73
MOMO_DB_PARITY = yes
MOMO_DATA_FRESH = no, because latest daily_sales_snapshot date is 2026-06-17 and stale age is 7 days as of 2026-06-24 22:40
MOMO_SOURCE_AVAILABLE = no, because Drive intake 當日業績匯入 has no newer 即時業績_當日 source, scheduler stats show repeated file_count=0 runs, and Mac Mini / MacBook candidate files only contain old or header-only data
FULL_STACK_GREEN = no, because live cold-start scorecard is PASS=86 WARN=0 BLOCKED=1 and repo-side v1.42 dry-run is PASS=88 WARN=0 BLOCKED=1 with blocker "188 momo source file absent while daily sales data stale"
DR_COMPLETE = no, because credential escrow evidence is incomplete

MOMO source absence recovery gate:

GO: declare MOMO service recovered when health is healthy, containers are healthy, scheduler runs, DB parity matches, and release version matches Gitea/CD.
NO-GO: declare MOMO data current while Drive intake has no newer 即時業績_當日 source file and latest DB bounds stop at 2026-06-17.
NO-GO: re-import stale local samples, product catalog exports, header-only sheets, or already imported archive files to fake freshness.
NO-GO: truncate, whole-DB restore, manual Drive movement, or manual import without explicit maintenance approval.
UNBLOCK: new legitimate PChome daily-sales source appears in 當日業績匯入 or an owner-approved safe import path; import job succeeds with sync_success=true; source file moves only after success; daily_sales_snapshot and realtime_sales_monthly bounds match; MOMO_DAILY_FRESHNESS <= 2.

所有回報必須使用這組詞，避免把「服務面可用」誤報成「整體 DR 完成」。

0.3 Codex 工作站交接判定

重啟後若需要從 Mac Mini / MacBook Pro 繼續 Codex 開發，必須另外確認 Codex safe handoff artifacts，不得把服務恢復與 Codex raw 對話同步混為一談。

2026-06-24 22:17 Asia/Taipei readback：

MacBook Pro 192.168.0.111 SSH = OK
Safe artifacts synced = Start Here and workstation dashboard readback matched; current SHA-256 values are tracked in the workstation dashboard artifact and local sha256sum readback
Start Here readback = registry_ready 3, registry_blocked 8, latest_dev_on_gitea 3, production_on_gitea 8, raw_history_sync False
Workstation dashboard readback = artifact_sync_synced 2, artifact_sync_blocked 0, MOMO current main baseline ready 2
MOMO Pro Mac Mini workspace = /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty 0
MOMO Pro MacBook workspace = /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty 0
AwoooGo MacBook workspace = ready on dev commit 8471b376d97c1436d4612ece17f51ba0950f114d, dirty 0

允許宣告：

Mac Mini / MacBook Pro 已同步 Codex 開工入口與治理 snapshot。
MOMO Pro 可以在 Mac Mini / MacBook Pro 從 Gitea current-main Codex baseline 開工；實作前仍需從 codex/momo-current-main-dev-base-20260624 切新的 codex/<task>。
MOMO import-boundary fix 已經由 main / CD #904 部署到 production；後續仍要等下一個真實匯入檔驗證 failure boundary 是否阻止搬檔。

禁止宣告：

raw Codex / ChatGPT 歷史聊天已同步。
所有產品都能雙機同步開發。
把 MOMO Pro 程式版本 V10.653 當成 MOMO 業務資料已更新。
2026FIFA / Agent Bounty owner preflight 已通過。

1. Golden Startup Order

0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed

Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.

1.1 Dependency Graph

flowchart TD
  network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
  network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
  data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
  obs110 --> k3s
  k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
  workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
  workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
  public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
  schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
  highload --> ai["AI auto-remediation: limited execution"]

This is also captured in the machine-readable baseline:

ops/reboot-recovery/full-stack-cold-start-baseline.yml

The YAML baseline is the source of truth for:

hosts, roles, and SSH users
phase ordering
service startup dependencies
endpoint success codes
schedule freshness thresholds
stateful-service protection boundaries
AI automation release gates

1.2 Phase Gate Logic

Each phase has the same decision rule:

Result	Meaning	Action
`BLOCKED`	A dependency required by later phases is down.	Stop phase release and fix the first blocked gate.
`WARN`	Core dependency passed, but confidence is incomplete.	Continue diagnosis, but do not release runner/CD/AI full execution.
`GREEN`	All checks in scope passed.	Release the next phase only.

The cold-start flow is intentionally conservative:

P0 network green
  -> P0 188 data green
  -> P0 110 registry/observability green
  -> P1 K3s green
  -> P2 workload + alert chain green
  -> P2 public routes green
  -> P2 schedules green
  -> P3 high-load services and runners/CD
  -> AI auto-remediation limited execution

The final release condition is not "containers are running". It is:

PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN

1.3 重啟 GO / NO-GO 決策樹

每次維護前先用這張表決定是否可以重啟，以及重啟後可以宣稱到哪個層級。

情境	GO / NO-GO	可做範圍	完成宣告上限
03:00 offsite sync 正在跑	`NO-GO`	只讀觀察，等待 sync 結束後 verifier	不宣告維護完成
120 不可達，但只重啟 110	`CONDITIONAL GO`	只可宣稱 110 / public service recovery；不可跑 120 backup fix	`SERVICE_READY`，不可 `FULL_STACK_GREEN`
188 data layer 不健康	`NO-GO`	先修 PostgreSQL / Redis / Docker / SignOz / momo DB	不釋出 K3s / runner / AI
110 Harbor / Registry 不健康	`NO-GO for K3s deploy`	先修 registry；K3s 可能 image pull 失敗	不釋出 CD / deploy
120 / 121 都 Ready，offsite verifier 綠	`GO`	可做完整 cold-start release chain	需 scorecard `WARN=0 BLOCKED=0`
credential escrow marker 缺失	`GO for service reboot`，`NO-GO for DR complete`	可恢復服務；不可宣稱 DR scorecard complete	`SERVICE_READY` 或 `BLOCKED_DR_ESCROW`
Alertmanager required rules 不可見	`NO-GO for unattended window`	先修 alert rules / drift guard	不釋出 AI auto-remediation

GO 只代表允許執行指定範圍，不代表完成。完成一定要回到 §15 Done Criteria。

1.4 Plan B：降級運轉與回復路徑

Plan B 不是另一套可以繞過 preflight 的重啟流程，也不是事故中臨場改主機的授權。Plan B 是當 Plan A 無法在維護窗口內達成 FULL_STACK_GREEN 時，預先定義「最低可接受服務目標、停止線、降級等級、主機路徑、回到 Plan A 的條件」。

Plan A 的目標是：

B4_FULL_STACK_GREEN：cold-start scorecard WARN=0 / BLOCKED=0，backup、offsite、DB、alert、scheduler、K3s、public route 與業務資料新鮮度都綠。

Plan B 的目標是：

先保住核心服務與資料完整性，不擴大 blast radius，不把部分可用誤報成 full-stack green，並把下一個 blocker 留成可追蹤工單。

Plan B 的機讀契約固定在 ops/reboot-recovery/full-stack-cold-start-baseline.yml 的 plan_b 區塊；scripts/reboot-recovery/reboot-recovery-readiness-audit.sh 必須檢查 SOP 與 baseline 都保留 B0-B5、T+120 停止線與三個收尾狀態。若這些欄位缺失，readiness audit 必須回 BLOCKED。

Plan B 紅線

紅線	具體要求
不假綠	不用 route 200、pod up、container up、UI 可見、CD success 或單一 smoke pass 宣稱完整恢復。
不消音正確紅燈	120 / backup / credential escrow / alert / scheduler 的紅燈如果反映真實缺口，必須保留。
不做未授權寫操作	沒有維護窗口與人工批准時，不重啟 Docker daemon、不 reload Nginx、不改 firewall / iptables、不 `kubectl patch` live、不讀 secret、不做 destructive recovery。
不釋出高風險自動化	CD runner、AI auto-remediation、heavy crawler、batch import、repair bot 必須等依賴鏈綠燈後才解除 freeze。

Plan B 觸發條件

觸發條件	立即動作	可宣稱上限
03:00 offsite sync、02:00 backup 或 full verifier 仍在跑	延後重啟；只讀等待完成	`B0_ABORTED_BEFORE_REBOOT`
任一 P0 主機重啟後 15 分鐘仍 ping / SSH 不可達	停止釋出下一層，啟動對應主機路徑	`B1_HOST_RECOVERY_ONLY`
188 PostgreSQL / Redis / momo / SignOz 任一核心資料面不健康	凍結 K3s deploy、runner、AI auto-remediation	`B1_HOST_RECOVERY_ONLY`
110 Harbor / Gitea / Alertmanager / Prometheus 不健康	凍結 CD / deploy / image pull 相關流程	`B2_CORE_SERVICE_READY` 以下
120 或 121 單台不健康，但另一台 control-plane 可承載	進入單節點 K3s 服務模式，保留 HA 紅燈	`B2_CORE_SERVICE_READY`
public route 可用，但 DB / backup / alert / schedule 任一不綠	標記 `ROUTE_GREEN_ONLY`，不宣稱 service green	`B2_CORE_SERVICE_READY`
cold-start `WARN>0`、`BLOCKED=0`	可宣稱服務可用但仍 degraded	`B3_SERVICE_AVAILABLE_DEGRADED`
credential escrow missing	可完成服務恢復，不可宣稱 DR complete	`B4_FULL_STACK_GREEN` 或以下，禁止 `B5_DR_COMPLETE`

Plan B 主機路徑

故障域	降級路徑	回到 Plan A 的條件
110 失敗	保留 120 / 121 K3s 與 188 data；凍結 CD、runner、Harbor image push、Alertmanager outbound；先確認 Gitea / Harbor / Prometheus / Alertmanager 是否只是 host service 層問題。	110 `HOST_READY`、Harbor / Gitea / Prometheus / Alertmanager 健康、backup-status 無 110 core blocker、cold-start 110 checks 綠。
120 失敗	121 承載 K3s control-plane；保留 `120_DEGRADED` 紅燈；不宣稱 K3s AA；不跑 120 backup fix；必要時走 console / fsck recovery。	120 ping / SSH OK、root filesystem rw、`k3s active`、node `mon Ready`、backup-configs / backup-all / offsite / cold-start chain 全過。
121 失敗	120 承載 K3s control-plane；保留 `121_DEGRADED` 紅燈；不宣稱 workload balanced；避免非必要 rollout。	121 ping / SSH OK、`k3s active`、node `mon1 Ready`、API/Web placement 回到 max skew <= 1。
188 失敗	先保資料面：PostgreSQL、Redis、momo DB、SignOz、Ollama / AI provider；凍結會寫入資料或產生大量負載的 batch / crawler / AI flow。	188 `HOST_READY`、PostgreSQL / Redis / momo parity / SignOz / AI provider route 健康，且 backup/status 無 188 core blocker。
K3s degraded	保留現有健康 Pod；先查 nodes / pods / events / VIP / NodePort；避免盲目重啟 k3s 或刪 Pod。	`mon` / `mon1` Ready、API/Web/Worker rollout healthy、public API/Web / alert webhook / scorecard 通過。
Public gateway degraded	保住內部 API / VIP / data；不 reload Nginx、不改 DNS/TLS/certbot/firewall，除非有 owner-approved maintenance window。	Nginx config owner evidence、route smoke、TLS / ACME、rollback owner 與 post-check 計畫通過。

Plan B 服務等級

維護期間所有回報都必須使用以下等級之一，禁止用「差不多好了」或「應該正常」：

等級	意義	最低證據
`B0_ABORTED_BEFORE_REBOOT`	preflight 發現 NO-GO，取消或延後重啟	未做 runtime 寫操作；記錄 NO-GO blocker。
`B1_HOST_RECOVERY_ONLY`	只完成主機層恢復	目標主機 ping / SSH / boot time / systemd 基礎狀態可判定；服務尚未全驗。
`B2_CORE_SERVICE_READY`	核心服務可用，但完整依賴鏈未過	public route、API、DB 或 K3s 主要面可用；backup / alert / scheduler / scorecard 尚未全綠。
`B3_SERVICE_AVAILABLE_DEGRADED`	核心服務可用，cold-start 無 hard block 但仍有 WARN	cold-start `BLOCKED=0`；WARN 被明確列出且不被消音。
`B4_FULL_STACK_GREEN`	本次重啟恢復完成	cold-start `PASS>0 WARN=0 BLOCKED=0`，backup / offsite / DB / alert / scheduler / data freshness 全綠。
`B5_DR_COMPLETE`	DR 完整	`B4` 加上 credential escrow missing `0`，restore / escrow / offsite evidence 完整。

Plan B 執行時序

T+0      freeze CD / runner / AI auto-remediation / heavy batch；保留 console、journal、backup、scorecard evidence。
T+5      判定 HOST_POWERED / HOST_BOOTED / HOST_READY；任一 P0 host 不可達即進入主機 Plan B。
T+15     188 data 或 110 registry / observability 不健康時停止釋出 K3s、runner、AI。
T+30     public route 可用但 DB / backup / alert / scheduler 未過時，只能回報 B2，不得 full green。
T+60     必須跑 cold-start scorecard；若仍 WARN / BLOCKED，留下 Plan B 等級與下一個 blocker。
T+120    若仍未達 B4，開 incident / follow-up，不延長窗口做未授權 runtime 寫操作。

Plan B 收尾條件

Plan B 只能以下列三種狀態收尾：

收尾狀態	條件	下一步
`RETURNED_TO_PLAN_A`	blocker 已清，完成 Plan A 全鏈路驗證	更新 reboot ledger，記錄實際耗時與 SOP 差異。
`SERVICE_AVAILABLE_DEGRADED`	服務可用但 scorecard 仍 WARN，或 DR / escrow / governance gate 未完成	保留紅燈，開下一步 owner / evidence / maintenance task。
`OPEN_INCIDENT_REQUIRED`	P0 host、data、K3s、gateway、backup、alert 任一仍 hard blocked	停止維護窗口，保留 evidence，升級事故處理。

Plan B 的專業標準不是「保證每次都綠」，而是保證每次重啟都能快速知道現在到哪一層、什麼不能宣稱、下一個 blocker 是誰、以及是否可以安全回到 Plan A。

2. Automation Freeze

Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.

Item	Cold-start policy	Reason
Gitea/GitHub runners	Last	Build jobs can saturate 110 CPU/RAM.
momo-scheduler / crawlers	Last	Chrome and batch work can saturate 188.
Sentry/Snuba consumers	Controlled	Kafka backlog and ClickHouse merge can create temporary high load.
Alertmanager outbound notification	Gate	Avoid alert storms before API webhook and Telegram are verified.
AI auto-repair	Observe-only	Metrics, Redis, KM, and playbooks may be incomplete.
Stateful DB restart	Human approval	PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets.

2.1 Freeze 執行清單

進入維護窗口後，先把「會放大事故」的來源降到 observe-only 或延後釋出。若沒有做到這一步，後續負載和告警會很難判讀。

順序	對象	只讀確認	允許動作	禁止動作
1	runner / CD	`systemctl list-units "actions.runner.*"`、Gitea Actions running jobs	暫停新 job、等待可完成 job 結束	重啟 Docker daemon 來中斷 job
2	AI auto-remediation	Prometheus / Alertmanager / cold-start monitor 狀態	切 observe-only、保留告警	自動 restart stateful service
3	momo scheduler / crawler	container health、recent logs、DB parity	延後 heavy import、保留現有資料	在 DB 未綠時強行匯入
4	Sentry / Snuba	ClickHouse / Kafka health、consumer restart loop	控制 consumer 釋出順序	generic compose down/up 全套重啟
5	K3s workload	node readiness、pods、events	依 node 狀態 cordon/drain	120 不可達時宣稱 drain 成功

多個工作視窗同時處理事故時，第一優先是避免互相打斷：只要有人在收斂 Docker / Nginx / firewall / K3s 寫操作，其他視窗先只讀觀察，直到明確交接。

2.2 CD / SSH Trust Guardrail

2026-06-13 的冷啟動假紅燈顯示：CD workflow 若用 ssh-keyscan ... > /home/wooo/.ssh/known_hosts，會覆蓋 110 使用者層的全域 SSH trust，導致 110 到 120 / 188 的 strict SSH 檢查失敗。這會把實際已恢復的主機誤判成 blocked。

固定規則：

項目	正確做法	禁止
Deploy 專用 host key	寫入 `/home/wooo/.ssh/deploy_known_hosts`	寫入或覆蓋 `/home/wooo/.ssh/known_hosts`
Deploy SSH options	`-o UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`	共用 operator / cold-start 的 `known_hosts`
冷啟動 SSH trust	保留 120 / 188 的已驗證 fingerprint；修復前先備份	無 fingerprint 交叉驗證就 `ssh-keygen -R` 或重建全檔
驗證	CD 後檢查 `known_hosts` mtime、120 / 188 entries、strict SSH	只看 CD success badge

2026-06-13 修復錨點：

Source fix：Gitea main 包含 80e6ec1a fix(ci): avoid clobbering runner known hosts。
Deploy marker：e4a349bc chore(cd): deploy 414413a [skip ci] 後，/home/wooo/.ssh/known_hosts mtime 仍停在 2026-06-13 01:20:02 +0800，未被 CD 覆蓋。
Deploy isolated file：/home/wooo/.ssh/deploy_known_hosts mtime 2026-06-13 01:24:05 +0800。
Global strict entries：120 ED25519 line 4、188 ED25519 line 5 仍存在；strict SSH 到 wooo@192.168.0.120 與 ollama@192.168.0.188 必須通過。

3. P0 Evidence And Network

Run from any machine on the same LAN:

for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done

arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done

Then capture reboot evidence:

ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'

If any host has ARP incomplete or SSH port down, stop here and fix physical/network first.

3.1 主機已啟動判定標準

每台主機重啟後先跑「四段式啟動判定」。只有全部符合角色期望，才進入服務恢復。

for h in 110 120 121 188; do
  ip="192.168.0.$h"
  echo "=== $ip ==="
  ping -c 2 -W 2 "$ip" >/dev/null && echo "HOST_POWERED_OR_LAN_OK=1" || echo "HOST_POWERED_OR_LAN_OK=0"
  arp -an | grep "$ip" || true
  nc -G 3 -z "$ip" 22 && echo "SSH_PORT_OPEN=1" || echo "SSH_PORT_OPEN=0"
done

可 SSH 後：

ssh wooo@192.168.0.110 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true; free -h; swapon --show'
ssh wooo@192.168.0.121 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true'
ssh ollama@192.168.0.188 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true; free -h'

120 若不可 SSH，狀態只能是 HOST_POWERED_UNKNOWN 或 HOST_BOOTED_UNKNOWN，必須走 console / VM / network 檢查，不可用 Kubernetes stale node object 代替主機現況。

判定	必要條件	下一步
`HOST_BOOTED`	ping 或 ARP 有回應、SSH port open、`who -b` 顯示本次 boot time	檢查角色服務
`HOST_READY`	`systemctl is-system-running` 為 `running`，或 degraded units 已逐一解釋且不影響本 host 角色	進入服務層驗證
`HOST_DEGRADED`	failed units 存在且影響本 host 角色，或 swap 滿、root readonly、boot storage error	先修 host，不釋出下一層
`HOST_UNREACHABLE`	ping/SSH/ARP 失敗	停止遠端修復假設，改 console/VM/network

2026-06-12 110 事故教訓：failed unit 若指向不存在的 legacy 路徑，要先確認是否仍屬現役服務。停用 stale timer 可以解除 host degraded，但必須同步 source-of-truth 後續清理，不能靠反覆 reset-failed 掩蓋。

2026-06-26 188 事故教訓：PostgreSQL host cluster / Docker product DB / exporter 三者必須分開判定。pg_isready、pg_up=1 或 public route 200 只能證明某個 PostgreSQL endpoint 可用，不能證明 postgresql@14-main 已恢復。若 journal 出現 could not locate a valid checkpoint record，不得由 startup 腳本或 AI 自動執行 pg_resetwal；必須進入 DB owner / backup restore / maintenance window / rollback owner / post-check gate。

4. P0 188 Data Layer

188 is the first real service dependency because multiple product data planes, exporters, and AI / observability services depend on PostgreSQL-compatible endpoints. Do not assume the host cluster postgresql@14-main, Docker product databases, and exporter target are the same endpoint; prove the authoritative endpoint before repair.

4.1 Startup order

containerd
docker
postgresql@14-main
k3s_datastore.kine maintenance
redis-server on 6380
ollama or current AI proxy dependencies
nginx
Docker networks
MinIO / OpenClaw / SignOz
momo / litellm / batch services after load is stable

4.2 Read-only check

ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_lsclusters 2>/dev/null || true
ss -ltnp "sport = :5432" 2>/dev/null || ss -ltn "sport = :5432" || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'

4.3 PostgreSQL WAL checkpoint damage

Signature:

PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...

This may block if the affected cluster is the authoritative runtime datastore:

188:5432
K3s startup on 120/121
AWOOOI API DB access
Alertmanager webhook if API cannot start

2026-06-26 counterexample: host cluster 14/main can be down while product DB containers and exporters still serve traffic. Therefore pg_isready is not enough and failed postgresql@14-main is not automatically a product outage. First map the listening process / container, current app DB configuration, and backup freshness.

Break-glass example only after DB owner approval, backup evidence, maintenance window, rollback owner, and post-check plan:

sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"

Do not run pg_resetwal, DROP, reinitialize the cluster, delete /var/lib/postgresql, or restore an old backup from AI/startup automation. These are break-glass actions only.

5. P0/P1 110 Registry And Observability

110 must recover Harbor/Gitea/Monitoring early, but runners last.

5.1 Startup order

docker
Remove Exited (128) / Exited (137) orphan containers
Harbor harbor-log
Harbor full stack
Gitea
Prometheus / Alertmanager / Grafana / exporters
Langfuse
SignOz
Sentry DB layer
Sentry web/worker/consumer layer
Gitea host runner and actions runners

5.2 Checks

ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'

Harbor healthy means /v2/ returns 200 or 401. Do not treat 401 as failure.

5.3 Runner gate

Runner may start only after all are true:

188 PostgreSQL ready
110 Harbor ready
110 Gitea ready
120/121 K3s nodes ready
AWOOOI API health passes
110 load/core is below 1.0 for at least 15 minutes
runner systemd guardrails are active: CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0

Check:

ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
  echo "=== $u ==="
  systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'

If WatchdogUSec is not 0, apply the guardrail script manually with sudo:

sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply

6. P1 120/121 K3s

K3s must wait for 188 PostgreSQL and 110 Harbor.

6.1 Startup order

120 k3s.service
121 k3s.service, k3s-agent.service, or its live role
CNI / kube-proxy
Nodes Ready
Core pods
awoooi-prod pods
keepalived VIP 192.168.0.125
NodePorts 32334 and 32335

6.2 Checks

ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

If K3s is activating while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.

6.3 120 / 121 AA / AS 與負載平衡判定

2026-06-12 15:19 live check 確認 120 / 121 都是 K3s control-plane，且兩台都是 k3s active、k3s-agent inactive。因此它們是 K3s 控制面 AA，不是傳統一主一從 AS。

但控制面 AA 不等於業務 workload AA。120 剛從 root filesystem fsck 恢復後，大多數 ArgoCD / AWOOOI / Velero / kube-system workload 仍集中在 121；120 主要只有 DaemonSet 類 Pod。每次 120 / 121 重啟或恢復後，都要額外跑 Pod 落點檢查：

ssh wooo@192.168.0.120 '
sudo kubectl get nodes -o wide
sudo kubectl get pods -A -o wide
sudo kubectl top nodes 2>/dev/null || true
sudo kubectl top pods -A --sort-by=cpu 2>/dev/null | head -30 || true
'

判定規則：

判定	條件	可宣稱
`K3S_CONTROL_PLANE_AA`	120 / 121 都是 `Ready control-plane`	控制面雙節點可用
`WORKLOAD_IMBALANCED`	主要 deployment / pod 都落在單一節點	不可宣稱服務 AA；需排程治理
`WORKLOAD_BALANCED`	replicas >= 2 的核心 API / Web 跨 120 / 121 分散	可宣稱承載層分散
`STATEFUL_AA`	storage replication、backup / restore drill、failover drill 都通過	才可宣稱資料層 AA

負載平衡與遷移評估的正式基準文件是 docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md。恢復期先完成 P0 備份鏈與 cold-start scorecard，再做 topology spread 或服務搬遷。

7. P2 AWOOOI Workloads

Run after K3s nodes are Ready:

ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'

curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/

If pods are ImagePullBackOff, go back to 110 Harbor.

If API health fails because DB/Redis is down, go back to 188.

8. P2 Alert Chain

Current main path:

Prometheus/Alertmanager on 110
  -> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
  -> AWOOOI API
  -> TelegramGateway
  -> Telegram

Alertmanager health alone is not enough. Run E2E:

curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
  -H 'Content-Type: application/json' \
  -d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'

Expected: API returns success and Telegram receives the test alert.

9. P2 Schedules And Delayed Work

Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.

Host / Layer	Required check	Success baseline
188 cron	`systemctl is-active cron` and `crontab -l`	cron active; backup, restart exporter, stats exporter entries present
188 backup-from-110	`backup_110_last_success_timestamp` in textfile/Prometheus	last success age `< 25h`
188 momo-scheduler	`docker inspect momo-scheduler` and `docker logs --since 6h momo-scheduler`	container `running healthy`; `全部排程任務已註冊`; Google Drive auth works; dashboard URLs use container-reachable hostnames
188 momo import	manual `run_auto_import_task()` after parser changes	selected sheet is `即時業績明細`; imported date range has matching rows in `daily_sales_snapshot` and `realtime_sales_monthly`
110 cron	`systemctl is-active cron`	cron active; Docker/systemd textfile exporters fresh
110 startup units	`systemctl --failed`	zero failed units; stale `momo-startup-complete` and `wooo-staggered-startup` disabled
120 K8s CronJobs	`kubectl get cronjobs -n awoooi-prod`	unsuspended; no failed Jobs remain after current validation
121 DR drill	`crontab -l`	DR drill cron present unless explicitly paused

Useful checks:

ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'

If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.

10. P2/P3 Stateful Service Guardrails

Tier	Examples	Automation
BLOCK	PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB	No automatic destructive action. Human approval only.
CRITICAL_HITL	Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse	Human-in-the-loop restart/repair.
STANDARD_HITL	API/Web/worker, OpenClaw, litellm	Restart only with evidence and blast-radius check.
AUTO	Stateless exporters, blackbox, nginx exporter	Auto restart allowed after verification.

Never use generic docker restart $(docker ps -q) during cold start.

10.1 Dirty-Reboot Storage Corruption

Treat these log signatures as storage corruption, not ordinary service flakiness:

Bad message
Structure needs cleaning
Unknown codec
PANIC: could not locate a valid checkpoint record
Kafka Malformed line in checkpoint files
ClickHouse broken and needs manual correction

Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns Bad message or Structure needs cleaning, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.

10.2 ClickHouse Clean-Clone Recovery Pattern

Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.

1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.

Do not replace this with rm -rf store/... unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:

/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS

10.3 Kafka Checkpoint Recovery Pattern

If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:

log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint

Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.

11. P3 High-Load Services

Only release these after P0/P1/P2 gates are green:

Host	Service	Release condition
188	momo-scheduler / crawler	load/core < 1.0 for 15 minutes and DB healthy
188	SignOz ClickHouse	healthy and merge backlog trending down
188	litellm	`/health/liveliness` good and provider route verified
110	Sentry Snuba consumers	ClickHouse healthy and Kafka backlog decreasing
110	Sentry uptime-checker	Sentry web/DB healthy
110	runners	all previous gates green, `host_runaway_process.prom` fresh, orphan browser group count `0`, and load/core < 1.0 for 15 minutes unless the remaining load is explicitly attributed to active CI

11.1 110 Runaway Browser / CI Load 分流

2026-06-18 110 CPU 滿載事件證明：泛用 HostHighCpuLoad 只能說主機忙，不能告訴 operator 要不要殺程序。110 現在必須使用專用 host runaway process 指標做第一層分流：

grep -E 'awoooi_host_runaway_|awoooi_host_gitea_actions_|awoooi_host_load5_per_core|awoooi_host_swap_used_ratio' \
  /home/wooo/node_exporter_textfiles/host_runaway_process.prom

Prometheus 也必須讀得到同一份 textfile；2026-06-18 14:31-14:32 live scrape 已確認 awoooi_host_runaway_process_monitor_up{host="110"}=1、orphan group count 0、active CI container count 2、remediation_authorized=0，且 missing / orphan alerts 均未 firing。

判讀：

指標組合	判定	行動
`awoooi_host_runaway_browser_orphan_group_count > 0` 且 CPU `>= 100`	orphan headless browser / smoke process group	執行 `host-runaway-process-remediation.py` dry-run；人工確認後才可 gated `SIGTERM`
orphan count `0` 且 `awoooi_host_gitea_actions_active_container_count > 0`	合法 CI build/test 負載	觀察 Gitea Actions queue / workflow timeout；不殺程序
`awoooi_host_runaway_process_monitor_up` 缺失或 stale	監控盲區	修 cron / textfile collector / Ansible role，不宣稱 AI Ops 可觀測
`awoooi_host_runaway_process_remediation_authorized > 0`	監控器被誤改成執行器	立即回滾；runtime remediation 必須只走 gated helper

正式 PlayBook：

docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md

這條 PlayBook 不取代 Docker / Sentry / Harbor / K3s / backup SOP。它只處理 orphan browser smoke 與 CI load 分類，避免 CPU 高時誤重啟 Docker 或誤殺合法 build。

12. Baseline And AI Auto-Remediation Gate

12.1 Stable Runtime Baseline

These are release gates after the first cold-start recovery pass:

Area	Baseline
188 host	PostgreSQL accepting, Redis PONG, momo `/health` 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers
110 host	Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop
K3s	120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx
Public routes	`https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx
Guardrails	Docker/systemd/storage/backup/runaway-process textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0`
Schedules	cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h`
Backlog	ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks

If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.

12.2 AI Auto-Remediation Gate

AI auto-repair can move from observe-only to limited execution only after:

Prometheus rules are loaded.
docker/systemd textfile exporter files are fresh.
runaway process textfile exporter is fresh and remediation_authorized=0.
blackbox probes have stable results.
cron/CronJob schedule checks are green.
AWOOOI API /api/v1/health passes.
Alertmanager E2E webhook passes.
Redis/KM/playbook health is available.
No active restart storm.
Host load/core remains below 1.0 for 15 minutes.

Until then:

diagnose only
notify only
require human approval for remediation
no DB/ClickHouse/Harbor/Sentry destructive action
no generic restart action against stateful services
no process kill unless host-runaway-process-remediation.py has dry-run evidence plus owner approval, maintenance window, and evidence ref

13. One-Command Readiness Script

13.1 Single Pass

Run this when you want one read-only snapshot:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh

The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:

P0-NETWORK
P0-188-DATA
P0-110-REGISTRY-OBSERVABILITY
P1-K3S
P2-WORKLOAD-ALERTCHAIN
P2-PUBLIC-ROUTES
P2-SCHEDULES
runner guardrail state inside P0-110-REGISTRY-OBSERVABILITY

If it prints BLOCKED, fix the first blocked gate before moving forward.

13.2 Professional Watch Mode

Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
  --watch \
  --interval 60 \
  --max-attempts 30 \
  --send-alert-test

This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is GREEN or the last attempt remains degraded/blocked.

Use --send-alert-test for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without --send-alert-test, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.

13.3 Persistent Read-Only Monitor

After recovery, host 110 should run the same gate as a node-exporter textfile monitor:

bash scripts/reboot-recovery/install-cold-start-monitor-110.sh

This command is not read-only. It copies scripts to 110, rewrites the marked wooo crontab block, and immediately refreshes the textfile metric. Run it only inside an approved maintenance window or explicit owner-approved live-sync change.

This installs two scripts under /home/wooo/scripts/, adds a marked user-cron block, and writes:

/home/wooo/node_exporter_textfiles/cold_start_recovery.prom
/home/wooo/reboot-recovery/cold-start-last.log

The cron path uses --monitor-read-only, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:

awoooi_cold_start_monitor_up
awoooi_cold_start_pass_gates
awoooi_cold_start_warn_gates
awoooi_cold_start_blocked_gates
awoooi_cold_start_last_run_timestamp
awoooi_cold_start_last_green_timestamp
awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}

Prometheus rules in ops/monitoring/alerts-unified.yml alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.

13.3.1 110 cold-start monitor live-sync gate

Use this gate whenever the repo-side cold-start script changes. This prevents a false-green where repo evidence is newer than the live 110 monitor.

Current read-only evidence, 2026-06-24 23:15 Asia/Taipei:

Repo script hash: f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05
110 live script hash: 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8
verify result: BLOCKED full-stack-cold-start-check.sh hash mismatch

Read-only verification:

bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh

Approved apply path, only after maintenance-window / owner approval:

bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

Completion criteria:

verify-cold-start-monitor-deploy.sh reports hash parity for full-stack-cold-start-check.sh and cold-start-textfile-exporter.sh.
The live 110 cold-start output includes the expected current fields, including MOMO_SOURCE_EMPTY_EVIDENCE_LINES, MOMO_IMPORT_CONFIG, and MOMO_LATEST_IMPORT_JOB while MOMO data freshness remains blocked by source absence.
The textfile monitor refreshes without creating alert spam.
LOGBOOK records local hash, remote hash, command type, approval reference, and final cold-start result.

NO-GO:

Do not run the installer as part of routine read-only triage.
Do not call repo-side v1.42 deployed on 110 while the hash mismatch remains.
Do not patch 110 manually with ad hoc scp; use the existing installer or Ansible source-of-truth path under an approved change.

13.4 Script-To-SOP Coverage Map

Script gate	SOP coverage	Blocks
`P0-NETWORK`	host reachability, ARP, SSH	every later phase
`P0-188-DATA`	PostgreSQL, Redis, momo, SignOz	K3s, AWOOOI API, momo public site
`P0-110-REGISTRY-OBSERVABILITY`	Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas	image pulls, CD, alert rules, runners
`P1-K3S`	120/121 K3s, VIP, node readiness, pod health	workload and webhook health
`P2-WORKLOAD-ALERTCHAIN`	AWOOOI API/Web, Alertmanager webhook	AI auto-remediation and alert confidence
`P2-PUBLIC-ROUTES`	external AWOOOI and momo URLs	external release
`P2-SCHEDULES`	cron, CronJobs, backups, textfile exporters, DR drill	final done criteria

13.5 Next-Reboot Operator Contract

Run the watch command above.
If it stops at BLOCKED, repair the first blocked gate and rerun watch mode.
If it stops at WARN, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
Release high-load services only after GREEN and load/core stays below 1.0 for 15 minutes.
Record the final output summary and any manual repair in docs/LOGBOOK.md.

13.6 2026-05-29 補充：188 Public Gateway 與備份告警

aiops.wooo.work 的 188 public gateway 不可再指向單一 192.168.0.120:31234/31235。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP：

location /api/ {
    proxy_pass http://192.168.0.125:32334/api/;
}

location /api/v1/ws {
    proxy_pass http://192.168.0.125:32334/api/v1/ws;
}

location / {
    proxy_pass http://192.168.0.125:32335;
}

變更來源必須是 infra/ansible/roles/nginx/templates/188-all-sites.conf.j2，再用 infra/ansible/playbooks/nginx-sync.yml 收斂；禁止只改 188 live 檔而不回寫 Ansible baseline。

備份告警有兩層，缺一不可：

ops/monitoring/alerts-unified.yml 是 repo canonical。
110 live /home/wooo/monitoring/alerts.yml 與 /home/wooo/monitoring/alerts-unified.canonical.yml 必須一致，否則 prometheus-rule-drift-guard 可能把規則拉回舊版。

重啟後必查：

curl -s http://127.0.0.1:9090/api/v1/rules \
  | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'

cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom

若 120 尚未恢復，BackupConfigCapturePartial{target="120-k3s-host-configs"} 與 cold-start blocked 是正確訊號，不可消音。120 恢復後再重跑：

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

13.7 2026-05-29 補充：momo PostgreSQL Index 與資料同步

mo.wooo.work 不能只看 /health 或首頁 200。重啟或 fsck 後，PostgreSQL index 可能讓匯入流程表面完成，但 daily_sales_snapshot 未同步到 realtime_sales_monthly。本次症狀：

daily_sales_snapshot 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
realtime_sales_monthly 同日期範圍為 0 筆。
momo-scheduler log 出現 PostgreSQL 內部錯誤 posting list tuple ... cannot be split。

標準處理順序：

# 188 / momo-db，只重建索引，不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL

重建索引後，才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 realtime_sales_monthly 該日期範圍筆數，若非 0，需先保存查詢結果並確認是否重跑同範圍同步；不可整表 truncate、不可整庫 restore。補同步後至少驗證：

SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache：

docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'

14. 主機開機、關機、重啟 SOP

本節是每次 110 / 120 / 121 / 188 相關電源操作的標準程序。112 是 Kali，只保留 read-only evidence，不納入本輪恢復或例行重啟釋出。

14.1 共同紅線

類型	禁止事項	正確處理
120 offline	不可消音 `ColdStartHost120Unreachable`、`ColdStartRecoveryBlocked` 或 120 config backup alert	保留紅燈，直到 console/VM recovery 後重跑完整 chain
Filesystem	不可對已掛載 root filesystem 做 online `fsck`	只在 console/rescue/initramfs 狀態下離線修復
Backup	不可用單項 backup 成功宣稱 aggregate backup green	以 `backup-all`、offsite verifier、cold-start scorecard 三者共同判定
Credential	不可把密碼、token、private key 寫進 repo、LOGBOOK 或聊天	只寫 non-secret evidence marker / vault reference
Stateful data	不可 truncate、DROP、整庫 restore 或整批刪 volume	先保存證據，優先 `REINDEX TABLE CONCURRENTLY` / clean-clone / idempotent resync
Automation	不可在 P0/P1 未綠時釋出 runner/CD/AI full execution	observe-only，runner/CD 最後釋出

14.2 關機前 SOP

目標是保留證據、停止高負載來源、讓 stateful service 乾淨落地。

宣告維護窗口，建立 docs/LOGBOOK.md 重啟紀錄草稿。
跑 preflight snapshot：

/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
/backup/scripts/backup-status.sh --no-notify
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

保存 host reboot evidence：

for h in 110 120 121; do
  ssh wooo@192.168.0.$h 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20; systemctl --failed --no-pager' || true
done
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20; systemctl --failed --no-pager' || true

暫停高負載與自動化釋出：

順序	對象	操作原則
1	Gitea / actions runners	停止新 job；不要在 build 中途硬關，先讓可完成 job 結束或人工取消
2	AI auto-remediation	切 observe-only；禁止自動 restart stateful services
3	momo crawler / scheduler / heavy batch	暫停會啟動 Chrome、批次匯入或大量 DB 寫入的工作
4	Sentry/Snuba/ClickHouse heavy consumers	確認沒有 restart storm；必要時 controlled stop
5	K3s workload	優先 drain / cordon 可達節點；不可在 120 已不可達時假裝 drain 完成

全機關機順序：

1. runner/CD and high-load batch
2. AI auto-remediation execution
3. AWOOOI workload layer
4. 121 K3s agent side
5. 120 K3s server side
6. 110 registry / observability, after evidence and backup status are captured
7. 188 data layer last
8. network / UPS / hypervisor last, if applicable

188 必須最後關，因為 PostgreSQL / Redis / momo DB / K3s datastore 是其他層的共同依賴。

14.3 開機 SOP

開機順序固定走 dependency chain，不追最吵的 alert。

1. Physical network: switch, NIC, ARP, SSH
2. 188 data layer: PostgreSQL, Redis, Docker, momo DB, SignOz dependencies
3. 110 registry / observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry
4. 120 K3s server / VIP path
5. 121 K3s agent / failover path
6. AWOOOI API/Web workload
7. Public routes and Alertmanager E2E
8. Backups, cron, CronJobs, textfile exporters
9. momo scheduler / crawlers and high-load consumers
10. runners/CD
11. AI auto-remediation limited execution

開機後每一層都要有 live evidence。最小驗收命令：

for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done

ssh ollama@192.168.0.188 'systemctl is-active docker postgresql@14-main redis-server nginx || true; pg_isready -h localhost -p 5432 || true; docker ps --format "{{.Names}}\t{{.Status}}" | head -80'
ssh wooo@192.168.0.110 'systemctl is-active docker cron || true; curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true; curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true'
ssh wooo@192.168.0.121 'sudo kubectl get nodes -o wide; sudo kubectl get pods -A | grep -v -E "Running|Completed" || true'

/backup/scripts/backup-status.sh --no-notify
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

14.4 單主機重啟 SOP

Host	重啟前條件	重啟後必查	完成條件
110	不在 `backup-all` / rclone / verify window；runner job 已停止或人工取消；188 healthy	Docker, Harbor, Gitea, Prometheus, Alertmanager, Sentry, cron, textfile exporters, `/backup/scripts/backup-status.sh --no-notify`	110 services green；backup status 沒有新增 stale / failed；runner/CD 最後釋出
120	必須是 console-first 維護；若可達，先 cordon/drain；若不可達，不宣稱 drain 成功	power/VM/NIC/boot/initramfs/fsck state, SSH, `kubectl get nodes`, `SchedulingDisabled` 清除狀態	120 ping/SSH OK；`mon Ready`；backup configs/all/offsite/verify/cold-start chain 重跑
121	120 / 188 healthy；可達時先 cordon/drain	`k3s-agent` 或 live role、VIP 狀態、`kubectl get nodes`, pod placement	`mon1 Ready`；VIP / NodePort 路徑正常；workload 無新增 failed pods
188	110 backup status 已保存；停止或延後 momo heavy import；確認無 DB restore / migration	PostgreSQL, Redis, Docker, momo DB parity, SignOz/ClickHouse, cron, backup freshness	DB accepting；momo parity 綠；188 backup jobs fresh；高負載服務最後釋出

14.4.1 110 重啟後恢復指揮卡

110 是 registry / observability / backup center。重啟後先看 host 與核心端口，不要第一時間重啟 Docker daemon。

順序	檢查	成功基準	失敗處理
1	`systemctl is-system-running` / failed units / Swap	`running`、failed `0` 或可解釋、Swap 未持續增加	先分辨 stale unit、現役 service、storage/network 問題
2	Docker daemon	`systemctl is-active docker=active`	若 Docker `activating`，先看 journal；不要連續 restart/kill
3	Harbor / registry	local `/v2/` 回 `200/401`，public registry 未登入 `401`	只針對失效 upstream 做最小修復；避免 daemon restart
4	Gitea / runners	Gitea 200/302；runner 最後釋出	runner job 不可在 P0/P1 未綠時搶資源
5	Prometheus / Alertmanager	`/-/ready`、`/-/healthy` OK；required alerts visible	若告警缺失，先修 rules/drift guard，再談自動化
6	Sentry / Langfuse / Stock / public tools	public 2xx/3xx；container 非 restart loop	只修明確故障服務；不要 compose 全套重建
7	backup / offsite	`backup-status --no-notify`、offsite verifier	120 不可達時 Configs 紅燈保留

110 post-reboot 最小命令：

ssh wooo@192.168.0.110 '
date; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true
free -h; swapon --show
systemctl is-active docker cron || true
curl -s -o /dev/null -w "harbor_v2=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'

2026-06-12 補充：stockplatform-shared-ui-monitor.timer 指向不存在的 legacy path 時，可停用 stale timer 解除 host failed unit；但正式 source-of-truth 必須後續清理，不能把 reset-failed 當修復。

14.4.2 188 重啟後恢復指揮卡

188 是資料與 AI/Web 依賴主機。它恢復前，不釋出 K3s、AWOOOI API、momo heavy import 或 AI auto-remediation。

順序	檢查	成功基準
1	PostgreSQL	`pg_isready` accepting，無 checkpoint / WAL panic
2	Redis	`PONG`
3	Docker / containerd	active；momo-db / signoz / openclaw / litellm 非 restart loop
4	momo DB parity	`daily_sales_snapshot` 與 `realtime_sales_monthly` 目前月份筆數與日期上下界一致
4a	momo Google Drive token writeback	`/home/ollama/momo-pro/config/google_token.json` owner 對齊 Docker userns scheduler UID，mode 不寬於 `600`；不得讀取或輸出 token 內容
4b	momo business data freshness	`daily_sales_snapshot` 最新日期落後 `0-2` 天可接受；落後 `3` 天以上為 `BLOCKED`，即使首頁 / health / DB parity 都正常也不可宣稱 full-stack green
5	SignOz / monitoring bridge	HTTP 200；ClickHouse 不在修復風暴
6	momo scheduler	container healthy，recent activity pattern > 0；heavy import 等 DB green 後釋出
7	backup freshness	188 backup textfile / 110 backup-from-188 freshness OK

188 post-reboot 不可用「首頁 200」取代 DB parity，也不可用 DB parity 取代資料新鮮度。若出現 posting list tuple ... cannot be split，只走 REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;，不可 truncate 或整庫 restore。

2026-06-25 補充：若 momo-scheduler logs 出現 Google Drive 認證失敗 / could not locate runnable browser / Permission denied: 'config/google_token.json'，先做 metadata-only 判讀，不得讀 token 內容。最新 10:35 readback 顯示 host path /home/ollama/momo-pro/config/google_token.json 與 container-side config/google_token.json 都是 missing，scheduler host UID 仍是 100000；因此不能沿用 2026-06-24「只改 owner/mode」的修復結論。解除 WARN 的最小安全流程是：取得 owner-provided non-secret evidence ref、確認維護窗口與 rollback owner、用不貼 token 的方式重新建立或恢復 token artifact、只檢查 stat owner:group:mode 與 scheduler auth readback、再跑 cold-start。未完成前，MOMO health 200 與 DB parity 不能取代 token/writeback evidence。

14.4.3 120 恢復指揮卡

120 目前是 console-first blocker。它不可達時，遠端只能做證據收集，不能假裝修復。

狀態	判定	正確動作
ping / SSH / ARP 全失敗	host / VM / network 層未知	到 hypervisor / console 確認 power、NIC、boot screen
initramfs / fsck prompt	filesystem repair gate	依 `120-fsck-maintenance-checklist.sh` 離線處理
SSH 恢復但 K3s NotReady	K3s / runtime 層	先看 `journalctl -u k3s`、containerd、188 PostgreSQL，再解除 cordon
node Ready 但 SchedulingDisabled	調度狀態未解除	確認健康後 `kubectl uncordon mon`，再看 workload

120 恢復後不得只看 kubectl get nodes。必須強制補跑：

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

14.4.4 121 重啟後恢復指揮卡

121 是 K3s failover / secondary control-plane path。它重啟後的核心是「不要讓 mon1 Ready 掩蓋 mon 不可達」。

檢查	成功基準	注意
SSH / systemd	host ready、failed units 可解釋	121 green 不代表 120 green
K3s role	`kubectl get nodes -o wide` 可讀	若只剩 `mon1 Ready`，仍是 degraded cluster
VIP / NodePort	VIP / public routes 通	必須確認 route 走 `192.168.0.125:32334/32335`
Cron / DR drill	cron present、DR drill 未被誤停	schedule green 是 cold-start done criteria 的一部分

若 121 重啟後看到 mon1 Ready 但 mon NotReady,SchedulingDisabled，結論是「121 恢復，cluster 仍 degraded」，不可把 121 正常誤報成 K3s 全綠。

14.5 每次重啟紀錄格式

每次開機、關機、重啟都要在 docs/LOGBOOK.md 追加紀錄，並把必要狀態同步到本 SOP 或 workplan。

## YYYY-MM-DD | Host reboot / shutdown / startup record

Scope:
- Hosts:
- Operation: shutdown / startup / reboot / recovery
- SOP version used:
- Operator:
- Maintenance window:

Pre-check:
- Cold-start scorecard:
- Backup status:
- Offsite verifier:
- Public routes:
- momo DB parity:
- Alertmanager rules / E2E:
- Credential escrow:

Execution:
- Start time:
- End time:
- Commands / console actions:
- Services paused:
- Services released:

Result:
- 110:
- 120:
- 121:
- 188:
- Cold-start scorecard after:
- Backup status after:
- Offsite verifier after:
- DB parity after:
- Alerts after:

Difference versus previous reboot:
- Faster:
- Slower:
- New blocker:
- Repeated blocker:
- False positive / detector tuning:
- SOP change required: yes/no

SOP update:
- Previous version:
- New version:
- Change reason:
- Files updated:

14.6 SOP 版本比較與改版規則

每次重啟後必須比較上一次紀錄，不只寫「已恢復」。

比較項	判定方式
Time to SSH	從 power-on 到各 host SSH OK
Time to K3s Ready	從 120/121 boot 到 nodes Ready
Time to public routes	從 K3s Ready 到 public 2xx/3xx
Time to backup green	從 110 ready 到 backup status / offsite verifier green
Persistent blockers	連續兩次以上出現即列入 SOP hard gate
False positives	例如 momo scheduler detector WARN；要寫清楚直接證據與調整方向
Procedure drift	live cron、Ansible template、script path 與 SOP 不一致時，先修 canonical，再修 SOP

改版規則：

只更新 live baseline 或百分比：不升版，只更新日期與 evidence。
新增、刪除或改變操作順序：升 minor version，例如 v1.4 -> v1.5。
牽涉破壞性操作、資料修復策略或人為批准邊界：升 major-ready review，先經人工批准。

14.7 2026-06-06 重啟紀錄比較錨點

2026-06-06 沒有執行新重啟；本次是 live recovery check。它仍要作為下一次重啟比較基準：

項目	2026-06-06 baseline
Overall	`65% BLOCKED`
Cold-start	`PASS=71 WARN=3 BLOCKED=3`
Remaining hard blocker	120 ping / SSH / K3s read-only check
Backup aggregate	`failed=1`, Configs only, due 120 config capture
Backup freshness	110 and 188 fresh, no stale jobs
Offsite	13 repos latest-only green
Escrow	5 markers missing
momo scheduler	direct healthy; 15:03 scorecard no longer emits scheduler WARN

14.8 2026-06-12 重啟後比較錨點

2026-06-12 110 被非計畫重啟後，本 SOP v1.5 的新比較錨點如下：

項目	2026-06-12 post-reboot baseline
110 host	`systemd running`，failed units `0`，Swap `0B/7.8GiB`
110 service recovery	Harbor / Gitea / Prometheus / Alertmanager / Sentry / Stock / public tools reachable
Cold-start	`PASS=72 WARN=2 BLOCKED=3`
Remaining hard blocker	120 ping / SSH / K3s read-only check
WARN	120-driven backup aggregate/config component and 120 K3s schedule check
Backup freshness	110 `13/13 fresh failed=1`，188 `2/2 fresh failed=0`，stale none
Offsite	13 repos latest-only green，`REMOTE_LATEST_ONLY_OK=1`，`VERIFY_OK=1`
Alerts	Prometheus and Alertmanager expose all five required backup/cold-start/escrow alerts
momo scheduler	scorecard reads `SCHEDULER_RECENT_ACTIVITY 1070` after detector fix
SOP change	v1.5 adds startup judgment layers, GO/NO-GO tree, host recovery cards, and timeline checks

14.9 2026-06-13 CD 後恢復比較錨點

2026-06-13 不是主機重啟，而是用來驗證「120/121 workload balancing + CD known_hosts guardrail」是否能承受下一次正常部署的比較錨點。

項目	2026-06-14 03:10 baseline
Gitea / ArgoCD	Gitea main `8868c025`，deploy marker `7b034b58`，ArgoCD revision `8868c025`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `26b67d11f7b7de4f9c9d95c01bb1dacf4000e887`
K3s placement	API/Web verified split across `mon` / `mon1` after the latest deploy marker；Worker single replica healthy
Cold-start	`PASS=81 WARN=2 BLOCKED=0`
Public routes	Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup	`backup-status`: 110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`，last aggregate `2026-06-14 02:40:22`
Offsite	textfile `remote_verify_ok=1`、`full_verify_fresh=1`，13 repos each `snapshot_count=1`
SSH trust	Global `known_hosts` retained 120 / 188 entries after CD; deploy-specific trust moved to `deploy_known_hosts`
Remaining non-service debt	`km-vectorize-29689620` official Job failed with `BackoffLimitExceeded`; failed Pod/log was deleted before inspection; credential escrow missing count remains `5`; 110 has `fwupd` failed units
SOP change	v1.10 changes the first-screen declaration from full green back to degraded, records official `km-vectorize` failure evidence, and verifies live `restartPolicy: Never` / `FallbackToLogsOnError` evidence retention for the next official run

14.10 2026-06-14 110 failed-unit 清理比較錨點

2026-06-14 08:24 的變更不是主機重啟，而是把 110 非核心 fwupd failed-unit 噪音從 cold-start 判定中收斂。這個錨點的用途是避免未來把 firmware metadata refresh failure 誤判成 AWOOOI runtime 失敗，同時保留 rollback。

項目	2026-06-14 08:24 baseline
110 failed units	`systemctl --failed` 回 `0 loaded units listed`
fwupd policy	`fwupd-refresh.timer` 為 `disabled / inactive`，原因是非核心 firmware metadata refresh 失敗不應阻擋 AWOOOI service recovery
Rollback	若需要恢復 firmware metadata refresh timer，執行 `sudo systemctl enable --now fwupd-refresh.timer` 後重跑 cold-start
Cold-start	`PASS=82 WARN=1 BLOCKED=0`
Remaining WARN	只剩 K8s failed Job `km-vectorize-29689620`；等待下一次官方 03:00 排程成功或保留 failed Pod/log
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
Credential escrow	仍缺 5 個 non-secret evidence marker；不可用 placeholder 或 secret 清紅燈
SOP change	v1.11 把 110 failed-unit gate 從 `GREEN_WITH_FWUPD_WARNING` 改成 `GREEN_WITH_FWUPD_TIMER_DISABLED`，並把完成宣告上限固定為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.11 2026-06-14 post-CD recovery readback

2026-06-14 08:40 的變更不是主機重啟，而是確認 latest CD deploy marker 沒有讓重啟恢復狀態倒退。這個錨點用來比較「治理 / 前端 / API CD 後，cold-start SOP 是否仍成立」。

項目	2026-06-14 08:40 post-CD baseline
Gitea / ArgoCD	Gitea main `18b867c3`，ArgoCD revision `18b867c3`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `e0a6d339669fc635357d36ea94215df25e652fa9`
CronJob readback	`km-vectorize` has `KM_PROJECT_ID=awoooi`、`restartPolicy: Never`、`terminationMessagePolicy: FallbackToLogsOnError`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon`
Cold-start	`PASS=82 WARN=1 BLOCKED=0`
Public routes	Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.12 records the post-CD no-regression readback and keeps the declaration ceiling at `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.12 2026-06-14 P2-135 deploy 後 recovery readback

2026-06-14 09:27 的變更不是主機重啟，而是確認 P2-135 deploy 與正式驗證後，reboot recovery baseline 沒有倒退。這個錨點也記錄 stockplatform-v2 rollout warmup 期間短暫 502 的判定方式：直接重查 route / TLS，並重跑完整 cold-start；只有重跑仍失敗才升級成 persistent public route blocker。

項目	2026-06-14 09:27 post-P2-135 baseline
Gitea / ArgoCD	Gitea main `5bad267e`，ArgoCD revision `5bad267e`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `280e0fbef0d5dccb10f1efe2cc18cf423544254e`
CronJob readback	`km-vectorize` has `KM_PROJECT_ID=awoooi`、`restartPolicy: Never`、`terminationMessagePolicy: FallbackToLogsOnError`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon1`
First cold-start	09:26 first run saw `stock.wooo.work` `502` while stockplatform-v2 containers were less than one minute old; direct route and TLS recheck returned `200`
Final cold-start	09:27 rerun returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.13 records the P2-135 post-deploy no-regression readback and keeps the declaration ceiling at `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.13 2026-06-14 P2-136 / AI Agent 活動正式部署後 recovery readback

2026-06-14 09:56 的變更不是主機重啟，而是確認 P2-136 / AI Agent 活動正式部署後，reboot recovery baseline 仍沒有倒退。這個錨點特別記錄 deploy marker、ArgoCD revision、live image 與 cold-start scorecard 必須一起看，避免只看 gitea/main 或 CD 成功就誤報 full-stack green。

項目	2026-06-14 09:56 post-P2-136 baseline
Gitea / ArgoCD	本 recovery commit 前最新文件 head `a0fe7741`；runtime deploy marker `60a0415c chore(cd): deploy a3de0ff [skip ci]`，ArgoCD revision `60a0415c`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `a3de0ffb8275b6838604b6dff87cd978b8e91122`
CronJob readback	`km-vectorize` has `KM_PROJECT_ID=awoooi`、`restartPolicy: Never`、`terminationMessagePolicy: FallbackToLogsOnError`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` remains retained
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon1`
Cold-start	09:56 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.14 records the P2-136 / AI Agent 活動正式部署後 no-regression readback and keeps the declaration ceiling at `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.14 2026-06-14 P2-137 / CI smoke timeout 修正後 recovery readback

2026-06-14 10:40 的變更不是主機重啟，而是確認 P2-137 正式部署與 BusyBox timeout smoke 修正後，reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback，不重複 P2-137 正式驗證內容。

項目	2026-06-14 10:40 post-P2-137 baseline
Gitea / ArgoCD	本 recovery commit 前最新文件 head `50d4f2ba`；runtime deploy marker `d023f5d7 chore(cd): deploy f737f27 [skip ci]`，ArgoCD revision `50d4f2ba`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `f737f278dc14372ff1fb15b124b1370c20e1bb99`
CronJob readback	`km-vectorize` has `KM_PROJECT_ID=awoooi`、`restartPolicy: Never`、`terminationMessagePolicy: FallbackToLogsOnError`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` remains retained
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon`
Cold-start	10:40 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.15 記錄 P2-137 / CI smoke timeout 修正後 no-regression readback，並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.15 2026-06-14 P2-143 owner response 預檢後 recovery readback

2026-06-14 15:00 的變更不是主機重啟，而是確認 P2-143 owner response 預檢與拒收邊界正式部署後，reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback，不重複 P2-142 / P2-143 正式驗證內容，也不把 owner response preflight 視為 runtime 授權。

項目	2026-06-14 15:00 post-P2-143 baseline
Gitea / ArgoCD	最新文件基準 `b09eb1c6 docs(ai): 校準 P2-143 正式驗證紀錄`；runtime deploy marker `667d6329 chore(cd): deploy 755b0a8 [skip ci]`；ArgoCD revision `4abf0c0f750254d3c7137eae049abdfd99630f5f`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `755b0a8d3038df2c52dee280067863d92db1eda5`
CronJob readback	`km-vectorize` schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`failedJobsHistoryLimit=3`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` 仍保留，但目前沒有可讀的 failed Pod / log
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon`
Cold-start	15:00 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
P2-143 API boundary	Production endpoint 回 current `P2-143`、next `P2-144`、completion `100`，且 reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 `0 / false`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.16 記錄 P2-143 owner response 預檢後 no-regression readback，並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.16 2026-06-14 P2-144 owner response 回讀後 recovery readback

2026-06-14 15:58 的變更不是主機重啟，而是確認 P2-144 owner response 回讀狀態與後續 deploy marker 180a6543 正式部署後，reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback，不重複 P2-144 正式驗證內容，也不把 owner response readback 視為 runtime 授權、正式收件或 owner acceptance。

項目	2026-06-14 15:58 post-P2-144 baseline
Gitea / ArgoCD	`gitea/main` 已前進至 `180a6543 chore(cd): deploy fef94df [skip ci]`；ArgoCD source revision `180a6543eaf26dd6b345d45114316926056a965a`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `fef94df877c5438f9f34ddbcace8ad8112a141ef`
CronJob readback	`km-vectorize` schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`failedJobsHistoryLimit=3`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` 仍保留，但目前沒有可讀的 failed Pod / log
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon1`
Cold-start	15:58 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
P2-144 API boundary	Production endpoint 回 current `P2-144`、next `P2-145`、completion `100`，且 owner response received / accepted / rejected、reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 `0 / false`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.17 記錄 P2-144 owner response 回讀後 no-regression readback，並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.17 2026-06-14 P2-145 owner response 驗收門檻後 recovery readback

2026-06-14 16:29 的變更不是主機重啟，而是確認 P2-145 owner response 驗收門檻正式部署後，reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback，不重複 P2-145 正式驗證內容，也不把 acceptance gate 視為 owner response received / accepted、runtime 授權或正式寫入。

項目	2026-06-14 16:29 post-P2-145 baseline
Gitea / ArgoCD	最新文件基準 `06fe0a8f docs(logbook): 記錄 P2-145 正式驗證 [skip ci]`；runtime deploy marker `36fbfc6b chore(cd): deploy 386dbd0 [skip ci]`；ArgoCD source revision `06fe0a8f14167824fea512f942d2569431bbcbc8`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `386dbd078ef63401d9736048463f4ef5326442d9`
CronJob readback	`km-vectorize` schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`failedJobsHistoryLimit=3`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` 仍為 `Failed 0/1`
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon`
Cold-start	16:29 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
P2-145 API boundary	Production endpoint 回 current `P2-145`、next `P2-146`、completion `100`，且 owner response received / accepted / rejected、reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 `0 / false`
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.18 記錄 P2-145 owner response 驗收門檻後 no-regression readback，並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.18 2026-06-14 IwoooS P0 配置控管優先序後 recovery readback

2026-06-14 17:04 的變更不是主機重啟，而是確認 IwoooS P0 配置控管優先序正式部署後，reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback，不重複 P0 配置控管正式驗證內容，也不把前台看板可見視為 Nginx reload、DNS / TLS probe、certbot renew、workflow / secret 修改、public route change 或 runtime gate。

項目	2026-06-14 17:04 post-IwoooS-P0-config baseline
Gitea / ArgoCD	最新文件基準 `af62ec1f docs(iwooos): 記錄 P0 配置控管正式驗證 [skip ci]`；runtime deploy marker `ed651a98 chore(cd): deploy e992af8 [skip ci]`；ArgoCD source revision `af62ec1fe72b3e84e179d80e788e5a5902bdaf27`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `e992af89955f8aae40a383b2f2e2f645445a690d`
CronJob readback	`km-vectorize` schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`failedJobsHistoryLimit=3`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` 仍為 `Failed 0/1`
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon1`
Cold-start	17:04 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應；IwoooS route `/zh-TW/iwooos` 額外 readback 回 `200`
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
IwoooS boundary	P0 配置控管優先序已可見，但 live evidence received、runtime gate、Nginx live config、DNS / TLS probe、certbot renew、workflow / secret 修改、public route change、production write 仍不得從本 readback 推定為已授權
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.19 記錄 IwoooS P0 配置控管優先序後 no-regression readback，並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.20 2026-06-15 km-vectorize official success readback

2026-06-15 03:11 的變更不是主機重啟，而是確認 km-vectorize 官方 03:00 排程成功，並把 ArgoCD fully healthy gate 關閉。這個錨點只記錄 recovery / cold-start readback，不手動刪 Job、不手動建立 Job、不 kubectl patch live、不重啟服務，也不把任何 backup / restore / escrow owner acceptance ledger 視為 backup run、restore run、credential escrow marker write、host write 或 production write 授權。

項目	2026-06-15 03:11 km-vectorize official success baseline
ArgoCD	`awoooi-prod` sync `Synced`，health `Healthy`，revision `d388e5b477333fd5e661527a729406a4e8215320`
CronJob readback	`km-vectorize` schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`lastScheduleTime=2026-06-14T19:00:00Z`、`lastSuccessfulTime=2026-06-14T19:00:55Z`
Job / Pod / log	Job `km-vectorize-29691060` `Complete`，Pod `km-vectorize-29691060-78xpz` `Completed` restart `0`，log `embed-all: 200 {"total":31,"success":31,"failed":0}`
Cold-start	03:11 returned `PASS=81 WARN=2 BLOCKED=0`，result `DEGRADED`
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，last aggregate `2026-06-15 02:40:13`
Escrow	`ESCROW_MISSING_COUNT=5`，缺 `restic_repository_password`、`offsite_provider_credentials`、`break_glass_admin_credentials`、`dns_registrar_recovery`、`oauth_ai_provider_recovery`
Remaining warnings	188 momo scheduler registration/activity 未確認；K8s 仍保留舊 failed Job evidence
SOP change	v1.21 關閉 `km-vectorize` official success gate，但宣告上限仍是 `SERVICE_AVAILABLE_ARGOCD_HEALTHY_DR_ESCROW_BLOCKED`；不可宣稱 `full-stack green` 或 `DR complete`

14.19 2026-06-14 高價值配置 Owner Packet 前台同步後 recovery readback

2026-06-14 18:15 的變更不是主機重啟，而是確認高價值配置 Owner Packet 前台同步正式部署後，reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback，不重複 Owner Packet 前台正式驗證、posture projection 或 intake preflight 內容，也不把前台草案可見視為 request sent、owner response received / accepted、runtime gate、Nginx reload、DNS / TLS probe、certbot renew、workflow / secret 修改、host write、active scan 或 production write。

項目	2026-06-14 18:15 post-owner-packet-frontend baseline
Gitea / ArgoCD	最新 repo 文件基準 `0a4766dd docs(security): 新增高價值配置 owner request 草稿包 [skip ci]`；runtime deploy marker `16c6b983 chore(cd): deploy e999c16 [skip ci]`；feature commit `e999c16b fix(iwooos): 同步高價值配置 owner packet 前台`；ArgoCD source revision `0a4766ddc94b0690824ce3deba5c6b9a69764f94`，sync `Synced`，health `Degraded`
K3s image readback	API/Web/Worker/CronJob image tag `e999c16b3435f197b78fe2adfeec1c4faa6c4675`
CronJob readback	`km-vectorize` schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`failedJobsHistoryLimit=3`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`；failed Job `km-vectorize-29689620` 仍為 `Failed 0/1`
K3s placement	API pods split `mon` / `mon1`，Web pods split `mon` / `mon1`，Worker single replica on `mon`
Cold-start	18:15 returned `PASS=82 WARN=1 BLOCKED=0`
Public routes	最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應；IwoooS route `/zh-TW/iwooos` 與 AwoooP route `/zh-TW/awooop` 額外 readback 皆回 `200`
Backup	110 `13/13 fresh failed=0`，188 `2/2 fresh failed=0`，`core_blockers=0`，`escrow_missing=5`
110 host	`systemctl --failed` 回 `0 loaded units listed`；`fwupd-refresh.timer` 維持 `disabled / inactive`
Owner Packet boundary	Owner Packet 前台數字已可見，但 request sent、owner response received / accepted / rejected、reviewer queue write、live evidence、runtime gate、Nginx live config、DNS / TLS probe、certbot renew、workflow / secret 修改、host write、active scan、production write 仍不得從本 readback 推定為已授權
Remaining gate	`km-vectorize-29689620` official Job 仍 failed；Credential escrow missing count 仍 `5`
SOP change	v1.20 記錄高價值配置 Owner Packet 前台同步後 no-regression readback，並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED`

14.21 2026-06-18 Plan B 降級運轉路徑

2026-06-18 的變更不是主機重啟，也不是新的 live recovery readback，而是把統帥要求的 Plan B 明確寫成可執行 SOP。這個錨點用來比較下一次重啟時是否有照 §1.4 先判斷 Plan A / Plan B、降級等級、停止線與回到 Plan A 的條件。

項目	2026-06-18 Plan B baseline
SOP version	`v1.22`
Plan B trigger	backup/offsite/verifier running、P0 host 15 分鐘不可達、188 data unhealthy、110 registry / observability unhealthy、單台 K3s degraded、route-only green、cold-start WARN、credential escrow missing
Service levels	`B0_ABORTED_BEFORE_REBOOT`、`B1_HOST_RECOVERY_ONLY`、`B2_CORE_SERVICE_READY`、`B3_SERVICE_AVAILABLE_DEGRADED`、`B4_FULL_STACK_GREEN`、`B5_DR_COMPLETE`
Host fallback paths	110 / 120 / 121 / 188 / K3s / Public gateway 各自有降級路徑與回到 Plan A 的條件
Timeline	`T+0` freeze、`T+5` host boot、`T+15` data / registry stop-line、`T+30` route-only guard、`T+60` cold-start scorecard、`T+120` incident / follow-up
Closeout states	`RETURNED_TO_PLAN_A`、`SERVICE_AVAILABLE_DEGRADED`、`OPEN_INCIDENT_REQUIRED`
SOP change	v1.22 新增 Plan B；不可把 Plan B 視為 runtime write 授權，也不可因文件化 Plan B 宣稱新的 service green、full-stack green 或 DR complete

14.23 2026-06-18 repo-side readiness audit blocker closure

2026-06-18 的第二段變更不是 live recovery，也不是主機重啟，而是把前一輪 readiness audit 的 repo-side hard blockers 收斂成可驗證契約。這個錨點代表「重啟 SOP / baseline / scripts / Ansible source-of-truth / Gitea workflow contract 在 repo 內已可通過 readiness audit」，不代表當日 live hosts 已重新驗證。

項目	2026-06-18 repo-side readiness baseline
SOP version	`v1.23`
Cold-start gate	`full-stack-cold-start-check.sh` 新增 `NODE_FS_ERROR_EVENTS`，120 / K3s node event 出現 filesystem / fsck / read-only / I/O 類證據時，不能宣稱 K3s safe
Backup contract	`backup-awoooi.sh` 移除 service-level 直接 offsite sync；offsite 發布只走集中 `sync-offsite-backups.sh` / verifier gate
Ansible 110 source-of-truth	`110-devops.yml` 納入 cold-start monitor、runner guardrails、host textfile exporters、backup scripts、daily backup heartbeat、offsite evidence report、offsite full-sync verifier
Ansible 188 source-of-truth	`188-ai-web.yml` 納入 textfile exporters，並把 momo PostgreSQL backup entrypoint 固定到 host-owned `/home/ollama/bin/momo-pg-backup.sh`
Nginx source-of-truth	`nginx-sync.yml` 納入 `188-internal-tools-https.conf.j2` route sync
CI / workflow contract	`.gitea/workflows/ansible-lint.yml` 改為 self-hosted validation，觸發範圍包含 Ansible、ops baseline、monitoring rules、backup scripts、reboot scripts、docs 與 workflow 自身
Validation toolchain	`bootstrap-ansible-validation-env.sh` 會優先使用 Python 3.11 / 3.10 建立 pinned validation venv；`ansible-validate.sh` 固定 repo roles path，並以 minimum lint profile 守住 syntax / loader readiness
Repo-side readiness audit	`PASS=185 WARN=1 BLOCKED=0`，結果 `READY WITH WARNINGS`；唯一 warning 是未跑 `--live`
Declaration limit	可宣稱 `REPO_SIDE_REBOOT_READINESS_READY_WITH_LIVE_CHECK_REQUIRED`；不可宣稱 `FULL_STACK_GREEN`、`DR_COMPLETE` 或 live service recovery complete

14.24 2026-06-18 live cold-start readback after repo-side closure

2026-06-18 12:13-12:17 的 readback 是 repo-side readiness closure 後的同日 live 驗證。這不是主機重啟，也不是 runtime 修復；它的用途是把「機制已完成」和「當下 live 狀態」分開，避免 false-green。

項目	2026-06-18 12:17 live baseline
SOP version	`v1.24`
Cold-start read-only result	`PASS=83 WARN=1 BLOCKED=0`，result `DEGRADED`
Host reachability	110 / 120 / 121 / 188 ping OK and SSH port OK
K3s	`mon` / `mon1` Ready control-plane；VIP `192.168.0.125` present on 120；`NODE_FS_ERROR_EVENTS 0`
110 / 188 service checks	110 Harbor / Gitea / Prometheus / Alertmanager / Sentry reachable；188 PostgreSQL / Redis / momo / SigNoz reachable
Backup health	110 backup health `total=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0`；188 backup health `total=2 stale=0`
Public route / TLS	awoooi API/Web、mo、momo health、Gitea、Harbor、registry、Sentry、SigNoz、stock、Langfuse、Bitan、aiops all 2xx/3xx with TLS verified
AWOOOI rollout convergence	After transient 12:14 startup window, final readback shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, API health `200 healthy`
Remaining warning	retained stale Job `km-vectorize-29689620` from 2026-06-14 03:00; later official Jobs `km-vectorize-29692500`, `29693940`, `29695380` are `Complete`
Declaration limit	可宣稱 `SERVICE_AVAILABLE_DEGRADED`；不可宣稱 `FULL_STACK_GREEN`，因為 `WARN=1`；不可宣稱 `DR_COMPLETE`，credential escrow evidence still requires real non-secret owner evidence

14.25 2026-06-18 stale failed Job classification and service-green readback

2026-06-18 13:43 的變更不是刪除 K8s Job，也不是手動建立 Job，而是修正 cold-start 判定邏輯：保留的歷史 failed Job 是 evidence；只有沒有後續官方成功 Job 的 failed Job 才是 active blocker。這讓 evidence retention 和 service readiness 不再互相衝突。

項目	2026-06-18 13:43 stale Job classification baseline
SOP version	`v1.25`
Script change	`full-stack-cold-start-check.sh` emits `FAILED_JOBS`, `STALE_FAILED_JOBS`, and `ACTIVE_FAILED_JOBS`
Active blocker rule	`ACTIVE_FAILED_JOBS > 0` causes warning; `STALE_FAILED_JOBS > 0` is retained evidence and does not warn by itself
Readiness audit contract	`reboot-recovery-readiness-audit.sh` requires both stale and active failed Job counters
Repo-side validation	`bash -n` passed; readiness audit returned `PASS=187 WARN=1 BLOCKED=0` with only the expected non-live warning
110 live script sync	`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `b48af9c603aa5a1a4f9434d6cc510398bbecc2e46400a21410e735d5f7d177c4`; previous version backed up to `/home/wooo/scripts/full-stack-cold-start-check.sh.before-stale-active-job-classification.20260618-135516`
Live cold-start readback	`PASS=84 WARN=0 BLOCKED=0`, result `GREEN`
K8s Job evidence	`FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`
Backup / DR evidence	110 backup health `13/13 fresh failed=0`; 188 backup health `2/2 fresh failed=0`; escrow readback still `ESCROW_MISSING_COUNT=5`
Declaration limit	可宣稱 `FULL_STACK_GREEN_FOR_SERVICE`；不可宣稱 `DR_COMPLETE`、`credential escrow complete` 或任何 runtime/security acceptance
SOP change	v1.25 defines retained failed Job evidence vs active failed Job blocker; future reboot comparison must record all three counters

14.26 2026-06-24 heartbeat noise / MOMO detector / rollout false-negative closure

2026-06-24 的變更不是主機重啟，而是把重啟 SOP 的兩種 false signal 收斂：Telegram 正常心跳不再每 30 分鐘洗版；MOMO scheduler / current-month parity detector 不再因舊 log pattern 或錯誤 DB exec 使用者誤報 WARN。這個錨點也記錄 CD rollout false-negative：worker startup probe 第一次超時重啟一次，K8s 最終 ready，但 Gitea CD #3289 因 rollout status timeout 標 Failure。

項目	2026-06-24 live baseline
SOP version	`v1.27`
Heartbeat code	`a84a5a0b fix(api): suppress healthy Telegram heartbeat noise`
Deploy marker	`4a7b5329 chore(cd): deploy a84a5a0 [skip ci]`
Production image readback	API/Web/Worker image tag `a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71`
Production rollout	API `2/2`、Web `2/2`、Worker `1/1` Ready
CD result caveat	Gitea CD `#3289` shows Failure because worker rollout status timed out before old replica convergence; K8s deploy marker and production readiness are green
Healthy heartbeat rule	`status=healthy` 且無 warnings 時只更新 suppression marker / log，不送 Telegram；warnings 與 recovery 仍可送
Live temporary suppression	Redis keys `heartbeat:silent_last_sent` and `heartbeat:healthy_suppressed_last_seen` set with 24h TTL during deployment; no token or secret printed
110 live script sync	`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6`; previous version backed up to `/home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759`
MOMO scheduler evidence	`SCHEDULER_CONTAINER_RUNNING true`、`SCHEDULER_CONTAINER_HEALTH healthy`、`SCHEDULER_RECENT_ACTIVITY 1303`
MOMO DB parity evidence	`MOMO_MONTHLY_SYNC 10936
K3s node evidence	`NODE_FS_ERROR_EVENTS 0`、`NODE_READONLY_FILESYSTEM_TRUE 0`、`NODE_DISK_PRESSURE_TRUE 0`、VIP `192.168.0.125` present
Live cold-start readback	`PASS=85 WARN=0 BLOCKED=0`, result `GREEN`
Declaration limit	可宣稱 current service recovery scorecard green；不可宣稱 `DR_COMPLETE`，credential escrow evidence missing remains `5`
SOP change	v1.27 requires heartbeat success-message suppression, MOMO detector parity using app-provided DB env, and rollout false-negative classification before retrying CD

Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe，不能只看第一次 rollout status --timeout=60s 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈，這是 CI timeout / probe tuning 工作，不是服務重啟事故；後續應調整 startup probe 或 post-deploy timeout。

2026-06-24 02:44 補充：本節的 02:08 PASS=85 WARN=0 BLOCKED=0 已被 §14.28 的 MOMO data freshness gate 取代；不可再引用該結果宣稱 full-stack green。

14.27 2026-06-24 188 node-exporter / backup health alert closure

2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。backup-status 與 cold-start 都能透過 SSH 讀到 188 backup_health.prom fresh，但 Prometheus node-exporter-188 scrape down 會讓 BackupHealthMonitorMissing188 正確告警。這種情況不能消音告警，必須恢復 exporter。

項目	2026-06-24 188 exporter baseline
SOP version	`v1.28`
Root cause	188 `9100` connection refused；`node_exporter` / `prometheus-node-exporter` unit absent/inactive；Prometheus could not scrape `backup_health.prom`
False start	Mounting `/home/ollama/node_exporter_textfiles` via `/host/home/ollama/...` failed because `/home/ollama` is `750` and textfile collector saw `permission denied`
Live restore	Docker container `node-exporter` uses `quay.io/prometheus/node-exporter:v1.8.2`, `restart=unless-stopped`, `-p 9100:9100`, rootfs mount `/host`, direct textfile bind `/home/ollama/node_exporter_textfiles:/textfile:ro`
Repo helper	`scripts/ops/188-node-exporter-restore.sh`
Local metrics	`awoooi_backup_health_monitor_up{host="188"} 1`; `node_textfile_scrape_error 0`
Prometheus readback	`up{job="node-exporter-188"} 1`; `awoooi_backup_health_monitor_up{host="188"} 1`; `absent(awoooi_backup_health_monitor_up{host="188"})` empty
Alert readback	`ALERTS{alertname="BackupHealthMonitorMissing188"}` empty
Declaration limit	可宣稱 188 backup health scrape restored；不可把這當作 credential escrow complete 或 backup retention policy complete

若未來重啟後 BackupHealthMonitorMissing188 active，但 SSH/backup-status 顯示 backup_health.prom fresh，優先查：

curl -fsS http://192.168.0.188:9100/metrics | grep -E 'awoooi_backup_health_monitor_up|node_textfile_scrape_error'

若 9100 connection refused 或 textfile collector error，先用 repo helper 恢復 exporter：

ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh

恢復後再查 Prometheus / Alertmanager，不要直接 silence。

14.28 2026-06-25 MOMO Google Drive token 與資料新鮮度 blocker

2026-06-24 的第三段變更是把「MOMO 服務活著但資料不新」納入 cold-start hard gate。2026-06-25 11:44 曾證明 MOMO 服務、public route、DB parity、scheduler activity、backup/offsite 都可用，但 Google Drive token artifact metadata missing 且資料停在 2026-06-17，所以 cold-start 正確 BLOCKED。2026-06-25 14:16 的最新狀態已由合法匯入 job 57 解除該資料新鮮度 blocker：MOMO service health 是 V10.674，daily_sales_snapshot 與 realtime_sales_monthly 皆到 2026-06-24，MOMO_DAILY_FRESHNESS 1|2026-06-24，dedicated preflight PASS=18 WARN=3 BLOCKED=0。這仍不代表 DR complete，也不代表可以讀取或保存 Google Drive token 內容。

項目	2026-06-25 MOMO freshness / token baseline
SOP version	`v1.51`
Token current state	`MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`; dedicated preflight also saw host token metadata aligned to scheduler UID and container-side token artifact mode `600`; token content was not read
Token recovery boundary	Owner-gated maintenance only；不得讀取、貼上、保存 token value / hash / partial；不得把聊天密碼或 workaround 寫進 repo
Drive auth behavior	2026-06-25 10:04 fail-closed evidence remains historical proof that auth failure does not become a fake success. 14:16 readback shows the later legitimate import succeeded and the blocker is cleared.
Drive pending folder	`當日業績匯入`，pattern `即時業績_當日`; latest successful source recorded by job `57`
Latest valid import	Job `57 completed`，`即時業績_當日.xlsx`，`2026-06-25T13:16:47.359958..2026-06-25T13:18:02.964985`，`15383/15383/0`
DB parity	`daily_sales_snapshot=109061
Data freshness	`MOMO_DAILY_FRESHNESS 1
Live cold-start readback	`PASS=89 WARN=0 BLOCKED=0`, result `GREEN`; dedicated MOMO preflight `PASS=18 WARN=3 BLOCKED=0`
110 live script sync	`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`
Alert behavior	Drive auth failure must send failure notification; heartbeat success remains suppressed; stale data alert should clear only with fresh DB evidence like job `57` / freshness `1`
Declaration limit	可宣稱 hosts/routes/K3s/backups/MOMO service/MOMO data freshness recovered for this evidence set；不可宣稱 DR complete、credential escrow complete、Wazuh host registry accepted 或 runtime/security acceptance

MOMO post-reboot 最小 readback：

scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh

ssh ollama@192.168.0.188 '
stat -c "%u:%g:%a %n" /home/ollama/momo-pro/config/google_token.json 2>/dev/null || echo "google_token.json missing"
docker top momo-scheduler -eo pid,user,uid,gid,args | head -n 3
docker logs --since 2h momo-scheduler 2>&1 | grep -E "AutoImport|Google Drive|Permission denied|could not locate runnable browser|沒有找到|發現檔案|匯入失敗通知" | tail -120
'

ssh ollama@192.168.0.188 'db_user=$(docker exec momo-pro-system printenv POSTGRES_USER); db_name=$(docker exec momo-pro-system printenv POSTGRES_DB); db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD); docker exec -i -e PGPASSWORD="$db_pass" momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -At' <<'SQL'
SELECT 'daily_sales_snapshot|' || count(*) || '|' || min(snapshot_date)::date || '|' || max(snapshot_date)::date FROM daily_sales_snapshot;
SELECT 'realtime_sales_monthly|' || count(*) || '|' || min("日期")::date || '|' || max("日期")::date FROM realtime_sales_monthly;
SELECT 'daily_freshness|' || (CURRENT_DATE - max(snapshot_date)::date) || '|' || max(snapshot_date)::date FROM daily_sales_snapshot;
SQL

Preferred path is the scripted preflight. It is read-only and returns 0 for clean, 1 for WARN-only, and 2 for BLOCKED. 2026-06-25 14:16 live run returned PASS=18 WARN=3 BLOCKED=0: https://mo.wooo.work/health and local health both returned 200, health version was V10.674, app / scheduler / Telegram bot were healthy, scheduler restart count was 0, token metadata aligned to scheduler UID without reading token content, current-month DB parity matched, latest daily import job 57 was clean, and DB_DAILY_FRESHNESS 1|2026-06-24 cleared the MOMO hard blocker. The remaining WARNs are stability / future-evidence notes, not blockers.

若 Drive token artifact missing 或 Drive pending folder 無新來源檔，不可手動 truncate、不可以舊 archive 檔重複匯入來製造「最新」，也不可把 DB parity 當 data freshness。下一個解除 blocker 的證據必須是：

Owner 提供非 secret evidence ref，確認可以恢復 Google Drive token artifact 或合法來源檔。
維護窗口、rollback owner、post-check owner 明確記錄。
token artifact 只用 metadata 驗證：owner 對齊 scheduler UID、mode 不寬於 600、不輸出 token 內容。
新的 即時業績_當日 source file 可見，或 scheduler 能成功列出待匯入來源。
import job 成功，sync_success=true，且 Drive 檔案只在成功後移動。
daily_sales_snapshot 與 realtime_sales_monthly 日期上下界一致，且 MOMO_DAILY_FRESHNESS <= 2。

14.29 2026-06-24 188 MinIO / Velero、DB exporter 與 110 disk pressure recovery

2026-06-24 的第四段變更是恢復真正的備份與監控鏈路，而不是消音告警。VeleroBackupNotRun、PostgreSQLDown、RedisDown、110 disk pressure 都是有效紅燈；修復順序必須是 source-of-truth / service / exporter / Prometheus / Alertmanager / cold-start scorecard。

項目	2026-06-24 06:35 recovery baseline
SOP version	`v1.30`
188 DB exporter root cause	Docker user namespace 下 exporter compose 不能使用 `network_mode: host`；Redis live port 是 `6380`
188 DB exporter source-of-truth	`ops/monitoring/docker-compose.exporters.yaml` 改為 bridge port mapping；PostgreSQL DSN 只從 host `.env.exporters` 注入，repo 不放密碼
188 DB exporter helper	`scripts/ops/188-db-exporters-restore.sh`；live path `/home/ollama/bin/188-db-exporters-restore.sh`
188 DB exporter readback	local metrics `pg_up=1`、`redis_up=1`；Prometheus `up{job="postgres-exporter"}=1`、`pg_up=1`、`up{job="redis-exporter"}=1`、`redis_up=1`
110 disk pressure	`/` from `92%` used to `73%` used after Docker image / build cache cleanup only; no Docker volume prune
MinIO / Velero root cause	188 MinIO endpoint `192.168.0.188:9000` was down; Velero BSL S3 list failed; MinIO data path had userns write denial
MinIO restore	live `/home/ollama/minio/docker-compose.override.yml` adds `userns_mode: host` for the `minio` service; MinIO health endpoint is OK
Velero restore	120 `BackupStorageLocation/default` phase is `Available`; one-off backup `reboot-recovery-202606240456` is `Completed`
Backup-health textfile	110 exporter refresh reports `awoooi_velero_monitor_up=1`, `awoooi_velero_latest_completed_backup_fresh=1`, restore-test cron present, failed jobs `0`
Alert readback	`VeleroBackupNotRun`、`PostgreSQLDown`、`RedisDown`、110 disk-pressure alerts resolved
Live cold-start readback	`PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`; only blocker remains MOMO data freshness
Declaration limit	可宣稱 backup / exporter / MinIO / Velero chain recovered；不可宣稱 full-stack green、MOMO data current、DR complete 或 runtime/security acceptance

188 PostgreSQL / Redis exporter post-reboot recovery:

ssh ollama@192.168.0.188 'bash /home/ollama/bin/188-db-exporters-restore.sh'
curl -fsS http://192.168.0.188:9187/metrics | grep '^pg_up '
curl -fsS http://192.168.0.188:9121/metrics | grep '^redis_up '

188 MinIO / 120 Velero recovery from 110:

ssh wooo@192.168.0.110 '/home/wooo/scripts/188-minio-velero-restore.sh'

如果需要在維護窗口中建立一次性 reboot-recovery 備份並刷新 110 backup-health textfile，必須明確設定：

ssh wooo@192.168.0.110 'CREATE_VELERO_BACKUP=true REFRESH_BACKUP_HEALTH=true /home/wooo/scripts/188-minio-velero-restore.sh'

本地 repo helper 可同步 live script：

scp -q scripts/ops/188-db-exporters-restore.sh ollama@192.168.0.188:/home/ollama/bin/188-db-exporters-restore.sh
scp -q scripts/ops/188-minio-velero-restore.sh wooo@192.168.0.110:/home/wooo/scripts/188-minio-velero-restore.sh

110 disk pressure cleanup rule:

Allowed in incident recovery: Docker image / build cache cleanup after checking `docker system df`.
Forbidden without explicit owner approval: `docker volume prune`, deleting database / registry / MinIO / ClickHouse / Sentry / PostgreSQL volumes, or removing unknown bind-mounted state.
Done gate: filesystem use below 85%, no active disk-pressure alerts, and no service regression in cold-start scorecard.

14.30 2026-06-24 notification noise closure after reboot recovery

2026-06-24 的第五段變更是把「服務已恢復，但舊監控路徑或成功心跳繼續洗 Telegram」納入重啟 SOP。這不是消音；失敗、warning、資料新鮮度、backup / exporter / escrow 紅燈仍要告警。修正目標是避免同一個已知失敗每 5 或 30 分鐘重複推送，並避免正常成功心跳佔滿戰情室。

項目	2026-06-24 notification baseline
SOP version	`v1.31`
AWOOOI healthy heartbeat	Production `a84a5a0b`：healthy 且無 warnings 時只更新 Redis/log，不送 Telegram；warning 變化會送，warning 恢復 healthy 只送一次 recovery
MOMO false-noise root cause	110 舊 `/home/wooo/scripts/docker_health_monitor.sh` 打 `http://192.168.0.188/health`，重啟期間連續得到 `HTTP 502`，產生每 5 分鐘 MOMO Pro 告警
MOMO monitor source-of-truth	新增 `scripts/ops/momo-pro-health-monitor.sh`；primary truth 是 `https://mo.wooo.work/health`，188 local `127.0.0.1:5003/health` 與 container state 只作 secondary evidence
MOMO live readback	`/home/wooo/scripts/docker_health_monitor.sh` hash `d7a6bc75549efa10176c42e6f9082c90b9856dbcbb335aba4a4fa4abb754eaef`; manual run returned `OK: public health 200; no alert`
AWOOI ops notify wrapper	`/home/wooo/awoooi-ops/notify-awoooi-ops.sh` hash `12bf9ae124c56bb7f31be15ebeb501671b0686d695492bc3fa1d9abb5b683b67`; repo MOMO monitor uses this wrapper instead of adding a new Telegram Bot API direct send
Docker monitor fallback	`scripts/ops/docker-health-monitor.sh` keeps `ACTION_COOLDOWN_SECONDS=300` for repair cadence but adds `NOTIFY_COOLDOWN_SECONDS=1800` for direct Telegram fallback when AWOOOI API cannot receive the event
Docker monitor live readback	`/home/wooo/awoooi-ops/docker-health-monitor.sh` hash `41d64f29048868c8e4c089132d299c8ee0e2b50ab2c513158d6d45cf92ea38e3` and exposes `TELEGRAM_COOLDOWN` lines for repeated fallback suppression
Bitan public-content check	Live `/home/wooo/apps/bitan-pharmacy-release/scripts/run-public-content-cleanliness-check.sh` now writes `public-content-cleanliness.notify.state`, suppresses same failure fingerprint for `21600s`, and sends one recovery notice after a failed state becomes pass
Bitan live readback	Script hash `294ec7f75448c86688b8afc408c785efe4cf173d468ad0d82228ba638d3de2dc`; manual no-notify run returned PASS for DB, public APIs, products/news pages, and content contract
Declaration limit	可宣稱 repeated healthy / same-failure notification noise is controlled for these paths；不可宣稱 all product alerts migrated to the unified notification gateway or any real failure alert disabled

Post-reboot notification gate:

ssh wooo@192.168.0.110 '/home/wooo/scripts/docker_health_monitor.sh'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/logs/docker_health.log'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/awoooi-ops/monitor.log'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/apps/bitan-pharmacy-release/logs/public-content-cleanliness-check.cron.log'

Done gate:

MOMO monitor: public health 200 -> no Telegram.
AWOOOI heartbeat: healthy + no warnings -> suppressed; warning/recovery still send.
Generic docker-health monitor: API 200/202 path is primary; direct Telegram fallback is fingerprint-cooled.
Bitan public content: pass -> no failure Telegram; repeated same failure -> cooled; recovery after prior failure -> one notice.

14.31 2026-06-24 MOMO source-file absence decision gate

2026-06-24 11:35 的恢復判定把 MOMO 分成兩件事：服務可用與資料新鮮。服務可用已恢復，資料新鮮仍 blocked。這個 gate 的目的，是防止 operator 在外部網站 200、container healthy、DB parity 正常時，誤把「沒有新來源檔」當成「恢復完成」。

項目	11:35 source-file absence baseline
SOP version	`v1.32`
MOMO public health	`https://mo.wooo.work/health` returns healthy; version `V10.639`
DB rows	`daily_sales_snapshot=104614`，`realtime_sales_monthly=786621`
DB bounds	daily `2025-07-01..2026-06-17`；monthly `2024-01-01..2026-06-17`
Current-month parity	`10936
Latest successful import	`daily_sales` job `56`，created `2026-06-18 11:41`，source `即時業績_當日.xlsx`，`sync_success=true`
Pending source folder	`當日業績匯入` count `0` for pattern `即時業績_當日`
Archive latest	`2026-06-18T01:30:39Z`，already imported by job `56`
Scheduler Drive readback	container-side Drive listing works and currently returns count `0`; no current `Permission denied` evidence in latest readback
Stale alert posture	`data_stale_alert` has 24h dedupe; this is a true warning, not heartbeat spam
Blocking metric	`MOMO_DAILY_FRESHNESS 7
Repo-side v1.42 scorecard evidence	`MOMO_SOURCE_EMPTY_EVIDENCE_LINES 21`、`MOMO_IMPORT_CONFIG 當日業績匯入

2026-06-24 23:04 repo-side cold-start v1.42 dry-run returns PASS=88 WARN=0 BLOCKED=1 and classifies the only blocker as:

188 momo source file absent while daily sales data stale

This is repo-side source-of-truth enhancement only. 2026-06-24 23:15 read-only deploy parity check proves the live 110 script is still older: repo hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05, live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8. Do not claim the live 110 deployed script has this v1.42 behavior until /home/wooo/scripts/full-stack-cold-start-check.sh is synced under an approved change and its hash/readback is recorded through §13.3.1.

GO / NO-GO:

GO: declare MOMO web/API/container/database service available.
GO: declare current-month table parity good.
NO-GO: declare MOMO business data current.
NO-GO: declare FULL_STACK_GREEN while MOMO_DAILY_FRESHNESS > 2.
NO-GO: re-import old archived files to fake freshness.
NO-GO: import product exports or manually constructed spreadsheets as daily sales source.
NO-GO: truncate tables, restore whole DB, or move Drive files when sync_success is false.

解除 blocker 的唯一合格證據：

1. New legitimate 即時業績_當日 source file appears in the expected Drive intake path, or owner supplies a verifiable source-evidence reference.
2. Import job completes with success=true and sync_success=true.
3. Drive file movement / archive evidence shows the source was handled once.
4. daily_sales_snapshot and realtime_sales_monthly counts and date bounds match for the imported range.
5. MOMO_DAILY_FRESHNESS <= 2.
6. backup / offsite / cold-start scorecard rerun after import remains green except known DR escrow blocker.

如果 source file 缺席，正確回報是：

MOMO service is recovered, data pipeline is waiting for upstream source file.
No safe import candidate exists.
Full-stack remains blocked by data freshness, not by service outage.

14.32 2026-06-24 188 nginx-exporter / CD monitoring coverage gate

2026-06-24 的第六段變更是把 CD post-deploy monitoring coverage 失敗納入重啟 SOP。2ec7f6f4 的 runtime deploy 已回寫 622bc372 並且 production API health 為 healthy，但 CD #3294 的 post-deploy checks 因 nginx-exporter target down 留下 Failure。根因是 188 nginx-exporter container 未運行，並非 Nginx public gateway、API/Web rollout 或 MOMO 服務故障。

項目	20:10 monitoring coverage baseline
SOP version	`v1.34`
Affected CD run	Gitea CD `#3294` 歷史結果仍為 Failure；deploy marker `622bc372` 已寫入
Root cause	Prometheus job `nginx-exporter` down，target `192.168.0.188:9113` connection refused
Non-root cause	Nginx `stub_status` 正常；不需要 reload Nginx、不需要重啟 API/Web/MOMO、不需要改 firewall
Live restore source	`/home/ollama/nginx-exporter.yml`
Repo helper	`scripts/ops/188-nginx-exporter-restore.sh`
Check mode	`--check` only reads stub_status, compose config, container state, and metrics
Apply mode	`--apply` runs `docker compose -f /home/ollama/nginx-exporter.yml up -d` after stub_status and compose config pass
Exporter metrics	`nginx_up 1`、`nginx_connections_active`、`nginx_http_requests_total`
Monitoring coverage	`Jobs 總數=14`、`全部 UP=14`、`真實問題=0`、`預期覆蓋率=100.0%`
Declaration limit	可宣稱 exporter / monitoring coverage recovered；不可把歷史 CD run 改稱 success，也不可宣稱 full-stack green / DR complete

Post-reboot / post-CD 188 nginx-exporter check:

bash scripts/ops/188-nginx-exporter-restore.sh --check
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0

如果 --check 只在 metrics 階段失敗，但 stub_status 與 compose config 都通過，且維護窗口允許恢復無狀態 exporter：

bash scripts/ops/188-nginx-exporter-restore.sh --apply
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0

禁止把這個症狀用下列方式處理：

NO-GO: reload Nginx before stub_status / exporter metrics prove Nginx config is the issue.
NO-GO: restart product containers because monitoring coverage alone is red.
NO-GO: silence monitoring coverage or mark CD green without target recovery evidence.
NO-GO: prune Docker volumes or delete exporter state not owned by this SOP.

14.33 2026-06-24 MOMO V10.646 / source-file absence / dual-workstation baseline

2026-06-24 的第七段變更是把 MOMO 的「程式版本最新」與「業務資料不新」拆成兩個獨立 gate，並把 Mac Mini / MacBook Pro 的 MOMO Codex 工作區固定到 Gitea main 最新基準。這避免重啟後出現兩種誤判：看到 /health 最新版就宣稱資料已更新，或看到資料 stale 就誤以為服務仍是舊版。

項目	20:42 MOMO / workstation baseline
SOP version	`v1.35`
MOMO public health	`https://mo.wooo.work/health` returns healthy, version `V10.646`
Gitea main truth	`wooo/ewoooc` `main=7cfca9375445ea03d6f5d10512d0276a20914d25`, `SYSTEM_VERSION = "V10.646"`
Mac Mini workspace	`/Users/ogt/codex-workspaces/momo-pro-dev`, branch `codex/momo-current-main-dev-base-20260624`, commit `7cfca9375445ea03d6f5d10512d0276a20914d25`, dirty `0`
MacBook workspace	`/Users/ooo/codex-workspaces/momo-pro-dev`, branch `codex/momo-current-main-dev-base-20260624`, commit `7cfca9375445ea03d6f5d10512d0276a20914d25`, dirty `0`
Remote baseline branch	`wooo/ewoooc` `codex/momo-current-main-dev-base-20260624` points to `7cfca9375445ea03d6f5d10512d0276a20914d25`
DB parity	current-month `daily_sales_snapshot` and `realtime_sales_monthly` match at `10936` rows, range `2026-06-01..2026-06-17`
Data freshness	`MOMO_DAILY_FRESHNESS 7
Source candidates inspected	Mac Mini current daily file contains only `2025-07-01..2025-07-02`; iCloud full-month file contains only `2025-06-01..2025-06-30`; MacBook candidates are header-only or the same `2025-07-01..2025-07-02` file
Declaration limit	可宣稱 MOMO release current 與 Codex dual-workstation baseline ready；不可宣稱 MOMO data current 或 full-stack green

MOMO post-reboot 判定必須同時回答四個問題：

MOMO_RELEASE_CURRENT = yes/no
MOMO_DB_PARITY = yes/no
MOMO_DATA_FRESH = yes/no
MOMO_SOURCE_AVAILABLE = yes/no

解除 MOMO data freshness blocker 的唯一安全路徑：

1. 新的合法 即時業績_當日 source file 出現在預期 Drive intake，或 owner 提供可驗證的 source-evidence reference。
2. 匯入 job 成功，且同步 realtime_sales_monthly 失敗時不得標 completed。
3. source file movement / archive evidence 證明該檔只處理一次。
4. daily_sales_snapshot 與 realtime_sales_monthly row count / date bounds 一致。
5. MOMO_DAILY_FRESHNESS <= 2。

禁止把以下情境當成解除 blocker：

NO-GO: 用舊 archive、iCloud 舊月檔、header-only 檔或測試檔重複匯入。
NO-GO: 把 V10.646 health 當成資料日期已到今天。
NO-GO: 把 current-month parity 當成 data freshness。
NO-GO: truncate 或 restore 整庫來製造新鮮度。

14.34 2026-06-24 MOMO import sync failure boundary hardening

2026-06-24 21:57 的第八段變更是把 MOMO 自動匯入的「partial success」風險納入重啟 SOP。2026-06-24 22:17 已補正式 release readback：同一修正已 fast-forward 到 MOMO main，Gitea Actions cd.yaml #904 成功，188 live source marker 已確認。daily_sales_snapshot 寫入成功不代表整體匯入成功；realtime_sales_monthly 同步失敗時，必須 fail job、保留來源檔，不得移動 Google Drive 檔案到 archive。

項目	22:17 MOMO import-boundary production baseline
SOP version	`v1.40`
Production health	`https://mo.wooo.work/health` healthy, version `V10.653`
Live DB read-only	`daily_sales_snapshot=104614 rows, 2025/07/01..2026/06/17`; `realtime_sales_monthly=786621 rows, 2024/01/01..2026/06/17`
Scheduler read-only	最近 12 小時 `當日業績匯入` / `即時業績_當日` 均為 `0` 個 Excel，排程不發送成功通知
Latest successful import	job `56 completed`, `10936` rows, `2026-06-18 11:41..11:42`
Code / deploy	MOMO `main` and `codex/momo-current-main-dev-base-20260624` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73`; Gitea Actions `cd.yaml #904` Success
Live source marker	188 `/home/ollama/momo-pro/services/import_service.py` contains `_table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試，不移動 Google Drive 檔案`
Regression	`pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q` => `10 passed`
Production deploy state	Production patched for code boundary; data freshness still blocked until a legitimate newer source file imports successfully

MOMO import success 判定：

GO: process_daily_sales_import returns True only if daily_sales_snapshot write and realtime_sales_monthly sync / verification both pass.
GO: auto_import_from_drive may move the Drive source file only after process_daily_sales_import returns True.
NO-GO: mark import_jobs.status=completed when sync_success=false.
NO-GO: move or archive the Drive source file when realtime_sales_monthly sync failed.
NO-GO: send a generic success notification for file_count > 0 before verify_import_data_sync passes.

重啟後若 MOMO data freshness blocked，先分成三層，不要混在一起：

1. Service availability: /health, container, DB connection.
2. Source availability: Drive pending folder has a legitimate new 即時業績_當日 source file.
3. Data correctness: import job completed with sync_success=true, and daily_sales_snapshot / realtime_sales_monthly match the imported date range.

14.35 2026-06-25 MOMO preflight 與 110 CPU orphan Chrome 分流

2026-06-25 11:01 的第九段變更是把兩個常見誤判收斂成可重跑 SOP：

MOMO service health green 不等於 data fresh。
110 high load 不等於可以重啟 Docker 或取消 CI。

MOMO 專用 preflight：

scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh

此腳本只做 read-only SSH / Docker metadata / logs / DB query，不讀 token 內容、不 import、不移動 Drive 檔、不 restart。14:16 live result:

MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=18 WARN=3 BLOCKED=0 HOST=ollama@192.168.0.188 FRESHNESS_MAX_DAYS=2
MOMO_PUBLIC_HEALTH_CODE 200
MOMO_HEALTH_CODE 200
MOMO_HEALTH_VERSION V10.674
MOMO_APP_HEALTH healthy
SCHEDULER_RUNNING true
SCHEDULER_HEALTH healthy
SCHEDULER_RESTART_COUNT 0
TELEGRAM_BOT_HEALTH healthy
MOMO_CONTAINER_REPLACE_EVENTS_45M 11
TOKEN_STAT 100000:100000:600
CONTAINER_TOKEN_STAT 0:0:600
LOCAL_EXACT_DAILY_SOURCE_COUNT 0
LOCAL_EXACT_DAILY_SOURCE_LATEST none
DB_DAILY 109061|2025-07-01|2026-06-24
DB_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24
DB_DAILY_FRESHNESS 1|2026-06-24
DB_LATEST_DAILY_IMPORT_JOB 57|completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0

110 CPU 分流：

Evidence	Decision
`ps` shows `stockplatform-review-bulk-ux` Chrome groups with root process PPID `1`, no parent node smoke, and sustained high CPU	Treat as orphan browser smoke. Run dry-run if available, then only with owner approval use targeted `SIGTERM` by process group.
Active Gitea Actions container is consuming CPU, e.g. `GITEA-ACTIONS-TASK-*`, `next build`, `uv pip install`, `docker-buildx`	Treat as legitimate CI/CD load. Do not kill unless there is explicit release owner approval to cancel the run.
`vmstat` shows high iowait or active swap in/out	Treat as storage / memory pressure, not browser runaway. Do not kill random processes; capture disk / memory evidence first.

2026-06-25 10:58 user-approved action:

Targeted command type: process SIGTERM only.
Targeted process groups: 438005, 471295, 640155, 670628.
Scope: orphan `stockplatform-review-bulk-ux` Chrome groups on 110.
Post-check: `OLD_GROUPS_REMAINING` empty.
Not performed: Docker restart, systemd restart, Nginx reload, firewall/iptables change, K8s action, CI cancellation, Wazuh/SOC change, secret read.
Remaining load: active Gitea Actions / CI build work; observe queue and timeout instead of killing.

14.22 重啟後時間軸驗證

每次重啟後照時間軸推進，不要等到最後才一次判定。

時間點	目標	必跑證據	可以宣稱
`T+0`	power / VM / console 已開始	console / hypervisor / UPS / operator note	maintenance started
`T+5m`	LAN / SSH 回復	ping、ARP、SSH port、`who -b`	`HOST_BOOTED`
`T+15m`	主機基礎服務回復	`systemctl is-system-running`、failed units、Docker / PostgreSQL / Redis / K3s role checks	`HOST_READY`
`T+30m`	核心服務回復	188 DB、110 Harbor/Gitea/Prom/AM、K3s nodes、AWOOOI API/Web、public routes	`SERVICE_READY` for scoped hosts
`T+45m`	排程與資料一致性	backup status、offsite verifier、momo DB parity、CronJobs、alert visibility	service recovery confidence
`T+60m`	釋出高負載與自動化	cold-start scorecard、load/core、runner guardrails、AI observe-only gate	release runner/CD only if gates allow

若任一時間點卡住，記錄卡在哪個 gate，不要跳到下一層。連續兩次重啟都卡同一 gate，必須回寫 §16 Known Drift 或 workplan。

15. Done Criteria

All must be true:

Four hosts reachable by SSH.
188 PostgreSQL and Redis healthy.
110 Harbor, Gitea, Prometheus, Alertmanager healthy.
120/121 K3s nodes Ready.
VIP 192.168.0.125 present.
AWOOOI API and Web reachable through NodePort/VIP.
Alertmanager E2E webhook succeeds.
cron/CronJob schedules are active, unsuspended, and verified.
MOMO release version matches Gitea source-of-truth for the intended deployment branch.
momo daily_sales_snapshot 與 realtime_sales_monthly 在最新匯入日期範圍內筆數一致。
momo business data freshness is within the declared SLO, and the latest import source evidence is legitimate; DB parity alone is not enough.
Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
High-load batch services are capped or delayed.
Runners are guarded and released last.
AI auto-remediation is not in full execution mode until all gates are green.
110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
110 runaway process textfile monitor is fresh, and Prometheus has HostOrphanBrowserSmokeHighCpu plus CI load classification rules loaded.
110 global /home/wooo/.ssh/known_hosts still contains verified 120 / 188 entries after any CD run; deploy jobs use /home/wooo/.ssh/deploy_known_hosts only.

15.1 可宣稱狀態

可宣稱文字	必要條件
`110 host recovered`	110 `HOST_READY`，failed units `0` 或全部可解釋，核心端口與 cron / backup status 已查
`public core services recovered`	public routes/TLS 2xx/3xx，AWOOOI API health、Harbor/Gitea/Stock/Sentry/SignOz/Langfuse/Bitan smoke OK
`backup/offsite current`	`backup-status --no-notify` 無 stale，offsite verifier `VERIFY_OK=1`，且任何 failed component 有明確 owner
`service recovery with known blocker`	cold-start `BLOCKED` 只剩已知 blocker，例如 120；告警保持可見
`full-stack green`	§15 全部成立，cold-start `WARN=0 BLOCKED=0`
`DR complete`	full-stack green 且 credential escrow missing count 為 0

16. Known Drift To Fix After Recovery

這些項目必須在事故後整理，不要在 P0 恢復中途順手大改。

SERVICE-ENDPOINTS.md still has old Prometheus/Alertmanager locations.
Audit older docs for direct node webhook targets; current main path should be VIP 192.168.0.125:32334.
OpenClaw 8088 vs 8089 must be live-confirmed and normalized.
188 compose paths drift between /home/ollama/* and Ansible /opt/*.
110 runner docs still mention Docker runner in places; live startup prefers host gitea-act-runner-host.service.
scripts/setup-runner-watchdog.sh conflicts with the 2026-05-05 runner watchdog disablement guardrail.
grist.wooo.work / registry.wooo.work public HTTP/HTTPS currently route to aiops.wooo.work; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.
stockplatform-shared-ui-monitor.timer / service source-of-truth 仍需清理或重建；2026-06-12 只停用 stale timer 以解除 host degraded。
111 local Ollama fallback 目前不可達；production provider 由 GCP-A / GCP-B 承接，但 111 恢復應另列 AI provider resilience 工作。
本 SOP v1.5 新增內容已用繁體中文補強；舊章節仍有英文段落，後續 runbook hygiene 應分批翻譯，不要在事故 P0 中混入大規模格式重排。

189 KiB Raw Permalink Blame History Unescape Escape