189 KiB
AWOOOI 全棧冷啟動與主機重啟 SOP
Version: v1.78 Last updated: 2026-06-27 Asia/Taipei Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
0. 最新 Live Baseline 與釋出判定
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md。
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 和可重放的 summary.txt 留在 /tmp/awoooi-post-reboot-readiness-*。v1.75 起,同一輪驗收後續步驟必須吃同一個 $ARTIFACT_DIR/summary.txt,例如 scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file "$ARTIFACT_DIR/summary.txt" --no-color 與 scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file "$ARTIFACT_DIR/summary.txt" --no-color;不得在同一輪 evidence chain 反覆重跑 live probes 後混用不同時間點結論。v1.76 起,delegated cold-start 若在 K3s rollout / CD 替換瞬間出現單次 BLOCKED AWOOOI API not reachable,但 wrapper 自己的 public https://awoooi.wooo.work/api/v1/health route retry 已回 2xx,該 blocker 只列為 route/API warmup evidence warning;public API 仍失敗、其他 non-route blocker、或 retry 後未恢復時,仍維持 hard blocked。宣告 guard 會把 summary 轉成 allowed / forbidden declaration,避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 SERVICE_GREEN=1 但 NEXT_REQUIRED_GATES 仍非空,再由 dispatch checklist 把尚未完成的 blocker 轉成 owner / evidence / forbidden-action checklist;需要機器可讀 intake 時,再跑 scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --dispatch-file <dispatch.txt> --output /tmp/awoooi-post-reboot-owner-packets.json 產生 awoooi_post_reboot_next_gate_owner_packets_v1 JSON,並立刻跑 scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json。dispatch / packet / guard 均固定 DISPATCH_AUTHORIZED=0、REQUEST_SENT_COUNT=0、OWNER_RESPONSE_ACCEPTED=0、HOST_WRITE_AUTHORIZED=0、SECRET_VALUE_COLLECTION_ALLOWED=0、RUNTIME_GATE=0;guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起,任何 owner response JSON 還必須經過 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file <owner-packets.json> --response-file <file>:空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closed;preflight 通過也只表示可進入獨立 reviewer acceptance,不是 runtime 授權。需要人工展開時,再跑 scripts/reboot-recovery/post-start-quick-check.sh --no-color 並以 docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
v1.76 owner gate replay rule:同一輪 summary 產生後,owner packet 與 owner response preflight 必須優先使用 --summary-file "$ARTIFACT_DIR/summary.txt",例如 scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.json 與 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --response-file <file>。只有在刻意要重新取 live evidence 時,才允許省略 --summary-file;否則 preflight 不得自己重跑 summary 造成同一輪狀態漂移。
2026-06-27 11:51 最新 live revalidation:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-115046/summary.txt 回傳 POST_START_RESULT=BLOCKED、POST_START_PASS=37、POST_START_WARN=3、POST_START_BLOCKED=2、SERVICE_GREEN=0、PRODUCT_DATA_GREEN=1、STOCK_FRESHNESS_STATUS=ok、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=none、BACKUP_CORE_GREEN=1、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。本輪已再次修復 188 momo_pg_daily crontab configured drift,backup-status 回 core_blockers=0、configured_missing_188=0;K3s / ArgoCD live readback 顯示 120 / 121 皆 Ready,awoooi-prod 為 Synced / Healthy,api/web/worker pods 均 Running。現在 hard blocker 是 MOMO business data freshness:daily_sales_snapshot 最新仍為 2026-06-24,DRIVE_INTAKE_COUNT=0,Drive archive / global latest 即時業績_當日 均為 2026-06-25T04:21:47Z,最新 import job 57 已 clean completed 且 sync_success=true。因此可宣稱主機、K3s、public routes、backup core 與 Stock freshness 已恢復;不可宣稱 full-stack green,直到 MOMO 來源檔補齊並由正式 import pipeline 更新 DB。DR complete 仍因 ESCROW_MISSING_COUNT=5 禁止宣稱,Wazuh 全主機納管仍因 manager registry accepted 0 禁止宣稱。
2026-06-27 00:58 最新 live summary:本輪先修復兩個實際 SOP blocker。第一,scripts/ops/recovery-scorecard-contract-check.py 已改成 PyYAML optional,標準 Python 環境也能驗證 recovery recording-rule contract,不會因 ModuleNotFoundError: yaml 中斷 DR/offsite checklist。第二,188 ollama crontab 已備份到 /home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt,並把 AWOOOI momo PostgreSQL daily backup 從 app-side /home/ollama/momo-pro/scripts/pg_backup.sh 收斂回 host-owned /home/ollama/bin/momo-pg-backup.sh;刷新 188 textfile exporter 後 awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1。00:58 scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-005728/summary.txt 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=3、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。同輪 backup-status 回 core_blockers=0、configured_missing_188=0;Prometheus live contract 回 awoooi_recovery_core_ready=1、awoooi_recovery_dr_offsite_ready=0,表示主機 / K3s / public routes / product data / backup core 已恢復,DR 仍只因 credential escrow 缺 5 個 non-secret evidence marker blocked,Wazuh 全主機 registry accepted 仍為 0。
2026-06-27 00:02 最新 live summary:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=4、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、STOCK_FRESHNESS_STATUS=ok、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=none、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。同一輪 production route smoke 回傳:IwoooS 200、Wazuh read-only routes 200、VibeWork 200、AwoooGo 200、MOMO health 200、Stock 200;AWOOOI API health healthy / prod / mock_mode=false,PostgreSQL / Redis / OpenClaw / SigNoz / GCP Ollama provider up,local Ollama endpoint 仍為 cooldown / degraded,由 provider fallback 承接,不是網站或 API service blocker。最新 deploy marker 為 e506b9d5 chore(cd): deploy fe74d86 [skip ci];本輪 89b9e67a 是 SOP / scripts / docs source update,不是 runtime bundle deploy marker。112 Wazuh 與 120 K3s 的 23:56 脫敏 readback 仍作為本輪相鄰 evidence:120 ArgoCD Synced / Healthy、Pod 均 Running 或 Completed;Wazuh manager registry 並非全空,但 WAZUH_MANAGER_REGISTRY_ACCEPTED=0 維持,不能宣稱全主機納管恢復。
2026-06-26 23:56 live summary retained for comparison:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=3、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、STOCK_FRESHNESS_STATUS=ok、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=none、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、RUNTIME_ACTION_AUTHORIZED=0。同一時段只讀補查 120:ArgoCD awoooi-prod 為 Synced / Healthy,awoooi-prod Pod 均為 Running 或 Completed;歷史 km-vectorize-29689620 failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋,不是目前服務 blocker。同一時段只讀補查 112:systemd running,Wazuh manager / indexer / dashboard active,manager API root 回 401,Dashboard unauthenticated check endpoints 回 401,manager registry 脫敏讀回為 local manager 1、受管 agent 5、active managed 5、disconnected 0、never connected 0。此證據證明 registry 不再是「全空」,但仍不能宣稱 Wazuh 全主機納管恢復,因為 SOP expected scope 仍是 6、Dashboard API connection / version 尚未以登入或 owner evidence 驗收,owner response accepted 仍為 0。
2026-06-26 18:46 最新即時恢復真相已覆蓋 12:13 對今日產品資料的判讀:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=PRODUCT_DATA_PENDING_EOD_WINDOW、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=0、STOCK_LATEST_TRADING_DATE=2026-06-26、STOCK_BLOCKERS=core_margin_short_daily_missing,ai_recommendations_stale、BACKUP_CORE_GREEN=1、ESCROW_MISSING_COUNT=5、WAZUH_MANAGER_REGISTRY_ACCEPTED=0。同一輪 live cold-start 長檢查回傳 PASS=87 WARN=0 BLOCKED=0 與 Result: GREEN,代表 110 / 120 / 121 / 188 主機、K3s、public routes、AWOOI API、MOMO、backup core、exporters、cron 與 Alertmanager 服務層已恢復;但 StockPlatform 今日官方 margin-short 尚未發布,AI recommendations 仍依賴該資料,因此不可宣稱所有產品資料最新。18:43 已以授權 SIGTERM 清除 110 上兩組 6 小時以上 stockplatform-review-bulk-ux orphan Chrome process group,REMAINING=0;18:44-18:46 已停止 168 Mac Mini 上本機 AWOOOI next build 並清理 temp/build/cache 與 Antigravity backup browser recordings,使 /System/Volumes/Data 從約 1.0Gi / 100% 回到約 8.7Gi / 96%。112 Kali 的 networking.service failed 已定位為 /etc/network/if-up.d/wg-nat 錯誤 shebang #\!/bin/bash 導致 Exec format error;Wazuh manager / indexer / dashboard 仍 active,該 hook 修復需要 112 sudo 提權,未使用或保存密碼。
2026-06-26 12:13 latest live summary supersedes the 08:59 gate set:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、POST_START_PASS=38、POST_START_WARN=4、POST_START_BLOCKED=0、SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、BACKUP_CORE_GREEN=1、DR_ESCROW_BLOCKED=1、ESCROW_MISSING_COUNT=5、HOST_188_SERVICE_GREEN=1、HOST_188_HYGIENE_BLOCKED=0、HOST_188_RESULT=HOST_188_HYGIENE_GREEN.、WAZUH_ROUTE_CODE=200、WAZUH_TRANSPORT_COUNT=6、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning、WAZUH_DASHBOARD_INDEX_OK=3、RUNTIME_ACTION_AUTHORIZED=0、OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export。188 host hygiene 已從 blocker 移除;目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復,但不得宣稱憑證已正式 renew,需等 snap certbot timer / ACME window readback。
2026-06-26 13:01 owner response preflight baseline:新增 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color 與 docs/templates/post-reboot-next-gate-owner-response.json。無 response file 時必須輸出 POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_file expected_gates=2 received=0 accepted=0 runtime_gate=0;直接使用模板時必須輸出 POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_content expected_gates=2 received=0 accepted=0 runtime_gate=0。此 gate 只驗收 credential_escrow_evidence 與 wazuh_manager_registry_export 的脫敏 owner evidence,不送 request、不寫 escrow marker、不讀 secret、不做 Wazuh / host / Kali runtime action,也不把一般批准訊息轉成 owner accepted。
2026-06-26 17:45 single-summary replay baseline:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 現在會自動寫入 /tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt,同一輪後續 declaration guard、next-gate dispatch、owner packet、contract guard 與 owner response preflight 均用此 summary 重放。17:45 summary 回傳 SERVICE_GREEN=1、PRODUCT_DATA_GREEN=1、BACKUP_CORE_GREEN=1、DR_ESCROW_BLOCKED=1、ESCROW_MISSING_COUNT=5、HOST_188_HYGIENE_BLOCKED=0、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED、NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export。post-start-quick-check.sh 也已補 route warmup 分類:若 delegated cold-start 的 BLOCKED 全部是 public route,且 wrapper 自己的 route retry 已全部恢復,該 cold-start blocker 會降級為 evidence warning,不再把整輪服務恢復誤判成 blocked;非 route blocker 或 retry 後仍失敗仍維持 hard blocked。
2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence:當時 HOST_188_HYGIENE_BLOCKED=1、NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export。此段只用來比對 188 修復前後差異;現行 gate set 必須使用 12:13 baseline。
2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence:當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出,目前 expected NEXT_GATE_COUNT=2,只剩 credential escrow 與 Wazuh registry。
2026-06-26 08:29 owner-packet JSON baseline:scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color 將 dispatch output 轉成 schema_version=awoooi_post_reboot_next_gate_owner_packets_v1,包含三個 owner_packets、next_gate_count=3、p0_gate_count=3、request_sent_count=0、owner_response_received_count=0、owner_response_accepted_count=0、runtime_action_authorized_count=0。此 JSON 是 AI / operator / owner review intake,不是外部 request,也不是維護窗口批准。
2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence:舊版鎖定三個 P0 gate。12:13 起 contract guard 依 source.next_required_gates 動態驗收,現行 expected success line 是 POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0;若 188 hygiene future regression,才會回到 gates=3。
2026-06-26 08:47 Wazuh registry detail baseline:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/value:WAZUH_COVERAGE_SCOPE=6、WAZUH_DIRECT_ACTIVE=2、WAZUH_NO_TRANSPORT=1、WAZUH_SSH_BLOCKED=3、WAZUH_ROUTE_CODE=200、WAZUH_TRANSPORT_COUNT=6、WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning、WAZUH_DASHBOARD_INDEX_OK=3、WAZUH_MANAGER_REGISTRY_ACCEPTED=0、WAZUH_RUNTIME_GATE=0。scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color 的 wazuh_manager_registry_export gate 會把這些狀態放入 CURRENT_EVIDENCE。判讀鐵律:route 200、transport 6、Dashboard index pattern 3 都不是 manager registry accepted;全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。
2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence:當時 HOST_188_FULLY_GREEN 仍 forbidden。12:13 起 guard 依 HOST_188_HYGIENE_BLOCKED=0 動態允許 188 host hygiene green,但仍拒絕 DR_COMPLETE、WAZUH_REGISTRY_RECOVERED、RUNTIME_ACTION_AUTHORIZED。
2026-06-26 07:39 live quick-check refresh:scripts/reboot-recovery/post-start-quick-check.sh --no-color 完整跑完,四主機 ping / SSH 全部 OK,delegated cold-start 為 PASS=89 WARN=0 BLOCKED=0,wrapper 總結為 POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0、warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2、RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED。MOMO health V10.701,daily snapshot 109061 rows / 2025-07-01..2026-06-24,current-month parity 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24,latest import job 57 completed。StockPlatform freshness status=ok、latest trading date 2026-06-25,price / chips / margin / AI recommendations 均為 2026-06-25。Backup-status 07:39 顯示 110 13/13 fresh failed=0、188 2/2 fresh failed=0、core_blockers=0、offsite/rclone fresh、last_backup_all=2026-06-26 02:31:02、escrow_missing=5。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 5.19 / 4.66 / 4.91,CPU idle 多數樣本 80%+,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration:主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠;forbidden declaration:DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。
2026-06-26 07:19 follow-up:gitea/main 已包含前一輪 SOP 文件 commit 1fd5e2a8,ArgoCD awoooi-prod 讀回 Synced / Healthy,revision 1fd5e2a8b0f18d24eed16aa2a44286bcbf230603,API 2/2、Web 2/2、Worker 1/1,pods restart=0。重跑 full cold-start 仍是 PASS=87 WARN=0 BLOCKED=0,result GREEN。直接 public route 讀回:AWOOOI API 200、AWOOOI Web 307、VibeWork 200、AwoooGo 200、MOMO health 200、Stock freshness 200、Bitan 200、Gitea 200、Harbor 200、Registry /v2/ expected 401、Sentry expected 302、SigNoz 200、Langfuse 200。188 blocker 精準分類:pg_lsclusters 顯示 host PostgreSQL 14/main down,systemctl status postgresql@14-main 顯示 invalid primary checkpoint record 與 PANIC: could not locate a valid checkpoint record;certbot.service 顯示 sentry.wooo.work renew rate-limited,snap.certbot.renew.service 顯示 challenge failed;awoooi-startup.service 曾嘗試以 root 執行 pg_resetwal 並失敗。本輪不執行 pg_resetwal、不 reset-failed、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md,並可先跑 scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color 取得只讀 preflight。110 load 已降到約 4.83 / 4.82 / 5.52,top CPU 是 active AWOOOI Web turbo build / Docker buildx;Swap 仍滿但 memory available 約 41Gi,本輪不手動清 swap。整體宣告仍是 FULL_STACK_GREEN_DR_ESCROW_BLOCKED。
2026-06-26 07:02 全主機 live refresh:110 / 120 / 121 / 188 / 112 / 111 / 168 ping 與 SSH port 全部 OK。110 systemctl=running、failed units 0,但 load 5.83 / 7.26 / 5.77 且 top CPU 是 AWOOOI Web next build,Swap 仍 7.8Gi / 7.8Gi;這是 CI/build 壓力,不是 orphan Chrome 或 Docker 事故。120 / 121 systemctl=running、K3s active,nodes mon / mon1 均為 Ready。ArgoCD awoooi-prod 在 06:57 曾短暫 OutOfSync / Progressing,因 deploy marker 52f61da4 rollout 正在替換 API/Web/Worker;07:00 後已穩定為 Synced / Healthy,API 2/2、Web 2/2、Worker 1/1,API/Web 仍跨 mon / mon1。重跑 live cold-start:PASS=87 WARN=0 BLOCKED=0,result GREEN。StockPlatform /api/v1/system/freshness 曾在容器剛重啟約 35 秒時短暫 502,後續連續讀回皆 200 且 status=ok、latest_trading_date=2026-06-25、blockers [];這類 rollout warmup 只有連續失敗才算 blocker。MOMO health 是 V10.699,cold-start direct evidence 仍顯示 current-month parity 15383 / 15383 截至 2026-06-24,daily freshness 1|2026-06-24。Backup status 06:58:110 13/13 fresh failed=0、188 2/2 fresh failed=0、core_blockers=0、offsite/rclone fresh、last_backup_all=2026-06-26 02:31:02、escrow_missing=5。188 產品容器健康,但 host systemctl=degraded 仍是真實 host hygiene blocker:awoooi-startup.service、postgresql@14-main.service、certbot.service、snap.certbot.renew.service failed。112 Wazuh manager/indexer/dashboard active,ports 1514 / 1515 / 55000 listen,但 production Wazuh route 仍回報 disabled_waiting_iwooos_wazuh_owner_gate、configured=false、manager registry accepted 0、runtime gate 0。111 / 168 可連線,但兩邊 AWOOOI dev workspaces 皆 ahead 17 且 HEAD 不同(111=56c83257、168=59485d51);Mac Mini /System/Volumes/Data 只剩約 3.2Gi。目前 service recovery 宣告維持 FULL_STACK_GREEN_DR_ESCROW_BLOCKED,host hygiene / DR escrow / Wazuh registry / workstation capacity 明確列為 service green 之外的 blocker。
2026-06-26 06:50-06:55 188 host hygiene read-only triage:188 product services remain green, but host systemctl is still degraded and must not be smoothed into full host green. Failed units are awoooi-startup.service, postgresql@14-main.service, certbot.service, and snap.certbot.renew.service. Evidence shows the host PostgreSQL cluster 14/main is down in pg_lsclusters, while product DB / exporters still respond through containerized services; therefore pg_isready or pg_up=1 cannot substitute for host cluster health. The 188 startup service detected could not locate a valid checkpoint record on 2026-06-23 and attempted pg_resetwal as root, which failed; v1.63 treats PostgreSQL checkpoint/WAL errors as break-glass only and the repo-side startup script now fails closed instead of running pg_resetwal. Certbot renew for sentry.wooo.work is also failing and hit ACME rate-limit / challenge failure, but the public cert is still valid until 2026-07-09 16:03:40 UTC. Current declaration: SERVICE_GREEN_HOST_HYGIENE_BLOCKED for 188, while overall service recovery remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED.
2026-06-26 06:40-06:44 全主機 read-only refresh:110 / 120 / 121 / 188 / 112 / 111 / 168 ping 與 SSH port 全部 OK。核心 reboot scope 維持 green:110 systemctl=running、failed units 0,Docker / Gitea / Harbor / Prometheus / Alertmanager 可用;120 / 121 systemctl=running、failed units 0,K3s nodes mon / mon1 Ready;188 產品容器與 PostgreSQL / Redis / MOMO / SignOz 可用。ArgoCD awoooi-prod 已從先前 degraded 收斂為 Synced / Healthy,revision b2945ab9f716d9d685434ae0e67b9318414b27fe;km-vectorize official 03:00 台北時間 run 成功,lastSuccess=2026-06-25T19:00:14Z。Public routes for AWOOOI / VibeWork / AwoooGo / MOMO / Stock / Bitan / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse return expected statuses; AWOOOI API health is healthy / prod / mock_mode=false; MOMO health is V10.690; StockPlatform freshness is status=ok, latest_trading_date=2026-06-25, blockers []; backup-status remains core green with escrow_missing=5. Boundaries: 188 host still has failed units awoooi-startup.service, certbot.service, postgresql@14-main.service, snap.certbot.renew.service that require host hygiene cleanup; 112 Wazuh services / ports are active but Wazuh manager registry accepted remains 0; 111 / 168 Codex workspaces are reachable but have different local HEADs on the same ahead branch; Mac Mini free space is about 3.4Gi. Current service verdict remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED, not DR_COMPLETE or Wazuh recovered.
2026-06-26 06:26-06:28 隔日 read-only refresh:四主機 ping/SSH OK,cold-start PASS=89 WARN=0 BLOCKED=0,MOMO V10.690 且 latest import job 57 completed,StockPlatform /api/v1/system/freshness 仍為 status=ok / latest_trading_date=2026-06-25 / blockers [],backup-status 110 13/13 fresh failed=0、188 2/2 fresh failed=0、core_blockers=0、offsite_fresh=1、rclone_gdrive_fresh=1、last_backup_all=2026-06-26 02:31:02、escrow_missing=5。06:26 full wrapper 首輪在 https://awoooi.wooo.work/zh-TW/iwooos 與 https://vibework.wooo.work/ 出現單次 000,但獨立 curl 立即回 200,route-only wrapper 也回 PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN;因此 v1.61 將 public route gate 改為最多 3 次 retry,只有連續失敗才算 BLOCKED,retry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。
2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, intelligence-sync 21:00 finished status=0, core.margin_short_daily reached 2026-06-25 / 1976 rows, and ai-recommendation-pipeline 21:10 finished STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25 with draft_count=120, candidate_count=120, and rag_documents=1000. StockPlatform /api/v1/system/freshness now returns status=ok, latest_trading_date=2026-06-25, blockers [], with price / chips / margin / AI recommendations all on 2026-06-25. The 21:14 full wrapper returns cold-start PASS=89 WARN=0 BLOCKED=0 and overall POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. The only remaining recovery red gate is DR credential escrow evidence escrow_missing=5; Wazuh manager registry accepted remains 0 as a security evidence blocker, not a reboot service blocker.
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two stockplatform-review-bulk-ux Chrome process groups 2756503 and 2829627 with root Chrome process PPID=1, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted SIGTERM at 20:24. Post-check showed no remaining PGID entries; vmstat showed CPU idle around 85-90%, si/so=0, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start PASS=89 WARN=0 BLOCKED=0, but overall POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1, RESULT=BLOCKED, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea main and live /home/wooo/stockplatform-v2 are now at fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints; six missing production cron entrypoint scripts are restored, run-intelligence-sync.sh contains the Docker-backed psql shim, and live contract check confirms every scripts/ops/*.sh referenced by install-production-cron.sh exists. The only live write performed for StockPlatform recovery was a fast-forward git pull --ff-only origin main on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: source-remediation-queue 19:56 and 20:00 succeeded, market-index-ingestion 20:00 succeeded, price-ingestion 20:02 succeeded, margin-short-ingestion 20:05 succeeded, chips-ingestion 20:06 succeeded, and ai-recommendation-pipeline 20:10 ran but correctly produced the internal blocker core_margin_short_daily_incomplete,official_margin_short_daily_official_pending. StockPlatform /api/v1/system/freshness therefore still returns status=blocked because the 2026-06-25 official margin-short source is pending and ai.recommendations must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.
2026-06-25 19:35 product-version / data-freshness refresh supersedes the 19:06 data-complete wording. Host boot, K3s, AWOOOI runtime, MOMO service/data, backup/offsite, Bitan cleanliness, and expanded public routes are available, but the stricter post-start wrapper now checks StockPlatform /api/v1/system/freshness and correctly returns RESULT=BLOCKED when product data is not current. The 19:35 lightweight wrapper run used --skip-cold-start --skip-backup --skip-cpu after the 19:24 full host/cold-start/backup readback and returned PASS=31 WARN=1 BLOCKED=1, with the single blocker StockPlatform freshness is blocked: core_margin_short_daily_missing,ai_recommendations_stale. stock.wooo.work, /healthz, and /api/healthz all return 200; public routes now covered by the wrapper include AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Do not declare "all products and data are latest" until StockPlatform freshness is ok; keep DR blocked until escrow_missing=0.
2026-06-25 19:06 post-CD live read-only refresh supersedes the 18:53 wrapper wording. Consecutive main pushes caused older deploy markers to be replaced, so the latest production truth is deploy marker d8ca8224 chore(cd): deploy 9dbe044 [skip ci]. Read-only ArgoCD shows awoooi-prod Synced / Healthy at revision d8ca822422021d0fda8da8fa4c354c0c4db7ff22; API/Web/Worker live image tag 9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be; API 2/2, Web 2/2, Worker 1/1. The 19:05 post-start quick check returns RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED, delegated cold-start remains PASS=89 WARN=0 BLOCKED=0, and 19:05-19:06 route stability checks confirm AWOOOI API, IwoooS, AwoooGo, Stock, VibeWork, Bitan, and MOMO health all return 200 for five consecutive external reads. Earlier AwoooGo / Stock 502 reads were post-deploy upstream warmup transients, not persistent service failures. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on V10.690; latest import job 57 completed cleanly; MOMO_DAILY_FRESHNESS 1|2026-06-24; current-month daily snapshot and realtime tables match through 2026-06-24. post-start-quick-check.sh parses cold-start PASS / WARN / BLOCKED summary before classifying exit codes, so WARN-only rollout/stale evidence is no longer inflated into a service blocker. The wrapper returns RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED when service blockers are zero but escrow_missing=5 remains. Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh production routes are now 200 disabled_waiting_iwooos_wazuh_owner_gate, but configured=false, manager query accepted 0, manager registry accepted 0, and runtime gate 0; treat Wazuh as a security registry evidence blocker, not a reboot service blocker.
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: 2026-06-25 19:05 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
Post-start quick check: 2026-06-25 21:14 PASS=38 WARN=2 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. Cold-start layer remains GREEN and StockPlatform freshness is now OK; DR remains blocked by credential escrow evidence.
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
Service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, StockPlatform data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared, and 110 orphan StockPlatform Chrome smoke groups cleared by targeted approved SIGTERM. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is now green for the 2026-06-25 evidence set.
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. 19:35 expanded route readback returned expected 2xx/3xx for AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
MOMO release state: mo.wooo.work health is healthy on version V10.690. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 18:23 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain recent container-replace / scheduler fail-closed / notification evidence notes, but no service blocker remains.
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
StockPlatform data state: `/api/v1/system/freshness` returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.margin_short_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, `core.market_index_daily.global` 2026-06-25 / 2001 rows, and `ai.recommendations` 2026-06-25 / 2868 rows. The 21:10 natural AI pipeline produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`; no manual ingestion or DB write was performed.
Product version readback: StockPlatform live source `/home/wooo/stockplatform-v2` matches Gitea `wooo/stockplatform-v2.git` main `fb91aa4c6272469d1d26e0820169629eac17d28a`; VibeWork live image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea `wooo/vibework.git` main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source `/home/wooo/awooogo` matches Gitea `wooo/AwoooGo` main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833`; 188 runtime directory is a compose/image deployment path, not a git worktree, so add image revision label evidence before declaring code-image parity.
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 20:24 readback found a second recurrence with orphan process groups `2756503` and `2829627`, root Chrome `PPID=1`, elapsed about 5h, no active parent smoke, GPU process CPU around 96%, and renderer CPU around 22%; approved targeted `SIGTERM` cleared both PGIDs. 21:14 CPU attribution shows current load is dominated by an active AWOOOI Web `next build` process group and its worker processes, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
Backup / monitoring state: 19:05 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
Route transient handling: post-deploy `502` on Stock or AwoooGo is a blocker only if it persists after upstream container health is ready and 3-5 consecutive external route reads still fail. For AwoooGo, live upstream is on 110 `192.168.0.110:32190`; do not test only `127.0.0.1` on 110 because the listener may bind the host address. For K3s workload balancing, wait for terminating pods to disappear before judging API/Web placement; final required state for two-replica API/Web is split across `mon` and `mon1`.
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
Deploy storm / CD replacement state: if several main commits land during recovery, older CD runs may be canceled by newer commits. Do not treat the canceled run as a service failure. Wait for the final deploy marker, verify live image tags, ArgoCD health, public routes, DB freshness, backup status, and post-start quick check before declaring latest production recovered.
Wazuh / SOC boundary state: production Wazuh read-only route presence is not equivalent to Wazuh registry recovery. `/api/iwooos/wazuh` and `/api/v1/iwooos/wazuh` returning `200 disabled_waiting_iwooos_wazuh_owner_gate` only proves the route boundary is deployed; manager registry accepted, owner evidence accepted, active response, host write, agent re-enroll, restart, secret patch, Kali active scan, and runtime gate remain `0 / false`.
Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom.
Allowed declaration: host boot, core service readiness, K3s, public route availability, AWOOOI API health, MOMO service health/data freshness, Bitan public-content cleanliness, and backup/offsite readiness are green for the latest read-only evidence set.
Forbidden declaration: all product data latest, StockPlatform data freshness green, DR complete, credential escrow complete, Wazuh host registry accepted, 110 live monitor synced, or runtime/security acceptance. Credential escrow evidence is still missing and StockPlatform freshness is blocked; neither may be smoothed into green.
2026-06-24 22:17 Codex workstation continuity readback:
MacBook Pro 192.168.0.111 can authenticate to Gitea over SSH with its own public key named MacBook Pro Codex 20260624.
MOMO Pro Mac Mini workspace is /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty=0.
MOMO Pro MacBook workspace is /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty=0.
MOMO import-boundary regression: pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q => 10 passed.
MOMO production release: Gitea main and cd.yaml #904 are at 84035906aba0e5e190d031a13cfd9b47a8cd1f73; 188 live source marker proves production deploy.
Codex Start Here / workstation dashboard / scorecard safe artifacts were copied to MacBook Pro; latest artifact dashboard readback is refreshed after the docs closeout commit. Raw Codex App DB, auth, sessions, raw conversations, .env, runtime volumes, raw .git directories, passwords, tokens, and Mac Mini private keys were not copied.
AwoooGo MacBook dev workspace remains ready at /Users/ooo/codex-workspaces/awooogo-dev, branch dev, upstream gitea/dev, commit 8471b376d97c1436d4612ece17f51ba0950f114d, dirty=0.
Safe handoff artifacts still match by local / remote SHA-256 readback after Start Here / workstation dashboard / scorecard refresh. Exact hash values are intentionally not hard-coded in this runbook because they change whenever handoff artifacts are refreshed. Raw Codex App DB, auth, sessions, raw conversations, .env, runtime volumes, raw .git directories, passwords, tokens, and Mac Mini private keys were not copied.
This improves workstation continuity after host reboot / operator relocation, and the MOMO import-boundary fix is now production-deployed; it does not change service cold-start status: full-stack green remains blocked by MOMO data freshness and DR remains blocked by credential escrow evidence.
2026-06-18 12:17 live readback supersedes older service-availability wording:
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=83 WARN=1 BLOCKED=0, Result=DEGRADED.
Service state: SERVICE_AVAILABLE_DEGRADED; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
Rollout state after transient 12:14 startup window: awoooi-api 2/2, awoooi-web 2/2, worker 1/1, canary 1/1, public API health 200 healthy.
Only live warning: retained stale K8s Job km-vectorize-29689620 from 2026-06-14 03:00. Later official km-vectorize Jobs 29692500 / 29693940 / 29695380 are Complete.
Allowed declaration: services are available with one stale failed Job warning.
Forbidden declaration: full cold-start green, DR complete, or runtime/security acceptance.
2026-06-18 13:43 live readback supersedes the stale-Job warning wording:
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=84 WARN=0 BLOCKED=0, Result=GREEN.
Service state: FULL_STACK_GREEN_FOR_SERVICE; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
K8s Job classification: FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0. The retained km-vectorize failure stays as evidence but no longer blocks service readiness after later official successful Jobs.
Allowed declaration: full cold-start service readiness is green for this evidence set.
Forbidden declaration: DR complete or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged.
2026-06-18 14:31 live runaway-process readback supersedes repo-only AIOps wording:
110 host runaway process exporter is live-installed and scraped.
Textfile source: /home/wooo/node_exporter_textfiles/host_runaway_process.prom.
Prometheus readback: monitor_up=1, orphan_browser_groups=0 for headless_browser_smoke and stockplatform_headless_smoke, active Gitea Actions containers=2, load5_per_core around 0.79-0.81, swap_used_ratio around 1.0, remediation_authorized=0.
Alerts: HostRunawayProcessMonitorMissing is not firing; HostOrphanBrowserSmokeHighCpu is not firing.
Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped.
Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check.
2026-06-18 14:51 production event-packet readback:
Host runaway alert-to-event packet is deployed in production.
Deploy marker: 2d278568 chore(cd): deploy f358a0f [skip ci].
Runtime image: awoooi-api / awoooi-web / awoooi-worker use f358a0f6c3e614e407dedb6eee89bf10b2bc8173.
ArgoCD readback: awoooi-prod Synced / Healthy.
Alert mapping: HostOrphanBrowserSmokeHighCpu -> orphan_browser_smoke_runaway_process; HostCiRunnerLoadSaturation -> ci_runner_load_saturation.
Allowed declaration: monitoring, alert rules, live scrape, AI event packet routing, PlayBook / KM contract, and production deployment are complete for this evidence set.
Forbidden declaration: Telegram send, Bot API call, Gateway queue write, process kill, Docker/systemd restart, Nginx reload, firewall/K8s action, or runtime remediation is authorized.
2026-06-18 16:08 P3-009 Host Runaway AIOps product readback:
Host runaway AIOps closed-loop read model is deployed in production.
Deploy marker: 42c08ece chore(cd): deploy 27143fb [skip ci].
API endpoint: /api/v1/agents/agent-host-runaway-aiops-loop-readiness.
Production readback: schema_version=host_runaway_aiops_loop_readiness_v1, current_task_id=P3-009, next_task_id=P3-010, completion=100, loop_stage_count=6, alert_lane_count=2, asset_writeback_contract_count=5.
Host 110 live readback in the model: orphan browser groups=0, active CI containers=2, remediation_authorized=0, runtime/write counters=0.
Governance route: /zh-TW/governance?tab=automation-inventory shows P3-009 on desktop 1440x1100 and mobile 390x844 with missing text=0, console/page errors=0, horizontal overflow=false.
Allowed declaration: monitoring, alert rules, AI event packet, PlayBook / KM contract, Verifier/writeback contract, gated remediation dry-run boundary, and product-visible readback are complete for this evidence set.
Forbidden declaration: AI runtime remediation is enabled. Process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain forbidden without owner approval, maintenance window, evidence ref, dry-run, and post-check.
| 項目 | 2026-06-24 11:35 Asia/Taipei live result | 判定 |
|---|---|---|
| Overall recovery readiness | 98% |
SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED |
| P0 host / K3s recovery | 100% |
DONE |
| P1 backup / alert / escrow | 96% |
BLOCKED_DR_ESCROW |
| P2 service / data truth | 96% |
BLOCKED_MOMO_DATA_FRESHNESS |
| P3 docs / automation contracts | 100% |
DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY |
| 110 host runtime | fwupd-refresh.timer intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; systemctl --failed returns 0 loaded units listed; rollback is sudo systemctl enable --now fwupd-refresh.timer |
GREEN_WITH_FWUPD_TIMER_DISABLED |
| 110 host runaway process guard | 14:31-14:32 live scrape confirms monitor_up=1, orphan browser groups 0, active Gitea Actions containers 2, load5_per_core≈0.79-0.81, swap_used_ratio≈1.0, and remediation_authorized=0; exporter/helper also remain in Ansible textfile exporter source-of-truth. |
LIVE_SCRAPED_RUNTIME_GATE_0 |
| 120 reachability | ping OK, SSH OK, boot around 2026-06-14 02:23, K3s active, node mon Ready |
GREEN |
| 121 reachability | ping OK, SSH OK, failed units 0 |
GREEN |
| 188 host runtime | production services green, but host systemctl degraded by awoooi-startup.service, postgresql@14-main.service, certbot.service, and snap.certbot.renew.service; host PostgreSQL cluster 14/main is down while product DB containers/exporters are healthy; certbot renewal for shared sentry.wooo.work certificate is failing but public cert is still valid until 2026-07-09 UTC |
SERVICE_GREEN_HOST_HYGIENE_BLOCKED |
| K3s node state | mon Ready control-plane, mon1 Ready control-plane; bad pods 0; FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0 |
GREEN_WITH_RETAINED_EVIDENCE |
| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale known_hosts; backup /home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416; final repair backup /home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949; CD fix 80e6ec1a moves deploy trust to /home/wooo/.ssh/deploy_known_hosts; 01:28 global known_hosts still contains 120 / 188 and was not clobbered by deploy marker e4a349bc |
GREEN_WITH_GUARDRAIL |
| Backup status | 11:20 status: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1; escrow readback still shows ESCROW_MISSING_COUNT=5 |
GREEN_WITH_DR_ESCROW_WARNING |
| Offsite sync / verify | 01:28 textfile: awoooi_backup_offsite_remote_verify_ok=1, full_verify_fresh=1, all 13 repos have snapshot_count=1 and snapshot_latest_only=1; latest scheduled verifier log is 2026-06-12 07:20 |
GREEN |
| Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | GREEN_WITH_EXPECTED_REDLIGHTS |
| Cold-start scorecard | 11:35 read-only scorecard:PASS=86 WARN=0 BLOCKED=1。Public routes / TLS、momo DB parity、backup exporters、120/121 K3s、MinIO / Velero、AWOOOI API/Web 皆通過;only blocker is MOMO data freshness. |
BLOCKED_MOMO_DATA_FRESHNESS |
| momo DB parity | `10936 | 10936 |
| momo scheduler | container healthy; Drive listing from container works; pending folder 當日業績匯入 count is 0 for 即時業績_當日; no current Permission denied evidence in the latest readback |
GREEN_WITH_SOURCE_ABSENT |
| ArgoCD app health | 11:35 readback: awoooi-prod sync Synced, health Healthy, source revision 7db7800e399caed5487a705c81ec993dec76c70f; API/Web/Worker ready. |
GREEN |
| Workload balancing | Live API/Web/Worker/CronJob image is e999c16b3435f197b78fe2adfeec1c4faa6c4675; API/Web pods remain split across mon / mon1, Worker single replica remains healthy on mon |
GREEN |
| Credential escrow | 5 non-secret evidence markers missing | BLOCKED |
Release rule:
Do not declare full cold-start green unless the latest scorecard has `WARN=0` and `BLOCKED=0`.
Do not declare aggregate backup green unless latest `backup-status` has `core_blockers=0`.
Do not declare DR scorecard complete while credential escrow markers are missing.
2026-06-14 18:15 live rule:
110 / 120 / 121 / 188 core service recovery remains available, but the latest 18:15 scorecard is DEGRADED because `WARN=1`.
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
NO-GO for "DR complete" while credential escrow evidence markers are missing.
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
NO-GO for "full cold-start green" until `km-vectorize` failed Job is cleared by an official successful run.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
Current allowed wording: "core service and backup are available; 110 failed units are cleared after intentionally disabling `fwupd-refresh.timer`; high-value config Owner Packet 前台同步後 recovery readback shows no service regression; cold-start is degraded only by the `km-vectorize` official Job failure; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained but failed Pod/log are currently absent, so the next official 03:00 run remains the evidence gate."
2026-06-18 12:17 live rule:
GO for controlled service availability: PASS=83 WARN=1 BLOCKED=0, public routes/TLS green, API health 200 healthy, API/Web/Worker/Canary ready after rollout convergence.
GO for repo-side reboot readiness mechanism: readiness audit PASS=185 WARN=1 BLOCKED=0; only skipped live gate warning before the live check was run.
NO-GO for "full cold-start green" until the retained stale failed Job evidence is either cleared by normal K8s history policy or explicitly accepted by an owner-provided readback package.
NO-GO for "DR complete" while credential escrow evidence markers remain missing.
Do not delete the failed Job manually during routine SOP verification. Keep it as evidence unless an approved maintenance window explicitly authorizes cleanup.
Current allowed wording: "SOP / Plan B / automation contracts are complete; live services are available with one retained stale km-vectorize failed Job warning; hard blockers are zero; DR remains blocked by credential escrow evidence."
2026-06-18 13:43 live rule:
GO for full cold-start service readiness for this evidence set: PASS=84 WARN=0 BLOCKED=0.
GO for controlled runner/CD release under the normal security gates; this is not a bypass for owner response, runtime writer, Telegram, Gateway, K8s, Docker, Nginx, firewall, or secret operations.
GO for retaining stale failed Job evidence: FAILED_JOBS=1 and STALE_FAILED_JOBS=1 are allowed when ACTIVE_FAILED_JOBS=0 and later official successful Jobs exist.
NO-GO for DR complete while credential escrow evidence markers remain missing: ESCROW_MISSING_COUNT=5.
NO-GO for deleting retained failed Jobs during routine verification. Cleanup requires an explicit maintenance window and owner acceptance.
Current allowed wording: "full-stack service recovery is green for the current evidence set; stale km-vectorize failure is retained as historical evidence, not an active blocker; DR complete remains blocked by credential escrow evidence."
After any future 120 recovery, rerun this exact chain from 110:
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
0.1 When To Use This
Use this SOP when any of these happen:
- 110/120/121/188 reboot unexpectedly.
- All services are abnormal after a power/network event.
- K3s is stuck
activating. - Host load remains high during startup and service health is mixed.
- Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.
The rule is simple: recover the dependency chain, not the loudest symptom.
0.2 啟動判定分層
重啟後不能只用一個訊號宣稱完成。每台主機與整個平台都必須分四層判定:
| 層級 | 代表意義 | 最低證據 | 不代表 |
|---|---|---|---|
HOST_POWERED |
主機或 VM 看起來已通電 | console / hypervisor 顯示 running,或 LAN ARP 開始出現 | OS 已完成開機 |
HOST_BOOTED |
OS 已進入可互動狀態 | ping OK、SSH port open、who -b 有本次 boot time |
systemd / Docker / K3s 已健康 |
HOST_READY |
主機基礎服務可承接下一層 | systemctl is-system-running 非 degraded;failed units 可解釋;cron / docker / DB / K3s 依角色正常 |
public route 或業務資料已正常 |
SERVICE_READY |
主機承載服務可用 | 服務 health、port、container health、DB / Redis / K3s / Harbor / Alertmanager checks 通過 | 備份、排程、告警、資料一致性與資料新鮮度已驗證 |
FULL_STACK_GREEN |
可以宣稱重啟恢復完成 | cold-start scorecard WARN=0、BLOCKED=0,備份/offsite/DB/告警/排程/資料新鮮度都綠 |
120 不可達或 MOMO 業務資料 stale 時永遠不能宣稱 |
2026-06-12 的 110/120 事故收斂判定是:
110 HOST_READY = yes
120 HOST_READY = yes
Core public services SERVICE_READY = yes
FULL_STACK_GREEN = yes, because cold-start scorecard is PASS=83 WARN=0 BLOCKED=0
DR_COMPLETE = no, because credential escrow evidence is incomplete
2026-06-24 的 MOMO 資料停更判定是:
110 / 120 / 121 / 188 HOST_READY = yes
Core public services SERVICE_READY = yes
MOMO_RELEASE_CURRENT = yes, because mo.wooo.work health is V10.653 and Gitea main / CD #904 deployed commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73
MOMO_DB_PARITY = yes
MOMO_DATA_FRESH = no, because latest daily_sales_snapshot date is 2026-06-17 and stale age is 7 days as of 2026-06-24 22:40
MOMO_SOURCE_AVAILABLE = no, because Drive intake 當日業績匯入 has no newer 即時業績_當日 source, scheduler stats show repeated file_count=0 runs, and Mac Mini / MacBook candidate files only contain old or header-only data
FULL_STACK_GREEN = no, because live cold-start scorecard is PASS=86 WARN=0 BLOCKED=1 and repo-side v1.42 dry-run is PASS=88 WARN=0 BLOCKED=1 with blocker "188 momo source file absent while daily sales data stale"
DR_COMPLETE = no, because credential escrow evidence is incomplete
MOMO source absence recovery gate:
GO: declare MOMO service recovered when health is healthy, containers are healthy, scheduler runs, DB parity matches, and release version matches Gitea/CD.
NO-GO: declare MOMO data current while Drive intake has no newer 即時業績_當日 source file and latest DB bounds stop at 2026-06-17.
NO-GO: re-import stale local samples, product catalog exports, header-only sheets, or already imported archive files to fake freshness.
NO-GO: truncate, whole-DB restore, manual Drive movement, or manual import without explicit maintenance approval.
UNBLOCK: new legitimate PChome daily-sales source appears in 當日業績匯入 or an owner-approved safe import path; import job succeeds with sync_success=true; source file moves only after success; daily_sales_snapshot and realtime_sales_monthly bounds match; MOMO_DAILY_FRESHNESS <= 2.
所有回報必須使用這組詞,避免把「服務面可用」誤報成「整體 DR 完成」。
0.3 Codex 工作站交接判定
重啟後若需要從 Mac Mini / MacBook Pro 繼續 Codex 開發,必須另外確認 Codex safe handoff artifacts,不得把服務恢復與 Codex raw 對話同步混為一談。
2026-06-24 22:17 Asia/Taipei readback:
MacBook Pro 192.168.0.111 SSH = OK
Safe artifacts synced = Start Here and workstation dashboard readback matched; current SHA-256 values are tracked in the workstation dashboard artifact and local sha256sum readback
Start Here readback = registry_ready 3, registry_blocked 8, latest_dev_on_gitea 3, production_on_gitea 8, raw_history_sync False
Workstation dashboard readback = artifact_sync_synced 2, artifact_sync_blocked 0, MOMO current main baseline ready 2
MOMO Pro Mac Mini workspace = /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty 0
MOMO Pro MacBook workspace = /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty 0
AwoooGo MacBook workspace = ready on dev commit 8471b376d97c1436d4612ece17f51ba0950f114d, dirty 0
允許宣告:
Mac Mini / MacBook Pro 已同步 Codex 開工入口與治理 snapshot。
MOMO Pro 可以在 Mac Mini / MacBook Pro 從 Gitea current-main Codex baseline 開工;實作前仍需從 codex/momo-current-main-dev-base-20260624 切新的 codex/<task>。
MOMO import-boundary fix 已經由 main / CD #904 部署到 production;後續仍要等下一個真實匯入檔驗證 failure boundary 是否阻止搬檔。
禁止宣告:
raw Codex / ChatGPT 歷史聊天已同步。
所有產品都能雙機同步開發。
把 MOMO Pro 程式版本 V10.653 當成 MOMO 業務資料已更新。
2026FIFA / Agent Bounty owner preflight 已通過。
1. Golden Startup Order
0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed
Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.
1.1 Dependency Graph
flowchart TD
network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
obs110 --> k3s
k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
highload --> ai["AI auto-remediation: limited execution"]
This is also captured in the machine-readable baseline:
ops/reboot-recovery/full-stack-cold-start-baseline.yml
The YAML baseline is the source of truth for:
- hosts, roles, and SSH users
- phase ordering
- service startup dependencies
- endpoint success codes
- schedule freshness thresholds
- stateful-service protection boundaries
- AI automation release gates
1.2 Phase Gate Logic
Each phase has the same decision rule:
| Result | Meaning | Action |
|---|---|---|
BLOCKED |
A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
WARN |
Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
GREEN |
All checks in scope passed. | Release the next phase only. |
The cold-start flow is intentionally conservative:
P0 network green
-> P0 188 data green
-> P0 110 registry/observability green
-> P1 K3s green
-> P2 workload + alert chain green
-> P2 public routes green
-> P2 schedules green
-> P3 high-load services and runners/CD
-> AI auto-remediation limited execution
The final release condition is not "containers are running". It is:
PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN
1.3 重啟 GO / NO-GO 決策樹
每次維護前先用這張表決定是否可以重啟,以及重啟後可以宣稱到哪個層級。
| 情境 | GO / NO-GO | 可做範圍 | 完成宣告上限 |
|---|---|---|---|
| 03:00 offsite sync 正在跑 | NO-GO |
只讀觀察,等待 sync 結束後 verifier | 不宣告維護完成 |
| 120 不可達,但只重啟 110 | CONDITIONAL GO |
只可宣稱 110 / public service recovery;不可跑 120 backup fix | SERVICE_READY,不可 FULL_STACK_GREEN |
| 188 data layer 不健康 | NO-GO |
先修 PostgreSQL / Redis / Docker / SignOz / momo DB | 不釋出 K3s / runner / AI |
| 110 Harbor / Registry 不健康 | NO-GO for K3s deploy |
先修 registry;K3s 可能 image pull 失敗 | 不釋出 CD / deploy |
| 120 / 121 都 Ready,offsite verifier 綠 | GO |
可做完整 cold-start release chain | 需 scorecard WARN=0 BLOCKED=0 |
| credential escrow marker 缺失 | GO for service reboot,NO-GO for DR complete |
可恢復服務;不可宣稱 DR scorecard complete | SERVICE_READY 或 BLOCKED_DR_ESCROW |
| Alertmanager required rules 不可見 | NO-GO for unattended window |
先修 alert rules / drift guard | 不釋出 AI auto-remediation |
GO 只代表允許執行指定範圍,不代表完成。完成一定要回到 §15 Done Criteria。
1.4 Plan B:降級運轉與回復路徑
Plan B 不是另一套可以繞過 preflight 的重啟流程,也不是事故中臨場改主機的授權。Plan B 是當 Plan A 無法在維護窗口內達成 FULL_STACK_GREEN 時,預先定義「最低可接受服務目標、停止線、降級等級、主機路徑、回到 Plan A 的條件」。
Plan A 的目標是:
B4_FULL_STACK_GREEN:cold-start scorecard WARN=0 / BLOCKED=0,backup、offsite、DB、alert、scheduler、K3s、public route 與業務資料新鮮度都綠。
Plan B 的目標是:
先保住核心服務與資料完整性,不擴大 blast radius,不把部分可用誤報成 full-stack green,並把下一個 blocker 留成可追蹤工單。
Plan B 的機讀契約固定在 ops/reboot-recovery/full-stack-cold-start-baseline.yml 的 plan_b 區塊;scripts/reboot-recovery/reboot-recovery-readiness-audit.sh 必須檢查 SOP 與 baseline 都保留 B0-B5、T+120 停止線與三個收尾狀態。若這些欄位缺失,readiness audit 必須回 BLOCKED。
Plan B 紅線
| 紅線 | 具體要求 |
|---|---|
| 不假綠 | 不用 route 200、pod up、container up、UI 可見、CD success 或單一 smoke pass 宣稱完整恢復。 |
| 不消音正確紅燈 | 120 / backup / credential escrow / alert / scheduler 的紅燈如果反映真實缺口,必須保留。 |
| 不做未授權寫操作 | 沒有維護窗口與人工批准時,不重啟 Docker daemon、不 reload Nginx、不改 firewall / iptables、不 kubectl patch live、不讀 secret、不做 destructive recovery。 |
| 不釋出高風險自動化 | CD runner、AI auto-remediation、heavy crawler、batch import、repair bot 必須等依賴鏈綠燈後才解除 freeze。 |
Plan B 觸發條件
| 觸發條件 | 立即動作 | 可宣稱上限 |
|---|---|---|
| 03:00 offsite sync、02:00 backup 或 full verifier 仍在跑 | 延後重啟;只讀等待完成 | B0_ABORTED_BEFORE_REBOOT |
| 任一 P0 主機重啟後 15 分鐘仍 ping / SSH 不可達 | 停止釋出下一層,啟動對應主機路徑 | B1_HOST_RECOVERY_ONLY |
| 188 PostgreSQL / Redis / momo / SignOz 任一核心資料面不健康 | 凍結 K3s deploy、runner、AI auto-remediation | B1_HOST_RECOVERY_ONLY |
| 110 Harbor / Gitea / Alertmanager / Prometheus 不健康 | 凍結 CD / deploy / image pull 相關流程 | B2_CORE_SERVICE_READY 以下 |
| 120 或 121 單台不健康,但另一台 control-plane 可承載 | 進入單節點 K3s 服務模式,保留 HA 紅燈 | B2_CORE_SERVICE_READY |
| public route 可用,但 DB / backup / alert / schedule 任一不綠 | 標記 ROUTE_GREEN_ONLY,不宣稱 service green |
B2_CORE_SERVICE_READY |
cold-start WARN>0、BLOCKED=0 |
可宣稱服務可用但仍 degraded | B3_SERVICE_AVAILABLE_DEGRADED |
| credential escrow missing | 可完成服務恢復,不可宣稱 DR complete | B4_FULL_STACK_GREEN 或以下,禁止 B5_DR_COMPLETE |
Plan B 主機路徑
| 故障域 | 降級路徑 | 回到 Plan A 的條件 |
|---|---|---|
| 110 失敗 | 保留 120 / 121 K3s 與 188 data;凍結 CD、runner、Harbor image push、Alertmanager outbound;先確認 Gitea / Harbor / Prometheus / Alertmanager 是否只是 host service 層問題。 | 110 HOST_READY、Harbor / Gitea / Prometheus / Alertmanager 健康、backup-status 無 110 core blocker、cold-start 110 checks 綠。 |
| 120 失敗 | 121 承載 K3s control-plane;保留 120_DEGRADED 紅燈;不宣稱 K3s AA;不跑 120 backup fix;必要時走 console / fsck recovery。 |
120 ping / SSH OK、root filesystem rw、k3s active、node mon Ready、backup-configs / backup-all / offsite / cold-start chain 全過。 |
| 121 失敗 | 120 承載 K3s control-plane;保留 121_DEGRADED 紅燈;不宣稱 workload balanced;避免非必要 rollout。 |
121 ping / SSH OK、k3s active、node mon1 Ready、API/Web placement 回到 max skew <= 1。 |
| 188 失敗 | 先保資料面:PostgreSQL、Redis、momo DB、SignOz、Ollama / AI provider;凍結會寫入資料或產生大量負載的 batch / crawler / AI flow。 | 188 HOST_READY、PostgreSQL / Redis / momo parity / SignOz / AI provider route 健康,且 backup/status 無 188 core blocker。 |
| K3s degraded | 保留現有健康 Pod;先查 nodes / pods / events / VIP / NodePort;避免盲目重啟 k3s 或刪 Pod。 | mon / mon1 Ready、API/Web/Worker rollout healthy、public API/Web / alert webhook / scorecard 通過。 |
| Public gateway degraded | 保住內部 API / VIP / data;不 reload Nginx、不改 DNS/TLS/certbot/firewall,除非有 owner-approved maintenance window。 | Nginx config owner evidence、route smoke、TLS / ACME、rollback owner 與 post-check 計畫通過。 |
Plan B 服務等級
維護期間所有回報都必須使用以下等級之一,禁止用「差不多好了」或「應該正常」:
| 等級 | 意義 | 最低證據 |
|---|---|---|
B0_ABORTED_BEFORE_REBOOT |
preflight 發現 NO-GO,取消或延後重啟 | 未做 runtime 寫操作;記錄 NO-GO blocker。 |
B1_HOST_RECOVERY_ONLY |
只完成主機層恢復 | 目標主機 ping / SSH / boot time / systemd 基礎狀態可判定;服務尚未全驗。 |
B2_CORE_SERVICE_READY |
核心服務可用,但完整依賴鏈未過 | public route、API、DB 或 K3s 主要面可用;backup / alert / scheduler / scorecard 尚未全綠。 |
B3_SERVICE_AVAILABLE_DEGRADED |
核心服務可用,cold-start 無 hard block 但仍有 WARN | cold-start BLOCKED=0;WARN 被明確列出且不被消音。 |
B4_FULL_STACK_GREEN |
本次重啟恢復完成 | cold-start PASS>0 WARN=0 BLOCKED=0,backup / offsite / DB / alert / scheduler / data freshness 全綠。 |
B5_DR_COMPLETE |
DR 完整 | B4 加上 credential escrow missing 0,restore / escrow / offsite evidence 完整。 |
Plan B 執行時序
T+0 freeze CD / runner / AI auto-remediation / heavy batch;保留 console、journal、backup、scorecard evidence。
T+5 判定 HOST_POWERED / HOST_BOOTED / HOST_READY;任一 P0 host 不可達即進入主機 Plan B。
T+15 188 data 或 110 registry / observability 不健康時停止釋出 K3s、runner、AI。
T+30 public route 可用但 DB / backup / alert / scheduler 未過時,只能回報 B2,不得 full green。
T+60 必須跑 cold-start scorecard;若仍 WARN / BLOCKED,留下 Plan B 等級與下一個 blocker。
T+120 若仍未達 B4,開 incident / follow-up,不延長窗口做未授權 runtime 寫操作。
Plan B 收尾條件
Plan B 只能以下列三種狀態收尾:
| 收尾狀態 | 條件 | 下一步 |
|---|---|---|
RETURNED_TO_PLAN_A |
blocker 已清,完成 Plan A 全鏈路驗證 | 更新 reboot ledger,記錄實際耗時與 SOP 差異。 |
SERVICE_AVAILABLE_DEGRADED |
服務可用但 scorecard 仍 WARN,或 DR / escrow / governance gate 未完成 | 保留紅燈,開下一步 owner / evidence / maintenance task。 |
OPEN_INCIDENT_REQUIRED |
P0 host、data、K3s、gateway、backup、alert 任一仍 hard blocked | 停止維護窗口,保留 evidence,升級事故處理。 |
Plan B 的專業標準不是「保證每次都綠」,而是保證每次重啟都能快速知道現在到哪一層、什麼不能宣稱、下一個 blocker 是誰、以及是否可以安全回到 Plan A。
2. Automation Freeze
Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.
| Item | Cold-start policy | Reason |
|---|---|---|
| Gitea/GitHub runners | Last | Build jobs can saturate 110 CPU/RAM. |
| momo-scheduler / crawlers | Last | Chrome and batch work can saturate 188. |
| Sentry/Snuba consumers | Controlled | Kafka backlog and ClickHouse merge can create temporary high load. |
| Alertmanager outbound notification | Gate | Avoid alert storms before API webhook and Telegram are verified. |
| AI auto-repair | Observe-only | Metrics, Redis, KM, and playbooks may be incomplete. |
| Stateful DB restart | Human approval | PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets. |
2.1 Freeze 執行清單
進入維護窗口後,先把「會放大事故」的來源降到 observe-only 或延後釋出。若沒有做到這一步,後續負載和告警會很難判讀。
| 順序 | 對象 | 只讀確認 | 允許動作 | 禁止動作 |
|---|---|---|---|---|
| 1 | runner / CD | systemctl list-units "actions.runner.*"、Gitea Actions running jobs |
暫停新 job、等待可完成 job 結束 | 重啟 Docker daemon 來中斷 job |
| 2 | AI auto-remediation | Prometheus / Alertmanager / cold-start monitor 狀態 | 切 observe-only、保留告警 | 自動 restart stateful service |
| 3 | momo scheduler / crawler | container health、recent logs、DB parity | 延後 heavy import、保留現有資料 | 在 DB 未綠時強行匯入 |
| 4 | Sentry / Snuba | ClickHouse / Kafka health、consumer restart loop | 控制 consumer 釋出順序 | generic compose down/up 全套重啟 |
| 5 | K3s workload | node readiness、pods、events | 依 node 狀態 cordon/drain | 120 不可達時宣稱 drain 成功 |
多個工作視窗同時處理事故時,第一優先是避免互相打斷:只要有人在收斂 Docker / Nginx / firewall / K3s 寫操作,其他視窗先只讀觀察,直到明確交接。
2.2 CD / SSH Trust Guardrail
2026-06-13 的冷啟動假紅燈顯示:CD workflow 若用 ssh-keyscan ... > /home/wooo/.ssh/known_hosts,會覆蓋 110 使用者層的全域 SSH trust,導致 110 到 120 / 188 的 strict SSH 檢查失敗。這會把實際已恢復的主機誤判成 blocked。
固定規則:
| 項目 | 正確做法 | 禁止 |
|---|---|---|
| Deploy 專用 host key | 寫入 /home/wooo/.ssh/deploy_known_hosts |
寫入或覆蓋 /home/wooo/.ssh/known_hosts |
| Deploy SSH options | -o UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts |
共用 operator / cold-start 的 known_hosts |
| 冷啟動 SSH trust | 保留 120 / 188 的已驗證 fingerprint;修復前先備份 | 無 fingerprint 交叉驗證就 ssh-keygen -R 或重建全檔 |
| 驗證 | CD 後檢查 known_hosts mtime、120 / 188 entries、strict SSH |
只看 CD success badge |
2026-06-13 修復錨點:
- Source fix:Gitea main 包含
80e6ec1a fix(ci): avoid clobbering runner known hosts。 - Deploy marker:
e4a349bc chore(cd): deploy 414413a [skip ci]後,/home/wooo/.ssh/known_hostsmtime 仍停在2026-06-13 01:20:02 +0800,未被 CD 覆蓋。 - Deploy isolated file:
/home/wooo/.ssh/deploy_known_hostsmtime2026-06-13 01:24:05 +0800。 - Global strict entries:120 ED25519 line 4、188 ED25519 line 5 仍存在;strict SSH 到
wooo@192.168.0.120與ollama@192.168.0.188必須通過。
3. P0 Evidence And Network
Run from any machine on the same LAN:
for h in 110 120 121 188; do
ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done
arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done
Then capture reboot evidence:
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
If any host has ARP incomplete or SSH port down, stop here and fix physical/network first.
3.1 主機已啟動判定標準
每台主機重啟後先跑「四段式啟動判定」。只有全部符合角色期望,才進入服務恢復。
for h in 110 120 121 188; do
ip="192.168.0.$h"
echo "=== $ip ==="
ping -c 2 -W 2 "$ip" >/dev/null && echo "HOST_POWERED_OR_LAN_OK=1" || echo "HOST_POWERED_OR_LAN_OK=0"
arp -an | grep "$ip" || true
nc -G 3 -z "$ip" 22 && echo "SSH_PORT_OPEN=1" || echo "SSH_PORT_OPEN=0"
done
可 SSH 後:
ssh wooo@192.168.0.110 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true; free -h; swapon --show'
ssh wooo@192.168.0.121 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true'
ssh ollama@192.168.0.188 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true; free -h'
120 若不可 SSH,狀態只能是 HOST_POWERED_UNKNOWN 或 HOST_BOOTED_UNKNOWN,必須走 console / VM / network 檢查,不可用 Kubernetes stale node object 代替主機現況。
| 判定 | 必要條件 | 下一步 |
|---|---|---|
HOST_BOOTED |
ping 或 ARP 有回應、SSH port open、who -b 顯示本次 boot time |
檢查角色服務 |
HOST_READY |
systemctl is-system-running 為 running,或 degraded units 已逐一解釋且不影響本 host 角色 |
進入服務層驗證 |
HOST_DEGRADED |
failed units 存在且影響本 host 角色,或 swap 滿、root readonly、boot storage error | 先修 host,不釋出下一層 |
HOST_UNREACHABLE |
ping/SSH/ARP 失敗 | 停止遠端修復假設,改 console/VM/network |
2026-06-12 110 事故教訓:failed unit 若指向不存在的 legacy 路徑,要先確認是否仍屬現役服務。停用 stale timer 可以解除 host degraded,但必須同步 source-of-truth 後續清理,不能靠反覆 reset-failed 掩蓋。
2026-06-26 188 事故教訓:PostgreSQL host cluster / Docker product DB / exporter 三者必須分開判定。pg_isready、pg_up=1 或 public route 200 只能證明某個 PostgreSQL endpoint 可用,不能證明 postgresql@14-main 已恢復。若 journal 出現 could not locate a valid checkpoint record,不得由 startup 腳本或 AI 自動執行 pg_resetwal;必須進入 DB owner / backup restore / maintenance window / rollback owner / post-check gate。
4. P0 188 Data Layer
188 is the first real service dependency because multiple product data planes, exporters, and AI / observability services depend on PostgreSQL-compatible endpoints. Do not assume the host cluster postgresql@14-main, Docker product databases, and exporter target are the same endpoint; prove the authoritative endpoint before repair.
4.1 Startup order
containerddockerpostgresql@14-maink3s_datastore.kinemaintenanceredis-serveron6380ollamaor current AI proxy dependenciesnginx- Docker networks
- MinIO / OpenClaw / SignOz
- momo / litellm / batch services after load is stable
4.2 Read-only check
ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_lsclusters 2>/dev/null || true
ss -ltnp "sport = :5432" 2>/dev/null || ss -ltn "sport = :5432" || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'
4.3 PostgreSQL WAL checkpoint damage
Signature:
PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...
This may block if the affected cluster is the authoritative runtime datastore:
188:5432- K3s startup on 120/121
- AWOOOI API DB access
- Alertmanager webhook if API cannot start
2026-06-26 counterexample: host cluster 14/main can be down while product DB containers and exporters still serve traffic. Therefore pg_isready is not enough and failed postgresql@14-main is not automatically a product outage. First map the listening process / container, current app DB configuration, and backup freshness.
Break-glass example only after DB owner approval, backup evidence, maintenance window, rollback owner, and post-check plan:
sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"
Do not run pg_resetwal, DROP, reinitialize the cluster, delete /var/lib/postgresql, or restore an old backup from AI/startup automation. These are break-glass actions only.
5. P0/P1 110 Registry And Observability
110 must recover Harbor/Gitea/Monitoring early, but runners last.
5.1 Startup order
docker- Remove
Exited (128)/Exited (137)orphan containers - Harbor
harbor-log - Harbor full stack
- Gitea
- Prometheus / Alertmanager / Grafana / exporters
- Langfuse
- SignOz
- Sentry DB layer
- Sentry web/worker/consumer layer
- Gitea host runner and actions runners
5.2 Checks
ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'
Harbor healthy means /v2/ returns 200 or 401. Do not treat 401 as failure.
5.3 Runner gate
Runner may start only after all are true:
188 PostgreSQLready110 Harborready110 Giteaready120/121 K3snodes ready- AWOOOI API health passes
- 110 load/core is below
1.0for at least 15 minutes - runner systemd guardrails are active:
CPUQuota=200%,MemoryMax=2G,WatchdogUSec=0
Check:
ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
echo "=== $u ==="
systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'
If WatchdogUSec is not 0, apply the guardrail script manually with sudo:
sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
6. P1 120/121 K3s
K3s must wait for 188 PostgreSQL and 110 Harbor.
6.1 Startup order
- 120
k3s.service - 121
k3s.service,k3s-agent.service, or its live role - CNI / kube-proxy
- Nodes Ready
- Core pods
awoooi-prodpods- keepalived VIP
192.168.0.125 - NodePorts
32334and32335
6.2 Checks
ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'
ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'
If K3s is activating while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.
6.3 120 / 121 AA / AS 與負載平衡判定
2026-06-12 15:19 live check 確認 120 / 121 都是 K3s control-plane,且兩台都是 k3s active、k3s-agent inactive。因此它們是 K3s 控制面 AA,不是傳統一主一從 AS。
但控制面 AA 不等於業務 workload AA。120 剛從 root filesystem fsck 恢復後,大多數 ArgoCD / AWOOOI / Velero / kube-system workload 仍集中在 121;120 主要只有 DaemonSet 類 Pod。每次 120 / 121 重啟或恢復後,都要額外跑 Pod 落點檢查:
ssh wooo@192.168.0.120 '
sudo kubectl get nodes -o wide
sudo kubectl get pods -A -o wide
sudo kubectl top nodes 2>/dev/null || true
sudo kubectl top pods -A --sort-by=cpu 2>/dev/null | head -30 || true
'
判定規則:
| 判定 | 條件 | 可宣稱 |
|---|---|---|
K3S_CONTROL_PLANE_AA |
120 / 121 都是 Ready control-plane |
控制面雙節點可用 |
WORKLOAD_IMBALANCED |
主要 deployment / pod 都落在單一節點 | 不可宣稱服務 AA;需排程治理 |
WORKLOAD_BALANCED |
replicas >= 2 的核心 API / Web 跨 120 / 121 分散 | 可宣稱承載層分散 |
STATEFUL_AA |
storage replication、backup / restore drill、failover drill 都通過 | 才可宣稱資料層 AA |
負載平衡與遷移評估的正式基準文件是 docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md。恢復期先完成 P0 備份鏈與 cold-start scorecard,再做 topology spread 或服務搬遷。
7. P2 AWOOOI Workloads
Run after K3s nodes are Ready:
ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'
curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/
If pods are ImagePullBackOff, go back to 110 Harbor.
If API health fails because DB/Redis is down, go back to 188.
8. P2 Alert Chain
Current main path:
Prometheus/Alertmanager on 110
-> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
-> AWOOOI API
-> TelegramGateway
-> Telegram
Alertmanager health alone is not enough. Run E2E:
curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
-H 'Content-Type: application/json' \
-d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'
Expected: API returns success and Telegram receives the test alert.
9. P2 Schedules And Delayed Work
Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.
| Host / Layer | Required check | Success baseline |
|---|---|---|
| 188 cron | systemctl is-active cron and crontab -l |
cron active; backup, restart exporter, stats exporter entries present |
| 188 backup-from-110 | backup_110_last_success_timestamp in textfile/Prometheus |
last success age < 25h |
| 188 momo-scheduler | docker inspect momo-scheduler and docker logs --since 6h momo-scheduler |
container running healthy; 全部排程任務已註冊; Google Drive auth works; dashboard URLs use container-reachable hostnames |
| 188 momo import | manual run_auto_import_task() after parser changes |
selected sheet is 即時業績明細; imported date range has matching rows in daily_sales_snapshot and realtime_sales_monthly |
| 110 cron | systemctl is-active cron |
cron active; Docker/systemd textfile exporters fresh |
| 110 startup units | systemctl --failed |
zero failed units; stale momo-startup-complete and wooo-staggered-startup disabled |
| 120 K8s CronJobs | kubectl get cronjobs -n awoooi-prod |
unsuspended; no failed Jobs remain after current validation |
| 121 DR drill | crontab -l |
DR drill cron present unless explicitly paused |
Useful checks:
ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'
If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.
10. P2/P3 Stateful Service Guardrails
| Tier | Examples | Automation |
|---|---|---|
| BLOCK | PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB | No automatic destructive action. Human approval only. |
| CRITICAL_HITL | Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse | Human-in-the-loop restart/repair. |
| STANDARD_HITL | API/Web/worker, OpenClaw, litellm | Restart only with evidence and blast-radius check. |
| AUTO | Stateless exporters, blackbox, nginx exporter | Auto restart allowed after verification. |
Never use generic docker restart $(docker ps -q) during cold start.
10.1 Dirty-Reboot Storage Corruption
Treat these log signatures as storage corruption, not ordinary service flakiness:
Bad messageStructure needs cleaningUnknown codecPANIC: could not locate a valid checkpoint record- Kafka
Malformed linein checkpoint files - ClickHouse
broken and needs manual correction
Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns Bad message or Structure needs cleaning, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.
10.2 ClickHouse Clean-Clone Recovery Pattern
Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.
1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.
Do not replace this with rm -rf store/... unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:
/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS
10.3 Kafka Checkpoint Recovery Pattern
If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:
log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint
Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.
11. P3 High-Load Services
Only release these after P0/P1/P2 gates are green:
| Host | Service | Release condition |
|---|---|---|
| 188 | momo-scheduler / crawler | load/core < 1.0 for 15 minutes and DB healthy |
| 188 | SignOz ClickHouse | healthy and merge backlog trending down |
| 188 | litellm | /health/liveliness good and provider route verified |
| 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing |
| 110 | Sentry uptime-checker | Sentry web/DB healthy |
| 110 | runners | all previous gates green, host_runaway_process.prom fresh, orphan browser group count 0, and load/core < 1.0 for 15 minutes unless the remaining load is explicitly attributed to active CI |
11.1 110 Runaway Browser / CI Load 分流
2026-06-18 110 CPU 滿載事件證明:泛用 HostHighCpuLoad 只能說主機忙,不能告訴 operator 要不要殺程序。110 現在必須使用專用 host runaway process 指標做第一層分流:
grep -E 'awoooi_host_runaway_|awoooi_host_gitea_actions_|awoooi_host_load5_per_core|awoooi_host_swap_used_ratio' \
/home/wooo/node_exporter_textfiles/host_runaway_process.prom
Prometheus 也必須讀得到同一份 textfile;2026-06-18 14:31-14:32 live scrape 已確認 awoooi_host_runaway_process_monitor_up{host="110"}=1、orphan group count 0、active CI container count 2、remediation_authorized=0,且 missing / orphan alerts 均未 firing。
判讀:
| 指標組合 | 判定 | 行動 |
|---|---|---|
awoooi_host_runaway_browser_orphan_group_count > 0 且 CPU >= 100 |
orphan headless browser / smoke process group | 執行 host-runaway-process-remediation.py dry-run;人工確認後才可 gated SIGTERM |
orphan count 0 且 awoooi_host_gitea_actions_active_container_count > 0 |
合法 CI build/test 負載 | 觀察 Gitea Actions queue / workflow timeout;不殺程序 |
awoooi_host_runaway_process_monitor_up 缺失或 stale |
監控盲區 | 修 cron / textfile collector / Ansible role,不宣稱 AI Ops 可觀測 |
awoooi_host_runaway_process_remediation_authorized > 0 |
監控器被誤改成執行器 | 立即回滾;runtime remediation 必須只走 gated helper |
正式 PlayBook:
docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md
這條 PlayBook 不取代 Docker / Sentry / Harbor / K3s / backup SOP。它只處理 orphan browser smoke 與 CI load 分類,避免 CPU 高時誤重啟 Docker 或誤殺合法 build。
12. Baseline And AI Auto-Remediation Gate
12.1 Stable Runtime Baseline
These are release gates after the first cold-start recovery pass:
| Area | Baseline |
|---|---|
| 188 host | PostgreSQL accepting, Redis PONG, momo /health 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers |
| 110 host | Harbor /v2/ 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop |
| K3s | 120/121 nodes Ready, VIP 192.168.0.125 present, AWOOOI API 2xx/3xx, Web 2xx/3xx |
| Public routes | https://awoooi.wooo.work/api/v1/health 2xx/3xx, https://mo.wooo.work/health 2xx/3xx |
| Guardrails | Docker/systemd/storage/backup/runaway-process textfile exporters fresh, runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0 |
| Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success < 25h |
| Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks |
If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.
12.2 AI Auto-Remediation Gate
AI auto-repair can move from observe-only to limited execution only after:
- Prometheus rules are loaded.
- docker/systemd textfile exporter files are fresh.
- runaway process textfile exporter is fresh and
remediation_authorized=0. - blackbox probes have stable results.
- cron/CronJob schedule checks are green.
- AWOOOI API
/api/v1/healthpasses. - Alertmanager E2E webhook passes.
- Redis/KM/playbook health is available.
- No active restart storm.
- Host load/core remains below
1.0for 15 minutes.
Until then:
- diagnose only
- notify only
- require human approval for remediation
- no DB/ClickHouse/Harbor/Sentry destructive action
- no generic restart action against stateful services
- no process kill unless
host-runaway-process-remediation.pyhas dry-run evidence plus owner approval, maintenance window, and evidence ref
13. One-Command Readiness Script
13.1 Single Pass
Run this when you want one read-only snapshot:
bash scripts/reboot-recovery/full-stack-cold-start-check.sh
The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:
P0-NETWORKP0-188-DATAP0-110-REGISTRY-OBSERVABILITYP1-K3SP2-WORKLOAD-ALERTCHAINP2-PUBLIC-ROUTESP2-SCHEDULES- runner guardrail state inside
P0-110-REGISTRY-OBSERVABILITY
If it prints BLOCKED, fix the first blocked gate before moving forward.
13.2 Professional Watch Mode
Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
--watch \
--interval 60 \
--max-attempts 30 \
--send-alert-test
This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is GREEN or the last attempt remains degraded/blocked.
Use --send-alert-test for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without --send-alert-test, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
13.3 Persistent Read-Only Monitor
After recovery, host 110 should run the same gate as a node-exporter textfile monitor:
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
This command is not read-only. It copies scripts to 110, rewrites the marked wooo crontab block, and immediately refreshes the textfile metric. Run it only inside an approved maintenance window or explicit owner-approved live-sync change.
This installs two scripts under /home/wooo/scripts/, adds a marked user-cron block, and writes:
/home/wooo/node_exporter_textfiles/cold_start_recovery.prom/home/wooo/reboot-recovery/cold-start-last.log
The cron path uses --monitor-read-only, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:
awoooi_cold_start_monitor_upawoooi_cold_start_pass_gatesawoooi_cold_start_warn_gatesawoooi_cold_start_blocked_gatesawoooi_cold_start_last_run_timestampawoooi_cold_start_last_green_timestampawoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}
Prometheus rules in ops/monitoring/alerts-unified.yml alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.
13.3.1 110 cold-start monitor live-sync gate
Use this gate whenever the repo-side cold-start script changes. This prevents a false-green where repo evidence is newer than the live 110 monitor.
Current read-only evidence, 2026-06-24 23:15 Asia/Taipei:
Repo script hash: f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05
110 live script hash: 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8
verify result: BLOCKED full-stack-cold-start-check.sh hash mismatch
Read-only verification:
bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh
Approved apply path, only after maintenance-window / owner approval:
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
Completion criteria:
verify-cold-start-monitor-deploy.shreports hash parity forfull-stack-cold-start-check.shandcold-start-textfile-exporter.sh.- The live 110 cold-start output includes the expected current fields, including
MOMO_SOURCE_EMPTY_EVIDENCE_LINES,MOMO_IMPORT_CONFIG, andMOMO_LATEST_IMPORT_JOBwhile MOMO data freshness remains blocked by source absence. - The textfile monitor refreshes without creating alert spam.
- LOGBOOK records local hash, remote hash, command type, approval reference, and final cold-start result.
NO-GO:
- Do not run the installer as part of routine read-only triage.
- Do not call repo-side v1.42 deployed on 110 while the hash mismatch remains.
- Do not patch 110 manually with ad hoc
scp; use the existing installer or Ansible source-of-truth path under an approved change.
13.4 Script-To-SOP Coverage Map
| Script gate | SOP coverage | Blocks |
|---|---|---|
P0-NETWORK |
host reachability, ARP, SSH | every later phase |
P0-188-DATA |
PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
P0-110-REGISTRY-OBSERVABILITY |
Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
P1-K3S |
120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
P2-WORKLOAD-ALERTCHAIN |
AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
P2-PUBLIC-ROUTES |
external AWOOOI and momo URLs | external release |
P2-SCHEDULES |
cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
13.5 Next-Reboot Operator Contract
- Run the watch command above.
- If it stops at
BLOCKED, repair the first blocked gate and rerun watch mode. - If it stops at
WARN, do not release runner/CD/AI full execution; clear or explicitly accept each warning. - Release high-load services only after
GREENand load/core stays below1.0for 15 minutes. - Record the final output summary and any manual repair in
docs/LOGBOOK.md.
13.6 2026-05-29 補充:188 Public Gateway 與備份告警
aiops.wooo.work 的 188 public gateway 不可再指向單一 192.168.0.120:31234/31235。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP:
location /api/ {
proxy_pass http://192.168.0.125:32334/api/;
}
location /api/v1/ws {
proxy_pass http://192.168.0.125:32334/api/v1/ws;
}
location / {
proxy_pass http://192.168.0.125:32335;
}
變更來源必須是 infra/ansible/roles/nginx/templates/188-all-sites.conf.j2,再用 infra/ansible/playbooks/nginx-sync.yml 收斂;禁止只改 188 live 檔而不回寫 Ansible baseline。
備份告警有兩層,缺一不可:
ops/monitoring/alerts-unified.yml是 repo canonical。- 110 live
/home/wooo/monitoring/alerts.yml與/home/wooo/monitoring/alerts-unified.canonical.yml必須一致,否則prometheus-rule-drift-guard可能把規則拉回舊版。
重啟後必查:
curl -s http://127.0.0.1:9090/api/v1/rules \
| python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'
cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom
若 120 尚未恢復,BackupConfigCapturePartial{target="120-k3s-host-configs"} 與 cold-start blocked 是正確訊號,不可消音。120 恢復後再重跑:
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
13.7 2026-05-29 補充:momo PostgreSQL Index 與資料同步
mo.wooo.work 不能只看 /health 或首頁 200。重啟或 fsck 後,PostgreSQL index 可能讓匯入流程表面完成,但 daily_sales_snapshot 未同步到 realtime_sales_monthly。本次症狀:
daily_sales_snapshot已有 2026-05-01 到 2026-05-28 的 17,353 筆。realtime_sales_monthly同日期範圍為 0 筆。- momo-scheduler log 出現 PostgreSQL 內部錯誤
posting list tuple ... cannot be split。
標準處理順序:
# 188 / momo-db,只重建索引,不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL
重建索引後,才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 realtime_sales_monthly 該日期範圍筆數,若非 0,需先保存查詢結果並確認是否重跑同範圍同步;不可整表 truncate、不可整庫 restore。補同步後至少驗證:
SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache:
docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'
14. 主機開機、關機、重啟 SOP
本節是每次 110 / 120 / 121 / 188 相關電源操作的標準程序。112 是 Kali,只保留 read-only evidence,不納入本輪恢復或例行重啟釋出。
14.1 共同紅線
| 類型 | 禁止事項 | 正確處理 |
|---|---|---|
| 120 offline | 不可消音 ColdStartHost120Unreachable、ColdStartRecoveryBlocked 或 120 config backup alert |
保留紅燈,直到 console/VM recovery 後重跑完整 chain |
| Filesystem | 不可對已掛載 root filesystem 做 online fsck |
只在 console/rescue/initramfs 狀態下離線修復 |
| Backup | 不可用單項 backup 成功宣稱 aggregate backup green | 以 backup-all、offsite verifier、cold-start scorecard 三者共同判定 |
| Credential | 不可把密碼、token、private key 寫進 repo、LOGBOOK 或聊天 | 只寫 non-secret evidence marker / vault reference |
| Stateful data | 不可 truncate、DROP、整庫 restore 或整批刪 volume | 先保存證據,優先 REINDEX TABLE CONCURRENTLY / clean-clone / idempotent resync |
| Automation | 不可在 P0/P1 未綠時釋出 runner/CD/AI full execution | observe-only,runner/CD 最後釋出 |
14.2 關機前 SOP
目標是保留證據、停止高負載來源、讓 stateful service 乾淨落地。
- 宣告維護窗口,建立
docs/LOGBOOK.md重啟紀錄草稿。 - 跑 preflight snapshot:
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
/backup/scripts/backup-status.sh --no-notify
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
- 保存 host reboot evidence:
for h in 110 120 121; do
ssh wooo@192.168.0.$h 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20; systemctl --failed --no-pager' || true
done
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20; systemctl --failed --no-pager' || true
- 暫停高負載與自動化釋出:
| 順序 | 對象 | 操作原則 |
|---|---|---|
| 1 | Gitea / actions runners | 停止新 job;不要在 build 中途硬關,先讓可完成 job 結束或人工取消 |
| 2 | AI auto-remediation | 切 observe-only;禁止自動 restart stateful services |
| 3 | momo crawler / scheduler / heavy batch | 暫停會啟動 Chrome、批次匯入或大量 DB 寫入的工作 |
| 4 | Sentry/Snuba/ClickHouse heavy consumers | 確認沒有 restart storm;必要時 controlled stop |
| 5 | K3s workload | 優先 drain / cordon 可達節點;不可在 120 已不可達時假裝 drain 完成 |
- 全機關機順序:
1. runner/CD and high-load batch
2. AI auto-remediation execution
3. AWOOOI workload layer
4. 121 K3s agent side
5. 120 K3s server side
6. 110 registry / observability, after evidence and backup status are captured
7. 188 data layer last
8. network / UPS / hypervisor last, if applicable
188 必須最後關,因為 PostgreSQL / Redis / momo DB / K3s datastore 是其他層的共同依賴。
14.3 開機 SOP
開機順序固定走 dependency chain,不追最吵的 alert。
1. Physical network: switch, NIC, ARP, SSH
2. 188 data layer: PostgreSQL, Redis, Docker, momo DB, SignOz dependencies
3. 110 registry / observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry
4. 120 K3s server / VIP path
5. 121 K3s agent / failover path
6. AWOOOI API/Web workload
7. Public routes and Alertmanager E2E
8. Backups, cron, CronJobs, textfile exporters
9. momo scheduler / crawlers and high-load consumers
10. runners/CD
11. AI auto-remediation limited execution
開機後每一層都要有 live evidence。最小驗收命令:
for h in 110 120 121 188; do
ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done
ssh ollama@192.168.0.188 'systemctl is-active docker postgresql@14-main redis-server nginx || true; pg_isready -h localhost -p 5432 || true; docker ps --format "{{.Names}}\t{{.Status}}" | head -80'
ssh wooo@192.168.0.110 'systemctl is-active docker cron || true; curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true; curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true'
ssh wooo@192.168.0.121 'sudo kubectl get nodes -o wide; sudo kubectl get pods -A | grep -v -E "Running|Completed" || true'
/backup/scripts/backup-status.sh --no-notify
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
14.4 單主機重啟 SOP
| Host | 重啟前條件 | 重啟後必查 | 完成條件 |
|---|---|---|---|
| 110 | 不在 backup-all / rclone / verify window;runner job 已停止或人工取消;188 healthy |
Docker, Harbor, Gitea, Prometheus, Alertmanager, Sentry, cron, textfile exporters, /backup/scripts/backup-status.sh --no-notify |
110 services green;backup status 沒有新增 stale / failed;runner/CD 最後釋出 |
| 120 | 必須是 console-first 維護;若可達,先 cordon/drain;若不可達,不宣稱 drain 成功 | power/VM/NIC/boot/initramfs/fsck state, SSH, kubectl get nodes, SchedulingDisabled 清除狀態 |
120 ping/SSH OK;mon Ready;backup configs/all/offsite/verify/cold-start chain 重跑 |
| 121 | 120 / 188 healthy;可達時先 cordon/drain | k3s-agent 或 live role、VIP 狀態、kubectl get nodes, pod placement |
mon1 Ready;VIP / NodePort 路徑正常;workload 無新增 failed pods |
| 188 | 110 backup status 已保存;停止或延後 momo heavy import;確認無 DB restore / migration | PostgreSQL, Redis, Docker, momo DB parity, SignOz/ClickHouse, cron, backup freshness | DB accepting;momo parity 綠;188 backup jobs fresh;高負載服務最後釋出 |
14.4.1 110 重啟後恢復指揮卡
110 是 registry / observability / backup center。重啟後先看 host 與核心端口,不要第一時間重啟 Docker daemon。
| 順序 | 檢查 | 成功基準 | 失敗處理 |
|---|---|---|---|
| 1 | systemctl is-system-running / failed units / Swap |
running、failed 0 或可解釋、Swap 未持續增加 |
先分辨 stale unit、現役 service、storage/network 問題 |
| 2 | Docker daemon | systemctl is-active docker=active |
若 Docker activating,先看 journal;不要連續 restart/kill |
| 3 | Harbor / registry | local /v2/ 回 200/401,public registry 未登入 401 |
只針對失效 upstream 做最小修復;避免 daemon restart |
| 4 | Gitea / runners | Gitea 200/302;runner 最後釋出 | runner job 不可在 P0/P1 未綠時搶資源 |
| 5 | Prometheus / Alertmanager | /-/ready、/-/healthy OK;required alerts visible |
若告警缺失,先修 rules/drift guard,再談自動化 |
| 6 | Sentry / Langfuse / Stock / public tools | public 2xx/3xx;container 非 restart loop | 只修明確故障服務;不要 compose 全套重建 |
| 7 | backup / offsite | backup-status --no-notify、offsite verifier |
120 不可達時 Configs 紅燈保留 |
110 post-reboot 最小命令:
ssh wooo@192.168.0.110 '
date; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true
free -h; swapon --show
systemctl is-active docker cron || true
curl -s -o /dev/null -w "harbor_v2=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'
2026-06-12 補充:stockplatform-shared-ui-monitor.timer 指向不存在的 legacy path 時,可停用 stale timer 解除 host failed unit;但正式 source-of-truth 必須後續清理,不能把 reset-failed 當修復。
14.4.2 188 重啟後恢復指揮卡
188 是資料與 AI/Web 依賴主機。它恢復前,不釋出 K3s、AWOOOI API、momo heavy import 或 AI auto-remediation。
| 順序 | 檢查 | 成功基準 |
|---|---|---|
| 1 | PostgreSQL | pg_isready accepting,無 checkpoint / WAL panic |
| 2 | Redis | PONG |
| 3 | Docker / containerd | active;momo-db / signoz / openclaw / litellm 非 restart loop |
| 4 | momo DB parity | daily_sales_snapshot 與 realtime_sales_monthly 目前月份筆數與日期上下界一致 |
| 4a | momo Google Drive token writeback | /home/ollama/momo-pro/config/google_token.json owner 對齊 Docker userns scheduler UID,mode 不寬於 600;不得讀取或輸出 token 內容 |
| 4b | momo business data freshness | daily_sales_snapshot 最新日期落後 0-2 天可接受;落後 3 天以上為 BLOCKED,即使首頁 / health / DB parity 都正常也不可宣稱 full-stack green |
| 5 | SignOz / monitoring bridge | HTTP 200;ClickHouse 不在修復風暴 |
| 6 | momo scheduler | container healthy,recent activity pattern > 0;heavy import 等 DB green 後釋出 |
| 7 | backup freshness | 188 backup textfile / 110 backup-from-188 freshness OK |
188 post-reboot 不可用「首頁 200」取代 DB parity,也不可用 DB parity 取代資料新鮮度。若出現 posting list tuple ... cannot be split,只走 REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;,不可 truncate 或整庫 restore。
2026-06-25 補充:若 momo-scheduler logs 出現 Google Drive 認證失敗 / could not locate runnable browser / Permission denied: 'config/google_token.json',先做 metadata-only 判讀,不得讀 token 內容。最新 10:35 readback 顯示 host path /home/ollama/momo-pro/config/google_token.json 與 container-side config/google_token.json 都是 missing,scheduler host UID 仍是 100000;因此不能沿用 2026-06-24「只改 owner/mode」的修復結論。解除 WARN 的最小安全流程是:取得 owner-provided non-secret evidence ref、確認維護窗口與 rollback owner、用不貼 token 的方式重新建立或恢復 token artifact、只檢查 stat owner:group:mode 與 scheduler auth readback、再跑 cold-start。未完成前,MOMO health 200 與 DB parity 不能取代 token/writeback evidence。
14.4.3 120 恢復指揮卡
120 目前是 console-first blocker。它不可達時,遠端只能做證據收集,不能假裝修復。
| 狀態 | 判定 | 正確動作 |
|---|---|---|
| ping / SSH / ARP 全失敗 | host / VM / network 層未知 | 到 hypervisor / console 確認 power、NIC、boot screen |
| initramfs / fsck prompt | filesystem repair gate | 依 120-fsck-maintenance-checklist.sh 離線處理 |
| SSH 恢復但 K3s NotReady | K3s / runtime 層 | 先看 journalctl -u k3s、containerd、188 PostgreSQL,再解除 cordon |
| node Ready 但 SchedulingDisabled | 調度狀態未解除 | 確認健康後 kubectl uncordon mon,再看 workload |
120 恢復後不得只看 kubectl get nodes。必須強制補跑:
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
14.4.4 121 重啟後恢復指揮卡
121 是 K3s failover / secondary control-plane path。它重啟後的核心是「不要讓 mon1 Ready 掩蓋 mon 不可達」。
| 檢查 | 成功基準 | 注意 |
|---|---|---|
| SSH / systemd | host ready、failed units 可解釋 | 121 green 不代表 120 green |
| K3s role | kubectl get nodes -o wide 可讀 |
若只剩 mon1 Ready,仍是 degraded cluster |
| VIP / NodePort | VIP / public routes 通 | 必須確認 route 走 192.168.0.125:32334/32335 |
| Cron / DR drill | cron present、DR drill 未被誤停 | schedule green 是 cold-start done criteria 的一部分 |
若 121 重啟後看到 mon1 Ready 但 mon NotReady,SchedulingDisabled,結論是「121 恢復,cluster 仍 degraded」,不可把 121 正常誤報成 K3s 全綠。
14.5 每次重啟紀錄格式
每次開機、關機、重啟都要在 docs/LOGBOOK.md 追加紀錄,並把必要狀態同步到本 SOP 或 workplan。
## YYYY-MM-DD | Host reboot / shutdown / startup record
Scope:
- Hosts:
- Operation: shutdown / startup / reboot / recovery
- SOP version used:
- Operator:
- Maintenance window:
Pre-check:
- Cold-start scorecard:
- Backup status:
- Offsite verifier:
- Public routes:
- momo DB parity:
- Alertmanager rules / E2E:
- Credential escrow:
Execution:
- Start time:
- End time:
- Commands / console actions:
- Services paused:
- Services released:
Result:
- 110:
- 120:
- 121:
- 188:
- Cold-start scorecard after:
- Backup status after:
- Offsite verifier after:
- DB parity after:
- Alerts after:
Difference versus previous reboot:
- Faster:
- Slower:
- New blocker:
- Repeated blocker:
- False positive / detector tuning:
- SOP change required: yes/no
SOP update:
- Previous version:
- New version:
- Change reason:
- Files updated:
14.6 SOP 版本比較與改版規則
每次重啟後必須比較上一次紀錄,不只寫「已恢復」。
| 比較項 | 判定方式 |
|---|---|
| Time to SSH | 從 power-on 到各 host SSH OK |
| Time to K3s Ready | 從 120/121 boot 到 nodes Ready |
| Time to public routes | 從 K3s Ready 到 public 2xx/3xx |
| Time to backup green | 從 110 ready 到 backup status / offsite verifier green |
| Persistent blockers | 連續兩次以上出現即列入 SOP hard gate |
| False positives | 例如 momo scheduler detector WARN;要寫清楚直接證據與調整方向 |
| Procedure drift | live cron、Ansible template、script path 與 SOP 不一致時,先修 canonical,再修 SOP |
改版規則:
- 只更新 live baseline 或百分比:不升版,只更新日期與 evidence。
- 新增、刪除或改變操作順序:升 minor version,例如
v1.4->v1.5。 - 牽涉破壞性操作、資料修復策略或人為批准邊界:升 major-ready review,先經人工批准。
14.7 2026-06-06 重啟紀錄比較錨點
2026-06-06 沒有執行新重啟;本次是 live recovery check。它仍要作為下一次重啟比較基準:
| 項目 | 2026-06-06 baseline |
|---|---|
| Overall | 65% BLOCKED |
| Cold-start | PASS=71 WARN=3 BLOCKED=3 |
| Remaining hard blocker | 120 ping / SSH / K3s read-only check |
| Backup aggregate | failed=1, Configs only, due 120 config capture |
| Backup freshness | 110 and 188 fresh, no stale jobs |
| Offsite | 13 repos latest-only green |
| Escrow | 5 markers missing |
| momo scheduler | direct healthy; 15:03 scorecard no longer emits scheduler WARN |
14.8 2026-06-12 重啟後比較錨點
2026-06-12 110 被非計畫重啟後,本 SOP v1.5 的新比較錨點如下:
| 項目 | 2026-06-12 post-reboot baseline |
|---|---|
| 110 host | systemd running,failed units 0,Swap 0B/7.8GiB |
| 110 service recovery | Harbor / Gitea / Prometheus / Alertmanager / Sentry / Stock / public tools reachable |
| Cold-start | PASS=72 WARN=2 BLOCKED=3 |
| Remaining hard blocker | 120 ping / SSH / K3s read-only check |
| WARN | 120-driven backup aggregate/config component and 120 K3s schedule check |
| Backup freshness | 110 13/13 fresh failed=1,188 2/2 fresh failed=0,stale none |
| Offsite | 13 repos latest-only green,REMOTE_LATEST_ONLY_OK=1,VERIFY_OK=1 |
| Alerts | Prometheus and Alertmanager expose all five required backup/cold-start/escrow alerts |
| momo scheduler | scorecard reads SCHEDULER_RECENT_ACTIVITY 1070 after detector fix |
| SOP change | v1.5 adds startup judgment layers, GO/NO-GO tree, host recovery cards, and timeline checks |
14.9 2026-06-13 CD 後恢復比較錨點
2026-06-13 不是主機重啟,而是用來驗證「120/121 workload balancing + CD known_hosts guardrail」是否能承受下一次正常部署的比較錨點。
| 項目 | 2026-06-14 03:10 baseline |
|---|---|
| Gitea / ArgoCD | Gitea main 8868c025,deploy marker 7b034b58,ArgoCD revision 8868c025,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag 26b67d11f7b7de4f9c9d95c01bb1dacf4000e887 |
| K3s placement | API/Web verified split across mon / mon1 after the latest deploy marker;Worker single replica healthy |
| Cold-start | PASS=81 WARN=2 BLOCKED=0 |
| Public routes | Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
| Backup | backup-status: 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5,last aggregate 2026-06-14 02:40:22 |
| Offsite | textfile remote_verify_ok=1、full_verify_fresh=1,13 repos each snapshot_count=1 |
| SSH trust | Global known_hosts retained 120 / 188 entries after CD; deploy-specific trust moved to deploy_known_hosts |
| Remaining non-service debt | km-vectorize-29689620 official Job failed with BackoffLimitExceeded; failed Pod/log was deleted before inspection; credential escrow missing count remains 5; 110 has fwupd failed units |
| SOP change | v1.10 changes the first-screen declaration from full green back to degraded, records official km-vectorize failure evidence, and verifies live restartPolicy: Never / FallbackToLogsOnError evidence retention for the next official run |
14.10 2026-06-14 110 failed-unit 清理比較錨點
2026-06-14 08:24 的變更不是主機重啟,而是把 110 非核心 fwupd failed-unit 噪音從 cold-start 判定中收斂。這個錨點的用途是避免未來把 firmware metadata refresh failure 誤判成 AWOOOI runtime 失敗,同時保留 rollback。
| 項目 | 2026-06-14 08:24 baseline |
|---|---|
| 110 failed units | systemctl --failed 回 0 loaded units listed |
| fwupd policy | fwupd-refresh.timer 為 disabled / inactive,原因是非核心 firmware metadata refresh 失敗不應阻擋 AWOOOI service recovery |
| Rollback | 若需要恢復 firmware metadata refresh timer,執行 sudo systemctl enable --now fwupd-refresh.timer 後重跑 cold-start |
| Cold-start | PASS=82 WARN=1 BLOCKED=0 |
| Remaining WARN | 只剩 K8s failed Job km-vectorize-29689620;等待下一次官方 03:00 排程成功或保留 failed Pod/log |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| Credential escrow | 仍缺 5 個 non-secret evidence marker;不可用 placeholder 或 secret 清紅燈 |
| SOP change | v1.11 把 110 failed-unit gate 從 GREEN_WITH_FWUPD_WARNING 改成 GREEN_WITH_FWUPD_TIMER_DISABLED,並把完成宣告上限固定為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.11 2026-06-14 post-CD recovery readback
2026-06-14 08:40 的變更不是主機重啟,而是確認 latest CD deploy marker 沒有讓重啟恢復狀態倒退。這個錨點用來比較「治理 / 前端 / API CD 後,cold-start SOP 是否仍成立」。
| 項目 | 2026-06-14 08:40 post-CD baseline |
|---|---|
| Gitea / ArgoCD | Gitea main 18b867c3,ArgoCD revision 18b867c3,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag e0a6d339669fc635357d36ea94215df25e652fa9 |
| CronJob readback | km-vectorize has KM_PROJECT_ID=awoooi、restartPolicy: Never、terminationMessagePolicy: FallbackToLogsOnError、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon |
| Cold-start | PASS=82 WARN=1 BLOCKED=0 |
| Public routes | Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.12 records the post-CD no-regression readback and keeps the declaration ceiling at SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.12 2026-06-14 P2-135 deploy 後 recovery readback
2026-06-14 09:27 的變更不是主機重啟,而是確認 P2-135 deploy 與正式驗證後,reboot recovery baseline 沒有倒退。這個錨點也記錄 stockplatform-v2 rollout warmup 期間短暫 502 的判定方式:直接重查 route / TLS,並重跑完整 cold-start;只有重跑仍失敗才升級成 persistent public route blocker。
| 項目 | 2026-06-14 09:27 post-P2-135 baseline |
|---|---|
| Gitea / ArgoCD | Gitea main 5bad267e,ArgoCD revision 5bad267e,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag 280e0fbef0d5dccb10f1efe2cc18cf423544254e |
| CronJob readback | km-vectorize has KM_PROJECT_ID=awoooi、restartPolicy: Never、terminationMessagePolicy: FallbackToLogsOnError、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon1 |
| First cold-start | 09:26 first run saw stock.wooo.work 502 while stockplatform-v2 containers were less than one minute old; direct route and TLS recheck returned 200 |
| Final cold-start | 09:27 rerun returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.13 records the P2-135 post-deploy no-regression readback and keeps the declaration ceiling at SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.13 2026-06-14 P2-136 / AI Agent 活動正式部署後 recovery readback
2026-06-14 09:56 的變更不是主機重啟,而是確認 P2-136 / AI Agent 活動正式部署後,reboot recovery baseline 仍沒有倒退。這個錨點特別記錄 deploy marker、ArgoCD revision、live image 與 cold-start scorecard 必須一起看,避免只看 gitea/main 或 CD 成功就誤報 full-stack green。
| 項目 | 2026-06-14 09:56 post-P2-136 baseline |
|---|---|
| Gitea / ArgoCD | 本 recovery commit 前最新文件 head a0fe7741;runtime deploy marker 60a0415c chore(cd): deploy a3de0ff [skip ci],ArgoCD revision 60a0415c,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag a3de0ffb8275b6838604b6dff87cd978b8e91122 |
| CronJob readback | km-vectorize has KM_PROJECT_ID=awoooi、restartPolicy: Never、terminationMessagePolicy: FallbackToLogsOnError、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 remains retained |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon1 |
| Cold-start | 09:56 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.14 records the P2-136 / AI Agent 活動正式部署後 no-regression readback and keeps the declaration ceiling at SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.14 2026-06-14 P2-137 / CI smoke timeout 修正後 recovery readback
2026-06-14 10:40 的變更不是主機重啟,而是確認 P2-137 正式部署與 BusyBox timeout smoke 修正後,reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback,不重複 P2-137 正式驗證內容。
| 項目 | 2026-06-14 10:40 post-P2-137 baseline |
|---|---|
| Gitea / ArgoCD | 本 recovery commit 前最新文件 head 50d4f2ba;runtime deploy marker d023f5d7 chore(cd): deploy f737f27 [skip ci],ArgoCD revision 50d4f2ba,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag f737f278dc14372ff1fb15b124b1370c20e1bb99 |
| CronJob readback | km-vectorize has KM_PROJECT_ID=awoooi、restartPolicy: Never、terminationMessagePolicy: FallbackToLogsOnError、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 remains retained |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon |
| Cold-start | 10:40 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.15 記錄 P2-137 / CI smoke timeout 修正後 no-regression readback,並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.15 2026-06-14 P2-143 owner response 預檢後 recovery readback
2026-06-14 15:00 的變更不是主機重啟,而是確認 P2-143 owner response 預檢與拒收邊界正式部署後,reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback,不重複 P2-142 / P2-143 正式驗證內容,也不把 owner response preflight 視為 runtime 授權。
| 項目 | 2026-06-14 15:00 post-P2-143 baseline |
|---|---|
| Gitea / ArgoCD | 最新文件基準 b09eb1c6 docs(ai): 校準 P2-143 正式驗證紀錄;runtime deploy marker 667d6329 chore(cd): deploy 755b0a8 [skip ci];ArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5f,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag 755b0a8d3038df2c52dee280067863d92db1eda5 |
| CronJob readback | km-vectorize schedule 0 3 * * *、timeZone=Asia/Taipei、suspend=false、failedJobsHistoryLimit=3、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 仍保留,但目前沒有可讀的 failed Pod / log |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon |
| Cold-start | 15:00 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應 |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| P2-143 API boundary | Production endpoint 回 current P2-143、next P2-144、completion 100,且 reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 0 / false |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.16 記錄 P2-143 owner response 預檢後 no-regression readback,並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.16 2026-06-14 P2-144 owner response 回讀後 recovery readback
2026-06-14 15:58 的變更不是主機重啟,而是確認 P2-144 owner response 回讀狀態與後續 deploy marker 180a6543 正式部署後,reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback,不重複 P2-144 正式驗證內容,也不把 owner response readback 視為 runtime 授權、正式收件或 owner acceptance。
| 項目 | 2026-06-14 15:58 post-P2-144 baseline |
|---|---|
| Gitea / ArgoCD | gitea/main 已前進至 180a6543 chore(cd): deploy fef94df [skip ci];ArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965a,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag fef94df877c5438f9f34ddbcace8ad8112a141ef |
| CronJob readback | km-vectorize schedule 0 3 * * *、timeZone=Asia/Taipei、suspend=false、failedJobsHistoryLimit=3、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 仍保留,但目前沒有可讀的 failed Pod / log |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon1 |
| Cold-start | 15:58 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應 |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| P2-144 API boundary | Production endpoint 回 current P2-144、next P2-145、completion 100,且 owner response received / accepted / rejected、reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 0 / false |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.17 記錄 P2-144 owner response 回讀後 no-regression readback,並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.17 2026-06-14 P2-145 owner response 驗收門檻後 recovery readback
2026-06-14 16:29 的變更不是主機重啟,而是確認 P2-145 owner response 驗收門檻正式部署後,reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback,不重複 P2-145 正式驗證內容,也不把 acceptance gate 視為 owner response received / accepted、runtime 授權或正式寫入。
| 項目 | 2026-06-14 16:29 post-P2-145 baseline |
|---|---|
| Gitea / ArgoCD | 最新文件基準 06fe0a8f docs(logbook): 記錄 P2-145 正式驗證 [skip ci];runtime deploy marker 36fbfc6b chore(cd): deploy 386dbd0 [skip ci];ArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag 386dbd078ef63401d9736048463f4ef5326442d9 |
| CronJob readback | km-vectorize schedule 0 3 * * *、timeZone=Asia/Taipei、suspend=false、failedJobsHistoryLimit=3、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 仍為 Failed 0/1 |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon |
| Cold-start | 16:29 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應 |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| P2-145 API boundary | Production endpoint 回 current P2-145、next P2-146、completion 100,且 owner response received / accepted / rejected、reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 0 / false |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.18 記錄 P2-145 owner response 驗收門檻後 no-regression readback,並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.18 2026-06-14 IwoooS P0 配置控管優先序後 recovery readback
2026-06-14 17:04 的變更不是主機重啟,而是確認 IwoooS P0 配置控管優先序正式部署後,reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback,不重複 P0 配置控管正式驗證內容,也不把前台看板可見視為 Nginx reload、DNS / TLS probe、certbot renew、workflow / secret 修改、public route change 或 runtime gate。
| 項目 | 2026-06-14 17:04 post-IwoooS-P0-config baseline |
|---|---|
| Gitea / ArgoCD | 最新文件基準 af62ec1f docs(iwooos): 記錄 P0 配置控管正式驗證 [skip ci];runtime deploy marker ed651a98 chore(cd): deploy e992af8 [skip ci];ArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag e992af89955f8aae40a383b2f2e2f645445a690d |
| CronJob readback | km-vectorize schedule 0 3 * * *、timeZone=Asia/Taipei、suspend=false、failedJobsHistoryLimit=3、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 仍為 Failed 0/1 |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon1 |
| Cold-start | 17:04 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應;IwoooS route /zh-TW/iwooos 額外 readback 回 200 |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| IwoooS boundary | P0 配置控管優先序已可見,但 live evidence received、runtime gate、Nginx live config、DNS / TLS probe、certbot renew、workflow / secret 修改、public route change、production write 仍不得從本 readback 推定為已授權 |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.19 記錄 IwoooS P0 配置控管優先序後 no-regression readback,並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.20 2026-06-15 km-vectorize official success readback
2026-06-15 03:11 的變更不是主機重啟,而是確認 km-vectorize 官方 03:00 排程成功,並把 ArgoCD fully healthy gate 關閉。這個錨點只記錄 recovery / cold-start readback,不手動刪 Job、不手動建立 Job、不 kubectl patch live、不重啟服務,也不把任何 backup / restore / escrow owner acceptance ledger 視為 backup run、restore run、credential escrow marker write、host write 或 production write 授權。
| 項目 | 2026-06-15 03:11 km-vectorize official success baseline |
|---|---|
| ArgoCD | awoooi-prod sync Synced,health Healthy,revision d388e5b477333fd5e661527a729406a4e8215320 |
| CronJob readback | km-vectorize schedule 0 3 * * *、timeZone=Asia/Taipei、suspend=false、lastScheduleTime=2026-06-14T19:00:00Z、lastSuccessfulTime=2026-06-14T19:00:55Z |
| Job / Pod / log | Job km-vectorize-29691060 Complete,Pod km-vectorize-29691060-78xpz Completed restart 0,log embed-all: 200 {"total":31,"success":31,"failed":0} |
| Cold-start | 03:11 returned PASS=81 WARN=2 BLOCKED=0,result DEGRADED |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,last aggregate 2026-06-15 02:40:13 |
| Escrow | ESCROW_MISSING_COUNT=5,缺 restic_repository_password、offsite_provider_credentials、break_glass_admin_credentials、dns_registrar_recovery、oauth_ai_provider_recovery |
| Remaining warnings | 188 momo scheduler registration/activity 未確認;K8s 仍保留舊 failed Job evidence |
| SOP change | v1.21 關閉 km-vectorize official success gate,但宣告上限仍是 SERVICE_AVAILABLE_ARGOCD_HEALTHY_DR_ESCROW_BLOCKED;不可宣稱 full-stack green 或 DR complete |
14.19 2026-06-14 高價值配置 Owner Packet 前台同步後 recovery readback
2026-06-14 18:15 的變更不是主機重啟,而是確認高價值配置 Owner Packet 前台同步正式部署後,reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback,不重複 Owner Packet 前台正式驗證、posture projection 或 intake preflight 內容,也不把前台草案可見視為 request sent、owner response received / accepted、runtime gate、Nginx reload、DNS / TLS probe、certbot renew、workflow / secret 修改、host write、active scan 或 production write。
| 項目 | 2026-06-14 18:15 post-owner-packet-frontend baseline |
|---|---|
| Gitea / ArgoCD | 最新 repo 文件基準 0a4766dd docs(security): 新增高價值配置 owner request 草稿包 [skip ci];runtime deploy marker 16c6b983 chore(cd): deploy e999c16 [skip ci];feature commit e999c16b fix(iwooos): 同步高價值配置 owner packet 前台;ArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94,sync Synced,health Degraded |
| K3s image readback | API/Web/Worker/CronJob image tag e999c16b3435f197b78fe2adfeec1c4faa6c4675 |
| CronJob readback | km-vectorize schedule 0 3 * * *、timeZone=Asia/Taipei、suspend=false、failedJobsHistoryLimit=3、lastScheduleTime=2026-06-13T19:00:00Z、lastSuccessfulTime=2026-06-04T11:00:37Z;failed Job km-vectorize-29689620 仍為 Failed 0/1 |
| K3s placement | API pods split mon / mon1,Web pods split mon / mon1,Worker single replica on mon |
| Cold-start | 18:15 returned PASS=82 WARN=1 BLOCKED=0 |
| Public routes | 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應;IwoooS route /zh-TW/iwooos 與 AwoooP route /zh-TW/awooop 額外 readback 皆回 200 |
| Backup | 110 13/13 fresh failed=0,188 2/2 fresh failed=0,core_blockers=0,escrow_missing=5 |
| 110 host | systemctl --failed 回 0 loaded units listed;fwupd-refresh.timer 維持 disabled / inactive |
| Owner Packet boundary | Owner Packet 前台數字已可見,但 request sent、owner response received / accepted / rejected、reviewer queue write、live evidence、runtime gate、Nginx live config、DNS / TLS probe、certbot renew、workflow / secret 修改、host write、active scan、production write 仍不得從本 readback 推定為已授權 |
| Remaining gate | km-vectorize-29689620 official Job 仍 failed;Credential escrow missing count 仍 5 |
| SOP change | v1.20 記錄高價值配置 Owner Packet 前台同步後 no-regression readback,並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED |
14.21 2026-06-18 Plan B 降級運轉路徑
2026-06-18 的變更不是主機重啟,也不是新的 live recovery readback,而是把統帥要求的 Plan B 明確寫成可執行 SOP。這個錨點用來比較下一次重啟時是否有照 §1.4 先判斷 Plan A / Plan B、降級等級、停止線與回到 Plan A 的條件。
| 項目 | 2026-06-18 Plan B baseline |
|---|---|
| SOP version | v1.22 |
| Plan B trigger | backup/offsite/verifier running、P0 host 15 分鐘不可達、188 data unhealthy、110 registry / observability unhealthy、單台 K3s degraded、route-only green、cold-start WARN、credential escrow missing |
| Service levels | B0_ABORTED_BEFORE_REBOOT、B1_HOST_RECOVERY_ONLY、B2_CORE_SERVICE_READY、B3_SERVICE_AVAILABLE_DEGRADED、B4_FULL_STACK_GREEN、B5_DR_COMPLETE |
| Host fallback paths | 110 / 120 / 121 / 188 / K3s / Public gateway 各自有降級路徑與回到 Plan A 的條件 |
| Timeline | T+0 freeze、T+5 host boot、T+15 data / registry stop-line、T+30 route-only guard、T+60 cold-start scorecard、T+120 incident / follow-up |
| Closeout states | RETURNED_TO_PLAN_A、SERVICE_AVAILABLE_DEGRADED、OPEN_INCIDENT_REQUIRED |
| SOP change | v1.22 新增 Plan B;不可把 Plan B 視為 runtime write 授權,也不可因文件化 Plan B 宣稱新的 service green、full-stack green 或 DR complete |
14.23 2026-06-18 repo-side readiness audit blocker closure
2026-06-18 的第二段變更不是 live recovery,也不是主機重啟,而是把前一輪 readiness audit 的 repo-side hard blockers 收斂成可驗證契約。這個錨點代表「重啟 SOP / baseline / scripts / Ansible source-of-truth / Gitea workflow contract 在 repo 內已可通過 readiness audit」,不代表當日 live hosts 已重新驗證。
| 項目 | 2026-06-18 repo-side readiness baseline |
|---|---|
| SOP version | v1.23 |
| Cold-start gate | full-stack-cold-start-check.sh 新增 NODE_FS_ERROR_EVENTS,120 / K3s node event 出現 filesystem / fsck / read-only / I/O 類證據時,不能宣稱 K3s safe |
| Backup contract | backup-awoooi.sh 移除 service-level 直接 offsite sync;offsite 發布只走集中 sync-offsite-backups.sh / verifier gate |
| Ansible 110 source-of-truth | 110-devops.yml 納入 cold-start monitor、runner guardrails、host textfile exporters、backup scripts、daily backup heartbeat、offsite evidence report、offsite full-sync verifier |
| Ansible 188 source-of-truth | 188-ai-web.yml 納入 textfile exporters,並把 momo PostgreSQL backup entrypoint 固定到 host-owned /home/ollama/bin/momo-pg-backup.sh |
| Nginx source-of-truth | nginx-sync.yml 納入 188-internal-tools-https.conf.j2 route sync |
| CI / workflow contract | .gitea/workflows/ansible-lint.yml 改為 self-hosted validation,觸發範圍包含 Ansible、ops baseline、monitoring rules、backup scripts、reboot scripts、docs 與 workflow 自身 |
| Validation toolchain | bootstrap-ansible-validation-env.sh 會優先使用 Python 3.11 / 3.10 建立 pinned validation venv;ansible-validate.sh 固定 repo roles path,並以 minimum lint profile 守住 syntax / loader readiness |
| Repo-side readiness audit | PASS=185 WARN=1 BLOCKED=0,結果 READY WITH WARNINGS;唯一 warning 是未跑 --live |
| Declaration limit | 可宣稱 REPO_SIDE_REBOOT_READINESS_READY_WITH_LIVE_CHECK_REQUIRED;不可宣稱 FULL_STACK_GREEN、DR_COMPLETE 或 live service recovery complete |
14.24 2026-06-18 live cold-start readback after repo-side closure
2026-06-18 12:13-12:17 的 readback 是 repo-side readiness closure 後的同日 live 驗證。這不是主機重啟,也不是 runtime 修復;它的用途是把「機制已完成」和「當下 live 狀態」分開,避免 false-green。
| 項目 | 2026-06-18 12:17 live baseline |
|---|---|
| SOP version | v1.24 |
| Cold-start read-only result | PASS=83 WARN=1 BLOCKED=0,result DEGRADED |
| Host reachability | 110 / 120 / 121 / 188 ping OK and SSH port OK |
| K3s | mon / mon1 Ready control-plane;VIP 192.168.0.125 present on 120;NODE_FS_ERROR_EVENTS 0 |
| 110 / 188 service checks | 110 Harbor / Gitea / Prometheus / Alertmanager / Sentry reachable;188 PostgreSQL / Redis / momo / SigNoz reachable |
| Backup health | 110 backup health total=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0;188 backup health total=2 stale=0 |
| Public route / TLS | awoooi API/Web、mo、momo health、Gitea、Harbor、registry、Sentry、SigNoz、stock、Langfuse、Bitan、aiops all 2xx/3xx with TLS verified |
| AWOOOI rollout convergence | After transient 12:14 startup window, final readback shows API 2/2, Web 2/2, Worker 1/1, Canary 1/1, API health 200 healthy |
| Remaining warning | retained stale Job km-vectorize-29689620 from 2026-06-14 03:00; later official Jobs km-vectorize-29692500, 29693940, 29695380 are Complete |
| Declaration limit | 可宣稱 SERVICE_AVAILABLE_DEGRADED;不可宣稱 FULL_STACK_GREEN,因為 WARN=1;不可宣稱 DR_COMPLETE,credential escrow evidence still requires real non-secret owner evidence |
14.25 2026-06-18 stale failed Job classification and service-green readback
2026-06-18 13:43 的變更不是刪除 K8s Job,也不是手動建立 Job,而是修正 cold-start 判定邏輯:保留的歷史 failed Job 是 evidence;只有沒有後續官方成功 Job 的 failed Job 才是 active blocker。這讓 evidence retention 和 service readiness 不再互相衝突。
| 項目 | 2026-06-18 13:43 stale Job classification baseline |
|---|---|
| SOP version | v1.25 |
| Script change | full-stack-cold-start-check.sh emits FAILED_JOBS, STALE_FAILED_JOBS, and ACTIVE_FAILED_JOBS |
| Active blocker rule | ACTIVE_FAILED_JOBS > 0 causes warning; STALE_FAILED_JOBS > 0 is retained evidence and does not warn by itself |
| Readiness audit contract | reboot-recovery-readiness-audit.sh requires both stale and active failed Job counters |
| Repo-side validation | bash -n passed; readiness audit returned PASS=187 WARN=1 BLOCKED=0 with only the expected non-live warning |
| 110 live script sync | /home/wooo/scripts/full-stack-cold-start-check.sh hash b48af9c603aa5a1a4f9434d6cc510398bbecc2e46400a21410e735d5f7d177c4; previous version backed up to /home/wooo/scripts/full-stack-cold-start-check.sh.before-stale-active-job-classification.20260618-135516 |
| Live cold-start readback | PASS=84 WARN=0 BLOCKED=0, result GREEN |
| K8s Job evidence | FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0, BAD_PODS=0 |
| Backup / DR evidence | 110 backup health 13/13 fresh failed=0; 188 backup health 2/2 fresh failed=0; escrow readback still ESCROW_MISSING_COUNT=5 |
| Declaration limit | 可宣稱 FULL_STACK_GREEN_FOR_SERVICE;不可宣稱 DR_COMPLETE、credential escrow complete 或任何 runtime/security acceptance |
| SOP change | v1.25 defines retained failed Job evidence vs active failed Job blocker; future reboot comparison must record all three counters |
14.26 2026-06-24 heartbeat noise / MOMO detector / rollout false-negative closure
2026-06-24 的變更不是主機重啟,而是把重啟 SOP 的兩種 false signal 收斂:Telegram 正常心跳不再每 30 分鐘洗版;MOMO scheduler / current-month parity detector 不再因舊 log pattern 或錯誤 DB exec 使用者誤報 WARN。這個錨點也記錄 CD rollout false-negative:worker startup probe 第一次超時重啟一次,K8s 最終 ready,但 Gitea CD #3289 因 rollout status timeout 標 Failure。
| 項目 | 2026-06-24 live baseline |
|---|---|
| SOP version | v1.27 |
| Heartbeat code | a84a5a0b fix(api): suppress healthy Telegram heartbeat noise |
| Deploy marker | 4a7b5329 chore(cd): deploy a84a5a0 [skip ci] |
| Production image readback | API/Web/Worker image tag a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71 |
| Production rollout | API 2/2、Web 2/2、Worker 1/1 Ready |
| CD result caveat | Gitea CD #3289 shows Failure because worker rollout status timed out before old replica convergence; K8s deploy marker and production readiness are green |
| Healthy heartbeat rule | status=healthy 且無 warnings 時只更新 suppression marker / log,不送 Telegram;warnings 與 recovery 仍可送 |
| Live temporary suppression | Redis keys heartbeat:silent_last_sent and heartbeat:healthy_suppressed_last_seen set with 24h TTL during deployment; no token or secret printed |
| 110 live script sync | /home/wooo/scripts/full-stack-cold-start-check.sh hash 47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6; previous version backed up to /home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759 |
| MOMO scheduler evidence | SCHEDULER_CONTAINER_RUNNING true、SCHEDULER_CONTAINER_HEALTH healthy、SCHEDULER_RECENT_ACTIVITY 1303 |
| MOMO DB parity evidence | `MOMO_MONTHLY_SYNC 10936 |
| K3s node evidence | NODE_FS_ERROR_EVENTS 0、NODE_READONLY_FILESYSTEM_TRUE 0、NODE_DISK_PRESSURE_TRUE 0、VIP 192.168.0.125 present |
| Live cold-start readback | PASS=85 WARN=0 BLOCKED=0, result GREEN |
| Declaration limit | 可宣稱 current service recovery scorecard green;不可宣稱 DR_COMPLETE,credential escrow evidence missing remains 5 |
| SOP change | v1.27 requires heartbeat success-message suppression, MOMO detector parity using app-provided DB env, and rollout false-negative classification before retrying CD |
Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe,不能只看第一次 rollout status --timeout=60s 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈,這是 CI timeout / probe tuning 工作,不是服務重啟事故;後續應調整 startup probe 或 post-deploy timeout。
2026-06-24 02:44 補充:本節的 02:08 PASS=85 WARN=0 BLOCKED=0 已被 §14.28 的 MOMO data freshness gate 取代;不可再引用該結果宣稱 full-stack green。
14.27 2026-06-24 188 node-exporter / backup health alert closure
2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。backup-status 與 cold-start 都能透過 SSH 讀到 188 backup_health.prom fresh,但 Prometheus node-exporter-188 scrape down 會讓 BackupHealthMonitorMissing188 正確告警。這種情況不能消音告警,必須恢復 exporter。
| 項目 | 2026-06-24 188 exporter baseline |
|---|---|
| SOP version | v1.28 |
| Root cause | 188 9100 connection refused;node_exporter / prometheus-node-exporter unit absent/inactive;Prometheus could not scrape backup_health.prom |
| False start | Mounting /home/ollama/node_exporter_textfiles via /host/home/ollama/... failed because /home/ollama is 750 and textfile collector saw permission denied |
| Live restore | Docker container node-exporter uses quay.io/prometheus/node-exporter:v1.8.2, restart=unless-stopped, -p 9100:9100, rootfs mount /host, direct textfile bind /home/ollama/node_exporter_textfiles:/textfile:ro |
| Repo helper | scripts/ops/188-node-exporter-restore.sh |
| Local metrics | awoooi_backup_health_monitor_up{host="188"} 1; node_textfile_scrape_error 0 |
| Prometheus readback | up{job="node-exporter-188"} 1; awoooi_backup_health_monitor_up{host="188"} 1; absent(awoooi_backup_health_monitor_up{host="188"}) empty |
| Alert readback | ALERTS{alertname="BackupHealthMonitorMissing188"} empty |
| Declaration limit | 可宣稱 188 backup health scrape restored;不可把這當作 credential escrow complete 或 backup retention policy complete |
若未來重啟後 BackupHealthMonitorMissing188 active,但 SSH/backup-status 顯示 backup_health.prom fresh,優先查:
curl -fsS http://192.168.0.188:9100/metrics | grep -E 'awoooi_backup_health_monitor_up|node_textfile_scrape_error'
若 9100 connection refused 或 textfile collector error,先用 repo helper 恢復 exporter:
ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh
恢復後再查 Prometheus / Alertmanager,不要直接 silence。
14.28 2026-06-25 MOMO Google Drive token 與資料新鮮度 blocker
2026-06-24 的第三段變更是把「MOMO 服務活著但資料不新」納入 cold-start hard gate。2026-06-25 11:44 曾證明 MOMO 服務、public route、DB parity、scheduler activity、backup/offsite 都可用,但 Google Drive token artifact metadata missing 且資料停在 2026-06-17,所以 cold-start 正確 BLOCKED。2026-06-25 14:16 的最新狀態已由合法匯入 job 57 解除該資料新鮮度 blocker:MOMO service health 是 V10.674,daily_sales_snapshot 與 realtime_sales_monthly 皆到 2026-06-24,MOMO_DAILY_FRESHNESS 1|2026-06-24,dedicated preflight PASS=18 WARN=3 BLOCKED=0。這仍不代表 DR complete,也不代表可以讀取或保存 Google Drive token 內容。
| 項目 | 2026-06-25 MOMO freshness / token baseline |
|---|---|
| SOP version | v1.51 |
| Token current state | MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000; dedicated preflight also saw host token metadata aligned to scheduler UID and container-side token artifact mode 600; token content was not read |
| Token recovery boundary | Owner-gated maintenance only;不得讀取、貼上、保存 token value / hash / partial;不得把聊天密碼或 workaround 寫進 repo |
| Drive auth behavior | 2026-06-25 10:04 fail-closed evidence remains historical proof that auth failure does not become a fake success. 14:16 readback shows the later legitimate import succeeded and the blocker is cleared. |
| Drive pending folder | 當日業績匯入,pattern 即時業績_當日; latest successful source recorded by job 57 |
| Latest valid import | Job 57 completed,即時業績_當日.xlsx,2026-06-25T13:16:47.359958..2026-06-25T13:18:02.964985,15383/15383/0 |
| DB parity | `daily_sales_snapshot=109061 |
| Data freshness | `MOMO_DAILY_FRESHNESS 1 |
| Live cold-start readback | PASS=89 WARN=0 BLOCKED=0, result GREEN; dedicated MOMO preflight PASS=18 WARN=3 BLOCKED=0 |
| 110 live script sync | /home/wooo/scripts/full-stack-cold-start-check.sh hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8 |
| Alert behavior | Drive auth failure must send failure notification; heartbeat success remains suppressed; stale data alert should clear only with fresh DB evidence like job 57 / freshness 1 |
| Declaration limit | 可宣稱 hosts/routes/K3s/backups/MOMO service/MOMO data freshness recovered for this evidence set;不可宣稱 DR complete、credential escrow complete、Wazuh host registry accepted 或 runtime/security acceptance |
MOMO post-reboot 最小 readback:
scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh
ssh ollama@192.168.0.188 '
stat -c "%u:%g:%a %n" /home/ollama/momo-pro/config/google_token.json 2>/dev/null || echo "google_token.json missing"
docker top momo-scheduler -eo pid,user,uid,gid,args | head -n 3
docker logs --since 2h momo-scheduler 2>&1 | grep -E "AutoImport|Google Drive|Permission denied|could not locate runnable browser|沒有找到|發現檔案|匯入失敗通知" | tail -120
'
ssh ollama@192.168.0.188 'db_user=$(docker exec momo-pro-system printenv POSTGRES_USER); db_name=$(docker exec momo-pro-system printenv POSTGRES_DB); db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD); docker exec -i -e PGPASSWORD="$db_pass" momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -At' <<'SQL'
SELECT 'daily_sales_snapshot|' || count(*) || '|' || min(snapshot_date)::date || '|' || max(snapshot_date)::date FROM daily_sales_snapshot;
SELECT 'realtime_sales_monthly|' || count(*) || '|' || min("日期")::date || '|' || max("日期")::date FROM realtime_sales_monthly;
SELECT 'daily_freshness|' || (CURRENT_DATE - max(snapshot_date)::date) || '|' || max(snapshot_date)::date FROM daily_sales_snapshot;
SQL
Preferred path is the scripted preflight. It is read-only and returns 0 for clean, 1 for WARN-only, and 2 for BLOCKED. 2026-06-25 14:16 live run returned PASS=18 WARN=3 BLOCKED=0: https://mo.wooo.work/health and local health both returned 200, health version was V10.674, app / scheduler / Telegram bot were healthy, scheduler restart count was 0, token metadata aligned to scheduler UID without reading token content, current-month DB parity matched, latest daily import job 57 was clean, and DB_DAILY_FRESHNESS 1|2026-06-24 cleared the MOMO hard blocker. The remaining WARNs are stability / future-evidence notes, not blockers.
若 Drive token artifact missing 或 Drive pending folder 無新來源檔,不可手動 truncate、不可以舊 archive 檔重複匯入來製造「最新」,也不可把 DB parity 當 data freshness。下一個解除 blocker 的證據必須是:
- Owner 提供非 secret evidence ref,確認可以恢復 Google Drive token artifact 或合法來源檔。
- 維護窗口、rollback owner、post-check owner 明確記錄。
- token artifact 只用 metadata 驗證:owner 對齊 scheduler UID、mode 不寬於
600、不輸出 token 內容。 - 新的
即時業績_當日source file 可見,或 scheduler 能成功列出待匯入來源。 - import job 成功,
sync_success=true,且 Drive 檔案只在成功後移動。 daily_sales_snapshot與realtime_sales_monthly日期上下界一致,且MOMO_DAILY_FRESHNESS <= 2。
14.29 2026-06-24 188 MinIO / Velero、DB exporter 與 110 disk pressure recovery
2026-06-24 的第四段變更是恢復真正的備份與監控鏈路,而不是消音告警。VeleroBackupNotRun、PostgreSQLDown、RedisDown、110 disk pressure 都是有效紅燈;修復順序必須是 source-of-truth / service / exporter / Prometheus / Alertmanager / cold-start scorecard。
| 項目 | 2026-06-24 06:35 recovery baseline |
|---|---|
| SOP version | v1.30 |
| 188 DB exporter root cause | Docker user namespace 下 exporter compose 不能使用 network_mode: host;Redis live port 是 6380 |
| 188 DB exporter source-of-truth | ops/monitoring/docker-compose.exporters.yaml 改為 bridge port mapping;PostgreSQL DSN 只從 host .env.exporters 注入,repo 不放密碼 |
| 188 DB exporter helper | scripts/ops/188-db-exporters-restore.sh;live path /home/ollama/bin/188-db-exporters-restore.sh |
| 188 DB exporter readback | local metrics pg_up=1、redis_up=1;Prometheus up{job="postgres-exporter"}=1、pg_up=1、up{job="redis-exporter"}=1、redis_up=1 |
| 110 disk pressure | / from 92% used to 73% used after Docker image / build cache cleanup only; no Docker volume prune |
| MinIO / Velero root cause | 188 MinIO endpoint 192.168.0.188:9000 was down; Velero BSL S3 list failed; MinIO data path had userns write denial |
| MinIO restore | live /home/ollama/minio/docker-compose.override.yml adds userns_mode: host for the minio service; MinIO health endpoint is OK |
| Velero restore | 120 BackupStorageLocation/default phase is Available; one-off backup reboot-recovery-202606240456 is Completed |
| Backup-health textfile | 110 exporter refresh reports awoooi_velero_monitor_up=1, awoooi_velero_latest_completed_backup_fresh=1, restore-test cron present, failed jobs 0 |
| Alert readback | VeleroBackupNotRun、PostgreSQLDown、RedisDown、110 disk-pressure alerts resolved |
| Live cold-start readback | PASS=86 WARN=0 BLOCKED=1, result BLOCKED; only blocker remains MOMO data freshness |
| Declaration limit | 可宣稱 backup / exporter / MinIO / Velero chain recovered;不可宣稱 full-stack green、MOMO data current、DR complete 或 runtime/security acceptance |
188 PostgreSQL / Redis exporter post-reboot recovery:
ssh ollama@192.168.0.188 'bash /home/ollama/bin/188-db-exporters-restore.sh'
curl -fsS http://192.168.0.188:9187/metrics | grep '^pg_up '
curl -fsS http://192.168.0.188:9121/metrics | grep '^redis_up '
188 MinIO / 120 Velero recovery from 110:
ssh wooo@192.168.0.110 '/home/wooo/scripts/188-minio-velero-restore.sh'
如果需要在維護窗口中建立一次性 reboot-recovery 備份並刷新 110 backup-health textfile,必須明確設定:
ssh wooo@192.168.0.110 'CREATE_VELERO_BACKUP=true REFRESH_BACKUP_HEALTH=true /home/wooo/scripts/188-minio-velero-restore.sh'
本地 repo helper 可同步 live script:
scp -q scripts/ops/188-db-exporters-restore.sh ollama@192.168.0.188:/home/ollama/bin/188-db-exporters-restore.sh
scp -q scripts/ops/188-minio-velero-restore.sh wooo@192.168.0.110:/home/wooo/scripts/188-minio-velero-restore.sh
110 disk pressure cleanup rule:
Allowed in incident recovery: Docker image / build cache cleanup after checking `docker system df`.
Forbidden without explicit owner approval: `docker volume prune`, deleting database / registry / MinIO / ClickHouse / Sentry / PostgreSQL volumes, or removing unknown bind-mounted state.
Done gate: filesystem use below 85%, no active disk-pressure alerts, and no service regression in cold-start scorecard.
14.30 2026-06-24 notification noise closure after reboot recovery
2026-06-24 的第五段變更是把「服務已恢復,但舊監控路徑或成功心跳繼續洗 Telegram」納入重啟 SOP。這不是消音;失敗、warning、資料新鮮度、backup / exporter / escrow 紅燈仍要告警。修正目標是避免同一個已知失敗每 5 或 30 分鐘重複推送,並避免正常成功心跳佔滿戰情室。
| 項目 | 2026-06-24 notification baseline |
|---|---|
| SOP version | v1.31 |
| AWOOOI healthy heartbeat | Production a84a5a0b:healthy 且無 warnings 時只更新 Redis/log,不送 Telegram;warning 變化會送,warning 恢復 healthy 只送一次 recovery |
| MOMO false-noise root cause | 110 舊 /home/wooo/scripts/docker_health_monitor.sh 打 http://192.168.0.188/health,重啟期間連續得到 HTTP 502,產生每 5 分鐘 MOMO Pro 告警 |
| MOMO monitor source-of-truth | 新增 scripts/ops/momo-pro-health-monitor.sh;primary truth 是 https://mo.wooo.work/health,188 local 127.0.0.1:5003/health 與 container state 只作 secondary evidence |
| MOMO live readback | /home/wooo/scripts/docker_health_monitor.sh hash d7a6bc75549efa10176c42e6f9082c90b9856dbcbb335aba4a4fa4abb754eaef; manual run returned OK: public health 200; no alert |
| AWOOI ops notify wrapper | /home/wooo/awoooi-ops/notify-awoooi-ops.sh hash 12bf9ae124c56bb7f31be15ebeb501671b0686d695492bc3fa1d9abb5b683b67; repo MOMO monitor uses this wrapper instead of adding a new Telegram Bot API direct send |
| Docker monitor fallback | scripts/ops/docker-health-monitor.sh keeps ACTION_COOLDOWN_SECONDS=300 for repair cadence but adds NOTIFY_COOLDOWN_SECONDS=1800 for direct Telegram fallback when AWOOOI API cannot receive the event |
| Docker monitor live readback | /home/wooo/awoooi-ops/docker-health-monitor.sh hash 41d64f29048868c8e4c089132d299c8ee0e2b50ab2c513158d6d45cf92ea38e3 and exposes TELEGRAM_COOLDOWN lines for repeated fallback suppression |
| Bitan public-content check | Live /home/wooo/apps/bitan-pharmacy-release/scripts/run-public-content-cleanliness-check.sh now writes public-content-cleanliness.notify.state, suppresses same failure fingerprint for 21600s, and sends one recovery notice after a failed state becomes pass |
| Bitan live readback | Script hash 294ec7f75448c86688b8afc408c785efe4cf173d468ad0d82228ba638d3de2dc; manual no-notify run returned PASS for DB, public APIs, products/news pages, and content contract |
| Declaration limit | 可宣稱 repeated healthy / same-failure notification noise is controlled for these paths;不可宣稱 all product alerts migrated to the unified notification gateway or any real failure alert disabled |
Post-reboot notification gate:
ssh wooo@192.168.0.110 '/home/wooo/scripts/docker_health_monitor.sh'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/logs/docker_health.log'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/awoooi-ops/monitor.log'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/apps/bitan-pharmacy-release/logs/public-content-cleanliness-check.cron.log'
Done gate:
MOMO monitor: public health 200 -> no Telegram.
AWOOOI heartbeat: healthy + no warnings -> suppressed; warning/recovery still send.
Generic docker-health monitor: API 200/202 path is primary; direct Telegram fallback is fingerprint-cooled.
Bitan public content: pass -> no failure Telegram; repeated same failure -> cooled; recovery after prior failure -> one notice.
14.31 2026-06-24 MOMO source-file absence decision gate
2026-06-24 11:35 的恢復判定把 MOMO 分成兩件事:服務可用與資料新鮮。服務可用已恢復,資料新鮮仍 blocked。這個 gate 的目的,是防止 operator 在外部網站 200、container healthy、DB parity 正常時,誤把「沒有新來源檔」當成「恢復完成」。
| 項目 | 11:35 source-file absence baseline |
|---|---|
| SOP version | v1.32 |
| MOMO public health | https://mo.wooo.work/health returns healthy; version V10.639 |
| DB rows | daily_sales_snapshot=104614,realtime_sales_monthly=786621 |
| DB bounds | daily 2025-07-01..2026-06-17;monthly 2024-01-01..2026-06-17 |
| Current-month parity | `10936 |
| Latest successful import | daily_sales job 56,created 2026-06-18 11:41,source 即時業績_當日.xlsx,sync_success=true |
| Pending source folder | 當日業績匯入 count 0 for pattern 即時業績_當日 |
| Archive latest | 2026-06-18T01:30:39Z,already imported by job 56 |
| Scheduler Drive readback | container-side Drive listing works and currently returns count 0; no current Permission denied evidence in latest readback |
| Stale alert posture | data_stale_alert has 24h dedupe; this is a true warning, not heartbeat spam |
| Blocking metric | `MOMO_DAILY_FRESHNESS 7 |
| Repo-side v1.42 scorecard evidence | MOMO_SOURCE_EMPTY_EVIDENCE_LINES 21、`MOMO_IMPORT_CONFIG 當日業績匯入 |
2026-06-24 23:04 repo-side cold-start v1.42 dry-run returns PASS=88 WARN=0 BLOCKED=1 and classifies the only blocker as:
188 momo source file absent while daily sales data stale
This is repo-side source-of-truth enhancement only. 2026-06-24 23:15 read-only deploy parity check proves the live 110 script is still older: repo hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05, live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8. Do not claim the live 110 deployed script has this v1.42 behavior until /home/wooo/scripts/full-stack-cold-start-check.sh is synced under an approved change and its hash/readback is recorded through §13.3.1.
GO / NO-GO:
GO: declare MOMO web/API/container/database service available.
GO: declare current-month table parity good.
NO-GO: declare MOMO business data current.
NO-GO: declare FULL_STACK_GREEN while MOMO_DAILY_FRESHNESS > 2.
NO-GO: re-import old archived files to fake freshness.
NO-GO: import product exports or manually constructed spreadsheets as daily sales source.
NO-GO: truncate tables, restore whole DB, or move Drive files when sync_success is false.
解除 blocker 的唯一合格證據:
1. New legitimate 即時業績_當日 source file appears in the expected Drive intake path, or owner supplies a verifiable source-evidence reference.
2. Import job completes with success=true and sync_success=true.
3. Drive file movement / archive evidence shows the source was handled once.
4. daily_sales_snapshot and realtime_sales_monthly counts and date bounds match for the imported range.
5. MOMO_DAILY_FRESHNESS <= 2.
6. backup / offsite / cold-start scorecard rerun after import remains green except known DR escrow blocker.
如果 source file 缺席,正確回報是:
MOMO service is recovered, data pipeline is waiting for upstream source file.
No safe import candidate exists.
Full-stack remains blocked by data freshness, not by service outage.
14.32 2026-06-24 188 nginx-exporter / CD monitoring coverage gate
2026-06-24 的第六段變更是把 CD post-deploy monitoring coverage 失敗納入重啟 SOP。2ec7f6f4 的 runtime deploy 已回寫 622bc372 並且 production API health 為 healthy,但 CD #3294 的 post-deploy checks 因 nginx-exporter target down 留下 Failure。根因是 188 nginx-exporter container 未運行,並非 Nginx public gateway、API/Web rollout 或 MOMO 服務故障。
| 項目 | 20:10 monitoring coverage baseline |
|---|---|
| SOP version | v1.34 |
| Affected CD run | Gitea CD #3294 歷史結果仍為 Failure;deploy marker 622bc372 已寫入 |
| Root cause | Prometheus job nginx-exporter down,target 192.168.0.188:9113 connection refused |
| Non-root cause | Nginx stub_status 正常;不需要 reload Nginx、不需要重啟 API/Web/MOMO、不需要改 firewall |
| Live restore source | /home/ollama/nginx-exporter.yml |
| Repo helper | scripts/ops/188-nginx-exporter-restore.sh |
| Check mode | --check only reads stub_status, compose config, container state, and metrics |
| Apply mode | --apply runs docker compose -f /home/ollama/nginx-exporter.yml up -d after stub_status and compose config pass |
| Exporter metrics | nginx_up 1、nginx_connections_active、nginx_http_requests_total |
| Monitoring coverage | Jobs 總數=14、全部 UP=14、真實問題=0、預期覆蓋率=100.0% |
| Declaration limit | 可宣稱 exporter / monitoring coverage recovered;不可把歷史 CD run 改稱 success,也不可宣稱 full-stack green / DR complete |
Post-reboot / post-CD 188 nginx-exporter check:
bash scripts/ops/188-nginx-exporter-restore.sh --check
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0
如果 --check 只在 metrics 階段失敗,但 stub_status 與 compose config 都通過,且維護窗口允許恢復無狀態 exporter:
bash scripts/ops/188-nginx-exporter-restore.sh --apply
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0
禁止把這個症狀用下列方式處理:
NO-GO: reload Nginx before stub_status / exporter metrics prove Nginx config is the issue.
NO-GO: restart product containers because monitoring coverage alone is red.
NO-GO: silence monitoring coverage or mark CD green without target recovery evidence.
NO-GO: prune Docker volumes or delete exporter state not owned by this SOP.
14.33 2026-06-24 MOMO V10.646 / source-file absence / dual-workstation baseline
2026-06-24 的第七段變更是把 MOMO 的「程式版本最新」與「業務資料不新」拆成兩個獨立 gate,並把 Mac Mini / MacBook Pro 的 MOMO Codex 工作區固定到 Gitea main 最新基準。這避免重啟後出現兩種誤判:看到 /health 最新版就宣稱資料已更新,或看到資料 stale 就誤以為服務仍是舊版。
| 項目 | 20:42 MOMO / workstation baseline |
|---|---|
| SOP version | v1.35 |
| MOMO public health | https://mo.wooo.work/health returns healthy, version V10.646 |
| Gitea main truth | wooo/ewoooc main=7cfca9375445ea03d6f5d10512d0276a20914d25, SYSTEM_VERSION = "V10.646" |
| Mac Mini workspace | /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 7cfca9375445ea03d6f5d10512d0276a20914d25, dirty 0 |
| MacBook workspace | /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 7cfca9375445ea03d6f5d10512d0276a20914d25, dirty 0 |
| Remote baseline branch | wooo/ewoooc codex/momo-current-main-dev-base-20260624 points to 7cfca9375445ea03d6f5d10512d0276a20914d25 |
| DB parity | current-month daily_sales_snapshot and realtime_sales_monthly match at 10936 rows, range 2026-06-01..2026-06-17 |
| Data freshness | `MOMO_DAILY_FRESHNESS 7 |
| Source candidates inspected | Mac Mini current daily file contains only 2025-07-01..2025-07-02; iCloud full-month file contains only 2025-06-01..2025-06-30; MacBook candidates are header-only or the same 2025-07-01..2025-07-02 file |
| Declaration limit | 可宣稱 MOMO release current 與 Codex dual-workstation baseline ready;不可宣稱 MOMO data current 或 full-stack green |
MOMO post-reboot 判定必須同時回答四個問題:
MOMO_RELEASE_CURRENT = yes/no
MOMO_DB_PARITY = yes/no
MOMO_DATA_FRESH = yes/no
MOMO_SOURCE_AVAILABLE = yes/no
解除 MOMO data freshness blocker 的唯一安全路徑:
1. 新的合法 即時業績_當日 source file 出現在預期 Drive intake,或 owner 提供可驗證的 source-evidence reference。
2. 匯入 job 成功,且同步 realtime_sales_monthly 失敗時不得標 completed。
3. source file movement / archive evidence 證明該檔只處理一次。
4. daily_sales_snapshot 與 realtime_sales_monthly row count / date bounds 一致。
5. MOMO_DAILY_FRESHNESS <= 2。
禁止把以下情境當成解除 blocker:
NO-GO: 用舊 archive、iCloud 舊月檔、header-only 檔或測試檔重複匯入。
NO-GO: 把 V10.646 health 當成資料日期已到今天。
NO-GO: 把 current-month parity 當成 data freshness。
NO-GO: truncate 或 restore 整庫來製造新鮮度。
14.34 2026-06-24 MOMO import sync failure boundary hardening
2026-06-24 21:57 的第八段變更是把 MOMO 自動匯入的「partial success」風險納入重啟 SOP。2026-06-24 22:17 已補正式 release readback:同一修正已 fast-forward 到 MOMO main,Gitea Actions cd.yaml #904 成功,188 live source marker 已確認。daily_sales_snapshot 寫入成功不代表整體匯入成功;realtime_sales_monthly 同步失敗時,必須 fail job、保留來源檔,不得移動 Google Drive 檔案到 archive。
| 項目 | 22:17 MOMO import-boundary production baseline |
|---|---|
| SOP version | v1.40 |
| Production health | https://mo.wooo.work/health healthy, version V10.653 |
| Live DB read-only | daily_sales_snapshot=104614 rows, 2025/07/01..2026/06/17; realtime_sales_monthly=786621 rows, 2024/01/01..2026/06/17 |
| Scheduler read-only | 最近 12 小時 當日業績匯入 / 即時業績_當日 均為 0 個 Excel,排程不發送成功通知 |
| Latest successful import | job 56 completed, 10936 rows, 2026-06-18 11:41..11:42 |
| Code / deploy | MOMO main and codex/momo-current-main-dev-base-20260624 commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73; Gitea Actions cd.yaml #904 Success |
| Live source marker | 188 /home/ollama/momo-pro/services/import_service.py contains _table_columns, 業績分析儀表板同步失敗, and 保留來源檔案等待重試,不移動 Google Drive 檔案 |
| Regression | pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q => 10 passed |
| Production deploy state | Production patched for code boundary; data freshness still blocked until a legitimate newer source file imports successfully |
MOMO import success 判定:
GO: process_daily_sales_import returns True only if daily_sales_snapshot write and realtime_sales_monthly sync / verification both pass.
GO: auto_import_from_drive may move the Drive source file only after process_daily_sales_import returns True.
NO-GO: mark import_jobs.status=completed when sync_success=false.
NO-GO: move or archive the Drive source file when realtime_sales_monthly sync failed.
NO-GO: send a generic success notification for file_count > 0 before verify_import_data_sync passes.
重啟後若 MOMO data freshness blocked,先分成三層,不要混在一起:
1. Service availability: /health, container, DB connection.
2. Source availability: Drive pending folder has a legitimate new 即時業績_當日 source file.
3. Data correctness: import job completed with sync_success=true, and daily_sales_snapshot / realtime_sales_monthly match the imported date range.
14.35 2026-06-25 MOMO preflight 與 110 CPU orphan Chrome 分流
2026-06-25 11:01 的第九段變更是把兩個常見誤判收斂成可重跑 SOP:
- MOMO service health green 不等於 data fresh。
- 110 high load 不等於可以重啟 Docker 或取消 CI。
MOMO 專用 preflight:
scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh
此腳本只做 read-only SSH / Docker metadata / logs / DB query,不讀 token 內容、不 import、不移動 Drive 檔、不 restart。14:16 live result:
MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=18 WARN=3 BLOCKED=0 HOST=ollama@192.168.0.188 FRESHNESS_MAX_DAYS=2
MOMO_PUBLIC_HEALTH_CODE 200
MOMO_HEALTH_CODE 200
MOMO_HEALTH_VERSION V10.674
MOMO_APP_HEALTH healthy
SCHEDULER_RUNNING true
SCHEDULER_HEALTH healthy
SCHEDULER_RESTART_COUNT 0
TELEGRAM_BOT_HEALTH healthy
MOMO_CONTAINER_REPLACE_EVENTS_45M 11
TOKEN_STAT 100000:100000:600
CONTAINER_TOKEN_STAT 0:0:600
LOCAL_EXACT_DAILY_SOURCE_COUNT 0
LOCAL_EXACT_DAILY_SOURCE_LATEST none
DB_DAILY 109061|2025-07-01|2026-06-24
DB_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24
DB_DAILY_FRESHNESS 1|2026-06-24
DB_LATEST_DAILY_IMPORT_JOB 57|completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0
110 CPU 分流:
| Evidence | Decision |
|---|---|
ps shows stockplatform-review-bulk-ux Chrome groups with root process PPID 1, no parent node smoke, and sustained high CPU |
Treat as orphan browser smoke. Run dry-run if available, then only with owner approval use targeted SIGTERM by process group. |
Active Gitea Actions container is consuming CPU, e.g. GITEA-ACTIONS-TASK-*, next build, uv pip install, docker-buildx |
Treat as legitimate CI/CD load. Do not kill unless there is explicit release owner approval to cancel the run. |
vmstat shows high iowait or active swap in/out |
Treat as storage / memory pressure, not browser runaway. Do not kill random processes; capture disk / memory evidence first. |
2026-06-25 10:58 user-approved action:
Targeted command type: process SIGTERM only.
Targeted process groups: 438005, 471295, 640155, 670628.
Scope: orphan `stockplatform-review-bulk-ux` Chrome groups on 110.
Post-check: `OLD_GROUPS_REMAINING` empty.
Not performed: Docker restart, systemd restart, Nginx reload, firewall/iptables change, K8s action, CI cancellation, Wazuh/SOC change, secret read.
Remaining load: active Gitea Actions / CI build work; observe queue and timeout instead of killing.
14.22 重啟後時間軸驗證
每次重啟後照時間軸推進,不要等到最後才一次判定。
| 時間點 | 目標 | 必跑證據 | 可以宣稱 |
|---|---|---|---|
T+0 |
power / VM / console 已開始 | console / hypervisor / UPS / operator note | maintenance started |
T+5m |
LAN / SSH 回復 | ping、ARP、SSH port、who -b |
HOST_BOOTED |
T+15m |
主機基礎服務回復 | systemctl is-system-running、failed units、Docker / PostgreSQL / Redis / K3s role checks |
HOST_READY |
T+30m |
核心服務回復 | 188 DB、110 Harbor/Gitea/Prom/AM、K3s nodes、AWOOOI API/Web、public routes | SERVICE_READY for scoped hosts |
T+45m |
排程與資料一致性 | backup status、offsite verifier、momo DB parity、CronJobs、alert visibility | service recovery confidence |
T+60m |
釋出高負載與自動化 | cold-start scorecard、load/core、runner guardrails、AI observe-only gate | release runner/CD only if gates allow |
若任一時間點卡住,記錄卡在哪個 gate,不要跳到下一層。連續兩次重啟都卡同一 gate,必須回寫 §16 Known Drift 或 workplan。
15. Done Criteria
All must be true:
- Four hosts reachable by SSH.
- 188 PostgreSQL and Redis healthy.
- 110 Harbor, Gitea, Prometheus, Alertmanager healthy.
- 120/121 K3s nodes Ready.
- VIP
192.168.0.125present. - AWOOOI API and Web reachable through NodePort/VIP.
- Alertmanager E2E webhook succeeds.
- cron/CronJob schedules are active, unsuspended, and verified.
- MOMO release version matches Gitea source-of-truth for the intended deployment branch.
- momo
daily_sales_snapshot與realtime_sales_monthly在最新匯入日期範圍內筆數一致。 - momo business data freshness is within the declared SLO, and the latest import source evidence is legitimate; DB parity alone is not enough.
- Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
- High-load batch services are capped or delayed.
- Runners are guarded and released last.
- AI auto-remediation is not in full execution mode until all gates are green.
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
- 110 runaway process textfile monitor is fresh, and Prometheus has
HostOrphanBrowserSmokeHighCpuplus CI load classification rules loaded. - 110 global
/home/wooo/.ssh/known_hostsstill contains verified 120 / 188 entries after any CD run; deploy jobs use/home/wooo/.ssh/deploy_known_hostsonly.
15.1 可宣稱狀態
| 可宣稱文字 | 必要條件 |
|---|---|
110 host recovered |
110 HOST_READY,failed units 0 或全部可解釋,核心端口與 cron / backup status 已查 |
public core services recovered |
public routes/TLS 2xx/3xx,AWOOOI API health、Harbor/Gitea/Stock/Sentry/SignOz/Langfuse/Bitan smoke OK |
backup/offsite current |
backup-status --no-notify 無 stale,offsite verifier VERIFY_OK=1,且任何 failed component 有明確 owner |
service recovery with known blocker |
cold-start BLOCKED 只剩已知 blocker,例如 120;告警保持可見 |
full-stack green |
§15 全部成立,cold-start WARN=0 BLOCKED=0 |
DR complete |
full-stack green 且 credential escrow missing count 為 0 |
16. Known Drift To Fix After Recovery
這些項目必須在事故後整理,不要在 P0 恢復中途順手大改。
SERVICE-ENDPOINTS.mdstill has old Prometheus/Alertmanager locations.- Audit older docs for direct node webhook targets; current main path should be VIP
192.168.0.125:32334. - OpenClaw
8088vs8089must be live-confirmed and normalized. - 188 compose paths drift between
/home/ollama/*and Ansible/opt/*. - 110 runner docs still mention Docker runner in places; live startup prefers host
gitea-act-runner-host.service. scripts/setup-runner-watchdog.shconflicts with the 2026-05-05 runner watchdog disablement guardrail.grist.wooo.work/registry.wooo.workpublic HTTP/HTTPS currently route toaiops.wooo.work; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.stockplatform-shared-ui-monitor.timer/ service source-of-truth 仍需清理或重建;2026-06-12 只停用 stale timer 以解除 host degraded。- 111 local Ollama fallback 目前不可達;production provider 由 GCP-A / GCP-B 承接,但 111 恢復應另列 AI provider resilience 工作。
- 本 SOP v1.5 新增內容已用繁體中文補強;舊章節仍有英文段落,後續 runbook hygiene 應分批翻譯,不要在事故 P0 中混入大規模格式重排。