Files
awoooi/docs/runbooks/FULL-STACK-COLD-START-SOP.md

189 KiB
Raw Permalink Blame History

AWOOOI 全棧冷啟動與主機重啟 SOP

Version: v1.78 Last updated: 2026-06-27 Asia/Taipei Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.


0. 最新 Live Baseline 與釋出判定

本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md

若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 和可重放的 summary.txt 留在 /tmp/awoooi-post-reboot-readiness-*。v1.75 起,同一輪驗收後續步驟必須吃同一個 $ARTIFACT_DIR/summary.txt,例如 scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file "$ARTIFACT_DIR/summary.txt" --no-colorscripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file "$ARTIFACT_DIR/summary.txt" --no-color;不得在同一輪 evidence chain 反覆重跑 live probes 後混用不同時間點結論。v1.76 起delegated cold-start 若在 K3s rollout / CD 替換瞬間出現單次 BLOCKED AWOOOI API not reachable,但 wrapper 自己的 public https://awoooi.wooo.work/api/v1/health route retry 已回 2xx該 blocker 只列為 route/API warmup evidence warningpublic API 仍失敗、其他 non-route blocker、或 retry 後未恢復時,仍維持 hard blocked。宣告 guard 會把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 SERVICE_GREEN=1NEXT_REQUIRED_GATES 仍非空,再由 dispatch checklist 把尚未完成的 blocker 轉成 owner / evidence / forbidden-action checklist需要機器可讀 intake 時,再跑 scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --dispatch-file <dispatch.txt> --output /tmp/awoooi-post-reboot-owner-packets.json 產生 awoooi_post_reboot_next_gate_owner_packets_v1 JSON並立刻跑 scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json。dispatch / packet / guard 均固定 DISPATCH_AUTHORIZED=0REQUEST_SENT_COUNT=0OWNER_RESPONSE_ACCEPTED=0HOST_WRITE_AUTHORIZED=0SECRET_VALUE_COLLECTION_ALLOWED=0RUNTIME_GATE=0guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起,任何 owner response JSON 還必須經過 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file <owner-packets.json> --response-file <file>空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closedpreflight 通過也只表示可進入獨立 reviewer acceptance不是 runtime 授權。需要人工展開時,再跑 scripts/reboot-recovery/post-start-quick-check.sh --no-color 並以 docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。

v1.76 owner gate replay rule同一輪 summary 產生後owner packet 與 owner response preflight 必須優先使用 --summary-file "$ARTIFACT_DIR/summary.txt",例如 scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.jsonscripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --response-file <file>。只有在刻意要重新取 live evidence 時,才允許省略 --summary-file;否則 preflight 不得自己重跑 summary 造成同一輪狀態漂移。

2026-06-27 11:51 最新 live revalidationscripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-115046/summary.txt 回傳 POST_START_RESULT=BLOCKEDPOST_START_PASS=37POST_START_WARN=3POST_START_BLOCKED=2SERVICE_GREEN=0PRODUCT_DATA_GREEN=1STOCK_FRESHNESS_STATUS=okSTOCK_LATEST_TRADING_DATE=2026-06-26STOCK_BLOCKERS=noneBACKUP_CORE_GREEN=1HOST_188_HYGIENE_BLOCKED=0WAZUH_MANAGER_REGISTRY_ACCEPTED=0RUNTIME_ACTION_AUTHORIZED=0。本輪已再次修復 188 momo_pg_daily crontab configured driftbackup-statuscore_blockers=0configured_missing_188=0K3s / ArgoCD live readback 顯示 120 / 121 皆 Readyawoooi-prodSynced / Healthyapi/web/worker pods 均 Running。現在 hard blocker 是 MOMO business data freshnessdaily_sales_snapshot 最新仍為 2026-06-24DRIVE_INTAKE_COUNT=0Drive archive / global latest 即時業績_當日 均為 2026-06-25T04:21:47Z,最新 import job 57 已 clean completed 且 sync_success=true。因此可宣稱主機、K3s、public routes、backup core 與 Stock freshness 已恢復;不可宣稱 full-stack green直到 MOMO 來源檔補齊並由正式 import pipeline 更新 DB。DR complete 仍因 ESCROW_MISSING_COUNT=5 禁止宣稱Wazuh 全主機納管仍因 manager registry accepted 0 禁止宣稱。

2026-06-27 00:58 最新 live summary本輪先修復兩個實際 SOP blocker。第一scripts/ops/recovery-scorecard-contract-check.py 已改成 PyYAML optional標準 Python 環境也能驗證 recovery recording-rule contract不會因 ModuleNotFoundError: yaml 中斷 DR/offsite checklist。第二188 ollama crontab 已備份到 /home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt,並把 AWOOOI momo PostgreSQL daily backup 從 app-side /home/ollama/momo-pro/scripts/pg_backup.sh 收斂回 host-owned /home/ollama/bin/momo-pg-backup.sh;刷新 188 textfile exporter 後 awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1。00:58 scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-005728/summary.txt 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDPOST_START_PASS=38POST_START_WARN=3POST_START_BLOCKED=0SERVICE_GREEN=1PRODUCT_DATA_GREEN=1BACKUP_CORE_GREEN=1ESCROW_MISSING_COUNT=5HOST_188_HYGIENE_BLOCKED=0WAZUH_MANAGER_REGISTRY_ACCEPTED=0RUNTIME_ACTION_AUTHORIZED=0。同輪 backup-statuscore_blockers=0configured_missing_188=0Prometheus live contract 回 awoooi_recovery_core_ready=1awoooi_recovery_dr_offsite_ready=0,表示主機 / K3s / public routes / product data / backup core 已恢復DR 仍只因 credential escrow 缺 5 個 non-secret evidence marker blockedWazuh 全主機 registry accepted 仍為 0。

2026-06-27 00:02 最新 live summaryscripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDPOST_START_PASS=38POST_START_WARN=4POST_START_BLOCKED=0SERVICE_GREEN=1PRODUCT_DATA_GREEN=1STOCK_FRESHNESS_STATUS=okSTOCK_LATEST_TRADING_DATE=2026-06-26STOCK_BLOCKERS=noneBACKUP_CORE_GREEN=1ESCROW_MISSING_COUNT=5HOST_188_HYGIENE_BLOCKED=0WAZUH_MANAGER_REGISTRY_ACCEPTED=0RUNTIME_ACTION_AUTHORIZED=0。同一輪 production route smoke 回傳IwoooS 200、Wazuh read-only routes 200、VibeWork 200、AwoooGo 200、MOMO health 200、Stock 200AWOOOI API health healthy / prod / mock_mode=falsePostgreSQL / Redis / OpenClaw / SigNoz / GCP Ollama provider uplocal Ollama endpoint 仍為 cooldown / degraded由 provider fallback 承接,不是網站或 API service blocker。最新 deploy marker 為 e506b9d5 chore(cd): deploy fe74d86 [skip ci];本輪 89b9e67a 是 SOP / scripts / docs source update不是 runtime bundle deploy marker。112 Wazuh 與 120 K3s 的 23:56 脫敏 readback 仍作為本輪相鄰 evidence120 ArgoCD Synced / Healthy、Pod 均 RunningCompletedWazuh manager registry 並非全空,但 WAZUH_MANAGER_REGISTRY_ACCEPTED=0 維持,不能宣稱全主機納管恢復。

2026-06-26 23:56 live summary retained for comparisonscripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDPOST_START_PASS=38POST_START_WARN=3POST_START_BLOCKED=0SERVICE_GREEN=1PRODUCT_DATA_GREEN=1STOCK_FRESHNESS_STATUS=okSTOCK_LATEST_TRADING_DATE=2026-06-26STOCK_BLOCKERS=noneBACKUP_CORE_GREEN=1ESCROW_MISSING_COUNT=5HOST_188_HYGIENE_BLOCKED=0WAZUH_MANAGER_REGISTRY_ACCEPTED=0RUNTIME_ACTION_AUTHORIZED=0。同一時段只讀補查 120ArgoCD awoooi-prodSynced / Healthyawoooi-prod Pod 均為 RunningCompleted;歷史 km-vectorize-29689620 failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋,不是目前服務 blocker。同一時段只讀補查 112systemd runningWazuh manager / indexer / dashboard activemanager API root 回 401Dashboard unauthenticated check endpoints 回 401manager registry 脫敏讀回為 local manager 1、受管 agent 5、active managed 5、disconnected 0、never connected 0。此證據證明 registry 不再是「全空」,但仍不能宣稱 Wazuh 全主機納管恢復,因為 SOP expected scope 仍是 6、Dashboard API connection / version 尚未以登入或 owner evidence 驗收owner response accepted 仍為 0

2026-06-26 18:46 最新即時恢復真相已覆蓋 12:13 對今日產品資料的判讀:scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=PRODUCT_DATA_PENDING_EOD_WINDOWSERVICE_GREEN=1PRODUCT_DATA_GREEN=0STOCK_LATEST_TRADING_DATE=2026-06-26STOCK_BLOCKERS=core_margin_short_daily_missing,ai_recommendations_staleBACKUP_CORE_GREEN=1ESCROW_MISSING_COUNT=5WAZUH_MANAGER_REGISTRY_ACCEPTED=0。同一輪 live cold-start 長檢查回傳 PASS=87 WARN=0 BLOCKED=0Result: GREEN,代表 110 / 120 / 121 / 188 主機、K3s、public routes、AWOOI API、MOMO、backup core、exporters、cron 與 Alertmanager 服務層已恢復;但 StockPlatform 今日官方 margin-short 尚未發布AI recommendations 仍依賴該資料因此不可宣稱所有產品資料最新。18:43 已以授權 SIGTERM 清除 110 上兩組 6 小時以上 stockplatform-review-bulk-ux orphan Chrome process groupREMAINING=018:44-18:46 已停止 168 Mac Mini 上本機 AWOOOI next build 並清理 temp/build/cache 與 Antigravity backup browser recordings使 /System/Volumes/Data 從約 1.0Gi / 100% 回到約 8.7Gi / 96%。112 Kali 的 networking.service failed 已定位為 /etc/network/if-up.d/wg-nat 錯誤 shebang #\!/bin/bash 導致 Exec format errorWazuh manager / indexer / dashboard 仍 active該 hook 修復需要 112 sudo 提權,未使用或保存密碼。

2026-06-26 12:13 latest live summary supersedes the 08:59 gate setscripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 回傳 POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDPOST_START_PASS=38POST_START_WARN=4POST_START_BLOCKED=0SERVICE_GREEN=1PRODUCT_DATA_GREEN=1BACKUP_CORE_GREEN=1DR_ESCROW_BLOCKED=1ESCROW_MISSING_COUNT=5HOST_188_SERVICE_GREEN=1HOST_188_HYGIENE_BLOCKED=0HOST_188_RESULT=HOST_188_HYGIENE_GREEN.WAZUH_ROUTE_CODE=200WAZUH_TRANSPORT_COUNT=6WAZUH_MANAGER_REGISTRY_ACCEPTED=0WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinningWAZUH_DASHBOARD_INDEX_OK=3RUNTIME_ACTION_AUTHORIZED=0OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDNEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export。188 host hygiene 已從 blocker 移除;目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復,但不得宣稱憑證已正式 renew需等 snap certbot timer / ACME window readback。

2026-06-26 13:01 owner response preflight baseline新增 scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-colordocs/templates/post-reboot-next-gate-owner-response.json。無 response file 時必須輸出 POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_file expected_gates=2 received=0 accepted=0 runtime_gate=0;直接使用模板時必須輸出 POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_content expected_gates=2 received=0 accepted=0 runtime_gate=0。此 gate 只驗收 credential_escrow_evidencewazuh_manager_registry_export 的脫敏 owner evidence不送 request、不寫 escrow marker、不讀 secret、不做 Wazuh / host / Kali runtime action也不把一般批准訊息轉成 owner accepted。

2026-06-26 17:45 single-summary replay baselinescripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 現在會自動寫入 /tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt,同一輪後續 declaration guardnext-gate dispatchowner packetcontract guardowner response preflight 均用此 summary 重放。17:45 summary 回傳 SERVICE_GREEN=1PRODUCT_DATA_GREEN=1BACKUP_CORE_GREEN=1DR_ESCROW_BLOCKED=1ESCROW_MISSING_COUNT=5HOST_188_HYGIENE_BLOCKED=0WAZUH_MANAGER_REGISTRY_ACCEPTED=0OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDNEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_exportpost-start-quick-check.sh 也已補 route warmup 分類:若 delegated cold-start 的 BLOCKED 全部是 public route且 wrapper 自己的 route retry 已全部恢復,該 cold-start blocker 會降級為 evidence warning不再把整輪服務恢復誤判成 blocked非 route blocker 或 retry 後仍失敗仍維持 hard blocked。

2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence當時 HOST_188_HYGIENE_BLOCKED=1NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export。此段只用來比對 188 修復前後差異;現行 gate set 必須使用 12:13 baseline。

2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出,目前 expected NEXT_GATE_COUNT=2,只剩 credential escrow 與 Wazuh registry。

2026-06-26 08:29 owner-packet JSON baselinescripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color 將 dispatch output 轉成 schema_version=awoooi_post_reboot_next_gate_owner_packets_v1,包含三個 owner_packetsnext_gate_count=3p0_gate_count=3request_sent_count=0owner_response_received_count=0owner_response_accepted_count=0runtime_action_authorized_count=0。此 JSON 是 AI / operator / owner review intake不是外部 request也不是維護窗口批准。

2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence舊版鎖定三個 P0 gate。12:13 起 contract guard 依 source.next_required_gates 動態驗收,現行 expected success line 是 POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0;若 188 hygiene future regression才會回到 gates=3

2026-06-26 08:47 Wazuh registry detail baselinescripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/valueWAZUH_COVERAGE_SCOPE=6WAZUH_DIRECT_ACTIVE=2WAZUH_NO_TRANSPORT=1WAZUH_SSH_BLOCKED=3WAZUH_ROUTE_CODE=200WAZUH_TRANSPORT_COUNT=6WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinningWAZUH_DASHBOARD_INDEX_OK=3WAZUH_MANAGER_REGISTRY_ACCEPTED=0WAZUH_RUNTIME_GATE=0scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-colorwazuh_manager_registry_export gate 會把這些狀態放入 CURRENT_EVIDENCE。判讀鐵律route 200、transport 6、Dashboard index pattern 3 都不是 manager registry accepted全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。

2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence當時 HOST_188_FULLY_GREEN 仍 forbidden。12:13 起 guard 依 HOST_188_HYGIENE_BLOCKED=0 動態允許 188 host hygiene green但仍拒絕 DR_COMPLETEWAZUH_REGISTRY_RECOVEREDRUNTIME_ACTION_AUTHORIZED

2026-06-26 07:39 live quick-check refreshscripts/reboot-recovery/post-start-quick-check.sh --no-color 完整跑完,四主機 ping / SSH 全部 OKdelegated cold-start 為 PASS=89 WARN=0 BLOCKED=0wrapper 總結為 POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0、warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED。MOMO health V10.701daily snapshot 109061 rows / 2025-07-01..2026-06-24current-month parity 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24latest import job 57 completed。StockPlatform freshness status=ok、latest trading date 2026-06-25price / chips / margin / AI recommendations 均為 2026-06-25。Backup-status 07:39 顯示 110 13/13 fresh failed=0、188 2/2 fresh failed=0core_blockers=0、offsite/rclone fresh、last_backup_all=2026-06-26 02:31:02escrow_missing=5。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 5.19 / 4.66 / 4.91CPU idle 多數樣本 80%+,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠forbidden declarationDR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。

2026-06-26 07:19 follow-upgitea/main 已包含前一輪 SOP 文件 commit 1fd5e2a8ArgoCD awoooi-prod 讀回 Synced / Healthyrevision 1fd5e2a8b0f18d24eed16aa2a44286bcbf230603API 2/2、Web 2/2、Worker 1/1pods restart=0。重跑 full cold-start 仍是 PASS=87 WARN=0 BLOCKED=0result GREEN。直接 public route 讀回AWOOOI API 200、AWOOOI Web 307、VibeWork 200、AwoooGo 200、MOMO health 200、Stock freshness 200、Bitan 200、Gitea 200、Harbor 200、Registry /v2/ expected 401、Sentry expected 302、SigNoz 200、Langfuse 200。188 blocker 精準分類:pg_lsclusters 顯示 host PostgreSQL 14/main downsystemctl status postgresql@14-main 顯示 invalid primary checkpoint recordPANIC: could not locate a valid checkpoint recordcertbot.service 顯示 sentry.wooo.work renew rate-limitedsnap.certbot.renew.service 顯示 challenge failedawoooi-startup.service 曾嘗試以 root 執行 pg_resetwal 並失敗。本輪不執行 pg_resetwal、不 reset-failed、不重啟 service188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md,並可先跑 scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color 取得只讀 preflight。110 load 已降到約 4.83 / 4.82 / 5.52top CPU 是 active AWOOOI Web turbo build / Docker buildxSwap 仍滿但 memory available 約 41Gi,本輪不手動清 swap。整體宣告仍是 FULL_STACK_GREEN_DR_ESCROW_BLOCKED

2026-06-26 07:02 全主機 live refresh110 / 120 / 121 / 188 / 112 / 111 / 168 ping 與 SSH port 全部 OK。110 systemctl=running、failed units 0,但 load 5.83 / 7.26 / 5.77 且 top CPU 是 AWOOOI Web next buildSwap 仍 7.8Gi / 7.8Gi;這是 CI/build 壓力,不是 orphan Chrome 或 Docker 事故。120 / 121 systemctl=running、K3s activenodes mon / mon1 均為 Ready。ArgoCD awoooi-prod 在 06:57 曾短暫 OutOfSync / Progressing,因 deploy marker 52f61da4 rollout 正在替換 API/Web/Worker07:00 後已穩定為 Synced / HealthyAPI 2/2、Web 2/2、Worker 1/1API/Web 仍跨 mon / mon1。重跑 live cold-startPASS=87 WARN=0 BLOCKED=0result GREEN。StockPlatform /api/v1/system/freshness 曾在容器剛重啟約 35 秒時短暫 502,後續連續讀回皆 200status=oklatest_trading_date=2026-06-25、blockers [];這類 rollout warmup 只有連續失敗才算 blocker。MOMO health 是 V10.699cold-start direct evidence 仍顯示 current-month parity 15383 / 15383 截至 2026-06-24daily freshness 1|2026-06-24。Backup status 06:58110 13/13 fresh failed=0、188 2/2 fresh failed=0core_blockers=0、offsite/rclone fresh、last_backup_all=2026-06-26 02:31:02escrow_missing=5。188 產品容器健康,但 host systemctl=degraded 仍是真實 host hygiene blockerawoooi-startup.servicepostgresql@14-main.servicecertbot.servicesnap.certbot.renew.service failed。112 Wazuh manager/indexer/dashboard activeports 1514 / 1515 / 55000 listen但 production Wazuh route 仍回報 disabled_waiting_iwooos_wazuh_owner_gateconfigured=false、manager registry accepted 0、runtime gate 0。111 / 168 可連線,但兩邊 AWOOOI dev workspaces 皆 ahead 17 且 HEAD 不同(111=56c83257168=59485d51Mac Mini /System/Volumes/Data 只剩約 3.2Gi。目前 service recovery 宣告維持 FULL_STACK_GREEN_DR_ESCROW_BLOCKEDhost hygiene / DR escrow / Wazuh registry / workstation capacity 明確列為 service green 之外的 blocker。

2026-06-26 06:50-06:55 188 host hygiene read-only triage188 product services remain green, but host systemctl is still degraded and must not be smoothed into full host green. Failed units are awoooi-startup.service, postgresql@14-main.service, certbot.service, and snap.certbot.renew.service. Evidence shows the host PostgreSQL cluster 14/main is down in pg_lsclusters, while product DB / exporters still respond through containerized services; therefore pg_isready or pg_up=1 cannot substitute for host cluster health. The 188 startup service detected could not locate a valid checkpoint record on 2026-06-23 and attempted pg_resetwal as root, which failed; v1.63 treats PostgreSQL checkpoint/WAL errors as break-glass only and the repo-side startup script now fails closed instead of running pg_resetwal. Certbot renew for sentry.wooo.work is also failing and hit ACME rate-limit / challenge failure, but the public cert is still valid until 2026-07-09 16:03:40 UTC. Current declaration: SERVICE_GREEN_HOST_HYGIENE_BLOCKED for 188, while overall service recovery remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED.

2026-06-26 06:40-06:44 全主機 read-only refresh110 / 120 / 121 / 188 / 112 / 111 / 168 ping 與 SSH port 全部 OK。核心 reboot scope 維持 green110 systemctl=running、failed units 0Docker / Gitea / Harbor / Prometheus / Alertmanager 可用120 / 121 systemctl=running、failed units 0K3s nodes mon / mon1 Ready188 產品容器與 PostgreSQL / Redis / MOMO / SignOz 可用。ArgoCD awoooi-prod 已從先前 degraded 收斂為 Synced / Healthyrevision b2945ab9f716d9d685434ae0e67b9318414b27fekm-vectorize official 03:00 台北時間 run 成功,lastSuccess=2026-06-25T19:00:14Z。Public routes for AWOOOI / VibeWork / AwoooGo / MOMO / Stock / Bitan / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse return expected statuses; AWOOOI API health is healthy / prod / mock_mode=false; MOMO health is V10.690; StockPlatform freshness is status=ok, latest_trading_date=2026-06-25, blockers []; backup-status remains core green with escrow_missing=5. Boundaries: 188 host still has failed units awoooi-startup.service, certbot.service, postgresql@14-main.service, snap.certbot.renew.service that require host hygiene cleanup; 112 Wazuh services / ports are active but Wazuh manager registry accepted remains 0; 111 / 168 Codex workspaces are reachable but have different local HEADs on the same ahead branch; Mac Mini free space is about 3.4Gi. Current service verdict remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED, not DR_COMPLETE or Wazuh recovered.

2026-06-26 06:26-06:28 隔日 read-only refresh四主機 ping/SSH OKcold-start PASS=89 WARN=0 BLOCKED=0MOMO V10.690 且 latest import job 57 completedStockPlatform /api/v1/system/freshness 仍為 status=ok / latest_trading_date=2026-06-25 / blockers []backup-status 110 13/13 fresh failed=0、188 2/2 fresh failed=0core_blockers=0offsite_fresh=1rclone_gdrive_fresh=1last_backup_all=2026-06-26 02:31:02escrow_missing=5。06:26 full wrapper 首輪在 https://awoooi.wooo.work/zh-TW/iwoooshttps://vibework.wooo.work/ 出現單次 000,但獨立 curl 立即回 200route-only wrapper 也回 PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN;因此 v1.61 將 public route gate 改為最多 3 次 retry只有連續失敗才算 BLOCKEDretry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。

2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, intelligence-sync 21:00 finished status=0, core.margin_short_daily reached 2026-06-25 / 1976 rows, and ai-recommendation-pipeline 21:10 finished STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25 with draft_count=120, candidate_count=120, and rag_documents=1000. StockPlatform /api/v1/system/freshness now returns status=ok, latest_trading_date=2026-06-25, blockers [], with price / chips / margin / AI recommendations all on 2026-06-25. The 21:14 full wrapper returns cold-start PASS=89 WARN=0 BLOCKED=0 and overall POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. The only remaining recovery red gate is DR credential escrow evidence escrow_missing=5; Wazuh manager registry accepted remains 0 as a security evidence blocker, not a reboot service blocker.

2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two stockplatform-review-bulk-ux Chrome process groups 2756503 and 2829627 with root Chrome process PPID=1, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted SIGTERM at 20:24. Post-check showed no remaining PGID entries; vmstat showed CPU idle around 85-90%, si/so=0, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start PASS=89 WARN=0 BLOCKED=0, but overall POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1, RESULT=BLOCKED, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.

2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea main and live /home/wooo/stockplatform-v2 are now at fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints; six missing production cron entrypoint scripts are restored, run-intelligence-sync.sh contains the Docker-backed psql shim, and live contract check confirms every scripts/ops/*.sh referenced by install-production-cron.sh exists. The only live write performed for StockPlatform recovery was a fast-forward git pull --ff-only origin main on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: source-remediation-queue 19:56 and 20:00 succeeded, market-index-ingestion 20:00 succeeded, price-ingestion 20:02 succeeded, margin-short-ingestion 20:05 succeeded, chips-ingestion 20:06 succeeded, and ai-recommendation-pipeline 20:10 ran but correctly produced the internal blocker core_margin_short_daily_incomplete,official_margin_short_daily_official_pending. StockPlatform /api/v1/system/freshness therefore still returns status=blocked because the 2026-06-25 official margin-short source is pending and ai.recommendations must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.

2026-06-25 19:35 product-version / data-freshness refresh supersedes the 19:06 data-complete wording. Host boot, K3s, AWOOOI runtime, MOMO service/data, backup/offsite, Bitan cleanliness, and expanded public routes are available, but the stricter post-start wrapper now checks StockPlatform /api/v1/system/freshness and correctly returns RESULT=BLOCKED when product data is not current. The 19:35 lightweight wrapper run used --skip-cold-start --skip-backup --skip-cpu after the 19:24 full host/cold-start/backup readback and returned PASS=31 WARN=1 BLOCKED=1, with the single blocker StockPlatform freshness is blocked: core_margin_short_daily_missing,ai_recommendations_stale. stock.wooo.work, /healthz, and /api/healthz all return 200; public routes now covered by the wrapper include AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Do not declare "all products and data are latest" until StockPlatform freshness is ok; keep DR blocked until escrow_missing=0.

2026-06-25 19:06 post-CD live read-only refresh supersedes the 18:53 wrapper wording. Consecutive main pushes caused older deploy markers to be replaced, so the latest production truth is deploy marker d8ca8224 chore(cd): deploy 9dbe044 [skip ci]. Read-only ArgoCD shows awoooi-prod Synced / Healthy at revision d8ca822422021d0fda8da8fa4c354c0c4db7ff22; API/Web/Worker live image tag 9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be; API 2/2, Web 2/2, Worker 1/1. The 19:05 post-start quick check returns RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED, delegated cold-start remains PASS=89 WARN=0 BLOCKED=0, and 19:05-19:06 route stability checks confirm AWOOOI API, IwoooS, AwoooGo, Stock, VibeWork, Bitan, and MOMO health all return 200 for five consecutive external reads. Earlier AwoooGo / Stock 502 reads were post-deploy upstream warmup transients, not persistent service failures. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on V10.690; latest import job 57 completed cleanly; MOMO_DAILY_FRESHNESS 1|2026-06-24; current-month daily snapshot and realtime tables match through 2026-06-24. post-start-quick-check.sh parses cold-start PASS / WARN / BLOCKED summary before classifying exit codes, so WARN-only rollout/stale evidence is no longer inflated into a service blocker. The wrapper returns RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED when service blockers are zero but escrow_missing=5 remains. Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh production routes are now 200 disabled_waiting_iwooos_wazuh_owner_gate, but configured=false, manager query accepted 0, manager registry accepted 0, and runtime gate 0; treat Wazuh as a security registry evidence blocker, not a reboot service blocker.

Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: 2026-06-25 19:05 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
Post-start quick check: 2026-06-25 21:14 PASS=38 WARN=2 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. Cold-start layer remains GREEN and StockPlatform freshness is now OK; DR remains blocked by credential escrow evidence.
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
Service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, StockPlatform data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared, and 110 orphan StockPlatform Chrome smoke groups cleared by targeted approved SIGTERM. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is now green for the 2026-06-25 evidence set.
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. 19:35 expanded route readback returned expected 2xx/3xx for AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
MOMO release state: mo.wooo.work health is healthy on version V10.690. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 18:23 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain recent container-replace / scheduler fail-closed / notification evidence notes, but no service blocker remains.
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
StockPlatform data state: `/api/v1/system/freshness` returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.margin_short_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, `core.market_index_daily.global` 2026-06-25 / 2001 rows, and `ai.recommendations` 2026-06-25 / 2868 rows. The 21:10 natural AI pipeline produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`; no manual ingestion or DB write was performed.
Product version readback: StockPlatform live source `/home/wooo/stockplatform-v2` matches Gitea `wooo/stockplatform-v2.git` main `fb91aa4c6272469d1d26e0820169629eac17d28a`; VibeWork live image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea `wooo/vibework.git` main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source `/home/wooo/awooogo` matches Gitea `wooo/AwoooGo` main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833`; 188 runtime directory is a compose/image deployment path, not a git worktree, so add image revision label evidence before declaring code-image parity.
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 20:24 readback found a second recurrence with orphan process groups `2756503` and `2829627`, root Chrome `PPID=1`, elapsed about 5h, no active parent smoke, GPU process CPU around 96%, and renderer CPU around 22%; approved targeted `SIGTERM` cleared both PGIDs. 21:14 CPU attribution shows current load is dominated by an active AWOOOI Web `next build` process group and its worker processes, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
Backup / monitoring state: 19:05 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
Route transient handling: post-deploy `502` on Stock or AwoooGo is a blocker only if it persists after upstream container health is ready and 3-5 consecutive external route reads still fail. For AwoooGo, live upstream is on 110 `192.168.0.110:32190`; do not test only `127.0.0.1` on 110 because the listener may bind the host address. For K3s workload balancing, wait for terminating pods to disappear before judging API/Web placement; final required state for two-replica API/Web is split across `mon` and `mon1`.
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
Deploy storm / CD replacement state: if several main commits land during recovery, older CD runs may be canceled by newer commits. Do not treat the canceled run as a service failure. Wait for the final deploy marker, verify live image tags, ArgoCD health, public routes, DB freshness, backup status, and post-start quick check before declaring latest production recovered.
Wazuh / SOC boundary state: production Wazuh read-only route presence is not equivalent to Wazuh registry recovery. `/api/iwooos/wazuh` and `/api/v1/iwooos/wazuh` returning `200 disabled_waiting_iwooos_wazuh_owner_gate` only proves the route boundary is deployed; manager registry accepted, owner evidence accepted, active response, host write, agent re-enroll, restart, secret patch, Kali active scan, and runtime gate remain `0 / false`.
Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom.
Allowed declaration: host boot, core service readiness, K3s, public route availability, AWOOOI API health, MOMO service health/data freshness, Bitan public-content cleanliness, and backup/offsite readiness are green for the latest read-only evidence set.
Forbidden declaration: all product data latest, StockPlatform data freshness green, DR complete, credential escrow complete, Wazuh host registry accepted, 110 live monitor synced, or runtime/security acceptance. Credential escrow evidence is still missing and StockPlatform freshness is blocked; neither may be smoothed into green.

2026-06-24 22:17 Codex workstation continuity readback:

MacBook Pro 192.168.0.111 can authenticate to Gitea over SSH with its own public key named MacBook Pro Codex 20260624.
MOMO Pro Mac Mini workspace is /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty=0.
MOMO Pro MacBook workspace is /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty=0.
MOMO import-boundary regression: pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q => 10 passed.
MOMO production release: Gitea main and cd.yaml #904 are at 84035906aba0e5e190d031a13cfd9b47a8cd1f73; 188 live source marker proves production deploy.
Codex Start Here / workstation dashboard / scorecard safe artifacts were copied to MacBook Pro; latest artifact dashboard readback is refreshed after the docs closeout commit. Raw Codex App DB, auth, sessions, raw conversations, .env, runtime volumes, raw .git directories, passwords, tokens, and Mac Mini private keys were not copied.
AwoooGo MacBook dev workspace remains ready at /Users/ooo/codex-workspaces/awooogo-dev, branch dev, upstream gitea/dev, commit 8471b376d97c1436d4612ece17f51ba0950f114d, dirty=0.
Safe handoff artifacts still match by local / remote SHA-256 readback after Start Here / workstation dashboard / scorecard refresh. Exact hash values are intentionally not hard-coded in this runbook because they change whenever handoff artifacts are refreshed. Raw Codex App DB, auth, sessions, raw conversations, .env, runtime volumes, raw .git directories, passwords, tokens, and Mac Mini private keys were not copied.
This improves workstation continuity after host reboot / operator relocation, and the MOMO import-boundary fix is now production-deployed; it does not change service cold-start status: full-stack green remains blocked by MOMO data freshness and DR remains blocked by credential escrow evidence.

2026-06-18 12:17 live readback supersedes older service-availability wording:

Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=83 WARN=1 BLOCKED=0, Result=DEGRADED.
Service state: SERVICE_AVAILABLE_DEGRADED; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
Rollout state after transient 12:14 startup window: awoooi-api 2/2, awoooi-web 2/2, worker 1/1, canary 1/1, public API health 200 healthy.
Only live warning: retained stale K8s Job km-vectorize-29689620 from 2026-06-14 03:00. Later official km-vectorize Jobs 29692500 / 29693940 / 29695380 are Complete.
Allowed declaration: services are available with one stale failed Job warning.
Forbidden declaration: full cold-start green, DR complete, or runtime/security acceptance.

2026-06-18 13:43 live readback supersedes the stale-Job warning wording:

Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=84 WARN=0 BLOCKED=0, Result=GREEN.
Service state: FULL_STACK_GREEN_FOR_SERVICE; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
K8s Job classification: FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0. The retained km-vectorize failure stays as evidence but no longer blocks service readiness after later official successful Jobs.
Allowed declaration: full cold-start service readiness is green for this evidence set.
Forbidden declaration: DR complete or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged.

2026-06-18 14:31 live runaway-process readback supersedes repo-only AIOps wording:

110 host runaway process exporter is live-installed and scraped.
Textfile source: /home/wooo/node_exporter_textfiles/host_runaway_process.prom.
Prometheus readback: monitor_up=1, orphan_browser_groups=0 for headless_browser_smoke and stockplatform_headless_smoke, active Gitea Actions containers=2, load5_per_core around 0.79-0.81, swap_used_ratio around 1.0, remediation_authorized=0.
Alerts: HostRunawayProcessMonitorMissing is not firing; HostOrphanBrowserSmokeHighCpu is not firing.
Allowed declaration: runaway Chrome/smoke recurrence guard is live and scraped.
Forbidden declaration: AI runtime remediation is enabled. Remediation remains gated and must not execute without owner approval, maintenance window, evidence ref, dry-run, and post-check.

2026-06-18 14:51 production event-packet readback:

Host runaway alert-to-event packet is deployed in production.
Deploy marker: 2d278568 chore(cd): deploy f358a0f [skip ci].
Runtime image: awoooi-api / awoooi-web / awoooi-worker use f358a0f6c3e614e407dedb6eee89bf10b2bc8173.
ArgoCD readback: awoooi-prod Synced / Healthy.
Alert mapping: HostOrphanBrowserSmokeHighCpu -> orphan_browser_smoke_runaway_process; HostCiRunnerLoadSaturation -> ci_runner_load_saturation.
Allowed declaration: monitoring, alert rules, live scrape, AI event packet routing, PlayBook / KM contract, and production deployment are complete for this evidence set.
Forbidden declaration: Telegram send, Bot API call, Gateway queue write, process kill, Docker/systemd restart, Nginx reload, firewall/K8s action, or runtime remediation is authorized.

2026-06-18 16:08 P3-009 Host Runaway AIOps product readback:

Host runaway AIOps closed-loop read model is deployed in production.
Deploy marker: 42c08ece chore(cd): deploy 27143fb [skip ci].
API endpoint: /api/v1/agents/agent-host-runaway-aiops-loop-readiness.
Production readback: schema_version=host_runaway_aiops_loop_readiness_v1, current_task_id=P3-009, next_task_id=P3-010, completion=100, loop_stage_count=6, alert_lane_count=2, asset_writeback_contract_count=5.
Host 110 live readback in the model: orphan browser groups=0, active CI containers=2, remediation_authorized=0, runtime/write counters=0.
Governance route: /zh-TW/governance?tab=automation-inventory shows P3-009 on desktop 1440x1100 and mobile 390x844 with missing text=0, console/page errors=0, horizontal overflow=false.
Allowed declaration: monitoring, alert rules, AI event packet, PlayBook / KM contract, Verifier/writeback contract, gated remediation dry-run boundary, and product-visible readback are complete for this evidence set.
Forbidden declaration: AI runtime remediation is enabled. Process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain forbidden without owner approval, maintenance window, evidence ref, dry-run, and post-check.
項目 2026-06-24 11:35 Asia/Taipei live result 判定
Overall recovery readiness 98% SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED
P0 host / K3s recovery 100% DONE
P1 backup / alert / escrow 96% BLOCKED_DR_ESCROW
P2 service / data truth 96% BLOCKED_MOMO_DATA_FRESHNESS
P3 docs / automation contracts 100% DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY
110 host runtime fwupd-refresh.timer intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; systemctl --failed returns 0 loaded units listed; rollback is sudo systemctl enable --now fwupd-refresh.timer GREEN_WITH_FWUPD_TIMER_DISABLED
110 host runaway process guard 14:31-14:32 live scrape confirms monitor_up=1, orphan browser groups 0, active Gitea Actions containers 2, load5_per_core≈0.79-0.81, swap_used_ratio≈1.0, and remediation_authorized=0; exporter/helper also remain in Ansible textfile exporter source-of-truth. LIVE_SCRAPED_RUNTIME_GATE_0
120 reachability ping OK, SSH OK, boot around 2026-06-14 02:23, K3s active, node mon Ready GREEN
121 reachability ping OK, SSH OK, failed units 0 GREEN
188 host runtime production services green, but host systemctl degraded by awoooi-startup.service, postgresql@14-main.service, certbot.service, and snap.certbot.renew.service; host PostgreSQL cluster 14/main is down while product DB containers/exporters are healthy; certbot renewal for shared sentry.wooo.work certificate is failing but public cert is still valid until 2026-07-09 UTC SERVICE_GREEN_HOST_HYGIENE_BLOCKED
K3s node state mon Ready control-plane, mon1 Ready control-plane; bad pods 0; FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0 GREEN_WITH_RETAINED_EVIDENCE
110 -> 120 / 188 SSH trust 00:33 cold-start exposed stale known_hosts; backup /home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416; final repair backup /home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949; CD fix 80e6ec1a moves deploy trust to /home/wooo/.ssh/deploy_known_hosts; 01:28 global known_hosts still contains 120 / 188 and was not clobbered by deploy marker e4a349bc GREEN_WITH_GUARDRAIL
Backup status 11:20 status: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1; escrow readback still shows ESCROW_MISSING_COUNT=5 GREEN_WITH_DR_ESCROW_WARNING
Offsite sync / verify 01:28 textfile: awoooi_backup_offsite_remote_verify_ok=1, full_verify_fresh=1, all 13 repos have snapshot_count=1 and snapshot_latest_only=1; latest scheduled verifier log is 2026-06-12 07:20 GREEN
Backup / cold-start alerts 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules GREEN_WITH_EXPECTED_REDLIGHTS
Cold-start scorecard 11:35 read-only scorecardPASS=86 WARN=0 BLOCKED=1。Public routes / TLS、momo DB parity、backup exporters、120/121 K3s、MinIO / Velero、AWOOOI API/Web 皆通過only blocker is MOMO data freshness. BLOCKED_MOMO_DATA_FRESHNESS
momo DB parity `10936 10936
momo scheduler container healthy; Drive listing from container works; pending folder 當日業績匯入 count is 0 for 即時業績_當日; no current Permission denied evidence in the latest readback GREEN_WITH_SOURCE_ABSENT
ArgoCD app health 11:35 readback: awoooi-prod sync Synced, health Healthy, source revision 7db7800e399caed5487a705c81ec993dec76c70f; API/Web/Worker ready. GREEN
Workload balancing Live API/Web/Worker/CronJob image is e999c16b3435f197b78fe2adfeec1c4faa6c4675; API/Web pods remain split across mon / mon1, Worker single replica remains healthy on mon GREEN
Credential escrow 5 non-secret evidence markers missing BLOCKED

Release rule:

Do not declare full cold-start green unless the latest scorecard has `WARN=0` and `BLOCKED=0`.
Do not declare aggregate backup green unless latest `backup-status` has `core_blockers=0`.
Do not declare DR scorecard complete while credential escrow markers are missing.

2026-06-14 18:15 live rule:

110 / 120 / 121 / 188 core service recovery remains available, but the latest 18:15 scorecard is DEGRADED because `WARN=1`.
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
NO-GO for "DR complete" while credential escrow evidence markers are missing.
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
NO-GO for "full cold-start green" until `km-vectorize` failed Job is cleared by an official successful run.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
Current allowed wording: "core service and backup are available; 110 failed units are cleared after intentionally disabling `fwupd-refresh.timer`; high-value config Owner Packet 前台同步後 recovery readback shows no service regression; cold-start is degraded only by the `km-vectorize` official Job failure; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained but failed Pod/log are currently absent, so the next official 03:00 run remains the evidence gate."

2026-06-18 12:17 live rule:

GO for controlled service availability: PASS=83 WARN=1 BLOCKED=0, public routes/TLS green, API health 200 healthy, API/Web/Worker/Canary ready after rollout convergence.
GO for repo-side reboot readiness mechanism: readiness audit PASS=185 WARN=1 BLOCKED=0; only skipped live gate warning before the live check was run.
NO-GO for "full cold-start green" until the retained stale failed Job evidence is either cleared by normal K8s history policy or explicitly accepted by an owner-provided readback package.
NO-GO for "DR complete" while credential escrow evidence markers remain missing.
Do not delete the failed Job manually during routine SOP verification. Keep it as evidence unless an approved maintenance window explicitly authorizes cleanup.
Current allowed wording: "SOP / Plan B / automation contracts are complete; live services are available with one retained stale km-vectorize failed Job warning; hard blockers are zero; DR remains blocked by credential escrow evidence."

2026-06-18 13:43 live rule:

GO for full cold-start service readiness for this evidence set: PASS=84 WARN=0 BLOCKED=0.
GO for controlled runner/CD release under the normal security gates; this is not a bypass for owner response, runtime writer, Telegram, Gateway, K8s, Docker, Nginx, firewall, or secret operations.
GO for retaining stale failed Job evidence: FAILED_JOBS=1 and STALE_FAILED_JOBS=1 are allowed when ACTIVE_FAILED_JOBS=0 and later official successful Jobs exist.
NO-GO for DR complete while credential escrow evidence markers remain missing: ESCROW_MISSING_COUNT=5.
NO-GO for deleting retained failed Jobs during routine verification. Cleanup requires an explicit maintenance window and owner acceptance.
Current allowed wording: "full-stack service recovery is green for the current evidence set; stale km-vectorize failure is retained as historical evidence, not an active blocker; DR complete remains blocked by credential escrow evidence."

After any future 120 recovery, rerun this exact chain from 110:

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

0.1 When To Use This

Use this SOP when any of these happen:

  • 110/120/121/188 reboot unexpectedly.
  • All services are abnormal after a power/network event.
  • K3s is stuck activating.
  • Host load remains high during startup and service health is mixed.
  • Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.

The rule is simple: recover the dependency chain, not the loudest symptom.

0.2 啟動判定分層

重啟後不能只用一個訊號宣稱完成。每台主機與整個平台都必須分四層判定:

層級 代表意義 最低證據 不代表
HOST_POWERED 主機或 VM 看起來已通電 console / hypervisor 顯示 running或 LAN ARP 開始出現 OS 已完成開機
HOST_BOOTED OS 已進入可互動狀態 ping OK、SSH port open、who -b 有本次 boot time systemd / Docker / K3s 已健康
HOST_READY 主機基礎服務可承接下一層 systemctl is-system-running 非 degradedfailed units 可解釋cron / docker / DB / K3s 依角色正常 public route 或業務資料已正常
SERVICE_READY 主機承載服務可用 服務 health、port、container health、DB / Redis / K3s / Harbor / Alertmanager checks 通過 備份、排程、告警、資料一致性與資料新鮮度已驗證
FULL_STACK_GREEN 可以宣稱重啟恢復完成 cold-start scorecard WARN=0BLOCKED=0,備份/offsite/DB/告警/排程/資料新鮮度都綠 120 不可達或 MOMO 業務資料 stale 時永遠不能宣稱

2026-06-12 的 110/120 事故收斂判定是:

110 HOST_READY = yes
120 HOST_READY = yes
Core public services SERVICE_READY = yes
FULL_STACK_GREEN = yes, because cold-start scorecard is PASS=83 WARN=0 BLOCKED=0
DR_COMPLETE = no, because credential escrow evidence is incomplete

2026-06-24 的 MOMO 資料停更判定是:

110 / 120 / 121 / 188 HOST_READY = yes
Core public services SERVICE_READY = yes
MOMO_RELEASE_CURRENT = yes, because mo.wooo.work health is V10.653 and Gitea main / CD #904 deployed commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73
MOMO_DB_PARITY = yes
MOMO_DATA_FRESH = no, because latest daily_sales_snapshot date is 2026-06-17 and stale age is 7 days as of 2026-06-24 22:40
MOMO_SOURCE_AVAILABLE = no, because Drive intake 當日業績匯入 has no newer 即時業績_當日 source, scheduler stats show repeated file_count=0 runs, and Mac Mini / MacBook candidate files only contain old or header-only data
FULL_STACK_GREEN = no, because live cold-start scorecard is PASS=86 WARN=0 BLOCKED=1 and repo-side v1.42 dry-run is PASS=88 WARN=0 BLOCKED=1 with blocker "188 momo source file absent while daily sales data stale"
DR_COMPLETE = no, because credential escrow evidence is incomplete

MOMO source absence recovery gate:

GO: declare MOMO service recovered when health is healthy, containers are healthy, scheduler runs, DB parity matches, and release version matches Gitea/CD.
NO-GO: declare MOMO data current while Drive intake has no newer 即時業績_當日 source file and latest DB bounds stop at 2026-06-17.
NO-GO: re-import stale local samples, product catalog exports, header-only sheets, or already imported archive files to fake freshness.
NO-GO: truncate, whole-DB restore, manual Drive movement, or manual import without explicit maintenance approval.
UNBLOCK: new legitimate PChome daily-sales source appears in 當日業績匯入 or an owner-approved safe import path; import job succeeds with sync_success=true; source file moves only after success; daily_sales_snapshot and realtime_sales_monthly bounds match; MOMO_DAILY_FRESHNESS <= 2.

所有回報必須使用這組詞,避免把「服務面可用」誤報成「整體 DR 完成」。

0.3 Codex 工作站交接判定

重啟後若需要從 Mac Mini / MacBook Pro 繼續 Codex 開發,必須另外確認 Codex safe handoff artifacts不得把服務恢復與 Codex raw 對話同步混為一談。

2026-06-24 22:17 Asia/Taipei readback

MacBook Pro 192.168.0.111 SSH = OK
Safe artifacts synced = Start Here and workstation dashboard readback matched; current SHA-256 values are tracked in the workstation dashboard artifact and local sha256sum readback
Start Here readback = registry_ready 3, registry_blocked 8, latest_dev_on_gitea 3, production_on_gitea 8, raw_history_sync False
Workstation dashboard readback = artifact_sync_synced 2, artifact_sync_blocked 0, MOMO current main baseline ready 2
MOMO Pro Mac Mini workspace = /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty 0
MOMO Pro MacBook workspace = /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73, SYSTEM_VERSION V10.653, dirty 0
AwoooGo MacBook workspace = ready on dev commit 8471b376d97c1436d4612ece17f51ba0950f114d, dirty 0

允許宣告:

Mac Mini / MacBook Pro 已同步 Codex 開工入口與治理 snapshot。
MOMO Pro 可以在 Mac Mini / MacBook Pro 從 Gitea current-main Codex baseline 開工;實作前仍需從 codex/momo-current-main-dev-base-20260624 切新的 codex/<task>。
MOMO import-boundary fix 已經由 main / CD #904 部署到 production後續仍要等下一個真實匯入檔驗證 failure boundary 是否阻止搬檔。

禁止宣告:

raw Codex / ChatGPT 歷史聊天已同步。
所有產品都能雙機同步開發。
把 MOMO Pro 程式版本 V10.653 當成 MOMO 業務資料已更新。
2026FIFA / Agent Bounty owner preflight 已通過。

1. Golden Startup Order

0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed

Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.

1.1 Dependency Graph

flowchart TD
  network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
  network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
  data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
  obs110 --> k3s
  k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
  workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
  workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
  public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
  schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
  highload --> ai["AI auto-remediation: limited execution"]

This is also captured in the machine-readable baseline:

ops/reboot-recovery/full-stack-cold-start-baseline.yml

The YAML baseline is the source of truth for:

  • hosts, roles, and SSH users
  • phase ordering
  • service startup dependencies
  • endpoint success codes
  • schedule freshness thresholds
  • stateful-service protection boundaries
  • AI automation release gates

1.2 Phase Gate Logic

Each phase has the same decision rule:

Result Meaning Action
BLOCKED A dependency required by later phases is down. Stop phase release and fix the first blocked gate.
WARN Core dependency passed, but confidence is incomplete. Continue diagnosis, but do not release runner/CD/AI full execution.
GREEN All checks in scope passed. Release the next phase only.

The cold-start flow is intentionally conservative:

P0 network green
  -> P0 188 data green
  -> P0 110 registry/observability green
  -> P1 K3s green
  -> P2 workload + alert chain green
  -> P2 public routes green
  -> P2 schedules green
  -> P3 high-load services and runners/CD
  -> AI auto-remediation limited execution

The final release condition is not "containers are running". It is:

PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN

1.3 重啟 GO / NO-GO 決策樹

每次維護前先用這張表決定是否可以重啟,以及重啟後可以宣稱到哪個層級。

情境 GO / NO-GO 可做範圍 完成宣告上限
03:00 offsite sync 正在跑 NO-GO 只讀觀察,等待 sync 結束後 verifier 不宣告維護完成
120 不可達,但只重啟 110 CONDITIONAL GO 只可宣稱 110 / public service recovery不可跑 120 backup fix SERVICE_READY,不可 FULL_STACK_GREEN
188 data layer 不健康 NO-GO 先修 PostgreSQL / Redis / Docker / SignOz / momo DB 不釋出 K3s / runner / AI
110 Harbor / Registry 不健康 NO-GO for K3s deploy 先修 registryK3s 可能 image pull 失敗 不釋出 CD / deploy
120 / 121 都 Readyoffsite verifier 綠 GO 可做完整 cold-start release chain 需 scorecard WARN=0 BLOCKED=0
credential escrow marker 缺失 GO for service rebootNO-GO for DR complete 可恢復服務;不可宣稱 DR scorecard complete SERVICE_READYBLOCKED_DR_ESCROW
Alertmanager required rules 不可見 NO-GO for unattended window 先修 alert rules / drift guard 不釋出 AI auto-remediation

GO 只代表允許執行指定範圍,不代表完成。完成一定要回到 §15 Done Criteria。

1.4 Plan B降級運轉與回復路徑

Plan B 不是另一套可以繞過 preflight 的重啟流程也不是事故中臨場改主機的授權。Plan B 是當 Plan A 無法在維護窗口內達成 FULL_STACK_GREEN 時,預先定義「最低可接受服務目標、停止線、降級等級、主機路徑、回到 Plan A 的條件」。

Plan A 的目標是:

B4_FULL_STACK_GREENcold-start scorecard WARN=0 / BLOCKED=0backup、offsite、DB、alert、scheduler、K3s、public route 與業務資料新鮮度都綠。

Plan B 的目標是:

先保住核心服務與資料完整性,不擴大 blast radius不把部分可用誤報成 full-stack green並把下一個 blocker 留成可追蹤工單。

Plan B 的機讀契約固定在 ops/reboot-recovery/full-stack-cold-start-baseline.ymlplan_b 區塊;scripts/reboot-recovery/reboot-recovery-readiness-audit.sh 必須檢查 SOP 與 baseline 都保留 B0-B5、T+120 停止線與三個收尾狀態。若這些欄位缺失readiness audit 必須回 BLOCKED

Plan B 紅線

紅線 具體要求
不假綠 不用 route 200、pod up、container up、UI 可見、CD success 或單一 smoke pass 宣稱完整恢復。
不消音正確紅燈 120 / backup / credential escrow / alert / scheduler 的紅燈如果反映真實缺口,必須保留。
不做未授權寫操作 沒有維護窗口與人工批准時,不重啟 Docker daemon、不 reload Nginx、不改 firewall / iptables、不 kubectl patch live、不讀 secret、不做 destructive recovery。
不釋出高風險自動化 CD runner、AI auto-remediation、heavy crawler、batch import、repair bot 必須等依賴鏈綠燈後才解除 freeze。

Plan B 觸發條件

觸發條件 立即動作 可宣稱上限
03:00 offsite sync、02:00 backup 或 full verifier 仍在跑 延後重啟;只讀等待完成 B0_ABORTED_BEFORE_REBOOT
任一 P0 主機重啟後 15 分鐘仍 ping / SSH 不可達 停止釋出下一層,啟動對應主機路徑 B1_HOST_RECOVERY_ONLY
188 PostgreSQL / Redis / momo / SignOz 任一核心資料面不健康 凍結 K3s deploy、runner、AI auto-remediation B1_HOST_RECOVERY_ONLY
110 Harbor / Gitea / Alertmanager / Prometheus 不健康 凍結 CD / deploy / image pull 相關流程 B2_CORE_SERVICE_READY 以下
120 或 121 單台不健康,但另一台 control-plane 可承載 進入單節點 K3s 服務模式,保留 HA 紅燈 B2_CORE_SERVICE_READY
public route 可用,但 DB / backup / alert / schedule 任一不綠 標記 ROUTE_GREEN_ONLY,不宣稱 service green B2_CORE_SERVICE_READY
cold-start WARN>0BLOCKED=0 可宣稱服務可用但仍 degraded B3_SERVICE_AVAILABLE_DEGRADED
credential escrow missing 可完成服務恢復,不可宣稱 DR complete B4_FULL_STACK_GREEN 或以下,禁止 B5_DR_COMPLETE

Plan B 主機路徑

故障域 降級路徑 回到 Plan A 的條件
110 失敗 保留 120 / 121 K3s 與 188 data凍結 CD、runner、Harbor image push、Alertmanager outbound先確認 Gitea / Harbor / Prometheus / Alertmanager 是否只是 host service 層問題。 110 HOST_READY、Harbor / Gitea / Prometheus / Alertmanager 健康、backup-status 無 110 core blocker、cold-start 110 checks 綠。
120 失敗 121 承載 K3s control-plane保留 120_DEGRADED 紅燈;不宣稱 K3s AA不跑 120 backup fix必要時走 console / fsck recovery。 120 ping / SSH OK、root filesystem rw、k3s active、node mon Ready、backup-configs / backup-all / offsite / cold-start chain 全過。
121 失敗 120 承載 K3s control-plane保留 121_DEGRADED 紅燈;不宣稱 workload balanced避免非必要 rollout。 121 ping / SSH OK、k3s active、node mon1 Ready、API/Web placement 回到 max skew <= 1。
188 失敗 先保資料面PostgreSQL、Redis、momo DB、SignOz、Ollama / AI provider凍結會寫入資料或產生大量負載的 batch / crawler / AI flow。 188 HOST_READY、PostgreSQL / Redis / momo parity / SignOz / AI provider route 健康,且 backup/status 無 188 core blocker。
K3s degraded 保留現有健康 Pod先查 nodes / pods / events / VIP / NodePort避免盲目重啟 k3s 或刪 Pod。 mon / mon1 Ready、API/Web/Worker rollout healthy、public API/Web / alert webhook / scorecard 通過。
Public gateway degraded 保住內部 API / VIP / data不 reload Nginx、不改 DNS/TLS/certbot/firewall除非有 owner-approved maintenance window。 Nginx config owner evidence、route smoke、TLS / ACME、rollback owner 與 post-check 計畫通過。

Plan B 服務等級

維護期間所有回報都必須使用以下等級之一,禁止用「差不多好了」或「應該正常」:

等級 意義 最低證據
B0_ABORTED_BEFORE_REBOOT preflight 發現 NO-GO取消或延後重啟 未做 runtime 寫操作;記錄 NO-GO blocker。
B1_HOST_RECOVERY_ONLY 只完成主機層恢復 目標主機 ping / SSH / boot time / systemd 基礎狀態可判定;服務尚未全驗。
B2_CORE_SERVICE_READY 核心服務可用,但完整依賴鏈未過 public route、API、DB 或 K3s 主要面可用backup / alert / scheduler / scorecard 尚未全綠。
B3_SERVICE_AVAILABLE_DEGRADED 核心服務可用cold-start 無 hard block 但仍有 WARN cold-start BLOCKED=0WARN 被明確列出且不被消音。
B4_FULL_STACK_GREEN 本次重啟恢復完成 cold-start PASS>0 WARN=0 BLOCKED=0backup / offsite / DB / alert / scheduler / data freshness 全綠。
B5_DR_COMPLETE DR 完整 B4 加上 credential escrow missing 0restore / escrow / offsite evidence 完整。

Plan B 執行時序

T+0      freeze CD / runner / AI auto-remediation / heavy batch保留 console、journal、backup、scorecard evidence。
T+5      判定 HOST_POWERED / HOST_BOOTED / HOST_READY任一 P0 host 不可達即進入主機 Plan B。
T+15     188 data 或 110 registry / observability 不健康時停止釋出 K3s、runner、AI。
T+30     public route 可用但 DB / backup / alert / scheduler 未過時,只能回報 B2不得 full green。
T+60     必須跑 cold-start scorecard若仍 WARN / BLOCKED留下 Plan B 等級與下一個 blocker。
T+120    若仍未達 B4開 incident / follow-up不延長窗口做未授權 runtime 寫操作。

Plan B 收尾條件

Plan B 只能以下列三種狀態收尾:

收尾狀態 條件 下一步
RETURNED_TO_PLAN_A blocker 已清,完成 Plan A 全鏈路驗證 更新 reboot ledger記錄實際耗時與 SOP 差異。
SERVICE_AVAILABLE_DEGRADED 服務可用但 scorecard 仍 WARN或 DR / escrow / governance gate 未完成 保留紅燈,開下一步 owner / evidence / maintenance task。
OPEN_INCIDENT_REQUIRED P0 host、data、K3s、gateway、backup、alert 任一仍 hard blocked 停止維護窗口,保留 evidence升級事故處理。

Plan B 的專業標準不是「保證每次都綠」,而是保證每次重啟都能快速知道現在到哪一層、什麼不能宣稱、下一個 blocker 是誰、以及是否可以安全回到 Plan A。


2. Automation Freeze

Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.

Item Cold-start policy Reason
Gitea/GitHub runners Last Build jobs can saturate 110 CPU/RAM.
momo-scheduler / crawlers Last Chrome and batch work can saturate 188.
Sentry/Snuba consumers Controlled Kafka backlog and ClickHouse merge can create temporary high load.
Alertmanager outbound notification Gate Avoid alert storms before API webhook and Telegram are verified.
AI auto-repair Observe-only Metrics, Redis, KM, and playbooks may be incomplete.
Stateful DB restart Human approval PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets.

2.1 Freeze 執行清單

進入維護窗口後,先把「會放大事故」的來源降到 observe-only 或延後釋出。若沒有做到這一步,後續負載和告警會很難判讀。

順序 對象 只讀確認 允許動作 禁止動作
1 runner / CD systemctl list-units "actions.runner.*"、Gitea Actions running jobs 暫停新 job、等待可完成 job 結束 重啟 Docker daemon 來中斷 job
2 AI auto-remediation Prometheus / Alertmanager / cold-start monitor 狀態 切 observe-only、保留告警 自動 restart stateful service
3 momo scheduler / crawler container health、recent logs、DB parity 延後 heavy import、保留現有資料 在 DB 未綠時強行匯入
4 Sentry / Snuba ClickHouse / Kafka health、consumer restart loop 控制 consumer 釋出順序 generic compose down/up 全套重啟
5 K3s workload node readiness、pods、events 依 node 狀態 cordon/drain 120 不可達時宣稱 drain 成功

多個工作視窗同時處理事故時,第一優先是避免互相打斷:只要有人在收斂 Docker / Nginx / firewall / K3s 寫操作,其他視窗先只讀觀察,直到明確交接。

2.2 CD / SSH Trust Guardrail

2026-06-13 的冷啟動假紅燈顯示CD workflow 若用 ssh-keyscan ... > /home/wooo/.ssh/known_hosts,會覆蓋 110 使用者層的全域 SSH trust導致 110 到 120 / 188 的 strict SSH 檢查失敗。這會把實際已恢復的主機誤判成 blocked。

固定規則:

項目 正確做法 禁止
Deploy 專用 host key 寫入 /home/wooo/.ssh/deploy_known_hosts 寫入或覆蓋 /home/wooo/.ssh/known_hosts
Deploy SSH options -o UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts 共用 operator / cold-start 的 known_hosts
冷啟動 SSH trust 保留 120 / 188 的已驗證 fingerprint修復前先備份 無 fingerprint 交叉驗證就 ssh-keygen -R 或重建全檔
驗證 CD 後檢查 known_hosts mtime、120 / 188 entries、strict SSH 只看 CD success badge

2026-06-13 修復錨點:

  • Source fixGitea main 包含 80e6ec1a fix(ci): avoid clobbering runner known hosts
  • Deploy markere4a349bc chore(cd): deploy 414413a [skip ci] 後,/home/wooo/.ssh/known_hosts mtime 仍停在 2026-06-13 01:20:02 +0800,未被 CD 覆蓋。
  • Deploy isolated file/home/wooo/.ssh/deploy_known_hosts mtime 2026-06-13 01:24:05 +0800
  • Global strict entries120 ED25519 line 4、188 ED25519 line 5 仍存在strict SSH 到 wooo@192.168.0.120ollama@192.168.0.188 必須通過。

3. P0 Evidence And Network

Run from any machine on the same LAN:

for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done

arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done

Then capture reboot evidence:

ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'

If any host has ARP incomplete or SSH port down, stop here and fix physical/network first.

3.1 主機已啟動判定標準

每台主機重啟後先跑「四段式啟動判定」。只有全部符合角色期望,才進入服務恢復。

for h in 110 120 121 188; do
  ip="192.168.0.$h"
  echo "=== $ip ==="
  ping -c 2 -W 2 "$ip" >/dev/null && echo "HOST_POWERED_OR_LAN_OK=1" || echo "HOST_POWERED_OR_LAN_OK=0"
  arp -an | grep "$ip" || true
  nc -G 3 -z "$ip" 22 && echo "SSH_PORT_OPEN=1" || echo "SSH_PORT_OPEN=0"
done

可 SSH 後:

ssh wooo@192.168.0.110 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true; free -h; swapon --show'
ssh wooo@192.168.0.121 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true'
ssh ollama@192.168.0.188 'hostname; date; who -b; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true; free -h'

120 若不可 SSH狀態只能是 HOST_POWERED_UNKNOWNHOST_BOOTED_UNKNOWN,必須走 console / VM / network 檢查,不可用 Kubernetes stale node object 代替主機現況。

判定 必要條件 下一步
HOST_BOOTED ping 或 ARP 有回應、SSH port open、who -b 顯示本次 boot time 檢查角色服務
HOST_READY systemctl is-system-runningrunning,或 degraded units 已逐一解釋且不影響本 host 角色 進入服務層驗證
HOST_DEGRADED failed units 存在且影響本 host 角色,或 swap 滿、root readonly、boot storage error 先修 host不釋出下一層
HOST_UNREACHABLE ping/SSH/ARP 失敗 停止遠端修復假設,改 console/VM/network

2026-06-12 110 事故教訓failed unit 若指向不存在的 legacy 路徑,要先確認是否仍屬現役服務。停用 stale timer 可以解除 host degraded但必須同步 source-of-truth 後續清理,不能靠反覆 reset-failed 掩蓋。

2026-06-26 188 事故教訓PostgreSQL host cluster / Docker product DB / exporter 三者必須分開判定。pg_isreadypg_up=1 或 public route 200 只能證明某個 PostgreSQL endpoint 可用,不能證明 postgresql@14-main 已恢復。若 journal 出現 could not locate a valid checkpoint record,不得由 startup 腳本或 AI 自動執行 pg_resetwal;必須進入 DB owner / backup restore / maintenance window / rollback owner / post-check gate。


4. P0 188 Data Layer

188 is the first real service dependency because multiple product data planes, exporters, and AI / observability services depend on PostgreSQL-compatible endpoints. Do not assume the host cluster postgresql@14-main, Docker product databases, and exporter target are the same endpoint; prove the authoritative endpoint before repair.

4.1 Startup order

  1. containerd
  2. docker
  3. postgresql@14-main
  4. k3s_datastore.kine maintenance
  5. redis-server on 6380
  6. ollama or current AI proxy dependencies
  7. nginx
  8. Docker networks
  9. MinIO / OpenClaw / SignOz
  10. momo / litellm / batch services after load is stable

4.2 Read-only check

ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_lsclusters 2>/dev/null || true
ss -ltnp "sport = :5432" 2>/dev/null || ss -ltn "sport = :5432" || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'

4.3 PostgreSQL WAL checkpoint damage

Signature:

PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...

This may block if the affected cluster is the authoritative runtime datastore:

  • 188:5432
  • K3s startup on 120/121
  • AWOOOI API DB access
  • Alertmanager webhook if API cannot start

2026-06-26 counterexample: host cluster 14/main can be down while product DB containers and exporters still serve traffic. Therefore pg_isready is not enough and failed postgresql@14-main is not automatically a product outage. First map the listening process / container, current app DB configuration, and backup freshness.

Break-glass example only after DB owner approval, backup evidence, maintenance window, rollback owner, and post-check plan:

sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"

Do not run pg_resetwal, DROP, reinitialize the cluster, delete /var/lib/postgresql, or restore an old backup from AI/startup automation. These are break-glass actions only.


5. P0/P1 110 Registry And Observability

110 must recover Harbor/Gitea/Monitoring early, but runners last.

5.1 Startup order

  1. docker
  2. Remove Exited (128) / Exited (137) orphan containers
  3. Harbor harbor-log
  4. Harbor full stack
  5. Gitea
  6. Prometheus / Alertmanager / Grafana / exporters
  7. Langfuse
  8. SignOz
  9. Sentry DB layer
  10. Sentry web/worker/consumer layer
  11. Gitea host runner and actions runners

5.2 Checks

ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'

Harbor healthy means /v2/ returns 200 or 401. Do not treat 401 as failure.

5.3 Runner gate

Runner may start only after all are true:

  • 188 PostgreSQL ready
  • 110 Harbor ready
  • 110 Gitea ready
  • 120/121 K3s nodes ready
  • AWOOOI API health passes
  • 110 load/core is below 1.0 for at least 15 minutes
  • runner systemd guardrails are active: CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0

Check:

ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
  echo "=== $u ==="
  systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'

If WatchdogUSec is not 0, apply the guardrail script manually with sudo:

sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply

6. P1 120/121 K3s

K3s must wait for 188 PostgreSQL and 110 Harbor.

6.1 Startup order

  1. 120 k3s.service
  2. 121 k3s.service, k3s-agent.service, or its live role
  3. CNI / kube-proxy
  4. Nodes Ready
  5. Core pods
  6. awoooi-prod pods
  7. keepalived VIP 192.168.0.125
  8. NodePorts 32334 and 32335

6.2 Checks

ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

If K3s is activating while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.

6.3 120 / 121 AA / AS 與負載平衡判定

2026-06-12 15:19 live check 確認 120 / 121 都是 K3s control-plane,且兩台都是 k3s activek3s-agent inactive。因此它們是 K3s 控制面 AA不是傳統一主一從 AS。

但控制面 AA 不等於業務 workload AA。120 剛從 root filesystem fsck 恢復後,大多數 ArgoCD / AWOOOI / Velero / kube-system workload 仍集中在 121120 主要只有 DaemonSet 類 Pod。每次 120 / 121 重啟或恢復後,都要額外跑 Pod 落點檢查:

ssh wooo@192.168.0.120 '
sudo kubectl get nodes -o wide
sudo kubectl get pods -A -o wide
sudo kubectl top nodes 2>/dev/null || true
sudo kubectl top pods -A --sort-by=cpu 2>/dev/null | head -30 || true
'

判定規則:

判定 條件 可宣稱
K3S_CONTROL_PLANE_AA 120 / 121 都是 Ready control-plane 控制面雙節點可用
WORKLOAD_IMBALANCED 主要 deployment / pod 都落在單一節點 不可宣稱服務 AA需排程治理
WORKLOAD_BALANCED replicas >= 2 的核心 API / Web 跨 120 / 121 分散 可宣稱承載層分散
STATEFUL_AA storage replication、backup / restore drill、failover drill 都通過 才可宣稱資料層 AA

負載平衡與遷移評估的正式基準文件是 docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md。恢復期先完成 P0 備份鏈與 cold-start scorecard再做 topology spread 或服務搬遷。


7. P2 AWOOOI Workloads

Run after K3s nodes are Ready:

ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'

curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/

If pods are ImagePullBackOff, go back to 110 Harbor.

If API health fails because DB/Redis is down, go back to 188.


8. P2 Alert Chain

Current main path:

Prometheus/Alertmanager on 110
  -> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
  -> AWOOOI API
  -> TelegramGateway
  -> Telegram

Alertmanager health alone is not enough. Run E2E:

curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
  -H 'Content-Type: application/json' \
  -d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'

Expected: API returns success and Telegram receives the test alert.


9. P2 Schedules And Delayed Work

Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.

Host / Layer Required check Success baseline
188 cron systemctl is-active cron and crontab -l cron active; backup, restart exporter, stats exporter entries present
188 backup-from-110 backup_110_last_success_timestamp in textfile/Prometheus last success age < 25h
188 momo-scheduler docker inspect momo-scheduler and docker logs --since 6h momo-scheduler container running healthy; 全部排程任務已註冊; Google Drive auth works; dashboard URLs use container-reachable hostnames
188 momo import manual run_auto_import_task() after parser changes selected sheet is 即時業績明細; imported date range has matching rows in daily_sales_snapshot and realtime_sales_monthly
110 cron systemctl is-active cron cron active; Docker/systemd textfile exporters fresh
110 startup units systemctl --failed zero failed units; stale momo-startup-complete and wooo-staggered-startup disabled
120 K8s CronJobs kubectl get cronjobs -n awoooi-prod unsuspended; no failed Jobs remain after current validation
121 DR drill crontab -l DR drill cron present unless explicitly paused

Useful checks:

ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'

If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.


10. P2/P3 Stateful Service Guardrails

Tier Examples Automation
BLOCK PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB No automatic destructive action. Human approval only.
CRITICAL_HITL Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse Human-in-the-loop restart/repair.
STANDARD_HITL API/Web/worker, OpenClaw, litellm Restart only with evidence and blast-radius check.
AUTO Stateless exporters, blackbox, nginx exporter Auto restart allowed after verification.

Never use generic docker restart $(docker ps -q) during cold start.

10.1 Dirty-Reboot Storage Corruption

Treat these log signatures as storage corruption, not ordinary service flakiness:

  • Bad message
  • Structure needs cleaning
  • Unknown codec
  • PANIC: could not locate a valid checkpoint record
  • Kafka Malformed line in checkpoint files
  • ClickHouse broken and needs manual correction

Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns Bad message or Structure needs cleaning, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.

10.2 ClickHouse Clean-Clone Recovery Pattern

Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.

1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.

Do not replace this with rm -rf store/... unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:

/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS

10.3 Kafka Checkpoint Recovery Pattern

If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:

log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint

Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.


11. P3 High-Load Services

Only release these after P0/P1/P2 gates are green:

Host Service Release condition
188 momo-scheduler / crawler load/core < 1.0 for 15 minutes and DB healthy
188 SignOz ClickHouse healthy and merge backlog trending down
188 litellm /health/liveliness good and provider route verified
110 Sentry Snuba consumers ClickHouse healthy and Kafka backlog decreasing
110 Sentry uptime-checker Sentry web/DB healthy
110 runners all previous gates green, host_runaway_process.prom fresh, orphan browser group count 0, and load/core < 1.0 for 15 minutes unless the remaining load is explicitly attributed to active CI

11.1 110 Runaway Browser / CI Load 分流

2026-06-18 110 CPU 滿載事件證明:泛用 HostHighCpuLoad 只能說主機忙,不能告訴 operator 要不要殺程序。110 現在必須使用專用 host runaway process 指標做第一層分流:

grep -E 'awoooi_host_runaway_|awoooi_host_gitea_actions_|awoooi_host_load5_per_core|awoooi_host_swap_used_ratio' \
  /home/wooo/node_exporter_textfiles/host_runaway_process.prom

Prometheus 也必須讀得到同一份 textfile2026-06-18 14:31-14:32 live scrape 已確認 awoooi_host_runaway_process_monitor_up{host="110"}=1、orphan group count 0、active CI container count 2remediation_authorized=0,且 missing / orphan alerts 均未 firing。

判讀:

指標組合 判定 行動
awoooi_host_runaway_browser_orphan_group_count > 0 且 CPU >= 100 orphan headless browser / smoke process group 執行 host-runaway-process-remediation.py dry-run人工確認後才可 gated SIGTERM
orphan count 0awoooi_host_gitea_actions_active_container_count > 0 合法 CI build/test 負載 觀察 Gitea Actions queue / workflow timeout不殺程序
awoooi_host_runaway_process_monitor_up 缺失或 stale 監控盲區 修 cron / textfile collector / Ansible role不宣稱 AI Ops 可觀測
awoooi_host_runaway_process_remediation_authorized > 0 監控器被誤改成執行器 立即回滾runtime remediation 必須只走 gated helper

正式 PlayBook

docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md

這條 PlayBook 不取代 Docker / Sentry / Harbor / K3s / backup SOP。它只處理 orphan browser smoke 與 CI load 分類,避免 CPU 高時誤重啟 Docker 或誤殺合法 build。


12. Baseline And AI Auto-Remediation Gate

12.1 Stable Runtime Baseline

These are release gates after the first cold-start recovery pass:

Area Baseline
188 host PostgreSQL accepting, Redis PONG, momo /health 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers
110 host Harbor /v2/ 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop
K3s 120/121 nodes Ready, VIP 192.168.0.125 present, AWOOOI API 2xx/3xx, Web 2xx/3xx
Public routes https://awoooi.wooo.work/api/v1/health 2xx/3xx, https://mo.wooo.work/health 2xx/3xx
Guardrails Docker/systemd/storage/backup/runaway-process textfile exporters fresh, runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0
Schedules cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success < 25h
Backlog ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks

If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.

12.2 AI Auto-Remediation Gate

AI auto-repair can move from observe-only to limited execution only after:

  • Prometheus rules are loaded.
  • docker/systemd textfile exporter files are fresh.
  • runaway process textfile exporter is fresh and remediation_authorized=0.
  • blackbox probes have stable results.
  • cron/CronJob schedule checks are green.
  • AWOOOI API /api/v1/health passes.
  • Alertmanager E2E webhook passes.
  • Redis/KM/playbook health is available.
  • No active restart storm.
  • Host load/core remains below 1.0 for 15 minutes.

Until then:

  • diagnose only
  • notify only
  • require human approval for remediation
  • no DB/ClickHouse/Harbor/Sentry destructive action
  • no generic restart action against stateful services
  • no process kill unless host-runaway-process-remediation.py has dry-run evidence plus owner approval, maintenance window, and evidence ref

13. One-Command Readiness Script

13.1 Single Pass

Run this when you want one read-only snapshot:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh

The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:

  • P0-NETWORK
  • P0-188-DATA
  • P0-110-REGISTRY-OBSERVABILITY
  • P1-K3S
  • P2-WORKLOAD-ALERTCHAIN
  • P2-PUBLIC-ROUTES
  • P2-SCHEDULES
  • runner guardrail state inside P0-110-REGISTRY-OBSERVABILITY

If it prints BLOCKED, fix the first blocked gate before moving forward.

13.2 Professional Watch Mode

Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
  --watch \
  --interval 60 \
  --max-attempts 30 \
  --send-alert-test

This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is GREEN or the last attempt remains degraded/blocked.

Use --send-alert-test for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without --send-alert-test, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.

13.3 Persistent Read-Only Monitor

After recovery, host 110 should run the same gate as a node-exporter textfile monitor:

bash scripts/reboot-recovery/install-cold-start-monitor-110.sh

This command is not read-only. It copies scripts to 110, rewrites the marked wooo crontab block, and immediately refreshes the textfile metric. Run it only inside an approved maintenance window or explicit owner-approved live-sync change.

This installs two scripts under /home/wooo/scripts/, adds a marked user-cron block, and writes:

  • /home/wooo/node_exporter_textfiles/cold_start_recovery.prom
  • /home/wooo/reboot-recovery/cold-start-last.log

The cron path uses --monitor-read-only, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:

  • awoooi_cold_start_monitor_up
  • awoooi_cold_start_pass_gates
  • awoooi_cold_start_warn_gates
  • awoooi_cold_start_blocked_gates
  • awoooi_cold_start_last_run_timestamp
  • awoooi_cold_start_last_green_timestamp
  • awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}

Prometheus rules in ops/monitoring/alerts-unified.yml alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.

13.3.1 110 cold-start monitor live-sync gate

Use this gate whenever the repo-side cold-start script changes. This prevents a false-green where repo evidence is newer than the live 110 monitor.

Current read-only evidence, 2026-06-24 23:15 Asia/Taipei:

Repo script hash: f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05
110 live script hash: 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8
verify result: BLOCKED full-stack-cold-start-check.sh hash mismatch

Read-only verification:

bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh

Approved apply path, only after maintenance-window / owner approval:

bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

Completion criteria:

  • verify-cold-start-monitor-deploy.sh reports hash parity for full-stack-cold-start-check.sh and cold-start-textfile-exporter.sh.
  • The live 110 cold-start output includes the expected current fields, including MOMO_SOURCE_EMPTY_EVIDENCE_LINES, MOMO_IMPORT_CONFIG, and MOMO_LATEST_IMPORT_JOB while MOMO data freshness remains blocked by source absence.
  • The textfile monitor refreshes without creating alert spam.
  • LOGBOOK records local hash, remote hash, command type, approval reference, and final cold-start result.

NO-GO:

  • Do not run the installer as part of routine read-only triage.
  • Do not call repo-side v1.42 deployed on 110 while the hash mismatch remains.
  • Do not patch 110 manually with ad hoc scp; use the existing installer or Ansible source-of-truth path under an approved change.

13.4 Script-To-SOP Coverage Map

Script gate SOP coverage Blocks
P0-NETWORK host reachability, ARP, SSH every later phase
P0-188-DATA PostgreSQL, Redis, momo, SignOz K3s, AWOOOI API, momo public site
P0-110-REGISTRY-OBSERVABILITY Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas image pulls, CD, alert rules, runners
P1-K3S 120/121 K3s, VIP, node readiness, pod health workload and webhook health
P2-WORKLOAD-ALERTCHAIN AWOOOI API/Web, Alertmanager webhook AI auto-remediation and alert confidence
P2-PUBLIC-ROUTES external AWOOOI and momo URLs external release
P2-SCHEDULES cron, CronJobs, backups, textfile exporters, DR drill final done criteria

13.5 Next-Reboot Operator Contract

  1. Run the watch command above.
  2. If it stops at BLOCKED, repair the first blocked gate and rerun watch mode.
  3. If it stops at WARN, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
  4. Release high-load services only after GREEN and load/core stays below 1.0 for 15 minutes.
  5. Record the final output summary and any manual repair in docs/LOGBOOK.md.

13.6 2026-05-29 補充188 Public Gateway 與備份告警

aiops.wooo.work 的 188 public gateway 不可再指向單一 192.168.0.120:31234/31235。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP

location /api/ {
    proxy_pass http://192.168.0.125:32334/api/;
}

location /api/v1/ws {
    proxy_pass http://192.168.0.125:32334/api/v1/ws;
}

location / {
    proxy_pass http://192.168.0.125:32335;
}

變更來源必須是 infra/ansible/roles/nginx/templates/188-all-sites.conf.j2,再用 infra/ansible/playbooks/nginx-sync.yml 收斂;禁止只改 188 live 檔而不回寫 Ansible baseline。

備份告警有兩層,缺一不可:

  • ops/monitoring/alerts-unified.yml 是 repo canonical。
  • 110 live /home/wooo/monitoring/alerts.yml/home/wooo/monitoring/alerts-unified.canonical.yml 必須一致,否則 prometheus-rule-drift-guard 可能把規則拉回舊版。

重啟後必查:

curl -s http://127.0.0.1:9090/api/v1/rules \
  | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'

cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom

若 120 尚未恢復,BackupConfigCapturePartial{target="120-k3s-host-configs"} 與 cold-start blocked 是正確訊號不可消音。120 恢復後再重跑:

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

13.7 2026-05-29 補充momo PostgreSQL Index 與資料同步

mo.wooo.work 不能只看 /health 或首頁 200。重啟或 fsck 後PostgreSQL index 可能讓匯入流程表面完成,但 daily_sales_snapshot 未同步到 realtime_sales_monthly。本次症狀:

  • daily_sales_snapshot 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
  • realtime_sales_monthly 同日期範圍為 0 筆。
  • momo-scheduler log 出現 PostgreSQL 內部錯誤 posting list tuple ... cannot be split

標準處理順序:

# 188 / momo-db只重建索引不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL

重建索引後,才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 realtime_sales_monthly 該日期範圍筆數,若非 0需先保存查詢結果並確認是否重跑同範圍同步不可整表 truncate、不可整庫 restore。補同步後至少驗證

SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache

docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'

14. 主機開機、關機、重啟 SOP

本節是每次 110 / 120 / 121 / 188 相關電源操作的標準程序。112 是 Kali只保留 read-only evidence不納入本輪恢復或例行重啟釋出。

14.1 共同紅線

類型 禁止事項 正確處理
120 offline 不可消音 ColdStartHost120UnreachableColdStartRecoveryBlocked 或 120 config backup alert 保留紅燈,直到 console/VM recovery 後重跑完整 chain
Filesystem 不可對已掛載 root filesystem 做 online fsck 只在 console/rescue/initramfs 狀態下離線修復
Backup 不可用單項 backup 成功宣稱 aggregate backup green backup-all、offsite verifier、cold-start scorecard 三者共同判定
Credential 不可把密碼、token、private key 寫進 repo、LOGBOOK 或聊天 只寫 non-secret evidence marker / vault reference
Stateful data 不可 truncate、DROP、整庫 restore 或整批刪 volume 先保存證據,優先 REINDEX TABLE CONCURRENTLY / clean-clone / idempotent resync
Automation 不可在 P0/P1 未綠時釋出 runner/CD/AI full execution observe-onlyrunner/CD 最後釋出

14.2 關機前 SOP

目標是保留證據、停止高負載來源、讓 stateful service 乾淨落地。

  1. 宣告維護窗口,建立 docs/LOGBOOK.md 重啟紀錄草稿。
  2. 跑 preflight snapshot
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
/backup/scripts/backup-status.sh --no-notify
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
  1. 保存 host reboot evidence
for h in 110 120 121; do
  ssh wooo@192.168.0.$h 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20; systemctl --failed --no-pager' || true
done
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20; systemctl --failed --no-pager' || true
  1. 暫停高負載與自動化釋出:
順序 對象 操作原則
1 Gitea / actions runners 停止新 job不要在 build 中途硬關,先讓可完成 job 結束或人工取消
2 AI auto-remediation 切 observe-only禁止自動 restart stateful services
3 momo crawler / scheduler / heavy batch 暫停會啟動 Chrome、批次匯入或大量 DB 寫入的工作
4 Sentry/Snuba/ClickHouse heavy consumers 確認沒有 restart storm必要時 controlled stop
5 K3s workload 優先 drain / cordon 可達節點;不可在 120 已不可達時假裝 drain 完成
  1. 全機關機順序:
1. runner/CD and high-load batch
2. AI auto-remediation execution
3. AWOOOI workload layer
4. 121 K3s agent side
5. 120 K3s server side
6. 110 registry / observability, after evidence and backup status are captured
7. 188 data layer last
8. network / UPS / hypervisor last, if applicable

188 必須最後關,因為 PostgreSQL / Redis / momo DB / K3s datastore 是其他層的共同依賴。

14.3 開機 SOP

開機順序固定走 dependency chain不追最吵的 alert。

1. Physical network: switch, NIC, ARP, SSH
2. 188 data layer: PostgreSQL, Redis, Docker, momo DB, SignOz dependencies
3. 110 registry / observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry
4. 120 K3s server / VIP path
5. 121 K3s agent / failover path
6. AWOOOI API/Web workload
7. Public routes and Alertmanager E2E
8. Backups, cron, CronJobs, textfile exporters
9. momo scheduler / crawlers and high-load consumers
10. runners/CD
11. AI auto-remediation limited execution

開機後每一層都要有 live evidence。最小驗收命令

for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done

ssh ollama@192.168.0.188 'systemctl is-active docker postgresql@14-main redis-server nginx || true; pg_isready -h localhost -p 5432 || true; docker ps --format "{{.Names}}\t{{.Status}}" | head -80'
ssh wooo@192.168.0.110 'systemctl is-active docker cron || true; curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true; curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true'
ssh wooo@192.168.0.121 'sudo kubectl get nodes -o wide; sudo kubectl get pods -A | grep -v -E "Running|Completed" || true'

/backup/scripts/backup-status.sh --no-notify
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

14.4 單主機重啟 SOP

Host 重啟前條件 重啟後必查 完成條件
110 不在 backup-all / rclone / verify windowrunner job 已停止或人工取消188 healthy Docker, Harbor, Gitea, Prometheus, Alertmanager, Sentry, cron, textfile exporters, /backup/scripts/backup-status.sh --no-notify 110 services greenbackup status 沒有新增 stale / failedrunner/CD 最後釋出
120 必須是 console-first 維護;若可達,先 cordon/drain若不可達不宣稱 drain 成功 power/VM/NIC/boot/initramfs/fsck state, SSH, kubectl get nodes, SchedulingDisabled 清除狀態 120 ping/SSH OKmon Readybackup configs/all/offsite/verify/cold-start chain 重跑
121 120 / 188 healthy可達時先 cordon/drain k3s-agent 或 live role、VIP 狀態、kubectl get nodes, pod placement mon1 ReadyVIP / NodePort 路徑正常workload 無新增 failed pods
188 110 backup status 已保存;停止或延後 momo heavy import確認無 DB restore / migration PostgreSQL, Redis, Docker, momo DB parity, SignOz/ClickHouse, cron, backup freshness DB acceptingmomo parity 綠188 backup jobs fresh高負載服務最後釋出

14.4.1 110 重啟後恢復指揮卡

110 是 registry / observability / backup center。重啟後先看 host 與核心端口,不要第一時間重啟 Docker daemon。

順序 檢查 成功基準 失敗處理
1 systemctl is-system-running / failed units / Swap running、failed 0 或可解釋、Swap 未持續增加 先分辨 stale unit、現役 service、storage/network 問題
2 Docker daemon systemctl is-active docker=active 若 Docker activating,先看 journal不要連續 restart/kill
3 Harbor / registry local /v2/200/401public registry 未登入 401 只針對失效 upstream 做最小修復;避免 daemon restart
4 Gitea / runners Gitea 200/302runner 最後釋出 runner job 不可在 P0/P1 未綠時搶資源
5 Prometheus / Alertmanager /-/ready/-/healthy OKrequired alerts visible 若告警缺失,先修 rules/drift guard再談自動化
6 Sentry / Langfuse / Stock / public tools public 2xx/3xxcontainer 非 restart loop 只修明確故障服務;不要 compose 全套重建
7 backup / offsite backup-status --no-notify、offsite verifier 120 不可達時 Configs 紅燈保留

110 post-reboot 最小命令:

ssh wooo@192.168.0.110 '
date; uptime; systemctl is-system-running || true; systemctl --failed --no-pager --plain || true
free -h; swapon --show
systemctl is-active docker cron || true
curl -s -o /dev/null -w "harbor_v2=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'

2026-06-12 補充:stockplatform-shared-ui-monitor.timer 指向不存在的 legacy path 時,可停用 stale timer 解除 host failed unit但正式 source-of-truth 必須後續清理,不能把 reset-failed 當修復。

14.4.2 188 重啟後恢復指揮卡

188 是資料與 AI/Web 依賴主機。它恢復前,不釋出 K3s、AWOOOI API、momo heavy import 或 AI auto-remediation。

順序 檢查 成功基準
1 PostgreSQL pg_isready accepting無 checkpoint / WAL panic
2 Redis PONG
3 Docker / containerd activemomo-db / signoz / openclaw / litellm 非 restart loop
4 momo DB parity daily_sales_snapshotrealtime_sales_monthly 目前月份筆數與日期上下界一致
4a momo Google Drive token writeback /home/ollama/momo-pro/config/google_token.json owner 對齊 Docker userns scheduler UIDmode 不寬於 600;不得讀取或輸出 token 內容
4b momo business data freshness daily_sales_snapshot 最新日期落後 0-2 天可接受;落後 3 天以上為 BLOCKED,即使首頁 / health / DB parity 都正常也不可宣稱 full-stack green
5 SignOz / monitoring bridge HTTP 200ClickHouse 不在修復風暴
6 momo scheduler container healthyrecent activity pattern > 0heavy import 等 DB green 後釋出
7 backup freshness 188 backup textfile / 110 backup-from-188 freshness OK

188 post-reboot 不可用「首頁 200」取代 DB parity也不可用 DB parity 取代資料新鮮度。若出現 posting list tuple ... cannot be split,只走 REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;,不可 truncate 或整庫 restore。

2026-06-25 補充:若 momo-scheduler logs 出現 Google Drive 認證失敗 / could not locate runnable browser / Permission denied: 'config/google_token.json',先做 metadata-only 判讀,不得讀 token 內容。最新 10:35 readback 顯示 host path /home/ollama/momo-pro/config/google_token.json 與 container-side config/google_token.json 都是 missingscheduler host UID 仍是 100000;因此不能沿用 2026-06-24「只改 owner/mode」的修復結論。解除 WARN 的最小安全流程是:取得 owner-provided non-secret evidence ref、確認維護窗口與 rollback owner、用不貼 token 的方式重新建立或恢復 token artifact、只檢查 stat owner:group:mode 與 scheduler auth readback、再跑 cold-start。未完成前MOMO health 200 與 DB parity 不能取代 token/writeback evidence。

14.4.3 120 恢復指揮卡

120 目前是 console-first blocker。它不可達時遠端只能做證據收集不能假裝修復。

狀態 判定 正確動作
ping / SSH / ARP 全失敗 host / VM / network 層未知 到 hypervisor / console 確認 power、NIC、boot screen
initramfs / fsck prompt filesystem repair gate 120-fsck-maintenance-checklist.sh 離線處理
SSH 恢復但 K3s NotReady K3s / runtime 層 先看 journalctl -u k3s、containerd、188 PostgreSQL再解除 cordon
node Ready 但 SchedulingDisabled 調度狀態未解除 確認健康後 kubectl uncordon mon,再看 workload

120 恢復後不得只看 kubectl get nodes。必須強制補跑:

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

14.4.4 121 重啟後恢復指揮卡

121 是 K3s failover / secondary control-plane path。它重啟後的核心是「不要讓 mon1 Ready 掩蓋 mon 不可達」。

檢查 成功基準 注意
SSH / systemd host ready、failed units 可解釋 121 green 不代表 120 green
K3s role kubectl get nodes -o wide 可讀 若只剩 mon1 Ready,仍是 degraded cluster
VIP / NodePort VIP / public routes 通 必須確認 route 走 192.168.0.125:32334/32335
Cron / DR drill cron present、DR drill 未被誤停 schedule green 是 cold-start done criteria 的一部分

若 121 重啟後看到 mon1 Readymon NotReady,SchedulingDisabled結論是「121 恢復cluster 仍 degraded」不可把 121 正常誤報成 K3s 全綠。

14.5 每次重啟紀錄格式

每次開機、關機、重啟都要在 docs/LOGBOOK.md 追加紀錄,並把必要狀態同步到本 SOP 或 workplan。

## YYYY-MM-DD | Host reboot / shutdown / startup record

Scope:
- Hosts:
- Operation: shutdown / startup / reboot / recovery
- SOP version used:
- Operator:
- Maintenance window:

Pre-check:
- Cold-start scorecard:
- Backup status:
- Offsite verifier:
- Public routes:
- momo DB parity:
- Alertmanager rules / E2E:
- Credential escrow:

Execution:
- Start time:
- End time:
- Commands / console actions:
- Services paused:
- Services released:

Result:
- 110:
- 120:
- 121:
- 188:
- Cold-start scorecard after:
- Backup status after:
- Offsite verifier after:
- DB parity after:
- Alerts after:

Difference versus previous reboot:
- Faster:
- Slower:
- New blocker:
- Repeated blocker:
- False positive / detector tuning:
- SOP change required: yes/no

SOP update:
- Previous version:
- New version:
- Change reason:
- Files updated:

14.6 SOP 版本比較與改版規則

每次重啟後必須比較上一次紀錄,不只寫「已恢復」。

比較項 判定方式
Time to SSH 從 power-on 到各 host SSH OK
Time to K3s Ready 從 120/121 boot 到 nodes Ready
Time to public routes 從 K3s Ready 到 public 2xx/3xx
Time to backup green 從 110 ready 到 backup status / offsite verifier green
Persistent blockers 連續兩次以上出現即列入 SOP hard gate
False positives 例如 momo scheduler detector WARN要寫清楚直接證據與調整方向
Procedure drift live cron、Ansible template、script path 與 SOP 不一致時,先修 canonical再修 SOP

改版規則:

  • 只更新 live baseline 或百分比:不升版,只更新日期與 evidence。
  • 新增、刪除或改變操作順序:升 minor version例如 v1.4 -> v1.5
  • 牽涉破壞性操作、資料修復策略或人為批准邊界:升 major-ready review先經人工批准。

14.7 2026-06-06 重啟紀錄比較錨點

2026-06-06 沒有執行新重啟;本次是 live recovery check。它仍要作為下一次重啟比較基準

項目 2026-06-06 baseline
Overall 65% BLOCKED
Cold-start PASS=71 WARN=3 BLOCKED=3
Remaining hard blocker 120 ping / SSH / K3s read-only check
Backup aggregate failed=1, Configs only, due 120 config capture
Backup freshness 110 and 188 fresh, no stale jobs
Offsite 13 repos latest-only green
Escrow 5 markers missing
momo scheduler direct healthy; 15:03 scorecard no longer emits scheduler WARN

14.8 2026-06-12 重啟後比較錨點

2026-06-12 110 被非計畫重啟後,本 SOP v1.5 的新比較錨點如下:

項目 2026-06-12 post-reboot baseline
110 host systemd runningfailed units 0Swap 0B/7.8GiB
110 service recovery Harbor / Gitea / Prometheus / Alertmanager / Sentry / Stock / public tools reachable
Cold-start PASS=72 WARN=2 BLOCKED=3
Remaining hard blocker 120 ping / SSH / K3s read-only check
WARN 120-driven backup aggregate/config component and 120 K3s schedule check
Backup freshness 110 13/13 fresh failed=1188 2/2 fresh failed=0stale none
Offsite 13 repos latest-only greenREMOTE_LATEST_ONLY_OK=1VERIFY_OK=1
Alerts Prometheus and Alertmanager expose all five required backup/cold-start/escrow alerts
momo scheduler scorecard reads SCHEDULER_RECENT_ACTIVITY 1070 after detector fix
SOP change v1.5 adds startup judgment layers, GO/NO-GO tree, host recovery cards, and timeline checks

14.9 2026-06-13 CD 後恢復比較錨點

2026-06-13 不是主機重啟而是用來驗證「120/121 workload balancing + CD known_hosts guardrail」是否能承受下一次正常部署的比較錨點。

項目 2026-06-14 03:10 baseline
Gitea / ArgoCD Gitea main 8868c025deploy marker 7b034b58ArgoCD revision 8868c025sync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag 26b67d11f7b7de4f9c9d95c01bb1dacf4000e887
K3s placement API/Web verified split across mon / mon1 after the latest deploy markerWorker single replica healthy
Cold-start PASS=81 WARN=2 BLOCKED=0
Public routes Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup backup-status: 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5last aggregate 2026-06-14 02:40:22
Offsite textfile remote_verify_ok=1full_verify_fresh=113 repos each snapshot_count=1
SSH trust Global known_hosts retained 120 / 188 entries after CD; deploy-specific trust moved to deploy_known_hosts
Remaining non-service debt km-vectorize-29689620 official Job failed with BackoffLimitExceeded; failed Pod/log was deleted before inspection; credential escrow missing count remains 5; 110 has fwupd failed units
SOP change v1.10 changes the first-screen declaration from full green back to degraded, records official km-vectorize failure evidence, and verifies live restartPolicy: Never / FallbackToLogsOnError evidence retention for the next official run

14.10 2026-06-14 110 failed-unit 清理比較錨點

2026-06-14 08:24 的變更不是主機重啟,而是把 110 非核心 fwupd failed-unit 噪音從 cold-start 判定中收斂。這個錨點的用途是避免未來把 firmware metadata refresh failure 誤判成 AWOOOI runtime 失敗,同時保留 rollback。

項目 2026-06-14 08:24 baseline
110 failed units systemctl --failed0 loaded units listed
fwupd policy fwupd-refresh.timerdisabled / inactive,原因是非核心 firmware metadata refresh 失敗不應阻擋 AWOOOI service recovery
Rollback 若需要恢復 firmware metadata refresh timer執行 sudo systemctl enable --now fwupd-refresh.timer 後重跑 cold-start
Cold-start PASS=82 WARN=1 BLOCKED=0
Remaining WARN 只剩 K8s failed Job km-vectorize-29689620;等待下一次官方 03:00 排程成功或保留 failed Pod/log
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
Credential escrow 仍缺 5 個 non-secret evidence marker不可用 placeholder 或 secret 清紅燈
SOP change v1.11 把 110 failed-unit gate 從 GREEN_WITH_FWUPD_WARNING 改成 GREEN_WITH_FWUPD_TIMER_DISABLED,並把完成宣告上限固定為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.11 2026-06-14 post-CD recovery readback

2026-06-14 08:40 的變更不是主機重啟,而是確認 latest CD deploy marker 沒有讓重啟恢復狀態倒退。這個錨點用來比較「治理 / 前端 / API CD 後cold-start SOP 是否仍成立」。

項目 2026-06-14 08:40 post-CD baseline
Gitea / ArgoCD Gitea main 18b867c3ArgoCD revision 18b867c3sync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag e0a6d339669fc635357d36ea94215df25e652fa9
CronJob readback km-vectorize has KM_PROJECT_ID=awoooirestartPolicy: NeverterminationMessagePolicy: FallbackToLogsOnErrorlastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Z
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon
Cold-start PASS=82 WARN=1 BLOCKED=0
Public routes Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.12 records the post-CD no-regression readback and keeps the declaration ceiling at SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.12 2026-06-14 P2-135 deploy 後 recovery readback

2026-06-14 09:27 的變更不是主機重啟,而是確認 P2-135 deploy 與正式驗證後reboot recovery baseline 沒有倒退。這個錨點也記錄 stockplatform-v2 rollout warmup 期間短暫 502 的判定方式:直接重查 route / TLS並重跑完整 cold-start只有重跑仍失敗才升級成 persistent public route blocker。

項目 2026-06-14 09:27 post-P2-135 baseline
Gitea / ArgoCD Gitea main 5bad267eArgoCD revision 5bad267esync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag 280e0fbef0d5dccb10f1efe2cc18cf423544254e
CronJob readback km-vectorize has KM_PROJECT_ID=awoooirestartPolicy: NeverterminationMessagePolicy: FallbackToLogsOnErrorlastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Z
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon1
First cold-start 09:26 first run saw stock.wooo.work 502 while stockplatform-v2 containers were less than one minute old; direct route and TLS recheck returned 200
Final cold-start 09:27 rerun returned PASS=82 WARN=1 BLOCKED=0
Public routes Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.13 records the P2-135 post-deploy no-regression readback and keeps the declaration ceiling at SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.13 2026-06-14 P2-136 / AI Agent 活動正式部署後 recovery readback

2026-06-14 09:56 的變更不是主機重啟,而是確認 P2-136 / AI Agent 活動正式部署後reboot recovery baseline 仍沒有倒退。這個錨點特別記錄 deploy marker、ArgoCD revision、live image 與 cold-start scorecard 必須一起看,避免只看 gitea/main 或 CD 成功就誤報 full-stack green。

項目 2026-06-14 09:56 post-P2-136 baseline
Gitea / ArgoCD 本 recovery commit 前最新文件 head a0fe7741runtime deploy marker 60a0415c chore(cd): deploy a3de0ff [skip ci]ArgoCD revision 60a0415csync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag a3de0ffb8275b6838604b6dff87cd978b8e91122
CronJob readback km-vectorize has KM_PROJECT_ID=awoooirestartPolicy: NeverterminationMessagePolicy: FallbackToLogsOnErrorlastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 remains retained
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon1
Cold-start 09:56 returned PASS=82 WARN=1 BLOCKED=0
Public routes Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.14 records the P2-136 / AI Agent 活動正式部署後 no-regression readback and keeps the declaration ceiling at SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.14 2026-06-14 P2-137 / CI smoke timeout 修正後 recovery readback

2026-06-14 10:40 的變更不是主機重啟,而是確認 P2-137 正式部署與 BusyBox timeout smoke 修正後reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback不重複 P2-137 正式驗證內容。

項目 2026-06-14 10:40 post-P2-137 baseline
Gitea / ArgoCD 本 recovery commit 前最新文件 head 50d4f2baruntime deploy marker d023f5d7 chore(cd): deploy f737f27 [skip ci]ArgoCD revision 50d4f2basync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag f737f278dc14372ff1fb15b124b1370c20e1bb99
CronJob readback km-vectorize has KM_PROJECT_ID=awoooirestartPolicy: NeverterminationMessagePolicy: FallbackToLogsOnErrorlastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 remains retained
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon
Cold-start 10:40 returned PASS=82 WARN=1 BLOCKED=0
Public routes Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.15 記錄 P2-137 / CI smoke timeout 修正後 no-regression readback並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.15 2026-06-14 P2-143 owner response 預檢後 recovery readback

2026-06-14 15:00 的變更不是主機重啟,而是確認 P2-143 owner response 預檢與拒收邊界正式部署後reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback不重複 P2-142 / P2-143 正式驗證內容,也不把 owner response preflight 視為 runtime 授權。

項目 2026-06-14 15:00 post-P2-143 baseline
Gitea / ArgoCD 最新文件基準 b09eb1c6 docs(ai): 校準 P2-143 正式驗證紀錄runtime deploy marker 667d6329 chore(cd): deploy 755b0a8 [skip ci]ArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5fsync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag 755b0a8d3038df2c52dee280067863d92db1eda5
CronJob readback km-vectorize schedule 0 3 * * *timeZone=Asia/Taipeisuspend=falsefailedJobsHistoryLimit=3lastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 仍保留,但目前沒有可讀的 failed Pod / log
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon
Cold-start 15:00 returned PASS=82 WARN=1 BLOCKED=0
Public routes 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
P2-143 API boundary Production endpoint 回 current P2-143、next P2-144、completion 100,且 reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 0 / false
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.16 記錄 P2-143 owner response 預檢後 no-regression readback並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.16 2026-06-14 P2-144 owner response 回讀後 recovery readback

2026-06-14 15:58 的變更不是主機重啟,而是確認 P2-144 owner response 回讀狀態與後續 deploy marker 180a6543 正式部署後reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback不重複 P2-144 正式驗證內容,也不把 owner response readback 視為 runtime 授權、正式收件或 owner acceptance。

項目 2026-06-14 15:58 post-P2-144 baseline
Gitea / ArgoCD gitea/main 已前進至 180a6543 chore(cd): deploy fef94df [skip ci]ArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965async Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag fef94df877c5438f9f34ddbcace8ad8112a141ef
CronJob readback km-vectorize schedule 0 3 * * *timeZone=Asia/Taipeisuspend=falsefailedJobsHistoryLimit=3lastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 仍保留,但目前沒有可讀的 failed Pod / log
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon1
Cold-start 15:58 returned PASS=82 WARN=1 BLOCKED=0
Public routes 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
P2-144 API boundary Production endpoint 回 current P2-144、next P2-145、completion 100,且 owner response received / accepted / rejected、reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 0 / false
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.17 記錄 P2-144 owner response 回讀後 no-regression readback並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.17 2026-06-14 P2-145 owner response 驗收門檻後 recovery readback

2026-06-14 16:29 的變更不是主機重啟,而是確認 P2-145 owner response 驗收門檻正式部署後reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback不重複 P2-145 正式驗證內容,也不把 acceptance gate 視為 owner response received / accepted、runtime 授權或正式寫入。

項目 2026-06-14 16:29 post-P2-145 baseline
Gitea / ArgoCD 最新文件基準 06fe0a8f docs(logbook): 記錄 P2-145 正式驗證 [skip ci]runtime deploy marker 36fbfc6b chore(cd): deploy 386dbd0 [skip ci]ArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8sync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag 386dbd078ef63401d9736048463f4ef5326442d9
CronJob readback km-vectorize schedule 0 3 * * *timeZone=Asia/Taipeisuspend=falsefailedJobsHistoryLimit=3lastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 仍為 Failed 0/1
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon
Cold-start 16:29 returned PASS=82 WARN=1 BLOCKED=0
Public routes 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
P2-145 API boundary Production endpoint 回 current P2-145、next P2-146、completion 100,且 owner response received / accepted / rejected、reviewer / Gateway queue、Telegram、Bot API、result capture、learning、PlayBook trust、production write、secret read、destructive operation 全部維持 0 / false
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.18 記錄 P2-145 owner response 驗收門檻後 no-regression readback並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.18 2026-06-14 IwoooS P0 配置控管優先序後 recovery readback

2026-06-14 17:04 的變更不是主機重啟,而是確認 IwoooS P0 配置控管優先序正式部署後reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback不重複 P0 配置控管正式驗證內容,也不把前台看板可見視為 Nginx reload、DNS / TLS probe、certbot renew、workflow / secret 修改、public route change 或 runtime gate。

項目 2026-06-14 17:04 post-IwoooS-P0-config baseline
Gitea / ArgoCD 最新文件基準 af62ec1f docs(iwooos): 記錄 P0 配置控管正式驗證 [skip ci]runtime deploy marker ed651a98 chore(cd): deploy e992af8 [skip ci]ArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27sync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag e992af89955f8aae40a383b2f2e2f645445a690d
CronJob readback km-vectorize schedule 0 3 * * *timeZone=Asia/Taipeisuspend=falsefailedJobsHistoryLimit=3lastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 仍為 Failed 0/1
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon1
Cold-start 17:04 returned PASS=82 WARN=1 BLOCKED=0
Public routes 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應IwoooS route /zh-TW/iwooos 額外 readback 回 200
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
IwoooS boundary P0 配置控管優先序已可見,但 live evidence received、runtime gate、Nginx live config、DNS / TLS probe、certbot renew、workflow / secret 修改、public route change、production write 仍不得從本 readback 推定為已授權
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.19 記錄 IwoooS P0 配置控管優先序後 no-regression readback並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.20 2026-06-15 km-vectorize official success readback

2026-06-15 03:11 的變更不是主機重啟,而是確認 km-vectorize 官方 03:00 排程成功,並把 ArgoCD fully healthy gate 關閉。這個錨點只記錄 recovery / cold-start readback不手動刪 Job、不手動建立 Job、不 kubectl patch live、不重啟服務也不把任何 backup / restore / escrow owner acceptance ledger 視為 backup run、restore run、credential escrow marker write、host write 或 production write 授權。

項目 2026-06-15 03:11 km-vectorize official success baseline
ArgoCD awoooi-prod sync Syncedhealth Healthyrevision d388e5b477333fd5e661527a729406a4e8215320
CronJob readback km-vectorize schedule 0 3 * * *timeZone=Asia/Taipeisuspend=falselastScheduleTime=2026-06-14T19:00:00ZlastSuccessfulTime=2026-06-14T19:00:55Z
Job / Pod / log Job km-vectorize-29691060 CompletePod km-vectorize-29691060-78xpz Completed restart 0log embed-all: 200 {"total":31,"success":31,"failed":0}
Cold-start 03:11 returned PASS=81 WARN=2 BLOCKED=0result DEGRADED
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0last aggregate 2026-06-15 02:40:13
Escrow ESCROW_MISSING_COUNT=5,缺 restic_repository_passwordoffsite_provider_credentialsbreak_glass_admin_credentialsdns_registrar_recoveryoauth_ai_provider_recovery
Remaining warnings 188 momo scheduler registration/activity 未確認K8s 仍保留舊 failed Job evidence
SOP change v1.21 關閉 km-vectorize official success gate但宣告上限仍是 SERVICE_AVAILABLE_ARGOCD_HEALTHY_DR_ESCROW_BLOCKED;不可宣稱 full-stack greenDR complete

14.19 2026-06-14 高價值配置 Owner Packet 前台同步後 recovery readback

2026-06-14 18:15 的變更不是主機重啟,而是確認高價值配置 Owner Packet 前台同步正式部署後reboot recovery baseline 仍沒有倒退。這個錨點只記錄 recovery / cold-start readback不重複 Owner Packet 前台正式驗證、posture projection 或 intake preflight 內容,也不把前台草案可見視為 request sent、owner response received / accepted、runtime gate、Nginx reload、DNS / TLS probe、certbot renew、workflow / secret 修改、host write、active scan 或 production write。

項目 2026-06-14 18:15 post-owner-packet-frontend baseline
Gitea / ArgoCD 最新 repo 文件基準 0a4766dd docs(security): 新增高價值配置 owner request 草稿包 [skip ci]runtime deploy marker 16c6b983 chore(cd): deploy e999c16 [skip ci]feature commit e999c16b fix(iwooos): 同步高價值配置 owner packet 前台ArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94sync Syncedhealth Degraded
K3s image readback API/Web/Worker/CronJob image tag e999c16b3435f197b78fe2adfeec1c4faa6c4675
CronJob readback km-vectorize schedule 0 3 * * *timeZone=Asia/Taipeisuspend=falsefailedJobsHistoryLimit=3lastScheduleTime=2026-06-13T19:00:00ZlastSuccessfulTime=2026-06-04T11:00:37Zfailed Job km-vectorize-29689620 仍為 Failed 0/1
K3s placement API pods split mon / mon1Web pods split mon / mon1Worker single replica on mon
Cold-start 18:15 returned PASS=82 WARN=1 BLOCKED=0
Public routes 最終 scorecard 已驗證 awoooi API/Web、momo、gitea、harbor、registry、sentry、signoz、stock、langfuse、bitan、aiops 皆可透過 TLS 回應IwoooS route /zh-TW/iwooos 與 AwoooP route /zh-TW/awooop 額外 readback 皆回 200
Backup 110 13/13 fresh failed=0188 2/2 fresh failed=0core_blockers=0escrow_missing=5
110 host systemctl --failed0 loaded units listedfwupd-refresh.timer 維持 disabled / inactive
Owner Packet boundary Owner Packet 前台數字已可見,但 request sent、owner response received / accepted / rejected、reviewer queue write、live evidence、runtime gate、Nginx live config、DNS / TLS probe、certbot renew、workflow / secret 修改、host write、active scan、production write 仍不得從本 readback 推定為已授權
Remaining gate km-vectorize-29689620 official Job 仍 failedCredential escrow missing count 仍 5
SOP change v1.20 記錄高價值配置 Owner Packet 前台同步後 no-regression readback並維持宣告上限為 SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED

14.21 2026-06-18 Plan B 降級運轉路徑

2026-06-18 的變更不是主機重啟,也不是新的 live recovery readback而是把統帥要求的 Plan B 明確寫成可執行 SOP。這個錨點用來比較下一次重啟時是否有照 §1.4 先判斷 Plan A / Plan B、降級等級、停止線與回到 Plan A 的條件。

項目 2026-06-18 Plan B baseline
SOP version v1.22
Plan B trigger backup/offsite/verifier running、P0 host 15 分鐘不可達、188 data unhealthy、110 registry / observability unhealthy、單台 K3s degraded、route-only green、cold-start WARN、credential escrow missing
Service levels B0_ABORTED_BEFORE_REBOOTB1_HOST_RECOVERY_ONLYB2_CORE_SERVICE_READYB3_SERVICE_AVAILABLE_DEGRADEDB4_FULL_STACK_GREENB5_DR_COMPLETE
Host fallback paths 110 / 120 / 121 / 188 / K3s / Public gateway 各自有降級路徑與回到 Plan A 的條件
Timeline T+0 freeze、T+5 host boot、T+15 data / registry stop-line、T+30 route-only guard、T+60 cold-start scorecard、T+120 incident / follow-up
Closeout states RETURNED_TO_PLAN_ASERVICE_AVAILABLE_DEGRADEDOPEN_INCIDENT_REQUIRED
SOP change v1.22 新增 Plan B不可把 Plan B 視為 runtime write 授權,也不可因文件化 Plan B 宣稱新的 service green、full-stack green 或 DR complete

14.23 2026-06-18 repo-side readiness audit blocker closure

2026-06-18 的第二段變更不是 live recovery也不是主機重啟而是把前一輪 readiness audit 的 repo-side hard blockers 收斂成可驗證契約。這個錨點代表「重啟 SOP / baseline / scripts / Ansible source-of-truth / Gitea workflow contract 在 repo 內已可通過 readiness audit」不代表當日 live hosts 已重新驗證。

項目 2026-06-18 repo-side readiness baseline
SOP version v1.23
Cold-start gate full-stack-cold-start-check.sh 新增 NODE_FS_ERROR_EVENTS120 / K3s node event 出現 filesystem / fsck / read-only / I/O 類證據時,不能宣稱 K3s safe
Backup contract backup-awoooi.sh 移除 service-level 直接 offsite syncoffsite 發布只走集中 sync-offsite-backups.sh / verifier gate
Ansible 110 source-of-truth 110-devops.yml 納入 cold-start monitor、runner guardrails、host textfile exporters、backup scripts、daily backup heartbeat、offsite evidence report、offsite full-sync verifier
Ansible 188 source-of-truth 188-ai-web.yml 納入 textfile exporters並把 momo PostgreSQL backup entrypoint 固定到 host-owned /home/ollama/bin/momo-pg-backup.sh
Nginx source-of-truth nginx-sync.yml 納入 188-internal-tools-https.conf.j2 route sync
CI / workflow contract .gitea/workflows/ansible-lint.yml 改為 self-hosted validation觸發範圍包含 Ansible、ops baseline、monitoring rules、backup scripts、reboot scripts、docs 與 workflow 自身
Validation toolchain bootstrap-ansible-validation-env.sh 會優先使用 Python 3.11 / 3.10 建立 pinned validation venvansible-validate.sh 固定 repo roles path並以 minimum lint profile 守住 syntax / loader readiness
Repo-side readiness audit PASS=185 WARN=1 BLOCKED=0,結果 READY WITH WARNINGS;唯一 warning 是未跑 --live
Declaration limit 可宣稱 REPO_SIDE_REBOOT_READINESS_READY_WITH_LIVE_CHECK_REQUIRED;不可宣稱 FULL_STACK_GREENDR_COMPLETE 或 live service recovery complete

14.24 2026-06-18 live cold-start readback after repo-side closure

2026-06-18 12:13-12:17 的 readback 是 repo-side readiness closure 後的同日 live 驗證。這不是主機重啟,也不是 runtime 修復;它的用途是把「機制已完成」和「當下 live 狀態」分開,避免 false-green。

項目 2026-06-18 12:17 live baseline
SOP version v1.24
Cold-start read-only result PASS=83 WARN=1 BLOCKED=0result DEGRADED
Host reachability 110 / 120 / 121 / 188 ping OK and SSH port OK
K3s mon / mon1 Ready control-planeVIP 192.168.0.125 present on 120NODE_FS_ERROR_EVENTS 0
110 / 188 service checks 110 Harbor / Gitea / Prometheus / Alertmanager / Sentry reachable188 PostgreSQL / Redis / momo / SigNoz reachable
Backup health 110 backup health total=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0188 backup health total=2 stale=0
Public route / TLS awoooi API/Web、mo、momo health、Gitea、Harbor、registry、Sentry、SigNoz、stock、Langfuse、Bitan、aiops all 2xx/3xx with TLS verified
AWOOOI rollout convergence After transient 12:14 startup window, final readback shows API 2/2, Web 2/2, Worker 1/1, Canary 1/1, API health 200 healthy
Remaining warning retained stale Job km-vectorize-29689620 from 2026-06-14 03:00; later official Jobs km-vectorize-29692500, 29693940, 29695380 are Complete
Declaration limit 可宣稱 SERVICE_AVAILABLE_DEGRADED;不可宣稱 FULL_STACK_GREEN,因為 WARN=1;不可宣稱 DR_COMPLETEcredential escrow evidence still requires real non-secret owner evidence

14.25 2026-06-18 stale failed Job classification and service-green readback

2026-06-18 13:43 的變更不是刪除 K8s Job也不是手動建立 Job而是修正 cold-start 判定邏輯:保留的歷史 failed Job 是 evidence只有沒有後續官方成功 Job 的 failed Job 才是 active blocker。這讓 evidence retention 和 service readiness 不再互相衝突。

項目 2026-06-18 13:43 stale Job classification baseline
SOP version v1.25
Script change full-stack-cold-start-check.sh emits FAILED_JOBS, STALE_FAILED_JOBS, and ACTIVE_FAILED_JOBS
Active blocker rule ACTIVE_FAILED_JOBS > 0 causes warning; STALE_FAILED_JOBS > 0 is retained evidence and does not warn by itself
Readiness audit contract reboot-recovery-readiness-audit.sh requires both stale and active failed Job counters
Repo-side validation bash -n passed; readiness audit returned PASS=187 WARN=1 BLOCKED=0 with only the expected non-live warning
110 live script sync /home/wooo/scripts/full-stack-cold-start-check.sh hash b48af9c603aa5a1a4f9434d6cc510398bbecc2e46400a21410e735d5f7d177c4; previous version backed up to /home/wooo/scripts/full-stack-cold-start-check.sh.before-stale-active-job-classification.20260618-135516
Live cold-start readback PASS=84 WARN=0 BLOCKED=0, result GREEN
K8s Job evidence FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0, BAD_PODS=0
Backup / DR evidence 110 backup health 13/13 fresh failed=0; 188 backup health 2/2 fresh failed=0; escrow readback still ESCROW_MISSING_COUNT=5
Declaration limit 可宣稱 FULL_STACK_GREEN_FOR_SERVICE;不可宣稱 DR_COMPLETEcredential escrow complete 或任何 runtime/security acceptance
SOP change v1.25 defines retained failed Job evidence vs active failed Job blocker; future reboot comparison must record all three counters

14.26 2026-06-24 heartbeat noise / MOMO detector / rollout false-negative closure

2026-06-24 的變更不是主機重啟,而是把重啟 SOP 的兩種 false signal 收斂Telegram 正常心跳不再每 30 分鐘洗版MOMO scheduler / current-month parity detector 不再因舊 log pattern 或錯誤 DB exec 使用者誤報 WARN。這個錨點也記錄 CD rollout false-negativeworker startup probe 第一次超時重啟一次K8s 最終 ready但 Gitea CD #3289 因 rollout status timeout 標 Failure。

項目 2026-06-24 live baseline
SOP version v1.27
Heartbeat code a84a5a0b fix(api): suppress healthy Telegram heartbeat noise
Deploy marker 4a7b5329 chore(cd): deploy a84a5a0 [skip ci]
Production image readback API/Web/Worker image tag a84a5a0bc4a672ac6feb95a85ac590aa2dd4bb71
Production rollout API 2/2、Web 2/2、Worker 1/1 Ready
CD result caveat Gitea CD #3289 shows Failure because worker rollout status timed out before old replica convergence; K8s deploy marker and production readiness are green
Healthy heartbeat rule status=healthy 且無 warnings 時只更新 suppression marker / log不送 Telegramwarnings 與 recovery 仍可送
Live temporary suppression Redis keys heartbeat:silent_last_sent and heartbeat:healthy_suppressed_last_seen set with 24h TTL during deployment; no token or secret printed
110 live script sync /home/wooo/scripts/full-stack-cold-start-check.sh hash 47e67d0c018f741acfba17a93cb1d668779bd08745902099a10ee61e73ea55b6; previous version backed up to /home/wooo/scripts/full-stack-cold-start-check.sh.before-momo-detector-20260624-020759
MOMO scheduler evidence SCHEDULER_CONTAINER_RUNNING trueSCHEDULER_CONTAINER_HEALTH healthySCHEDULER_RECENT_ACTIVITY 1303
MOMO DB parity evidence `MOMO_MONTHLY_SYNC 10936
K3s node evidence NODE_FS_ERROR_EVENTS 0NODE_READONLY_FILESYSTEM_TRUE 0NODE_DISK_PRESSURE_TRUE 0、VIP 192.168.0.125 present
Live cold-start readback PASS=85 WARN=0 BLOCKED=0, result GREEN
Declaration limit 可宣稱 current service recovery scorecard green不可宣稱 DR_COMPLETEcredential escrow evidence missing remains 5
SOP change v1.27 requires heartbeat success-message suppression, MOMO detector parity using app-provided DB env, and rollout false-negative classification before retrying CD

Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe不能只看第一次 rollout status --timeout=60s 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈,這是 CI timeout / probe tuning 工作,不是服務重啟事故;後續應調整 startup probe 或 post-deploy timeout。

2026-06-24 02:44 補充:本節的 02:08 PASS=85 WARN=0 BLOCKED=0 已被 §14.28 的 MOMO data freshness gate 取代;不可再引用該結果宣稱 full-stack green。

14.27 2026-06-24 188 node-exporter / backup health alert closure

2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。backup-status 與 cold-start 都能透過 SSH 讀到 188 backup_health.prom fresh但 Prometheus node-exporter-188 scrape down 會讓 BackupHealthMonitorMissing188 正確告警。這種情況不能消音告警,必須恢復 exporter。

項目 2026-06-24 188 exporter baseline
SOP version v1.28
Root cause 188 9100 connection refusednode_exporter / prometheus-node-exporter unit absent/inactivePrometheus could not scrape backup_health.prom
False start Mounting /home/ollama/node_exporter_textfiles via /host/home/ollama/... failed because /home/ollama is 750 and textfile collector saw permission denied
Live restore Docker container node-exporter uses quay.io/prometheus/node-exporter:v1.8.2, restart=unless-stopped, -p 9100:9100, rootfs mount /host, direct textfile bind /home/ollama/node_exporter_textfiles:/textfile:ro
Repo helper scripts/ops/188-node-exporter-restore.sh
Local metrics awoooi_backup_health_monitor_up{host="188"} 1; node_textfile_scrape_error 0
Prometheus readback up{job="node-exporter-188"} 1; awoooi_backup_health_monitor_up{host="188"} 1; absent(awoooi_backup_health_monitor_up{host="188"}) empty
Alert readback ALERTS{alertname="BackupHealthMonitorMissing188"} empty
Declaration limit 可宣稱 188 backup health scrape restored不可把這當作 credential escrow complete 或 backup retention policy complete

若未來重啟後 BackupHealthMonitorMissing188 active但 SSH/backup-status 顯示 backup_health.prom fresh優先查

curl -fsS http://192.168.0.188:9100/metrics | grep -E 'awoooi_backup_health_monitor_up|node_textfile_scrape_error'

9100 connection refused 或 textfile collector error先用 repo helper 恢復 exporter

ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh

恢復後再查 Prometheus / Alertmanager不要直接 silence。

14.28 2026-06-25 MOMO Google Drive token 與資料新鮮度 blocker

2026-06-24 的第三段變更是把「MOMO 服務活著但資料不新」納入 cold-start hard gate。2026-06-25 11:44 曾證明 MOMO 服務、public route、DB parity、scheduler activity、backup/offsite 都可用,但 Google Drive token artifact metadata missing 且資料停在 2026-06-17,所以 cold-start 正確 BLOCKED。2026-06-25 14:16 的最新狀態已由合法匯入 job 57 解除該資料新鮮度 blockerMOMO service health 是 V10.674daily_sales_snapshotrealtime_sales_monthly 皆到 2026-06-24MOMO_DAILY_FRESHNESS 1|2026-06-24dedicated preflight PASS=18 WARN=3 BLOCKED=0。這仍不代表 DR complete也不代表可以讀取或保存 Google Drive token 內容。

項目 2026-06-25 MOMO freshness / token baseline
SOP version v1.51
Token current state MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000; dedicated preflight also saw host token metadata aligned to scheduler UID and container-side token artifact mode 600; token content was not read
Token recovery boundary Owner-gated maintenance only不得讀取、貼上、保存 token value / hash / partial不得把聊天密碼或 workaround 寫進 repo
Drive auth behavior 2026-06-25 10:04 fail-closed evidence remains historical proof that auth failure does not become a fake success. 14:16 readback shows the later legitimate import succeeded and the blocker is cleared.
Drive pending folder 當日業績匯入pattern 即時業績_當日; latest successful source recorded by job 57
Latest valid import Job 57 completed即時業績_當日.xlsx2026-06-25T13:16:47.359958..2026-06-25T13:18:02.96498515383/15383/0
DB parity `daily_sales_snapshot=109061
Data freshness `MOMO_DAILY_FRESHNESS 1
Live cold-start readback PASS=89 WARN=0 BLOCKED=0, result GREEN; dedicated MOMO preflight PASS=18 WARN=3 BLOCKED=0
110 live script sync /home/wooo/scripts/full-stack-cold-start-check.sh hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8
Alert behavior Drive auth failure must send failure notification; heartbeat success remains suppressed; stale data alert should clear only with fresh DB evidence like job 57 / freshness 1
Declaration limit 可宣稱 hosts/routes/K3s/backups/MOMO service/MOMO data freshness recovered for this evidence set不可宣稱 DR complete、credential escrow complete、Wazuh host registry accepted 或 runtime/security acceptance

MOMO post-reboot 最小 readback

scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh

ssh ollama@192.168.0.188 '
stat -c "%u:%g:%a %n" /home/ollama/momo-pro/config/google_token.json 2>/dev/null || echo "google_token.json missing"
docker top momo-scheduler -eo pid,user,uid,gid,args | head -n 3
docker logs --since 2h momo-scheduler 2>&1 | grep -E "AutoImport|Google Drive|Permission denied|could not locate runnable browser|沒有找到|發現檔案|匯入失敗通知" | tail -120
'

ssh ollama@192.168.0.188 'db_user=$(docker exec momo-pro-system printenv POSTGRES_USER); db_name=$(docker exec momo-pro-system printenv POSTGRES_DB); db_pass=$(docker exec momo-pro-system printenv POSTGRES_PASSWORD); docker exec -i -e PGPASSWORD="$db_pass" momo-db psql -h 127.0.0.1 -U "$db_user" -d "$db_name" -At' <<'SQL'
SELECT 'daily_sales_snapshot|' || count(*) || '|' || min(snapshot_date)::date || '|' || max(snapshot_date)::date FROM daily_sales_snapshot;
SELECT 'realtime_sales_monthly|' || count(*) || '|' || min("日期")::date || '|' || max("日期")::date FROM realtime_sales_monthly;
SELECT 'daily_freshness|' || (CURRENT_DATE - max(snapshot_date)::date) || '|' || max(snapshot_date)::date FROM daily_sales_snapshot;
SQL

Preferred path is the scripted preflight. It is read-only and returns 0 for clean, 1 for WARN-only, and 2 for BLOCKED. 2026-06-25 14:16 live run returned PASS=18 WARN=3 BLOCKED=0: https://mo.wooo.work/health and local health both returned 200, health version was V10.674, app / scheduler / Telegram bot were healthy, scheduler restart count was 0, token metadata aligned to scheduler UID without reading token content, current-month DB parity matched, latest daily import job 57 was clean, and DB_DAILY_FRESHNESS 1|2026-06-24 cleared the MOMO hard blocker. The remaining WARNs are stability / future-evidence notes, not blockers.

若 Drive token artifact missing 或 Drive pending folder 無新來源檔,不可手動 truncate、不可以舊 archive 檔重複匯入來製造「最新」,也不可把 DB parity 當 data freshness。下一個解除 blocker 的證據必須是:

  1. Owner 提供非 secret evidence ref確認可以恢復 Google Drive token artifact 或合法來源檔。
  2. 維護窗口、rollback owner、post-check owner 明確記錄。
  3. token artifact 只用 metadata 驗證owner 對齊 scheduler UID、mode 不寬於 600、不輸出 token 內容。
  4. 新的 即時業績_當日 source file 可見,或 scheduler 能成功列出待匯入來源。
  5. import job 成功,sync_success=true,且 Drive 檔案只在成功後移動。
  6. daily_sales_snapshotrealtime_sales_monthly 日期上下界一致,且 MOMO_DAILY_FRESHNESS <= 2

14.29 2026-06-24 188 MinIO / Velero、DB exporter 與 110 disk pressure recovery

2026-06-24 的第四段變更是恢復真正的備份與監控鏈路,而不是消音告警。VeleroBackupNotRunPostgreSQLDownRedisDown、110 disk pressure 都是有效紅燈;修復順序必須是 source-of-truth / service / exporter / Prometheus / Alertmanager / cold-start scorecard。

項目 2026-06-24 06:35 recovery baseline
SOP version v1.30
188 DB exporter root cause Docker user namespace 下 exporter compose 不能使用 network_mode: hostRedis live port 是 6380
188 DB exporter source-of-truth ops/monitoring/docker-compose.exporters.yaml 改為 bridge port mappingPostgreSQL DSN 只從 host .env.exporters 注入repo 不放密碼
188 DB exporter helper scripts/ops/188-db-exporters-restore.shlive path /home/ollama/bin/188-db-exporters-restore.sh
188 DB exporter readback local metrics pg_up=1redis_up=1Prometheus up{job="postgres-exporter"}=1pg_up=1up{job="redis-exporter"}=1redis_up=1
110 disk pressure / from 92% used to 73% used after Docker image / build cache cleanup only; no Docker volume prune
MinIO / Velero root cause 188 MinIO endpoint 192.168.0.188:9000 was down; Velero BSL S3 list failed; MinIO data path had userns write denial
MinIO restore live /home/ollama/minio/docker-compose.override.yml adds userns_mode: host for the minio service; MinIO health endpoint is OK
Velero restore 120 BackupStorageLocation/default phase is Available; one-off backup reboot-recovery-202606240456 is Completed
Backup-health textfile 110 exporter refresh reports awoooi_velero_monitor_up=1, awoooi_velero_latest_completed_backup_fresh=1, restore-test cron present, failed jobs 0
Alert readback VeleroBackupNotRunPostgreSQLDownRedisDown、110 disk-pressure alerts resolved
Live cold-start readback PASS=86 WARN=0 BLOCKED=1, result BLOCKED; only blocker remains MOMO data freshness
Declaration limit 可宣稱 backup / exporter / MinIO / Velero chain recovered不可宣稱 full-stack green、MOMO data current、DR complete 或 runtime/security acceptance

188 PostgreSQL / Redis exporter post-reboot recovery:

ssh ollama@192.168.0.188 'bash /home/ollama/bin/188-db-exporters-restore.sh'
curl -fsS http://192.168.0.188:9187/metrics | grep '^pg_up '
curl -fsS http://192.168.0.188:9121/metrics | grep '^redis_up '

188 MinIO / 120 Velero recovery from 110:

ssh wooo@192.168.0.110 '/home/wooo/scripts/188-minio-velero-restore.sh'

如果需要在維護窗口中建立一次性 reboot-recovery 備份並刷新 110 backup-health textfile必須明確設定

ssh wooo@192.168.0.110 'CREATE_VELERO_BACKUP=true REFRESH_BACKUP_HEALTH=true /home/wooo/scripts/188-minio-velero-restore.sh'

本地 repo helper 可同步 live script

scp -q scripts/ops/188-db-exporters-restore.sh ollama@192.168.0.188:/home/ollama/bin/188-db-exporters-restore.sh
scp -q scripts/ops/188-minio-velero-restore.sh wooo@192.168.0.110:/home/wooo/scripts/188-minio-velero-restore.sh

110 disk pressure cleanup rule:

Allowed in incident recovery: Docker image / build cache cleanup after checking `docker system df`.
Forbidden without explicit owner approval: `docker volume prune`, deleting database / registry / MinIO / ClickHouse / Sentry / PostgreSQL volumes, or removing unknown bind-mounted state.
Done gate: filesystem use below 85%, no active disk-pressure alerts, and no service regression in cold-start scorecard.

14.30 2026-06-24 notification noise closure after reboot recovery

2026-06-24 的第五段變更是把「服務已恢復,但舊監控路徑或成功心跳繼續洗 Telegram」納入重啟 SOP。這不是消音失敗、warning、資料新鮮度、backup / exporter / escrow 紅燈仍要告警。修正目標是避免同一個已知失敗每 5 或 30 分鐘重複推送,並避免正常成功心跳佔滿戰情室。

項目 2026-06-24 notification baseline
SOP version v1.31
AWOOOI healthy heartbeat Production a84a5a0bhealthy 且無 warnings 時只更新 Redis/log不送 Telegramwarning 變化會送warning 恢復 healthy 只送一次 recovery
MOMO false-noise root cause 110 舊 /home/wooo/scripts/docker_health_monitor.shhttp://192.168.0.188/health,重啟期間連續得到 HTTP 502,產生每 5 分鐘 MOMO Pro 告警
MOMO monitor source-of-truth 新增 scripts/ops/momo-pro-health-monitor.shprimary truth 是 https://mo.wooo.work/health188 local 127.0.0.1:5003/health 與 container state 只作 secondary evidence
MOMO live readback /home/wooo/scripts/docker_health_monitor.sh hash d7a6bc75549efa10176c42e6f9082c90b9856dbcbb335aba4a4fa4abb754eaef; manual run returned OK: public health 200; no alert
AWOOI ops notify wrapper /home/wooo/awoooi-ops/notify-awoooi-ops.sh hash 12bf9ae124c56bb7f31be15ebeb501671b0686d695492bc3fa1d9abb5b683b67; repo MOMO monitor uses this wrapper instead of adding a new Telegram Bot API direct send
Docker monitor fallback scripts/ops/docker-health-monitor.sh keeps ACTION_COOLDOWN_SECONDS=300 for repair cadence but adds NOTIFY_COOLDOWN_SECONDS=1800 for direct Telegram fallback when AWOOOI API cannot receive the event
Docker monitor live readback /home/wooo/awoooi-ops/docker-health-monitor.sh hash 41d64f29048868c8e4c089132d299c8ee0e2b50ab2c513158d6d45cf92ea38e3 and exposes TELEGRAM_COOLDOWN lines for repeated fallback suppression
Bitan public-content check Live /home/wooo/apps/bitan-pharmacy-release/scripts/run-public-content-cleanliness-check.sh now writes public-content-cleanliness.notify.state, suppresses same failure fingerprint for 21600s, and sends one recovery notice after a failed state becomes pass
Bitan live readback Script hash 294ec7f75448c86688b8afc408c785efe4cf173d468ad0d82228ba638d3de2dc; manual no-notify run returned PASS for DB, public APIs, products/news pages, and content contract
Declaration limit 可宣稱 repeated healthy / same-failure notification noise is controlled for these paths不可宣稱 all product alerts migrated to the unified notification gateway or any real failure alert disabled

Post-reboot notification gate:

ssh wooo@192.168.0.110 '/home/wooo/scripts/docker_health_monitor.sh'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/logs/docker_health.log'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/awoooi-ops/monitor.log'
ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/apps/bitan-pharmacy-release/logs/public-content-cleanliness-check.cron.log'

Done gate:

MOMO monitor: public health 200 -> no Telegram.
AWOOOI heartbeat: healthy + no warnings -> suppressed; warning/recovery still send.
Generic docker-health monitor: API 200/202 path is primary; direct Telegram fallback is fingerprint-cooled.
Bitan public content: pass -> no failure Telegram; repeated same failure -> cooled; recovery after prior failure -> one notice.

14.31 2026-06-24 MOMO source-file absence decision gate

2026-06-24 11:35 的恢復判定把 MOMO 分成兩件事:服務可用與資料新鮮。服務可用已恢復,資料新鮮仍 blocked。這個 gate 的目的,是防止 operator 在外部網站 200、container healthy、DB parity 正常時,誤把「沒有新來源檔」當成「恢復完成」。

項目 11:35 source-file absence baseline
SOP version v1.32
MOMO public health https://mo.wooo.work/health returns healthy; version V10.639
DB rows daily_sales_snapshot=104614realtime_sales_monthly=786621
DB bounds daily 2025-07-01..2026-06-17monthly 2024-01-01..2026-06-17
Current-month parity `10936
Latest successful import daily_sales job 56created 2026-06-18 11:41source 即時業績_當日.xlsxsync_success=true
Pending source folder 當日業績匯入 count 0 for pattern 即時業績_當日
Archive latest 2026-06-18T01:30:39Zalready imported by job 56
Scheduler Drive readback container-side Drive listing works and currently returns count 0; no current Permission denied evidence in latest readback
Stale alert posture data_stale_alert has 24h dedupe; this is a true warning, not heartbeat spam
Blocking metric `MOMO_DAILY_FRESHNESS 7
Repo-side v1.42 scorecard evidence MOMO_SOURCE_EMPTY_EVIDENCE_LINES 21、`MOMO_IMPORT_CONFIG 當日業績匯入

2026-06-24 23:04 repo-side cold-start v1.42 dry-run returns PASS=88 WARN=0 BLOCKED=1 and classifies the only blocker as:

188 momo source file absent while daily sales data stale

This is repo-side source-of-truth enhancement only. 2026-06-24 23:15 read-only deploy parity check proves the live 110 script is still older: repo hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05, live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8. Do not claim the live 110 deployed script has this v1.42 behavior until /home/wooo/scripts/full-stack-cold-start-check.sh is synced under an approved change and its hash/readback is recorded through §13.3.1.

GO / NO-GO:

GO: declare MOMO web/API/container/database service available.
GO: declare current-month table parity good.
NO-GO: declare MOMO business data current.
NO-GO: declare FULL_STACK_GREEN while MOMO_DAILY_FRESHNESS > 2.
NO-GO: re-import old archived files to fake freshness.
NO-GO: import product exports or manually constructed spreadsheets as daily sales source.
NO-GO: truncate tables, restore whole DB, or move Drive files when sync_success is false.

解除 blocker 的唯一合格證據:

1. New legitimate 即時業績_當日 source file appears in the expected Drive intake path, or owner supplies a verifiable source-evidence reference.
2. Import job completes with success=true and sync_success=true.
3. Drive file movement / archive evidence shows the source was handled once.
4. daily_sales_snapshot and realtime_sales_monthly counts and date bounds match for the imported range.
5. MOMO_DAILY_FRESHNESS <= 2.
6. backup / offsite / cold-start scorecard rerun after import remains green except known DR escrow blocker.

如果 source file 缺席,正確回報是:

MOMO service is recovered, data pipeline is waiting for upstream source file.
No safe import candidate exists.
Full-stack remains blocked by data freshness, not by service outage.

14.32 2026-06-24 188 nginx-exporter / CD monitoring coverage gate

2026-06-24 的第六段變更是把 CD post-deploy monitoring coverage 失敗納入重啟 SOP。2ec7f6f4 的 runtime deploy 已回寫 622bc372 並且 production API health 為 healthy但 CD #3294 的 post-deploy checks 因 nginx-exporter target down 留下 Failure。根因是 188 nginx-exporter container 未運行,並非 Nginx public gateway、API/Web rollout 或 MOMO 服務故障。

項目 20:10 monitoring coverage baseline
SOP version v1.34
Affected CD run Gitea CD #3294 歷史結果仍為 Failuredeploy marker 622bc372 已寫入
Root cause Prometheus job nginx-exporter downtarget 192.168.0.188:9113 connection refused
Non-root cause Nginx stub_status 正常;不需要 reload Nginx、不需要重啟 API/Web/MOMO、不需要改 firewall
Live restore source /home/ollama/nginx-exporter.yml
Repo helper scripts/ops/188-nginx-exporter-restore.sh
Check mode --check only reads stub_status, compose config, container state, and metrics
Apply mode --apply runs docker compose -f /home/ollama/nginx-exporter.yml up -d after stub_status and compose config pass
Exporter metrics nginx_up 1nginx_connections_activenginx_http_requests_total
Monitoring coverage Jobs 總數=14全部 UP=14真實問題=0預期覆蓋率=100.0%
Declaration limit 可宣稱 exporter / monitoring coverage recovered不可把歷史 CD run 改稱 success也不可宣稱 full-stack green / DR complete

Post-reboot / post-CD 188 nginx-exporter check:

bash scripts/ops/188-nginx-exporter-restore.sh --check
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0

如果 --check 只在 metrics 階段失敗,但 stub_status 與 compose config 都通過,且維護窗口允許恢復無狀態 exporter

bash scripts/ops/188-nginx-exporter-restore.sh --apply
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0

禁止把這個症狀用下列方式處理:

NO-GO: reload Nginx before stub_status / exporter metrics prove Nginx config is the issue.
NO-GO: restart product containers because monitoring coverage alone is red.
NO-GO: silence monitoring coverage or mark CD green without target recovery evidence.
NO-GO: prune Docker volumes or delete exporter state not owned by this SOP.

14.33 2026-06-24 MOMO V10.646 / source-file absence / dual-workstation baseline

2026-06-24 的第七段變更是把 MOMO 的「程式版本最新」與「業務資料不新」拆成兩個獨立 gate並把 Mac Mini / MacBook Pro 的 MOMO Codex 工作區固定到 Gitea main 最新基準。這避免重啟後出現兩種誤判:看到 /health 最新版就宣稱資料已更新,或看到資料 stale 就誤以為服務仍是舊版。

項目 20:42 MOMO / workstation baseline
SOP version v1.35
MOMO public health https://mo.wooo.work/health returns healthy, version V10.646
Gitea main truth wooo/ewoooc main=7cfca9375445ea03d6f5d10512d0276a20914d25, SYSTEM_VERSION = "V10.646"
Mac Mini workspace /Users/ogt/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 7cfca9375445ea03d6f5d10512d0276a20914d25, dirty 0
MacBook workspace /Users/ooo/codex-workspaces/momo-pro-dev, branch codex/momo-current-main-dev-base-20260624, commit 7cfca9375445ea03d6f5d10512d0276a20914d25, dirty 0
Remote baseline branch wooo/ewoooc codex/momo-current-main-dev-base-20260624 points to 7cfca9375445ea03d6f5d10512d0276a20914d25
DB parity current-month daily_sales_snapshot and realtime_sales_monthly match at 10936 rows, range 2026-06-01..2026-06-17
Data freshness `MOMO_DAILY_FRESHNESS 7
Source candidates inspected Mac Mini current daily file contains only 2025-07-01..2025-07-02; iCloud full-month file contains only 2025-06-01..2025-06-30; MacBook candidates are header-only or the same 2025-07-01..2025-07-02 file
Declaration limit 可宣稱 MOMO release current 與 Codex dual-workstation baseline ready不可宣稱 MOMO data current 或 full-stack green

MOMO post-reboot 判定必須同時回答四個問題:

MOMO_RELEASE_CURRENT = yes/no
MOMO_DB_PARITY = yes/no
MOMO_DATA_FRESH = yes/no
MOMO_SOURCE_AVAILABLE = yes/no

解除 MOMO data freshness blocker 的唯一安全路徑:

1. 新的合法 即時業績_當日 source file 出現在預期 Drive intake或 owner 提供可驗證的 source-evidence reference。
2. 匯入 job 成功,且同步 realtime_sales_monthly 失敗時不得標 completed。
3. source file movement / archive evidence 證明該檔只處理一次。
4. daily_sales_snapshot 與 realtime_sales_monthly row count / date bounds 一致。
5. MOMO_DAILY_FRESHNESS <= 2。

禁止把以下情境當成解除 blocker

NO-GO: 用舊 archive、iCloud 舊月檔、header-only 檔或測試檔重複匯入。
NO-GO: 把 V10.646 health 當成資料日期已到今天。
NO-GO: 把 current-month parity 當成 data freshness。
NO-GO: truncate 或 restore 整庫來製造新鮮度。

14.34 2026-06-24 MOMO import sync failure boundary hardening

2026-06-24 21:57 的第八段變更是把 MOMO 自動匯入的「partial success」風險納入重啟 SOP。2026-06-24 22:17 已補正式 release readback同一修正已 fast-forward 到 MOMO mainGitea Actions cd.yaml #904 成功188 live source marker 已確認。daily_sales_snapshot 寫入成功不代表整體匯入成功;realtime_sales_monthly 同步失敗時,必須 fail job、保留來源檔不得移動 Google Drive 檔案到 archive。

項目 22:17 MOMO import-boundary production baseline
SOP version v1.40
Production health https://mo.wooo.work/health healthy, version V10.653
Live DB read-only daily_sales_snapshot=104614 rows, 2025/07/01..2026/06/17; realtime_sales_monthly=786621 rows, 2024/01/01..2026/06/17
Scheduler read-only 最近 12 小時 當日業績匯入 / 即時業績_當日 均為 0 個 Excel排程不發送成功通知
Latest successful import job 56 completed, 10936 rows, 2026-06-18 11:41..11:42
Code / deploy MOMO main and codex/momo-current-main-dev-base-20260624 commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73; Gitea Actions cd.yaml #904 Success
Live source marker 188 /home/ollama/momo-pro/services/import_service.py contains _table_columns, 業績分析儀表板同步失敗, and 保留來源檔案等待重試,不移動 Google Drive 檔案
Regression pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q => 10 passed
Production deploy state Production patched for code boundary; data freshness still blocked until a legitimate newer source file imports successfully

MOMO import success 判定:

GO: process_daily_sales_import returns True only if daily_sales_snapshot write and realtime_sales_monthly sync / verification both pass.
GO: auto_import_from_drive may move the Drive source file only after process_daily_sales_import returns True.
NO-GO: mark import_jobs.status=completed when sync_success=false.
NO-GO: move or archive the Drive source file when realtime_sales_monthly sync failed.
NO-GO: send a generic success notification for file_count > 0 before verify_import_data_sync passes.

重啟後若 MOMO data freshness blocked先分成三層不要混在一起

1. Service availability: /health, container, DB connection.
2. Source availability: Drive pending folder has a legitimate new 即時業績_當日 source file.
3. Data correctness: import job completed with sync_success=true, and daily_sales_snapshot / realtime_sales_monthly match the imported date range.

14.35 2026-06-25 MOMO preflight 與 110 CPU orphan Chrome 分流

2026-06-25 11:01 的第九段變更是把兩個常見誤判收斂成可重跑 SOP

  1. MOMO service health green 不等於 data fresh。
  2. 110 high load 不等於可以重啟 Docker 或取消 CI。

MOMO 專用 preflight

scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh

此腳本只做 read-only SSH / Docker metadata / logs / DB query不讀 token 內容、不 import、不移動 Drive 檔、不 restart。14:16 live result:

MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=18 WARN=3 BLOCKED=0 HOST=ollama@192.168.0.188 FRESHNESS_MAX_DAYS=2
MOMO_PUBLIC_HEALTH_CODE 200
MOMO_HEALTH_CODE 200
MOMO_HEALTH_VERSION V10.674
MOMO_APP_HEALTH healthy
SCHEDULER_RUNNING true
SCHEDULER_HEALTH healthy
SCHEDULER_RESTART_COUNT 0
TELEGRAM_BOT_HEALTH healthy
MOMO_CONTAINER_REPLACE_EVENTS_45M 11
TOKEN_STAT 100000:100000:600
CONTAINER_TOKEN_STAT 0:0:600
LOCAL_EXACT_DAILY_SOURCE_COUNT 0
LOCAL_EXACT_DAILY_SOURCE_LATEST none
DB_DAILY 109061|2025-07-01|2026-06-24
DB_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24
DB_DAILY_FRESHNESS 1|2026-06-24
DB_LATEST_DAILY_IMPORT_JOB 57|completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0

110 CPU 分流:

Evidence Decision
ps shows stockplatform-review-bulk-ux Chrome groups with root process PPID 1, no parent node smoke, and sustained high CPU Treat as orphan browser smoke. Run dry-run if available, then only with owner approval use targeted SIGTERM by process group.
Active Gitea Actions container is consuming CPU, e.g. GITEA-ACTIONS-TASK-*, next build, uv pip install, docker-buildx Treat as legitimate CI/CD load. Do not kill unless there is explicit release owner approval to cancel the run.
vmstat shows high iowait or active swap in/out Treat as storage / memory pressure, not browser runaway. Do not kill random processes; capture disk / memory evidence first.

2026-06-25 10:58 user-approved action:

Targeted command type: process SIGTERM only.
Targeted process groups: 438005, 471295, 640155, 670628.
Scope: orphan `stockplatform-review-bulk-ux` Chrome groups on 110.
Post-check: `OLD_GROUPS_REMAINING` empty.
Not performed: Docker restart, systemd restart, Nginx reload, firewall/iptables change, K8s action, CI cancellation, Wazuh/SOC change, secret read.
Remaining load: active Gitea Actions / CI build work; observe queue and timeout instead of killing.

14.22 重啟後時間軸驗證

每次重啟後照時間軸推進,不要等到最後才一次判定。

時間點 目標 必跑證據 可以宣稱
T+0 power / VM / console 已開始 console / hypervisor / UPS / operator note maintenance started
T+5m LAN / SSH 回復 ping、ARP、SSH port、who -b HOST_BOOTED
T+15m 主機基礎服務回復 systemctl is-system-running、failed units、Docker / PostgreSQL / Redis / K3s role checks HOST_READY
T+30m 核心服務回復 188 DB、110 Harbor/Gitea/Prom/AM、K3s nodes、AWOOOI API/Web、public routes SERVICE_READY for scoped hosts
T+45m 排程與資料一致性 backup status、offsite verifier、momo DB parity、CronJobs、alert visibility service recovery confidence
T+60m 釋出高負載與自動化 cold-start scorecard、load/core、runner guardrails、AI observe-only gate release runner/CD only if gates allow

若任一時間點卡住,記錄卡在哪個 gate不要跳到下一層。連續兩次重啟都卡同一 gate必須回寫 §16 Known Drift 或 workplan。


15. Done Criteria

All must be true:

  • Four hosts reachable by SSH.
  • 188 PostgreSQL and Redis healthy.
  • 110 Harbor, Gitea, Prometheus, Alertmanager healthy.
  • 120/121 K3s nodes Ready.
  • VIP 192.168.0.125 present.
  • AWOOOI API and Web reachable through NodePort/VIP.
  • Alertmanager E2E webhook succeeds.
  • cron/CronJob schedules are active, unsuspended, and verified.
  • MOMO release version matches Gitea source-of-truth for the intended deployment branch.
  • momo daily_sales_snapshotrealtime_sales_monthly 在最新匯入日期範圍內筆數一致。
  • momo business data freshness is within the declared SLO, and the latest import source evidence is legitimate; DB parity alone is not enough.
  • Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
  • High-load batch services are capped or delayed.
  • Runners are guarded and released last.
  • AI auto-remediation is not in full execution mode until all gates are green.
  • 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
  • 110 runaway process textfile monitor is fresh, and Prometheus has HostOrphanBrowserSmokeHighCpu plus CI load classification rules loaded.
  • 110 global /home/wooo/.ssh/known_hosts still contains verified 120 / 188 entries after any CD run; deploy jobs use /home/wooo/.ssh/deploy_known_hosts only.

15.1 可宣稱狀態

可宣稱文字 必要條件
110 host recovered 110 HOST_READYfailed units 0 或全部可解釋,核心端口與 cron / backup status 已查
public core services recovered public routes/TLS 2xx/3xxAWOOOI API health、Harbor/Gitea/Stock/Sentry/SignOz/Langfuse/Bitan smoke OK
backup/offsite current backup-status --no-notify 無 staleoffsite verifier VERIFY_OK=1,且任何 failed component 有明確 owner
service recovery with known blocker cold-start BLOCKED 只剩已知 blocker例如 120告警保持可見
full-stack green §15 全部成立cold-start WARN=0 BLOCKED=0
DR complete full-stack green 且 credential escrow missing count 為 0

16. Known Drift To Fix After Recovery

這些項目必須在事故後整理,不要在 P0 恢復中途順手大改。

  • SERVICE-ENDPOINTS.md still has old Prometheus/Alertmanager locations.
  • Audit older docs for direct node webhook targets; current main path should be VIP 192.168.0.125:32334.
  • OpenClaw 8088 vs 8089 must be live-confirmed and normalized.
  • 188 compose paths drift between /home/ollama/* and Ansible /opt/*.
  • 110 runner docs still mention Docker runner in places; live startup prefers host gitea-act-runner-host.service.
  • scripts/setup-runner-watchdog.sh conflicts with the 2026-05-05 runner watchdog disablement guardrail.
  • grist.wooo.work / registry.wooo.work public HTTP/HTTPS currently route to aiops.wooo.work; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.
  • stockplatform-shared-ui-monitor.timer / service source-of-truth 仍需清理或重建2026-06-12 只停用 stale timer 以解除 host degraded。
  • 111 local Ollama fallback 目前不可達production provider 由 GCP-A / GCP-B 承接,但 111 恢復應另列 AI provider resilience 工作。
  • 本 SOP v1.5 新增內容已用繁體中文補強;舊章節仍有英文段落,後續 runbook hygiene 應分批翻譯,不要在事故 P0 中混入大規模格式重排。