# AWOOOI Full-Stack Cold Start SOP > Version: v1.1 > Last updated: 2026-05-06 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. --- ## 0. When To Use This Use this SOP when any of these happen: - 110/120/121/188 reboot unexpectedly. - All services are abnormal after a power/network event. - K3s is stuck `activating`. - Host load remains high during startup and service health is mixed. - Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state. The rule is simple: **recover the dependency chain, not the loudest symptom.** --- ## 1. Golden Startup Order ```text 0. Freeze automation and preserve evidence 1. Physical/network layer 2. 188 data layer 3. 110 registry/observability layer 4. 120/121 K3s layer 5. AWOOOI workload layer 6. Public routes and alert chain 7. High-load batch/consumer/crawler services 8. Runner/CD 9. AI auto-remediation 10. 112 Kali scanner, if needed ``` Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy. ### 1.1 Dependency Graph ```mermaid flowchart TD network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"] network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"] data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"] obs110 --> k3s k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"] workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"] workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"] public --> schedules["Schedules: cron, CronJobs, backups, exporters"] schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"] highload --> ai["AI auto-remediation: limited execution"] ``` This is also captured in the machine-readable baseline: ```text ops/reboot-recovery/full-stack-cold-start-baseline.yml ``` The YAML baseline is the source of truth for: - hosts, roles, and SSH users - phase ordering - service startup dependencies - endpoint success codes - schedule freshness thresholds - stateful-service protection boundaries - AI automation release gates ### 1.2 Phase Gate Logic Each phase has the same decision rule: | Result | Meaning | Action | |--------|---------|--------| | `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. | | `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. | | `GREEN` | All checks in scope passed. | Release the next phase only. | The cold-start flow is intentionally conservative: ```text P0 network green -> P0 188 data green -> P0 110 registry/observability green -> P1 K3s green -> P2 workload + alert chain green -> P2 public routes green -> P2 schedules green -> P3 high-load services and runners/CD -> AI auto-remediation limited execution ``` The final release condition is not "containers are running". It is: ```text PASS > 0 WARN = 0 BLOCKED = 0 Result: GREEN ``` --- ## 2. Automation Freeze Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode. | Item | Cold-start policy | Reason | |------|-------------------|--------| | Gitea/GitHub runners | Last | Build jobs can saturate 110 CPU/RAM. | | momo-scheduler / crawlers | Last | Chrome and batch work can saturate 188. | | Sentry/Snuba consumers | Controlled | Kafka backlog and ClickHouse merge can create temporary high load. | | Alertmanager outbound notification | Gate | Avoid alert storms before API webhook and Telegram are verified. | | AI auto-repair | Observe-only | Metrics, Redis, KM, and playbooks may be incomplete. | | Stateful DB restart | Human approval | PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets. | --- ## 3. P0 Evidence And Network Run from any machine on the same LAN: ```bash for h in 110 120 121 188; do ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h" done arp -an | grep -E '192\.168\.0\.(110|120|121|188)' for h in 110 120 121 188; do nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h" done ``` Then capture reboot evidence: ```bash ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20' ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20' ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20' ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20' ``` If any host has ARP `incomplete` or SSH port down, stop here and fix physical/network first. --- ## 4. P0 188 Data Layer 188 is the first real service dependency because K3s datastore and AWOOOI DB depend on PostgreSQL. ### 4.1 Startup order 1. `containerd` 2. `docker` 3. `postgresql@14-main` 4. `k3s_datastore.kine` maintenance 5. `redis-server` on `6380` 6. `ollama` or current AI proxy dependencies 7. `nginx` 8. Docker networks 9. MinIO / OpenClaw / SignOz 10. momo / litellm / batch services after load is stable ### 4.2 Read-only check ```bash ssh ollama@192.168.0.188 ' hostname; date; uptime; free -h systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true pg_isready -h localhost -p 5432 || true redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120 ' ``` ### 4.3 PostgreSQL WAL checkpoint damage Signature: ```text PANIC: could not locate a valid checkpoint record invalid primary checkpoint record unexpected pageaddr ... in log segment ... ``` This blocks: - `188:5432` - K3s startup on 120/121 - AWOOOI API DB access - Alertmanager webhook if API cannot start Human-approved recovery command on 188: ```bash sudo systemctl stop postgresql@14-main sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main sudo systemctl start postgresql@14-main pg_isready -h localhost -p 5432 sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;" ``` Do not run `DROP`, reinitialize the cluster, delete `/var/lib/postgresql`, or restore an old backup unless the commander explicitly approves it. --- ## 5. P0/P1 110 Registry And Observability 110 must recover Harbor/Gitea/Monitoring early, but runners last. ### 5.1 Startup order 1. `docker` 2. Remove `Exited (128)` / `Exited (137)` orphan containers 3. Harbor `harbor-log` 4. Harbor full stack 5. Gitea 6. Prometheus / Alertmanager / Grafana / exporters 7. Langfuse 8. SignOz 9. Sentry DB layer 10. Sentry web/worker/consumer layer 11. Gitea host runner and actions runners ### 5.2 Checks ```bash ssh wooo@192.168.0.110 ' hostname; date; uptime; free -h systemctl is-active docker || true curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true docker ps --format "{{.Names}}\t{{.Status}}" | head -120 ' ``` Harbor healthy means `/v2/` returns `200` or `401`. Do not treat `401` as failure. ### 5.3 Runner gate Runner may start only after all are true: - `188 PostgreSQL` ready - `110 Harbor` ready - `110 Gitea` ready - `120/121 K3s` nodes ready - AWOOOI API health passes - 110 load/core is below `1.0` for at least 15 minutes - runner systemd guardrails are active: `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` Check: ```bash ssh wooo@192.168.0.110 ' for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do echo "=== $u ===" systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts done ' ``` If `WatchdogUSec` is not `0`, apply the guardrail script manually with sudo: ```bash sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply ``` --- ## 6. P1 120/121 K3s K3s must wait for 188 PostgreSQL and 110 Harbor. ### 6.1 Startup order 1. 120 `k3s.service` 2. 121 `k3s-agent.service` or its live role 3. CNI / kube-proxy 4. Nodes Ready 5. Core pods 6. `awoooi-prod` pods 7. keepalived VIP `192.168.0.125` 8. NodePorts `32334` and `32335` ### 6.2 Checks ```bash ssh wooo@192.168.0.120 ' hostname; uptime pg_isready -h 192.168.0.188 -p 5432 || true systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true kubectl get nodes -o wide 2>/dev/null || true kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true ip addr show | grep 192.168.0.125 || true ' ssh wooo@192.168.0.121 ' hostname; uptime systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true ip addr show | grep 192.168.0.125 || true ' ``` If K3s is `activating` while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it. --- ## 7. P2 AWOOOI Workloads Run after K3s nodes are Ready: ```bash ssh wooo@192.168.0.120 ' kubectl get deploy -n awoooi-prod kubectl get pods -n awoooi-prod -o wide kubectl get svc -n awoooi-prod kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40 ' curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/ ``` If pods are `ImagePullBackOff`, go back to 110 Harbor. If API health fails because DB/Redis is down, go back to 188. --- ## 8. P2 Alert Chain Current main path: ```text Prometheus/Alertmanager on 110 -> http://192.168.0.125:32334/api/v1/webhooks/alertmanager -> AWOOOI API -> TelegramGateway -> Telegram ``` Alertmanager health alone is not enough. Run E2E: ```bash curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \ -H 'Content-Type: application/json' \ -d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}' ``` Expected: API returns success and Telegram receives the test alert. --- ## 9. P2 Schedules And Delayed Work Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken. | Host / Layer | Required check | Success baseline | |--------------|----------------|------------------| | 188 cron | `systemctl is-active cron` and `crontab -l` | cron active; backup, restart exporter, stats exporter entries present | | 188 backup-from-110 | `backup_110_last_success_timestamp` in textfile/Prometheus | last success age `< 25h` | | 188 momo-scheduler | `docker inspect momo-scheduler` and `docker logs --since 6h momo-scheduler` | container `running healthy`; `全部排程任務已註冊`; Google Drive auth works; dashboard URLs use container-reachable hostnames | | 188 momo import | manual `run_auto_import_task()` after parser changes | selected sheet is `即時業績明細`; imported date range has matching rows in `daily_sales_snapshot` and `realtime_sales_monthly` | | 110 cron | `systemctl is-active cron` | cron active; Docker/systemd textfile exporters fresh | | 110 startup units | `systemctl --failed` | zero failed units; stale `momo-startup-complete` and `wooo-staggered-startup` disabled | | 120 K8s CronJobs | `kubectl get cronjobs -n awoooi-prod` | unsuspended; no failed Jobs remain after current validation | | 121 DR drill | `crontab -l` | DR drill cron present unless explicitly paused | Useful checks: ```bash ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom' ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l' ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod' ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l' ``` If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms. --- ## 10. P2/P3 Stateful Service Guardrails | Tier | Examples | Automation | |------|----------|------------| | BLOCK | PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB | No automatic destructive action. Human approval only. | | CRITICAL_HITL | Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse | Human-in-the-loop restart/repair. | | STANDARD_HITL | API/Web/worker, OpenClaw, litellm | Restart only with evidence and blast-radius check. | | AUTO | Stateless exporters, blackbox, nginx exporter | Auto restart allowed after verification. | Never use generic `docker restart $(docker ps -q)` during cold start. ### 10.1 Dirty-Reboot Storage Corruption Treat these log signatures as storage corruption, not ordinary service flakiness: - `Bad message` - `Structure needs cleaning` - `Unknown codec` - `PANIC: could not locate a valid checkpoint record` - Kafka `Malformed line` in checkpoint files - ClickHouse `broken and needs manual correction` Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns `Bad message` or `Structure needs cleaning`, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore. ### 10.2 ClickHouse Clean-Clone Recovery Pattern Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads. ```text 1. Stop the compose stack or at least stop dependent consumers. 2. Disable restart loops for the failing container. 3. Save logs and build an exclude list from unreadable store paths. 4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS. 5. Create a clean _data clone with readable files only. 6. Add flags/force_restore_data. 7. Start ClickHouse first, then web/API, then consumers. 8. Verify HTTP, merge backlog, and restart count before releasing high-load services. ``` Do not replace this with `rm -rf store/...` unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is: ```text /var/lib/docker/volumes//_data.corrupt-YYYYMMDD-HHMMSS /var/backups/--YYYYMMDD-HHMMSS ``` ### 10.3 Kafka Checkpoint Recovery Pattern If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files: ```text log-start-offset-checkpoint recovery-point-offset-checkpoint replication-offset-checkpoint ``` Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery. --- ## 11. P3 High-Load Services Only release these after P0/P1/P2 gates are green: | Host | Service | Release condition | |------|---------|-------------------| | 188 | momo-scheduler / crawler | load/core < 1.0 for 15 minutes and DB healthy | | 188 | SignOz ClickHouse | healthy and merge backlog trending down | | 188 | litellm | `/health/liveliness` good and provider route verified | | 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing | | 110 | Sentry uptime-checker | Sentry web/DB healthy | | 110 | runners | all previous gates green and load/core < 1.0 for 15 minutes | --- ## 12. Baseline And AI Auto-Remediation Gate ### 12.1 Stable Runtime Baseline These are release gates after the first cold-start recovery pass: | Area | Baseline | |------|----------| | 188 host | PostgreSQL accepting, Redis PONG, momo `/health` 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers | | 110 host | Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop | | K3s | 120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx | | Public routes | `https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx | | Guardrails | Docker/systemd textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` | | Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h` | | Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks | If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem. ### 12.2 AI Auto-Remediation Gate AI auto-repair can move from observe-only to limited execution only after: - Prometheus rules are loaded. - docker/systemd textfile exporter files are fresh. - blackbox probes have stable results. - cron/CronJob schedule checks are green. - AWOOOI API `/api/v1/health` passes. - Alertmanager E2E webhook passes. - Redis/KM/playbook health is available. - No active restart storm. - Host load/core remains below `1.0` for 15 minutes. Until then: - diagnose only - notify only - require human approval for remediation - no DB/ClickHouse/Harbor/Sentry destructive action - no generic restart action against stateful services --- ## 13. One-Command Readiness Script ### 13.1 Single Pass Run this when you want one read-only snapshot: ```bash bash scripts/reboot-recovery/full-stack-cold-start-check.sh ``` The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates: - `P0-NETWORK` - `P0-188-DATA` - `P0-110-REGISTRY-OBSERVABILITY` - `P1-K3S` - `P2-WORKLOAD-ALERTCHAIN` - `P2-PUBLIC-ROUTES` - `P2-SCHEDULES` - runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY` If it prints `BLOCKED`, fix the first blocked gate before moving forward. ### 13.2 Professional Watch Mode Run this after a full reboot when you want the machine to keep checking until the whole stack is ready: ```bash bash scripts/reboot-recovery/full-stack-cold-start-check.sh \ --watch \ --interval 60 \ --max-attempts 30 \ --send-alert-test ``` This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked. Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete. ### 13.3 Persistent Read-Only Monitor After recovery, host 110 should run the same gate as a node-exporter textfile monitor: ```bash bash scripts/reboot-recovery/install-cold-start-monitor-110.sh ``` This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes: - `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom` - `/home/wooo/reboot-recovery/cold-start-last.log` The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics: - `awoooi_cold_start_monitor_up` - `awoooi_cold_start_pass_gates` - `awoooi_cold_start_warn_gates` - `awoooi_cold_start_blocked_gates` - `awoooi_cold_start_last_run_timestamp` - `awoooi_cold_start_last_green_timestamp` - `awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}` Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours. ### 13.4 Script-To-SOP Coverage Map | Script gate | SOP coverage | Blocks | |-------------|--------------|--------| | `P0-NETWORK` | host reachability, ARP, SSH | every later phase | | `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site | | `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners | | `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health | | `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence | | `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release | | `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria | ### 13.5 Next-Reboot Operator Contract 1. Run the watch command above. 2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode. 3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning. 4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes. 5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`. ### 13.6 2026-05-29 補充:188 Public Gateway 與備份告警 `aiops.wooo.work` 的 188 public gateway 不可再指向單一 `192.168.0.120:31234/31235`。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP: ```nginx location /api/ { proxy_pass http://192.168.0.125:32334/api/; } location /api/v1/ws { proxy_pass http://192.168.0.125:32334/api/v1/ws; } location / { proxy_pass http://192.168.0.125:32335; } ``` 變更來源必須是 `infra/ansible/roles/nginx/templates/188-all-sites.conf.j2`,再用 `infra/ansible/playbooks/nginx-sync.yml` 收斂;禁止只改 188 live 檔而不回寫 Ansible baseline。 備份告警有兩層,缺一不可: - `ops/monitoring/alerts-unified.yml` 是 repo canonical。 - 110 live `/home/wooo/monitoring/alerts.yml` 與 `/home/wooo/monitoring/alerts-unified.canonical.yml` 必須一致,否則 `prometheus-rule-drift-guard` 可能把規則拉回舊版。 重啟後必查: ```bash curl -s http://127.0.0.1:9090/api/v1/rules \ | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])' cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom ``` 若 120 尚未恢復,`BackupConfigCapturePartial{target="120-k3s-host-configs"}` 與 cold-start blocked 是正確訊號,不可消音。120 恢復後再重跑: ```bash /backup/scripts/backup-configs.sh /backup/scripts/backup-all.sh /backup/scripts/sync-offsite-backups.sh --mode sync /backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color ``` ### 13.7 2026-05-29 補充:momo PostgreSQL Index 與資料同步 `mo.wooo.work` 不能只看 `/health` 或首頁 200。重啟或 fsck 後,PostgreSQL index 可能讓匯入流程表面完成,但 `daily_sales_snapshot` 未同步到 `realtime_sales_monthly`。本次症狀: - `daily_sales_snapshot` 已有 2026-05-01 到 2026-05-28 的 17,353 筆。 - `realtime_sales_monthly` 同日期範圍為 0 筆。 - momo-scheduler log 出現 PostgreSQL 內部錯誤 `posting list tuple ... cannot be split`。 標準處理順序: ```bash # 188 / momo-db,只重建索引,不刪資料 docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL' REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly; SQL ``` 重建索引後,才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 `realtime_sales_monthly` 該日期範圍筆數,若非 0,需先保存查詢結果並確認是否重跑同範圍同步;不可整表 truncate、不可整庫 restore。補同步後至少驗證: ```sql SELECT count(*), min(snapshot_date::date), max(snapshot_date::date) FROM daily_sales_snapshot WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28'; SELECT count(*), min("日期"::date), max("日期"::date) FROM realtime_sales_monthly WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28'; ``` 兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache: ```bash docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")' ``` --- ## 14. Done Criteria All must be true: - Four hosts reachable by SSH. - 188 PostgreSQL and Redis healthy. - 110 Harbor, Gitea, Prometheus, Alertmanager healthy. - 120/121 K3s nodes Ready. - VIP `192.168.0.125` present. - AWOOOI API and Web reachable through NodePort/VIP. - Alertmanager E2E webhook succeeds. - cron/CronJob schedules are active, unsuspended, and verified. - momo `daily_sales_snapshot` 與 `realtime_sales_monthly` 在最新匯入日期範圍內筆數一致。 - Sentry and SignOz are either healthy or explicitly in controlled backlog recovery. - High-load batch services are capped or delayed. - Runners are guarded and released last. - AI auto-remediation is not in full execution mode until all gates are green. - 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded. --- ## 15. Known Drift To Fix After Recovery These must be cleaned after the incident, not during P0: - `SERVICE-ENDPOINTS.md` still has old Prometheus/Alertmanager locations. - Audit older docs for direct node webhook targets; current main path should be VIP `192.168.0.125:32334`. - OpenClaw `8088` vs `8089` must be live-confirmed and normalized. - 188 compose paths drift between `/home/ollama/*` and Ansible `/opt/*`. - 110 runner docs still mention Docker runner in places; live startup prefers host `gitea-act-runner-host.service`. - `scripts/setup-runner-watchdog.sh` conflicts with the 2026-05-05 runner watchdog disablement guardrail. - `grist.wooo.work` / `registry.wooo.work` public HTTP/HTTPS currently route to `aiops.wooo.work`; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.