awoooi/docs/runbooks/FULL-STACK-COLD-START-SOP.md

# AWOOOI Full-Stack Cold Start SOP

> Version: v1.1
> Last updated: 2026-05-06 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

---

## 0. When To Use This

Use this SOP when any of these happen:

- 110/120/121/188 reboot unexpectedly.
- All services are abnormal after a power/network event.
- K3s is stuck `activating`.
- Host load remains high during startup and service health is mixed.
- Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.

The rule is simple: **recover the dependency chain, not the loudest symptom.**

---

## 1. Golden Startup Order

```text
0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed
```

Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.

### 1.1 Dependency Graph

```mermaid
flowchart TD
  network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
  network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
  data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
  obs110 --> k3s
  k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
  workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
  workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
  public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
  schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
  highload --> ai["AI auto-remediation: limited execution"]
```

This is also captured in the machine-readable baseline:

```text
ops/reboot-recovery/full-stack-cold-start-baseline.yml
```

The YAML baseline is the source of truth for:

- hosts, roles, and SSH users
- phase ordering
- service startup dependencies
- endpoint success codes
- schedule freshness thresholds
- stateful-service protection boundaries
- AI automation release gates

### 1.2 Phase Gate Logic

Each phase has the same decision rule:

| Result | Meaning | Action |
|--------|---------|--------|
| `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
| `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
| `GREEN` | All checks in scope passed. | Release the next phase only. |

The cold-start flow is intentionally conservative:

```text
P0 network green
  -> P0 188 data green
  -> P0 110 registry/observability green
  -> P1 K3s green
  -> P2 workload + alert chain green
  -> P2 public routes green
  -> P2 schedules green
  -> P3 high-load services and runners/CD
  -> AI auto-remediation limited execution
```

The final release condition is not "containers are running". It is:

```text
PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN
```

---

## 2. Automation Freeze

Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.

| Item | Cold-start policy | Reason |
|------|-------------------|--------|
| Gitea/GitHub runners | Last | Build jobs can saturate 110 CPU/RAM. |
| momo-scheduler / crawlers | Last | Chrome and batch work can saturate 188. |
| Sentry/Snuba consumers | Controlled | Kafka backlog and ClickHouse merge can create temporary high load. |
| Alertmanager outbound notification | Gate | Avoid alert storms before API webhook and Telegram are verified. |
| AI auto-repair | Observe-only | Metrics, Redis, KM, and playbooks may be incomplete. |
| Stateful DB restart | Human approval | PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets. |

---

## 3. P0 Evidence And Network

Run from any machine on the same LAN:

```bash
for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done

arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done
```

Then capture reboot evidence:

```bash
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
```

If any host has ARP `incomplete` or SSH port down, stop here and fix physical/network first.

---

## 4. P0 188 Data Layer

188 is the first real service dependency because K3s datastore and AWOOOI DB depend on PostgreSQL.

### 4.1 Startup order

1. `containerd`
2. `docker`
3. `postgresql@14-main`
4. `k3s_datastore.kine` maintenance
5. `redis-server` on `6380`
6. `ollama` or current AI proxy dependencies
7. `nginx`
8. Docker networks
9. MinIO / OpenClaw / SignOz
10. momo / litellm / batch services after load is stable

### 4.2 Read-only check

```bash
ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'
```

### 4.3 PostgreSQL WAL checkpoint damage

Signature:

```text
PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...
```

This blocks:

- `188:5432`
- K3s startup on 120/121
- AWOOOI API DB access
- Alertmanager webhook if API cannot start

Human-approved recovery command on 188:

```bash
sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"
```

Do not run `DROP`, reinitialize the cluster, delete `/var/lib/postgresql`, or restore an old backup unless the commander explicitly approves it.

---

## 5. P0/P1 110 Registry And Observability

110 must recover Harbor/Gitea/Monitoring early, but runners last.

### 5.1 Startup order

1. `docker`
2. Remove `Exited (128)` / `Exited (137)` orphan containers
3. Harbor `harbor-log`
4. Harbor full stack
5. Gitea
6. Prometheus / Alertmanager / Grafana / exporters
7. Langfuse
8. SignOz
9. Sentry DB layer
10. Sentry web/worker/consumer layer
11. Gitea host runner and actions runners

### 5.2 Checks

```bash
ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'
```

Harbor healthy means `/v2/` returns `200` or `401`. Do not treat `401` as failure.

### 5.3 Runner gate

Runner may start only after all are true:

- `188 PostgreSQL` ready
- `110 Harbor` ready
- `110 Gitea` ready
- `120/121 K3s` nodes ready
- AWOOOI API health passes
- 110 load/core is below `1.0` for at least 15 minutes
- runner systemd guardrails are active: `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0`

Check:

```bash
ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
  echo "=== $u ==="
  systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'
```

If `WatchdogUSec` is not `0`, apply the guardrail script manually with sudo:

```bash
sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
```

---

## 6. P1 120/121 K3s

K3s must wait for 188 PostgreSQL and 110 Harbor.

### 6.1 Startup order

1. 120 `k3s.service`
2. 121 `k3s-agent.service` or its live role
3. CNI / kube-proxy
4. Nodes Ready
5. Core pods
6. `awoooi-prod` pods
7. keepalived VIP `192.168.0.125`
8. NodePorts `32334` and `32335`

### 6.2 Checks

```bash
ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'
```

If K3s is `activating` while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.

---

## 7. P2 AWOOOI Workloads

Run after K3s nodes are Ready:

```bash
ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'

curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/
```

If pods are `ImagePullBackOff`, go back to 110 Harbor.

If API health fails because DB/Redis is down, go back to 188.

---

## 8. P2 Alert Chain

Current main path:

```text
Prometheus/Alertmanager on 110
  -> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
  -> AWOOOI API
  -> TelegramGateway
  -> Telegram
```

Alertmanager health alone is not enough. Run E2E:

```bash
curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
  -H 'Content-Type: application/json' \
  -d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'
```

Expected: API returns success and Telegram receives the test alert.

---

## 9. P2 Schedules And Delayed Work

Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.

| Host / Layer | Required check | Success baseline |
|--------------|----------------|------------------|
| 188 cron | `systemctl is-active cron` and `crontab -l` | cron active; backup, restart exporter, stats exporter entries present |
| 188 backup-from-110 | `backup_110_last_success_timestamp` in textfile/Prometheus | last success age `< 25h` |
| 188 momo-scheduler | `docker inspect momo-scheduler` and `docker logs --since 6h momo-scheduler` | container `running healthy`; `全部排程任務已註冊`; Google Drive auth works; dashboard URLs use container-reachable hostnames |
| 188 momo import | manual `run_auto_import_task()` after parser changes | selected sheet is `即時業績明細`; imported date range has matching rows in `daily_sales_snapshot` and `realtime_sales_monthly` |
| 110 cron | `systemctl is-active cron` | cron active; Docker/systemd textfile exporters fresh |
| 110 startup units | `systemctl --failed` | zero failed units; stale `momo-startup-complete` and `wooo-staggered-startup` disabled |
| 120 K8s CronJobs | `kubectl get cronjobs -n awoooi-prod` | unsuspended; no failed Jobs remain after current validation |
| 121 DR drill | `crontab -l` | DR drill cron present unless explicitly paused |

Useful checks:

```bash
ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'
```

If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.

---

## 10. P2/P3 Stateful Service Guardrails

| Tier | Examples | Automation |
|------|----------|------------|
| BLOCK | PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB | No automatic destructive action. Human approval only. |
| CRITICAL_HITL | Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse | Human-in-the-loop restart/repair. |
| STANDARD_HITL | API/Web/worker, OpenClaw, litellm | Restart only with evidence and blast-radius check. |
| AUTO | Stateless exporters, blackbox, nginx exporter | Auto restart allowed after verification. |

Never use generic `docker restart $(docker ps -q)` during cold start.

### 10.1 Dirty-Reboot Storage Corruption

Treat these log signatures as storage corruption, not ordinary service flakiness:

- `Bad message`
- `Structure needs cleaning`
- `Unknown codec`
- `PANIC: could not locate a valid checkpoint record`
- Kafka `Malformed line` in checkpoint files
- ClickHouse `broken and needs manual correction`

Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns `Bad message` or `Structure needs cleaning`, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.

### 10.2 ClickHouse Clean-Clone Recovery Pattern

Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.

```text
1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.
```

Do not replace this with `rm -rf store/...` unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:

```text
/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS
```

### 10.3 Kafka Checkpoint Recovery Pattern

If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:

```text
log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint
```

Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.

---

## 11. P3 High-Load Services

Only release these after P0/P1/P2 gates are green:

| Host | Service | Release condition |
|------|---------|-------------------|
| 188 | momo-scheduler / crawler | load/core < 1.0 for 15 minutes and DB healthy |
| 188 | SignOz ClickHouse | healthy and merge backlog trending down |
| 188 | litellm | `/health/liveliness` good and provider route verified |
| 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing |
| 110 | Sentry uptime-checker | Sentry web/DB healthy |
| 110 | runners | all previous gates green and load/core < 1.0 for 15 minutes |

---

## 12. Baseline And AI Auto-Remediation Gate

### 12.1 Stable Runtime Baseline

These are release gates after the first cold-start recovery pass:

| Area | Baseline |
|------|----------|
| 188 host | PostgreSQL accepting, Redis PONG, momo `/health` 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers |
| 110 host | Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop |
| K3s | 120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx |
| Public routes | `https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx |
| Guardrails | Docker/systemd textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` |
| Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h` |
| Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks |

If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.

### 12.2 AI Auto-Remediation Gate

AI auto-repair can move from observe-only to limited execution only after:

- Prometheus rules are loaded.
- docker/systemd textfile exporter files are fresh.
- blackbox probes have stable results.
- cron/CronJob schedule checks are green.
- AWOOOI API `/api/v1/health` passes.
- Alertmanager E2E webhook passes.
- Redis/KM/playbook health is available.
- No active restart storm.
- Host load/core remains below `1.0` for 15 minutes.

Until then:

- diagnose only
- notify only
- require human approval for remediation
- no DB/ClickHouse/Harbor/Sentry destructive action
- no generic restart action against stateful services

---

## 13. One-Command Readiness Script

### 13.1 Single Pass

Run this when you want one read-only snapshot:

```bash
bash scripts/reboot-recovery/full-stack-cold-start-check.sh
```

The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:

- `P0-NETWORK`
- `P0-188-DATA`
- `P0-110-REGISTRY-OBSERVABILITY`
- `P1-K3S`
- `P2-WORKLOAD-ALERTCHAIN`
- `P2-PUBLIC-ROUTES`
- `P2-SCHEDULES`
- runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY`

If it prints `BLOCKED`, fix the first blocked gate before moving forward.

### 13.2 Professional Watch Mode

Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:

```bash
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
  --watch \
  --interval 60 \
  --max-attempts 30 \
  --send-alert-test
```

This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked.

Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.

### 13.3 Persistent Read-Only Monitor

After recovery, host 110 should run the same gate as a node-exporter textfile monitor:

```bash
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
```

This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes:

- `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom`
- `/home/wooo/reboot-recovery/cold-start-last.log`

The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:

- `awoooi_cold_start_monitor_up`
- `awoooi_cold_start_pass_gates`
- `awoooi_cold_start_warn_gates`
- `awoooi_cold_start_blocked_gates`
- `awoooi_cold_start_last_run_timestamp`
- `awoooi_cold_start_last_green_timestamp`
- `awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}`

Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.

### 13.4 Script-To-SOP Coverage Map

| Script gate | SOP coverage | Blocks |
|-------------|--------------|--------|
| `P0-NETWORK` | host reachability, ARP, SSH | every later phase |
| `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
| `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
| `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
| `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |

### 13.5 Next-Reboot Operator Contract

1. Run the watch command above.
2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes.
5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`.

### 13.6 2026-05-29 補充：188 Public Gateway 與備份告警

`aiops.wooo.work` 的 188 public gateway 不可再指向單一 `192.168.0.120:31234/31235`。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP：

```nginx
location /api/ {
    proxy_pass http://192.168.0.125:32334/api/;
}

location /api/v1/ws {
    proxy_pass http://192.168.0.125:32334/api/v1/ws;
}

location / {
    proxy_pass http://192.168.0.125:32335;
}
```

變更來源必須是 `infra/ansible/roles/nginx/templates/188-all-sites.conf.j2`，再用 `infra/ansible/playbooks/nginx-sync.yml` 收斂；禁止只改 188 live 檔而不回寫 Ansible baseline。

備份告警有兩層，缺一不可：

- `ops/monitoring/alerts-unified.yml` 是 repo canonical。
- 110 live `/home/wooo/monitoring/alerts.yml` 與 `/home/wooo/monitoring/alerts-unified.canonical.yml` 必須一致，否則 `prometheus-rule-drift-guard` 可能把規則拉回舊版。

重啟後必查：

```bash
curl -s http://127.0.0.1:9090/api/v1/rules \
  | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'

cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom
```

若 120 尚未恢復，`BackupConfigCapturePartial{target="120-k3s-host-configs"}` 與 cold-start blocked 是正確訊號，不可消音。120 恢復後再重跑：

```bash
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
```

### 13.7 2026-05-29 補充：momo PostgreSQL Index 與資料同步

`mo.wooo.work` 不能只看 `/health` 或首頁 200。重啟或 fsck 後，PostgreSQL index 可能讓匯入流程表面完成，但 `daily_sales_snapshot` 未同步到 `realtime_sales_monthly`。本次症狀：

- `daily_sales_snapshot` 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
- `realtime_sales_monthly` 同日期範圍為 0 筆。
- momo-scheduler log 出現 PostgreSQL 內部錯誤 `posting list tuple ... cannot be split`。

標準處理順序：

```bash
# 188 / momo-db，只重建索引，不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL
```

重建索引後，才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 `realtime_sales_monthly` 該日期範圍筆數，若非 0，需先保存查詢結果並確認是否重跑同範圍同步；不可整表 truncate、不可整庫 restore。補同步後至少驗證：

```sql
SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
```

兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache：

```bash
docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'
```

---

## 14. Done Criteria

All must be true:

- Four hosts reachable by SSH.
- 188 PostgreSQL and Redis healthy.
- 110 Harbor, Gitea, Prometheus, Alertmanager healthy.
- 120/121 K3s nodes Ready.
- VIP `192.168.0.125` present.
- AWOOOI API and Web reachable through NodePort/VIP.
- Alertmanager E2E webhook succeeds.
- cron/CronJob schedules are active, unsuspended, and verified.
- momo `daily_sales_snapshot` 與 `realtime_sales_monthly` 在最新匯入日期範圍內筆數一致。
- Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
- High-load batch services are capped or delayed.
- Runners are guarded and released last.
- AI auto-remediation is not in full execution mode until all gates are green.
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.

---

## 15. Known Drift To Fix After Recovery

These must be cleaned after the incident, not during P0:

- `SERVICE-ENDPOINTS.md` still has old Prometheus/Alertmanager locations.
- Audit older docs for direct node webhook targets; current main path should be VIP `192.168.0.125:32334`.
- OpenClaw `8088` vs `8089` must be live-confirmed and normalized.
- 188 compose paths drift between `/home/ollama/*` and Ansible `/opt/*`.
- 110 runner docs still mention Docker runner in places; live startup prefers host `gitea-act-runner-host.service`.
- `scripts/setup-runner-watchdog.sh` conflicts with the 2026-05-05 runner watchdog disablement guardrail.
- `grist.wooo.work` / `registry.wooo.work` public HTTP/HTTPS currently route to `aiops.wooo.work`; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.