Files

Your Name ae7b39d96a fix(ops): harden reboot recovery and backup alerts

2026-05-29 12:41:34 +08:00

26 KiB

Raw Blame History

AWOOOI Full-Stack Cold Start SOP

Version: v1.1 Last updated: 2026-05-06 Asia/Taipei Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

0. When To Use This

Use this SOP when any of these happen:

110/120/121/188 reboot unexpectedly.
All services are abnormal after a power/network event.
K3s is stuck activating.
Host load remains high during startup and service health is mixed.
Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.

The rule is simple: recover the dependency chain, not the loudest symptom.

1. Golden Startup Order

0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed

Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.

1.1 Dependency Graph

flowchart TD
  network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
  network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
  data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
  obs110 --> k3s
  k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
  workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
  workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
  public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
  schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
  highload --> ai["AI auto-remediation: limited execution"]

This is also captured in the machine-readable baseline:

ops/reboot-recovery/full-stack-cold-start-baseline.yml

The YAML baseline is the source of truth for:

hosts, roles, and SSH users
phase ordering
service startup dependencies
endpoint success codes
schedule freshness thresholds
stateful-service protection boundaries
AI automation release gates

1.2 Phase Gate Logic

Each phase has the same decision rule:

Result	Meaning	Action
`BLOCKED`	A dependency required by later phases is down.	Stop phase release and fix the first blocked gate.
`WARN`	Core dependency passed, but confidence is incomplete.	Continue diagnosis, but do not release runner/CD/AI full execution.
`GREEN`	All checks in scope passed.	Release the next phase only.

The cold-start flow is intentionally conservative:

P0 network green
  -> P0 188 data green
  -> P0 110 registry/observability green
  -> P1 K3s green
  -> P2 workload + alert chain green
  -> P2 public routes green
  -> P2 schedules green
  -> P3 high-load services and runners/CD
  -> AI auto-remediation limited execution

The final release condition is not "containers are running". It is:

PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN

2. Automation Freeze

Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.

Item	Cold-start policy	Reason
Gitea/GitHub runners	Last	Build jobs can saturate 110 CPU/RAM.
momo-scheduler / crawlers	Last	Chrome and batch work can saturate 188.
Sentry/Snuba consumers	Controlled	Kafka backlog and ClickHouse merge can create temporary high load.
Alertmanager outbound notification	Gate	Avoid alert storms before API webhook and Telegram are verified.
AI auto-repair	Observe-only	Metrics, Redis, KM, and playbooks may be incomplete.
Stateful DB restart	Human approval	PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets.

3. P0 Evidence And Network

Run from any machine on the same LAN:

for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done

arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done

Then capture reboot evidence:

ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'

If any host has ARP incomplete or SSH port down, stop here and fix physical/network first.

4. P0 188 Data Layer

188 is the first real service dependency because K3s datastore and AWOOOI DB depend on PostgreSQL.

4.1 Startup order

containerd
docker
postgresql@14-main
k3s_datastore.kine maintenance
redis-server on 6380
ollama or current AI proxy dependencies
nginx
Docker networks
MinIO / OpenClaw / SignOz
momo / litellm / batch services after load is stable

4.2 Read-only check

ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'

4.3 PostgreSQL WAL checkpoint damage

Signature:

PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...

This blocks:

188:5432
K3s startup on 120/121
AWOOOI API DB access
Alertmanager webhook if API cannot start

Human-approved recovery command on 188:

sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"

Do not run DROP, reinitialize the cluster, delete /var/lib/postgresql, or restore an old backup unless the commander explicitly approves it.

5. P0/P1 110 Registry And Observability

110 must recover Harbor/Gitea/Monitoring early, but runners last.

5.1 Startup order

docker
Remove Exited (128) / Exited (137) orphan containers
Harbor harbor-log
Harbor full stack
Gitea
Prometheus / Alertmanager / Grafana / exporters
Langfuse
SignOz
Sentry DB layer
Sentry web/worker/consumer layer
Gitea host runner and actions runners

5.2 Checks

ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'

Harbor healthy means /v2/ returns 200 or 401. Do not treat 401 as failure.

5.3 Runner gate

Runner may start only after all are true:

188 PostgreSQL ready
110 Harbor ready
110 Gitea ready
120/121 K3s nodes ready
AWOOOI API health passes
110 load/core is below 1.0 for at least 15 minutes
runner systemd guardrails are active: CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0

Check:

ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
  echo "=== $u ==="
  systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'

If WatchdogUSec is not 0, apply the guardrail script manually with sudo:

sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply

6. P1 120/121 K3s

K3s must wait for 188 PostgreSQL and 110 Harbor.

6.1 Startup order

120 k3s.service
121 k3s-agent.service or its live role
CNI / kube-proxy
Nodes Ready
Core pods
awoooi-prod pods
keepalived VIP 192.168.0.125
NodePorts 32334 and 32335

6.2 Checks

ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

If K3s is activating while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.

7. P2 AWOOOI Workloads

Run after K3s nodes are Ready:

ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'

curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/

If pods are ImagePullBackOff, go back to 110 Harbor.

If API health fails because DB/Redis is down, go back to 188.

8. P2 Alert Chain

Current main path:

Prometheus/Alertmanager on 110
  -> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
  -> AWOOOI API
  -> TelegramGateway
  -> Telegram

Alertmanager health alone is not enough. Run E2E:

curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
  -H 'Content-Type: application/json' \
  -d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'

Expected: API returns success and Telegram receives the test alert.

9. P2 Schedules And Delayed Work

Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.

Host / Layer	Required check	Success baseline
188 cron	`systemctl is-active cron` and `crontab -l`	cron active; backup, restart exporter, stats exporter entries present
188 backup-from-110	`backup_110_last_success_timestamp` in textfile/Prometheus	last success age `< 25h`
188 momo-scheduler	`docker inspect momo-scheduler` and `docker logs --since 6h momo-scheduler`	container `running healthy`; `全部排程任務已註冊`; Google Drive auth works; dashboard URLs use container-reachable hostnames
188 momo import	manual `run_auto_import_task()` after parser changes	selected sheet is `即時業績明細`; imported date range has matching rows in `daily_sales_snapshot` and `realtime_sales_monthly`
110 cron	`systemctl is-active cron`	cron active; Docker/systemd textfile exporters fresh
110 startup units	`systemctl --failed`	zero failed units; stale `momo-startup-complete` and `wooo-staggered-startup` disabled
120 K8s CronJobs	`kubectl get cronjobs -n awoooi-prod`	unsuspended; no failed Jobs remain after current validation
121 DR drill	`crontab -l`	DR drill cron present unless explicitly paused

Useful checks:

ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'

If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.

10. P2/P3 Stateful Service Guardrails

Tier	Examples	Automation
BLOCK	PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB	No automatic destructive action. Human approval only.
CRITICAL_HITL	Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse	Human-in-the-loop restart/repair.
STANDARD_HITL	API/Web/worker, OpenClaw, litellm	Restart only with evidence and blast-radius check.
AUTO	Stateless exporters, blackbox, nginx exporter	Auto restart allowed after verification.

Never use generic docker restart $(docker ps -q) during cold start.

10.1 Dirty-Reboot Storage Corruption

Treat these log signatures as storage corruption, not ordinary service flakiness:

Bad message
Structure needs cleaning
Unknown codec
PANIC: could not locate a valid checkpoint record
Kafka Malformed line in checkpoint files
ClickHouse broken and needs manual correction

Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns Bad message or Structure needs cleaning, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.

10.2 ClickHouse Clean-Clone Recovery Pattern

Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.

1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.

Do not replace this with rm -rf store/... unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:

/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS

10.3 Kafka Checkpoint Recovery Pattern

If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:

log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint

Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.

11. P3 High-Load Services

Only release these after P0/P1/P2 gates are green:

Host	Service	Release condition
188	momo-scheduler / crawler	load/core < 1.0 for 15 minutes and DB healthy
188	SignOz ClickHouse	healthy and merge backlog trending down
188	litellm	`/health/liveliness` good and provider route verified
110	Sentry Snuba consumers	ClickHouse healthy and Kafka backlog decreasing
110	Sentry uptime-checker	Sentry web/DB healthy
110	runners	all previous gates green and load/core < 1.0 for 15 minutes

12. Baseline And AI Auto-Remediation Gate

12.1 Stable Runtime Baseline

These are release gates after the first cold-start recovery pass:

Area	Baseline
188 host	PostgreSQL accepting, Redis PONG, momo `/health` 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers
110 host	Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop
K3s	120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx
Public routes	`https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx
Guardrails	Docker/systemd textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0`
Schedules	cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h`
Backlog	ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks

If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.

12.2 AI Auto-Remediation Gate

AI auto-repair can move from observe-only to limited execution only after:

Prometheus rules are loaded.
docker/systemd textfile exporter files are fresh.
blackbox probes have stable results.
cron/CronJob schedule checks are green.
AWOOOI API /api/v1/health passes.
Alertmanager E2E webhook passes.
Redis/KM/playbook health is available.
No active restart storm.
Host load/core remains below 1.0 for 15 minutes.

Until then:

diagnose only
notify only
require human approval for remediation
no DB/ClickHouse/Harbor/Sentry destructive action
no generic restart action against stateful services

13. One-Command Readiness Script

13.1 Single Pass

Run this when you want one read-only snapshot:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh

The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:

P0-NETWORK
P0-188-DATA
P0-110-REGISTRY-OBSERVABILITY
P1-K3S
P2-WORKLOAD-ALERTCHAIN
P2-PUBLIC-ROUTES
P2-SCHEDULES
runner guardrail state inside P0-110-REGISTRY-OBSERVABILITY

If it prints BLOCKED, fix the first blocked gate before moving forward.

13.2 Professional Watch Mode

Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
  --watch \
  --interval 60 \
  --max-attempts 30 \
  --send-alert-test

This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is GREEN or the last attempt remains degraded/blocked.

Use --send-alert-test for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without --send-alert-test, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.

13.3 Persistent Read-Only Monitor

After recovery, host 110 should run the same gate as a node-exporter textfile monitor:

bash scripts/reboot-recovery/install-cold-start-monitor-110.sh

This installs two scripts under /home/wooo/scripts/, adds a marked user-cron block, and writes:

/home/wooo/node_exporter_textfiles/cold_start_recovery.prom
/home/wooo/reboot-recovery/cold-start-last.log

The cron path uses --monitor-read-only, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:

awoooi_cold_start_monitor_up
awoooi_cold_start_pass_gates
awoooi_cold_start_warn_gates
awoooi_cold_start_blocked_gates
awoooi_cold_start_last_run_timestamp
awoooi_cold_start_last_green_timestamp
awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}

Prometheus rules in ops/monitoring/alerts-unified.yml alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.

13.4 Script-To-SOP Coverage Map

Script gate	SOP coverage	Blocks
`P0-NETWORK`	host reachability, ARP, SSH	every later phase
`P0-188-DATA`	PostgreSQL, Redis, momo, SignOz	K3s, AWOOOI API, momo public site
`P0-110-REGISTRY-OBSERVABILITY`	Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas	image pulls, CD, alert rules, runners
`P1-K3S`	120/121 K3s, VIP, node readiness, pod health	workload and webhook health
`P2-WORKLOAD-ALERTCHAIN`	AWOOOI API/Web, Alertmanager webhook	AI auto-remediation and alert confidence
`P2-PUBLIC-ROUTES`	external AWOOOI and momo URLs	external release
`P2-SCHEDULES`	cron, CronJobs, backups, textfile exporters, DR drill	final done criteria

13.5 Next-Reboot Operator Contract

Run the watch command above.
If it stops at BLOCKED, repair the first blocked gate and rerun watch mode.
If it stops at WARN, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
Release high-load services only after GREEN and load/core stays below 1.0 for 15 minutes.
Record the final output summary and any manual repair in docs/LOGBOOK.md.

13.6 2026-05-29 補充：188 Public Gateway 與備份告警

aiops.wooo.work 的 188 public gateway 不可再指向單一 192.168.0.120:31234/31235。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP：

location /api/ {
    proxy_pass http://192.168.0.125:32334/api/;
}

location /api/v1/ws {
    proxy_pass http://192.168.0.125:32334/api/v1/ws;
}

location / {
    proxy_pass http://192.168.0.125:32335;
}

變更來源必須是 infra/ansible/roles/nginx/templates/188-all-sites.conf.j2，再用 infra/ansible/playbooks/nginx-sync.yml 收斂；禁止只改 188 live 檔而不回寫 Ansible baseline。

備份告警有兩層，缺一不可：

ops/monitoring/alerts-unified.yml 是 repo canonical。
110 live /home/wooo/monitoring/alerts.yml 與 /home/wooo/monitoring/alerts-unified.canonical.yml 必須一致，否則 prometheus-rule-drift-guard 可能把規則拉回舊版。

重啟後必查：

curl -s http://127.0.0.1:9090/api/v1/rules \
  | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'

cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom

若 120 尚未恢復，BackupConfigCapturePartial{target="120-k3s-host-configs"} 與 cold-start blocked 是正確訊號，不可消音。120 恢復後再重跑：

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

13.7 2026-05-29 補充：momo PostgreSQL Index 與資料同步

mo.wooo.work 不能只看 /health 或首頁 200。重啟或 fsck 後，PostgreSQL index 可能讓匯入流程表面完成，但 daily_sales_snapshot 未同步到 realtime_sales_monthly。本次症狀：

daily_sales_snapshot 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
realtime_sales_monthly 同日期範圍為 0 筆。
momo-scheduler log 出現 PostgreSQL 內部錯誤 posting list tuple ... cannot be split。

標準處理順序：

# 188 / momo-db，只重建索引，不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL

重建索引後，才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 realtime_sales_monthly 該日期範圍筆數，若非 0，需先保存查詢結果並確認是否重跑同範圍同步；不可整表 truncate、不可整庫 restore。補同步後至少驗證：

SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache：

docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'

14. Done Criteria

All must be true:

Four hosts reachable by SSH.
188 PostgreSQL and Redis healthy.
110 Harbor, Gitea, Prometheus, Alertmanager healthy.
120/121 K3s nodes Ready.
VIP 192.168.0.125 present.
AWOOOI API and Web reachable through NodePort/VIP.
Alertmanager E2E webhook succeeds.
cron/CronJob schedules are active, unsuspended, and verified.
momo daily_sales_snapshot 與 realtime_sales_monthly 在最新匯入日期範圍內筆數一致。
Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
High-load batch services are capped or delayed.
Runners are guarded and released last.
AI auto-remediation is not in full execution mode until all gates are green.
110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.

15. Known Drift To Fix After Recovery

These must be cleaned after the incident, not during P0:

SERVICE-ENDPOINTS.md still has old Prometheus/Alertmanager locations.
Audit older docs for direct node webhook targets; current main path should be VIP 192.168.0.125:32334.
OpenClaw 8088 vs 8089 must be live-confirmed and normalized.
188 compose paths drift between /home/ollama/* and Ansible /opt/*.
110 runner docs still mention Docker runner in places; live startup prefers host gitea-act-runner-host.service.
scripts/setup-runner-watchdog.sh conflicts with the 2026-05-05 runner watchdog disablement guardrail.
grist.wooo.work / registry.wooo.work public HTTP/HTTPS currently route to aiops.wooo.work; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.

26 KiB Raw Blame History Unescape Escape