26 KiB
AWOOOI Full-Stack Cold Start SOP
Version: v1.1 Last updated: 2026-05-06 Asia/Taipei Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
0. When To Use This
Use this SOP when any of these happen:
- 110/120/121/188 reboot unexpectedly.
- All services are abnormal after a power/network event.
- K3s is stuck
activating. - Host load remains high during startup and service health is mixed.
- Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.
The rule is simple: recover the dependency chain, not the loudest symptom.
1. Golden Startup Order
0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed
Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.
1.1 Dependency Graph
flowchart TD
network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
obs110 --> k3s
k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
highload --> ai["AI auto-remediation: limited execution"]
This is also captured in the machine-readable baseline:
ops/reboot-recovery/full-stack-cold-start-baseline.yml
The YAML baseline is the source of truth for:
- hosts, roles, and SSH users
- phase ordering
- service startup dependencies
- endpoint success codes
- schedule freshness thresholds
- stateful-service protection boundaries
- AI automation release gates
1.2 Phase Gate Logic
Each phase has the same decision rule:
| Result | Meaning | Action |
|---|---|---|
BLOCKED |
A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
WARN |
Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
GREEN |
All checks in scope passed. | Release the next phase only. |
The cold-start flow is intentionally conservative:
P0 network green
-> P0 188 data green
-> P0 110 registry/observability green
-> P1 K3s green
-> P2 workload + alert chain green
-> P2 public routes green
-> P2 schedules green
-> P3 high-load services and runners/CD
-> AI auto-remediation limited execution
The final release condition is not "containers are running". It is:
PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN
2. Automation Freeze
Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.
| Item | Cold-start policy | Reason |
|---|---|---|
| Gitea/GitHub runners | Last | Build jobs can saturate 110 CPU/RAM. |
| momo-scheduler / crawlers | Last | Chrome and batch work can saturate 188. |
| Sentry/Snuba consumers | Controlled | Kafka backlog and ClickHouse merge can create temporary high load. |
| Alertmanager outbound notification | Gate | Avoid alert storms before API webhook and Telegram are verified. |
| AI auto-repair | Observe-only | Metrics, Redis, KM, and playbooks may be incomplete. |
| Stateful DB restart | Human approval | PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets. |
3. P0 Evidence And Network
Run from any machine on the same LAN:
for h in 110 120 121 188; do
ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done
arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done
Then capture reboot evidence:
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
If any host has ARP incomplete or SSH port down, stop here and fix physical/network first.
4. P0 188 Data Layer
188 is the first real service dependency because K3s datastore and AWOOOI DB depend on PostgreSQL.
4.1 Startup order
containerddockerpostgresql@14-maink3s_datastore.kinemaintenanceredis-serveron6380ollamaor current AI proxy dependenciesnginx- Docker networks
- MinIO / OpenClaw / SignOz
- momo / litellm / batch services after load is stable
4.2 Read-only check
ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'
4.3 PostgreSQL WAL checkpoint damage
Signature:
PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...
This blocks:
188:5432- K3s startup on 120/121
- AWOOOI API DB access
- Alertmanager webhook if API cannot start
Human-approved recovery command on 188:
sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"
Do not run DROP, reinitialize the cluster, delete /var/lib/postgresql, or restore an old backup unless the commander explicitly approves it.
5. P0/P1 110 Registry And Observability
110 must recover Harbor/Gitea/Monitoring early, but runners last.
5.1 Startup order
docker- Remove
Exited (128)/Exited (137)orphan containers - Harbor
harbor-log - Harbor full stack
- Gitea
- Prometheus / Alertmanager / Grafana / exporters
- Langfuse
- SignOz
- Sentry DB layer
- Sentry web/worker/consumer layer
- Gitea host runner and actions runners
5.2 Checks
ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'
Harbor healthy means /v2/ returns 200 or 401. Do not treat 401 as failure.
5.3 Runner gate
Runner may start only after all are true:
188 PostgreSQLready110 Harborready110 Giteaready120/121 K3snodes ready- AWOOOI API health passes
- 110 load/core is below
1.0for at least 15 minutes - runner systemd guardrails are active:
CPUQuota=200%,MemoryMax=2G,WatchdogUSec=0
Check:
ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
echo "=== $u ==="
systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'
If WatchdogUSec is not 0, apply the guardrail script manually with sudo:
sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
6. P1 120/121 K3s
K3s must wait for 188 PostgreSQL and 110 Harbor.
6.1 Startup order
- 120
k3s.service - 121
k3s-agent.serviceor its live role - CNI / kube-proxy
- Nodes Ready
- Core pods
awoooi-prodpods- keepalived VIP
192.168.0.125 - NodePorts
32334and32335
6.2 Checks
ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'
ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'
If K3s is activating while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.
7. P2 AWOOOI Workloads
Run after K3s nodes are Ready:
ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'
curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/
If pods are ImagePullBackOff, go back to 110 Harbor.
If API health fails because DB/Redis is down, go back to 188.
8. P2 Alert Chain
Current main path:
Prometheus/Alertmanager on 110
-> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
-> AWOOOI API
-> TelegramGateway
-> Telegram
Alertmanager health alone is not enough. Run E2E:
curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
-H 'Content-Type: application/json' \
-d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'
Expected: API returns success and Telegram receives the test alert.
9. P2 Schedules And Delayed Work
Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.
| Host / Layer | Required check | Success baseline |
|---|---|---|
| 188 cron | systemctl is-active cron and crontab -l |
cron active; backup, restart exporter, stats exporter entries present |
| 188 backup-from-110 | backup_110_last_success_timestamp in textfile/Prometheus |
last success age < 25h |
| 188 momo-scheduler | docker inspect momo-scheduler and docker logs --since 6h momo-scheduler |
container running healthy; 全部排程任務已註冊; Google Drive auth works; dashboard URLs use container-reachable hostnames |
| 188 momo import | manual run_auto_import_task() after parser changes |
selected sheet is 即時業績明細; imported date range has matching rows in daily_sales_snapshot and realtime_sales_monthly |
| 110 cron | systemctl is-active cron |
cron active; Docker/systemd textfile exporters fresh |
| 110 startup units | systemctl --failed |
zero failed units; stale momo-startup-complete and wooo-staggered-startup disabled |
| 120 K8s CronJobs | kubectl get cronjobs -n awoooi-prod |
unsuspended; no failed Jobs remain after current validation |
| 121 DR drill | crontab -l |
DR drill cron present unless explicitly paused |
Useful checks:
ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'
If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.
10. P2/P3 Stateful Service Guardrails
| Tier | Examples | Automation |
|---|---|---|
| BLOCK | PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB | No automatic destructive action. Human approval only. |
| CRITICAL_HITL | Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse | Human-in-the-loop restart/repair. |
| STANDARD_HITL | API/Web/worker, OpenClaw, litellm | Restart only with evidence and blast-radius check. |
| AUTO | Stateless exporters, blackbox, nginx exporter | Auto restart allowed after verification. |
Never use generic docker restart $(docker ps -q) during cold start.
10.1 Dirty-Reboot Storage Corruption
Treat these log signatures as storage corruption, not ordinary service flakiness:
Bad messageStructure needs cleaningUnknown codecPANIC: could not locate a valid checkpoint record- Kafka
Malformed linein checkpoint files - ClickHouse
broken and needs manual correction
Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns Bad message or Structure needs cleaning, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.
10.2 ClickHouse Clean-Clone Recovery Pattern
Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.
1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.
Do not replace this with rm -rf store/... unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:
/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS
10.3 Kafka Checkpoint Recovery Pattern
If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:
log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint
Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.
11. P3 High-Load Services
Only release these after P0/P1/P2 gates are green:
| Host | Service | Release condition |
|---|---|---|
| 188 | momo-scheduler / crawler | load/core < 1.0 for 15 minutes and DB healthy |
| 188 | SignOz ClickHouse | healthy and merge backlog trending down |
| 188 | litellm | /health/liveliness good and provider route verified |
| 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing |
| 110 | Sentry uptime-checker | Sentry web/DB healthy |
| 110 | runners | all previous gates green and load/core < 1.0 for 15 minutes |
12. Baseline And AI Auto-Remediation Gate
12.1 Stable Runtime Baseline
These are release gates after the first cold-start recovery pass:
| Area | Baseline |
|---|---|
| 188 host | PostgreSQL accepting, Redis PONG, momo /health 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers |
| 110 host | Harbor /v2/ 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop |
| K3s | 120/121 nodes Ready, VIP 192.168.0.125 present, AWOOOI API 2xx/3xx, Web 2xx/3xx |
| Public routes | https://awoooi.wooo.work/api/v1/health 2xx/3xx, https://mo.wooo.work/health 2xx/3xx |
| Guardrails | Docker/systemd textfile exporters fresh, runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0 |
| Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success < 25h |
| Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks |
If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.
12.2 AI Auto-Remediation Gate
AI auto-repair can move from observe-only to limited execution only after:
- Prometheus rules are loaded.
- docker/systemd textfile exporter files are fresh.
- blackbox probes have stable results.
- cron/CronJob schedule checks are green.
- AWOOOI API
/api/v1/healthpasses. - Alertmanager E2E webhook passes.
- Redis/KM/playbook health is available.
- No active restart storm.
- Host load/core remains below
1.0for 15 minutes.
Until then:
- diagnose only
- notify only
- require human approval for remediation
- no DB/ClickHouse/Harbor/Sentry destructive action
- no generic restart action against stateful services
13. One-Command Readiness Script
13.1 Single Pass
Run this when you want one read-only snapshot:
bash scripts/reboot-recovery/full-stack-cold-start-check.sh
The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:
P0-NETWORKP0-188-DATAP0-110-REGISTRY-OBSERVABILITYP1-K3SP2-WORKLOAD-ALERTCHAINP2-PUBLIC-ROUTESP2-SCHEDULES- runner guardrail state inside
P0-110-REGISTRY-OBSERVABILITY
If it prints BLOCKED, fix the first blocked gate before moving forward.
13.2 Professional Watch Mode
Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
--watch \
--interval 60 \
--max-attempts 30 \
--send-alert-test
This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is GREEN or the last attempt remains degraded/blocked.
Use --send-alert-test for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without --send-alert-test, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
13.3 Persistent Read-Only Monitor
After recovery, host 110 should run the same gate as a node-exporter textfile monitor:
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
This installs two scripts under /home/wooo/scripts/, adds a marked user-cron block, and writes:
/home/wooo/node_exporter_textfiles/cold_start_recovery.prom/home/wooo/reboot-recovery/cold-start-last.log
The cron path uses --monitor-read-only, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:
awoooi_cold_start_monitor_upawoooi_cold_start_pass_gatesawoooi_cold_start_warn_gatesawoooi_cold_start_blocked_gatesawoooi_cold_start_last_run_timestampawoooi_cold_start_last_green_timestampawoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}
Prometheus rules in ops/monitoring/alerts-unified.yml alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.
13.4 Script-To-SOP Coverage Map
| Script gate | SOP coverage | Blocks |
|---|---|---|
P0-NETWORK |
host reachability, ARP, SSH | every later phase |
P0-188-DATA |
PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
P0-110-REGISTRY-OBSERVABILITY |
Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
P1-K3S |
120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
P2-WORKLOAD-ALERTCHAIN |
AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
P2-PUBLIC-ROUTES |
external AWOOOI and momo URLs | external release |
P2-SCHEDULES |
cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
13.5 Next-Reboot Operator Contract
- Run the watch command above.
- If it stops at
BLOCKED, repair the first blocked gate and rerun watch mode. - If it stops at
WARN, do not release runner/CD/AI full execution; clear or explicitly accept each warning. - Release high-load services only after
GREENand load/core stays below1.0for 15 minutes. - Record the final output summary and any manual repair in
docs/LOGBOOK.md.
13.6 2026-05-29 補充:188 Public Gateway 與備份告警
aiops.wooo.work 的 188 public gateway 不可再指向單一 192.168.0.120:31234/31235。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP:
location /api/ {
proxy_pass http://192.168.0.125:32334/api/;
}
location /api/v1/ws {
proxy_pass http://192.168.0.125:32334/api/v1/ws;
}
location / {
proxy_pass http://192.168.0.125:32335;
}
變更來源必須是 infra/ansible/roles/nginx/templates/188-all-sites.conf.j2,再用 infra/ansible/playbooks/nginx-sync.yml 收斂;禁止只改 188 live 檔而不回寫 Ansible baseline。
備份告警有兩層,缺一不可:
ops/monitoring/alerts-unified.yml是 repo canonical。- 110 live
/home/wooo/monitoring/alerts.yml與/home/wooo/monitoring/alerts-unified.canonical.yml必須一致,否則prometheus-rule-drift-guard可能把規則拉回舊版。
重啟後必查:
curl -s http://127.0.0.1:9090/api/v1/rules \
| python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'
cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom
若 120 尚未恢復,BackupConfigCapturePartial{target="120-k3s-host-configs"} 與 cold-start blocked 是正確訊號,不可消音。120 恢復後再重跑:
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
13.7 2026-05-29 補充:momo PostgreSQL Index 與資料同步
mo.wooo.work 不能只看 /health 或首頁 200。重啟或 fsck 後,PostgreSQL index 可能讓匯入流程表面完成,但 daily_sales_snapshot 未同步到 realtime_sales_monthly。本次症狀:
daily_sales_snapshot已有 2026-05-01 到 2026-05-28 的 17,353 筆。realtime_sales_monthly同日期範圍為 0 筆。- momo-scheduler log 出現 PostgreSQL 內部錯誤
posting list tuple ... cannot be split。
標準處理順序:
# 188 / momo-db,只重建索引,不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL
重建索引後,才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 realtime_sales_monthly 該日期範圍筆數,若非 0,需先保存查詢結果並確認是否重跑同範圍同步;不可整表 truncate、不可整庫 restore。補同步後至少驗證:
SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache:
docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'
14. Done Criteria
All must be true:
- Four hosts reachable by SSH.
- 188 PostgreSQL and Redis healthy.
- 110 Harbor, Gitea, Prometheus, Alertmanager healthy.
- 120/121 K3s nodes Ready.
- VIP
192.168.0.125present. - AWOOOI API and Web reachable through NodePort/VIP.
- Alertmanager E2E webhook succeeds.
- cron/CronJob schedules are active, unsuspended, and verified.
- momo
daily_sales_snapshot與realtime_sales_monthly在最新匯入日期範圍內筆數一致。 - Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
- High-load batch services are capped or delayed.
- Runners are guarded and released last.
- AI auto-remediation is not in full execution mode until all gates are green.
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
15. Known Drift To Fix After Recovery
These must be cleaned after the incident, not during P0:
SERVICE-ENDPOINTS.mdstill has old Prometheus/Alertmanager locations.- Audit older docs for direct node webhook targets; current main path should be VIP
192.168.0.125:32334. - OpenClaw
8088vs8089must be live-confirmed and normalized. - 188 compose paths drift between
/home/ollama/*and Ansible/opt/*. - 110 runner docs still mention Docker runner in places; live startup prefers host
gitea-act-runner-host.service. scripts/setup-runner-watchdog.shconflicts with the 2026-05-05 runner watchdog disablement guardrail.grist.wooo.work/registry.wooo.workpublic HTTP/HTTPS currently route toaiops.wooo.work; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.