Files
awoooi/docs/runbooks/FULL-STACK-COLD-START-SOP.md
2026-05-29 12:41:34 +08:00

26 KiB
Raw Blame History

AWOOOI Full-Stack Cold Start SOP

Version: v1.1 Last updated: 2026-05-06 Asia/Taipei Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.


0. When To Use This

Use this SOP when any of these happen:

  • 110/120/121/188 reboot unexpectedly.
  • All services are abnormal after a power/network event.
  • K3s is stuck activating.
  • Host load remains high during startup and service health is mixed.
  • Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.

The rule is simple: recover the dependency chain, not the loudest symptom.


1. Golden Startup Order

0. Freeze automation and preserve evidence
1. Physical/network layer
2. 188 data layer
3. 110 registry/observability layer
4. 120/121 K3s layer
5. AWOOOI workload layer
6. Public routes and alert chain
7. High-load batch/consumer/crawler services
8. Runner/CD
9. AI auto-remediation
10. 112 Kali scanner, if needed

Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.

1.1 Dependency Graph

flowchart TD
  network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
  network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
  data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
  obs110 --> k3s
  k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
  workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
  workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
  public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
  schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
  highload --> ai["AI auto-remediation: limited execution"]

This is also captured in the machine-readable baseline:

ops/reboot-recovery/full-stack-cold-start-baseline.yml

The YAML baseline is the source of truth for:

  • hosts, roles, and SSH users
  • phase ordering
  • service startup dependencies
  • endpoint success codes
  • schedule freshness thresholds
  • stateful-service protection boundaries
  • AI automation release gates

1.2 Phase Gate Logic

Each phase has the same decision rule:

Result Meaning Action
BLOCKED A dependency required by later phases is down. Stop phase release and fix the first blocked gate.
WARN Core dependency passed, but confidence is incomplete. Continue diagnosis, but do not release runner/CD/AI full execution.
GREEN All checks in scope passed. Release the next phase only.

The cold-start flow is intentionally conservative:

P0 network green
  -> P0 188 data green
  -> P0 110 registry/observability green
  -> P1 K3s green
  -> P2 workload + alert chain green
  -> P2 public routes green
  -> P2 schedules green
  -> P3 high-load services and runners/CD
  -> AI auto-remediation limited execution

The final release condition is not "containers are running". It is:

PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN

2. Automation Freeze

Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.

Item Cold-start policy Reason
Gitea/GitHub runners Last Build jobs can saturate 110 CPU/RAM.
momo-scheduler / crawlers Last Chrome and batch work can saturate 188.
Sentry/Snuba consumers Controlled Kafka backlog and ClickHouse merge can create temporary high load.
Alertmanager outbound notification Gate Avoid alert storms before API webhook and Telegram are verified.
AI auto-repair Observe-only Metrics, Redis, KM, and playbooks may be incomplete.
Stateful DB restart Human approval PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets.

3. P0 Evidence And Network

Run from any machine on the same LAN:

for h in 110 120 121 188; do
  ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
done

arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
for h in 110 120 121 188; do
  nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
done

Then capture reboot evidence:

ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'

If any host has ARP incomplete or SSH port down, stop here and fix physical/network first.


4. P0 188 Data Layer

188 is the first real service dependency because K3s datastore and AWOOOI DB depend on PostgreSQL.

4.1 Startup order

  1. containerd
  2. docker
  3. postgresql@14-main
  4. k3s_datastore.kine maintenance
  5. redis-server on 6380
  6. ollama or current AI proxy dependencies
  7. nginx
  8. Docker networks
  9. MinIO / OpenClaw / SignOz
  10. momo / litellm / batch services after load is stable

4.2 Read-only check

ssh ollama@192.168.0.188 '
hostname; date; uptime; free -h
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
pg_isready -h localhost -p 5432 || true
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
'

4.3 PostgreSQL WAL checkpoint damage

Signature:

PANIC: could not locate a valid checkpoint record
invalid primary checkpoint record
unexpected pageaddr ... in log segment ...

This blocks:

  • 188:5432
  • K3s startup on 120/121
  • AWOOOI API DB access
  • Alertmanager webhook if API cannot start

Human-approved recovery command on 188:

sudo systemctl stop postgresql@14-main
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
sudo systemctl start postgresql@14-main
pg_isready -h localhost -p 5432
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"

Do not run DROP, reinitialize the cluster, delete /var/lib/postgresql, or restore an old backup unless the commander explicitly approves it.


5. P0/P1 110 Registry And Observability

110 must recover Harbor/Gitea/Monitoring early, but runners last.

5.1 Startup order

  1. docker
  2. Remove Exited (128) / Exited (137) orphan containers
  3. Harbor harbor-log
  4. Harbor full stack
  5. Gitea
  6. Prometheus / Alertmanager / Grafana / exporters
  7. Langfuse
  8. SignOz
  9. Sentry DB layer
  10. Sentry web/worker/consumer layer
  11. Gitea host runner and actions runners

5.2 Checks

ssh wooo@192.168.0.110 '
hostname; date; uptime; free -h
systemctl is-active docker || true
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
'

Harbor healthy means /v2/ returns 200 or 401. Do not treat 401 as failure.

5.3 Runner gate

Runner may start only after all are true:

  • 188 PostgreSQL ready
  • 110 Harbor ready
  • 110 Gitea ready
  • 120/121 K3s nodes ready
  • AWOOOI API health passes
  • 110 load/core is below 1.0 for at least 15 minutes
  • runner systemd guardrails are active: CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0

Check:

ssh wooo@192.168.0.110 '
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
  echo "=== $u ==="
  systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
done
'

If WatchdogUSec is not 0, apply the guardrail script manually with sudo:

sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply

6. P1 120/121 K3s

K3s must wait for 188 PostgreSQL and 110 Harbor.

6.1 Startup order

  1. 120 k3s.service
  2. 121 k3s-agent.service or its live role
  3. CNI / kube-proxy
  4. Nodes Ready
  5. Core pods
  6. awoooi-prod pods
  7. keepalived VIP 192.168.0.125
  8. NodePorts 32334 and 32335

6.2 Checks

ssh wooo@192.168.0.120 '
hostname; uptime
pg_isready -h 192.168.0.188 -p 5432 || true
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
kubectl get nodes -o wide 2>/dev/null || true
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

ssh wooo@192.168.0.121 '
hostname; uptime
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
ip addr show | grep 192.168.0.125 || true
'

If K3s is activating while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.


7. P2 AWOOOI Workloads

Run after K3s nodes are Ready:

ssh wooo@192.168.0.120 '
kubectl get deploy -n awoooi-prod
kubectl get pods -n awoooi-prod -o wide
kubectl get svc -n awoooi-prod
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
'

curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/

If pods are ImagePullBackOff, go back to 110 Harbor.

If API health fails because DB/Redis is down, go back to 188.


8. P2 Alert Chain

Current main path:

Prometheus/Alertmanager on 110
  -> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
  -> AWOOOI API
  -> TelegramGateway
  -> Telegram

Alertmanager health alone is not enough. Run E2E:

curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
  -H 'Content-Type: application/json' \
  -d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'

Expected: API returns success and Telegram receives the test alert.


9. P2 Schedules And Delayed Work

Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.

Host / Layer Required check Success baseline
188 cron systemctl is-active cron and crontab -l cron active; backup, restart exporter, stats exporter entries present
188 backup-from-110 backup_110_last_success_timestamp in textfile/Prometheus last success age < 25h
188 momo-scheduler docker inspect momo-scheduler and docker logs --since 6h momo-scheduler container running healthy; 全部排程任務已註冊; Google Drive auth works; dashboard URLs use container-reachable hostnames
188 momo import manual run_auto_import_task() after parser changes selected sheet is 即時業績明細; imported date range has matching rows in daily_sales_snapshot and realtime_sales_monthly
110 cron systemctl is-active cron cron active; Docker/systemd textfile exporters fresh
110 startup units systemctl --failed zero failed units; stale momo-startup-complete and wooo-staggered-startup disabled
120 K8s CronJobs kubectl get cronjobs -n awoooi-prod unsuspended; no failed Jobs remain after current validation
121 DR drill crontab -l DR drill cron present unless explicitly paused

Useful checks:

ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'

If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.


10. P2/P3 Stateful Service Guardrails

Tier Examples Automation
BLOCK PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB No automatic destructive action. Human approval only.
CRITICAL_HITL Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse Human-in-the-loop restart/repair.
STANDARD_HITL API/Web/worker, OpenClaw, litellm Restart only with evidence and blast-radius check.
AUTO Stateless exporters, blackbox, nginx exporter Auto restart allowed after verification.

Never use generic docker restart $(docker ps -q) during cold start.

10.1 Dirty-Reboot Storage Corruption

Treat these log signatures as storage corruption, not ordinary service flakiness:

  • Bad message
  • Structure needs cleaning
  • Unknown codec
  • PANIC: could not locate a valid checkpoint record
  • Kafka Malformed line in checkpoint files
  • ClickHouse broken and needs manual correction

Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns Bad message or Structure needs cleaning, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.

10.2 ClickHouse Clean-Clone Recovery Pattern

Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.

1. Stop the compose stack or at least stop dependent consumers.
2. Disable restart loops for the failing container.
3. Save logs and build an exclude list from unreadable store paths.
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
5. Create a clean _data clone with readable files only.
6. Add flags/force_restore_data.
7. Start ClickHouse first, then web/API, then consumers.
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.

Do not replace this with rm -rf store/... unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:

/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS

10.3 Kafka Checkpoint Recovery Pattern

If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:

log-start-offset-checkpoint
recovery-point-offset-checkpoint
replication-offset-checkpoint

Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.


11. P3 High-Load Services

Only release these after P0/P1/P2 gates are green:

Host Service Release condition
188 momo-scheduler / crawler load/core < 1.0 for 15 minutes and DB healthy
188 SignOz ClickHouse healthy and merge backlog trending down
188 litellm /health/liveliness good and provider route verified
110 Sentry Snuba consumers ClickHouse healthy and Kafka backlog decreasing
110 Sentry uptime-checker Sentry web/DB healthy
110 runners all previous gates green and load/core < 1.0 for 15 minutes

12. Baseline And AI Auto-Remediation Gate

12.1 Stable Runtime Baseline

These are release gates after the first cold-start recovery pass:

Area Baseline
188 host PostgreSQL accepting, Redis PONG, momo /health 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers
110 host Harbor /v2/ 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop
K3s 120/121 nodes Ready, VIP 192.168.0.125 present, AWOOOI API 2xx/3xx, Web 2xx/3xx
Public routes https://awoooi.wooo.work/api/v1/health 2xx/3xx, https://mo.wooo.work/health 2xx/3xx
Guardrails Docker/systemd textfile exporters fresh, runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0
Schedules cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success < 25h
Backlog ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks

If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.

12.2 AI Auto-Remediation Gate

AI auto-repair can move from observe-only to limited execution only after:

  • Prometheus rules are loaded.
  • docker/systemd textfile exporter files are fresh.
  • blackbox probes have stable results.
  • cron/CronJob schedule checks are green.
  • AWOOOI API /api/v1/health passes.
  • Alertmanager E2E webhook passes.
  • Redis/KM/playbook health is available.
  • No active restart storm.
  • Host load/core remains below 1.0 for 15 minutes.

Until then:

  • diagnose only
  • notify only
  • require human approval for remediation
  • no DB/ClickHouse/Harbor/Sentry destructive action
  • no generic restart action against stateful services

13. One-Command Readiness Script

13.1 Single Pass

Run this when you want one read-only snapshot:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh

The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:

  • P0-NETWORK
  • P0-188-DATA
  • P0-110-REGISTRY-OBSERVABILITY
  • P1-K3S
  • P2-WORKLOAD-ALERTCHAIN
  • P2-PUBLIC-ROUTES
  • P2-SCHEDULES
  • runner guardrail state inside P0-110-REGISTRY-OBSERVABILITY

If it prints BLOCKED, fix the first blocked gate before moving forward.

13.2 Professional Watch Mode

Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:

bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
  --watch \
  --interval 60 \
  --max-attempts 30 \
  --send-alert-test

This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is GREEN or the last attempt remains degraded/blocked.

Use --send-alert-test for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without --send-alert-test, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.

13.3 Persistent Read-Only Monitor

After recovery, host 110 should run the same gate as a node-exporter textfile monitor:

bash scripts/reboot-recovery/install-cold-start-monitor-110.sh

This installs two scripts under /home/wooo/scripts/, adds a marked user-cron block, and writes:

  • /home/wooo/node_exporter_textfiles/cold_start_recovery.prom
  • /home/wooo/reboot-recovery/cold-start-last.log

The cron path uses --monitor-read-only, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:

  • awoooi_cold_start_monitor_up
  • awoooi_cold_start_pass_gates
  • awoooi_cold_start_warn_gates
  • awoooi_cold_start_blocked_gates
  • awoooi_cold_start_last_run_timestamp
  • awoooi_cold_start_last_green_timestamp
  • awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}

Prometheus rules in ops/monitoring/alerts-unified.yml alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.

13.4 Script-To-SOP Coverage Map

Script gate SOP coverage Blocks
P0-NETWORK host reachability, ARP, SSH every later phase
P0-188-DATA PostgreSQL, Redis, momo, SignOz K3s, AWOOOI API, momo public site
P0-110-REGISTRY-OBSERVABILITY Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas image pulls, CD, alert rules, runners
P1-K3S 120/121 K3s, VIP, node readiness, pod health workload and webhook health
P2-WORKLOAD-ALERTCHAIN AWOOOI API/Web, Alertmanager webhook AI auto-remediation and alert confidence
P2-PUBLIC-ROUTES external AWOOOI and momo URLs external release
P2-SCHEDULES cron, CronJobs, backups, textfile exporters, DR drill final done criteria

13.5 Next-Reboot Operator Contract

  1. Run the watch command above.
  2. If it stops at BLOCKED, repair the first blocked gate and rerun watch mode.
  3. If it stops at WARN, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
  4. Release high-load services only after GREEN and load/core stays below 1.0 for 15 minutes.
  5. Record the final output summary and any manual repair in docs/LOGBOOK.md.

13.6 2026-05-29 補充188 Public Gateway 與備份告警

aiops.wooo.work 的 188 public gateway 不可再指向單一 192.168.0.120:31234/31235。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP

location /api/ {
    proxy_pass http://192.168.0.125:32334/api/;
}

location /api/v1/ws {
    proxy_pass http://192.168.0.125:32334/api/v1/ws;
}

location / {
    proxy_pass http://192.168.0.125:32335;
}

變更來源必須是 infra/ansible/roles/nginx/templates/188-all-sites.conf.j2,再用 infra/ansible/playbooks/nginx-sync.yml 收斂;禁止只改 188 live 檔而不回寫 Ansible baseline。

備份告警有兩層,缺一不可:

  • ops/monitoring/alerts-unified.yml 是 repo canonical。
  • 110 live /home/wooo/monitoring/alerts.yml/home/wooo/monitoring/alerts-unified.canonical.yml 必須一致,否則 prometheus-rule-drift-guard 可能把規則拉回舊版。

重啟後必查:

curl -s http://127.0.0.1:9090/api/v1/rules \
  | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'

cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom

若 120 尚未恢復,BackupConfigCapturePartial{target="120-k3s-host-configs"} 與 cold-start blocked 是正確訊號不可消音。120 恢復後再重跑:

/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

13.7 2026-05-29 補充momo PostgreSQL Index 與資料同步

mo.wooo.work 不能只看 /health 或首頁 200。重啟或 fsck 後PostgreSQL index 可能讓匯入流程表面完成,但 daily_sales_snapshot 未同步到 realtime_sales_monthly。本次症狀:

  • daily_sales_snapshot 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
  • realtime_sales_monthly 同日期範圍為 0 筆。
  • momo-scheduler log 出現 PostgreSQL 內部錯誤 posting list tuple ... cannot be split

標準處理順序:

# 188 / momo-db只重建索引不刪資料
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
SQL

重建索引後,才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 realtime_sales_monthly 該日期範圍筆數,若非 0需先保存查詢結果並確認是否重跑同範圍同步不可整表 truncate、不可整庫 restore。補同步後至少驗證

SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
FROM daily_sales_snapshot
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

SELECT count(*), min("日期"::date), max("日期"::date)
FROM realtime_sales_monthly
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';

兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache

docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'

14. Done Criteria

All must be true:

  • Four hosts reachable by SSH.
  • 188 PostgreSQL and Redis healthy.
  • 110 Harbor, Gitea, Prometheus, Alertmanager healthy.
  • 120/121 K3s nodes Ready.
  • VIP 192.168.0.125 present.
  • AWOOOI API and Web reachable through NodePort/VIP.
  • Alertmanager E2E webhook succeeds.
  • cron/CronJob schedules are active, unsuspended, and verified.
  • momo daily_sales_snapshotrealtime_sales_monthly 在最新匯入日期範圍內筆數一致。
  • Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
  • High-load batch services are capped or delayed.
  • Runners are guarded and released last.
  • AI auto-remediation is not in full execution mode until all gates are green.
  • 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.

15. Known Drift To Fix After Recovery

These must be cleaned after the incident, not during P0:

  • SERVICE-ENDPOINTS.md still has old Prometheus/Alertmanager locations.
  • Audit older docs for direct node webhook targets; current main path should be VIP 192.168.0.125:32334.
  • OpenClaw 8088 vs 8089 must be live-confirmed and normalized.
  • 188 compose paths drift between /home/ollama/* and Ansible /opt/*.
  • 110 runner docs still mention Docker runner in places; live startup prefers host gitea-act-runner-host.service.
  • scripts/setup-runner-watchdog.sh conflicts with the 2026-05-05 runner watchdog disablement guardrail.
  • grist.wooo.work / registry.wooo.work public HTTP/HTTPS currently route to aiops.wooo.work; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.