705 lines
26 KiB
Markdown
705 lines
26 KiB
Markdown
# AWOOOI Full-Stack Cold Start SOP
|
||
|
||
> Version: v1.1
|
||
> Last updated: 2026-05-06 Asia/Taipei
|
||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||
|
||
---
|
||
|
||
## 0. When To Use This
|
||
|
||
Use this SOP when any of these happen:
|
||
|
||
- 110/120/121/188 reboot unexpectedly.
|
||
- All services are abnormal after a power/network event.
|
||
- K3s is stuck `activating`.
|
||
- Host load remains high during startup and service health is mixed.
|
||
- Monitoring, alerting, CD, AI auto-repair, and Docker Compose services disagree about the real state.
|
||
|
||
The rule is simple: **recover the dependency chain, not the loudest symptom.**
|
||
|
||
---
|
||
|
||
## 1. Golden Startup Order
|
||
|
||
```text
|
||
0. Freeze automation and preserve evidence
|
||
1. Physical/network layer
|
||
2. 188 data layer
|
||
3. 110 registry/observability layer
|
||
4. 120/121 K3s layer
|
||
5. AWOOOI workload layer
|
||
6. Public routes and alert chain
|
||
7. High-load batch/consumer/crawler services
|
||
8. Runner/CD
|
||
9. AI auto-remediation
|
||
10. 112 Kali scanner, if needed
|
||
```
|
||
|
||
Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.
|
||
|
||
### 1.1 Dependency Graph
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
|
||
network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
|
||
data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
|
||
obs110 --> k3s
|
||
k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
|
||
workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
|
||
workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
|
||
public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
|
||
schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
|
||
highload --> ai["AI auto-remediation: limited execution"]
|
||
```
|
||
|
||
This is also captured in the machine-readable baseline:
|
||
|
||
```text
|
||
ops/reboot-recovery/full-stack-cold-start-baseline.yml
|
||
```
|
||
|
||
The YAML baseline is the source of truth for:
|
||
|
||
- hosts, roles, and SSH users
|
||
- phase ordering
|
||
- service startup dependencies
|
||
- endpoint success codes
|
||
- schedule freshness thresholds
|
||
- stateful-service protection boundaries
|
||
- AI automation release gates
|
||
|
||
### 1.2 Phase Gate Logic
|
||
|
||
Each phase has the same decision rule:
|
||
|
||
| Result | Meaning | Action |
|
||
|--------|---------|--------|
|
||
| `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
|
||
| `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
|
||
| `GREEN` | All checks in scope passed. | Release the next phase only. |
|
||
|
||
The cold-start flow is intentionally conservative:
|
||
|
||
```text
|
||
P0 network green
|
||
-> P0 188 data green
|
||
-> P0 110 registry/observability green
|
||
-> P1 K3s green
|
||
-> P2 workload + alert chain green
|
||
-> P2 public routes green
|
||
-> P2 schedules green
|
||
-> P3 high-load services and runners/CD
|
||
-> AI auto-remediation limited execution
|
||
```
|
||
|
||
The final release condition is not "containers are running". It is:
|
||
|
||
```text
|
||
PASS > 0
|
||
WARN = 0
|
||
BLOCKED = 0
|
||
Result: GREEN
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Automation Freeze
|
||
|
||
Cold start creates noisy metrics and partial failures. During P0/P1, keep automation in observe-only mode.
|
||
|
||
| Item | Cold-start policy | Reason |
|
||
|------|-------------------|--------|
|
||
| Gitea/GitHub runners | Last | Build jobs can saturate 110 CPU/RAM. |
|
||
| momo-scheduler / crawlers | Last | Chrome and batch work can saturate 188. |
|
||
| Sentry/Snuba consumers | Controlled | Kafka backlog and ClickHouse merge can create temporary high load. |
|
||
| Alertmanager outbound notification | Gate | Avoid alert storms before API webhook and Telegram are verified. |
|
||
| AI auto-repair | Observe-only | Metrics, Redis, KM, and playbooks may be incomplete. |
|
||
| Stateful DB restart | Human approval | PostgreSQL, Redis, ClickHouse, Harbor DB, Sentry DB are not generic restart targets. |
|
||
|
||
---
|
||
|
||
## 3. P0 Evidence And Network
|
||
|
||
Run from any machine on the same LAN:
|
||
|
||
```bash
|
||
for h in 110 120 121 188; do
|
||
ping -c 2 -W 2 192.168.0.$h >/dev/null && echo "PING_OK 192.168.0.$h" || echo "PING_FAIL 192.168.0.$h"
|
||
done
|
||
|
||
arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
|
||
for h in 110 120 121 188; do
|
||
nc -G 3 -z 192.168.0.$h 22 && echo "SSH_OK 192.168.0.$h" || echo "SSH_FAIL 192.168.0.$h"
|
||
done
|
||
```
|
||
|
||
Then capture reboot evidence:
|
||
|
||
```bash
|
||
ssh ollama@192.168.0.188 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
|
||
ssh wooo@192.168.0.110 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
|
||
ssh wooo@192.168.0.120 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
|
||
ssh wooo@192.168.0.121 'hostname; date; uptime; who -b; last -x reboot shutdown | head -20'
|
||
```
|
||
|
||
If any host has ARP `incomplete` or SSH port down, stop here and fix physical/network first.
|
||
|
||
---
|
||
|
||
## 4. P0 188 Data Layer
|
||
|
||
188 is the first real service dependency because K3s datastore and AWOOOI DB depend on PostgreSQL.
|
||
|
||
### 4.1 Startup order
|
||
|
||
1. `containerd`
|
||
2. `docker`
|
||
3. `postgresql@14-main`
|
||
4. `k3s_datastore.kine` maintenance
|
||
5. `redis-server` on `6380`
|
||
6. `ollama` or current AI proxy dependencies
|
||
7. `nginx`
|
||
8. Docker networks
|
||
9. MinIO / OpenClaw / SignOz
|
||
10. momo / litellm / batch services after load is stable
|
||
|
||
### 4.2 Read-only check
|
||
|
||
```bash
|
||
ssh ollama@192.168.0.188 '
|
||
hostname; date; uptime; free -h
|
||
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx || true
|
||
pg_isready -h localhost -p 5432 || true
|
||
redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true
|
||
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" | head -120
|
||
'
|
||
```
|
||
|
||
### 4.3 PostgreSQL WAL checkpoint damage
|
||
|
||
Signature:
|
||
|
||
```text
|
||
PANIC: could not locate a valid checkpoint record
|
||
invalid primary checkpoint record
|
||
unexpected pageaddr ... in log segment ...
|
||
```
|
||
|
||
This blocks:
|
||
|
||
- `188:5432`
|
||
- K3s startup on 120/121
|
||
- AWOOOI API DB access
|
||
- Alertmanager webhook if API cannot start
|
||
|
||
Human-approved recovery command on 188:
|
||
|
||
```bash
|
||
sudo systemctl stop postgresql@14-main
|
||
sudo install -d -m 700 -o postgres -g postgres /var/backups/postgresql
|
||
sudo tar -C /var/lib/postgresql/14 -czf /var/backups/postgresql/14-main-before-pg-resetwal-$(date +%Y%m%d-%H%M%S).tgz main
|
||
sudo -u postgres /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
|
||
sudo systemctl start postgresql@14-main
|
||
pg_isready -h localhost -p 5432
|
||
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"
|
||
```
|
||
|
||
Do not run `DROP`, reinitialize the cluster, delete `/var/lib/postgresql`, or restore an old backup unless the commander explicitly approves it.
|
||
|
||
---
|
||
|
||
## 5. P0/P1 110 Registry And Observability
|
||
|
||
110 must recover Harbor/Gitea/Monitoring early, but runners last.
|
||
|
||
### 5.1 Startup order
|
||
|
||
1. `docker`
|
||
2. Remove `Exited (128)` / `Exited (137)` orphan containers
|
||
3. Harbor `harbor-log`
|
||
4. Harbor full stack
|
||
5. Gitea
|
||
6. Prometheus / Alertmanager / Grafana / exporters
|
||
7. Langfuse
|
||
8. SignOz
|
||
9. Sentry DB layer
|
||
10. Sentry web/worker/consumer layer
|
||
11. Gitea host runner and actions runners
|
||
|
||
### 5.2 Checks
|
||
|
||
```bash
|
||
ssh wooo@192.168.0.110 '
|
||
hostname; date; uptime; free -h
|
||
systemctl is-active docker || true
|
||
curl -s -o /dev/null -w "harbor=%{http_code}\n" --max-time 5 http://127.0.0.1:5000/v2/ || true
|
||
curl -s -o /dev/null -w "gitea=%{http_code}\n" --max-time 5 http://127.0.0.1:3001/ || true
|
||
curl -s --max-time 5 http://127.0.0.1:9090/-/ready || true
|
||
curl -s --max-time 5 http://127.0.0.1:9093/-/healthy || true
|
||
curl -s -o /dev/null -w "sentry=%{http_code}\n" --max-time 10 http://127.0.0.1:9000/ || true
|
||
docker ps --format "{{.Names}}\t{{.Status}}" | head -120
|
||
'
|
||
```
|
||
|
||
Harbor healthy means `/v2/` returns `200` or `401`. Do not treat `401` as failure.
|
||
|
||
### 5.3 Runner gate
|
||
|
||
Runner may start only after all are true:
|
||
|
||
- `188 PostgreSQL` ready
|
||
- `110 Harbor` ready
|
||
- `110 Gitea` ready
|
||
- `120/121 K3s` nodes ready
|
||
- AWOOOI API health passes
|
||
- 110 load/core is below `1.0` for at least 15 minutes
|
||
- runner systemd guardrails are active: `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0`
|
||
|
||
Check:
|
||
|
||
```bash
|
||
ssh wooo@192.168.0.110 '
|
||
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain | awk "{print \$1}"); do
|
||
echo "=== $u ==="
|
||
systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts
|
||
done
|
||
'
|
||
```
|
||
|
||
If `WatchdogUSec` is not `0`, apply the guardrail script manually with sudo:
|
||
|
||
```bash
|
||
sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
|
||
```
|
||
|
||
---
|
||
|
||
## 6. P1 120/121 K3s
|
||
|
||
K3s must wait for 188 PostgreSQL and 110 Harbor.
|
||
|
||
### 6.1 Startup order
|
||
|
||
1. 120 `k3s.service`
|
||
2. 121 `k3s-agent.service` or its live role
|
||
3. CNI / kube-proxy
|
||
4. Nodes Ready
|
||
5. Core pods
|
||
6. `awoooi-prod` pods
|
||
7. keepalived VIP `192.168.0.125`
|
||
8. NodePorts `32334` and `32335`
|
||
|
||
### 6.2 Checks
|
||
|
||
```bash
|
||
ssh wooo@192.168.0.120 '
|
||
hostname; uptime
|
||
pg_isready -h 192.168.0.188 -p 5432 || true
|
||
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
|
||
kubectl get nodes -o wide 2>/dev/null || true
|
||
kubectl get pods -A 2>/dev/null | grep -v -E "Running|Completed" || true
|
||
kubectl get pods -n awoooi-prod -o wide 2>/dev/null || true
|
||
ip addr show | grep 192.168.0.125 || true
|
||
'
|
||
|
||
ssh wooo@192.168.0.121 '
|
||
hostname; uptime
|
||
systemctl is-active k3s k3s-agent keepalived 2>/dev/null || true
|
||
ip addr show | grep 192.168.0.125 || true
|
||
'
|
||
```
|
||
|
||
If K3s is `activating` while 188 PostgreSQL is down, fix PostgreSQL first. Restarting K3s repeatedly will not solve it.
|
||
|
||
---
|
||
|
||
## 7. P2 AWOOOI Workloads
|
||
|
||
Run after K3s nodes are Ready:
|
||
|
||
```bash
|
||
ssh wooo@192.168.0.120 '
|
||
kubectl get deploy -n awoooi-prod
|
||
kubectl get pods -n awoooi-prod -o wide
|
||
kubectl get svc -n awoooi-prod
|
||
kubectl get events -n awoooi-prod --sort-by=.lastTimestamp | tail -40
|
||
'
|
||
|
||
curl -s --max-time 8 http://192.168.0.125:32334/api/v1/health
|
||
curl -s -o /dev/null -w "web=%{http_code}\n" --max-time 8 http://192.168.0.125:32335/
|
||
```
|
||
|
||
If pods are `ImagePullBackOff`, go back to 110 Harbor.
|
||
|
||
If API health fails because DB/Redis is down, go back to 188.
|
||
|
||
---
|
||
|
||
## 8. P2 Alert Chain
|
||
|
||
Current main path:
|
||
|
||
```text
|
||
Prometheus/Alertmanager on 110
|
||
-> http://192.168.0.125:32334/api/v1/webhooks/alertmanager
|
||
-> AWOOOI API
|
||
-> TelegramGateway
|
||
-> Telegram
|
||
```
|
||
|
||
Alertmanager health alone is not enough. Run E2E:
|
||
|
||
```bash
|
||
curl -s -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"receiver":"cold-start-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartE2ETest","severity":"info"},"annotations":{"summary":"Cold start E2E test, ignore"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-test"}'
|
||
```
|
||
|
||
Expected: API returns success and Telegram receives the test alert.
|
||
|
||
---
|
||
|
||
## 9. P2 Schedules And Delayed Work
|
||
|
||
Do not mark the reboot complete until scheduled work is proven runnable. A container can be healthy while its cron path is broken.
|
||
|
||
| Host / Layer | Required check | Success baseline |
|
||
|--------------|----------------|------------------|
|
||
| 188 cron | `systemctl is-active cron` and `crontab -l` | cron active; backup, restart exporter, stats exporter entries present |
|
||
| 188 backup-from-110 | `backup_110_last_success_timestamp` in textfile/Prometheus | last success age `< 25h` |
|
||
| 188 momo-scheduler | `docker inspect momo-scheduler` and `docker logs --since 6h momo-scheduler` | container `running healthy`; `全部排程任務已註冊`; Google Drive auth works; dashboard URLs use container-reachable hostnames |
|
||
| 188 momo import | manual `run_auto_import_task()` after parser changes | selected sheet is `即時業績明細`; imported date range has matching rows in `daily_sales_snapshot` and `realtime_sales_monthly` |
|
||
| 110 cron | `systemctl is-active cron` | cron active; Docker/systemd textfile exporters fresh |
|
||
| 110 startup units | `systemctl --failed` | zero failed units; stale `momo-startup-complete` and `wooo-staggered-startup` disabled |
|
||
| 120 K8s CronJobs | `kubectl get cronjobs -n awoooi-prod` | unsuspended; no failed Jobs remain after current validation |
|
||
| 121 DR drill | `crontab -l` | DR drill cron present unless explicitly paused |
|
||
|
||
Useful checks:
|
||
|
||
```bash
|
||
ssh ollama@192.168.0.188 'systemctl is-active cron; crontab -l; ls -l /home/ollama/node_exporter_textfiles/*.prom'
|
||
ssh wooo@192.168.0.110 'systemctl --failed --no-pager; systemctl is-active cron; crontab -l'
|
||
ssh wooo@192.168.0.120 'sudo kubectl get cronjobs,jobs -n awoooi-prod'
|
||
ssh wooo@192.168.0.121 'systemctl is-active cron; crontab -l'
|
||
```
|
||
|
||
If a schedule succeeds but emits a false verification alert, fix the verification rule before releasing AI auto-remediation. False positives train operators to ignore real alarms.
|
||
|
||
---
|
||
|
||
## 10. P2/P3 Stateful Service Guardrails
|
||
|
||
| Tier | Examples | Automation |
|
||
|------|----------|------------|
|
||
| BLOCK | PostgreSQL data dir, ClickHouse data dir, Harbor DB, Sentry DB | No automatic destructive action. Human approval only. |
|
||
| CRITICAL_HITL | Redis, Kafka, MinIO, SignOz ClickHouse, Sentry ClickHouse | Human-in-the-loop restart/repair. |
|
||
| STANDARD_HITL | API/Web/worker, OpenClaw, litellm | Restart only with evidence and blast-radius check. |
|
||
| AUTO | Stateless exporters, blackbox, nginx exporter | Auto restart allowed after verification. |
|
||
|
||
Never use generic `docker restart $(docker ps -q)` during cold start.
|
||
|
||
### 10.1 Dirty-Reboot Storage Corruption
|
||
|
||
Treat these log signatures as storage corruption, not ordinary service flakiness:
|
||
|
||
- `Bad message`
|
||
- `Structure needs cleaning`
|
||
- `Unknown codec`
|
||
- `PANIC: could not locate a valid checkpoint record`
|
||
- Kafka `Malformed line` in checkpoint files
|
||
- ClickHouse `broken and needs manual correction`
|
||
|
||
Cold-start automation may stop a restart storm and collect evidence, but it must not delete the original data directory. If a filesystem returns `Bad message` or `Structure needs cleaning`, the real root cause is below the container layer. Online recovery can restore service from readable data, but complete historical recovery requires an offline filesystem check or backup restore.
|
||
|
||
### 10.2 ClickHouse Clean-Clone Recovery Pattern
|
||
|
||
Use this pattern for Sentry ClickHouse or SignOz ClickHouse when individual corrupted parts cannot be moved because the host filesystem rejects reads.
|
||
|
||
```text
|
||
1. Stop the compose stack or at least stop dependent consumers.
|
||
2. Disable restart loops for the failing container.
|
||
3. Save logs and build an exclude list from unreadable store paths.
|
||
4. Preserve the original volume as _data.corrupt-YYYYMMDD-HHMMSS.
|
||
5. Create a clean _data clone with readable files only.
|
||
6. Add flags/force_restore_data.
|
||
7. Start ClickHouse first, then web/API, then consumers.
|
||
8. Verify HTTP, merge backlog, and restart count before releasing high-load services.
|
||
```
|
||
|
||
Do not replace this with `rm -rf store/...` unless the unreadable path is already backed up or the commander explicitly accepts data loss. The preferred incident artifact is:
|
||
|
||
```text
|
||
/var/lib/docker/volumes/<volume>/_data.corrupt-YYYYMMDD-HHMMSS
|
||
/var/backups/<service>-<component>-YYYYMMDD-HHMMSS
|
||
```
|
||
|
||
### 10.3 Kafka Checkpoint Recovery Pattern
|
||
|
||
If Kafka refuses to start with malformed checkpoint files after a dirty reboot, preserve and move only checkpoint files:
|
||
|
||
```text
|
||
log-start-offset-checkpoint
|
||
recovery-point-offset-checkpoint
|
||
replication-offset-checkpoint
|
||
```
|
||
|
||
Then start Kafka and confirm health before starting Snuba/Sentry consumers. Do not delete topic directories or Kafka logs during cold-start recovery.
|
||
|
||
---
|
||
|
||
## 11. P3 High-Load Services
|
||
|
||
Only release these after P0/P1/P2 gates are green:
|
||
|
||
| Host | Service | Release condition |
|
||
|------|---------|-------------------|
|
||
| 188 | momo-scheduler / crawler | load/core < 1.0 for 15 minutes and DB healthy |
|
||
| 188 | SignOz ClickHouse | healthy and merge backlog trending down |
|
||
| 188 | litellm | `/health/liveliness` good and provider route verified |
|
||
| 110 | Sentry Snuba consumers | ClickHouse healthy and Kafka backlog decreasing |
|
||
| 110 | Sentry uptime-checker | Sentry web/DB healthy |
|
||
| 110 | runners | all previous gates green and load/core < 1.0 for 15 minutes |
|
||
|
||
---
|
||
|
||
## 12. Baseline And AI Auto-Remediation Gate
|
||
|
||
### 12.1 Stable Runtime Baseline
|
||
|
||
These are release gates after the first cold-start recovery pass:
|
||
|
||
| Area | Baseline |
|
||
|------|----------|
|
||
| 188 host | PostgreSQL accepting, Redis PONG, momo `/health` 200, SignOz HTTP reachable, load/core < 1.0 sustained before crawlers |
|
||
| 110 host | Harbor `/v2/` 200/401, Gitea 200/302, Prometheus ready, Alertmanager healthy, Sentry HTTP 200/302/400, no ClickHouse/Kafka restart loop |
|
||
| K3s | 120/121 nodes Ready, VIP `192.168.0.125` present, AWOOOI API 2xx/3xx, Web 2xx/3xx |
|
||
| Public routes | `https://awoooi.wooo.work/api/v1/health` 2xx/3xx, `https://mo.wooo.work/health` 2xx/3xx |
|
||
| Guardrails | Docker/systemd textfile exporters fresh, runner `CPUQuota=200%`, `MemoryMax=2G`, `WatchdogUSec=0` |
|
||
| Schedules | cron active on 110/188/120/121; K8s CronJobs unsuspended; no current failed Jobs; 188 backup success `< 25h` |
|
||
| Backlog | ClickHouse merges and Kafka/Snuba lag trending down, not increasing for two consecutive checks |
|
||
|
||
If service health is green but load average remains high, check live CPU and IO before changing memory limits. High load after Sentry/Snuba or ClickHouse startup can be backlog drain; high CPU from runners/builds/crawlers is a release-order problem.
|
||
|
||
### 12.2 AI Auto-Remediation Gate
|
||
|
||
AI auto-repair can move from observe-only to limited execution only after:
|
||
|
||
- Prometheus rules are loaded.
|
||
- docker/systemd textfile exporter files are fresh.
|
||
- blackbox probes have stable results.
|
||
- cron/CronJob schedule checks are green.
|
||
- AWOOOI API `/api/v1/health` passes.
|
||
- Alertmanager E2E webhook passes.
|
||
- Redis/KM/playbook health is available.
|
||
- No active restart storm.
|
||
- Host load/core remains below `1.0` for 15 minutes.
|
||
|
||
Until then:
|
||
|
||
- diagnose only
|
||
- notify only
|
||
- require human approval for remediation
|
||
- no DB/ClickHouse/Harbor/Sentry destructive action
|
||
- no generic restart action against stateful services
|
||
|
||
---
|
||
|
||
## 13. One-Command Readiness Script
|
||
|
||
### 13.1 Single Pass
|
||
|
||
Run this when you want one read-only snapshot:
|
||
|
||
```bash
|
||
bash scripts/reboot-recovery/full-stack-cold-start-check.sh
|
||
```
|
||
|
||
The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:
|
||
|
||
- `P0-NETWORK`
|
||
- `P0-188-DATA`
|
||
- `P0-110-REGISTRY-OBSERVABILITY`
|
||
- `P1-K3S`
|
||
- `P2-WORKLOAD-ALERTCHAIN`
|
||
- `P2-PUBLIC-ROUTES`
|
||
- `P2-SCHEDULES`
|
||
- runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY`
|
||
|
||
If it prints `BLOCKED`, fix the first blocked gate before moving forward.
|
||
|
||
### 13.2 Professional Watch Mode
|
||
|
||
Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:
|
||
|
||
```bash
|
||
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
|
||
--watch \
|
||
--interval 60 \
|
||
--max-attempts 30 \
|
||
--send-alert-test
|
||
```
|
||
|
||
This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked.
|
||
|
||
Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
|
||
|
||
### 13.3 Persistent Read-Only Monitor
|
||
|
||
After recovery, host 110 should run the same gate as a node-exporter textfile monitor:
|
||
|
||
```bash
|
||
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
|
||
```
|
||
|
||
This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes:
|
||
|
||
- `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom`
|
||
- `/home/wooo/reboot-recovery/cold-start-last.log`
|
||
|
||
The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:
|
||
|
||
- `awoooi_cold_start_monitor_up`
|
||
- `awoooi_cold_start_pass_gates`
|
||
- `awoooi_cold_start_warn_gates`
|
||
- `awoooi_cold_start_blocked_gates`
|
||
- `awoooi_cold_start_last_run_timestamp`
|
||
- `awoooi_cold_start_last_green_timestamp`
|
||
- `awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}`
|
||
|
||
Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.
|
||
|
||
### 13.4 Script-To-SOP Coverage Map
|
||
|
||
| Script gate | SOP coverage | Blocks |
|
||
|-------------|--------------|--------|
|
||
| `P0-NETWORK` | host reachability, ARP, SSH | every later phase |
|
||
| `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
|
||
| `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
|
||
| `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
|
||
| `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
|
||
| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
|
||
| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
|
||
|
||
### 13.5 Next-Reboot Operator Contract
|
||
|
||
1. Run the watch command above.
|
||
2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
|
||
3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
|
||
4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes.
|
||
5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`.
|
||
|
||
### 13.6 2026-05-29 補充:188 Public Gateway 與備份告警
|
||
|
||
`aiops.wooo.work` 的 188 public gateway 不可再指向單一 `192.168.0.120:31234/31235`。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP:
|
||
|
||
```nginx
|
||
location /api/ {
|
||
proxy_pass http://192.168.0.125:32334/api/;
|
||
}
|
||
|
||
location /api/v1/ws {
|
||
proxy_pass http://192.168.0.125:32334/api/v1/ws;
|
||
}
|
||
|
||
location / {
|
||
proxy_pass http://192.168.0.125:32335;
|
||
}
|
||
```
|
||
|
||
變更來源必須是 `infra/ansible/roles/nginx/templates/188-all-sites.conf.j2`,再用 `infra/ansible/playbooks/nginx-sync.yml` 收斂;禁止只改 188 live 檔而不回寫 Ansible baseline。
|
||
|
||
備份告警有兩層,缺一不可:
|
||
|
||
- `ops/monitoring/alerts-unified.yml` 是 repo canonical。
|
||
- 110 live `/home/wooo/monitoring/alerts.yml` 與 `/home/wooo/monitoring/alerts-unified.canonical.yml` 必須一致,否則 `prometheus-rule-drift-guard` 可能把規則拉回舊版。
|
||
|
||
重啟後必查:
|
||
|
||
```bash
|
||
curl -s http://127.0.0.1:9090/api/v1/rules \
|
||
| python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'
|
||
|
||
cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom
|
||
```
|
||
|
||
若 120 尚未恢復,`BackupConfigCapturePartial{target="120-k3s-host-configs"}` 與 cold-start blocked 是正確訊號,不可消音。120 恢復後再重跑:
|
||
|
||
```bash
|
||
/backup/scripts/backup-configs.sh
|
||
/backup/scripts/backup-all.sh
|
||
/backup/scripts/sync-offsite-backups.sh --mode sync
|
||
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
|
||
```
|
||
|
||
### 13.7 2026-05-29 補充:momo PostgreSQL Index 與資料同步
|
||
|
||
`mo.wooo.work` 不能只看 `/health` 或首頁 200。重啟或 fsck 後,PostgreSQL index 可能讓匯入流程表面完成,但 `daily_sales_snapshot` 未同步到 `realtime_sales_monthly`。本次症狀:
|
||
|
||
- `daily_sales_snapshot` 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
|
||
- `realtime_sales_monthly` 同日期範圍為 0 筆。
|
||
- momo-scheduler log 出現 PostgreSQL 內部錯誤 `posting list tuple ... cannot be split`。
|
||
|
||
標準處理順序:
|
||
|
||
```bash
|
||
# 188 / momo-db,只重建索引,不刪資料
|
||
docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
|
||
REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
|
||
SQL
|
||
```
|
||
|
||
重建索引後,才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 `realtime_sales_monthly` 該日期範圍筆數,若非 0,需先保存查詢結果並確認是否重跑同範圍同步;不可整表 truncate、不可整庫 restore。補同步後至少驗證:
|
||
|
||
```sql
|
||
SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
|
||
FROM daily_sales_snapshot
|
||
WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
|
||
|
||
SELECT count(*), min("日期"::date), max("日期"::date)
|
||
FROM realtime_sales_monthly
|
||
WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
|
||
```
|
||
|
||
兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache:
|
||
|
||
```bash
|
||
docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'
|
||
```
|
||
|
||
---
|
||
|
||
## 14. Done Criteria
|
||
|
||
All must be true:
|
||
|
||
- Four hosts reachable by SSH.
|
||
- 188 PostgreSQL and Redis healthy.
|
||
- 110 Harbor, Gitea, Prometheus, Alertmanager healthy.
|
||
- 120/121 K3s nodes Ready.
|
||
- VIP `192.168.0.125` present.
|
||
- AWOOOI API and Web reachable through NodePort/VIP.
|
||
- Alertmanager E2E webhook succeeds.
|
||
- cron/CronJob schedules are active, unsuspended, and verified.
|
||
- momo `daily_sales_snapshot` 與 `realtime_sales_monthly` 在最新匯入日期範圍內筆數一致。
|
||
- Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
|
||
- High-load batch services are capped or delayed.
|
||
- Runners are guarded and released last.
|
||
- AI auto-remediation is not in full execution mode until all gates are green.
|
||
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
|
||
|
||
---
|
||
|
||
## 15. Known Drift To Fix After Recovery
|
||
|
||
These must be cleaned after the incident, not during P0:
|
||
|
||
- `SERVICE-ENDPOINTS.md` still has old Prometheus/Alertmanager locations.
|
||
- Audit older docs for direct node webhook targets; current main path should be VIP `192.168.0.125:32334`.
|
||
- OpenClaw `8088` vs `8089` must be live-confirmed and normalized.
|
||
- 188 compose paths drift between `/home/ollama/*` and Ansible `/opt/*`.
|
||
- 110 runner docs still mention Docker runner in places; live startup prefers host `gitea-act-runner-host.service`.
|
||
- `scripts/setup-runner-watchdog.sh` conflicts with the 2026-05-05 runner watchdog disablement guardrail.
|
||
- `grist.wooo.work` / `registry.wooo.work` public HTTP/HTTPS currently route to `aiops.wooo.work`; their old 110 certbot renewal configs are disabled until public routing is corrected or DNS-01 renewal is configured.
|