docs(ops): codify full stack cold start recovery
All checks were successful
Code Review / ai-code-review (push) Successful in 7s

This commit is contained in:
Your Name
2026-05-06 00:07:57 +08:00
parent 2aa31c205a
commit 0315c2b510
4 changed files with 552 additions and 27 deletions

View File

@@ -3269,3 +3269,48 @@ DATABASE_URL=postgresql+asyncpg://u:p@localhost:5432/test REDIS_URL=redis://loca
apps/api/tests/test_openclaw_alert_cloud_fallback_gate.py -q
# 15 passed
```
---
## 2026-05-06台北— 全棧重開機冷啟動 SOP / baseline / watch mode
**觸發**2026-05-05 晚間 110 / 120 / 121 / 188 異常重開機後,要求把本次恢復順序、服務相依、放行邏輯、最後確認機制完整文件化,並建立下次重開機可快速恢復的標準做法。
### 已完成
| Artifact | Result |
|----------|--------|
| `docs/runbooks/FULL-STACK-COLD-START-SOP.md` | 升級為 v1.1,補齊 Golden Startup Order、Mermaid 依賴圖、phase gate 邏輯、script-to-SOP 覆蓋表、next-reboot operator contract |
| `ops/reboot-recovery/full-stack-cold-start-baseline.yml` | 新增機器可讀 baseline固定 hosts、roles、啟動順序、endpoint code、schedule freshness、stateful-service 禁區、AI auto-remediation gate |
| `scripts/reboot-recovery/full-stack-cold-start-check.sh` | 新增 `--watch` / `--interval` / `--max-attempts`,可在重開機後反覆檢查直到 `GREEN` |
### 標準下次重開機放行指令
```bash
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
--watch \
--interval 60 \
--max-attempts 30 \
--send-alert-test
```
### 驗證結果
```bash
bash -n scripts/reboot-recovery/full-stack-cold-start-check.sh
# OK
ruby -e 'require "yaml"; YAML.load_file("ops/reboot-recovery/full-stack-cold-start-baseline.yml"); puts "YAML OK"'
# YAML OK
bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test
# PASS=50 WARN=0 BLOCKED=0
# Result: GREEN. Full stack is ready for controlled runner/CD release.
```
### 放行原則
- `BLOCKED`:停止釋放後續 phase先修第一個阻塞 gate。
- `WARN`:不可釋放 runner/CD/AI full execution需清掉或明確接受警告。
- `GREEN`:只代表可進入下一階段;高負載 crawler / Snuba / ClickHouse merge / runner/CD 仍需最後釋放。
- Stateful DB / ClickHouse / Kafka / Harbor / Sentry 資料層不可由 AI 自動破壞性修復。

View File

@@ -1,7 +1,7 @@
# AWOOOI Full-Stack Cold Start SOP
> Version: v1.0
> Last updated: 2026-05-05 Asia/Taipei
> Version: v1.1
> Last updated: 2026-05-06 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
---
@@ -38,6 +38,71 @@ The rule is simple: **recover the dependency chain, not the loudest symptom.**
Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.
### 1.1 Dependency Graph
```mermaid
flowchart TD
network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
obs110 --> k3s
k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
highload --> ai["AI auto-remediation: limited execution"]
```
This is also captured in the machine-readable baseline:
```text
ops/reboot-recovery/full-stack-cold-start-baseline.yml
```
The YAML baseline is the source of truth for:
- hosts, roles, and SSH users
- phase ordering
- service startup dependencies
- endpoint success codes
- schedule freshness thresholds
- stateful-service protection boundaries
- AI automation release gates
### 1.2 Phase Gate Logic
Each phase has the same decision rule:
| Result | Meaning | Action |
|--------|---------|--------|
| `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
| `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
| `GREEN` | All checks in scope passed. | Release the next phase only. |
The cold-start flow is intentionally conservative:
```text
P0 network green
-> P0 188 data green
-> P0 110 registry/observability green
-> P1 K3s green
-> P2 workload + alert chain green
-> P2 public routes green
-> P2 schedules green
-> P3 high-load services and runners/CD
-> AI auto-remediation limited execution
```
The final release condition is not "containers are running". It is:
```text
PASS > 0
WARN = 0
BLOCKED = 0
Result: GREEN
```
---
## 2. Automation Freeze
@@ -443,26 +508,63 @@ Until then:
## 13. One-Command Readiness Script
Run:
### 13.1 Single Pass
Run this when you want one read-only snapshot:
```bash
bash scripts/reboot-recovery/full-stack-cold-start-check.sh
```
The script is read-only. It reports gates:
The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:
- `P0-NETWORK`
- `P0-188-DATA`
- `P0-110-REGISTRY`
- `P0-110-REGISTRY-OBSERVABILITY`
- `P1-K3S`
- `P2-WORKLOAD`
- `P2-ALERTCHAIN`
- `P2-WORKLOAD-ALERTCHAIN`
- `P2-PUBLIC-ROUTES`
- `P2-SCHEDULES`
- runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY`
If it prints `BLOCKED`, fix the first blocked gate before moving forward.
### 13.2 Professional Watch Mode
Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:
```bash
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
--watch \
--interval 60 \
--max-attempts 30 \
--send-alert-test
```
This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked.
Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
### 13.3 Script-To-SOP Coverage Map
| Script gate | SOP coverage | Blocks |
|-------------|--------------|--------|
| `P0-NETWORK` | host reachability, ARP, SSH | every later phase |
| `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
| `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
| `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
| `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
### 13.4 Next-Reboot Operator Contract
1. Run the watch command above.
2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes.
5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`.
---
## 14. Done Criteria

View File

@@ -0,0 +1,303 @@
# AWOOOI full-stack cold-start dependency baseline.
# This is the machine-readable companion to docs/runbooks/FULL-STACK-COLD-START-SOP.md.
#
# Intent:
# - document the reboot startup order and service dependency graph
# - define release gates for operators and AI automation
# - keep stateful services out of generic auto-restart loops
version: "2026-05-06"
incident_reference: "2026-05-05 full-stack reboot recovery"
scope:
managed_hosts:
"110":
address: "192.168.0.110"
ssh_user: "wooo"
roles:
- registry
- git
- observability
- sentry
- runners
"120":
address: "192.168.0.120"
ssh_user: "wooo"
roles:
- k3s_server
- keepalived_vip
- awoooi_nodeport
"121":
address: "192.168.0.121"
ssh_user: "wooo"
roles:
- k3s_node
- keepalived_peer
- dr_drill
"188":
address: "192.168.0.188"
ssh_user: "ollama"
roles:
- postgres_datastore
- redis
- momo
- signoz
- ai_proxy
intentionally_skipped:
"112":
role: "kali"
reason: "scanner host is not required for production cold-start release"
global_policy:
startup_rule: "Recover the dependency chain before releasing high-load work."
runner_cd_rule: "Release runners and CD only after data, registry, K3s, workload, routes, schedules, and alert E2E gates are green."
ai_auto_repair_rule: "Observe-only until all green gates pass and host load stays below baseline."
destructive_state_rule: "No DROP, data directory deletion, volume recreation, pg_resetwal, fsck, or backup restore without explicit human approval."
no_generic_restart_rule: "Never run generic docker restart against all containers during cold start."
phases:
- id: "P0-NETWORK"
order: 0
start_after: []
owns:
- "LAN reachability"
- "SSH reachability"
- "ARP evidence"
gates:
- "ping 192.168.0.110/120/121/188 succeeds"
- "TCP 22 open on 192.168.0.110/120/121/188"
- "reboot evidence captured before repair"
blocks:
- "all other phases"
- id: "P0-188-DATA"
order: 1
start_after:
- "P0-NETWORK"
host: "188"
service_order:
- "containerd"
- "docker"
- "postgresql@14-main"
- "k3s_datastore.kine maintenance"
- "redis-server"
- "ollama or current AI proxy dependencies"
- "nginx"
- "Docker networks"
- "MinIO / OpenClaw / SignOz"
- "momo / litellm / batch services"
gates:
- "PostgreSQL port 5432 open"
- "pg_isready reports accepting connections"
- "Redis replies PONG"
- "momo health endpoint returns 200"
- "SignOz HTTP route is reachable"
blocks:
- "120/121 K3s"
- "AWOOOI API database access"
- "Alertmanager webhook"
- "momo public site"
- id: "P0-110-REGISTRY-OBSERVABILITY"
order: 2
start_after:
- "P0-NETWORK"
- "P0-188-DATA"
host: "110"
service_order:
- "docker"
- "orphan Exited(128/137) cleanup if needed"
- "Harbor log"
- "Harbor registry stack"
- "Gitea"
- "Prometheus / Alertmanager / Grafana / exporters"
- "Langfuse"
- "SignOz or local observability companions"
- "Sentry DB layer"
- "Sentry web / worker / consumer layer"
- "Gitea host runner and actions runners"
gates:
- "Harbor /v2/ returns 200 or 401"
- "Gitea returns 200 or 302"
- "Prometheus /-/ready returns 200"
- "Alertmanager /-/healthy returns 200"
- "Sentry HTTP returns 200, 302, or 400"
- "runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0"
blocks:
- "K3s image pulls"
- "runtime CD"
- "alert rules deploy"
- "code-review runners"
- id: "P1-K3S"
order: 3
start_after:
- "P0-188-DATA"
- "P0-110-REGISTRY-OBSERVABILITY"
hosts:
- "120"
- "121"
service_order:
- "120 k3s.service"
- "121 k3s-agent.service or live role"
- "CNI / kube-proxy"
- "nodes Ready"
- "core pods"
- "awoooi-prod pods"
- "keepalived VIP 192.168.0.125"
- "NodePorts 32334 and 32335"
gates:
- "120 can reach 188:5432"
- "K3s nodes show Ready"
- "VIP 192.168.0.125 is present"
- "awoooi-prod pods are Running or Completed"
blocks:
- "AWOOOI workload health"
- "public AWOOOI route"
- "Alertmanager webhook"
- id: "P2-WORKLOAD-ALERTCHAIN"
order: 4
start_after:
- "P1-K3S"
owners:
- "AWOOOI API"
- "AWOOOI Web"
- "Alertmanager webhook"
- "Telegram delivery"
gates:
- "http://192.168.0.125:32334/api/v1/health returns 2xx/3xx"
- "http://192.168.0.125:32335/ returns 2xx/3xx"
- "Alertmanager webhook POST returns 2xx"
- "K8s Telegram secrets are present and non-placeholder"
blocks:
- "AI auto-remediation"
- "full alert confidence"
- id: "P2-PUBLIC-ROUTES"
order: 5
start_after:
- "P2-WORKLOAD-ALERTCHAIN"
gates:
- "https://awoooi.wooo.work/api/v1/health returns 2xx/3xx"
- "https://awoooi.wooo.work/ returns 2xx/3xx"
- "https://mo.wooo.work/ returns 2xx/3xx"
- "https://mo.wooo.work/health returns 2xx/3xx"
blocks:
- "external release complete"
- id: "P2-SCHEDULES"
order: 6
start_after:
- "P2-PUBLIC-ROUTES"
gates:
- "110/120/121/188 cron services active"
- "188 backup-from-110 success age below 25h"
- "188 docker restart/stats textfiles fresh"
- "110 docker/systemd textfiles fresh"
- "120 awoooi-prod CronJobs present and unsuspended"
- "120 awoooi-prod has no failed Jobs"
- "121 DR drill cron present"
blocks:
- "done criteria"
- "AI auto-remediation release"
- id: "P3-HIGH-LOAD-RELEASE"
order: 7
start_after:
- "P2-SCHEDULES"
release_last:
- "momo-scheduler / Chrome crawlers"
- "Sentry Snuba consumers"
- "SignOz ClickHouse merge-heavy work"
- "Gitea actions runners"
- "runtime CD jobs"
gates:
- "all prior gates green"
- "host load per CPU below 1.0 for 15 minutes before releasing batch/runner work"
- "ClickHouse/Kafka/Snuba backlog decreasing for two consecutive checks if backlog exists"
baselines:
endpoints:
awoooi_vip_api_health: "http://192.168.0.125:32334/api/v1/health"
awoooi_vip_web: "http://192.168.0.125:32335/"
awoooi_public_api_health: "https://awoooi.wooo.work/api/v1/health"
awoooi_public_web: "https://awoooi.wooo.work/"
momo_public_web: "https://mo.wooo.work/"
momo_public_health: "https://mo.wooo.work/health"
harbor_registry: "http://127.0.0.1:5000/v2/"
gitea: "http://127.0.0.1:3001/"
prometheus_ready: "http://127.0.0.1:9090/-/ready"
alertmanager_healthy: "http://127.0.0.1:9093/-/healthy"
sentry: "http://127.0.0.1:9000/"
expected_codes:
harbor_registry:
- 200
- 401
gitea:
- 200
- 302
prometheus_ready:
- 200
alertmanager_healthy:
- 200
sentry:
- 200
- 302
- 400
workload_and_public:
- "2xx"
- "3xx"
runner_guardrails:
CPUQuotaPerSecUSec: "2s"
MemoryMax: "2147483648"
WatchdogUSec: "0"
freshness_seconds:
docker_textfiles: 300
systemd_textfiles: 300
backup_success: 90000
stateful_services:
hard_block_auto_repair:
- "188 PostgreSQL data directory"
- "188 k3s_datastore"
- "188 momo database"
- "110 Harbor DB"
- "110 Sentry DB"
- "Sentry ClickHouse data"
- "SignOz ClickHouse data"
- "Kafka topic/log directories"
human_in_loop_required:
- "pg_resetwal"
- "ClickHouse clean-clone recovery"
- "Kafka checkpoint file quarantine"
- "backup restore"
- "filesystem repair"
ai_automation_gate:
observe_only_until:
- "P0-NETWORK green"
- "P0-188-DATA green"
- "P0-110-REGISTRY-OBSERVABILITY green"
- "P1-K3S green"
- "P2-WORKLOAD-ALERTCHAIN green"
- "P2-PUBLIC-ROUTES green"
- "P2-SCHEDULES green"
- "no active restart storm"
- "host load per CPU below 1.0 for 15 minutes"
allowed_before_green:
- "diagnose"
- "collect evidence"
- "notify"
blocked_before_green:
- "stateful restart"
- "destructive repair"
- "runner/CD release"
- "generic container restart"
final_confirmation:
command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test"
green_result:
PASS: "greater than 0"
WARN: 0
BLOCKED: 0
summary: "Result: GREEN"

View File

@@ -6,26 +6,61 @@ set -uo pipefail
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6)
SEND_ALERT_TEST=0
WATCH_MODE=0
WATCH_INTERVAL=60
WATCH_MAX_ATTEMPTS=30
for arg in "$@"; do
case "$arg" in
usage() {
cat <<'USAGE'
Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options]
Options:
--send-alert-test POST one Alertmanager webhook test after AWOOOI API is ready.
--watch Repeat checks until all gates are GREEN or max attempts is reached.
--interval SECONDS Retry interval for --watch. Default: 60.
--max-attempts COUNT Max attempts for --watch. Default: 30. Use 0 for unlimited.
-h, --help Show this help.
Default mode is read-only and does not POST an Alertmanager test event.
Use --send-alert-test for the final release gate after AWOOOI API is expected to be ready.
USAGE
}
while [ "$#" -gt 0 ]; do
case "$1" in
--send-alert-test)
SEND_ALERT_TEST=1
;;
--watch)
WATCH_MODE=1
;;
--interval)
shift
if ! [[ "${1:-}" =~ ^[0-9]+$ ]] || [ "${1:-0}" -lt 1 ]; then
echo "--interval requires a positive integer number of seconds" >&2
exit 64
fi
WATCH_INTERVAL="$1"
;;
--max-attempts)
shift
if ! [[ "${1:-}" =~ ^[0-9]+$ ]]; then
echo "--max-attempts requires a non-negative integer" >&2
exit 64
fi
WATCH_MAX_ATTEMPTS="$1"
;;
-h|--help)
cat <<'USAGE'
Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [--send-alert-test]
Default mode is read-only and does not POST an Alertmanager test event.
Use --send-alert-test only after AWOOOI API is expected to be ready.
USAGE
usage
exit 0
;;
*)
echo "Unknown argument: $arg" >&2
echo "Unknown argument: $1" >&2
usage >&2
exit 64
;;
esac
shift
done
RED=$'\033[0;31m'
@@ -38,6 +73,12 @@ PASS=0
WARN=0
FAIL=0
reset_counters() {
PASS=0
WARN=0
FAIL=0
}
log_section() {
printf "\n%s=== %s ===%s\n" "$BLUE" "$1" "$NC"
}
@@ -104,6 +145,7 @@ print_header() {
echo "AWOOOI full-stack cold-start check"
date '+%Y-%m-%d %H:%M:%S %Z'
echo "Scope: 110 / 120 / 121 / 188. 112 Kali is intentionally skipped."
echo "Baseline: ops/reboot-recovery/full-stack-cold-start-baseline.yml"
}
check_network() {
@@ -385,21 +427,54 @@ summary() {
echo "PASS=$PASS WARN=$WARN BLOCKED=$FAIL"
if [ "$FAIL" -gt 0 ]; then
echo "Result: BLOCKED. Fix the first blocked gate before releasing runner/CD/AI auto-remediation."
exit 2
return 2
fi
if [ "$WARN" -gt 0 ]; then
echo "Result: DEGRADED. Core gates passed but warnings remain."
exit 1
return 1
fi
echo "Result: GREEN. Full stack is ready for controlled runner/CD release."
return 0
}
print_header
check_network
check_188
check_110
check_k3s
check_workload_and_alertchain
check_public_routes
check_schedules
summary
run_once() {
reset_counters
print_header
check_network
check_188
check_110
check_k3s
check_workload_and_alertchain
check_public_routes
check_schedules
summary
}
if [ "$WATCH_MODE" -eq 1 ]; then
attempt=1
while :; do
if [ "$WATCH_MAX_ATTEMPTS" -eq 0 ]; then
printf "\nWatch attempt %s/unlimited\n" "$attempt"
else
printf "\nWatch attempt %s/%s\n" "$attempt" "$WATCH_MAX_ATTEMPTS"
fi
run_once
rc=$?
if [ "$rc" -eq 0 ]; then
exit 0
fi
if [ "$WATCH_MAX_ATTEMPTS" -ne 0 ] && [ "$attempt" -ge "$WATCH_MAX_ATTEMPTS" ]; then
echo "Watch stopped before GREEN. Last result code: $rc"
exit "$rc"
fi
echo "Waiting ${WATCH_INTERVAL}s before the next cold-start gate check..."
sleep "$WATCH_INTERVAL"
attempt=$((attempt + 1))
done
fi
run_once
exit $?