diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 3f342400..8936a689 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -3269,3 +3269,48 @@ DATABASE_URL=postgresql+asyncpg://u:p@localhost:5432/test REDIS_URL=redis://loca apps/api/tests/test_openclaw_alert_cloud_fallback_gate.py -q # 15 passed ``` + +--- + +## 2026-05-06(台北)— 全棧重開機冷啟動 SOP / baseline / watch mode + +**觸發**:2026-05-05 晚間 110 / 120 / 121 / 188 異常重開機後,要求把本次恢復順序、服務相依、放行邏輯、最後確認機制完整文件化,並建立下次重開機可快速恢復的標準做法。 + +### 已完成 + +| Artifact | Result | +|----------|--------| +| `docs/runbooks/FULL-STACK-COLD-START-SOP.md` | 升級為 v1.1,補齊 Golden Startup Order、Mermaid 依賴圖、phase gate 邏輯、script-to-SOP 覆蓋表、next-reboot operator contract | +| `ops/reboot-recovery/full-stack-cold-start-baseline.yml` | 新增機器可讀 baseline,固定 hosts、roles、啟動順序、endpoint code、schedule freshness、stateful-service 禁區、AI auto-remediation gate | +| `scripts/reboot-recovery/full-stack-cold-start-check.sh` | 新增 `--watch` / `--interval` / `--max-attempts`,可在重開機後反覆檢查直到 `GREEN` | + +### 標準下次重開機放行指令 + +```bash +bash scripts/reboot-recovery/full-stack-cold-start-check.sh \ + --watch \ + --interval 60 \ + --max-attempts 30 \ + --send-alert-test +``` + +### 驗證結果 + +```bash +bash -n scripts/reboot-recovery/full-stack-cold-start-check.sh +# OK + +ruby -e 'require "yaml"; YAML.load_file("ops/reboot-recovery/full-stack-cold-start-baseline.yml"); puts "YAML OK"' +# YAML OK + +bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test +# PASS=50 WARN=0 BLOCKED=0 +# Result: GREEN. Full stack is ready for controlled runner/CD release. +``` + +### 放行原則 + +- `BLOCKED`:停止釋放後續 phase,先修第一個阻塞 gate。 +- `WARN`:不可釋放 runner/CD/AI full execution,需清掉或明確接受警告。 +- `GREEN`:只代表可進入下一階段;高負載 crawler / Snuba / ClickHouse merge / runner/CD 仍需最後釋放。 +- Stateful DB / ClickHouse / Kafka / Harbor / Sentry 資料層不可由 AI 自動破壞性修復。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 9c04d3a2..0dbb612d 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,7 +1,7 @@ # AWOOOI Full-Stack Cold Start SOP -> Version: v1.0 -> Last updated: 2026-05-05 Asia/Taipei +> Version: v1.1 +> Last updated: 2026-05-06 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. --- @@ -38,6 +38,71 @@ The rule is simple: **recover the dependency chain, not the loudest symptom.** Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy. +### 1.1 Dependency Graph + +```mermaid +flowchart TD + network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"] + network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"] + data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"] + obs110 --> k3s + k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"] + workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"] + workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"] + public --> schedules["Schedules: cron, CronJobs, backups, exporters"] + schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"] + highload --> ai["AI auto-remediation: limited execution"] +``` + +This is also captured in the machine-readable baseline: + +```text +ops/reboot-recovery/full-stack-cold-start-baseline.yml +``` + +The YAML baseline is the source of truth for: + +- hosts, roles, and SSH users +- phase ordering +- service startup dependencies +- endpoint success codes +- schedule freshness thresholds +- stateful-service protection boundaries +- AI automation release gates + +### 1.2 Phase Gate Logic + +Each phase has the same decision rule: + +| Result | Meaning | Action | +|--------|---------|--------| +| `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. | +| `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. | +| `GREEN` | All checks in scope passed. | Release the next phase only. | + +The cold-start flow is intentionally conservative: + +```text +P0 network green + -> P0 188 data green + -> P0 110 registry/observability green + -> P1 K3s green + -> P2 workload + alert chain green + -> P2 public routes green + -> P2 schedules green + -> P3 high-load services and runners/CD + -> AI auto-remediation limited execution +``` + +The final release condition is not "containers are running". It is: + +```text +PASS > 0 +WARN = 0 +BLOCKED = 0 +Result: GREEN +``` + --- ## 2. Automation Freeze @@ -443,26 +508,63 @@ Until then: ## 13. One-Command Readiness Script -Run: +### 13.1 Single Pass + +Run this when you want one read-only snapshot: ```bash bash scripts/reboot-recovery/full-stack-cold-start-check.sh ``` -The script is read-only. It reports gates: +The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates: - `P0-NETWORK` - `P0-188-DATA` -- `P0-110-REGISTRY` +- `P0-110-REGISTRY-OBSERVABILITY` - `P1-K3S` -- `P2-WORKLOAD` -- `P2-ALERTCHAIN` +- `P2-WORKLOAD-ALERTCHAIN` - `P2-PUBLIC-ROUTES` - `P2-SCHEDULES` - runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY` If it prints `BLOCKED`, fix the first blocked gate before moving forward. +### 13.2 Professional Watch Mode + +Run this after a full reboot when you want the machine to keep checking until the whole stack is ready: + +```bash +bash scripts/reboot-recovery/full-stack-cold-start-check.sh \ + --watch \ + --interval 60 \ + --max-attempts 30 \ + --send-alert-test +``` + +This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked. + +Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete. + +### 13.3 Script-To-SOP Coverage Map + +| Script gate | SOP coverage | Blocks | +|-------------|--------------|--------| +| `P0-NETWORK` | host reachability, ARP, SSH | every later phase | +| `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site | +| `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners | +| `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health | +| `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence | +| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release | +| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria | + +### 13.4 Next-Reboot Operator Contract + +1. Run the watch command above. +2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode. +3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning. +4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes. +5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`. + --- ## 14. Done Criteria diff --git a/ops/reboot-recovery/full-stack-cold-start-baseline.yml b/ops/reboot-recovery/full-stack-cold-start-baseline.yml new file mode 100644 index 00000000..b28c0ed1 --- /dev/null +++ b/ops/reboot-recovery/full-stack-cold-start-baseline.yml @@ -0,0 +1,303 @@ +# AWOOOI full-stack cold-start dependency baseline. +# This is the machine-readable companion to docs/runbooks/FULL-STACK-COLD-START-SOP.md. +# +# Intent: +# - document the reboot startup order and service dependency graph +# - define release gates for operators and AI automation +# - keep stateful services out of generic auto-restart loops + +version: "2026-05-06" +incident_reference: "2026-05-05 full-stack reboot recovery" +scope: + managed_hosts: + "110": + address: "192.168.0.110" + ssh_user: "wooo" + roles: + - registry + - git + - observability + - sentry + - runners + "120": + address: "192.168.0.120" + ssh_user: "wooo" + roles: + - k3s_server + - keepalived_vip + - awoooi_nodeport + "121": + address: "192.168.0.121" + ssh_user: "wooo" + roles: + - k3s_node + - keepalived_peer + - dr_drill + "188": + address: "192.168.0.188" + ssh_user: "ollama" + roles: + - postgres_datastore + - redis + - momo + - signoz + - ai_proxy + intentionally_skipped: + "112": + role: "kali" + reason: "scanner host is not required for production cold-start release" + +global_policy: + startup_rule: "Recover the dependency chain before releasing high-load work." + runner_cd_rule: "Release runners and CD only after data, registry, K3s, workload, routes, schedules, and alert E2E gates are green." + ai_auto_repair_rule: "Observe-only until all green gates pass and host load stays below baseline." + destructive_state_rule: "No DROP, data directory deletion, volume recreation, pg_resetwal, fsck, or backup restore without explicit human approval." + no_generic_restart_rule: "Never run generic docker restart against all containers during cold start." + +phases: + - id: "P0-NETWORK" + order: 0 + start_after: [] + owns: + - "LAN reachability" + - "SSH reachability" + - "ARP evidence" + gates: + - "ping 192.168.0.110/120/121/188 succeeds" + - "TCP 22 open on 192.168.0.110/120/121/188" + - "reboot evidence captured before repair" + blocks: + - "all other phases" + + - id: "P0-188-DATA" + order: 1 + start_after: + - "P0-NETWORK" + host: "188" + service_order: + - "containerd" + - "docker" + - "postgresql@14-main" + - "k3s_datastore.kine maintenance" + - "redis-server" + - "ollama or current AI proxy dependencies" + - "nginx" + - "Docker networks" + - "MinIO / OpenClaw / SignOz" + - "momo / litellm / batch services" + gates: + - "PostgreSQL port 5432 open" + - "pg_isready reports accepting connections" + - "Redis replies PONG" + - "momo health endpoint returns 200" + - "SignOz HTTP route is reachable" + blocks: + - "120/121 K3s" + - "AWOOOI API database access" + - "Alertmanager webhook" + - "momo public site" + + - id: "P0-110-REGISTRY-OBSERVABILITY" + order: 2 + start_after: + - "P0-NETWORK" + - "P0-188-DATA" + host: "110" + service_order: + - "docker" + - "orphan Exited(128/137) cleanup if needed" + - "Harbor log" + - "Harbor registry stack" + - "Gitea" + - "Prometheus / Alertmanager / Grafana / exporters" + - "Langfuse" + - "SignOz or local observability companions" + - "Sentry DB layer" + - "Sentry web / worker / consumer layer" + - "Gitea host runner and actions runners" + gates: + - "Harbor /v2/ returns 200 or 401" + - "Gitea returns 200 or 302" + - "Prometheus /-/ready returns 200" + - "Alertmanager /-/healthy returns 200" + - "Sentry HTTP returns 200, 302, or 400" + - "runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0" + blocks: + - "K3s image pulls" + - "runtime CD" + - "alert rules deploy" + - "code-review runners" + + - id: "P1-K3S" + order: 3 + start_after: + - "P0-188-DATA" + - "P0-110-REGISTRY-OBSERVABILITY" + hosts: + - "120" + - "121" + service_order: + - "120 k3s.service" + - "121 k3s-agent.service or live role" + - "CNI / kube-proxy" + - "nodes Ready" + - "core pods" + - "awoooi-prod pods" + - "keepalived VIP 192.168.0.125" + - "NodePorts 32334 and 32335" + gates: + - "120 can reach 188:5432" + - "K3s nodes show Ready" + - "VIP 192.168.0.125 is present" + - "awoooi-prod pods are Running or Completed" + blocks: + - "AWOOOI workload health" + - "public AWOOOI route" + - "Alertmanager webhook" + + - id: "P2-WORKLOAD-ALERTCHAIN" + order: 4 + start_after: + - "P1-K3S" + owners: + - "AWOOOI API" + - "AWOOOI Web" + - "Alertmanager webhook" + - "Telegram delivery" + gates: + - "http://192.168.0.125:32334/api/v1/health returns 2xx/3xx" + - "http://192.168.0.125:32335/ returns 2xx/3xx" + - "Alertmanager webhook POST returns 2xx" + - "K8s Telegram secrets are present and non-placeholder" + blocks: + - "AI auto-remediation" + - "full alert confidence" + + - id: "P2-PUBLIC-ROUTES" + order: 5 + start_after: + - "P2-WORKLOAD-ALERTCHAIN" + gates: + - "https://awoooi.wooo.work/api/v1/health returns 2xx/3xx" + - "https://awoooi.wooo.work/ returns 2xx/3xx" + - "https://mo.wooo.work/ returns 2xx/3xx" + - "https://mo.wooo.work/health returns 2xx/3xx" + blocks: + - "external release complete" + + - id: "P2-SCHEDULES" + order: 6 + start_after: + - "P2-PUBLIC-ROUTES" + gates: + - "110/120/121/188 cron services active" + - "188 backup-from-110 success age below 25h" + - "188 docker restart/stats textfiles fresh" + - "110 docker/systemd textfiles fresh" + - "120 awoooi-prod CronJobs present and unsuspended" + - "120 awoooi-prod has no failed Jobs" + - "121 DR drill cron present" + blocks: + - "done criteria" + - "AI auto-remediation release" + + - id: "P3-HIGH-LOAD-RELEASE" + order: 7 + start_after: + - "P2-SCHEDULES" + release_last: + - "momo-scheduler / Chrome crawlers" + - "Sentry Snuba consumers" + - "SignOz ClickHouse merge-heavy work" + - "Gitea actions runners" + - "runtime CD jobs" + gates: + - "all prior gates green" + - "host load per CPU below 1.0 for 15 minutes before releasing batch/runner work" + - "ClickHouse/Kafka/Snuba backlog decreasing for two consecutive checks if backlog exists" + +baselines: + endpoints: + awoooi_vip_api_health: "http://192.168.0.125:32334/api/v1/health" + awoooi_vip_web: "http://192.168.0.125:32335/" + awoooi_public_api_health: "https://awoooi.wooo.work/api/v1/health" + awoooi_public_web: "https://awoooi.wooo.work/" + momo_public_web: "https://mo.wooo.work/" + momo_public_health: "https://mo.wooo.work/health" + harbor_registry: "http://127.0.0.1:5000/v2/" + gitea: "http://127.0.0.1:3001/" + prometheus_ready: "http://127.0.0.1:9090/-/ready" + alertmanager_healthy: "http://127.0.0.1:9093/-/healthy" + sentry: "http://127.0.0.1:9000/" + expected_codes: + harbor_registry: + - 200 + - 401 + gitea: + - 200 + - 302 + prometheus_ready: + - 200 + alertmanager_healthy: + - 200 + sentry: + - 200 + - 302 + - 400 + workload_and_public: + - "2xx" + - "3xx" + runner_guardrails: + CPUQuotaPerSecUSec: "2s" + MemoryMax: "2147483648" + WatchdogUSec: "0" + freshness_seconds: + docker_textfiles: 300 + systemd_textfiles: 300 + backup_success: 90000 + +stateful_services: + hard_block_auto_repair: + - "188 PostgreSQL data directory" + - "188 k3s_datastore" + - "188 momo database" + - "110 Harbor DB" + - "110 Sentry DB" + - "Sentry ClickHouse data" + - "SignOz ClickHouse data" + - "Kafka topic/log directories" + human_in_loop_required: + - "pg_resetwal" + - "ClickHouse clean-clone recovery" + - "Kafka checkpoint file quarantine" + - "backup restore" + - "filesystem repair" + +ai_automation_gate: + observe_only_until: + - "P0-NETWORK green" + - "P0-188-DATA green" + - "P0-110-REGISTRY-OBSERVABILITY green" + - "P1-K3S green" + - "P2-WORKLOAD-ALERTCHAIN green" + - "P2-PUBLIC-ROUTES green" + - "P2-SCHEDULES green" + - "no active restart storm" + - "host load per CPU below 1.0 for 15 minutes" + allowed_before_green: + - "diagnose" + - "collect evidence" + - "notify" + blocked_before_green: + - "stateful restart" + - "destructive repair" + - "runner/CD release" + - "generic container restart" + +final_confirmation: + command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test" + green_result: + PASS: "greater than 0" + WARN: 0 + BLOCKED: 0 + summary: "Result: GREEN" diff --git a/scripts/reboot-recovery/full-stack-cold-start-check.sh b/scripts/reboot-recovery/full-stack-cold-start-check.sh index 42aaa78d..d45df327 100755 --- a/scripts/reboot-recovery/full-stack-cold-start-check.sh +++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh @@ -6,26 +6,61 @@ set -uo pipefail SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6) SEND_ALERT_TEST=0 +WATCH_MODE=0 +WATCH_INTERVAL=60 +WATCH_MAX_ATTEMPTS=30 -for arg in "$@"; do - case "$arg" in +usage() { + cat <<'USAGE' +Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options] + +Options: + --send-alert-test POST one Alertmanager webhook test after AWOOOI API is ready. + --watch Repeat checks until all gates are GREEN or max attempts is reached. + --interval SECONDS Retry interval for --watch. Default: 60. + --max-attempts COUNT Max attempts for --watch. Default: 30. Use 0 for unlimited. + -h, --help Show this help. + +Default mode is read-only and does not POST an Alertmanager test event. +Use --send-alert-test for the final release gate after AWOOOI API is expected to be ready. +USAGE +} + +while [ "$#" -gt 0 ]; do + case "$1" in --send-alert-test) SEND_ALERT_TEST=1 ;; + --watch) + WATCH_MODE=1 + ;; + --interval) + shift + if ! [[ "${1:-}" =~ ^[0-9]+$ ]] || [ "${1:-0}" -lt 1 ]; then + echo "--interval requires a positive integer number of seconds" >&2 + exit 64 + fi + WATCH_INTERVAL="$1" + ;; + --max-attempts) + shift + if ! [[ "${1:-}" =~ ^[0-9]+$ ]]; then + echo "--max-attempts requires a non-negative integer" >&2 + exit 64 + fi + WATCH_MAX_ATTEMPTS="$1" + ;; -h|--help) - cat <<'USAGE' -Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [--send-alert-test] - -Default mode is read-only and does not POST an Alertmanager test event. -Use --send-alert-test only after AWOOOI API is expected to be ready. -USAGE + usage exit 0 ;; *) - echo "Unknown argument: $arg" >&2 + echo "Unknown argument: $1" >&2 + usage >&2 exit 64 ;; esac + shift done RED=$'\033[0;31m' @@ -38,6 +73,12 @@ PASS=0 WARN=0 FAIL=0 +reset_counters() { + PASS=0 + WARN=0 + FAIL=0 +} + log_section() { printf "\n%s=== %s ===%s\n" "$BLUE" "$1" "$NC" } @@ -104,6 +145,7 @@ print_header() { echo "AWOOOI full-stack cold-start check" date '+%Y-%m-%d %H:%M:%S %Z' echo "Scope: 110 / 120 / 121 / 188. 112 Kali is intentionally skipped." + echo "Baseline: ops/reboot-recovery/full-stack-cold-start-baseline.yml" } check_network() { @@ -385,21 +427,54 @@ summary() { echo "PASS=$PASS WARN=$WARN BLOCKED=$FAIL" if [ "$FAIL" -gt 0 ]; then echo "Result: BLOCKED. Fix the first blocked gate before releasing runner/CD/AI auto-remediation." - exit 2 + return 2 fi if [ "$WARN" -gt 0 ]; then echo "Result: DEGRADED. Core gates passed but warnings remain." - exit 1 + return 1 fi echo "Result: GREEN. Full stack is ready for controlled runner/CD release." + return 0 } -print_header -check_network -check_188 -check_110 -check_k3s -check_workload_and_alertchain -check_public_routes -check_schedules -summary +run_once() { + reset_counters + print_header + check_network + check_188 + check_110 + check_k3s + check_workload_and_alertchain + check_public_routes + check_schedules + summary +} + +if [ "$WATCH_MODE" -eq 1 ]; then + attempt=1 + while :; do + if [ "$WATCH_MAX_ATTEMPTS" -eq 0 ]; then + printf "\nWatch attempt %s/unlimited\n" "$attempt" + else + printf "\nWatch attempt %s/%s\n" "$attempt" "$WATCH_MAX_ATTEMPTS" + fi + + run_once + rc=$? + if [ "$rc" -eq 0 ]; then + exit 0 + fi + + if [ "$WATCH_MAX_ATTEMPTS" -ne 0 ] && [ "$attempt" -ge "$WATCH_MAX_ATTEMPTS" ]; then + echo "Watch stopped before GREEN. Last result code: $rc" + exit "$rc" + fi + + echo "Waiting ${WATCH_INTERVAL}s before the next cold-start gate check..." + sleep "$WATCH_INTERVAL" + attempt=$((attempt + 1)) + done +fi + +run_once +exit $?