docs(ops): codify full stack cold start recovery
All checks were successful
Code Review / ai-code-review (push) Successful in 7s
All checks were successful
Code Review / ai-code-review (push) Successful in 7s
This commit is contained in:
@@ -3269,3 +3269,48 @@ DATABASE_URL=postgresql+asyncpg://u:p@localhost:5432/test REDIS_URL=redis://loca
|
||||
apps/api/tests/test_openclaw_alert_cloud_fallback_gate.py -q
|
||||
# 15 passed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-06(台北)— 全棧重開機冷啟動 SOP / baseline / watch mode
|
||||
|
||||
**觸發**:2026-05-05 晚間 110 / 120 / 121 / 188 異常重開機後,要求把本次恢復順序、服務相依、放行邏輯、最後確認機制完整文件化,並建立下次重開機可快速恢復的標準做法。
|
||||
|
||||
### 已完成
|
||||
|
||||
| Artifact | Result |
|
||||
|----------|--------|
|
||||
| `docs/runbooks/FULL-STACK-COLD-START-SOP.md` | 升級為 v1.1,補齊 Golden Startup Order、Mermaid 依賴圖、phase gate 邏輯、script-to-SOP 覆蓋表、next-reboot operator contract |
|
||||
| `ops/reboot-recovery/full-stack-cold-start-baseline.yml` | 新增機器可讀 baseline,固定 hosts、roles、啟動順序、endpoint code、schedule freshness、stateful-service 禁區、AI auto-remediation gate |
|
||||
| `scripts/reboot-recovery/full-stack-cold-start-check.sh` | 新增 `--watch` / `--interval` / `--max-attempts`,可在重開機後反覆檢查直到 `GREEN` |
|
||||
|
||||
### 標準下次重開機放行指令
|
||||
|
||||
```bash
|
||||
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
|
||||
--watch \
|
||||
--interval 60 \
|
||||
--max-attempts 30 \
|
||||
--send-alert-test
|
||||
```
|
||||
|
||||
### 驗證結果
|
||||
|
||||
```bash
|
||||
bash -n scripts/reboot-recovery/full-stack-cold-start-check.sh
|
||||
# OK
|
||||
|
||||
ruby -e 'require "yaml"; YAML.load_file("ops/reboot-recovery/full-stack-cold-start-baseline.yml"); puts "YAML OK"'
|
||||
# YAML OK
|
||||
|
||||
bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test
|
||||
# PASS=50 WARN=0 BLOCKED=0
|
||||
# Result: GREEN. Full stack is ready for controlled runner/CD release.
|
||||
```
|
||||
|
||||
### 放行原則
|
||||
|
||||
- `BLOCKED`:停止釋放後續 phase,先修第一個阻塞 gate。
|
||||
- `WARN`:不可釋放 runner/CD/AI full execution,需清掉或明確接受警告。
|
||||
- `GREEN`:只代表可進入下一階段;高負載 crawler / Snuba / ClickHouse merge / runner/CD 仍需最後釋放。
|
||||
- Stateful DB / ClickHouse / Kafka / Harbor / Sentry 資料層不可由 AI 自動破壞性修復。
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# AWOOOI Full-Stack Cold Start SOP
|
||||
|
||||
> Version: v1.0
|
||||
> Last updated: 2026-05-05 Asia/Taipei
|
||||
> Version: v1.1
|
||||
> Last updated: 2026-05-06 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
---
|
||||
@@ -38,6 +38,71 @@ The rule is simple: **recover the dependency chain, not the loudest symptom.**
|
||||
|
||||
Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.
|
||||
|
||||
### 1.1 Dependency Graph
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
|
||||
network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
|
||||
data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
|
||||
obs110 --> k3s
|
||||
k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
|
||||
workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
|
||||
workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
|
||||
public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
|
||||
schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
|
||||
highload --> ai["AI auto-remediation: limited execution"]
|
||||
```
|
||||
|
||||
This is also captured in the machine-readable baseline:
|
||||
|
||||
```text
|
||||
ops/reboot-recovery/full-stack-cold-start-baseline.yml
|
||||
```
|
||||
|
||||
The YAML baseline is the source of truth for:
|
||||
|
||||
- hosts, roles, and SSH users
|
||||
- phase ordering
|
||||
- service startup dependencies
|
||||
- endpoint success codes
|
||||
- schedule freshness thresholds
|
||||
- stateful-service protection boundaries
|
||||
- AI automation release gates
|
||||
|
||||
### 1.2 Phase Gate Logic
|
||||
|
||||
Each phase has the same decision rule:
|
||||
|
||||
| Result | Meaning | Action |
|
||||
|--------|---------|--------|
|
||||
| `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
|
||||
| `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
|
||||
| `GREEN` | All checks in scope passed. | Release the next phase only. |
|
||||
|
||||
The cold-start flow is intentionally conservative:
|
||||
|
||||
```text
|
||||
P0 network green
|
||||
-> P0 188 data green
|
||||
-> P0 110 registry/observability green
|
||||
-> P1 K3s green
|
||||
-> P2 workload + alert chain green
|
||||
-> P2 public routes green
|
||||
-> P2 schedules green
|
||||
-> P3 high-load services and runners/CD
|
||||
-> AI auto-remediation limited execution
|
||||
```
|
||||
|
||||
The final release condition is not "containers are running". It is:
|
||||
|
||||
```text
|
||||
PASS > 0
|
||||
WARN = 0
|
||||
BLOCKED = 0
|
||||
Result: GREEN
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Automation Freeze
|
||||
@@ -443,26 +508,63 @@ Until then:
|
||||
|
||||
## 13. One-Command Readiness Script
|
||||
|
||||
Run:
|
||||
### 13.1 Single Pass
|
||||
|
||||
Run this when you want one read-only snapshot:
|
||||
|
||||
```bash
|
||||
bash scripts/reboot-recovery/full-stack-cold-start-check.sh
|
||||
```
|
||||
|
||||
The script is read-only. It reports gates:
|
||||
The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:
|
||||
|
||||
- `P0-NETWORK`
|
||||
- `P0-188-DATA`
|
||||
- `P0-110-REGISTRY`
|
||||
- `P0-110-REGISTRY-OBSERVABILITY`
|
||||
- `P1-K3S`
|
||||
- `P2-WORKLOAD`
|
||||
- `P2-ALERTCHAIN`
|
||||
- `P2-WORKLOAD-ALERTCHAIN`
|
||||
- `P2-PUBLIC-ROUTES`
|
||||
- `P2-SCHEDULES`
|
||||
- runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY`
|
||||
|
||||
If it prints `BLOCKED`, fix the first blocked gate before moving forward.
|
||||
|
||||
### 13.2 Professional Watch Mode
|
||||
|
||||
Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:
|
||||
|
||||
```bash
|
||||
bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
|
||||
--watch \
|
||||
--interval 60 \
|
||||
--max-attempts 30 \
|
||||
--send-alert-test
|
||||
```
|
||||
|
||||
This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked.
|
||||
|
||||
Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
|
||||
|
||||
### 13.3 Script-To-SOP Coverage Map
|
||||
|
||||
| Script gate | SOP coverage | Blocks |
|
||||
|-------------|--------------|--------|
|
||||
| `P0-NETWORK` | host reachability, ARP, SSH | every later phase |
|
||||
| `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
|
||||
| `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
|
||||
| `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
|
||||
| `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
|
||||
| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
|
||||
| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
|
||||
|
||||
### 13.4 Next-Reboot Operator Contract
|
||||
|
||||
1. Run the watch command above.
|
||||
2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
|
||||
3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
|
||||
4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes.
|
||||
5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`.
|
||||
|
||||
---
|
||||
|
||||
## 14. Done Criteria
|
||||
|
||||
303
ops/reboot-recovery/full-stack-cold-start-baseline.yml
Normal file
303
ops/reboot-recovery/full-stack-cold-start-baseline.yml
Normal file
@@ -0,0 +1,303 @@
|
||||
# AWOOOI full-stack cold-start dependency baseline.
|
||||
# This is the machine-readable companion to docs/runbooks/FULL-STACK-COLD-START-SOP.md.
|
||||
#
|
||||
# Intent:
|
||||
# - document the reboot startup order and service dependency graph
|
||||
# - define release gates for operators and AI automation
|
||||
# - keep stateful services out of generic auto-restart loops
|
||||
|
||||
version: "2026-05-06"
|
||||
incident_reference: "2026-05-05 full-stack reboot recovery"
|
||||
scope:
|
||||
managed_hosts:
|
||||
"110":
|
||||
address: "192.168.0.110"
|
||||
ssh_user: "wooo"
|
||||
roles:
|
||||
- registry
|
||||
- git
|
||||
- observability
|
||||
- sentry
|
||||
- runners
|
||||
"120":
|
||||
address: "192.168.0.120"
|
||||
ssh_user: "wooo"
|
||||
roles:
|
||||
- k3s_server
|
||||
- keepalived_vip
|
||||
- awoooi_nodeport
|
||||
"121":
|
||||
address: "192.168.0.121"
|
||||
ssh_user: "wooo"
|
||||
roles:
|
||||
- k3s_node
|
||||
- keepalived_peer
|
||||
- dr_drill
|
||||
"188":
|
||||
address: "192.168.0.188"
|
||||
ssh_user: "ollama"
|
||||
roles:
|
||||
- postgres_datastore
|
||||
- redis
|
||||
- momo
|
||||
- signoz
|
||||
- ai_proxy
|
||||
intentionally_skipped:
|
||||
"112":
|
||||
role: "kali"
|
||||
reason: "scanner host is not required for production cold-start release"
|
||||
|
||||
global_policy:
|
||||
startup_rule: "Recover the dependency chain before releasing high-load work."
|
||||
runner_cd_rule: "Release runners and CD only after data, registry, K3s, workload, routes, schedules, and alert E2E gates are green."
|
||||
ai_auto_repair_rule: "Observe-only until all green gates pass and host load stays below baseline."
|
||||
destructive_state_rule: "No DROP, data directory deletion, volume recreation, pg_resetwal, fsck, or backup restore without explicit human approval."
|
||||
no_generic_restart_rule: "Never run generic docker restart against all containers during cold start."
|
||||
|
||||
phases:
|
||||
- id: "P0-NETWORK"
|
||||
order: 0
|
||||
start_after: []
|
||||
owns:
|
||||
- "LAN reachability"
|
||||
- "SSH reachability"
|
||||
- "ARP evidence"
|
||||
gates:
|
||||
- "ping 192.168.0.110/120/121/188 succeeds"
|
||||
- "TCP 22 open on 192.168.0.110/120/121/188"
|
||||
- "reboot evidence captured before repair"
|
||||
blocks:
|
||||
- "all other phases"
|
||||
|
||||
- id: "P0-188-DATA"
|
||||
order: 1
|
||||
start_after:
|
||||
- "P0-NETWORK"
|
||||
host: "188"
|
||||
service_order:
|
||||
- "containerd"
|
||||
- "docker"
|
||||
- "postgresql@14-main"
|
||||
- "k3s_datastore.kine maintenance"
|
||||
- "redis-server"
|
||||
- "ollama or current AI proxy dependencies"
|
||||
- "nginx"
|
||||
- "Docker networks"
|
||||
- "MinIO / OpenClaw / SignOz"
|
||||
- "momo / litellm / batch services"
|
||||
gates:
|
||||
- "PostgreSQL port 5432 open"
|
||||
- "pg_isready reports accepting connections"
|
||||
- "Redis replies PONG"
|
||||
- "momo health endpoint returns 200"
|
||||
- "SignOz HTTP route is reachable"
|
||||
blocks:
|
||||
- "120/121 K3s"
|
||||
- "AWOOOI API database access"
|
||||
- "Alertmanager webhook"
|
||||
- "momo public site"
|
||||
|
||||
- id: "P0-110-REGISTRY-OBSERVABILITY"
|
||||
order: 2
|
||||
start_after:
|
||||
- "P0-NETWORK"
|
||||
- "P0-188-DATA"
|
||||
host: "110"
|
||||
service_order:
|
||||
- "docker"
|
||||
- "orphan Exited(128/137) cleanup if needed"
|
||||
- "Harbor log"
|
||||
- "Harbor registry stack"
|
||||
- "Gitea"
|
||||
- "Prometheus / Alertmanager / Grafana / exporters"
|
||||
- "Langfuse"
|
||||
- "SignOz or local observability companions"
|
||||
- "Sentry DB layer"
|
||||
- "Sentry web / worker / consumer layer"
|
||||
- "Gitea host runner and actions runners"
|
||||
gates:
|
||||
- "Harbor /v2/ returns 200 or 401"
|
||||
- "Gitea returns 200 or 302"
|
||||
- "Prometheus /-/ready returns 200"
|
||||
- "Alertmanager /-/healthy returns 200"
|
||||
- "Sentry HTTP returns 200, 302, or 400"
|
||||
- "runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0"
|
||||
blocks:
|
||||
- "K3s image pulls"
|
||||
- "runtime CD"
|
||||
- "alert rules deploy"
|
||||
- "code-review runners"
|
||||
|
||||
- id: "P1-K3S"
|
||||
order: 3
|
||||
start_after:
|
||||
- "P0-188-DATA"
|
||||
- "P0-110-REGISTRY-OBSERVABILITY"
|
||||
hosts:
|
||||
- "120"
|
||||
- "121"
|
||||
service_order:
|
||||
- "120 k3s.service"
|
||||
- "121 k3s-agent.service or live role"
|
||||
- "CNI / kube-proxy"
|
||||
- "nodes Ready"
|
||||
- "core pods"
|
||||
- "awoooi-prod pods"
|
||||
- "keepalived VIP 192.168.0.125"
|
||||
- "NodePorts 32334 and 32335"
|
||||
gates:
|
||||
- "120 can reach 188:5432"
|
||||
- "K3s nodes show Ready"
|
||||
- "VIP 192.168.0.125 is present"
|
||||
- "awoooi-prod pods are Running or Completed"
|
||||
blocks:
|
||||
- "AWOOOI workload health"
|
||||
- "public AWOOOI route"
|
||||
- "Alertmanager webhook"
|
||||
|
||||
- id: "P2-WORKLOAD-ALERTCHAIN"
|
||||
order: 4
|
||||
start_after:
|
||||
- "P1-K3S"
|
||||
owners:
|
||||
- "AWOOOI API"
|
||||
- "AWOOOI Web"
|
||||
- "Alertmanager webhook"
|
||||
- "Telegram delivery"
|
||||
gates:
|
||||
- "http://192.168.0.125:32334/api/v1/health returns 2xx/3xx"
|
||||
- "http://192.168.0.125:32335/ returns 2xx/3xx"
|
||||
- "Alertmanager webhook POST returns 2xx"
|
||||
- "K8s Telegram secrets are present and non-placeholder"
|
||||
blocks:
|
||||
- "AI auto-remediation"
|
||||
- "full alert confidence"
|
||||
|
||||
- id: "P2-PUBLIC-ROUTES"
|
||||
order: 5
|
||||
start_after:
|
||||
- "P2-WORKLOAD-ALERTCHAIN"
|
||||
gates:
|
||||
- "https://awoooi.wooo.work/api/v1/health returns 2xx/3xx"
|
||||
- "https://awoooi.wooo.work/ returns 2xx/3xx"
|
||||
- "https://mo.wooo.work/ returns 2xx/3xx"
|
||||
- "https://mo.wooo.work/health returns 2xx/3xx"
|
||||
blocks:
|
||||
- "external release complete"
|
||||
|
||||
- id: "P2-SCHEDULES"
|
||||
order: 6
|
||||
start_after:
|
||||
- "P2-PUBLIC-ROUTES"
|
||||
gates:
|
||||
- "110/120/121/188 cron services active"
|
||||
- "188 backup-from-110 success age below 25h"
|
||||
- "188 docker restart/stats textfiles fresh"
|
||||
- "110 docker/systemd textfiles fresh"
|
||||
- "120 awoooi-prod CronJobs present and unsuspended"
|
||||
- "120 awoooi-prod has no failed Jobs"
|
||||
- "121 DR drill cron present"
|
||||
blocks:
|
||||
- "done criteria"
|
||||
- "AI auto-remediation release"
|
||||
|
||||
- id: "P3-HIGH-LOAD-RELEASE"
|
||||
order: 7
|
||||
start_after:
|
||||
- "P2-SCHEDULES"
|
||||
release_last:
|
||||
- "momo-scheduler / Chrome crawlers"
|
||||
- "Sentry Snuba consumers"
|
||||
- "SignOz ClickHouse merge-heavy work"
|
||||
- "Gitea actions runners"
|
||||
- "runtime CD jobs"
|
||||
gates:
|
||||
- "all prior gates green"
|
||||
- "host load per CPU below 1.0 for 15 minutes before releasing batch/runner work"
|
||||
- "ClickHouse/Kafka/Snuba backlog decreasing for two consecutive checks if backlog exists"
|
||||
|
||||
baselines:
|
||||
endpoints:
|
||||
awoooi_vip_api_health: "http://192.168.0.125:32334/api/v1/health"
|
||||
awoooi_vip_web: "http://192.168.0.125:32335/"
|
||||
awoooi_public_api_health: "https://awoooi.wooo.work/api/v1/health"
|
||||
awoooi_public_web: "https://awoooi.wooo.work/"
|
||||
momo_public_web: "https://mo.wooo.work/"
|
||||
momo_public_health: "https://mo.wooo.work/health"
|
||||
harbor_registry: "http://127.0.0.1:5000/v2/"
|
||||
gitea: "http://127.0.0.1:3001/"
|
||||
prometheus_ready: "http://127.0.0.1:9090/-/ready"
|
||||
alertmanager_healthy: "http://127.0.0.1:9093/-/healthy"
|
||||
sentry: "http://127.0.0.1:9000/"
|
||||
expected_codes:
|
||||
harbor_registry:
|
||||
- 200
|
||||
- 401
|
||||
gitea:
|
||||
- 200
|
||||
- 302
|
||||
prometheus_ready:
|
||||
- 200
|
||||
alertmanager_healthy:
|
||||
- 200
|
||||
sentry:
|
||||
- 200
|
||||
- 302
|
||||
- 400
|
||||
workload_and_public:
|
||||
- "2xx"
|
||||
- "3xx"
|
||||
runner_guardrails:
|
||||
CPUQuotaPerSecUSec: "2s"
|
||||
MemoryMax: "2147483648"
|
||||
WatchdogUSec: "0"
|
||||
freshness_seconds:
|
||||
docker_textfiles: 300
|
||||
systemd_textfiles: 300
|
||||
backup_success: 90000
|
||||
|
||||
stateful_services:
|
||||
hard_block_auto_repair:
|
||||
- "188 PostgreSQL data directory"
|
||||
- "188 k3s_datastore"
|
||||
- "188 momo database"
|
||||
- "110 Harbor DB"
|
||||
- "110 Sentry DB"
|
||||
- "Sentry ClickHouse data"
|
||||
- "SignOz ClickHouse data"
|
||||
- "Kafka topic/log directories"
|
||||
human_in_loop_required:
|
||||
- "pg_resetwal"
|
||||
- "ClickHouse clean-clone recovery"
|
||||
- "Kafka checkpoint file quarantine"
|
||||
- "backup restore"
|
||||
- "filesystem repair"
|
||||
|
||||
ai_automation_gate:
|
||||
observe_only_until:
|
||||
- "P0-NETWORK green"
|
||||
- "P0-188-DATA green"
|
||||
- "P0-110-REGISTRY-OBSERVABILITY green"
|
||||
- "P1-K3S green"
|
||||
- "P2-WORKLOAD-ALERTCHAIN green"
|
||||
- "P2-PUBLIC-ROUTES green"
|
||||
- "P2-SCHEDULES green"
|
||||
- "no active restart storm"
|
||||
- "host load per CPU below 1.0 for 15 minutes"
|
||||
allowed_before_green:
|
||||
- "diagnose"
|
||||
- "collect evidence"
|
||||
- "notify"
|
||||
blocked_before_green:
|
||||
- "stateful restart"
|
||||
- "destructive repair"
|
||||
- "runner/CD release"
|
||||
- "generic container restart"
|
||||
|
||||
final_confirmation:
|
||||
command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test"
|
||||
green_result:
|
||||
PASS: "greater than 0"
|
||||
WARN: 0
|
||||
BLOCKED: 0
|
||||
summary: "Result: GREEN"
|
||||
@@ -6,26 +6,61 @@ set -uo pipefail
|
||||
|
||||
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6)
|
||||
SEND_ALERT_TEST=0
|
||||
WATCH_MODE=0
|
||||
WATCH_INTERVAL=60
|
||||
WATCH_MAX_ATTEMPTS=30
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options]
|
||||
|
||||
Options:
|
||||
--send-alert-test POST one Alertmanager webhook test after AWOOOI API is ready.
|
||||
--watch Repeat checks until all gates are GREEN or max attempts is reached.
|
||||
--interval SECONDS Retry interval for --watch. Default: 60.
|
||||
--max-attempts COUNT Max attempts for --watch. Default: 30. Use 0 for unlimited.
|
||||
-h, --help Show this help.
|
||||
|
||||
Default mode is read-only and does not POST an Alertmanager test event.
|
||||
Use --send-alert-test for the final release gate after AWOOOI API is expected to be ready.
|
||||
USAGE
|
||||
}
|
||||
|
||||
while [ "$#" -gt 0 ]; do
|
||||
case "$1" in
|
||||
--send-alert-test)
|
||||
SEND_ALERT_TEST=1
|
||||
;;
|
||||
--watch)
|
||||
WATCH_MODE=1
|
||||
;;
|
||||
--interval)
|
||||
shift
|
||||
if ! [[ "${1:-}" =~ ^[0-9]+$ ]] || [ "${1:-0}" -lt 1 ]; then
|
||||
echo "--interval requires a positive integer number of seconds" >&2
|
||||
exit 64
|
||||
fi
|
||||
WATCH_INTERVAL="$1"
|
||||
;;
|
||||
--max-attempts)
|
||||
shift
|
||||
if ! [[ "${1:-}" =~ ^[0-9]+$ ]]; then
|
||||
echo "--max-attempts requires a non-negative integer" >&2
|
||||
exit 64
|
||||
fi
|
||||
WATCH_MAX_ATTEMPTS="$1"
|
||||
;;
|
||||
-h|--help)
|
||||
cat <<'USAGE'
|
||||
Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [--send-alert-test]
|
||||
|
||||
Default mode is read-only and does not POST an Alertmanager test event.
|
||||
Use --send-alert-test only after AWOOOI API is expected to be ready.
|
||||
USAGE
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown argument: $arg" >&2
|
||||
echo "Unknown argument: $1" >&2
|
||||
usage >&2
|
||||
exit 64
|
||||
;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
RED=$'\033[0;31m'
|
||||
@@ -38,6 +73,12 @@ PASS=0
|
||||
WARN=0
|
||||
FAIL=0
|
||||
|
||||
reset_counters() {
|
||||
PASS=0
|
||||
WARN=0
|
||||
FAIL=0
|
||||
}
|
||||
|
||||
log_section() {
|
||||
printf "\n%s=== %s ===%s\n" "$BLUE" "$1" "$NC"
|
||||
}
|
||||
@@ -104,6 +145,7 @@ print_header() {
|
||||
echo "AWOOOI full-stack cold-start check"
|
||||
date '+%Y-%m-%d %H:%M:%S %Z'
|
||||
echo "Scope: 110 / 120 / 121 / 188. 112 Kali is intentionally skipped."
|
||||
echo "Baseline: ops/reboot-recovery/full-stack-cold-start-baseline.yml"
|
||||
}
|
||||
|
||||
check_network() {
|
||||
@@ -385,21 +427,54 @@ summary() {
|
||||
echo "PASS=$PASS WARN=$WARN BLOCKED=$FAIL"
|
||||
if [ "$FAIL" -gt 0 ]; then
|
||||
echo "Result: BLOCKED. Fix the first blocked gate before releasing runner/CD/AI auto-remediation."
|
||||
exit 2
|
||||
return 2
|
||||
fi
|
||||
if [ "$WARN" -gt 0 ]; then
|
||||
echo "Result: DEGRADED. Core gates passed but warnings remain."
|
||||
exit 1
|
||||
return 1
|
||||
fi
|
||||
echo "Result: GREEN. Full stack is ready for controlled runner/CD release."
|
||||
return 0
|
||||
}
|
||||
|
||||
print_header
|
||||
check_network
|
||||
check_188
|
||||
check_110
|
||||
check_k3s
|
||||
check_workload_and_alertchain
|
||||
check_public_routes
|
||||
check_schedules
|
||||
summary
|
||||
run_once() {
|
||||
reset_counters
|
||||
print_header
|
||||
check_network
|
||||
check_188
|
||||
check_110
|
||||
check_k3s
|
||||
check_workload_and_alertchain
|
||||
check_public_routes
|
||||
check_schedules
|
||||
summary
|
||||
}
|
||||
|
||||
if [ "$WATCH_MODE" -eq 1 ]; then
|
||||
attempt=1
|
||||
while :; do
|
||||
if [ "$WATCH_MAX_ATTEMPTS" -eq 0 ]; then
|
||||
printf "\nWatch attempt %s/unlimited\n" "$attempt"
|
||||
else
|
||||
printf "\nWatch attempt %s/%s\n" "$attempt" "$WATCH_MAX_ATTEMPTS"
|
||||
fi
|
||||
|
||||
run_once
|
||||
rc=$?
|
||||
if [ "$rc" -eq 0 ]; then
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [ "$WATCH_MAX_ATTEMPTS" -ne 0 ] && [ "$attempt" -ge "$WATCH_MAX_ATTEMPTS" ]; then
|
||||
echo "Watch stopped before GREEN. Last result code: $rc"
|
||||
exit "$rc"
|
||||
fi
|
||||
|
||||
echo "Waiting ${WATCH_INTERVAL}s before the next cold-start gate check..."
|
||||
sleep "$WATCH_INTERVAL"
|
||||
attempt=$((attempt + 1))
|
||||
done
|
||||
fi
|
||||
|
||||
run_once
|
||||
exit $?
|
||||
|
||||
Reference in New Issue
Block a user