Compare commits

...

1 Commits

Author SHA1 Message Date
Your Name
022dc2a12d docs(ops): refresh reboot SOP midday evidence 2026-06-13 12:46:19 +08:00
3 changed files with 33 additions and 15 deletions

View File

@@ -33861,3 +33861,20 @@ production browser smoke:
1. 繼續推進 S4.9 owner response 真實回覆資料包,必填 owner role / team、decision、decision reason、affected scope、redacted evidence refs、followup owner驗收前維持 `0 / false`
2. 持續盤點高價值配置控管,優先納入 Nginx、K8s manifest、ArgoCD app、Gitea workflow、registry / Harbor、Sentry / SigNoz / Alertmanager、public gateway、AI provider route、資料庫 migration 與 secrets injection 流程。
3. 任何主機維護、Kali 更新、Nginx / Docker / firewall / active scan 仍需獨立維護窗口與人工批准,不得由治理頁或 AwoooP approval 直接替代。
## 2026-06-13 — Reboot SOP midday live refresh
**只讀驗證**
- 12:43 從 110 執行 `/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`,結果 `PASS=83 WARN=0 BLOCKED=0`
- Public API health 回 `status=healthy``environment=prod``mock_mode=false`PostgreSQL、Redis、OpenClaw、SigNoz、GCP Ollama A/B 均為 up`ollama_local` 仍是非阻斷 cooldown。
- 12:44 `/backup/scripts/backup-status.sh --no-notify --no-refresh` 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``escrow_missing=5`、last aggregate `2026-06-13 02:34:06`
- 120 / 121 仍為 `Ready control-plane`public routes / TLS、momo scheduler、momo DB parity、cron / CronJob / textfile checks 均通過。
**判定調整**
- 服務與 cold-start仍可宣稱 `GREEN`
- DR仍不可宣稱完成credential escrow evidence 缺 `5` 個。
- Workload balancing降級為 `SERVICE_OK_PLACEMENT_DRIFT`。Topology spread constraints 仍存在,但 live API pods 目前 `2/2` 在 120Web 仍 split 120 / 121Worker 單副本在 120。這不是服務阻斷但不能再說 API 已均衡。
**文件更新**
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 更新為 12:43 live baseline 與允許宣告用語。
- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新 current verdict、P2 狀態與 midday live refresh evidence。

View File

@@ -10,27 +10,27 @@
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`
| 項目 | 2026-06-13 01:29 Asia/Taipei live result | 判定 |
| 項目 | 2026-06-13 12:43 Asia/Taipei live result | 判定 |
|------|-------------------------------------------|------|
| Overall recovery readiness | `95%` | `SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED` |
| Overall recovery readiness | `94%` | `SERVICE_GREEN_PLACEMENT_DRIFT_DR_ESCROW_BLOCKED` |
| P0 host / K3s recovery | `100%` | `DONE` |
| P1 backup / alert / escrow | `90%` | `BLOCKED_DR_ESCROW` |
| P2 service / data truth | `100%` | `VERIFIED_WORKLOAD_BALANCED` |
| P2 service / data truth | `98%` | `VERIFIED_SERVICE_GREEN_PLACEMENT_DRIFT` |
| P3 docs / automation contracts | `100%` | `DONE_WITH_VALIDATION_GAP` |
| 110 host runtime | `systemctl is-system-running=running`, failed units `0`, cold-start load `3.55 / 4.54 / 4.51`; Docker / Harbor / Gitea / Prometheus / Alertmanager / Sentry checks pass | `GREEN_WITH_LOAD_WATCH` |
| 110 host runtime | failed units `0`, cold-start load `2.23 / 2.95 / 2.85`; Docker / Harbor / Gitea / Prometheus / Alertmanager / Sentry checks pass | `GREEN_WITH_LOAD_WATCH` |
| 120 reachability | ping OK, SSH OK, boot `2026-06-12 15:13`, root ext4 `rw`, failed units `0` | `GREEN` |
| 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` |
| 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` |
| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; latest CD kept API/Web split across 120 / 121 | `GREEN_WORKLOAD_BALANCED` |
| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; deployment topology spread remains configured, but live API pods both sit on 120 while Web remains split 120 / 121 | `GREEN_WITH_API_PLACEMENT_DRIFT` |
| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc` | `GREEN_WITH_GUARDRAIL` |
| Backup status | 01:26 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-12 15:54:40` | `GREEN_WITH_DR_ESCROW_WARNING` |
| Backup status | 12:44 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-13 02:34:06` | `GREEN_WITH_DR_ESCROW_WARNING` |
| Offsite sync / verify | 01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20 | `GREEN` |
| Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | `GREEN_WITH_EXPECTED_REDLIGHTS` |
| Cold-start scorecard | 01:26 read-only scorecard after latest CD / trust guardrail: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
| Cold-start scorecard | 12:43 read-only scorecard: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` |
| ArgoCD app health | ArgoCD revision `e4a349bc` is `Synced`; app remains `Degraded` while `km-vectorize` `lastSuccessfulTime=2026-06-04T11:00:37Z`; corrected schedule remains `0 3 * * *` with `timeZone=Asia/Taipei` | `GOVERNANCE_DEBT_IN_REMEDIATION` |
| Workload balancing | Latest live deployments use images from `414413a5`; API/Web each have one pod on 120 and one pod on 121 after deploy marker `e4a349bc` | `GREEN` |
| Workload balancing | Topology spread constraints remain live for API/Web/Worker, but `whenUnsatisfiable=ScheduleAnyway` is a soft preference; 12:43 live placement has API `2/2` on 120, Web split 120 / 121, Worker on 120 | `SERVICE_OK_PLACEMENT_DRIFT` |
| Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |
Release rule:
@@ -44,14 +44,14 @@ Do not declare DR scorecard complete while credential escrow markers are missing
2026-06-13 live rule:
```text
110 / 120 / 121 / 188 service recovery is full-stack green after the 01:26 scorecard.
110 / 120 / 121 / 188 service recovery is full-stack green after the 12:43 scorecard.
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
NO-GO for "DR complete" while credential escrow evidence markers are missing.
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
GO for "AWOOOI service / cold-start green"; NO-GO for "API workload balanced" until API pods are again split across 120 / 121 or a hard placement policy is approved.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
Current allowed wording: "service / cold-start green; ArgoCD CronJob governance debt in remediation and waiting for 03:00 `km-vectorize` success evidence."
Current allowed wording: "service / cold-start green; DR escrow remains blocked; API placement drift is observed; ArgoCD CronJob governance debt is in remediation."
```
After any future 120 recovery, rerun this exact chain from 110:

View File

@@ -11,15 +11,15 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 01:26 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, API/Web remain spread across 120 / 121 after deploy marker `e4a349bc`, and CD no longer clobbers global `known_hosts`. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. |
| Overall recovery readiness | SERVICE_GREEN_PLACEMENT_DRIFT_DR_ESCROW_BLOCKED | 94% | 2026-06-13 12:43 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and CD no longer clobbers global `known_hosts`. Live deployment topology spread remains configured, but API pods drifted to `2/2` on 120 while Web remains split 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-13 01:26 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 01:28 offsite textfile has `remote_verify_ok=1`, `full_verify_fresh=1`, and all 13 repos `snapshot_count=1`; 01:27 Alertmanager exposes the five expected escrow gap alerts and Prometheus has all five required alert rule names healthy. |
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 01:26 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web both keep 120 / 121 split placement after latest deploy marker `e4a349bc`. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-13 12:44 `backup-status --no-notify --no-refresh` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-13 02:34:06`; backup textfile still reports all expected freshness metrics. |
| P2 service / data truth | VERIFIED_SERVICE_GREEN_PLACEMENT_DRIFT | 98% | 2026-06-13 12:43 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API live placement drifted to both replicas on 120 even though topology spread remains configured; Web remains split 120 / 121. |
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.8, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
2026-06-13 12:43 refresh: full cold-start is again green for the current evidence set (`PASS=83 WARN=0 BLOCKED=0`). Do not declare DR complete while credential escrow is missing. Do not declare API workload balanced until live API pods are split across 120 / 121 again or a hard placement policy is approved; current topology spread is present but soft (`ScheduleAnyway`) and API placement has drifted to 120.
---
@@ -56,6 +56,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
| 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. |
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. |
| 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. |
| 2026-06-13 midday live refresh | SERVICE_GREEN_API_PLACEMENT_DRIFT_DR_ESCROW_BLOCKED | 12:43 cold-start from 110 returns `PASS=83 WARN=0 BLOCKED=0`; public API health is `healthy`; backup status returns `core_blockers=0` and `escrow_missing=5`; all public route/TLS and momo parity checks pass. K3s topology spread constraints remain configured for API/Web/Worker, but API pods are currently both on `mon` while Web remains split `mon` / `mon1`; this is service-safe drift, not a cold-start blocker. |
---