docs(ops): 明確化重啟 Plan B 降級路徑 [skip ci]
This commit is contained in:
@@ -1,3 +1,26 @@
|
||||
## 2026-06-18|重啟 SOP Plan B 降級運轉路徑明確化
|
||||
|
||||
**背景**:統帥追問「是否有 Plan B」,並要求重啟 SOP 不要停留在「差一點、差這個、差那個」的描述。這次把 Plan B 正式寫入冷啟動 / 主機重啟 SOP,讓下一次維護前可以直接判斷 Plan A、Plan B、停止線、降級等級與完成宣告上限。
|
||||
|
||||
**完成內容**:
|
||||
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升為 `v1.22`,新增 `Plan B:降級運轉與回復路徑`。
|
||||
- Plan B 固定為降級運轉與回復路徑,不是繞過 preflight 的重啟流程,也不是 runtime 寫操作授權。
|
||||
- 新增 Plan B 紅線:不假綠、不消音正確紅燈、不做未授權 Docker / Nginx / firewall / K8s / secret / destructive 操作、不釋出高風險自動化。
|
||||
- 新增 Plan B 觸發條件:backup/offsite 仍在跑、P0 主機 15 分鐘不可達、188 data 不健康、110 registry / observability 不健康、單台 K3s control-plane degraded、route-only green、cold-start `WARN>0`、credential escrow missing。
|
||||
- 新增 110 / 120 / 121 / 188 / K3s / Public gateway 的主機與層級 fallback 路徑,以及回到 Plan A 的條件。
|
||||
- 新增 B0-B5 服務等級:`B0_ABORTED_BEFORE_REBOOT`、`B1_HOST_RECOVERY_ONLY`、`B2_CORE_SERVICE_READY`、`B3_SERVICE_AVAILABLE_DEGRADED`、`B4_FULL_STACK_GREEN`、`B5_DR_COMPLETE`。
|
||||
- 新增 T+0 / T+5 / T+15 / T+30 / T+60 / T+120 fallback 時序,避免維護窗口無限延長或用未授權操作硬拉綠燈。
|
||||
- 新增 Plan B 收尾狀態:`RETURNED_TO_PLAN_A`、`SERVICE_AVAILABLE_DEGRADED`、`OPEN_INCIDENT_REQUIRED`。
|
||||
- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 已同步 P3-008 與進度紀錄,要求未來每次重啟都要對照 SOP §1.4,記錄實際 Plan B trigger、B0-B5 等級、blocker 與是否回到 Plan A。
|
||||
|
||||
**完成度同步**:
|
||||
- 重啟 SOP Plan B 文件化:`0% -> 100%`。
|
||||
- P3 docs / automation contracts:維持 `100%`,新增 Plan B 可執行判定。
|
||||
- Overall recovery readiness:沿用最新已記錄 live evidence `98% SERVICE_AVAILABLE_ARGOCD_HEALTHY_DR_ESCROW_BLOCKED`,本輪未重跑 live check,不能把文件更新當作新的服務綠燈。
|
||||
- DR complete:仍 blocked,credential escrow missing count 仍需以 live evidence 回報,最新文件基準仍記錄為 `5`。
|
||||
|
||||
**邊界**:本輪只更新 SOP / workplan / LOGBOOK。未 SSH、未重啟主機、未改 Docker / systemd / Nginx / firewall / K8s / ArgoCD、未 workflow dispatch、未讀或保存 secret、未 active scan、未送 Telegram、未改 runtime gate。下一次真正重啟前仍必須重新跑 same-day live preflight,再決定 Plan A 或 Plan B 入口。
|
||||
|
||||
## 2026-06-18|AI Agent 自動化完整工作清單與主流差距重盤點
|
||||
|
||||
**背景**:統帥要求把前面所有關於 OpenClaw、Hermes、NemoTron、12-Agent War Room、Telegram Bot、日報 / 週報 / 月報、MCP、RAG、版本更新、工具 / 套件 / 服務 / 主機 / 專案 / 網站前後台自動化的工作項目全部細化列出,並繼續推進。由於 `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md` 禁止另開孤島 MD,本輪把完整清單收斂到既有 `docs/ai/AI_AGENT_AUTOMATION_WORKLIST_2026-06-04.md`,並同步 MASTER §8。
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.20
|
||||
> Last updated: 2026-06-14 Asia/Taipei
|
||||
> Version: v1.22
|
||||
> Last updated: 2026-06-18 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
---
|
||||
@@ -202,6 +202,91 @@ Result: GREEN
|
||||
|
||||
GO 只代表允許執行指定範圍,不代表完成。完成一定要回到 §15 Done Criteria。
|
||||
|
||||
### 1.4 Plan B:降級運轉與回復路徑
|
||||
|
||||
Plan B 不是另一套可以繞過 preflight 的重啟流程,也不是事故中臨場改主機的授權。Plan B 是當 Plan A 無法在維護窗口內達成 `FULL_STACK_GREEN` 時,預先定義「最低可接受服務目標、停止線、降級等級、主機路徑、回到 Plan A 的條件」。
|
||||
|
||||
Plan A 的目標是:
|
||||
|
||||
```text
|
||||
B4_FULL_STACK_GREEN:cold-start scorecard WARN=0 / BLOCKED=0,backup、offsite、DB、alert、scheduler、K3s 與 public route 都綠。
|
||||
```
|
||||
|
||||
Plan B 的目標是:
|
||||
|
||||
```text
|
||||
先保住核心服務與資料完整性,不擴大 blast radius,不把部分可用誤報成 full-stack green,並把下一個 blocker 留成可追蹤工單。
|
||||
```
|
||||
|
||||
#### Plan B 紅線
|
||||
|
||||
| 紅線 | 具體要求 |
|
||||
|------|----------|
|
||||
| 不假綠 | 不用 route 200、pod up、container up、UI 可見、CD success 或單一 smoke pass 宣稱完整恢復。 |
|
||||
| 不消音正確紅燈 | 120 / backup / credential escrow / alert / scheduler 的紅燈如果反映真實缺口,必須保留。 |
|
||||
| 不做未授權寫操作 | 沒有維護窗口與人工批准時,不重啟 Docker daemon、不 reload Nginx、不改 firewall / iptables、不 `kubectl patch` live、不讀 secret、不做 destructive recovery。 |
|
||||
| 不釋出高風險自動化 | CD runner、AI auto-remediation、heavy crawler、batch import、repair bot 必須等依賴鏈綠燈後才解除 freeze。 |
|
||||
|
||||
#### Plan B 觸發條件
|
||||
|
||||
| 觸發條件 | 立即動作 | 可宣稱上限 |
|
||||
|----------|----------|------------|
|
||||
| 03:00 offsite sync、02:00 backup 或 full verifier 仍在跑 | 延後重啟;只讀等待完成 | `B0_ABORTED_BEFORE_REBOOT` |
|
||||
| 任一 P0 主機重啟後 15 分鐘仍 ping / SSH 不可達 | 停止釋出下一層,啟動對應主機路徑 | `B1_HOST_RECOVERY_ONLY` |
|
||||
| 188 PostgreSQL / Redis / momo / SignOz 任一核心資料面不健康 | 凍結 K3s deploy、runner、AI auto-remediation | `B1_HOST_RECOVERY_ONLY` |
|
||||
| 110 Harbor / Gitea / Alertmanager / Prometheus 不健康 | 凍結 CD / deploy / image pull 相關流程 | `B2_CORE_SERVICE_READY` 以下 |
|
||||
| 120 或 121 單台不健康,但另一台 control-plane 可承載 | 進入單節點 K3s 服務模式,保留 HA 紅燈 | `B2_CORE_SERVICE_READY` |
|
||||
| public route 可用,但 DB / backup / alert / schedule 任一不綠 | 標記 `ROUTE_GREEN_ONLY`,不宣稱 service green | `B2_CORE_SERVICE_READY` |
|
||||
| cold-start `WARN>0`、`BLOCKED=0` | 可宣稱服務可用但仍 degraded | `B3_SERVICE_AVAILABLE_DEGRADED` |
|
||||
| credential escrow missing | 可完成服務恢復,不可宣稱 DR complete | `B4_FULL_STACK_GREEN` 或以下,禁止 `B5_DR_COMPLETE` |
|
||||
|
||||
#### Plan B 主機路徑
|
||||
|
||||
| 故障域 | 降級路徑 | 回到 Plan A 的條件 |
|
||||
|--------|----------|--------------------|
|
||||
| 110 失敗 | 保留 120 / 121 K3s 與 188 data;凍結 CD、runner、Harbor image push、Alertmanager outbound;先確認 Gitea / Harbor / Prometheus / Alertmanager 是否只是 host service 層問題。 | 110 `HOST_READY`、Harbor / Gitea / Prometheus / Alertmanager 健康、backup-status 無 110 core blocker、cold-start 110 checks 綠。 |
|
||||
| 120 失敗 | 121 承載 K3s control-plane;保留 `120_DEGRADED` 紅燈;不宣稱 K3s AA;不跑 120 backup fix;必要時走 console / fsck recovery。 | 120 ping / SSH OK、root filesystem rw、`k3s active`、node `mon Ready`、backup-configs / backup-all / offsite / cold-start chain 全過。 |
|
||||
| 121 失敗 | 120 承載 K3s control-plane;保留 `121_DEGRADED` 紅燈;不宣稱 workload balanced;避免非必要 rollout。 | 121 ping / SSH OK、`k3s active`、node `mon1 Ready`、API/Web placement 回到 max skew <= 1。 |
|
||||
| 188 失敗 | 先保資料面:PostgreSQL、Redis、momo DB、SignOz、Ollama / AI provider;凍結會寫入資料或產生大量負載的 batch / crawler / AI flow。 | 188 `HOST_READY`、PostgreSQL / Redis / momo parity / SignOz / AI provider route 健康,且 backup/status 無 188 core blocker。 |
|
||||
| K3s degraded | 保留現有健康 Pod;先查 nodes / pods / events / VIP / NodePort;避免盲目重啟 k3s 或刪 Pod。 | `mon` / `mon1` Ready、API/Web/Worker rollout healthy、public API/Web / alert webhook / scorecard 通過。 |
|
||||
| Public gateway degraded | 保住內部 API / VIP / data;不 reload Nginx、不改 DNS/TLS/certbot/firewall,除非有 owner-approved maintenance window。 | Nginx config owner evidence、route smoke、TLS / ACME、rollback owner 與 post-check 計畫通過。 |
|
||||
|
||||
#### Plan B 服務等級
|
||||
|
||||
維護期間所有回報都必須使用以下等級之一,禁止用「差不多好了」或「應該正常」:
|
||||
|
||||
| 等級 | 意義 | 最低證據 |
|
||||
|------|------|----------|
|
||||
| `B0_ABORTED_BEFORE_REBOOT` | preflight 發現 NO-GO,取消或延後重啟 | 未做 runtime 寫操作;記錄 NO-GO blocker。 |
|
||||
| `B1_HOST_RECOVERY_ONLY` | 只完成主機層恢復 | 目標主機 ping / SSH / boot time / systemd 基礎狀態可判定;服務尚未全驗。 |
|
||||
| `B2_CORE_SERVICE_READY` | 核心服務可用,但完整依賴鏈未過 | public route、API、DB 或 K3s 主要面可用;backup / alert / scheduler / scorecard 尚未全綠。 |
|
||||
| `B3_SERVICE_AVAILABLE_DEGRADED` | 核心服務可用,cold-start 無 hard block 但仍有 WARN | cold-start `BLOCKED=0`;WARN 被明確列出且不被消音。 |
|
||||
| `B4_FULL_STACK_GREEN` | 本次重啟恢復完成 | cold-start `PASS>0 WARN=0 BLOCKED=0`,backup / offsite / DB / alert / scheduler 全綠。 |
|
||||
| `B5_DR_COMPLETE` | DR 完整 | `B4` 加上 credential escrow missing `0`,restore / escrow / offsite evidence 完整。 |
|
||||
|
||||
#### Plan B 執行時序
|
||||
|
||||
```text
|
||||
T+0 freeze CD / runner / AI auto-remediation / heavy batch;保留 console、journal、backup、scorecard evidence。
|
||||
T+5 判定 HOST_POWERED / HOST_BOOTED / HOST_READY;任一 P0 host 不可達即進入主機 Plan B。
|
||||
T+15 188 data 或 110 registry / observability 不健康時停止釋出 K3s、runner、AI。
|
||||
T+30 public route 可用但 DB / backup / alert / scheduler 未過時,只能回報 B2,不得 full green。
|
||||
T+60 必須跑 cold-start scorecard;若仍 WARN / BLOCKED,留下 Plan B 等級與下一個 blocker。
|
||||
T+120 若仍未達 B4,開 incident / follow-up,不延長窗口做未授權 runtime 寫操作。
|
||||
```
|
||||
|
||||
#### Plan B 收尾條件
|
||||
|
||||
Plan B 只能以下列三種狀態收尾:
|
||||
|
||||
| 收尾狀態 | 條件 | 下一步 |
|
||||
|----------|------|--------|
|
||||
| `RETURNED_TO_PLAN_A` | blocker 已清,完成 Plan A 全鏈路驗證 | 更新 reboot ledger,記錄實際耗時與 SOP 差異。 |
|
||||
| `SERVICE_AVAILABLE_DEGRADED` | 服務可用但 scorecard 仍 WARN,或 DR / escrow / governance gate 未完成 | 保留紅燈,開下一步 owner / evidence / maintenance task。 |
|
||||
| `OPEN_INCIDENT_REQUIRED` | P0 host、data、K3s、gateway、backup、alert 任一仍 hard blocked | 停止維護窗口,保留 evidence,升級事故處理。 |
|
||||
|
||||
Plan B 的專業標準不是「保證每次都綠」,而是保證每次重啟都能快速知道現在到哪一層、什麼不能宣稱、下一個 blocker 是誰、以及是否可以安全回到 Plan A。
|
||||
|
||||
---
|
||||
|
||||
## 2. Automation Freeze
|
||||
@@ -1363,7 +1448,21 @@ SOP update:
|
||||
| Remaining gate | `km-vectorize-29689620` official Job 仍 failed;Credential escrow missing count 仍 `5` |
|
||||
| SOP change | v1.20 記錄高價值配置 Owner Packet 前台同步後 no-regression readback,並維持宣告上限為 `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED` |
|
||||
|
||||
### 14.20 重啟後時間軸驗證
|
||||
### 14.21 2026-06-18 Plan B 降級運轉路徑
|
||||
|
||||
2026-06-18 的變更不是主機重啟,也不是新的 live recovery readback,而是把統帥要求的 Plan B 明確寫成可執行 SOP。這個錨點用來比較下一次重啟時是否有照 §1.4 先判斷 Plan A / Plan B、降級等級、停止線與回到 Plan A 的條件。
|
||||
|
||||
| 項目 | 2026-06-18 Plan B baseline |
|
||||
|------|----------------------------|
|
||||
| SOP version | `v1.22` |
|
||||
| Plan B trigger | backup/offsite/verifier running、P0 host 15 分鐘不可達、188 data unhealthy、110 registry / observability unhealthy、單台 K3s degraded、route-only green、cold-start WARN、credential escrow missing |
|
||||
| Service levels | `B0_ABORTED_BEFORE_REBOOT`、`B1_HOST_RECOVERY_ONLY`、`B2_CORE_SERVICE_READY`、`B3_SERVICE_AVAILABLE_DEGRADED`、`B4_FULL_STACK_GREEN`、`B5_DR_COMPLETE` |
|
||||
| Host fallback paths | 110 / 120 / 121 / 188 / K3s / Public gateway 各自有降級路徑與回到 Plan A 的條件 |
|
||||
| Timeline | `T+0` freeze、`T+5` host boot、`T+15` data / registry stop-line、`T+30` route-only guard、`T+60` cold-start scorecard、`T+120` incident / follow-up |
|
||||
| Closeout states | `RETURNED_TO_PLAN_A`、`SERVICE_AVAILABLE_DEGRADED`、`OPEN_INCIDENT_REQUIRED` |
|
||||
| SOP change | v1.22 新增 Plan B;不可把 Plan B 視為 runtime write 授權,也不可因文件化 Plan B 宣稱新的 service green、full-stack green 或 DR complete |
|
||||
|
||||
### 14.22 重啟後時間軸驗證
|
||||
|
||||
每次重啟後照時間軸推進,不要等到最後才一次判定。
|
||||
|
||||
|
||||
@@ -15,7 +15,7 @@
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_ARGOCD_HEALTHY_WITH_RESIDUAL_WARNINGS | 99% | 2026-06-15 03:11 cold-start is degraded by two warnings only; public route/API smoke is green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are mostly green, and 110 failed units remain `0`. `km-vectorize-29691060` succeeded, ArgoCD is `Healthy`, and API/Web remain split across 120 / 121. Remaining scorecard warnings are 188 momo scheduler registration/activity not confirmed and retained old K8s failed Job evidence. |
|
||||
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.21, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, post-CD no-regression readback, P2-135 deploy recovery readback, P2-136 / AI Agent 活動正式部署後 recovery readback, P2-137 / CI smoke timeout recovery readback, P2-143 owner response 預檢後 recovery readback, P2-144 owner response 回讀後 recovery readback, P2-145 owner response 驗收門檻後 recovery readback, IwoooS P0 配置控管優先序後 recovery readback, 高價值配置 Owner Packet 前台同步後 recovery readback,以及 `km-vectorize` official success readback 均已更新;本工作站無法執行 Ansible syntax check。 |
|
||||
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.22, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, post-CD no-regression readback, P2-135 deploy recovery readback, P2-136 / AI Agent 活動正式部署後 recovery readback, P2-137 / CI smoke timeout recovery readback, P2-143 owner response 預檢後 recovery readback, P2-144 owner response 回讀後 recovery readback, P2-145 owner response 驗收門檻後 recovery readback, IwoooS P0 配置控管優先序後 recovery readback, 高價值配置 Owner Packet 前台同步後 recovery readback,以及 `km-vectorize` official success readback 均已更新;本工作站無法執行 Ansible syntax check。 |
|
||||
|
||||
Full cold-start may be declared green only for the latest verified evidence set. As of 2026-06-15 03:11, `km-vectorize` and ArgoCD are healthy, but the latest scorecard is still `DEGRADED` by residual warnings. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
|
||||
@@ -175,7 +175,7 @@ Next: <single next action>
|
||||
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
|
||||
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
|
||||
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
|
||||
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.20 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, 2026-06-14 110 failed-unit cleanup anchor, 2026-06-14 post-CD recovery readback, P2-135 deploy recovery readback, P2-136 / AI Agent 活動正式部署後 recovery readback, P2-137 / CI smoke timeout recovery readback, P2-143 owner response 預檢後 recovery readback, P2-144 owner response 回讀後 recovery readback, P2-145 owner response 驗收門檻後 recovery readback, IwoooS P0 配置控管優先序後 recovery readback, 高價值配置 Owner Packet 前台同步後 recovery readback, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note,以及 allowed declaration wording. | Use v1.20 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10 / §14.11 / §14.12 / §14.13 / §14.14 / §14.15 / §14.16 / §14.17 / §14.18 / §14.19 / §14.20. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
|
||||
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.22 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, B0-B5 service levels, T+0/T+120 fallback timeline, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, 2026-06-14 110 failed-unit cleanup anchor, 2026-06-14 post-CD recovery readback, P2-135 deploy recovery readback, P2-136 / AI Agent 活動正式部署後 recovery readback, P2-137 / CI smoke timeout recovery readback, P2-143 owner response 預檢後 recovery readback, P2-144 owner response 回讀後 recovery readback, P2-145 owner response 驗收門檻後 recovery readback, IwoooS P0 配置控管優先序後 recovery readback, 高價值配置 Owner Packet 前台同步後 recovery readback, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note,以及 allowed declaration wording. | Use v1.22 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, and blockers against §1.4 plus §14.8 / §14.9 / §14.10 / §14.11 / §14.12 / §14.13 / §14.14 / §14.15 / §14.16 / §14.17 / §14.18 / §14.19 / §14.20 / §14.21 / §14.22. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`, and blocks false green while escrow, scorecard WARN, or governed runtime debt remain red. |
|
||||
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
||||
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
||||
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
||||
@@ -213,6 +213,16 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy
|
||||
|
||||
## 9. Progress Updates
|
||||
|
||||
```text
|
||||
2026-06-18 11:41 Asia/Taipei
|
||||
Phase: P3
|
||||
Before: P3 100%
|
||||
After: P3 100%
|
||||
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.22 with explicit Plan B degraded-operation path, B0-B5 service levels, Plan B trigger table, host-specific fallback routes for 110/120/121/188/K3s/public gateway, T+0/T+120 fallback timeline, and Plan B closeout states. This workplan now requires every future reboot record to compare actual timing and blockers against SOP §1.4, not only the Plan A cold-start chain.
|
||||
Blocked: no for documentation. Live reboot authorization still requires fresh same-day preflight before any maintenance window; DR complete remains blocked while credential escrow missing count is 5.
|
||||
Next: before the next host reboot, rerun live preflight, choose Plan A or Plan B entry criteria, then record final level as B0/B1/B2/B3/B4/B5 with the exact blocker.
|
||||
```
|
||||
|
||||
```text
|
||||
2026-06-14 18:15 Asia/Taipei
|
||||
Phase: P0/P1/P2/P3
|
||||
|
||||
Reference in New Issue
Block a user