docs(ops): record P2-135 recovery readback [skip ci]
This commit is contained in:
@@ -53,6 +53,24 @@
|
||||
**下一步**:
|
||||
- `P2-136`:已可承接 P2-135 正式驗證後的下一個只讀關卡;仍不得直接開啟 result capture writer、learning writer、PlayBook trust writer、reviewer queue write、Gateway queue write、Telegram send、Bot API call 或 production write。
|
||||
|
||||
## 2026-06-14|P2-135 deploy 後 recovery readback 無回歸
|
||||
|
||||
**背景**:另一工作視窗已完成 P2-135 正式驗證並推到 `gitea/main=5bad267e`。本輪只從 reboot / cold-start / backup recovery 角度做只讀 readback,不重複修改 P2-135 治理文件或 release authorization readback gate。
|
||||
|
||||
**只讀驗證:**
|
||||
- ArgoCD `awoooi-prod`:`Synced / Degraded`,revision `5bad267ebad118c69a5a8f6d624503df7069f931`;Degraded 仍由舊 `km-vectorize-29689620` failed Job / `lastSuccessfulTime` gate 造成。
|
||||
- `km-vectorize` CronJob live:image `192.168.0.110:5000/awoooi/api:280e0fbef0d5dccb10f1efe2cc18cf423544254e`,`KM_PROJECT_ID=awoooi`,`restartPolicy: Never`,`terminationMessagePolicy: FallbackToLogsOnError`,schedule `0 3 * * *`,`timeZone=Asia/Taipei`,`lastScheduleTime=2026-06-13T19:00:00Z`,`lastSuccessfulTime=2026-06-04T11:00:37Z`。
|
||||
- K3s placement:API pods 分散在 `mon` / `mon1`,Web pods 分散在 `mon` / `mon1`,Worker 單副本在 `mon1`;全部 Running,bad pods `0`。
|
||||
- 09:26 `backup-status.sh --no-notify`:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`escrow_missing=5`,offsite/rclone fresh。
|
||||
- 09:26 110 host readback:`systemctl --failed` 回 `0 loaded units listed`,`fwupd-refresh.timer` 維持 `disabled / inactive`。
|
||||
- 09:26 首輪 cold-start 因 `stock.wooo.work` 在 stockplatform-v2 rollout warmup 短暫回 `502`,結果為 `PASS=80 WARN=2 BLOCKED=1`;直接複查 `stock.wooo.work` / TLS 隨即回 `200`。
|
||||
- 09:27 第二輪 cold-start:`PASS=82 WARN=1 BLOCKED=0`;public routes / TLS / momo DB parity / backup exporters / 120/121 nodes / AWOOOI API-Web 都通過,唯一 WARN 仍是 K8s failed Job `km-vectorize-29689620`。
|
||||
|
||||
**判定:**
|
||||
- `Overall recovery readiness` 維持 `97%`,不是 `100%`。
|
||||
- 可以宣稱 P2-135 deploy 後 recovery readback 無回歸;`stock` 的短暫 `502` 已在第二輪 scorecard 轉綠。
|
||||
- 不能宣稱 full cold-start green、ArgoCD Healthy 或 DR complete;下一個真正完成 gate 仍是下一次官方 03:00 `km-vectorize` 成功更新 `lastSuccessfulTime`,以及 5 個 credential escrow non-secret evidence marker。
|
||||
|
||||
## 2026-06-14|Post-CD reboot recovery readback 無回歸
|
||||
|
||||
**背景**:`gitea/main` 從 `2b22c9d6 docs(ops): record 110 fwupd cleanup [skip ci]` 前進到 deploy marker `18b867c3 chore(cd): deploy e0a6d33 [skip ci]` 後,需確認 P2-134 相關 CD 沒有讓主機重啟 / cold-start SOP 狀態倒退。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.12
|
||||
> Version: v1.13
|
||||
> Last updated: 2026-06-14 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。
|
||||
|
||||
| 項目 | 2026-06-14 08:40 Asia/Taipei live result | 判定 |
|
||||
| 項目 | 2026-06-14 09:27 Asia/Taipei live result | 判定 |
|
||||
|------|-------------------------------------------|------|
|
||||
| Overall recovery readiness | `97%` | `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED` |
|
||||
| P0 host / K3s recovery | `100%` | `DONE` |
|
||||
@@ -23,14 +23,14 @@
|
||||
| 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` |
|
||||
| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`; latest CD kept API/Web split across 120 / 121; bad pods `0`; retained failed Job `km-vectorize-29689620` is the only WARN | `GREEN_WORKLOAD_BALANCED_WITH_KM_WARN` |
|
||||
| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc` | `GREEN_WITH_GUARDRAIL` |
|
||||
| Backup status | 08:40 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22` | `GREEN_WITH_DR_ESCROW_WARNING` |
|
||||
| Backup status | 09:26 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22` | `GREEN_WITH_DR_ESCROW_WARNING` |
|
||||
| Offsite sync / verify | 01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20 | `GREEN` |
|
||||
| Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | `GREEN_WITH_EXPECTED_REDLIGHTS` |
|
||||
| Cold-start scorecard | 08:40 post-CD read-only scorecard: `PASS=82 WARN=1 BLOCKED=0`; remaining WARN is K8s failed Job `km-vectorize-29689620` | `DEGRADED_NO_BLOCKERS` |
|
||||
| Cold-start scorecard | 09:27 post-P2-135 read-only scorecard rerun: `PASS=82 WARN=1 BLOCKED=0`; remaining WARN is K8s failed Job `km-vectorize-29689620`. 09:26 first run saw transient `stock.wooo.work` `502` during stockplatform-v2 warmup, but direct route/TLS recheck and the rerun returned `200`. | `DEGRADED_NO_BLOCKERS` |
|
||||
| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
|
||||
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1223`; detector widened and deployed to 110 | `GREEN` |
|
||||
| ArgoCD app health | Latest live app revision `18b867c3` remains `Synced / Degraded` because official Job `km-vectorize-29689620` failed with `BackoffLimitExceeded`. Root-cause candidate is missing `X-Project-ID` on the internal `/api/v1/knowledge/embed-all` call; deployed CronJob image `e0a6d339` sends `X-Project-ID` and live CronJob has `KM_PROJECT_ID=awoooi`, `restartPolicy: Never`, and `terminationMessagePolicy: FallbackToLogsOnError`. | `TENANT_CONTEXT_FIX_LIVE_WAITING_OFFICIAL_RUN` |
|
||||
| Workload balancing | Live API/Web/Worker/CronJob image is `e0a6d339`; API/Web pods remain split across `mon` / `mon1`, Worker single replica remains healthy on `mon` | `GREEN` |
|
||||
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1251`; detector widened and deployed to 110 | `GREEN` |
|
||||
| ArgoCD app health | Latest live app revision `5bad267e` remains `Synced / Degraded` because official Job `km-vectorize-29689620` failed with `BackoffLimitExceeded`. Root-cause candidate is missing `X-Project-ID` on the internal `/api/v1/knowledge/embed-all` call; deployed CronJob image `280e0fbe` sends `X-Project-ID` and live CronJob has `KM_PROJECT_ID=awoooi`, `restartPolicy: Never`, and `terminationMessagePolicy: FallbackToLogsOnError`. | `TENANT_CONTEXT_FIX_LIVE_WAITING_OFFICIAL_RUN` |
|
||||
| Workload balancing | Live API/Web/Worker/CronJob image is `280e0fbe`; API/Web pods remain split across `mon` / `mon1`, Worker single replica remains healthy on `mon1` | `GREEN` |
|
||||
| Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |
|
||||
|
||||
Release rule:
|
||||
@@ -44,7 +44,7 @@ Do not declare DR scorecard complete while credential escrow markers are missing
|
||||
2026-06-14 live rule:
|
||||
|
||||
```text
|
||||
110 / 120 / 121 / 188 core service recovery remains available, but the latest 08:40 scorecard is DEGRADED because `WARN=1`.
|
||||
110 / 120 / 121 / 188 core service recovery remains available, but the latest 09:27 scorecard is DEGRADED because `WARN=1`.
|
||||
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
|
||||
NO-GO for "DR complete" while credential escrow evidence markers are missing.
|
||||
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
|
||||
@@ -1206,7 +1206,25 @@ SOP update:
|
||||
| Remaining gate | `km-vectorize-29689620` official Job 仍 failed;Credential escrow missing count 仍 `5` |
|
||||
| SOP change | v1.12 records the post-CD no-regression readback and keeps the declaration ceiling at `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED` |
|
||||
|
||||
### 14.12 重啟後時間軸驗證
|
||||
### 14.12 2026-06-14 P2-135 deploy 後 recovery readback
|
||||
|
||||
2026-06-14 09:27 的變更不是主機重啟,而是確認 P2-135 deploy 與正式驗證後,reboot recovery baseline 沒有倒退。這個錨點也記錄 stockplatform-v2 rollout warmup 期間短暫 `502` 的判定方式:直接重查 route / TLS,並重跑完整 cold-start;只有重跑仍失敗才升級成 persistent public route blocker。
|
||||
|
||||
| 項目 | 2026-06-14 09:27 post-P2-135 baseline |
|
||||
|------|----------------------------------------|
|
||||
| Gitea / ArgoCD | Gitea main `5bad267e`,ArgoCD revision `5bad267e`,sync `Synced`,health `Degraded` |
|
||||
| K3s image readback | API/Web/Worker/CronJob image tag `280e0fbef0d5dccb10f1efe2cc18cf423544254e` |
|
||||
| CronJob readback | `km-vectorize` has `KM_PROJECT_ID=awoooi`、`restartPolicy: Never`、`terminationMessagePolicy: FallbackToLogsOnError`、`lastScheduleTime=2026-06-13T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z` |
|
||||
| K3s placement | API pods split `mon` / `mon1`,Web pods split `mon` / `mon1`,Worker single replica on `mon1` |
|
||||
| First cold-start | 09:26 first run saw `stock.wooo.work` `502` while stockplatform-v2 containers were less than one minute old; direct route and TLS recheck returned `200` |
|
||||
| Final cold-start | 09:27 rerun returned `PASS=82 WARN=1 BLOCKED=0` |
|
||||
| Public routes | Final scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
|
||||
| Backup | 110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`escrow_missing=5` |
|
||||
| 110 host | `systemctl --failed` 回 `0 loaded units listed`;`fwupd-refresh.timer` 維持 `disabled / inactive` |
|
||||
| Remaining gate | `km-vectorize-29689620` official Job 仍 failed;Credential escrow missing count 仍 `5` |
|
||||
| SOP change | v1.13 records the P2-135 post-deploy no-regression readback and keeps the declaration ceiling at `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED` |
|
||||
|
||||
### 14.13 重啟後時間軸驗證
|
||||
|
||||
每次重啟後照時間軸推進,不要等到最後才一次判定。
|
||||
|
||||
|
||||
@@ -11,13 +11,13 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 97% | 2026-06-14 08:40 post-CD cold-start scorecard is `PASS=82 WARN=1 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public route/API smoke remains green, deploy marker `18b867c3` put API/Web/Worker/CronJob image `e0a6d339` live, and API/Web remain live-verified split across 120 / 121. The 110 `fwupd` failed-unit warning remains cleared with `fwupd-refresh.timer` disabled/inactive and `systemctl --failed` clean. Full cold-start still cannot be declared green because the official `km-vectorize-29689620` Job failed with `BackoffLimitExceeded`; DR remains blocked by five missing credential escrow evidence markers. |
|
||||
| Overall recovery readiness | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 97% | 2026-06-14 09:27 post-P2-135 cold-start scorecard rerun is `PASS=82 WARN=1 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public route/API smoke remains green after a transient stockplatform-v2 warmup `502`, deploy marker `8d575c1a` / docs commit `5bad267e` put API/Web/Worker/CronJob image `280e0fbe` live, and API/Web remain live-verified split across 120 / 121. The 110 `fwupd` failed-unit warning remains cleared with `fwupd-refresh.timer` disabled/inactive and `systemctl --failed` clean. Full cold-start still cannot be declared green because the official `km-vectorize-29689620` Job failed with `BackoffLimitExceeded`; DR remains blocked by five missing credential escrow evidence markers. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 08:40 readback shows 120 booted again around `02:23`, host is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-14 08:40 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED_WITH_KM_WARN | 99% | 2026-06-14 08:40 cold-start is degraded by one warning only; public route/API smoke is green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are mostly green, and 110 failed units remain `0`. API/Web both keep 120 / 121 split placement after latest ArgoCD revision `18b867c3`, with live API/Web/Worker/CronJob image `e0a6d339`; the remaining exception is failed `km-vectorize-29689620`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.12, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, post-CD no-regression readback, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-14 09:26 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED_WITH_KM_WARN | 99% | 2026-06-14 09:27 cold-start rerun is degraded by one warning only; public route/API smoke is green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are mostly green, and 110 failed units remain `0`. API/Web both keep 120 / 121 split placement after latest ArgoCD revision `5bad267e`, with live API/Web/Worker/CronJob image `280e0fbe`; the remaining exception is failed `km-vectorize-29689620`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.13, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, post-CD no-regression readback, P2-135 deploy recovery readback, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
||||
|
||||
Full cold-start may be declared green only for the latest verified evidence set. As of 2026-06-14 08:40, the latest evidence set is degraded by `km-vectorize` only, not green. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
Full cold-start may be declared green only for the latest verified evidence set. As of 2026-06-14 09:27, the latest evidence set is degraded by `km-vectorize` only, not green. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
|
||||
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
||||
|
||||
@@ -66,6 +66,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
|
||||
| 2026-06-14 `km-vectorize` tenant context follow-up | ROOT_CAUSE_CANDIDATE_LIVE | Source audit shows `cron_km_vectorize.py` calls `/api/v1/knowledge/embed-all` without project context, while API middleware and `get_db_context()` require `X-Project-ID` / tenant context for fail-closed RLS. API logs show matching `db_context_missing` / `Missing tenant context` patterns. Deploy marker `ec03f0b7` put image `8ddb80d6` live; CronJob now has `KM_PROJECT_ID=awoooi`, script sends `X-Project-ID`, targeted pytest `7 passed`, and no manual Job was created. Completion still waits for the next official 03:00 success or retained failed Pod/log. |
|
||||
| 2026-06-14 110 failed-unit cleanup | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | `fwupd-refresh.timer` is intentionally `disabled / inactive` after non-runtime firmware metadata refresh failed units were classified; rollback is `sudo systemctl enable --now fwupd-refresh.timer`. `systemctl --failed` now returns `0 loaded units listed`; 08:24 cold-start improved to `PASS=82 WARN=1 BLOCKED=0`. Remaining warning is only K8s failed Job `km-vectorize-29689620`; backup core remains green and `escrow_missing=5`. |
|
||||
| 2026-06-14 post-CD recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | Gitea main / ArgoCD revision `18b867c3` synced after deploy marker `18b867c3 chore(cd): deploy e0a6d33 [skip ci]`; API/Web/Worker/CronJob image is `e0a6d339`. API/Web remain split across `mon` / `mon1`, Worker is healthy on `mon`, public routes and TLS pass, backup core remains `0`, escrow missing remains `5`, and 08:40 cold-start remains `PASS=82 WARN=1 BLOCKED=0`. This proves no post-CD reboot recovery regression, but still not full green. |
|
||||
| 2026-06-14 P2-135 deploy recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | Gitea main `5bad267e` and ArgoCD revision `5bad267e` are synced after deploy marker `8d575c1a`; API/Web/Worker/CronJob image is `280e0fbe`. API/Web remain split across `mon` / `mon1`, Worker is healthy on `mon1`, backup core remains `0`, escrow missing remains `5`, and 09:27 cold-start rerun is `PASS=82 WARN=1 BLOCKED=0`. 09:26 first run saw transient `stock.wooo.work` `502` while stockplatform-v2 containers were under one minute old; direct route/TLS recheck and scorecard rerun returned `200`. This proves no persistent post-P2-135 recovery regression, but still not full green. |
|
||||
|
||||
---
|
||||
|
||||
@@ -167,7 +168,7 @@ Next: <single next action>
|
||||
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
|
||||
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
|
||||
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
|
||||
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.12 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, 2026-06-14 110 failed-unit cleanup anchor, 2026-06-14 post-CD recovery readback, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, and allowed declaration wording. | Use v1.12 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10 / §14.11 / §14.12. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
|
||||
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.13 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, 2026-06-14 110 failed-unit cleanup anchor, 2026-06-14 post-CD recovery readback, 2026-06-14 P2-135 deploy recovery readback, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, and allowed declaration wording. | Use v1.13 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10 / §14.11 / §14.12 / §14.13. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
|
||||
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
||||
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
||||
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
||||
@@ -205,6 +206,16 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy
|
||||
|
||||
## 9. Progress Updates
|
||||
|
||||
```text
|
||||
2026-06-14 09:27 Asia/Taipei
|
||||
Phase: P0/P1/P2/P3
|
||||
Before: Overall 97%, P1 92%, P2 99%, P3 100%
|
||||
After: Overall 97%, P1 92%, P2 99%, P3 100%
|
||||
Evidence: gitea/main 5bad267e and ArgoCD revision 5bad267e; deploy marker 8d575c1a put image 280e0fbe live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; first cold-start had transient stock 502 during stockplatform-v2 warmup, direct route/TLS recheck returned 200, final cold-start PASS=82 WARN=1 BLOCKED=0.
|
||||
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
|
||||
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
|
||||
```
|
||||
|
||||
```text
|
||||
2026-06-14 08:40 Asia/Taipei
|
||||
Phase: P0/P1/P2/P3
|
||||
|
||||
Reference in New Issue
Block a user