docs(ops): 記錄重啟 live readback 階段判定 [skip ci]
This commit is contained in:
@@ -1,3 +1,25 @@
|
||||
## 2026-06-18|重啟 live cold-start readback:服務可用但保留 stale failed Job warning
|
||||
|
||||
**背景**:重啟 SOP / Plan B / repo-side readiness blockers 已推上 `gitea/main=63d8361f` 後,為避免把 repo-side readiness 誤講成 live full green,本輪立即用只讀方式重跑 cold-start gate 並追蹤剛好同時發生的 AWOOOI rollout 自然收斂。
|
||||
|
||||
**Live readback**:
|
||||
- `bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` 於 12:13 回 `PASS=83 WARN=1 BLOCKED=0`,結果 `DEGRADED`。
|
||||
- P0 reachability:110 / 120 / 121 / 188 ping 與 SSH port 全部 OK;112 Kali 仍照 SOP 排除。
|
||||
- 188 data layer:PostgreSQL accepting connections、Redis `PONG`、momo health / SigNoz reachable,momo current-month snapshot 與 realtime parity 為 `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`。
|
||||
- 110 registry / observability:Harbor、Gitea、Prometheus、Alertmanager、Sentry reachable;110 failed units `0`;backup health 110 `total=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0`;188 backup health `total=2 stale=0`。
|
||||
- K3s:`mon` / `mon1` 皆 `Ready control-plane`;VIP `192.168.0.125` 存在於 120;`NODE_FS_ERROR_EVENTS 0`。
|
||||
- Public routes / TLS:awoooi API/Web、mo、momo health、Gitea、Harbor、registry、Sentry、SigNoz、stock、Langfuse、Bitan、aiops 全部回 2xx/3xx 且 TLS 驗證通過。
|
||||
- 唯一 warning:K8s 保留舊 `km-vectorize-29689620` failed Job,該 Job 由 2026-06-14 03:00 官方 CronJob 產生,image `26b67d11...`,`Pods Statuses: 0 Active / 0 Succeeded / 1 Failed`,Events 已不存在;後續 `km-vectorize-29692500`、`29693940`、`29695380` 均為 `Complete`,最新約 9 小時前成功。
|
||||
- 追蹤同時發生的 rollout:12:14 曾短暫看到 external health `502` 與 API/Web startup probe 未 ready;12:15 後 API health 持續 `200 healthy`;12:16 至 12:17 readback 顯示 `awoooi-api 2/2`、`awoooi-web 2/2`、`awoooi-worker 1/1`、`awoooi-auto-repair-canary 1/1`,pod 全部 `1/1 Running`。
|
||||
|
||||
**完成度同步**:
|
||||
- Reboot SOP / Plan B / repo-side automation contracts:`100%`。
|
||||
- Live service availability after read-only check:`SERVICE_AVAILABLE_DEGRADED`,hard blocker `0`。
|
||||
- Full cold-start green:仍 `NO-GO`,因為 `WARN=1`,必須清楚標成 stale failed Job warning,不得講 `WARN=0`。
|
||||
- DR complete:仍 `NO-GO`,credential escrow evidence markers 仍不可偽造,最新治理口徑仍為 `escrow_missing=5`。
|
||||
|
||||
**邊界**:本輪 live 追蹤全程只讀;未刪除 failed Job、未手動建立 Job、未 patch K8s、未 ArgoCD sync、未重啟服務、未改 Docker / Nginx / firewall、未讀 secret、未送 Telegram、未 active scan。下一次正式重啟前仍必須重跑同一條 live preflight;若只有 stale failed Job warning 且後續官方 Job 成功,可走 Plan B `B3_SERVICE_AVAILABLE_DEGRADED` 或維護後收斂到 `B4_FULL_STACK_GREEN`,但不可把 DEGRADED 說成 full green。
|
||||
|
||||
## 2026-06-18|IwoooS SOC / SIEM / Kali 112 / Wazuh 整合控制本地驗證完成
|
||||
|
||||
**背景**:使用者要求把整體資安監控、告警與 Kali 112 主機徹底整合,並導入業界主流資安機制。前一輪已完成外部入侵主機防堵控制,但仍缺一層把 Wazuh、Kali 112、Prometheus / Alertmanager、SigNoz、Sentry、Nginx / Gateway、host forensic、Docker / systemd、K8s / ArgoCD、Gitea / runner、Harbor / SBOM 與 backup / DR 串成同一條只讀 SOC 控管線的 gate。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.23
|
||||
> Version: v1.24
|
||||
> Last updated: 2026-06-18 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -10,6 +10,18 @@
|
||||
|
||||
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。
|
||||
|
||||
2026-06-18 12:17 live readback supersedes older service-availability wording:
|
||||
|
||||
```text
|
||||
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
|
||||
Live cold-start read-only check: PASS=83 WARN=1 BLOCKED=0, Result=DEGRADED.
|
||||
Service state: SERVICE_AVAILABLE_DEGRADED; 110/120/121/188 reachable, K3s mon/mon1 Ready, NODE_FS_ERROR_EVENTS=0, public routes/TLS green, 110/188 backup health fresh.
|
||||
Rollout state after transient 12:14 startup window: awoooi-api 2/2, awoooi-web 2/2, worker 1/1, canary 1/1, public API health 200 healthy.
|
||||
Only live warning: retained stale K8s Job km-vectorize-29689620 from 2026-06-14 03:00. Later official km-vectorize Jobs 29692500 / 29693940 / 29695380 are Complete.
|
||||
Allowed declaration: services are available with one stale failed Job warning.
|
||||
Forbidden declaration: full cold-start green, DR complete, or runtime/security acceptance.
|
||||
```
|
||||
|
||||
| 項目 | 2026-06-14 18:15 Asia/Taipei live result | 判定 |
|
||||
|------|-------------------------------------------|------|
|
||||
| Overall recovery readiness | `97%` | `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED` |
|
||||
@@ -55,6 +67,17 @@ NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/kno
|
||||
Current allowed wording: "core service and backup are available; 110 failed units are cleared after intentionally disabling `fwupd-refresh.timer`; high-value config Owner Packet 前台同步後 recovery readback shows no service regression; cold-start is degraded only by the `km-vectorize` official Job failure; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained but failed Pod/log are currently absent, so the next official 03:00 run remains the evidence gate."
|
||||
```
|
||||
|
||||
2026-06-18 12:17 live rule:
|
||||
|
||||
```text
|
||||
GO for controlled service availability: PASS=83 WARN=1 BLOCKED=0, public routes/TLS green, API health 200 healthy, API/Web/Worker/Canary ready after rollout convergence.
|
||||
GO for repo-side reboot readiness mechanism: readiness audit PASS=185 WARN=1 BLOCKED=0; only skipped live gate warning before the live check was run.
|
||||
NO-GO for "full cold-start green" until the retained stale failed Job evidence is either cleared by normal K8s history policy or explicitly accepted by an owner-provided readback package.
|
||||
NO-GO for "DR complete" while credential escrow evidence markers remain missing.
|
||||
Do not delete the failed Job manually during routine SOP verification. Keep it as evidence unless an approved maintenance window explicitly authorizes cleanup.
|
||||
Current allowed wording: "SOP / Plan B / automation contracts are complete; live services are available with one retained stale km-vectorize failed Job warning; hard blockers are zero; DR remains blocked by credential escrow evidence."
|
||||
```
|
||||
|
||||
After any future 120 recovery, rerun this exact chain from 110:
|
||||
|
||||
```bash
|
||||
@@ -1481,6 +1504,23 @@ SOP update:
|
||||
| Repo-side readiness audit | `PASS=185 WARN=1 BLOCKED=0`,結果 `READY WITH WARNINGS`;唯一 warning 是未跑 `--live` |
|
||||
| Declaration limit | 可宣稱 `REPO_SIDE_REBOOT_READINESS_READY_WITH_LIVE_CHECK_REQUIRED`;不可宣稱 `FULL_STACK_GREEN`、`DR_COMPLETE` 或 live service recovery complete |
|
||||
|
||||
### 14.24 2026-06-18 live cold-start readback after repo-side closure
|
||||
|
||||
2026-06-18 12:13-12:17 的 readback 是 repo-side readiness closure 後的同日 live 驗證。這不是主機重啟,也不是 runtime 修復;它的用途是把「機制已完成」和「當下 live 狀態」分開,避免 false-green。
|
||||
|
||||
| 項目 | 2026-06-18 12:17 live baseline |
|
||||
|------|--------------------------------|
|
||||
| SOP version | `v1.24` |
|
||||
| Cold-start read-only result | `PASS=83 WARN=1 BLOCKED=0`,result `DEGRADED` |
|
||||
| Host reachability | 110 / 120 / 121 / 188 ping OK and SSH port OK |
|
||||
| K3s | `mon` / `mon1` Ready control-plane;VIP `192.168.0.125` present on 120;`NODE_FS_ERROR_EVENTS 0` |
|
||||
| 110 / 188 service checks | 110 Harbor / Gitea / Prometheus / Alertmanager / Sentry reachable;188 PostgreSQL / Redis / momo / SigNoz reachable |
|
||||
| Backup health | 110 backup health `total=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0`;188 backup health `total=2 stale=0` |
|
||||
| Public route / TLS | awoooi API/Web、mo、momo health、Gitea、Harbor、registry、Sentry、SigNoz、stock、Langfuse、Bitan、aiops all 2xx/3xx with TLS verified |
|
||||
| AWOOOI rollout convergence | After transient 12:14 startup window, final readback shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, API health `200 healthy` |
|
||||
| Remaining warning | retained stale Job `km-vectorize-29689620` from 2026-06-14 03:00; later official Jobs `km-vectorize-29692500`, `29693940`, `29695380` are `Complete` |
|
||||
| Declaration limit | 可宣稱 `SERVICE_AVAILABLE_DEGRADED`;不可宣稱 `FULL_STACK_GREEN`,因為 `WARN=1`;不可宣稱 `DR_COMPLETE`,credential escrow evidence still requires real non-secret owner evidence |
|
||||
|
||||
### 14.22 重啟後時間軸驗證
|
||||
|
||||
每次重啟後照時間軸推進,不要等到最後才一次判定。
|
||||
|
||||
@@ -11,13 +11,13 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | SERVICE_AVAILABLE_ARGOCD_HEALTHY_DR_ESCROW_BLOCKED | 98% | 2026-06-15 03:11 官方 `km-vectorize` 03:00 gate 已成功:ArgoCD `awoooi-prod` 為 `Synced / Healthy`,CronJob `lastSuccessfulTime=2026-06-14T19:00:55Z`,Job `km-vectorize-29691060` `Complete`,log 為 `embed-all: 200 {"total":31,"success":31,"failed":0}`。backup core blockers 仍為 `0`,110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,但 `escrow_missing=5`。Full cold-start 仍不可宣稱 green,因為最新 scorecard 為 `PASS=81 WARN=2 BLOCKED=0`,warning 來自 188 momo scheduler registration/activity 未確認與 K8s 仍保留舊 failed Job evidence。 |
|
||||
| Overall recovery readiness | SERVICE_AVAILABLE_DEGRADED_STALE_KM_JOB_DR_ESCROW_BLOCKED | 98% | 2026-06-18 12:13 live cold-start read-only gate returned `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, public routes/TLS are green, 110 backup health is `13/13 fresh failed=0`, 188 backup health is `2/2 fresh failed=0`, and final rollout readback at 12:17 shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, API health `200 healthy`。Only warning is retained stale Job `km-vectorize-29689620` from 2026-06-14 03:00; later official `km-vectorize` Jobs `29692500` / `29693940` / `29695380` are `Complete`。DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_ARGOCD_HEALTHY_WITH_RESIDUAL_WARNINGS | 99% | 2026-06-15 03:11 cold-start is degraded by two warnings only; public route/API smoke is green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are mostly green, and 110 failed units remain `0`. `km-vectorize-29691060` succeeded, ArgoCD is `Healthy`, and API/Web remain split across 120 / 121. Remaining scorecard warnings are 188 momo scheduler registration/activity not confirmed and retained old K8s failed Job evidence. |
|
||||
| P3 docs / automation contracts | REPO_SIDE_READY_WITH_LIVE_CHECK_REQUIRED | 100% | Workplan, SOP v1.23, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readback, P2-135 deploy recovery readback, P2-136 / AI Agent 活動正式部署後 recovery readback, P2-137 / CI smoke timeout recovery readback, P2-143 owner response 預檢後 recovery readback, P2-144 owner response 回讀後 recovery readback, P2-145 owner response 驗收門檻後 recovery readback, IwoooS P0 配置控管優先序後 recovery readback, 高價值配置 Owner Packet 前台同步後 recovery readback,以及 `km-vectorize` official success readback 均已更新。2026-06-18 repo-side `reboot-recovery-readiness-audit.sh --no-color` returned `PASS=185 WARN=1 BLOCKED=0`; only warning is live gate skipped. |
|
||||
| P2 service / data truth | VERIFIED_SERVICE_AVAILABLE_WITH_STALE_JOB_WARN | 99% | 2026-06-18 12:13 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness, VIP, and 110 / 188 runtime health. 12:17 post-rollout readback confirms AWOOOI deployments ready and API health `200 healthy`. The only service warning is old retained `km-vectorize-29689620` failed Job evidence; no hard blocker remains. |
|
||||
| P3 docs / automation contracts | REPO_SIDE_READY_LIVE_CHECK_RUN_WITH_WARN | 100% | Workplan, SOP v1.24, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, and 2026-06-18 live readback are updated. Repo-side `reboot-recovery-readiness-audit.sh --no-color` returned `PASS=185 WARN=1 BLOCKED=0`; live cold-start returned `PASS=83 WARN=1 BLOCKED=0`. |
|
||||
|
||||
Full cold-start may be declared green only for the latest verified evidence set. As of 2026-06-15 03:11, `km-vectorize` and ArgoCD are healthy, but the latest scorecard is still `DEGRADED` by residual warnings. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
Full cold-start may be declared green only for the latest verified evidence set. As of 2026-06-18 12:17, services are available and hard blockers are zero, but the latest scorecard is still `DEGRADED` by one retained stale `km-vectorize` failed Job warning. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
|
||||
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
||||
|
||||
@@ -175,7 +175,7 @@ Next: <single next action>
|
||||
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
|
||||
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
|
||||
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
|
||||
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.23 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, 2026-06-14 110 failed-unit cleanup anchor, 2026-06-14 post-CD recovery readback, P2-135 deploy recovery readback, P2-136 / AI Agent 活動正式部署後 recovery readback, P2-137 / CI smoke timeout recovery readback, P2-143 owner response 預檢後 recovery readback, P2-144 owner response 回讀後 recovery readback, P2-145 owner response 驗收門檻後 recovery readback, IwoooS P0 配置控管優先序後 recovery readback, 高價值配置 Owner Packet 前台同步後 recovery readback, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note,以及 allowed declaration wording. | Use v1.23 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, and blockers against §1.4 plus §14.8 through §14.23. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; repo-side `reboot-recovery-readiness-audit.sh --no-color` now returns `PASS=185 WARN=1 BLOCKED=0`, with the remaining warning only because live gate was intentionally skipped. |
|
||||
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.24 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, repo-side readiness audit blocker closure, 2026-06-18 live cold-start readback, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note,以及 allowed declaration wording. | Use v1.24 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, and blockers against §1.4 plus §14.8 through §14.24. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; repo-side `reboot-recovery-readiness-audit.sh --no-color` returns `PASS=185 WARN=1 BLOCKED=0`, and live cold-start returns `PASS=83 WARN=1 BLOCKED=0` with one retained stale km-vectorize failed Job warning. |
|
||||
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
||||
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
||||
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
||||
@@ -214,6 +214,14 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy
|
||||
## 9. Progress Updates
|
||||
|
||||
```text
|
||||
2026-06-18 12:17 Asia/Taipei
|
||||
Phase: P0/P2/P3 live readback
|
||||
Before: repo-side readiness was complete, but live gate had not been rerun after the same-day push.
|
||||
After: live cold-start is `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`; final rollout readback shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, and API health `200 healthy`.
|
||||
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; read-only K8s deployment/job snapshot from 120; public API health readback.
|
||||
Blocked: no hard blocker. One warning remains: stale retained Job `km-vectorize-29689620` from 2026-06-14 03:00; later official km-vectorize Jobs are Complete. DR complete still blocked by real credential escrow evidence markers.
|
||||
Next: before any actual reboot, rerun the same live preflight and classify as `B3_SERVICE_AVAILABLE_DEGRADED` if only stale evidence remains, or `B4_FULL_STACK_GREEN` only when `WARN=0 BLOCKED=0`.
|
||||
|
||||
2026-06-18 12:06 Asia/Taipei
|
||||
Phase: P3
|
||||
Before: repo-side readiness audit PASS=147 WARN=2 BLOCKED=37 before blocker batch; after Plan B-only guard it still had pre-existing blockers.
|
||||
|
||||
Reference in New Issue
Block a user