|
|
|
|
@@ -11,15 +11,15 @@
|
|
|
|
|
|
|
|
|
|
| Area | Status | Completion | Evidence |
|
|
|
|
|
|------|--------|------------|----------|
|
|
|
|
|
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 00:34 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`). |
|
|
|
|
|
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 00:34 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt. |
|
|
|
|
|
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
|
|
|
|
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-12 15:54 `backup-all` completed `13/13`; 17:37 offsite full sync completed `13/13`; 18:55 verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status has `core_blockers=0` but `escrow_missing=5`. |
|
|
|
|
|
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 post-rollout cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web now both have 120 / 121 split placement. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.6, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, and host role / load-balancing assessment are updated; Ansible syntax check is unavailable on this workstation. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.7, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
|
|
|
|
|
|
|
|
|
Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
|
|
|
|
|
|
|
|
|
2026-06-13 00:34 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing.
|
|
|
|
|
2026-06-13 00:34 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing. 00:50 `km-vectorize` remediation is `85%`: schedule/label manifest fix prepared, waiting for Gitea / ArgoCD sync and next 03:00 CronJob success readback.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -54,6 +54,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
|
|
|
|
|
| 2026-06-12 blocker pursuit | WAITING_EXTERNAL_ACCESS | 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo. |
|
|
|
|
|
| 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
|
|
|
|
|
| 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. |
|
|
|
|
|
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_85 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Manifest now corrects the schedule to `0 3 * * *` with `timeZone: Asia/Taipei` and adds Job labels. Next gate is Gitea / ArgoCD sync plus next 03:00 official CronJob readback. |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -125,6 +126,7 @@ Next: <single next action>
|
|
|
|
|
| P1-010 | DONE | 100 | Offsite sync manual backup repairs | 2026-06-12 17:37 full offsite sync completed `13/13` after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot. | Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots. | `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, full sync `13/13`. |
|
|
|
|
|
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
|
|
|
|
|
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
|
|
|
|
|
| P1-013 | IN_PROGRESS | 85 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed, and the manifest schedule was semantically wrong (`0 19` with `timeZone: Asia/Taipei` ran at 19:00 台北, not 03:00). Manual Job evidence is invalid because the controller deleted `km-vectorize-codex-002709` as `UnexpectedJob`. | Push manifest fix, let ArgoCD sync, then verify the next 03:00 official CronJob updates `lastSuccessfulTime` and ArgoCD returns `Healthy`. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`. |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -156,6 +158,7 @@ Next: <single next action>
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.6 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 comparison anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, and allowed declaration wording. | Use v1.6 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while 120 / escrow remain red. |
|
|
|
|
|
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
|
|
|
|
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
|
|
|
|
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, the invalid manual Job boundary, and the 85% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|