|
|
|
|
@@ -15,7 +15,7 @@
|
|
|
|
|
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
|
|
|
|
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-13 12:43 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 13:10 escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. Owner request package is now ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
|
|
|
|
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 01:26 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web both keep 120 / 121 split placement after latest deploy marker `e4a349bc`. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.8, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.9, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
|
|
|
|
|
|
|
|
|
Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
|
|
|
|
|
|
|
|
|
@@ -54,7 +54,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
|
|
|
|
|
| 2026-06-12 blocker pursuit | WAITING_EXTERNAL_ACCESS | 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo. |
|
|
|
|
|
| 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
|
|
|
|
|
| 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. |
|
|
|
|
|
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. |
|
|
|
|
|
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_92 | 13:37 live readback: ArgoCD revision `88dc08e5` is `Synced / Degraded`; only unhealthy resource is `CronJob/awoooi-prod/km-vectorize` with message `CronJob has not completed its last execution successfully`. CronJob `lastScheduleTime=2026-06-12T19:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; no 2026-06-13 failed Job is retained because `failedJobsHistoryLimit=0`. GitOps candidate now changes `km-vectorize` to `failedJobsHistoryLimit=3` so future 03:00 failures keep inspectable Job/Pod evidence. Next gate is ArgoCD sync plus the next official 03:00 success readback. |
|
|
|
|
|
| 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. |
|
|
|
|
|
| 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. |
|
|
|
|
|
| 2026-06-13 API rollout strategy hardening | LIVE_VERIFIED | First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps rollout reached ArgoCD revision `60f653a0`, API/Web use `maxSurge=0`, `maxUnavailable=1`, `minDomains=2`, `DoNotSchedule`, and both deployments are split `mon` / `mon1`. Public API / governance route smoke passed and 12:59 cold-start returned `PASS=83 WARN=0 BLOCKED=0`. |
|
|
|
|
|
@@ -130,7 +130,7 @@ Next: <single next action>
|
|
|
|
|
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
|
|
|
|
|
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
|
|
|
|
|
| P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. | Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate. |
|
|
|
|
|
| P1-013 | IN_PROGRESS | 90 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed, and the manifest schedule was semantically wrong (`0 19` with `timeZone: Asia/Taipei` ran at 19:00 台北, not 03:00). Manual Job evidence is invalid because the controller deleted `km-vectorize-codex-002709` as `UnexpectedJob`. Gitea main `47ee96b0` is synced live and the CronJob spec is corrected. | Verify the next 03:00 official CronJob updates `lastSuccessfulTime` and ArgoCD returns `Healthy`. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`. |
|
|
|
|
|
| P1-013 | IN_PROGRESS | 92 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed. The schedule is now correct (`0 3` with `timeZone: Asia/Taipei`), but the 2026-06-13 03:00 run did not update `lastSuccessfulTime` and no failed Job was retained because `failedJobsHistoryLimit=0`. This hid the failure evidence and slowed triage. | Apply GitOps `failedJobsHistoryLimit=3`, verify ArgoCD sync, then verify the next 03:00 official CronJob either succeeds or leaves inspectable failure evidence. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`; if it fails, the failed Job/Pod/log remains available. |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -159,7 +159,7 @@ Next: <single next action>
|
|
|
|
|
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
|
|
|
|
|
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
|
|
|
|
|
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.8 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, and allowed declaration wording. | Use v1.8 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.9 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, and allowed declaration wording. | Use v1.9 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
|
|
|
|
|
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
|
|
|
|
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
|
|
|
|
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
|
|
|
|
|