fix(k8s): retain km-vectorize failure evidence

2026-06-13 13:41:14 +08:00
parent 88dc08e595
commit 39246c6595
4 changed files with 29 additions and 11 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -33966,3 +33966,21 @@ production browser smoke:
 - 本輪沒有讀取、收集、貼上或保存任何 secret value、hash、prefix/suffix、partial token。
 - 本輪沒有寫入 live marker；`BackupCredentialEscrowEvidenceMissing` 必須繼續 firing，直到五個 marker 以真實非敏感 evidence-id 補齊。
 - Service / cold-start 維持 `GREEN`；DR scorecard 仍是 `BLOCKED`。
+
+## 2026-06-13 — km-vectorize failure evidence retention hardening
+
+**Live read-only evidence，13:37 Asia/Taipei**：
+- Gitea main：`88dc08e5 docs(ops): add credential escrow evidence owner request [skip ci]`。
+- ArgoCD `awoooi-prod`：`sync=Synced`、`health=Degraded`、revision `88dc08e595878ee0728752a26f207689f7c062d4`。
+- 唯一 unhealthy resource：`CronJob/awoooi-prod/km-vectorize`，message：`CronJob has not completed its last execution successfully`。
+- CronJob：`schedule=0 3 * * *`、`timeZone=Asia/Taipei`、`suspend=false`、`lastScheduleTime=2026-06-12T19:00:00Z`、`lastSuccessfulTime=2026-06-04T11:00:37Z`。
+- Namespace 只保留 2026-06-02 / 06-03 / 06-04 的 completed `km-vectorize` Jobs；2026-06-13 03:00 的失敗 Job 沒有留下可查證物件，原因是 `failedJobsHistoryLimit=0`。
+
+**修正內容**：
+- `k8s/awoooi-prod/15-cronjob-km-vectorize.yaml` 將 `failedJobsHistoryLimit` 從 `0` 改為 `3`。
+- SOP 升版到 `v1.9`，新增 CronJob 失敗證據保留規則。
+- reboot workplan 將 `km-vectorize` remediation 從 `90%` 推進到 `92%`：source / live schedule 已正確，現在補上 future failure inspectability；仍不能宣稱 ArgoCD Healthy。
+
+**邊界**：
+- 本輪是 GitOps source 修正，不手動建立 Job、不手動刪 CronJob、不 `kubectl patch` live、不重啟 Pod。
+- 下一個 gate：ArgoCD sync 到新 revision 後，下一次 03:00 官方 `km-vectorize` 必須成功更新 `lastSuccessfulTime`；若失敗，failed Job / Pod / log 必須保留可查。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1,6 +1,6 @@
 # AWOOOI 全棧冷啟動與主機重啟 SOP

-> Version: v1.8
+> Version: v1.9
 > Last updated: 2026-06-13 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

@@ -29,7 +29,7 @@
 | Cold-start scorecard | 01:26 read-only scorecard after latest CD / trust guardrail: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
 | momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
 | momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` |
-| ArgoCD app health | ArgoCD revision `e4a349bc` is `Synced`; app remains `Degraded` while `km-vectorize` `lastSuccessfulTime=2026-06-04T11:00:37Z`; corrected schedule remains `0 3 * * *` with `timeZone=Asia/Taipei` | `GOVERNANCE_DEBT_IN_REMEDIATION` |
+| ArgoCD app health | 13:37 ArgoCD revision `88dc08e5` is `Synced`; app remains `Degraded` only because `km-vectorize` CronJob has not completed its last execution successfully. CronJob schedule is `0 3 * * *` with `timeZone=Asia/Taipei`; `lastScheduleTime=2026-06-12T19:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`. Source now retains failed Job evidence so the next failure is inspectable. | `GOVERNANCE_DEBT_IN_REMEDIATION` |
 | Workload balancing | Latest live deployments use images from `414413a5`; API/Web each have one pod on 120 and one pod on 121 after deploy marker `e4a349bc` | `GREEN` |
 | Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |

@@ -51,7 +51,7 @@ Do not fake or silence credential escrow alerts; they are the remaining correct
 GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
 NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
 NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
-Current allowed wording: "service / cold-start green; ArgoCD CronJob governance debt in remediation and waiting for 03:00 `km-vectorize` success evidence."
+Current allowed wording: "service / cold-start green; ArgoCD CronJob governance debt in remediation; `km-vectorize` failure evidence retention is fixed and the next 03:00 run must prove success."
 ```

 After any future 120 recovery, rerun this exact chain from 110:
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -15,7 +15,7 @@
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
 | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-13 12:43 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 13:10 escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. Owner request package is now ready; actual marker write remains blocked on real non-secret evidence IDs. |
 | P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 01:26 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web both keep 120 / 121 split placement after latest deploy marker `e4a349bc`. |
-| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.8, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
+| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.9, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |

 Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.

@@ -54,7 +54,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
 | 2026-06-12 blocker pursuit | WAITING_EXTERNAL_ACCESS | 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo. |
 | 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
 | 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. |
-| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. |
+| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_92 | 13:37 live readback: ArgoCD revision `88dc08e5` is `Synced / Degraded`; only unhealthy resource is `CronJob/awoooi-prod/km-vectorize` with message `CronJob has not completed its last execution successfully`. CronJob `lastScheduleTime=2026-06-12T19:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; no 2026-06-13 failed Job is retained because `failedJobsHistoryLimit=0`. GitOps candidate now changes `km-vectorize` to `failedJobsHistoryLimit=3` so future 03:00 failures keep inspectable Job/Pod evidence. Next gate is ArgoCD sync plus the next official 03:00 success readback. |
 | 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. |
 | 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. |
 | 2026-06-13 API rollout strategy hardening | LIVE_VERIFIED | First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps rollout reached ArgoCD revision `60f653a0`, API/Web use `maxSurge=0`, `maxUnavailable=1`, `minDomains=2`, `DoNotSchedule`, and both deployments are split `mon` / `mon1`. Public API / governance route smoke passed and 12:59 cold-start returned `PASS=83 WARN=0 BLOCKED=0`. |
@@ -130,7 +130,7 @@ Next: <single next action>
 | P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
 | P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
 | P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. | Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate. |
-| P1-013 | IN_PROGRESS | 90 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed, and the manifest schedule was semantically wrong (`0 19` with `timeZone: Asia/Taipei` ran at 19:00 台北, not 03:00). Manual Job evidence is invalid because the controller deleted `km-vectorize-codex-002709` as `UnexpectedJob`. Gitea main `47ee96b0` is synced live and the CronJob spec is corrected. | Verify the next 03:00 official CronJob updates `lastSuccessfulTime` and ArgoCD returns `Healthy`. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`. |
+| P1-013 | IN_PROGRESS | 92 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed. The schedule is now correct (`0 3` with `timeZone: Asia/Taipei`), but the 2026-06-13 03:00 run did not update `lastSuccessfulTime` and no failed Job was retained because `failedJobsHistoryLimit=0`. This hid the failure evidence and slowed triage. | Apply GitOps `failedJobsHistoryLimit=3`, verify ArgoCD sync, then verify the next 03:00 official CronJob either succeeds or leaves inspectable failure evidence. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`; if it fails, the failed Job/Pod/log remains available. |

 ---

@@ -159,7 +159,7 @@ Next: <single next action>
 | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
 | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
 | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
-| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.8 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, and allowed declaration wording. | Use v1.8 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
+| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.9 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, and allowed declaration wording. | Use v1.9 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
 | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
 | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
 | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
--- a/k8s/awoooi-prod/15-cronjob-km-vectorize.yaml
+++ b/k8s/awoooi-prod/15-cronjob-km-vectorize.yaml
@@ -22,10 +22,10 @@ spec:
  timeZone: "Asia/Taipei"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
-  # 2026-05-24 Codex: do not retain Failed Jobs in the ArgoCD app tree; stale
-  # failure evidence belongs in AwoooP/KM governance, while retained Job
-  # objects keep the whole Application Degraded after recovery.
-  failedJobsHistoryLimit: 0
+  # 2026-06-13 Codex: retain a small failed-job window. Setting this to 0
+  # hid the 03:00 failure evidence and left ArgoCD Degraded with no Job/Pod
+  # to inspect, which slows post-reboot recovery triage.
+  failedJobsHistoryLimit: 3
  startingDeadlineSeconds: 300
  jobTemplate:
    metadata: