fix(k8s): retain km-vectorize failed pod evidence
This commit is contained in:
@@ -1,3 +1,32 @@
|
||||
## 2026-06-14|km-vectorize 03:00 官方排程失敗與 evidence retention 補強
|
||||
|
||||
**背景**:03:10 heartbeat 依重啟 SOP 只讀檢查 120 上 ArgoCD、`km-vectorize` 官方 CronJob、backup status、credential escrow 與 full cold-start scorecard。本輪沒有手動建立 Job、沒有刪 Job / Pod、沒有 `kubectl patch` live、沒有重啟任何服務。
|
||||
|
||||
**Live 驗證:**
|
||||
- 現在時間:`2026-06-14 03:10 CST`。
|
||||
- Gitea main:`c82f320b docs(logbook): 記錄 P2-127 正式驗證 [skip ci]`。
|
||||
- ArgoCD `awoooi-prod`:revision `c82f320b97c9880508879eefc991055da800bc36`,`Synced / Degraded`。
|
||||
- `km-vectorize` CronJob:schedule `0 3 * * *`、`timeZone=Asia/Taipei`、`failedJobsHistoryLimit=3`、image `26b67d11f7b7de4f9c9d95c01bb1dacf4000e887`、last schedule `2026-06-14 03:00 +0800`。
|
||||
- 官方 Job `km-vectorize-29689620`:`Failed`,`BackoffLimitExceeded`,annotation `batch.kubernetes.io/cronjob-scheduled-timestamp=2026-06-14T03:00:00+08:00`;Job 本體已保留。
|
||||
- 失敗 Pod `km-vectorize-29689620-nwpqz` 曾出現 `Back-off restarting failed container`,但隨後被 Job controller 刪除;因此本次仍無法讀取容器 log。
|
||||
- 110 backup status:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`escrow_missing=5`、last aggregate `2026-06-14 02:40:22`。
|
||||
- Credential escrow marker status:`restic_repository_password`、`offsite_provider_credentials`、`break_glass_admin_credentials`、`dns_registrar_recovery`、`oauth_ai_provider_recovery` 仍缺。
|
||||
- Full cold-start:`PASS=81 WARN=2 BLOCKED=0`,result `DEGRADED`;warnings 是 110 `fwupd-refresh.service` / `fwupd.service` failed units 與 K3s failed Job `km-vectorize-29689620`。
|
||||
|
||||
**GitOps 修正:**
|
||||
- 更新 `k8s/awoooi-prod/15-cronjob-km-vectorize.yaml`:`restartPolicy` 從 `OnFailure` 改為 `Never`,並新增 `terminationMessagePolicy: FallbackToLogsOnError`。
|
||||
- 目的只限證據保留:下一次官方 03:00 若再失敗,Job / failed Pod / termination message 應留在 cluster 供只讀 triage;本次不猜 root cause、不調整資源、不手動補跑。
|
||||
|
||||
**目前可宣稱:**
|
||||
- 110 / 120 / 121 / 188 核心服務與備份核心仍可用;backup core blocker 仍為 `0`。
|
||||
- `km-vectorize` 已有官方失敗證據,且 repo 端補上下一次失敗 Pod/log retention 的 GitOps 修正。
|
||||
|
||||
**目前不可宣稱:**
|
||||
- 不可宣稱 full cold-start green,最新 scorecard 是 `WARN=2 BLOCKED=0`。
|
||||
- 不可宣稱 ArgoCD Healthy,`km-vectorize` 官方 03:00 仍失敗。
|
||||
- 不可宣稱 DR complete,五個 credential escrow evidence markers 仍缺。
|
||||
- 不可手動建立 `km-vectorize` Job、刪除 CronJob status、刪 failed Job、偽造 evidence marker 或讀取 secret。
|
||||
|
||||
## 2026-06-14|P2-127 Owner acceptance / maintenance window gate 完成與正式驗證
|
||||
|
||||
**背景**:P2-126 已把 owner-approved execution rehearsal / no-write apply gate 正式驗證完成;但真正接近 writer apply、execution apply、receipt write、result capture、learning、PlayBook trust、reviewer queue、Gateway queue 或 Telegram 前,仍需要把 owner acceptance、maintenance window、rollback owner 與 post-apply verifier gate 固定成可審核、可阻擋、可回滾的封包。P2-127 因此只建立 acceptance / maintenance gate,不套用 writer、不開維護窗口、不確認 rollback owner、不寫 receipt、不寫 result capture / learning / PlayBook trust / reviewer queue / Gateway queue,也不送 Telegram 或呼叫 Bot API。
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.9
|
||||
> Last updated: 2026-06-13 Asia/Taipei
|
||||
> Version: v1.10
|
||||
> Last updated: 2026-06-14 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
---
|
||||
@@ -10,27 +10,27 @@
|
||||
|
||||
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。
|
||||
|
||||
| 項目 | 2026-06-13 14:16 Asia/Taipei live result | 判定 |
|
||||
| 項目 | 2026-06-14 03:10 Asia/Taipei live result | 判定 |
|
||||
|------|-------------------------------------------|------|
|
||||
| Overall recovery readiness | `96%` | `SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED` |
|
||||
| Overall recovery readiness | `94%` | `SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED` |
|
||||
| P0 host / K3s recovery | `100%` | `DONE` |
|
||||
| P1 backup / alert / escrow | `92%` | `BLOCKED_DR_ESCROW` |
|
||||
| P2 service / data truth | `100%` | `VERIFIED_WORKLOAD_BALANCED` |
|
||||
| P3 docs / automation contracts | `100%` | `DONE_WITH_VALIDATION_GAP` |
|
||||
| 110 host runtime | `systemctl is-system-running=running`, failed units `0`, cold-start load `3.55 / 4.54 / 4.51`; Docker / Harbor / Gitea / Prometheus / Alertmanager / Sentry checks pass | `GREEN_WITH_LOAD_WATCH` |
|
||||
| 110 host runtime | `fwupd-refresh.service` and `fwupd.service` remain failed; Docker / Harbor / Gitea / Prometheus / Alertmanager / Sentry checks pass in scorecard | `GREEN_WITH_FWUPD_WARNING` |
|
||||
| 120 reachability | ping OK, SSH OK, boot `2026-06-12 15:13`, root ext4 `rw`, failed units `0` | `GREEN` |
|
||||
| 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` |
|
||||
| 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` |
|
||||
| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; latest CD kept API/Web split across 120 / 121 | `GREEN_WORKLOAD_BALANCED` |
|
||||
| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc` | `GREEN_WITH_GUARDRAIL` |
|
||||
| Backup status | 14:16 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-13 02:34:06` | `GREEN_WITH_DR_ESCROW_WARNING` |
|
||||
| Backup status | 03:11 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22` | `GREEN_WITH_DR_ESCROW_WARNING` |
|
||||
| Offsite sync / verify | 01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20 | `GREEN` |
|
||||
| Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | `GREEN_WITH_EXPECTED_REDLIGHTS` |
|
||||
| Cold-start scorecard | 14:16 read-only scorecard after production image `e897c8bf` verification: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
|
||||
| Cold-start scorecard | 03:11 read-only scorecard after official `km-vectorize` run: `PASS=81 WARN=2 BLOCKED=0` | `DEGRADED_NO_BLOCKERS` |
|
||||
| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
|
||||
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` |
|
||||
| ArgoCD app health | 14:16 ArgoCD revision `a520c32d` is `Synced`; app remains `Degraded` only because `km-vectorize` CronJob has not completed its last execution successfully. CronJob schedule is `0 3 * * *` with `timeZone=Asia/Taipei`, `failedJobsHistoryLimit=3`, and `lastSuccessfulTime=2026-06-04T11:00:37Z`; next 03:00 run must prove success or leave inspectable failure evidence. | `GOVERNANCE_DEBT_IN_REMEDIATION` |
|
||||
| Workload balancing | Live API/Web/Worker image is `e897c8bf`; API/Web pods remain split across 120 / 121, Worker single replica remains healthy | `GREEN` |
|
||||
| ArgoCD app health | 03:10 ArgoCD revision `c82f320b` is `Synced`; app remains `Degraded` because official Job `km-vectorize-29689620` failed with `BackoffLimitExceeded`. CronJob schedule is `0 3 * * *` with `timeZone=Asia/Taipei`, `failedJobsHistoryLimit=3`, and image `26b67d11`; the retained Job proves failure, but the failed Pod/log was deleted before inspection. | `FAILED_EVIDENCE_RETENTION_PATCHED_IN_GIT` |
|
||||
| Workload balancing | Live API/Web/Worker image is `26b67d11`; API/Web pods remain ready, Worker single replica remains healthy | `GREEN` |
|
||||
| Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |
|
||||
|
||||
Release rule:
|
||||
@@ -44,14 +44,15 @@ Do not declare DR scorecard complete while credential escrow markers are missing
|
||||
2026-06-13 live rule:
|
||||
|
||||
```text
|
||||
110 / 120 / 121 / 188 service recovery is full-stack green after the 14:16 scorecard.
|
||||
110 / 120 / 121 / 188 core service recovery remains available, but the latest 03:11 scorecard is DEGRADED because `WARN=2`.
|
||||
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
|
||||
NO-GO for "DR complete" while credential escrow evidence markers are missing.
|
||||
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
|
||||
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
|
||||
NO-GO for "full cold-start green" until 110 failed units are resolved/accepted and `km-vectorize` failed Job is cleared by an official successful run.
|
||||
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
|
||||
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
|
||||
Current allowed wording: "service / cold-start green; DR complete still blocked by credential escrow; ArgoCD CronJob governance debt in remediation; `km-vectorize` failure evidence retention is fixed and the next 03:00 run must prove success."
|
||||
Current allowed wording: "core service and backup are available; cold-start is degraded by `km-vectorize` official Job failure and 110 fwupd failed units; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained and GitOps now patches failed Pod/log retention for the next official run."
|
||||
```
|
||||
|
||||
After any future 120 recovery, rerun this exact chain from 110:
|
||||
@@ -1160,18 +1161,18 @@ SOP update:
|
||||
|
||||
2026-06-13 不是主機重啟,而是用來驗證「120/121 workload balancing + CD known_hosts guardrail」是否能承受下一次正常部署的比較錨點。
|
||||
|
||||
| 項目 | 2026-06-13 14:16 baseline |
|
||||
| 項目 | 2026-06-14 03:10 baseline |
|
||||
|------|----------------------------|
|
||||
| Gitea / ArgoCD | Gitea main `a520c32d`,deploy marker `a520c32d`,ArgoCD revision `a520c32d`,sync `Synced` |
|
||||
| K3s image readback | API/Web/Worker image tag `e897c8bf20125f7792f9aee21d7f503f4358bcfb` |
|
||||
| Gitea / ArgoCD | Gitea main `c82f320b`,deploy marker `7b034b58`,ArgoCD revision `c82f320b`,sync `Synced`,health `Degraded` |
|
||||
| K3s image readback | API/Web/Worker/CronJob image tag `26b67d11f7b7de4f9c9d95c01bb1dacf4000e887` |
|
||||
| K3s placement | API/Web verified split across `mon` / `mon1` after the latest deploy marker;Worker single replica healthy |
|
||||
| Cold-start | `PASS=83 WARN=0 BLOCKED=0` |
|
||||
| Cold-start | `PASS=81 WARN=2 BLOCKED=0` |
|
||||
| Public routes | Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
|
||||
| Backup | `backup-status`: 110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`escrow_missing=5` |
|
||||
| Backup | `backup-status`: 110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`escrow_missing=5`,last aggregate `2026-06-14 02:40:22` |
|
||||
| Offsite | textfile `remote_verify_ok=1`、`full_verify_fresh=1`,13 repos each `snapshot_count=1` |
|
||||
| SSH trust | Global `known_hosts` retained 120 / 188 entries after CD; deploy-specific trust moved to `deploy_known_hosts` |
|
||||
| Remaining non-service debt | `km-vectorize` waits for official 03:00 `lastSuccessfulTime`; credential escrow missing count remains `5`; 188 host degraded only by certbot units |
|
||||
| SOP change | v1.9 keeps CD / SSH trust guardrail, workload-balance proof, DR escrow red gate, and `km-vectorize` official schedule gate in the first-screen declaration rules |
|
||||
| Remaining non-service debt | `km-vectorize-29689620` official Job failed with `BackoffLimitExceeded`; failed Pod/log was deleted before inspection; credential escrow missing count remains `5`; 110 has `fwupd` failed units |
|
||||
| SOP change | v1.10 changes the first-screen declaration from full green back to degraded, records official `km-vectorize` failure evidence, and requires `restartPolicy: Never` / `FallbackToLogsOnError` evidence retention for the next official run |
|
||||
|
||||
### 14.10 重啟後時間軸驗證
|
||||
|
||||
|
||||
@@ -11,13 +11,13 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 96% | 2026-06-13 14:16 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, deploy marker `a520c32d` put API/Web/Worker image `e897c8bf` live, API/Web remain live-verified split across 120 / 121, and CD no longer clobbers global `known_hosts`. 14:16 escrow status still has `escrow_missing=5`; DR remains blocked by five missing credential escrow evidence markers. ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. |
|
||||
| Overall recovery readiness | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 94% | 2026-06-14 03:11 cold-start scorecard is `PASS=81 WARN=2 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public route/API smoke remains green, deploy marker `7b034b58` put API/Web/Worker image `26b67d11` live, and API/Web remain live-verified split across 120 / 121. The 03:00 official `km-vectorize-29689620` Job failed with `BackoffLimitExceeded`, and 110 has `fwupd` failed units, so full cold-start cannot be declared green. DR remains blocked by five missing credential escrow evidence markers. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-13 14:16 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-13 02:34:06`; 13:10 escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 14:16 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are green. API/Web both keep 120 / 121 split placement after latest ArgoCD revision `a520c32d`, with live API/Web/Worker image `e897c8bf`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.9, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-14 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED_WITH_CRON_WARN | 98% | 2026-06-14 03:11 cold-start is degraded by warnings only; public route/API smoke is green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are mostly green. API/Web both keep 120 / 121 split placement after latest ArgoCD revision `c82f320b`, with live API/Web/Worker image `26b67d11`; the exception is failed `km-vectorize-29689620`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.10, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
|
||||
|
||||
Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
Full cold-start may be declared green only for the latest verified evidence set. As of 2026-06-14 03:11, the latest evidence set is degraded, not green. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
|
||||
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
||||
|
||||
@@ -62,6 +62,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
|
||||
| 2026-06-13 security mirror production image closeout | LIVE_VERIFIED | Gitea main `64ea2444` records the Web rebuild trigger. Deploy marker `2cc02f1c chore(cd): deploy 6cf8d3c [skip ci]` put Web image `6cf8d3ca` live; ArgoCD source revision later advanced to `64ea2444` while Web image correctly remains `6cf8d3ca` because `64ea2444` is docs/changelog only. Public `/zh-TW/governance` and `/en/governance` return `200`, API health is `healthy`, `security-mirror-progress-guard.py` passes, and 14:10 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
|
||||
| 2026-06-13 final post-trigger deploy closeout | LIVE_VERIFIED | Deploy marker `834ccdba chore(cd): deploy bf86017 [skip ci]` put API/Web/Worker image `bf860177` live. ArgoCD revision `834ccdba` is `Synced / Degraded` only by `km-vectorize`; routes `/zh-TW/governance` and `/en/governance` return `200`, API health is `healthy`, source guards pass, backup status has `core_blockers=0` and `escrow_missing=5`, and 14:13 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
|
||||
| 2026-06-13 final goal audit refresh | SERVICE_GREEN_REMAINING_GATES_EXPLICIT | Clean worktree rebased onto `a520c32d` and reran source guards successfully; live ArgoCD tracks revision `a520c32d` with API/Web/Worker image `e897c8bf`, health `Degraded` only by `km-vectorize`; `km-vectorize` schedule remains `0 3 * * *`, `timeZone=Asia/Taipei`, `failedJobsHistoryLimit=3`, and no failed Job is currently retained. Public `/zh-TW/governance`, `/en/governance`, and `/api/v1/health` are green; backup core blockers remain `0`, `escrow_missing=5`; 14:16 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Remaining gates: five credential escrow markers and next official 03:00 `km-vectorize` success readback. |
|
||||
| 2026-06-14 `km-vectorize` official run follow-up | DEGRADED_EVIDENCE_RETENTION_PATCHED | 03:00 official `km-vectorize-29689620` ran from CronJob and failed with `BackoffLimitExceeded`; ArgoCD revision `c82f320b` remains `Synced / Degraded`. Job is retained, but failed Pod `km-vectorize-29689620-nwpqz` was deleted before logs could be read, so root cause remains unproven. GitOps candidate changes `restartPolicy: Never` and adds `terminationMessagePolicy: FallbackToLogsOnError` so the next official failure should retain Pod/log evidence. Backup core remains green, `escrow_missing=5`, and 03:11 cold-start is `PASS=81 WARN=2 BLOCKED=0`. |
|
||||
|
||||
---
|
||||
|
||||
@@ -134,7 +135,7 @@ Next: <single next action>
|
||||
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
|
||||
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
|
||||
| P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. | Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate. |
|
||||
| P1-013 | IN_PROGRESS | 92 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed. The schedule is now correct (`0 3` with `timeZone: Asia/Taipei`), but the 2026-06-13 03:00 run did not update `lastSuccessfulTime` and no failed Job was retained because `failedJobsHistoryLimit=0`. This hid the failure evidence and slowed triage. | Apply GitOps `failedJobsHistoryLimit=3`, verify ArgoCD sync, then verify the next 03:00 official CronJob either succeeds or leaves inspectable failure evidence. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`; if it fails, the failed Job/Pod/log remains available. |
|
||||
| P1-013 | IN_PROGRESS | 94 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`. The schedule is correct (`0 3` with `timeZone: Asia/Taipei`) and `failedJobsHistoryLimit=3` retained the 2026-06-14 failed Job, but `restartPolicy: OnFailure` still allowed the failed Pod/log to be deleted before inspection. The 03:00 official Job `km-vectorize-29689620` failed with `BackoffLimitExceeded`; root cause remains unproven without logs. | Apply GitOps evidence-retention patch: `restartPolicy: Never` plus `terminationMessagePolicy: FallbackToLogsOnError`, verify ArgoCD sync, then verify the next 03:00 official CronJob either succeeds or leaves inspectable failed Pod/log evidence. Do not manual-run or patch live. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`; if it fails, the failed Job/Pod/log remains available for read-only triage. |
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -46,7 +46,11 @@ spec:
|
||||
component: km-vectorize
|
||||
phase: "4-3"
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
# 2026-06-14 Codex: keep failed Job pods inspectable. The 03:00
|
||||
# official run failed with BackoffLimitExceeded, but restartPolicy
|
||||
# OnFailure let the controller delete the failed pod before logs
|
||||
# could be read. Never keeps failed pods attached to retained Jobs.
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: km-vectorize
|
||||
# 2026-05-05 Codex: keep the API image placeholder so CD
|
||||
@@ -54,6 +58,7 @@ spec:
|
||||
# awoooi-api:latest repo returns 400 from Harbor after reboot.
|
||||
image: 192.168.0.110:5000/library/api:IMAGE_TAG_PLACEHOLDER
|
||||
imagePullPolicy: IfNotPresent
|
||||
terminationMessagePolicy: FallbackToLogsOnError
|
||||
command:
|
||||
- python
|
||||
- /app/scripts/cron_km_vectorize.py
|
||||
|
||||
Reference in New Issue
Block a user