fix(k8s): correct km vectorize cron schedule
All checks were successful
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / tests (push) Successful in 1m25s
CD Pipeline / build-and-deploy (push) Successful in 4m40s
CD Pipeline / post-deploy-checks (push) Successful in 2m8s

This commit is contained in:
Your Name
2026-06-13 00:59:18 +08:00
parent bdfc5770bd
commit 47ee96b093
4 changed files with 40 additions and 7 deletions

View File

@@ -1,3 +1,26 @@
## 2026-06-13P1-ARGO `km-vectorize` CronJob health remediation
**Live finding00:50 Asia/Taipei**
- ArgoCD `awoooi-prod` 仍為 `Synced / Degraded`,唯一非綠來源是 `CronJob/km-vectorize``lastSuccessfulTime` 停在 `2026-06-04T11:00:37Z`
- 現行 CronJob `lastScheduleTime=2026-06-12T11:00:00Z`,但 6/5 至 6/12 沒有保留新 Job6/2、6/3、6/4 的 retained Jobs 都是 `Complete` 且 ownerReference 指向現行 CronJob。
- 最近可見成功 pod log`embed-all: 200 {"total":32,"success":32,"failed":0}`,代表任務邏輯至少在 6/4 可成功完成。
- `kubectl create job --from=cronjob/km-vectorize km-vectorize-codex-002709` 產生的手動 Job 被 CronJob controller 標為 `UnexpectedJob` 並刪除,不能拿來當完成證據。
- Manifest 註解寫每日 03:00 台北,但 `schedule: "0 19 * * *"` 搭配 `timeZone: Asia/Taipei` 實際是每日 19:00 台北,時間語意錯誤。
**修正source-control**
- `k8s/awoooi-prod/15-cronjob-km-vectorize.yaml` 改為 `schedule: "0 3 * * *"` 並保留 `timeZone: Asia/Taipei`
- `jobTemplate.metadata.labels` 新增 `app=awoooi``component=km-vectorize``phase=4-3`pod template 同步補 `phase=4-3`,讓後續 Job/Pod 查詢與 ArgoCD resource tree 有一致 labels。
**完成度同步**
- P1-ARGO `80% -> 85%`root cause narrowed、manifest fix prepared等待 Gitea main / ArgoCD sync 後,由下一次正式 03:00 CronJob 更新 `lastSuccessfulTime`
- Overall service recovery 維持 `95% SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED`;這是治理 health debt不是目前 API/Web/backup/cold-start blocker。
- 不宣稱 ArgoCD 全綠,直到 `km-vectorize``lastSuccessfulTime` 刷新且 ArgoCD health 回 `Healthy`
**邊界**:本段不重啟 Docker、不 reload Nginx、不改 firewall、不手動消音告警、不寫 credential escrow placeholder、不把 manual Job 當正式 CronJob success。
## 2026-06-13P2-103 任務結果稽核軌跡
**背景**P2-102 已把 13 類候選操作固定成 dry-run evidence、side-effect count、verifier plan 與人工 handoff統帥持續指出 TG / AwoooP 批准後仍常停在 `learning_recorded``manual_review``no_action`,沒有清楚顯示結果應接到 KM、LOGBOOK、稽核軌跡、timeline 或人工修復下一步。P2-103 補上結果路由契約,讓批准後卡住的狀態能被分類、追蹤與交接。

View File

@@ -29,7 +29,7 @@
| Cold-start scorecard | 00:34 final read-only scorecard: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` |
| ArgoCD app health | Deployments/PDB/CronJobs mostly healthy; app Degraded because `km-vectorize` last success is stale | `GOVERNANCE_DEBT` |
| ArgoCD app health | Deployments/PDB healthy; app Degraded because `km-vectorize` last success is stale. 00:50 remediation corrected the source manifest schedule to 03:00 Asia/Taipei and added Job labels; waiting for official CronJob success readback. | `GOVERNANCE_DEBT_IN_REMEDIATION` |
| Workload balancing | Gitea main `acaae999` synced by ArgoCD; API/Web each have one pod on 120 and one pod on 121 | `GREEN` |
| Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |
@@ -49,6 +49,7 @@ GO for controlled runner/CD release; keep AI auto-remediation governed by normal
NO-GO for "DR complete" while credential escrow evidence markers are missing.
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
```
After any future 120 recovery, rerun this exact chain from 110:

View File

@@ -11,15 +11,15 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 00:34 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`). |
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 00:34 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-12 15:54 `backup-all` completed `13/13`; 17:37 offsite full sync completed `13/13`; 18:55 verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status has `core_blockers=0` but `escrow_missing=5`. |
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 post-rollout cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web now both have 120 / 121 split placement. |
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.6, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, and host role / load-balancing assessment are updated; Ansible syntax check is unavailable on this workstation. |
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.7, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
2026-06-13 00:34 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing.
2026-06-13 00:34 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing. 00:50 `km-vectorize` remediation is `85%`: schedule/label manifest fix prepared, waiting for Gitea / ArgoCD sync and next 03:00 CronJob success readback.
---
@@ -54,6 +54,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
| 2026-06-12 blocker pursuit | WAITING_EXTERNAL_ACCESS | 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo. |
| 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
| 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. |
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_85 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Manifest now corrects the schedule to `0 3 * * *` with `timeZone: Asia/Taipei` and adds Job labels. Next gate is Gitea / ArgoCD sync plus next 03:00 official CronJob readback. |
---
@@ -125,6 +126,7 @@ Next: <single next action>
| P1-010 | DONE | 100 | Offsite sync manual backup repairs | 2026-06-12 17:37 full offsite sync completed `13/13` after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot. | Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots. | `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, full sync `13/13`. |
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
| P1-013 | IN_PROGRESS | 85 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed, and the manifest schedule was semantically wrong (`0 19` with `timeZone: Asia/Taipei` ran at 19:00 台北, not 03:00). Manual Job evidence is invalid because the controller deleted `km-vectorize-codex-002709` as `UnexpectedJob`. | Push manifest fix, let ArgoCD sync, then verify the next 03:00 official CronJob updates `lastSuccessfulTime` and ArgoCD returns `Healthy`. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`. |
---
@@ -156,6 +158,7 @@ Next: <single next action>
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.6 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 comparison anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, and allowed declaration wording. | Use v1.6 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while 120 / escrow remain red. |
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, the invalid manual Job boundary, and the 85% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
---

View File

@@ -1,7 +1,7 @@
# =============================================================================
# KM Vectorize CronJob - ADR-073 Phase 4-3 飛輪學習固化
# =============================================================================
# 每日 03:00 台北 (19:00 UTC 前一天) 自動向量化新增 KM 條目
# 每日 03:00 台北自動向量化新增 KM 條目
# 確保 RAG 查詢可存取最新知識,完成飛輪「學習固化」節點
#
# 2026-04-12 ogt (ADR-073 Phase 4-3)
@@ -17,8 +17,8 @@ metadata:
component: km-vectorize
phase: "4-3"
spec:
# 每日 19:00 UTC = 03:00 台北
schedule: "0 19 * * *"
# 2026-06-13 Codex: timeZone 已指定 Asia/Taipeischedule 必須使用台北本地時間。
schedule: "0 3 * * *"
timeZone: "Asia/Taipei"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
@@ -28,6 +28,11 @@ spec:
failedJobsHistoryLimit: 0
startingDeadlineSeconds: 300
jobTemplate:
metadata:
labels:
app: awoooi
component: km-vectorize
phase: "4-3"
spec:
backoffLimit: 2
# 2026-05-05 Codex: allow post-reboot/post-migration catch-up batches.
@@ -39,6 +44,7 @@ spec:
labels:
app: awoooi
component: km-vectorize
phase: "4-3"
spec:
restartPolicy: OnFailure
containers: