From 47ee96b093212e87d3fb342c29ef7b4bd98d079c Mon Sep 17 00:00:00 2001 From: Your Name Date: Sat, 13 Jun 2026 00:59:18 +0800 Subject: [PATCH] fix(k8s): correct km vectorize cron schedule --- docs/LOGBOOK.md | 23 +++++++++++++++++++ docs/runbooks/FULL-STACK-COLD-START-SOP.md | 3 ++- ...oot-cold-start-backup-recovery-workplan.md | 9 +++++--- k8s/awoooi-prod/15-cronjob-km-vectorize.yaml | 12 +++++++--- 4 files changed, 40 insertions(+), 7 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 73aa8d5c..9586418f 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,26 @@ +## 2026-06-13|P1-ARGO `km-vectorize` CronJob health remediation + +**Live finding(00:50 Asia/Taipei)**: + +- ArgoCD `awoooi-prod` 仍為 `Synced / Degraded`,唯一非綠來源是 `CronJob/km-vectorize` 的 `lastSuccessfulTime` 停在 `2026-06-04T11:00:37Z`。 +- 現行 CronJob `lastScheduleTime=2026-06-12T11:00:00Z`,但 6/5 至 6/12 沒有保留新 Job;6/2、6/3、6/4 的 retained Jobs 都是 `Complete` 且 ownerReference 指向現行 CronJob。 +- 最近可見成功 pod log:`embed-all: 200 {"total":32,"success":32,"failed":0}`,代表任務邏輯至少在 6/4 可成功完成。 +- `kubectl create job --from=cronjob/km-vectorize km-vectorize-codex-002709` 產生的手動 Job 被 CronJob controller 標為 `UnexpectedJob` 並刪除,不能拿來當完成證據。 +- Manifest 註解寫每日 03:00 台北,但 `schedule: "0 19 * * *"` 搭配 `timeZone: Asia/Taipei` 實際是每日 19:00 台北,時間語意錯誤。 + +**修正(source-control)**: + +- `k8s/awoooi-prod/15-cronjob-km-vectorize.yaml` 改為 `schedule: "0 3 * * *"` 並保留 `timeZone: Asia/Taipei`。 +- `jobTemplate.metadata.labels` 新增 `app=awoooi`、`component=km-vectorize`、`phase=4-3`;pod template 同步補 `phase=4-3`,讓後續 Job/Pod 查詢與 ArgoCD resource tree 有一致 labels。 + +**完成度同步**: + +- P1-ARGO `80% -> 85%`:root cause narrowed、manifest fix prepared;等待 Gitea main / ArgoCD sync 後,由下一次正式 03:00 CronJob 更新 `lastSuccessfulTime`。 +- Overall service recovery 維持 `95% SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED`;這是治理 health debt,不是目前 API/Web/backup/cold-start blocker。 +- 不宣稱 ArgoCD 全綠,直到 `km-vectorize` 的 `lastSuccessfulTime` 刷新且 ArgoCD health 回 `Healthy`。 + +**邊界**:本段不重啟 Docker、不 reload Nginx、不改 firewall、不手動消音告警、不寫 credential escrow placeholder、不把 manual Job 當正式 CronJob success。 + ## 2026-06-13|P2-103 任務結果稽核軌跡 **背景**:P2-102 已把 13 類候選操作固定成 dry-run evidence、side-effect count、verifier plan 與人工 handoff;統帥持續指出 TG / AwoooP 批准後仍常停在 `learning_recorded`、`manual_review` 或 `no_action`,沒有清楚顯示結果應接到 KM、LOGBOOK、稽核軌跡、timeline 或人工修復下一步。P2-103 補上結果路由契約,讓批准後卡住的狀態能被分類、追蹤與交接。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 865bbc49..fad0d089 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -29,7 +29,7 @@ | Cold-start scorecard | 00:34 final read-only scorecard: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` | | momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` | | momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` | -| ArgoCD app health | Deployments/PDB/CronJobs mostly healthy; app Degraded because `km-vectorize` last success is stale | `GOVERNANCE_DEBT` | +| ArgoCD app health | Deployments/PDB healthy; app Degraded because `km-vectorize` last success is stale. 00:50 remediation corrected the source manifest schedule to 03:00 Asia/Taipei and added Job labels; waiting for official CronJob success readback. | `GOVERNANCE_DEBT_IN_REMEDIATION` | | Workload balancing | Gitea main `acaae999` synced by ArgoCD; API/Web each have one pod on 120 and one pod on 121 | `GREEN` | | Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` | @@ -49,6 +49,7 @@ GO for controlled runner/CD release; keep AI auto-remediation governed by normal NO-GO for "DR complete" while credential escrow evidence markers are missing. Do not fake or silence credential escrow alerts; they are the remaining correct DR red light. GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1. +NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`. ``` After any future 120 recovery, rerun this exact chain from 110: diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 318df561..94991864 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,15 +11,15 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 00:34 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`). | +| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 00:34 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-12 15:54 `backup-all` completed `13/13`; 17:37 offsite full sync completed `13/13`; 18:55 verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status has `core_blockers=0` but `escrow_missing=5`. | | P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 post-rollout cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web now both have 120 / 121 split placement. | -| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.6, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, and host role / load-balancing assessment are updated; Ansible syntax check is unavailable on this workstation. | +| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.7, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. | Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked. -2026-06-13 00:34 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing. +2026-06-13 00:34 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing. 00:50 `km-vectorize` remediation is `85%`: schedule/label manifest fix prepared, waiting for Gitea / ArgoCD sync and next 03:00 CronJob success readback. --- @@ -54,6 +54,7 @@ Full cold-start may be declared green only for the latest verified evidence set. | 2026-06-12 blocker pursuit | WAITING_EXTERNAL_ACCESS | 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo. | | 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. | | 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. | +| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_85 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Manifest now corrects the schedule to `0 3 * * *` with `timeZone: Asia/Taipei` and adds Job labels. Next gate is Gitea / ArgoCD sync plus next 03:00 official CronJob readback. | --- @@ -125,6 +126,7 @@ Next: | P1-010 | DONE | 100 | Offsite sync manual backup repairs | 2026-06-12 17:37 full offsite sync completed `13/13` after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot. | Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots. | `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, full sync `13/13`. | | P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. | | P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. | +| P1-013 | IN_PROGRESS | 85 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed, and the manifest schedule was semantically wrong (`0 19` with `timeZone: Asia/Taipei` ran at 19:00 台北, not 03:00). Manual Job evidence is invalid because the controller deleted `km-vectorize-codex-002709` as `UnexpectedJob`. | Push manifest fix, let ArgoCD sync, then verify the next 03:00 official CronJob updates `lastSuccessfulTime` and ArgoCD returns `Healthy`. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`. | --- @@ -156,6 +158,7 @@ Next: | P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.6 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 comparison anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, and allowed declaration wording. | Use v1.6 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while 120 / escrow remain red. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | +| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, the invalid manual Job boundary, and the 85% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | --- diff --git a/k8s/awoooi-prod/15-cronjob-km-vectorize.yaml b/k8s/awoooi-prod/15-cronjob-km-vectorize.yaml index c14a04c9..20ac6a76 100644 --- a/k8s/awoooi-prod/15-cronjob-km-vectorize.yaml +++ b/k8s/awoooi-prod/15-cronjob-km-vectorize.yaml @@ -1,7 +1,7 @@ # ============================================================================= # KM Vectorize CronJob - ADR-073 Phase 4-3 飛輪學習固化 # ============================================================================= -# 每日 03:00 台北 (19:00 UTC 前一天) 自動向量化新增 KM 條目 +# 每日 03:00 台北自動向量化新增 KM 條目 # 確保 RAG 查詢可存取最新知識,完成飛輪「學習固化」節點 # # 2026-04-12 ogt (ADR-073 Phase 4-3) @@ -17,8 +17,8 @@ metadata: component: km-vectorize phase: "4-3" spec: - # 每日 19:00 UTC = 03:00 台北 - schedule: "0 19 * * *" + # 2026-06-13 Codex: timeZone 已指定 Asia/Taipei,schedule 必須使用台北本地時間。 + schedule: "0 3 * * *" timeZone: "Asia/Taipei" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 3 @@ -28,6 +28,11 @@ spec: failedJobsHistoryLimit: 0 startingDeadlineSeconds: 300 jobTemplate: + metadata: + labels: + app: awoooi + component: km-vectorize + phase: "4-3" spec: backoffLimit: 2 # 2026-05-05 Codex: allow post-reboot/post-migration catch-up batches. @@ -39,6 +44,7 @@ spec: labels: app: awoooi component: km-vectorize + phase: "4-3" spec: restartPolicy: OnFailure containers: