fix(k8s): pass project context to km vectorize
All checks were successful
Code Review / ai-code-review (push) Successful in 26s
CD Pipeline / tests (push) Successful in 1m57s
CD Pipeline / build-and-deploy (push) Successful in 6m1s
CD Pipeline / post-deploy-checks (push) Successful in 38s

This commit is contained in:
Your Name
2026-06-14 08:09:39 +08:00
parent 46027e18ef
commit 8ddb80d63d
6 changed files with 72 additions and 5 deletions

View File

@@ -0,0 +1,36 @@
from __future__ import annotations
import importlib.util
from pathlib import Path
def _load_cron_module():
root = Path(__file__).resolve().parents[3]
script = root / "scripts" / "cron_km_vectorize.py"
spec = importlib.util.spec_from_file_location("cron_km_vectorize", script)
assert spec is not None
assert spec.loader is not None
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
return module
def test_project_headers_default_to_awoooi(monkeypatch):
monkeypatch.delenv("KM_PROJECT_ID", raising=False)
module = _load_cron_module()
assert module._project_headers() == {"X-Project-ID": "awoooi"}
def test_project_headers_use_env_value(monkeypatch):
monkeypatch.setenv("KM_PROJECT_ID", "tenant-a")
module = _load_cron_module()
assert module._project_headers() == {"X-Project-ID": "tenant-a"}
def test_project_headers_fallback_when_env_is_blank(monkeypatch):
monkeypatch.setenv("KM_PROJECT_ID", " ")
module = _load_cron_module()
assert module._project_headers() == {"X-Project-ID": "awoooi"}

View File

@@ -1,3 +1,21 @@
## 2026-06-14km-vectorize tenant context 修正候選
**背景**03:00 官方 `km-vectorize-29689620` 已證實失敗Pod/log 因舊 `restartPolicy: OnFailure` 被刪,無法直接讀 root cause。後續只讀追查 source 與 runtime log pattern發現 `/api/v1/knowledge/embed-all` 會進入 `KnowledgeService.embed_all_entries()`,而該 service 呼叫 `get_db_context()` 時必須有 `project_id`。API middleware 支援 `X-Project-ID` / `X-Tenant-ID` / `project_id` query`scripts/cron_km_vectorize.py` 之前未送任何 project context。這與 API logs 中多個 `db_context_missing` / `Missing tenant context: project_id is required` pattern 一致,因此目前 root-cause candidate 是 internal CronJob 沒有帶 tenant context觸發 fail-closed RLS。
**修正內容:**
- `scripts/cron_km_vectorize.py` 新增 `_project_headers()`,預設送 `X-Project-ID: awoooi`;若 `KM_PROJECT_ID` 有值則使用 env空字串 fallback 到 `awoooi`
- `k8s/awoooi-prod/15-cronjob-km-vectorize.yaml` 新增 `KM_PROJECT_ID=awoooi`,讓 CronJob 的 tenant context 顯式可審核。
- 保留前一輪 evidence-retention 修正:`restartPolicy: Never``terminationMessagePolicy: FallbackToLogsOnError`
**驗證:**
- Targeted pytest`DATABASE_URL=postgresql+asyncpg://test:test@127.0.0.1:5432/test pytest apps/api/tests/test_cron_km_vectorize.py apps/api/tests/test_db_context_guard.py -q``7 passed`
- `kubectl kustomize k8s/awoooi-prod` 渲染確認 `KM_PROJECT_ID=awoooi``restartPolicy: Never``terminationMessagePolicy: FallbackToLogsOnError`
- YAML parse 與 `git diff --check` 通過。
**邊界:**
- 這是 root-cause candidate 修正,不是完成證明;仍必須等下一次官方 03:00 `km-vectorize` 成功更新 `lastSuccessfulTime`,或失敗時留下 Pod/log 證據再繼續修。
- 不手動建立 Job、不 patch live、不刪 failed Job、不偽造 credential escrow evidence。
## 2026-06-14P2-133 Final release candidate readback 完成與正式驗證
**背景**P2-132 已把 post-release verifier / rollback gate 正式驗證完成;但 verifier gate 仍不得被誤讀成 post-release verifier ready、release verification passed、rollback release passed、live apply release passed 或 final candidate approved。P2-133 因此只建立 final release candidate readback把 post-release verifier gate、rollback release gate、release verification hold、live-apply post-release gate 與 blocked post-release transition 讀回成 release candidate 可審核狀態,不批准 owner release、不批准 maintenance window、不確認 rollback owner、不通過 final candidate、不釋放 live apply、不套用 writer、不寫 receipt、不寫 result capture / learning / PlayBook trust / reviewer queue / Gateway queue也不送 Telegram 或呼叫 Bot API。

View File

@@ -29,7 +29,7 @@
| Cold-start scorecard | 03:11 read-only scorecard after official `km-vectorize` run: `PASS=81 WARN=2 BLOCKED=0` | `DEGRADED_NO_BLOCKERS` |
| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` |
| ArgoCD app health | 03:17 ArgoCD revision `8868c025` is `Synced`; app remains `Degraded` because official Job `km-vectorize-29689620` failed with `BackoffLimitExceeded`. CronJob schedule is `0 3 * * *` with `timeZone=Asia/Taipei`, `failedJobsHistoryLimit=3`, image `26b67d11`, `restartPolicy: Never`, and `terminationMessagePolicy: FallbackToLogsOnError`; the retained Job proves failure, and the next official failed run should keep Pod/log evidence. | `FAILED_EVIDENCE_RETENTION_LIVE` |
| ArgoCD app health | Latest live app remains `Synced / Degraded` because official Job `km-vectorize-29689620` failed with `BackoffLimitExceeded`. Root-cause candidate is missing `X-Project-ID` on the internal `/api/v1/knowledge/embed-all` call; GitOps candidate now adds `KM_PROJECT_ID=awoooi` and sends `X-Project-ID`. Existing live evidence retention remains `restartPolicy: Never` and `terminationMessagePolicy: FallbackToLogsOnError`. | `TENANT_CONTEXT_FIX_WAITING_OFFICIAL_RUN` |
| Workload balancing | Live API/Web/Worker image is `26b67d11`; API/Web pods remain ready, Worker single replica remains healthy | `GREEN` |
| Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |
@@ -52,7 +52,7 @@ GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoC
NO-GO for "full cold-start green" until 110 failed units are resolved/accepted and `km-vectorize` failed Job is cleared by an official successful run.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
Current allowed wording: "core service and backup are available; cold-start is degraded by `km-vectorize` official Job failure and 110 fwupd failed units; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained and live CronJob now keeps failed Pod/log evidence for the next official run."
Current allowed wording: "core service and backup are available; cold-start is degraded by `km-vectorize` official Job failure and 110 fwupd failed units; DR complete still blocked by credential escrow; `km-vectorize` failed Job is retained, live CronJob keeps failed Pod/log evidence, and tenant context fix is waiting for the next official run."
```
After any future 120 recovery, rerun this exact chain from 110:

View File

@@ -11,7 +11,7 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 94% | 2026-06-14 03:11 cold-start scorecard is `PASS=81 WARN=2 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public route/API smoke remains green, deploy marker `7b034b58` put API/Web/Worker image `26b67d11` live, and API/Web remain live-verified split across 120 / 121. The 03:00 official `km-vectorize-29689620` Job failed with `BackoffLimitExceeded`, and 110 has `fwupd` failed units, so full cold-start cannot be declared green. ArgoCD auto-synced evidence retention patch `8868c025` at 03:17. DR remains blocked by five missing credential escrow evidence markers. |
| Overall recovery readiness | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 95% | 2026-06-14 03:11 cold-start scorecard is `PASS=81 WARN=2 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public route/API smoke remains green, deploy marker `7b034b58` put API/Web/Worker image `26b67d11` live, and API/Web remain live-verified split across 120 / 121. The 03:00 official `km-vectorize-29689620` Job failed with `BackoffLimitExceeded`, and 110 has `fwupd` failed units, so full cold-start cannot be declared green. Evidence retention is live; root-cause candidate fix adds `KM_PROJECT_ID=awoooi` / `X-Project-ID` for fail-closed RLS. DR remains blocked by five missing credential escrow evidence markers. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-14 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-14 02:40:22`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED_WITH_CRON_WARN | 98% | 2026-06-14 03:11 cold-start is degraded by warnings only; public route/API smoke is green, VIP API/Web are reachable, momo current-month parity remains covered by the scorecard, schedules/services are mostly green. API/Web both keep 120 / 121 split placement after latest ArgoCD revision `c82f320b`, with live API/Web/Worker image `26b67d11`; the exception is failed `km-vectorize-29689620`. |
@@ -63,6 +63,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
| 2026-06-13 final post-trigger deploy closeout | LIVE_VERIFIED | Deploy marker `834ccdba chore(cd): deploy bf86017 [skip ci]` put API/Web/Worker image `bf860177` live. ArgoCD revision `834ccdba` is `Synced / Degraded` only by `km-vectorize`; routes `/zh-TW/governance` and `/en/governance` return `200`, API health is `healthy`, source guards pass, backup status has `core_blockers=0` and `escrow_missing=5`, and 14:13 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
| 2026-06-13 final goal audit refresh | SERVICE_GREEN_REMAINING_GATES_EXPLICIT | Clean worktree rebased onto `a520c32d` and reran source guards successfully; live ArgoCD tracks revision `a520c32d` with API/Web/Worker image `e897c8bf`, health `Degraded` only by `km-vectorize`; `km-vectorize` schedule remains `0 3 * * *`, `timeZone=Asia/Taipei`, `failedJobsHistoryLimit=3`, and no failed Job is currently retained. Public `/zh-TW/governance`, `/en/governance`, and `/api/v1/health` are green; backup core blockers remain `0`, `escrow_missing=5`; 14:16 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Remaining gates: five credential escrow markers and next official 03:00 `km-vectorize` success readback. |
| 2026-06-14 `km-vectorize` official run follow-up | DEGRADED_EVIDENCE_RETENTION_LIVE | 03:00 official `km-vectorize-29689620` ran from CronJob and failed with `BackoffLimitExceeded`; ArgoCD later auto-synced revision `8868c025` and remains `Synced / Degraded`. Job is retained, but failed Pod `km-vectorize-29689620-nwpqz` was deleted before logs could be read, so root cause remains unproven for this run. Live CronJob is now `restartPolicy: Never` plus `terminationMessagePolicy: FallbackToLogsOnError`, so the next official failure should retain Pod/log evidence. Backup core remains green, `escrow_missing=5`, and 03:11 cold-start is `PASS=81 WARN=2 BLOCKED=0`. |
| 2026-06-14 `km-vectorize` tenant context follow-up | ROOT_CAUSE_CANDIDATE_PATCHED | Source audit shows `cron_km_vectorize.py` calls `/api/v1/knowledge/embed-all` without project context, while API middleware and `get_db_context()` require `X-Project-ID` / tenant context for fail-closed RLS. API logs show matching `db_context_missing` / `Missing tenant context` patterns. GitOps candidate adds `KM_PROJECT_ID=awoooi` and sends `X-Project-ID`; targeted pytest `7 passed`, kustomize renders the env/header support path, and no manual Job was created. Completion still waits for the next official 03:00 success or retained failed Pod/log. |
---
@@ -135,7 +136,7 @@ Next: <single next action>
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
| P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. | Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate. |
| P1-013 | IN_PROGRESS | 95 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`. The schedule is correct (`0 3` with `timeZone: Asia/Taipei`) and `failedJobsHistoryLimit=3` retained the 2026-06-14 failed Job, but the old live `restartPolicy: OnFailure` deleted the failed Pod/log before inspection. The evidence-retention patch is now live: `restartPolicy: Never` plus `terminationMessagePolicy: FallbackToLogsOnError`. Root cause remains unproven until the next official run preserves logs or succeeds. | Verify the next 03:00 official CronJob either succeeds or leaves inspectable failed Pod/log evidence. Do not manual-run or patch live. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`; if it fails, the failed Job/Pod/log remains available for read-only triage. |
| P1-013 | IN_PROGRESS | 96 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`. The schedule is correct (`0 3` with `timeZone: Asia/Taipei`), evidence retention is live, and source/log audit found a strong root-cause candidate: the internal CronJob called `/api/v1/knowledge/embed-all` without `X-Project-ID`, while `get_db_context()` now fail-closes without project context. GitOps candidate adds `KM_PROJECT_ID=awoooi` and script header `X-Project-ID`. | Push the tenant-context patch, verify ArgoCD sync, then verify the next 03:00 official CronJob either succeeds or leaves inspectable failed Pod/log evidence. Do not manual-run or patch live. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`; if it fails, the failed Job/Pod/log remains available for read-only triage. |
---

View File

@@ -69,6 +69,10 @@ spec:
# 2026-05-05 Codex: use the actual Service name; the old
# awoooi-api DNS name does not exist in awoooi-prod.
value: "http://awoooi-api-svc.awoooi-prod.svc.cluster.local:8000"
- name: KM_PROJECT_ID
# 2026-06-14 Codex: internal cron must send explicit tenant
# context to pass fail-closed RLS in get_db_context().
value: "awoooi"
resources:
requests:
cpu: "50m"

View File

@@ -15,6 +15,14 @@ import sys
import httpx
def _project_headers() -> dict[str, str]:
"""Return internal API project context headers required by fail-closed RLS."""
project_id = os.environ.get("KM_PROJECT_ID", "awoooi").strip()
if not project_id:
project_id = "awoooi"
return {"X-Project-ID": project_id}
async def main() -> int:
api_base = os.environ.get(
"INTERNAL_API_URL",
@@ -24,7 +32,7 @@ async def main() -> int:
async with httpx.AsyncClient(timeout=1800) as client:
try:
resp = await client.post(url)
resp = await client.post(url, headers=_project_headers())
print(f"embed-all: {resp.status_code} {resp.text[:200]}")
if resp.status_code >= 400:
print(f"ERROR: embed-all returned {resp.status_code}", file=sys.stderr)