awoooi

Author	SHA1	Message	Date
AWOOOI CD	6e76c5dfd5	chore(cd): deploy `c9393c3` [skip ci]	2026-04-30 14:41:46 +08:00
Your Name	c9393c3688	fix(cd): run post deploy checks on host runner Some checks failed Code Review / ai-code-review (push) Successful in 27s Details CD Pipeline / tests (push) Successful in 2m46s Details CD Pipeline / build-and-deploy (push) Successful in 7m46s Details CD Pipeline / post-deploy-checks (push) Failing after 19s Details	2026-04-30 14:31:12 +08:00
AWOOOI CD	19788302df	chore(cd): deploy `80defbe` [skip ci]	2026-04-30 14:26:44 +08:00
Your Name	80defbed7c	fix(aiops): fallback and escalate automation blockers Some checks failed CD Pipeline / tests (push) Successful in 2m41s Details Code Review / ai-code-review (push) Successful in 24s Details CD Pipeline / build-and-deploy (push) Successful in 7m51s Details CD Pipeline / post-deploy-checks (push) Failing after 2m15s Details	2026-04-30 14:13:57 +08:00
Your Name	82649c2cbb	fix(cd): run tests in explicit ci container Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details	2026-04-30 14:11:39 +08:00
Your Name	ed2a4838f2	fix(auto): use action parser for repair gates Some checks failed CD Pipeline / tests (push) Failing after 1m2s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 24s Details	2026-04-30 14:06:09 +08:00
AWOOOI CD	9ee3cc6242	chore(cd): deploy `4723499` [skip ci]	2026-04-30 11:11:04 +08:00
Your Name	4723499955	fix(cd): install playwright system deps for smoke All checks were successful CD Pipeline / tests (push) Successful in 1m34s Details Code Review / ai-code-review (push) Successful in 24s Details CD Pipeline / build-and-deploy (push) Successful in 6m58s Details CD Pipeline / post-deploy-checks (push) Successful in 3m7s Details	2026-04-30 11:02:12 +08:00
Your Name	e27b462bef	fix(ops): keep disabled gitea runner stopped All checks were successful Code Review / ai-code-review (push) Successful in 27s Details	2026-04-30 10:59:46 +08:00
AWOOOI CD	a0be4ebb03	chore(cd): deploy `0f7e9d3` [skip ci]	2026-04-30 10:54:29 +08:00
Your Name	0f7e9d3467	fix(cd): run docker builds on host runner All checks were successful CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 25s Details CD Pipeline / build-and-deploy (push) Successful in 9m20s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-04-30 10:43:33 +08:00
Your Name	7cc10b2599	fix(cd): serialize gitea docker builds Some checks failed CD Pipeline / build-and-deploy (push) Failing after 40s Details Code Review / ai-code-review (push) Successful in 24s Details	2026-04-30 10:11:50 +08:00
Your Name	e91db52858	docs(logbook): record `639bb64` prod deployment [skip ci]	2026-04-30 09:45:48 +08:00
Your Name	9f15f3cfe4	chore(cd): deploy `639bb64` [skip ci]	2026-04-30 09:41:20 +08:00
Your Name	639bb64788	feat(flywheel): surface ai automation and code review Some checks failed Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Failing after 5m23s Details	2026-04-30 00:09:25 +08:00
AWOOOI CD	d197e2785d	chore(cd): deploy `4a57c2d` [skip ci]	2026-04-29 15:48:24 +00:00
Your Name	4a57c2d04f	feat(flywheel): expose incident processing timeline All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m56s Details	2026-04-29 23:38:30 +08:00
AWOOOI CD	dae0aa2312	chore(cd): deploy `d845d53` [skip ci]	2026-04-29 15:06:57 +00:00
Your Name	d845d53257	fix(security): keep Gemini key out of request URLs All checks were successful CD Pipeline / build-and-deploy (push) Successful in 15m5s Details	2026-04-29 22:56:12 +08:00
AWOOOI CD	b857be0a64	chore(cd): deploy `fe2b8f4` [skip ci]	2026-04-29 14:47:51 +00:00
Your Name	fe2b8f4571	fix(flywheel): fallback on OpenClaw degraded responses All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m56s Details	2026-04-29 22:38:57 +08:00
AWOOOI CD	525a243550	chore(cd): deploy `dccdcdb` [skip ci]	2026-04-29 13:59:53 +00:00
Your Name	dccdcdbaf5	fix(flywheel): unblock action safety and Claude fallback All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details	2026-04-29 21:51:18 +08:00
AWOOOI CD	4c91d89dd2	chore(cd): deploy `4115ddd` [skip ci]	2026-04-29 13:04:37 +00:00
Your Name	f5f41543c9	docs: ADR-105 推翻 A2 + LOGBOOK 2026-04-29 LLM 飛輪復活戰 ADR-105 完整記錄推翻 A2 鐵律的決策： - Context: A2 歷史背景 + 2 個月後事實基礎變化（GPU + qwen2.5:7b） - Decision: 4 處修改（IntentType.DIAGNOSE override / chain / openclaw.py task_type / 6 regression test） - Consequences: 正面（飛輪復活）+ 負面（Ollama 單點）+ 已知債（ADR-106-109 後續） - Validation: 部署前 1635 tests 全綠，部署後 5 項驗證指標 - Rollback: env 切換 / git revert LOGBOOK 加 2026-04-29 條目： - 真根因：4 provider 全死 + A2 鐵律排除 Ollama - CD 連環血淚：5 個 commit 全 failure（setup_test_schema.sql 缺欄） - 已落地（不依賴 CD）：Prometheus 17 條 rule + Gemini sanitize - Memory 索引同步更新（指向 project_revert_a2_ollama_primary.md）注意：docs/ 不在 cd.yaml paths trigger，此 commit 不影響 CD。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:59:53 +08:00
Your Name	4115ddde48	fix(cd-blocker-2): setup_test_schema.sql 補 KM 欄位（解 CD 真實 root cause） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m4s Details ## 之前 `c5b18101` 修錯地方我加 db/base.py:init_db() ALTER 沒解問題。CI 不跑 init_db()。 ## 真實 CD 流程 `.gitea/workflows/cd.yaml` Integration Tests step： 1. 啟動臨時 `pg-test-b5` 容器（fresh PG） 2. `psql -f tests/integration/setup_test_schema.sql` 建表 3. 跑 pytest tests/integration/test_b5_core_flows.py setup_test_schema.sql 的 `knowledge_entries` 表沒有 `related_approval_id` + `path_type` 欄位 → INSERT 失敗。 ## 修法 setup_test_schema.sql:110 `CREATE TABLE knowledge_entries` 補： - related_approval_id VARCHAR(64) - path_type VARCHAR(50) - uix_knowledge_incident_path partial unique index - ix_knowledge_related_approval partial index ## 預期效果 CD #1119 (本 commit) 應該成功。解鎖 4 個 stuck commit (1114-1118) 的部署 backlog。 `fb0c72db` 推翻 A2 DIAGNOSE Ollama primary 終於上 prod。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:54:54 +08:00
Your Name	c5b1810172	fix(cd-blocker): 補 knowledge_entries 防禦性 ALTER（解 CD #1115-1117 全 failure） Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m38s Details 🚨 真根因：CD pipeline 從昨天 push `fb0c72db` 起，4 個 commit (1114-1117) 全 failure prod pod 28 小時沒更新 → 統帥 17:33/17:35 看到的 Telegram 告警仍是「llm_failed」不是 ai_router 沒推翻 A2，是部署根本沒上 prod。 ## CD 失敗證據（gitea actions API） ``` #1117 `7b471e7a` failure Gemini sanitize #1116 `3668d49f` failure W2 三件 + KMWriter critic #1115 `fb0c72db` failure 推翻 A2 DIAGNOSE Ollama primary #1114 `8d24f151` failure PR-R1 4 Major 修 #1113 `681b5ac9` success PR-R1 規則→Playbook 遷移 ← 最後一次成功 ``` ## 失敗 Stack Trace（job 1267 logs） ``` sqlalchemy.exc.ProgrammingError: column "related_approval_id" of relation "knowledge_entries" does not exist SQL: INSERT INTO knowledge_entries (..., related_approval_id, path_type, ...) test: tests/integration/test_b5_core_flows.py::test_knowledge_entry_view_count ``` ## 根因 commit `c22e5f33` (KMWriter) 加 ORM 欄位 `related_approval_id` + `path_type`： - `models.py` ORM Mapped 欄位 ✅ - `knowledge.py` Pydantic schema ✅ - `migrations/p1_1_km_idempotent_path_type.sql` 加 path_type ✅ - 但 `db/base.py:init_db()` 沒對應 ALTER❌ CI integration test 用 prod schema 建 PG → 既有表沒有新欄位 → INSERT 失敗。我之前只補了 `timeline_events.incident_id` 的 ALTER，漏了 `knowledge_entries`。 ## 修法 `db/base.py:init_db()` 補 3 條防禦性 SQL（同 timeline_events 模式）： ```sql ALTER TABLE knowledge_entries ADD COLUMN IF NOT EXISTS related_approval_id VARCHAR(64), ADD COLUMN IF NOT EXISTS path_type VARCHAR(50); CREATE UNIQUE INDEX IF NOT EXISTS uix_knowledge_incident_path ON knowledge_entries(related_incident_id, path_type) WHERE related_incident_id IS NOT NULL AND path_type IS NOT NULL; CREATE INDEX IF NOT EXISTS ix_knowledge_related_approval ON knowledge_entries(related_approval_id) WHERE related_approval_id IS NOT NULL; ``` ## 驗證 - 1635 unit tests 全綠 - 預期 CD #1118 (本 commit) 解 4 個失敗 commit 的部署 backlog - 部署完成後 prod ai_router `fb0c72db` 推翻 A2 才會真的生效 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:44:23 +08:00
Your Name	7b471e7ae2	fix(secret-leak): Gemini API key 從 prod log 清除（P0 SECRET LEAK） Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m6s Details ## 問題（2026-04-29 11:50 prod log 證據） prod log 出現完整 Gemini API key 明碼： ``` "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=AIzaSyCqv7TY2iTGi2wa91d2irwH08VYXjT9YUk" event: gemini_provider_failed ``` 違反鐵律： - feedback_secret_debug_output_ban.md: debug 含 secret 字串禁 echo/log 原值 - feedback_secrets_leak_incidents_2026-04-18.md: 已有 2 起 secret leak 事故 ## 根因 `gemini.py:118` `logger.warning("gemini_provider_failed", error=str(e), ...)` httpx HTTPStatusError str() 會包含完整 URL（含 ?key=... query string）： - Google Gemini API 設計用 query string 傳 API key（不像 Claude/NVIDIA 用 header） - httpx 拋例外時把 URL 寫進 error message - str(e) 直接 log → key 進 K8s pod log → audit log → Sentry → 任何下游 log 接收方 ## 修法新增 `_sanitize_error()` 函式： - regex `([?&])key=[^&\s'"]+` → `\1key=<redacted>` - 在 `gemini_provider_failed` log 出口呼叫 - AIResult.error 也用 sanitize 過的（不污染下游）只修 Gemini（其他 provider 用 header / 內網無 key）： - Claude: API key 在 `x-api-key` header → 不在 URL → 安全 - OpenClaw: 內網 188:8088 → 無 API key → 安全 - Ollama: 內網 111:11434 → 無 API key → 安全 - NVIDIA: API key 在 `Authorization: Bearer` header → 安全 ## 驗證 - 1635 unit tests 全綠（修法不破壞任何既有行為） - 直接執行 sanitize 函式確認 `AIzaSy` key 被替換成 `<redacted>` ## 已知債 - 此 commit 只防新 leak，舊 log 中的 key 仍存在（K8s pod log / Sentry / structlog backend） - Gemini API key 仍應輪換*（已洩漏的 key 不可信） - 統帥需手動： 1. 去 https://aistudio.google.com/apikey 新增 key 2. 在 K8s secret 換 GEMINI_API_KEY 3. 撤銷舊 key Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:49:09 +08:00
Your Name	3668d49f2f	feat(flywheel): W2 三件 + KMWriter critic 修法（1635 tests 全綠） Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m38s Details W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major 全部修完，default flag=false 安全無爆炸風險。 ## W2 三件 PR ### PR-R2 — AOL → catalog confidence EWMA 回灌（修飛輪斷鏈 C2） - 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py` - 邏輯：每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog - 失敗閾值 N=5 連續低成功率 → review_status='draft' - Hermes _fetch_noisy_rules SQL 加 OR review_status='draft' - ENABLE_AOL_WRITEBACK_JOB=false (default) - 8 個測試（mock path 修正：lazy import → patch src.db.base.get_db_context） ### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6) - 新檔 `apps/api/src/services/self_healing_validator.py`（純函數 assess_self_healing） - post_execution_verifier.py step 5 串接（feature flag gate） - evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位 - db/models.py + base.py ALTER IF NOT EXISTS - score < 0.5 → 觸發 rollback 提案 Telegram alert（不自動執行） - ENABLE_SELF_HEALING_VALIDATOR=false (default) - 7 個測試 ### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4) - learning_service.py 三條新邏輯： 1. _write_playbook_evolution_km：promote/demote 寫 KM 演化條目 2. _check_and_mark_playbook_review：N=5 累積觸發 review_required 3. _demote_alert_rule_catalog_confidence：DEPRECATED → confidence×=0.5 - PlaybookRecord 加 review_required 欄位（schema migration via base.py） - ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default) - KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調 - 6 個測試 ## KMWriter Critic 5 個 Critical/Major 修復（之前 critic PR review 發現）之前 push commit `c5753e1c` 已修，本 commit 補回 stash 中的對應檔案： - C1 km_writer.py:194 backfill 自打臉（已修：同步 await + DLQ） - C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊 - M1 decision_manager.py:2178/2203 移除 _fire_and_forget - M2 incident_service.py:1099 自製 path 加 retry+DLQ - M3 km_writer.py:166 冪等聲明對齊（UPSERT + partial unique index） ## 驗證 - 1635 unit tests 全綠（+27 from 1608） - 與 `fb0c72db` (推翻 A2 Ollama primary) 共存無衝突 - 所有新 Job/Service default flag=false（不爆炸） ## 期望影響飛輪斷鏈 C2 + C3 + C4 + C6 全修飛輪自主化評分：65 → 85 預估（W2 完成後）啟用順序（待 prod `fb0c72db` 驗證 OLLAMA primary 跑得起來後）： 1. ENABLE_AOL_WRITEBACK_JOB=true 2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true 3. ENABLE_SELF_HEALING_VALIDATOR=true Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:44:04 +08:00
Your Name	fb0c72db42	feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m26s Details 統帥鐵律 2026-04-29：「主要優先用 111 主機的 Ollama」 + feedback_ai_autonomous_direction.md：以本地免費 LLM 為主 + feedback_ollama_111_only.md：Ollama 唯一主機 = 111 ## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎舊事實：Ollama = CPU-only deepseek-r1:14b @ 238s（不可用）新事實：prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct VRAM 8.2GB 全載入，ctx 32k，實測 hi prompt 0.54s 雲端全死（2026-04-29 prod log 證據）： - OpenClaw 188:8088 → 500 Internal Server Error - Gemini → 429 Too Many Requests（配額爆） - Claude → 404 Not Found（model claude-3-haiku-20240307 過期）不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動 ## 修改範圍（最小、安全、可驗證） ### ai_router.py - `_diagnose_fallback_chain`: OLLAMA 第一順位（取代「永久排除」舊註解）順序：[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE] - `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA - 不動 _full_fallback_chain（避免影響 RESTART/SCALE/CONFIG/DELETE） - 不動 _tool_calling_fallback_chain - 不動 complexity_map（critic M2 留待後續） ### openclaw.py - 注入 task_type="diagnose" 到 alert_context（critic C2 真根因） - 修復 ai_providers/ollama.py:77 timeout 對齊問題： - 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s - 沒有 → OPENCLAW_TIMEOUT=30s（不夠 qwen2.5:7b 推理） - prod log 看到 latency_ms=120014 的根因 - 用 dict(alert_context) 複製，不污染原 context ## Regression Test 同步更新（5 個） A2 鐵律守門 test 全部反映新鐵律： - test_p0_diagnose_routing.py::test_diagnose_override_is_ollama （原 test_diagnose_override_is_openclaw_nemo） - test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary （原 test_diagnose_fallback_chain_no_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama （原 test_diagnose_route_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama （原 test_diagnose_route_sync_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary （原 test_build_fallback_chain_for_intent_diagnose_no_ollama） - test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary （原 test_router_does_not_use_failover_for_openclaw_nemo）每個 test docstring 都記載歷史脈絡 + 推翻原因。 ## 驗證 - 1608 unit tests 全綠 - LLM 路徑 16 個 test 全綠（含 6 個 A2 守門 test 更新版） - complexity_scorer / failover_manager / intent_classifier 不受影響 ## 期望 prod 行為（部署後驗證） incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU) 失敗才 fallback → OpenClaw 188 → Gemini → Claude Ollama 用 200s timeout（之前 30s 不夠） → AI 自動修復終於可以啟動，不再 100% llm_failed ## 已知債（後續處理） - models.json:21 ollama.default 仍是 deepseek-r1:14b（critic C1，但 prod 已自動 route 到實載 model） - complexity 4/5 仍寫死 gemini/claude（critic M2） - Gemini API key 在 prod log 明文（需輪換 + sanitize） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:39:36 +08:00
Your Name	8d24f15183	fix(critic-review): PR-R1 4 Major 修 — wildcard 過濾 + 二次確認 + unverified 旗標 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m34s Details critic PR review `681b5ac9` 揭示 4 Major 問題（無 Critical），全部修復。 ## Major #1 — generic_fallback wildcard 污染 RAG 語料位置：rule_to_playbook_migrator.py:128 `_build_symptom_pattern` 問題：generic_fallback 規則的 `alert_names=[""]` 會原樣寫入 PlaybookRecord，進 playbook_rag 向量化文字「告警: 」變成普通 token，每筆查詢都會跟它算相似度 → RAG top-k 可能回 fallback DRAFT 誤導推薦。修法：在 `_build_symptom_pattern` 過濾 `["*"]`（與 keywords 一致對待）。 ## Major #2 — CLI --commit 無二次確認位置：scripts/migrate_rules_to_playbooks.py 問題：`--commit` 直接寫 prod DB 25 筆 DRAFT，誤跑無法回頭。修法： - 加 `--yes` flag（CI / 自動化用） - 沒帶 `--yes` 時 stdin prompt: "Type 'yes' to confirm" ## Major #3 — yaml_rule kubectl_command 未過 SPF-2 action_parser 位置：rule_to_playbook_migrator.py:153 `_build_repair_steps` 問題：DRAFT 不會自動 promote（門檻 0.9），但人工 review 路徑無安全攔截器。若有人 UI 一鍵 promote → 含 {target} placeholder 的危險指令直接到 prod。修法：在 step dict 加 metadata： - unverified_command: True - needs_action_parser_review: True - source: "yaml_rule_migration" （promote 流程須強制走 action_parser，由 SPF-2 落地時實作） ## Minor 修 - 刪除 dead import `import re`（未使用） - `enumerate([:3], start=2)` 取代 `if idx >= 4: break`（邊界寫法易誤讀） ## 驗證 - 23 個 PR-R1 測試全綠（修法不破壞既有行為） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:56:32 +08:00
Your Name	681b5ac949	feat(flywheel): W1 PR-R1 規則→Playbook 遷移 + PR-K1 timeline 防禦 ALTER Some checks failed run-migration / migrate (push) Failing after 12s Details Type Sync Check / check-type-sync (push) Successful in 1m25s Details CD Pipeline / build-and-deploy (push) Failing after 1m48s Details W1 第二波：onboarder 飛輪 80→90 路徑剩餘兩件 PR。 ## PR-R1 — 25 條 yaml 規則 → DRAFT Playbook 遷移斷鏈背景（onboarder C2）：alert_rules.yaml 25 條規則 68% 寫死 RESTART，沒有對應 Playbook → RAG 永遠 generic_fallback → 規則命中率沒回饋給 catalog。修法： - 新建 services/rule_to_playbook_migrator.py - 自動從 alert_rules.yaml 解析每條 rule - 產生 PlaybookRecord（status=DRAFT, ai_confidence=0.3, source=YAML_RULE） - 誠實標示信心 0.3（非假 1.0，違反 feedback_confidence_truthfulness） - INSERT ON CONFLICT 冪等（name LIKE 'AutoMigrated: %' 去重，不擾動 seed） - 新建 scripts/migrate_rules_to_playbooks.py（CLI: --dry-run/--commit/--disable-flag） - ENABLE_RULE_MIGRATION_DRAFT=true（rollback flag） - 23 測試覆蓋（parse / build_dict / idempotent / dry_run / action_type / severity_map / feature_flag / wildcard_filter / partial_existing 等） ## PR-K1 — timeline_events 防禦性 ALTER（db-expert finding）任務原前提錯誤：onboarder 報告的 C7 斷鏈（incident_id 欄位）在 2026-04-24 P1.6 已修復 ORM。但生產環境若在 P1.6 前已建表，create_all 跳過已存在的表 → ORM 寫入 SELECT 仍可能找不到 column。修法： - db/base.py:init_db() 補防禦性 ALTER: ALTER TABLE timeline_events ADD COLUMN IF NOT EXISTS incident_id VARCHAR(64); CREATE INDEX IF NOT EXISTS ix_timeline_incident_id ON timeline_events(incident_id); - IF NOT EXISTS 為 no-op 安全（已有 column 不做事） - stage 欄位是任務描述的幻覺（codebase 0 writer），不新增未做： - alembic migration（專案不用 alembic，遵循既有 init_db ALTER pattern） - onboarder C7 在 ORM 層已修，本 commit 確保 prod schema 對齊 ## 驗證 - 1608 unit tests 全綠（+23 from 1585） - PR-R1 23 個測試獨立通過 ## 期望影響 - 飛輪 RAG 終於有 25 條 DRAFT Playbook 可查 → +5 分 - prod schema 對齊保險 → 防 ORM SELECT 失敗 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:49:25 +08:00
Your Name	c5753e1c57	fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化 critic PR review 揭示已 push commits 的 7 個 blocker，本 commit 全部修復。 ## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約（critic 最嚴重 5 條） ### C1 km_writer.py:194 — backfill 自打臉修 - 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe() - 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入 - 新增 km_backfill_reconciler_job.py（每 5 分鐘掃 DLQ）+ ENABLE_KM_BACKFILL_RECONCILER flag - 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race ### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊 - 從 ensure_future（fire-and-forget 比舊版同步寫更糟） - 改 await writer.write(retry=1, timeout=2.0)（仍 await 但只試一次、超時短） - docstring 明確標註「緊急回滾用，不保證可靠性」 ### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路 - 兩處 _fire_and_forget(executor.write_execution_result_to_km(...)) - 改 await asyncio.shield(...) + BaseException 保護（防上層 cancel 中斷） - KM_WRITE_AWAIT=true 在這條路徑終於真正 await ### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ - 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task - 改 3 次指數退避 retry + DLQ 保護（呼叫 km_writer 私有 helper） ### M3 km_writer.py:166 — 冪等聲明對齊實作 - knowledge_repository.create() 加 UPSERT 路徑（pg_insert ON CONFLICT DO UPDATE） - KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位 - migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path ## M4 alertmanager.yml — equal: [] 收緊（critic 防爆炸抑制） - OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束 - 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警 ## M5 Alertmanager 版本驗證（已確認 v0.31.1，遠超 v0.22+） ## M6 governance_agent.py — health score 區分 skipped vs ok vs violated - check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status} - run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警（不污染 self_failure 計數，因為 no_data 是 emitter 未實作不是治理機制故障） ## M7 scripts/check_config_drift.py — 改 AST 解析 - regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...) - 避免多行 list / default_factory= / 含跳行字串的 false negative - 4 欄位（AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL）全對齊 ## 新增測試 - test_km_writer_backfill_reconciler.py: 7 cases（C1 reconciler + safe helper） - test_km_writer_idempotent.py: 5 cases（M3 path_type 注入 + UPSERT 分支） ## 驗證 - 1585 unit tests 全綠（+13 從 1572） - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker AST-based 4 欄位全對齊 - Alertmanager v0.31.1 確認支援新語法 ## 期望影響 - KMWriter 名實統一：飛輪閉環 KM 寫入路徑 100% 可靠 - M4 抑制爆炸風險解除 - 治理層不再對 SLO no_data 靜默 - drift checker false negative 風險解除 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	6878e62af7	feat(flywheel): W1 PR-P1 + ADR-091 T1 — 飛輪 80→90 第一波依 onboarder 端到端閉環審計挖出的 10 條斷鏈 + critic 鐵律違反全景， W1 第一波修復飛輪鐵證 1 + 2 的核心斷鏈 C1。 ## W1 PR-P1 — matched_playbook_id 四斷點守門 (C1 修復) fullstack 探勘發現 4 斷點之前 session 已修，本 PR 補： - ENABLE_PLAYBOOK_MATCHING feature flag (default=true) rollback: kubectl set env deployment/awoooi-api ENABLE_PLAYBOOK_MATCHING=false - proposal_service._try_playbook_match_id 入口加 flag check - 7 個 e2e 測試補上保護網（之前無測試覆蓋）斷鏈 C1 證據鏈：proposal_service.generate_proposal() → matched_playbook_id → approval_db → approval_repository → learning_service._update_playbook_stats 24h 後 playbooks.trust_score 應有真實 EWMA 更新。 ## ADR-091 T1 — auto_generate_rule 雙寫 DB (鐵證 1 第一步) 飛輪鐵證 1：alert_rule_catalog.source='ai_generated' 全 codebase 0 筆。 auto_generate_rule() 寫 alert_rules.yaml 但不寫 DB → AI 自學成果與 catalog 雙軌脫鉤。修法（依 ADR-091 §1 D1）： - 新增 _insert_catalog_ai_generated()：YAML 寫入成功後雙寫 source='ai_generated', confidence=0.5, review_status='draft', created_by_agent - 新增 _parse_for_to_seconds() helper（"30s"/"5m"/"2h" → seconds） - ON CONFLICT (rule_name) DO NOTHING 冪等保證 - transaction 策略：YAML + DB 不在同一 transaction（YAML 已成 SoT，DB 失敗只 log） - ENABLE_AI_RULE_CATALOG_WRITE feature flag (default=true) rollback: kubectl set env deployment/awoooi-api ENABLE_AI_RULE_CATALOG_WRITE=false 13 個測試覆蓋：parse helper 8 + 業務邏輯 5（success/db_fail/idempotent/flag/SQL_lit） ## 驗證 1572 unit tests 全綠（+20 新增：PR-P1 7 + ADR-091 T1 13） ## 期望影響飛輪自主化評分：42 → 65（+23 = C1 +3 + 鐵證 1 +20） ## 已知債（critic PR review 揭示，下一個 commit 處理） - KMWriter 統一契約 3 條 caller 路徑被旁路（C1/M1/M2） - KMWriter 冪等聲明與實作不符（M3 缺 ON CONFLICT） - Alertmanager equal:[] 爆炸抑制 + 版本未驗（M4/M5） - drift checker regex 脆弱（M7 應改 AST） - governance health score skipped 失真（M6） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	dc18b0ebd6	fix(prometheus_url): drift 殘存追修 — kured 守門員 + monitoring API debugger 全 codebase 追根溯源後揪出 5 處 PROMETHEUS_URL drift 殘存（根因：docs/reference/SERVICE-ENDPOINTS.md 早期把 Prometheus 標在 188 是整個 codebase drift 的源頭）。本次修最急的 2 處： ## 🔴🔴 kured.yaml:132（守門員失效風險） - 188 → 110 - kured 跑 reboot 前會查 Prometheus alerts，連錯主機 = 跳過保護直接 reboot 主機 - 對齊 ConfigMap + config.py PROMETHEUS_URL ## 🟡 monitoring.py:67（單一事實源） - 寫死 110:9090 改用 settings.PROMETHEUS_URL - 主機巧合正確但繞過 ConfigMap 注入機制 - 未來 Prometheus 再遷移避免再次 drift ## 暫不修 - k3s_monitor_service.py:38 用 121:30090 是 K3s NodePort 內網端點與外部 PROMETHEUS_URL 概念不同，需新增 PROMETHEUS_INTERNAL_URL setting - 其他 docstring + 文件 drift（SERVICE-ENDPOINTS.md 等）留待後續 ## 驗證 1552 unit tests 全綠（無回歸） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	6eb33594c2	docs(logbook): T0 12-Agent 全景驗證紀錄承接前段 session wave2 (commit `143c15f0`) + DB cleanup + Gitea HMAC + ArgoCD/Sentry MCP，派四位專家並行驗證（critic / db-expert / debugger / tool-expert）。詳情：B1/B2 鬼魂按鈕 + KM 早期吞例外 + M1-M4 中度問題 + G1-G3 環境治理 gap。此 commit 主要為 LOGBOOK 索引補齊，本次 P0/P1 修復內容詳見前 2 個 commit。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	c22e5f334e	feat(km): P1-1 KMWriter 統一契約 + 5 caller 切換 + M4 反查鏈補齊 12-Agent 全景診斷揪出 KM 寫入鏈路 5 條入口無統一契約，fire-and-forget 在 Pod recycle 時會丟失條目。本次抽 KMWriter 強制 7 條契約。 ## 7 條契約強制 1. 同步底線：強制 await asyncio.wait_for(timeout) 2. 重試：3 次指數退避 1s/2s/4s（OperationalError / 網路類例外） 3. 失敗回收：3 次後寫 Redis DLQ km:dlq + log 4. 觀測：structlog event + 預留 metric hook（P1-3 補 emitter） 5. 冪等：incident_id + path_type 為 unique key 6. 禁止吞例外：except 必須 log + raise/DLQ 7. M4 反查鏈：payload 含 approval_id 時自動填 related_approval_id 並回填 Path A ## Caller 切換（5 條入口統一介面） - incident_service.py:1086 Path A（KB extractor + km_conversion） - approval_execution.py:771 Path B-人工 - decision_manager.py:2178 Path B-自動成功（消除跨類私有方法調用 M1） - decision_manager.py:2200 Path B-自動失敗（修 B2 早期吞例外） - playbook_service.py:210 PlaybookKM（兩份 T0 報告都漏的第三條） ## M4 反查鏈補齊 - knowledge.py + models.py: 補 related_approval_id ORM 欄位 - 對齊 phase26_incident_km_integration.sql:20 schema（partial index 已存在） - approval↔KM 雙向反查鏈完整（dual-path 縫合線） ## Feature Flag (rollback 保險) - KM_WRITE_AWAIT=true (default): await + timeout + DLQ 強制 - KM_WRITE_AWAIT=false: fire-and-forget（舊行為） ## 測試 - apps/api/tests/test_km_writer.py: 18 測試全綠覆蓋 success / timeout / retry / DLQ / 冪等 / KMWriteError / on_failure=raise / 反查鏈回填 - 1552 unit tests 全綠（無回歸） ## 驗收飛輪閉環核心 — KM 寫入不再靜默丟失，AI 學習鏈不斷裂。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	715dc3cb91	fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具 12-Agent 全景診斷觸發的 P0/P1 觀測層修復。 ## P0 假警報止血（4 SLO 雪崩根因） - governance_agent.py:306 — 空 result 不再 fallback 0.0，改 continue + log warning 根因：Prometheus 查無資料（emitter 未實作 / rule 未部署）被誤判為 SLO=0 必觸發 violated=True 噴 4 條假告警 ## P0 鬼魂按鈕守門 - telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear() first_row（批准/拒絕，HMAC nonce 無狀態）由 caller 1488 永遠保留 feedback_no_ghost_buttons.md 三缺一鐵律對齊 ## ConfigMap drift 修復（3 處） - config.py:683 PROMETHEUS_URL: 188→110（drift checker 揪出 = SPF-4 部分根因） - config.py:705 ARGOCD_URL: 125→121（T0 G3 已知） - config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap ## P1 Alertmanager 升級（amtool SUCCESS） - ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法 - match/match_re → matchers - source_match/target_match → source_matchers/target_matchers - group_by 加 team label（防 SLO 雪崩 4 條同秒推） - PostgreSQL/Redis inhibit 補 equal: ['instance']（防爆炸抑制） - 新增 3 組因果抑制： - OllamaInstanceDown → SLO_/AI_（30 分鐘） - KMConverterDown → SLO_KMGrowthRate* - SLO__FastBurn → SLO__(Medium\|Slow)Burn ## 治理工具落地 - scripts/check_config_drift.py: ConfigMap vs code default drift 檢測揪出 PROMETHEUS_URL drift 是 SPF-4 根因（governance_agent 連 188 而非 110） - scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證 ## 驗證 - 1552 unit tests 全綠 - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker 4 欄位全對齊 - health check 11 服務全可達 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
AWOOOI CD	20009cddcf	chore(cd): deploy `143c15f` [skip ci]	2026-04-28 07:36:19 +00:00
Your Name	143c15f052	feat(wave2+km): LLM 動態按鈕啟用 + KM 自動寫入 + AI Router dead code 標記 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m52s Details - ConfigMap: USE_LLM_DYNAMIC_BUTTONS=true（B2/B3/B4 handler 全就緒） - decision_manager: auto_execute 失敗路徑補 KM fire-and-forget 寫入 - ai_router: _build_fallback_chain 標記 DEPRECATED 2026-04-28 - tests: test_golden_regression.py 新增 172 行 golden 回歸測試 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 15:27:33 +08:00
AWOOOI CD	2e6ae7fe84	chore(cd): deploy `7f200af` [skip ci]	2026-04-28 07:14:34 +00:00
Your Name	7f200aff5f	fix(solver): 注入告警 labels 讓 params 模板填充真實值 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details 根因：Solver LLM 不知道 namespace/pod/deployment/instance 真實值， recommended_actions.params 模板（{labels.namespace} 等）填不出來 → Telegram 顯示 kubectl scale deployment --replicas=（空白）修復： - solver.run() 加 incident_labels 參數 - _build_prompt() 把 labels 顯式列出給 LLM 參考 - orchestrator 從 snapshot.alert_info.labels 取出後傳入 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 15:05:06 +08:00
AWOOOI CD	b8a330f9e4	chore(cd): deploy `c1a1be6` [skip ci]	2026-04-27 12:21:13 +00:00
Your Name	c1a1be61bd	fix(ssh-auto): 主機告警 SSH 自動診斷授權（HostHighCpuLoad 不再卡人工審核） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m7s Details 根因：SSH_MCP_ALLOWED_HOSTS 未設定 → _ssh_execute() 全部攔截 + auto_approve 只認 kubectl 不認 ssh → 主機告警永遠降級人工修復： - ConfigMap: 補 SSH_MCP_ALLOWED_HOSTS 四主機白名單 - alert_rules: HostHighCpuLoad 等從 NO_ACTION 改為 ssh_diagnose 指令 - auto_approve: _has_executable 加入 ssh 開頭識別 - decision_manager: _ssh_execute() 加入 ssh_diagnose 路由 - ssh_provider: 新增 ssh_diagnose tool（ps aux + free -h + df -h，只讀） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 20:13:07 +08:00
Your Name	277808758d	fix(failover): 補 OllamaRoutingResult.health_188 optional 欄位（merge conflict 遺漏） Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details stash pop 時 --theirs 覆蓋掉了 health_188 dataclass 欄位定義，導致 to_dict() 拋出 AttributeError（health_188 只在方法內引用）。補上 health_188: HealthReport \| None = None，37 failover tests ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 20:04:49 +08:00
Your Name	877c2651bf	feat(p3.2.3): provider版本變更Telegram告警 + Gemini quota訊息更新 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m40s Details - FailoverAlerter.alert_provider_version_changed()： - 每個 provider 獨立 dedup key（TTL 3600s），避免頻繁重複告警 - 批次合併通知：同一輪變更一則訊息，標出哪些 provider 版本異動 - 例外由 tracker 層 try/except 攔截，不中斷探測排程 - ModelVersionTracker.run_probe_cycle()： - changed_providers 非空時呼叫 alert_provider_version_changed() - P3.2.3 整合完成，告警鏈路 probe → 比對 → DB → Telegram 全通 - Gemini quota 告警訊息更新：移除舊的 188 CPU 備援字眼，改為 Nemotron → Claude - 6 new tests, 1501 passed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 20:00:03 +08:00
Your Name	b6e4e87e57	test(p3.2): provider_version_alerter 單元測試（6 passed） Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:56:51 +08:00
Your Name	ae5e33d254	feat(failover+dispatcher): 補齊 unstaged 服務變更 - callback_dispatcher: params 型別放寬支援 numeric - failover_alerter: alert TTL 修正 - model_version_tracker: 小調整 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:56:51 +08:00
Your Name	3e382a4225	fix(telegram): P0 async race + P1 short_id 碰撞 + P0 incident_id 修復 - _build_llm_action_buttons 改 async，await setex 在 return 前完成（消除「按鈕發出→點擊→Redis 未寫完」的 race） - short_id 從 4 bytes → 8 bytes（16-hex），64-bit 碰撞空間 - payload 加入 incident_id，callback handler 從 payload 還原真實 ID （修 P0-2：避免 short_id 進 context 造成 KM 學習鏈錯亂） - Redis 故障與按鈕過期分流回應（P1） - HTML escape 防 XSS（P2） - _build_inline_keyboard 改 async，兩個呼叫端加 await - tests 全部改 @pytest.mark.asyncio + AsyncMock redis （1495 passed in unit suite） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:56:51 +08:00
AWOOOI CD	ded17caca0	chore(cd): deploy `a0502b7` [skip ci]	2026-04-27 11:55:33 +00:00

1 2 3 4 5 ...

1825 Commits