awoooi

Author	SHA1	Message	Date
OG T	c1f23cfabe	feat(coverage_evaluator): 擴充 4 維 — playbook/remediation/rule_matching/rule_creation Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點: coverage 7 維中原只實作 3 維 (monitoring/alerting/km),其餘 4 維永遠 unknown v2 擴充: + auto_playbook: asset.name 出現在 playbooks.symptom_pattern/description (approved 狀態) → green 沒對應 playbook 但 type='k8s_workload' → yellow + auto_remediation: 過去 30d remediation_events.target_resource ILIKE asset.name → green 沒 target 但 k8s_workload/container → red (應有修復能力但沒) + auto_rule_matching: 過去 30d incidents.affected_services ILIKE asset.name 或 incidents.alertname match alert_rule.labels.host/namespace → green 沒觸發 → yellow (可能沒問題也可能沒覆蓋) + auto_rule_creation: alert_rule_catalog source='ai_generated' match asset → green 目前全 yaml_hardcoded → 全 red (表示尚未由 AI 主動建規則) 未來 Hermes 產出 AI rule 後會變 green 解鎖: coverage 7 維完整 SLO KPI (MASTER §7.1) - red count = 真正的治理缺口 - green ratio = 自動化成熟度 - AI 可主動推薦 red asset 的補覆蓋動作 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:54:36 +08:00
OG T	ba18ad2ef8	feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / build-and-deploy (push) Successful in 8m37s Details 統帥 2026-04-19 決策: - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則 - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護) - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整 1. rule_stats_updater v2 noise 算法: 原: 任何 EXPIRED approval 都算 fp 問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ... 2. hermes_rule_quality v2 LLM 升級: 新增 _llm_analyze_noisy_rule: - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate - 3 路 parse fallback (直接 / NemoTron wrapper / description nested) _write_advisory_aol 加 llm_analysis 到 output_payload _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長) 符合統帥鐵律: AI 分析但不自動動作,仍人工決策 3. ops/monitoring/alerts-unified.yml 替換 Rule 1: 刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報) 加 HostDiskUsageHigh (>80% for 10m, warning) 加 HostDiskUsageCritical (>90% for 5m, critical) 兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯 (待 deploy-alerts workflow 下次 apply 到 Prometheus) 4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推): UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:39:05 +08:00
OG T	6ab0ce9c75	feat(aiops): Hermes rule quality advisor — E3 AI 規則品質建議 (保守版) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m22s Details 實證 rule_stats 跑完後發現 2 條 100% noise_rate 規則: - PostgreSQLDiskGrowthRate (tp=0 fp=2) - NoAlertsReceived2Hours (tp=0 fp=1) 加上 MoWoooWorkDown (33%), KubePodCrashLooping (25%) 新增 hermes_rule_quality_job.py (~210 行): 每日 04:00 Taipei 分析 alert_rule_catalog: - threshold: noise_rate >= 0.7 AND 樣本 >= 5 - 為每條寫 aol('rule_rejected', proposed_action='review_or_deprecate') - 推 Telegram 摘要給 SRE group 統帥鐵律對齊: ✅ 不自動改 review_status (人工決策 deprecate,AI 只推建議) ✅ threshold 作為「觸發討論」而非「最終決策」 ✅ aol(rule_rejected) 留 trail,未來可升級 LLM 辯證解鎖 E3 Hermes 基礎: 後續可加 LLM 分析假報真因 (expr 缺 for: window、 label match 太寬泛、metric 本身 noisy 等),產出具體改進建議. Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 18:11:26 +08:00
OG T	e677773e39	fix(asset_scanner): Pod→Deployment via ReplicaSet 橋樑 (relationship 漏掉修復) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m31s Details Review 盲點: 實測 asset_relationship 52 筆,但都是 Pod→StatefulSet + Pod→ConfigMap, 完全沒有 Pod→Deployment! 真因: K8s 中 Pod.ownerReferences[0].kind = 'ReplicaSet' (99% 案例) Deployment 管 ReplicaSet 管 Pod (兩層 owner chain) 原 code 只 match kind in (deployment/statefulset/daemonset) → 跳過 ReplicaSet → Pod→Deployment 關係全部漏掉修復 v3.1: 0. 新增 collect replicasets pass (僅作為 bridge,不寫 asset_inventory) 建 rs_to_deployment map: {ns/rs_name: deployment_name} 2. Pod ownerRef.kind='ReplicaSet' → 反查 rs_to_deployment → 建 Pod→Deployment 預期效果: - asset_relationship 從 52 → 150+ (所有 Deployment-managed Pod 都有 relationship) - OpenClaw blast_radius 計算 Deployment 影響的 Pod 數 = 正確不寫 ReplicaSet 為 asset (他是 ephemeral 中介,滾動更新會大量產生,污染 inventory) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:26:57 +08:00
OG T	c8b263db06	fix(coverage_evaluator): KM 欄位修正 ke.body → ke.content + 擴大 title 匹配 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 實測 `df71c9a` 部署後 coverage_evaluator 生效: - monitoring: 2 hosts match Prometheus targets - alerting: 74 筆 (22 green + 52 red) - km: 0 (錯誤: column "ke.body" does not exist) 真因: knowledge_entries 表欄位是 'content' 不是 'body' 修復: ke.content ILIKE '%name%' OR ke.title ILIKE '%name%' 同時清 unused import (typing.Any) 下輪 coverage_evaluator tick 將正確 UPDATE auto_km_creation 維度解鎖完整 3 維 coverage (monitoring/alerting/km) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:24:46 +08:00
OG T	92349bc37c	feat(aiops): asset_change_tracker — 8 張 0 writer 表全數上線 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點 10: asset_change_event 仍 0 筆 (最後一張 0 writer 表) 新增 asset_change_tracker_job.py (~180 行): 每 1h 比對最近兩次 asset_discovery_run,寫 asset_change_event: ✅ asset_added: newer run 有但 older run 沒有 (EXCEPT SET) ✅ asset_removed: older 有但 newer 沒有 ✅ lifecycle_changed: asset_inventory.lifecycle_state='deprecated' 且 updated_at 近 2h 使用 SET EXCEPT 避免 N+1, 單次 INSERT 完成所有 diff 8 張 ADR-090 0 writer 表到此全數有 writer: ✅ asset_inventory / asset_discovery_run / asset_coverage_snapshot / asset_relationship / asset_change_event / asset_compliance_snapshot (asset_) ✅ alert_rule_catalog ✅ host_capacity_snapshot / capacity_violation_event (capacity_) Phase 7 資產盤點 + 覆蓋矩陣 + 變化追蹤完整實作. 接下來可以上 Hermes AI agent 分析品質 (deprecate noisy rules, 推薦 coverage 修復). Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:18:34 +08:00
OG T	df71c9a37b	feat(aiops): rule_stats_updater — 計算 noise_rate + true/false positive All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m26s Details Review 盲點 5: alert_rule_catalog 68 筆但 noise_rate/TP/FP/last_fired_at 全 NULL 新增 rule_stats_updater_job.py (~170 行): 每 1h UPDATE 全表 alert_rule_catalog,從 incidents + approval_records 推算: - last_fired_at = max(incidents.created_at WHERE alertname=rule_name) - true_positive_count = count incidents.status='RESOLVED' past 30d - false_positive_count = count approval_records.status='EXPIRED' past 30d (EXPIRED = 48h 無人處理,視為假警報 proxy) - noise_rate = fp / (tp + fp) 窗口: 30 天 (可配置) 使用單一 UPDATE + subquery,避免 N+1 (68 rule × 3 query = 204 queries → 1 query) 解鎖 E3 Hermes: 後續 Hermes AI agent 讀 alert_rule_catalog WHERE noise_rate > 0.5 提案 review_status='deprecated' 或 superseded_by_rule_id Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:05:30 +08:00
OG T	505232336b	feat(aiops): coverage_evaluator — 把 coverage_snapshot 從 unknown 升為真實 status Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點 4: asset_coverage_snapshot 546 筆全是 'unknown',沒實際意義新增 coverage_evaluator_job.py (~270 行): 每 1h 針對最新 asset_discovery_run 的 coverage_snapshot 做 3 維升級: ✅ auto_monitoring: Prometheus /api/v1/targets 看 host asset IP → green (有 target) / red (無 target) ✅ auto_alerting: alert_rule_catalog.labels 是否 match asset → host/namespace/layer 三種 match 策略, green/red ✅ auto_km_creation: knowledge_entries.body ILIKE asset.name → green (有 KM) / yellow (無 KM) evidence JSONB 記錄升級依據,供 AI 後續稽核未實作 (留 unknown): ⏳ auto_rule_matching (需 alert history 統計) ⏳ auto_playbook / auto_remediation / auto_rule_creation (需 playbook 表) 預期效果 (下次 evaluator 跑 + coverage_snapshot UPDATE): - 546 筆 coverage 從 100% unknown → 30-50% green/red/yellow - 真正可以算 "覆蓋率 SLO" KPI (MASTER §7.1) - AI 可從 coverage_snapshot 看出 red asset,主動推 remediation Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:02:30 +08:00
OG T	fdf8b739f1	feat(asset_scanner): v3 擴充多資源類型 + asset_relationship builder Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 原本 MVP 只掃 pods (39 assets) 盲點,本次擴充: 新增資源類型掃描: - nodes (asset_type='host') — 實體主機 - deployments/statefulsets/daemonsets (asset_type='k8s_workload') - services (asset_type='k8s_resource') - configmaps (asset_type='k8s_resource') 跳過 secrets (awoooi-executor RBAC 禁止 list,正確設計) 新增 asset_relationship 自動建立: - Pod → Deployment/StatefulSet/DaemonSet (depends_on, via ownerReferences) - Service → Pod (routes_to, via spec.selector 匹配 Pod.labels) - Pod → ConfigMap (depends_on, via spec.volumes[].configMap.name) 用 ON CONFLICT (from/to/type) DO UPDATE last_verified_at 保持 idempotent 新增 _fetch_kubectl_json helper (nodes 不帶 --all-namespaces) 新增 _build_{pod,workload,service,node,configmap}_asset 各自 asset 建構器預期效果 (下次 scan 1h 後或 Pod 重啟時): - asset_inventory: 39 → 300+ (全集群多種資源) - asset_relationship: 0 → 數百 (OpenClaw 爆炸半徑計算終於有拓樸) 解鎖下游: - AI 計算 blast_radius 可查 asset_relationship (之前無資料) - MASTER §3.3 D3 Declarative Remediation 的 blast_radius_calculator 有真實依賴圖 Refs: ADR-090 §3.3, MASTER §3.1 L6×D1 (8D 感官拓樸) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:54:18 +08:00
OG T	02263445c2	fix(asset_scanner): kubectl 改 subprocess — K8sProvider 不支援 --all-namespaces All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m9s Details `5b9b36f` 部署後 asset_scanner 跑 3 次但 total=0, new=0: - asset_inventory 仍 0 筆 - Pod 手動 kubectl get pods --all-namespaces -o json 可取 JSON - 真因: K8sProvider._kubectl_get 把 namespace 參數塞進 '-n $ns', 所以 '--all-namespaces' 變成 '-n --all-namespaces' (kubectl 拒絕) 修復: - 不走 K8sProvider,直接 asyncio.create_subprocess_exec - kubectl get pods --all-namespaces -o json - 30s timeout,rc != 0 拋 RuntimeError 觸發 aol status='failed' 驗證: 部署後 asset_inventory 應在 1 分鐘內開始有 pods 寫入 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:31:26 +08:00
OG T	4259a104f5	feat(aiops): capacity_scanner + compliance_scanner (ADR-090 Phase 7 剩 2) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 完成 ADR-090 Phase 7 第 3+4 個 service,解鎖 2 張 0 writer 表: B3. apps/api/src/jobs/capacity_scanner_job.py (~300 行) - 每日 02:00 Taipei 撈 Prometheus node_exporter - 寫 host_capacity_snapshot (load1/5/15, cpu, iowait, mem, swap) - heuristic ai_verdict: cpu>80 or mem>85 → critical; >60/70 → warning - 超過硬閾值 → 寫 capacity_violation_event - 寫 aol(capacity_recommendation) B4. apps/api/src/jobs/compliance_scanner_job.py (~270 行) - 每日 03:00 Taipei 遍歷 asset_inventory active assets - 為每個 asset 寫 7 維 compliance snapshot - secret_rotated: 真實檢查 (metadata.creationTimestamp > 90d = warning) - 其他 6 維 (ssl_cert_valid / cve_scan / backup_tested / audit_log_enabled / access_reviewed / encryption_at_rest) 占位 'unknown' + detail TODO,後續 agent 補邏輯 - 寫 aol(coverage_recalculated) summary main.py lifespan 同步 wire 2 個新 loop 預期解鎖 (配合 B1 asset_scanner + B2 rule_catalog_sync): - asset_inventory: 0 → 數百 (B1) - asset_discovery_run: 0 → 每小時 1 (B1) - asset_coverage_snapshot: 0 → assets × 7 維 (B1) - alert_rule_catalog: 0 → ~68 條 (B2) - host_capacity_snapshot: 0 → 每日 hosts (B3) - capacity_violation_event: 0 → 超閾值時 (B3) - asset_compliance_snapshot: 0 → assets × 7 維 (B4) automation_operation_log 新增 4 個 op_type: asset_discovered / rule_created / rule_updated / capacity_recommendation / coverage_recalculated 8 張 0 writer 表到此全數有 writer,ADR-090 Phase 7 實作完成. Refs: ADR-090 §4.2 Phase 4, MASTER §3.5 D5 (capacity-aware) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:23:27 +08:00
OG T	5b9b36f30d	fix(ci)+feat(aiops): cd.yaml shared network + rule_catalog_sync (ADR-090 E3) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m31s Details CI 修復 (`c0f3509` 第三次 fail 真因): `c0f3509` log: 'Detected act task network: (none, will fall back to bridge)' → grep ACT_NET 在 CI 環境未 match → fallback bridge → default bridge 不支援 container name DNS → pg-test-b5 解析失敗修復 (v3 — 主動創 shared network): - B5_NET=b5-test-net (idempotent docker network create) - ci-runner 自己 docker network connect $HOSTNAME - pg-test-b5 --network=$B5_NET - 兩邊同 user-defined network → container name DNS 正常新增 rule_catalog_sync_job (ADR-090 § Phase 7 第 2 個 service): + apps/api/src/jobs/rule_catalog_sync_job.py (~230 行) - run_rule_catalog_sync_loop: 啟動延遲 90s,每 1h sync - sync_once: HTTP GET {PROMETHEUS_URL}/api/v1/rules (type=alert) - UPSERT alert_rule_catalog (rule_name 為 UNIQUE) - 只在實際 INSERT/UPDATE 發生時才寫 aol (避免 N 條 rule 污染) + main.py lifespan asyncio.create_task() wire 預期解鎖: - alert_rule_catalog: 從 0 → Prometheus active rules 數 (~68 條) - automation_operation_log: 新增 'rule_created' / 'rule_updated' op_type - E3 Hermes AI 終於有 baseline 可以提案規則修正 Refs: ADR-090 §4.2 E3, MASTER §3.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:08:34 +08:00
OG T	c0f3509d39	fix(drift-card): Drift Diff HTTP 400 — item-by-item 累計長度避免切斷 HTML Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m0s Details 統帥回報 14:18 點 [查看 Diff] 收到 'Drift Diff 查詢失敗: HTTP error: 400' 真因 (telegram_gateway.py:2087 _send_drift_diff_detail): - report_id=7ffe78ae 有 48 items,單筆 git_value 最長 1794 字 (env array) - 累計 _full 遠超 4096,執行 _full[:3950] 截斷 - 截斷可能切在 HTML tag 中間 (<code>... 或 < entity 中間) - Telegram parse_mode='HTML' 拒絕不完整 HTML → 400 修復: - item-by-item 累計長度,單個 item 算 _block 長度+1 - 預留 3800 上限 (4096 - 250 buffer 給 header + '… 還有 X 項' 提示) - 確保 _full 永遠是完整 HTML 結構驗證: 下次 drift report 出現 + 統帥點 [查看 Diff] 應正常顯示 (本 session 的下個 cycle) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:26:29 +08:00
OG T	ddb902f1ff	fix(ci+aiops): cd.yaml grep set-e bug + 新增 asset_scanner_job (ADR-090) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CI 修復 (`b636d3b` 第二次 fail 真因): cd.yaml line 161 ACT_NET=$(docker network ls \| grep -E '^GITEA-ACTIONS-...') act runner 用 'bash -e -o pipefail',grep 無 match 時 exit 1 → 整 step 中斷 (前一次 `e7ba8cb` fail 是 PG IP 不通,b636d3b 是 grep set-e bug — 兩個不同錯誤) 修復: ACT_NET=$(... \| (grep -E '...' \|\| echo "") \| head -1) 把 grep 包在 subshell 並 \|\| echo "" 確保失敗時 ACT_NET 為空字串新增 asset_scanner_job (ADR-090 § Phase 7 第 1 個 service): + apps/api/src/jobs/asset_scanner_job.py (~360 行) - run_asset_scanner_loop: 每 1h cron,首次延遲 60s - scan_once: 用 K8sProvider kubectl_get pods --all-namespaces - UPSERT asset_inventory (asset_key 為 UNIQUE,跨 run 沿用同 asset_id) - 為每個 active asset 寫 7 維 asset_coverage_snapshot (預設 unknown) - 寫 automation_operation_log(asset_discovered) + main.py lifespan asyncio.create_task() wire 預期解鎖: - asset_inventory: 從 0 → 數百 (全 namespace pods) - asset_discovery_run: 每小時 1 筆 - asset_coverage_snapshot: 每筆 asset × 7 dim - automation_operation_log: 新增 'asset_discovered' op_type 下一階段 (rule_catalog / capacity / compliance scanner) 待 CI 通過後分批提交. Refs: ADR-090 §4.1, MASTER §3.4 D4, project_blindspot_governance.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:15:45 +08:00
OG T	e7ba8cb181	fix(aiops): 打通 AI 自主學習鏈 — verifier 改 await + aol 動作回灌 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 7m29s Details 統帥 2026-04-19 全景審計發現: - automation_operation_log: 22 筆 (全部 drift_narrator),33 件/7d approval 動作 0 筆回灌 - incident_evidence.verification_result: 1212 筆 100% NULL,verifier 從未寫入 - 根因: _run_post_execution_verify 用 asyncio.create_task fire-and-forget, Pod recycle 時 task 被殺,verification_result 永遠寫不進去修復 (打通 verifier→learning→Playbook EWMA→finetune 全鏈): approval_execution.py: + _log_aol_started: 主流程開始時 INSERT aol(playbook_executed, pending) + _log_aol_completed: 4 個 return 點 UPDATE aol 為 success/failed + duration + stderr └ NO_ACTION / parse_fail / K8s 成功 / K8s 失敗全部留痕 ~ _run_post_execution_verify 兩處 (成功+失敗 path) 從 create_task 改 await + 60s timeout + 失敗時 stderr_feed_back 寫入 result.error → 解開 E6 stderr 回灌閉環 declarative_remediation.py: ~ _log_remediation_event task 加 named + add_done_callback,task 失敗時有 log (原 fire-and-forget 0 筆寫入,現在可診斷為何 task 死掉) 預期效果: - aol playbook_executed 即時可見 (33 件/7d 立刻有資料) - incident_evidence.verification_result 開始累積 → finetune_exporter 7d cron 終於有料 - Playbook EWMA trust_score 開始動態變化 - stderr_feed_back 接通 → 失敗訊號回灌 retry/Playbook 負向強化不影響: - background_task 跑在背景,+60s 延遲不阻塞 API - aol 寫入失敗只 logger.warning,不阻塞執行主流程 Refs: MASTER §3.1 L6×D1 (ADR-081 PostExecutionVerifier), MASTER §3.4 D4 (ADR-083 學習閉環), ADR-090 監控盲區治理 (2026-04-18 全景審計) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 12:07:29 +08:00
OG T	2abc91e360	fix(drift-card): 修 drift 卡片 2 bug — AI 研判 copy 樣式 + Diff 按鈕 AttributeError All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m8s Details Bug 1: 按「🔍 查看 Diff」失敗錯誤: 'DriftReportRepository' object has no attribute 'get_by_id' 根因: DriftReportRepository 方法叫 get(), 其他 repo 都叫 get_by_id() 修法: 加 get_by_id() alias, 對齊 repo 介面慣例 Bug 2: AI 研判內容被渲染成 code block + copy 按鈕根因: telegram_gateway line 1962 用 <pre> 包 diff_summary 但 diff_summary 是 AI 研判敘述 + emoji 清單, 非 code 修法: 移除 <pre>, 改以分隔線 + html.escape 純文字顯示驗收: - 下次 drift 卡片: AI 研判段落純文字(無紫色 code block + copy) - 按「🔍 查看 Diff」→ 送完整 diff 詳情(非 AttributeError) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:27:13 +08:00
OG T	4b8be32610	fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX Some checks failed CD Pipeline / build-and-deploy (push) Failing after 25m27s Details Ansible Lint / lint (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) ## TG-1: INFO_ACTIONS 加 view security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式, 不再誤觸發 4-part nonce 寫格式。 ## AP-1: approval_records.telegram_message_id 持久化 telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE approval_records SET telegram_message_id, telegram_chat_id (不只 Redis, Pod 重啟仍可找回原卡片)。 ## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量 approval_execution._push_execution_result_to_alert 除了 reply 原卡片, 還 editMessageReplyMarkup 移除按鈕（修「永遠執行中」卡片問題）。 - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息顯示 "📚 KM +N 🎯 Playbook 更新×M" - 成功: ✅ 執行成功 + action + KM 增量 - 失敗: ❌ 執行失敗 + 原因 + KM 增量 ## AP-3: primary_responsibility 正規化降「❓ 未知」比例 openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單 (FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA, 否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。 ## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:15:58 +08:00
OG T	68a42a3c97	fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 根因: commit `7e9448f` 的 Python hallucination validator 只裝在 `analyze_alert` (webhook path),但 incident sweeper 走 `generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23 PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod" 幻覺未攔截。修: 1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法 2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除 3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper 4. helper 補: - result.action_title = '[安全降級] 調查 {ns} 真實資源狀態' (之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字) - 每個欄位賦值 try/except 保底,單欄失敗不影響其他 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:11:09 +08:00
OG T	fdce0a3ab9	fix(approval): NO_ACTION 不再誤標 EXECUTION_FAILED (MASTER §7.1 #11 修) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 根因: approval.action='NO_ACTION - 待分析' (幻覺 validator 降級產物) 丟進 parse_operation_from_action → operation_type=None → background_execution_skip → update_execution_status(success=False) → 標為 EXECUTION_FAILED。污染 KPI: MASTER §7.1 #11 auto_execute 成功率 = EXECUTION_SUCCESS / (SUCCESS+FAILED) NO_ACTION 本來就不該計入失敗,但卻被算進去拖垮指標。實測 30d 成功率 0.9% 有很大比例是 NO_ACTION 誤標造成。修復: parse 失敗時先判斷是否 NO_ACTION 類 (action 含 NO_ACTION/OBSERVE/INVESTIGATE 等關鍵字) → 走專屬 noop 分支: - log event=background_execution_noop (info 級) - update_execution_status(success=True) → EXECUTION_SUCCESS - timeline 標 ✅ 純觀察類動作完成 - reply 原告警卡片顯示成功 - return True 真正解析失敗 (非 NO_ACTION) 保留原失敗路徑,但補上 error_message (P0.2 延伸),讓 rejection_reason 有 "Could not parse operation type from action: <action>" 而非空字串。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:08:16 +08:00
OG T	2e988bdb81	fix(telegram): drift 執行結果貼回卡片 + audit log user_id Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details IDE 抓到 _stamp 未使用(結果沒送)+ user_id 未使用(audit 缺漏)。修: 1. _edit_drift_card_outcome 不只移除按鈕,還 send 簽核戳訊息 (reply_to 原卡片,若 msg_id 存在),格式: ✅ 已採納 by @username (成功) Drift <report_id> 2. _handle_drift_action 加 drift_callback_dispatched log(audit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:07:13 +08:00
OG T	877c8479e0	fix(telegram): TG-2 + TG-4 修 drift 按鈕 black hole Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 統帥截圖直擊: 按「查看 Diff」→ 變成「執行中」,且看不到還有 21 項。全景盤點發現 9 個 Telegram 子系統 bug,本 commit 修 2 個最痛的: ## TG-2: drift_view/drift_adopt/drift_revert 3 按鈕無 handler 點擊 → fallthrough → UX 黑洞 / 誤觸發 approve 路徑。修復: handle_callback 在 state guard 後(line 2752 後)加 Step 1.85 offroute: 3 個 drift_* action → _handle_drift_action 專職處理, 不走 nonce approve/reject dispatch,避免誤觸發執行流。 3 個按鈕實作: - drift_view: 讀 drift_reports → 送新訊息展示全部 items (HIGH/MEDIUM/INFO emoji + Git vs K8s 原值對照,上限 50 項 4000 字) - drift_adopt: 呼叫 drift_adopt_service.adopt_drift() - drift_revert: 呼叫 drift_remediator.revert() ## TG-4: drift card message_id 沒存 Redis → edit 回不了卡片修復: send_drift_card 成功後 setex f"tg_drift:{incident_id}" TTL 24h, 供 _edit_drift_card_outcome 在 adopt/revert 執行後更新原卡片(先移除按鈕 + 加「XX by @username (成功/失敗)」簽核戳)。 ## 未包含(follow-up): TG-1 INFO_ACTIONS 擴充(view) — 下一 commit TG-3 handler 重複分派 — 評估中 TG-5 Bot webhook URL 未設 — 需統帥決策公開 URL approval card NO_ACTION 誤標 FAILED — 下一 commit approval card description 矛盾 / responsibility 未知 / 執行後 edit 全景 9 bug 清單詳見 project_phase7_round3_telegram_subsystem_audit(待建)。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:06:30 +08:00
OG T	98aef55b31	feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修 Some checks failed CD Pipeline / build-and-deploy (push) Successful in 11m49s Details run-migration / migrate (push) Failing after 15s Details 2026-04-18 晚（台北時區）— ogt + Claude Opus 4.7 (1M) MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈: #3 fine-tune JSONL /week — finetune_exports 表不存在 #4 MCP 呼叫/24h — timeline_events 沒 mcp_call event_type #6 Declarative 修復使用率 — remediation_events 表不存在 #7 general 兜底 17.3% — classify_alert_early 漏 5 類 #10 notification_outcomes /week — 表不存在本 commit 全修。 ## 1. Migration: adr090d_kpi_data_sources.sql (3 張表) - finetune_exports — P3 Fine-tune JSONL 追蹤 - remediation_events — P5 Declarative 修復追蹤 - notification_outcomes — 通知品質 + RLHF 語料 Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。 ## 2. classify_alert_early 擴 4 類規則 (降 general 兜底) - test 攔截: Test/FPTest/FingerprintTest/ADR089Test/L4Closure/FreshUniq* → category='test', TYPE-1 純通知 - HighCPU/Memory/Disk/Load → host_resource - TLS/SSL/ProbeFailure* → ssl_cert - PostgreSQL/MySQL/MongoDB/DiskGrowthRate → database 預期 general 17.3% → 3-5% (達標 <10%)。 ## 3. finetune_exporter DB 寫入 _run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。 ## 4. declarative_remediation DB 寫入 evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events (status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。 ## 5. telegram_gateway DB 寫入 (send_approval_card) _send_request 成功返回 message_id 後寫 notification_outcomes 一筆, channel='telegram', delivery_status='delivered\|failed'。未來人類按鈕時 update user_action → RLHF 訓料黃金。 ## 6. pre_decision_investigator MCP 呼叫追蹤 _call_single_tool() finally 寫 timeline_events event_type='mcp_call', 含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。 ## 預期量化改善 \| KPI \| 修前 \| 修後 24h 後應見 \| \|-----\|------\|----------------\| \| #3 fine-tune /week \| 0 (表不存在) \| >=10 (每週 cron 跑) \| \| #4 MCP 呼叫/24h \| 0 \| >0 (實測將寫 timeline) \| \| #6 declarative 使用率 \| 表不存在 \| 有資料 (pending/success/failed 分佈) \| \| #7 general 兜底 \| 17.3% \| <10% \| \| #10 notification_outcomes \| 0 \| 每次 approval card 寫一筆 \| Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:00:31 +08:00
OG T	898145d68e	refactor(openclaw): SuggestedAction 改用頂部 import (避免 inline 三重巢狀) Some checks failed CD Pipeline / build-and-deploy (push) Successful in 11m17s Details Ansible Lint / lint (push) Has been cancelled Details IDE 對 inline "from src.models.ai import" 誤報(但運行正常)。改為頂部 import 既滿足 IDE 也更 Pythonic。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:28:19 +08:00
OG T	e6e484c1dc	fix(openclaw): import path 修正 — src.models.ai (非 openclaw_schema) Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details IDE 正確抓到的 bug(非 false positive),SuggestedAction 在 src/models/ai.py。 _SA.NO_ACTION 現在能正確降級。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:45 +08:00
OG T	7e9448f6d0	fix(openclaw): 幻覺 deployment 名雙層防禦 — Prompt + Python validator Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-18 晚（台北時區）— ogt + Claude Opus 4.7 (1M) 生產事件 (approval f763bedf, 22:58): - Alert: KubePodCrashLooping, labels.deployment="awoooi-api" - NEMOTRON 雖收 inventory "awoooi-api, awoooi-web, awoooi-worker" 仍輸出 kubectl_command="kubectl rollout restart deployment/awoooi-prod" (把 namespace 誤當 deployment 名) - 執行結果: "Deployment 'awoooi-prod' not found in namespace 'awoooi-prod'" ## Layer 1: NEMOTRON_SYSTEM_PROMPT 強化 (prompts.py) 新增「🔒 DEPLOYMENT NAME RULE (STRICTLY ENFORCED)」區塊: - namespace NEVER is a deployment name - "awoooi-prod" 是 NAMESPACE,不可寫 deployment/awoooi-prod - 若有 inventory,deployment 必須 exact match - 優先用 labels.deployment,unknown → NO_ACTION ## Layer 2: Python 後驗證 (openclaw.py:1322+) LLM 回應解析後 regex 抽出 deployment 名,對照 _k8s_inventory: - 在清單內 → 通過 - 不在清單內 → 降級: * kubectl_command → "kubectl get deploy -n {ns}"(純調查) * suggested_action → NO_ACTION * target_resource → "unknown(hallucinated)" * confidence → 0.0 * description 加註 [安全降級] 並列出合法 inventory - log 'openclaw_deployment_hallucination_detected' 記錄效果: 就算 LLM 無視 prompt,Python 層也會擋下。破壞性 kubectl 絕不執行於不存在的 deployment。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:09 +08:00
OG T	6ad73b4834	fix(flywheel): 三修 L5/L6 斷鏈 — RBAC 擴權 + 失敗原因入庫 + verifier 失敗時也跑 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m6s Details 2026-04-18 晚（台北時區） — ogt + Claude Opus 4.7 (1M) 全景飛輪診斷暴露 3 個真斷鏈: - L5 執行 30d: EXECUTION_FAILED 216 / EXECUTION_SUCCESS 2 (失敗率 99%) - L6 驗證 7d: verification_result 全 NULL (988 筆 evidence 都沒驗) - 所有 rejection_reason / error_message 欄位全空(無法診斷) 根因: awoooi-executor ServiceAccount RBAC 不足,executor.py 每次 kubectl get nodes/HPA 都 Forbidden,連 evidence 都抓不到,後面 repair 全炸,verifier 因為 execution 沒 success 永遠不 trigger,evidence 驗證結果永遠 NULL。修一個 RBAC 解 3 個節點。 ## P0.1 RBAC 擴權 (k8s/awoooi-prod/07-rbac.yaml) 新增 cluster-scope 讀權(僅 list/get/watch,零寫入): - nodes + nodes/status (evidence gathering 必需) - horizontalpodautoscalers (HPA 狀態) - metrics.k8s.io: nodes + pods (resource metrics) - statefulsets + daemonsets (完整 workload 視圖) 已 kubectl apply + 煙霧測試: kubectl get nodes 可跑。 ## P0.2 失敗時必寫 rejection_reason (approval_db.py) update_execution_status() 新增 error_message 參數,失敗時寫入 rejection_reason (截 2000 字) → 之後診斷有依據。 approval_execution.py 呼叫端同步更新,result.error 一路傳進 DB。 ## P0.3 Verifier 失敗時也跑 (approval_execution.py) 原邏輯: verifier 只在 result.success=True 時呼叫 → 99% 失敗下永遠不跑。新邏輯: 失敗 path 也 create_task 跑 verifier,action_taken 後綴加 ":FAILED" 標記。verifier 抓 post_state 寫 verification_result='failed' 回 incident_evidence。 L7 learning 從此有失敗樣本可學,playbook trust 負向 2x 衰減才真正生效。預期效果: - EXECUTION_FAILED 率 30d 內應從 99% 降到 <30% - incident_evidence.verification_result NULL 率應從 100% 降到 <10% - approval_records.rejection_reason 補齊率從 0% 到 100% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 20:12:57 +08:00
OG T	b0d560dbb3	fix(drift-narrator): shortener 用 replace — 包容 LLM 加 'Resource/Name:' 前綴幻覺 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m50s Details 2026-04-18 下午（台北時區）— ogt + Claude Opus 4.7 Round 4 LLM 自己在 field 前加資源識別符: 'Deployment/awoooi-web: spec.template.spec.containers' 導致 startswith 模式 shortener 失效(前綴不在開頭)。防禦式修法: startswith 不中 → 改用 replace 清除任何位置的前綴。結果: 'Deployment/awoooi-web: containers' ✅ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 18:12:15 +08:00
OG T	b63aed72df	fix(drift-narrator): 砍 spec.template.spec. 前綴 — 修 Telegram 自動換行醜陋排版 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m1s Details 2026-04-18 下午（台北時區）— ogt + Claude Opus 4.7 統帥實彈三輪視覺回報: 字段名 'spec.template.spec.volumes' 共 24 字元, 加上 emoji+': '+summary 超過 Telegram <pre> 視覺寬度,自動換行造成 emoji 與 field name 斷開、單獨成行的醜狀。修復: _shorten_field_path() 砍 3 種常見前綴: - 'spec.template.spec.' → '' - 'spec.template.' → '' (後備) - 'spec.' → '' (後備) 效果對比: 前: '🟡 spec.template.spec.affinity.podAntiAffinity.preferredDuringS: [清單 3 項]' 後: '🟡 affinity.podAntiAffinity.preferredDuringS: [清單 3 項]' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 17:10:20 +08:00
OG T	f3960f36d2	fix(drift-narrator): fallback 強化 — 標註 K8s 預設值補齊 + 可操作數獨立計算 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m37s Details 2026-04-18 下午（台北時區）—— ogt + Claude Opus 4.7 (1M) 統帥實彈測試回報: 卡片顯示「securityContext: (未設) → {物件 0 欄位}」毫無意義。根因: _fallback_items 對「K8s controller 自動補齊空物件」的噪音誤當成真實變更輸出。且「還有 29 項」數字包含白名單 + trivial。修復 3 項: 1. _is_trivial_drift() 新判定函數 None/空字串/{}/[]/false/0 等互相視為「無實質變更」捕捉 K8s controller 自動補齊場景 2. _summarize_item() 替代原本 smart_shorten - trivial → "K8s 預設值補齊 (無實質變更)" - None → value → "新增 xxx" - value → None → "已刪除 (原: xxx)" - 其他 → "from → to" 3. _fallback_items() 改進 - 按 level 排序 (HIGH 優先) - 白名單 + HPA allowlist 先過濾 4. _count_nontrivial_drift() + Telegram 呈現 - 新增「可操作」計數 (去掉白名單 + trivial) - 「還有 N 項」用可操作數,不會誤導 - items 為空時顯示「全為白名單或預設值補齊」預期效果: 之前: "... 還有 29 項" (其實只 1 個是真實 drift) 現在: "... 還有 0 項" 或 "(全部為白名單或預設值補齊)" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:29:49 +08:00
OG T	1606093dd2	fix(drift-narrator): 兩個 hotfix — NEMOTRON wrapper 解析 + tags asyncpg 型別 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-18 下午（台北時區）—— ogt + Claude Opus 4.7 (1M) Live-fire test (report_id=80a34b58) 暴露兩個 bug: ## Bug 1: LLM JSON 被 NEMOTRON wrapper 吞掉根因: openclaw.call() 經 NEMOTRON 路由時強制回 {description,...} 結構, 我的 prompt 要 {narrative, items} 無法穿透。 (同 `1ff3405` 早前碰過的 JSON 裸奔問題根源) 修復: 三路 fallback 解析 - Path 1: 直接我們的 {narrative, items}（Ollama 或 LLM 守規矩） - Path 2: NEMOTRON wrapper,description 巢狀 JSON 含我們結構 - Path 3: description 是純敘述 → 當 narrative + Python fallback_items ## Bug 2: tags 參數 asyncpg DataError 根因: 傳 '{drift,type4d,llm_summary}' 字面量字串,asyncpg 要求 Python list '(a sized iterable container expected (got type str))' 修復: tags 改傳 ['drift','type4d','llm_summary'] Python list,移除 CAST AS text[] asyncpg 自動推斷 text[] Live-fire 結果驗證: - narrative ✅ 生成(fallback path) - items ⚠️ 只 1 筆(NEMOTRON 未吐我們結構) - DB write ❌ tags 型別錯 - Telegram ✅ 送出(雖 fallback 內容但視覺 OK) 本 commit 後預期: - LLM 回應走 Path 2/3 → narrative + Python fallback items(5 筆 smart summary) - DB write 成功 → automation_operation_log + ai_collaboration_trace 皆有記錄 - 若 LLM 未來學會走 Path 1(給我們 {narrative, items}),自動升級 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:26:17 +08:00
OG T	a156566b17	feat(drift-narrator): ADR-090-C L4 稽核閉環 — notification_formatted op 入庫 Some checks failed CD Pipeline / build-and-deploy (push) Successful in 10m47s Details run-migration / migrate (push) Failing after 14s Details 2026-04-18 下午（台北時區）—— ogt + Claude Opus 4.7 (1M) 架構鐵律執行: 「沒有被記錄的 AI 決策,就等於沒有發生過。」 drift_narrator 每次呼叫 LLM 生成摘要,必須完整寫入 automation_operation_log + ai_collaboration_trace,形成 L4 稽核 + RLHF 語料。本 commit 兩件事: 1. apps/api/migrations/adr090c_notification_formatted_op_type.sql - 擴充 automation_operation_log.operation_type CHECK 加 'notification_formatted' - DROP + ADD CONSTRAINT idempotent 模式 - 已用 awoooi（表 owner）apply 進 prod 驗證通過 2. apps/api/src/services/drift_narrator_service.py - 新增 _log_ai_action_to_db() 負責 DB 稽核寫入 - 在 _generate_narrative_and_items() 結尾（success / fallback 都寫）呼叫 - automation_operation_log: * operation_type='notification_formatted' * actor='drift_narrator' * input = {report_id, namespace, counts, items_scanned} * output = {narrative, items, items_count} * duration_ms, tags=['drift','type4d','llm_summary'] * parent_op_id 查詢 alert_fired 鏈路（未來 drift → alert 關聯） - ai_collaboration_trace: * agent='drift_narrator', model=provider (ollama / nemotron / 等) * prompt（限 8000 字）+ response（JSONB） * accepted = LLM JSON 解析成功 flag（未來 RLHF 訓料金礦） - 錯誤處理: DB 寫入 try/except 包住,永不破壞 Telegram 通知主流程 P2.4 事件關聯: - SELECT parent op via input->>'report_id' 或 'drift_report_id' - 若找到則綁定 parent_op_id（形成 alert_fired → notification_formatted 追溯鏈） - 目前 drift 本身不經 alert_fired,parent 為 NULL（等未來鏈路接通） P2.5 RLHF 語料: - ai_collaboration_trace.accepted=true 的紀錄即為「LLM 解析成功」樣本 - 未來統帥按 Telegram [✅ 採納變更] / [⏪ 回滾] 時,對應 trace 也可更新 outcome flag,形成完整 Human-in-the-loop 語料技術細節: - get_db_context() auto-commit（src/db/base.py:128）,無需手動 commit - prompt 最長 8000 字（一般 drift 約 2-3k） - raw_response 保留前 500 字在 trace.response JSON 中相關: - feedback_ai_autonomous_direction.md L4 北極星 - feedback_secrets_leak_incidents_2026-04-18.md L1-L4 分層 - ADR-090 11 張神經網路表 - commit fb88512（B 方案視覺層） IDE 可能顯示 src.db.base 找不到 —— 那是誤報（drift_repository.py 用同一條路徑）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:04:23 +08:00
OG T	fb88512fcb	fix(drift-narrator): B 方案 LLM 驅動智能摘要 — 徹底消滅 str()[:30] 暴力截斷 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-18 下午（台北時區）—— ogt + Claude Opus 4.7 (1M) 根因: _format_drift_summary() 對 dict/list 型別的 git_value/actual_value 直接呼叫 str()[:30] 暴力截斷,產生像 "[{'name': 'repair-ssh-key', 's" 這種亂碼掉半個 dict key 的亂七八糟輸出,徹底違背「AI 自主化」原則。 B 方案架構決策: 「捨棄 Python 寫死的字串解析邏輯。將原始 Config Diff 結構直接作為 Context,餵給 Hermes/NemoTron,利用 prompt 規定輸出格式,讓 LLM 自己消化並輸出包含紅黃燈標示的 Top 5 人類易讀摘要。」實作: 1. _NARRATIVE_PROMPT 重寫 — 要求 LLM 回傳 {narrative, items[]} JSON - drift items 以 JSON serialize 餵進 prompt（保留 200 字 context） - items 限 5 筆,HIGH 優先 - summary 30 字繁中口語（非技術 repr） 2. _generate_narrative_and_items() 新方法 — 解析 LLM JSON 並驗證結構 3. _format_drift_for_llm() 新方法 — 結構化 JSON 給 LLM（取代舊 str 版） 4. _render_telegram_body() 新方法 — 組裝乾淨的 Telegram 卡片範例輸出: 🤖 AI 研判 <LLM 4-5 行敘述> 📊 漂移明細 (HIGH: 1 \| MEDIUM: 29) 🔴 spec.template.spec.volumes: 新增 2 項 repair-ssh-key 掛載 🟡 spec.template.spec.serviceAccount: (未設) → awoooi-executor ... 還有 27 項 (按 🔍 查看 Diff) 5. Fallback 強化 — _smart_shorten() + _fallback_items() LLM 失敗時用型別感知的 Python 摘要（dict/list 顯示大小,不暴力 repr）移除: - _format_drift_summary() — 舊的暴力截斷實作 - _generate_narrative() — 只回 string 的舊介面保留: - _fallback_narrative() / _format_intent_summary() — 仍有用 - Redis 快取 / trigger 條件 / DB update — 邏輯不變 MVP 階段: 本 commit 只改視覺呈現,沒動 automation_operation_log / ai_collaboration_trace 稽核寫入。等 Telegram 視覺驗證 OK 後再做 Phase 2 加入 DB 稽核。相關: - feedback_ai_autonomous_direction.md 北極星原則 - `1ff3405` 今早的 JSON 裸奔 hotfix（只修了 narrative,沒修 items） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 15:54:16 +08:00
OG T	1ff3405755	fix(drift-narrator): 修復 JSON 裸奔 — 從 NEMOTRON 回傳解析 description 欄位 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m44s Details 根因：openclaw.call() 經 NEMOTRON 路由後強制輸出 JSON（NEMOTRON_SYSTEM_PROMPT 鐵律）但 _generate_narrative 期待純文字 → JSON 整包吐到 Telegram <pre> 區塊裸奔修復：收到 text 後先嘗試 JSON 解析 - 成功 → 按優先順序取 description / action_title / reasoning - 失敗（非 JSON）→ 原文使用（向下相容 Ollama qwen 純文字回傳）效果：Telegram Config Drift 卡片顯示繁中人話摘要，不再吐原始 JSON 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 01:08:32 +08:00
OG T	4f2e122fd2	fix(openclaw): Checkpoint-2 webhook path K8s inventory injection — 防止 NemoTron 幻覺 awoooi-service All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m39s Details 根因：NemoTron 在 webhook path（analyze_alert）無叢集上下文 → 盲猜 deployment/awoooi-service → kubectl not found → EXECUTION_FAILED → trust score 0 永遠修復： - analyze_alert() Step 0.5: 呼叫 _fetch_k8s_inventory_for_openclaw() 拉取真實 Deployment 清單 - 注入「🔒 叢集實際資源清單」section 到 full_prompt，強制 LLM 從清單選擇資源名 - 失敗/超時 → 返回空字串 → 注入警示提示，主流程不中斷 - available_len 計算納入 k8s_section 長度防止 4K 截斷影響： - Solver Agent path (solver_agent.py) 已在 `cf50a5c` 修復 - 本 commit 修復 Alertmanager webhook path（analyze_alert → NemoTron） - 兩條路徑均有 K8s 環境感知，LLM 不再幻覺資源名 ADR-082: Phase 2 多 Agent 協作 2026-04-17 ogt + Claude Sonnet 4.6（Checkpoint-2 webhook path completion） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 00:53:27 +08:00
OG T	cf50a5ce25	fix(solver+execution): Checkpoint-1 假成功修復 + Checkpoint-2 K8s 環境感知 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m55s Details ## Checkpoint-1: 假成功根治 - approval_execution.py: execute_approved_action 改返回 bool (原返回 None，呼叫端無法判斷 K8s 是否接受指令) - decision_manager.py auto-execute 路徑: 用 _exec_success 取代硬編 success=True 修復: K8s 拒絕指令時正確發 ❌ 而非 ✅ 自動修復完成 ## Checkpoint-2: K8s 環境感知 (Inventory Pre-flight) - solver_agent.py: 新增 _fetch_k8s_inventory() — 生成 kubectl 指令前先拉 kubectl get deployments,statefulsets -n awoooi-prod，將真實名稱清單注入 Solver prompt，LLM 必須從清單選擇，防止幻覺（awooiii-api 三個 i） - 超時 5s 或失敗 → 返回 ""，prompt 顯示警示但不中斷主流程 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 23:08:23 +08:00
OG T	cbb719b4a1	fix(decision_manager): ADR-091 hotfix — 修復 d5dbfc9 喪屍閘門邏輯漏洞 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m9s Details d5dbfc9 引入的閘門條件 `not action.strip()` 在 action="待分析" 時判斷為 False（非空字串），導致閘門失效，喪屍卡片仍然突圍廣播。根本原因：c759b4e P1 修復讓 suggested_action fallback 為 "待分析" 而非 ""，使原本的 empty-string 檢查形同虛設。修復：改用集合判斷 `_action_text in {"", "待分析", "NO_ACTION", "待分析 - 系統自動保護"}`，涵蓋所有已知失敗狀態 token，完全封堵喪屍卡片廣播路徑。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 22:44:53 +08:00
OG T	af2adb5b96	fix(telegram): ADR-091 禁止 Agent Debate 分析失敗時廣播「待分析」喪屍卡片 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m51s Details 問題根因: GET /incidents 觸發 Phase 2 Agent Debate → LLM 全失敗 → description="待分析" + action="" → 每隔幾分鐘廣播新 Telegram 卡片 → 告警疲勞（SRE 最致命的殺手）架構缺陷 (anti-pattern): GET 請求（讀取操作）產生對外廣播副作用 → 違反 RESTful 原則修復 (_push_decision_to_telegram): 在 DB 更新完成後、Telegram 推送前加入閘門： description="待分析" AND action="" → 靜默退出，絕不廣播 ADR-091 鐵律: 只有 Alertmanager Webhook POST（真實新告警）可觸發 Telegram 廣播 Agent Debate 失敗分析 → 靜默 DB 更新，不污染頻道 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 22:26:35 +08:00
OG T	604d8eea37	fix(schema-drift): 補齊 prompts.py + Claude API schema enum 同步 (ADR-090) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m27s Details 問題: `fe77e6d` 擴充了 models/ai.py enum 至 8 值，但兩個地方未同步： 1. core/prompts.py L77: 缺 INVESTIGATE、OBSERVE 2. core/prompts.py L176 (NEMOTRON_SYSTEM_PROMPT): 缺 APPLY_HPA、INVESTIGATE、OBSERVE 3. openclaw.py L564 (_call_claude tools schema): 舊 4 值 enum 約束影響: LLM 不知道可以輸出 INVESTIGATE/OBSERVE，只能選舊 4 值修復: 三處統一對齊 8 個 suggested_action 值 RESTART_DEPLOYMENT\|DELETE_POD\|SCALE_DEPLOYMENT\|APPLY_HPA\|TUNE_RESOURCES\|INVESTIGATE\|OBSERVE\|NO_ACTION Closes: ADR-090 Prompt-Model 三層同步鐵律 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 22:10:18 +08:00
OG T	fe77e6d297	fix(ai): SuggestedAction enum 擴充 + Pydantic fallback 防護 Some checks failed CD Pipeline / build-and-deploy (push) Successful in 10m48s Details Type Sync Check / check-type-sync (push) Failing after 2m52s Details 根本原因: NemoTron 輸出 "investigate" → Pydantic 只接受 4 個值 → 爆炸 → openclaw_analysis_parse_failed → analysis_result=None → 全部 fallback 卡片顯示「待分析」修復: 1. SuggestedAction enum 新增 INVESTIGATE/OBSERVE/APPLY_HPA/TUNE_RESOURCES (prompt.py 列了 6 個，enum 只有 4 個，prompt/model 不同步是根源) 2. normalize_suggested_action validator: uppercase + 別名映射 + 未知值 fallback NO_ACTION 確保任何 LLM 輸出都不會讓 Pydantic 爆炸導致 analysis_result = None Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 21:36:36 +08:00
OG T	c759b4eeab	fix(webhook+decision): ADR-089 async webhook + 超時髒資料修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m16s Details P0 — Webhook async (ADR-089): - Alertmanager 收到告警立即回 202，不再同步等 90s LLM - 新增 _process_new_alert_background()：LLM 分析/Approval/Incident/Telegram 全進背景 - 根治 Alertmanager Fallback 風暴（超時 → 重送 → 指數退避風暴） P1 — 超時髒資料 (decision_manager): - _package_to_proposal_data: blocked_reason 禁止進 desc_parts（禁進卡片） - _push_decision_to_telegram: suggested_action fallback 改「待分析」，禁止 description 流入 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:29:24 +08:00
OG T	9d6aa7ea45	feat(trust): ADR-088 Trust Score 持久化 — L4 自動放行核心 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m40s Details TrustScoreManager 從記憶體升級為 PostgreSQL 持久化， Pod 重啟後信任分數不再歸零，AI 能真正累積到 L4 自動放行門檻。變更: - migrations/adr088_trust_score_persistence.sql: trust_records 表 - db/models.py: TrustRecordDB ORM model - repositories/interfaces.py: ITrustRepository Protocol - repositories/trust_repository.py: PG upsert ON CONFLICT DO UPDATE - services/trust_engine.py: bulk_load() 啟動 warm-up - services/learning_service.py: _persist_trust() + 2 call sites - main.py: 啟動時 load_all() → bulk_load() 流程: 批准 5 次 → score=5 寫入 DB → Pod 重啟 → warm-up 讀回 → evaluate_adjusted_risk MEDIUM→LOW → 自動執行 2026-04-17 ogt + Claude Sonnet 4.6（亞太）: ADR-088 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 16:14:44 +08:00
OG T	1ae9e9f389	fix(code-review): P0-1 action fallback 語意修正 + P1-2 reason enum + P2-2 secops 清洗 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m7s Details Code Review 發現 (2026-04-17 首席架構師審查): P0-1 auto_approve.py 條件 1d 語意修正: - 原：用 `action` 變數（已 fallback = action or kubectl_command）做 kubectl 判斷 → action="" + kubectl_command="kubectl get pods" → action="kubectl get pods" → 1d 通過 → _kubectl_cmd 與 action 同值（重複判斷同一來源），掩蓋 action 本身是自然語言的情況 - 修：改用 proposal_data.get("action", "") 原始值（_raw_action） → 直接檢查 action 欄位本身，邏輯語意明確 P1-2 auto_approve.py NO_EXECUTABLE_ACTION 新增: - 新增 AutoApproveReason.NO_EXECUTABLE_ACTION enum 值 - 條件 1d 改用此 reason（原 NO_PLAYBOOK 語意為「無匹配 Playbook」，不適用此場景） - 避免污染 KM 飛輪學習資料的根因分類（ADR-068） P2-2 decision_manager.py secops 分支: - threat_behavior 改用 _parse_debate_summary → 取 diagnosis 欄位 - 與 BUG-A/BUG-C 修復一致，不再傾倒完整 debate_summary 前 150 字 ADR-082: Phase 2 多 Agent 協作 2026-04-17 ogt + Claude Sonnet 4.6（亞太）: Code Review 後修正 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 15:23:35 +08:00
OG T	93205ceab0	fix(auto_approve+solver): P1 kubectl gate + P2 Nemo path kubectl 強制 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m56s Details P1 安全漏洞 (auto_approve.py): - 新增條件 1d：action 必須含 kubectl 關鍵字才可自動執行 - Solver 經 OpenClaw Nemo 路徑輸出自然語言 → 條件 1c 通過但無法執行 - 修復：自然語言 action → 降級人工審核（NO_PLAYBOOK reason） P2 執行障礙 (solver_agent.py): - Nemo 格式路徑：action_title 不含 kubectl → return [] → 觸發 _degraded_plan - _default_action_for_category：舊自然語言 → 真實 kubectl 調查指令 - 降級路徑現在輸出 kubectl get/top/exec 等唯讀指令，可被 auto_approve 1d 正確評估 ADR-082: Phase 2 多 Agent 協作 2026-04-17 ogt + Claude Sonnet 4.6（亞太）: P1+P2 hotfix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:49:53 +08:00
OG T	f421e652d3	fix(telegram): BUG-C TYPE-3 排版清洗 + 批准/拒絕永遠置頂（ADR-075 UI 第三波修復） Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Checkpoint 1 — decision_manager.py TYPE-3 root_cause 清洗: - 舊: root_cause=_smt(reasoning, 500) → debate_summary 全文（診斷/方案/審查/質疑）全部傾倒到 AI 診斷欄 - 新: _parse_debate_summary 只取 diagnosis 欄位 + _smt 截斷 300 字 - 移除 _requires_human 變數（已無用途） Checkpoint 2 — telegram_gateway.py _build_inline_keyboard 按鈕順序重構: - 舊: K8s 類別按鈕置頂，批准/拒絕受 requires_human_approval 控制 → 死卡 - 新: [✅ 批准][❌ 拒絕] 永遠第一行，K8s/DB/Host 操作按鈕置後 - 移除 requires_human_approval 參數（邏輯已簡化為無條件置頂）修改範圍: decision_manager.py else 路由段 + _build_inline_keyboard + send_approval_card 簽名， telegram_gateway.py 模板/訊息格式零改動。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:42:29 +08:00
OG T	418d73540b	fix(telegram): BUG-A TYPE-1 + BUG-B TYPE-4D 資料前處理（ADR-075 UI 第二波修復） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m25s Details BUG-A (TYPE-1 純資訊通知): - 舊: message=reasoning[:200] → debate_summary 全文傾倒（診斷/方案/審查/質疑一起出現） - 新: _parse_debate_summary(reasoning) 只取 diagnosis 欄位 + _smt 截斷 200 字 BUG-B (TYPE-4D Config Drift): - 舊: diff_summary=description[:500] → LLM 輸出的 JSON 原文直接顯示在 <pre> 區塊 - 新: JSON Catcher — json.loads(description) 成功則格式化「📝建議操作/📖說明/⏪回滾方案」失敗 (JSONDecodeError/TypeError/AttributeError) → 平滑降級為純文字截斷僅修改 decision_manager.py 路由準備段，telegram_gateway.py 模板層零改動。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:14:10 +08:00
OG T	6baa2e91da	fix(telegram): 修復死卡按鈕 + 重複渲染 + 智能截斷三連修 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m26s Details 問題 1 — 批准/拒絕按鈕消失（死卡）根因：_build_inline_keyboard 有 alert_category 動態按鈕時走 category 路徑， approve/reject 行被跳過 → requires_human_approval 卡片無審核扳機修復：新增 requires_human_approval 參數；True 時強制在動態按鈕後插入批准/拒絕行影響：decision_manager 傳入 proposal_data.requires_human_review 問題 2 — TYPE-8M 三欄重複渲染根因：diagnosis/system_impact/probable_cause 全用 reasoning[:100] → 同一段字修復：新增 _parse_debate_summary()，拆分 debate_summary 的「診斷/方案/安全審查/質疑」各欄位填入不同語意的組件問題 3 — 幽靈截斷「質疑：無（通」根因：粗暴 [:N] 在括號/中文字中間切斷修復：新增 _smart_truncate()，在句子邊界（。！？；，）截斷，補 …[截斷] 標記驗證：verify_telegram_ui.py 全部通過（括號平衡 ✅、欄位不重複 ✅、按鈕存在 ✅） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 13:57:42 +08:00
OG T	0ab92c20d6	fix(telegram): root_cause 截斷上限 300→500 — 修復「質疑：無（通」幽靈重現 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m31s Details 根因：debate_summary 結構為「診斷（≤220字）；方案；安全審查；質疑」診斷假設長時總長超過 300 chars → root_cause 截斷在「通」字修復：300 → 500（Telegram 單卡 4096 限制，安全） 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 13:28:16 +08:00
OG T	58d9c0637a	fix(drift): drift_narrator 改用 OpenClaw AI Router — 修復「研判原因」空白根因：drift_narrator_service.py 的 _generate_narrative() 直接呼叫 Ollama httpx (192.168.0.111:11434)，繞過 AI Router，無 fallback。 192.168.0.111 為死亡 IP → httpx 連線失敗 → 降級 fallback_narrative() → fallback 中 interpretation.explanation 存在但顯示層截斷 → 空白修復：改用 get_openclaw().call(prompt)，統一走 AI Router 同 drift_interpreter.py 的修法（d952435）移除 unused httpx import 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 13:28:16 +08:00
OG T	e0bfcc7bd6	fix(phase5): 修復 Solver action 格式 — 強制輸出 kubectl 命令 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 9m33s Details 根因：_build_prompt() 的 action 範例為 "restart_service:awoooi-api"（自訂格式）， LLM 模仿此格式輸出自然語言描述而非 kubectl 命令。影響鏈： Solver action = 自然語言描述 → auto_approve Condition 1c 拒絕（無 kubectl 關鍵字） → _auto_execute() 永不被調用 → blast_radius_calculator 永不被調用 → blast_radius_score fill rate = 0/14 = 0%（Phase 5 驗收指標未達）修復： 1. blast_radius 參考從抽象描述改為實際 kubectl 命令示例 2. 明確要求 action 欄位必須是真實 kubectl 命令（不可用自然語言） 3. 正確範例：kubectl rollout restart deployment/awoooi-api -n awoooi-prod 預期效果：LLM 輸出 kubectl 命令 → auto_approve 通過（低 blast_radius 情境） → blast_radius_calculator 被調用 → fill rate 趨向 100% 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 12:44:36 +08:00
OG T	b7c2b691bb	fix(p2-backlog): 修復 suggested_action「待分析」— action 空時 fallback 到 description Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m26s Details 根因：_push_decision_to_telegram() 的 suggested_action 只有兩條路： - action 有值 → 顯示 action[:120] - action 空 → 顯示「待分析」但 _package_to_proposal_data() 已從 hypothesis 組出 description （含「根因：...（信心 X%）；方案：...」），此時 action="" 卻還是顯示「待分析」導致 SRE 在 Telegram 卡片看不到 AI 的診斷結論。修復：action 空時，優先用 description[:120] 作為 suggested_action （description 已包含根因摘要，比「待分析」有意義） fallback chain: action → description → "待分析" 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 11:48:49 +08:00

... 9 10 11 12 13 ...

1184 Commits