Commit Graph

777 Commits

Author SHA1 Message Date
Your Name
f572561467 feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m31s
統帥 2026-04-19 截圖反饋:
  1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop)
  2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行)

新增 services/ai_advisory_helpers.py (~240 行):
  - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}',
    TTL 25h,fail-open (Redis 掛照推,不阻塞).
  - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用).
  - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h.
  - build_ai_advisory_keyboard: 統一 4 按鈕
       已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令
    callback_data 格式: 'ai_advisory_{action}:{type}:{id}'
  - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback,
    view/produce_cmd 留 P1.

4 個 LLM scanner 改用 helper:
  - capacity_forecaster: daily_lock + snooze check per host + 按鈕
  - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕
  - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕
  - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕

telegram_gateway.py:
  handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後)
  新增 _handle_ai_advisory_action 方法:
    解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback
    → answer_callback (Telegram toast 回饋)
    → 返回 dict (info_action=True for view/produce_cmd)

統帥鐵律對齊:
   多 Pod 場景只 leader 推 (Redis SETNX 保證冪等)
   失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作)
   aol.output 加 human_feedback 供 AI 學習
   snooze 避免重複告警 (24h TTL)
   原 drift 按鈕 pattern 複用 (non-breaking)

明早 AI 將收到:
  - 單一訊息 (非重複)
  - 含 4 按鈕 (手動 feedback 閉環)
  - snooze 後同主題 24h 不再推

view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:02:57 +08:00
Your Name
fa643ebdc7 refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m52s
首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化:
  1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行)
  2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token

P1.1+1.2 新增 services/llm_json_parser.py (~90 行):
  parse_llm_json_response(text, required_key, logger_context)
  3-path fallback:
    Path 1: 剝 markdown fence + 直接 JSON 含 required_key
    Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON)
    Path 3: 所有失敗 return None + logger.warning
  失敗永不 raise,呼叫者決定 fallback.

4 個 LLM scanner 改用 helper:
  - hermes_rule_quality_job: required_key='recommended_actions'
  - capacity_forecaster_job: required_key='priority_actions'
  - compliance_scanner_job: required_key='posture_grade'
  - coverage_evaluator_job: required_key='worst_dimension'
每個減少約 20 行重複.

P1.3 coverage 觸發條件改雙條件:
  原: total_red >= 20 (bootstrap 必觸發)
  新: red_ratio > 30% AND total_scanned >= 50
  _fetch_red_summary 加 total_scanned 回傳供計算.

5/5 單元測試 parse_llm_json_response:
   direct / markdown fence / NemoTron wrapper / invalid / missing key

P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判).
其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:39:40 +08:00
Your Name
37b6c9ba56 chore: remove empty ai_orchestrator.py (意外進 commit 的空檔)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m6s
上個 commit (86d9b22 LOGBOOK) 因 stash pop 意外帶入 0 行空檔
ai_orchestrator.py,非刻意創建。本次刪除保持 services/ 乾淨。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:22:53 +08:00
Your Name
86d9b22125 docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Session 35 commits 完整結案:
  - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster)
  - KPI Dashboard API (autonomy_score 63/100 可量化)
  - Audit 誠實 3 Gaps
  - Gap 1 host IPv4 嚴格 + 清理 266 筆重複
  - Gap 2 真因確認非 bug
  - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage)

AI 自主化達成:
  1/9 LLM (只 Hermes) → 4/9 LLM decision
  8 張 0 writer 表全活化
  7/7 coverage 維度完整
  今晚 AI 將自主推 4 種 Telegram 分析報告

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:22:42 +08:00
Your Name
2f5cab2e45 feat(coverage_evaluator): Gap 3.3 LLM 升級 — 缺口分析 + 補覆蓋建議
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m14s
Gap 3 進度: 4/9 service 升級 LLM (達到合理上限 — 其他 4 個純資料移動不需 LLM)

coverage_evaluator 原本 7 維升級 unknown→green/yellow/red 後無主動建議.
新增:

1. _fetch_red_summary: 撈最新 run 的 red 分布 + top 10 被標 red 的 asset
2. _llm_analyze_coverage_gaps (~50 行):
   有 >= 20 red 時才跑 LLM (避免 well-covered 集群浪費 token)
   LLM JSON 輸出:
     - worst_dimension: 最該優先補的維度
     - root_cause: red 集中的真因 (繁中)
     - top_remediation_actions[3]: priority/target/action/effort
     - estimated_weeks_to_close: 1-52
     - confidence: 0-1
3. _send_telegram_gaps: 推 coverage 缺口 Telegram 摘要
   總 red + 最嚴重維度 + 補齊週數 + top 3 補覆蓋動作

scan 完流程:
  評估 7 維 → 撈 red summary → LLM 分析 (if total_red >= 20) → Telegram

統帥鐵律對齊:
   不寫死補覆蓋優先 (LLM 根據實際 red 分布推)
   AI 建議 + 人工決策 (Telegram 末行: '人工評估補覆蓋排程')
   包含預估完成時間 + 信心 (可追蹤)

session 累計 35 commits, 9 新 scanner, 4 用 LLM:
  - Hermes (rule quality)
  - capacity_forecaster (容量預測)
  - compliance_scanner (合規態勢)
  - coverage_evaluator (覆蓋缺口)
剩 5 個純資料移動不適合 LLM (asset_scanner/rule_catalog_sync/
                           rule_stats_updater/asset_change_tracker/capacity_scanner)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:02:36 +08:00
Your Name
f6cb938dc3 feat(compliance_scanner): Gap 3.2 LLM 升級 — 合規態勢分析 + Telegram 摘要
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
朝 AI 自主化方向 — 9 新 scanner 從 2/9 LLM 提升到 3/9.

compliance_scanner 原本每次 scan 273 snapshots 寫 DB,無任何人可見摘要.
新增:

1. _write_compliance_for_asset_v2 (wrapper):
   原 _write_compliance_for_asset 保持不變,v2 版加回傳 asset_warning dict
   供上層 LLM 分析用,只有 violations/warnings > 0 才傳回

2. _llm_analyze_compliance_posture (~50 行):
   有 warning 時用 OpenClaw 分析整體 posture
   輸出 JSON:
     - posture_grade: A/B/C/D/F
     - posture_summary: 3 句繁中整體態勢敘述
     - top_priorities[3]: priority + action + rationale
     - risk_level: low/medium/high/critical
     - confidence: 0-1
   3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀)

3. _send_telegram_posture (~40 行):
   推每日合規摘要到 SRE group
   含評級 emoji (🟢A / 🟡B / 🟠C / 🔴D / F)
   顯示 asset_type 分布 (Top 5 種問題類型統計)
   含 AI top 3 priority 動作 + rationale

scan_once 流程:
  掃 assets × 7 維 → 收集 warning_assets → LLM 分析 → Telegram 推送

統帥鐵律對齊:
   AI 分析 + 人工決策 (Telegram 末行: '人工評估各項修復優先')
   不寫死優先順序 (LLM 根據 warnings 實際分布推)
   asset_type 分布統計幫統帥快速定位

Gap 3 進度: 3/8 service 升級 LLM (Hermes + capacity_forecaster + compliance_scanner)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:59:38 +08:00
Your Name
d6b854a25e feat(capacity_forecaster): Gap 3 LLM 升級 — 從 threshold 到 AI 決策
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Audit 發現 8/9 個新 scanner 是純 threshold,只 Hermes 1 個用 LLM.
統帥指示「朝 AI 自主化方向」→ Gap 3 開始把 threshold 升級 LLM.

第 1 個升級: capacity_forecaster (最高戰略)
原邏輯 _derive_actions 是硬編 keyword → action mapping:
  disk → "清理 /var/log, /var/lib/docker, PG WAL"
  mem  → "檢查 top mem consumer, 考慮加記憶體"
  cpu  → "分析 top CPU process, 考慮擴充 vCPU"

新增 _llm_analyze_risk (~60 行):
  用 OpenClaw 對每個高風險 host 跑 LLM 分析
  Prompt 含:
    - host + findings (Prometheus predict_linear 結果)
    - 主機架構說明 (110 Harbor / 120-121 K3s / 188 PG 等)
  LLM JSON 輸出:
    - root_causes (3 個候選真因,繁中)
    - priority_actions (high/medium/low + 具體指令 hint)
    - urgency_days (0-30)
    - confidence (0-1)
  3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀)

_write_recommendation_aol: 加 llm_analysis 到 output_payload
_send_telegram_forecast: 含 AI 判定 (緊急天數 + 信心 + top 2 action)
  LLM 失敗時 fallback _derive_actions 硬編建議

對齊統帥鐵律:
   AI 分析 + 人工決策 (仍 requires_human_decision=True)
   不寫死修復動作 (LLM 根據 host 實際狀況產)
   root_causes 考慮 host 主機架構 context

Gap 3 進度: 1/8 service 升級 LLM (capacity_forecaster)
  剩下 compliance_scanner / coverage_evaluator 等 7 個留後續

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:52:34 +08:00
OG T
97154d12fa fix(asset_scanner): Gap 1 修正 — 嚴格 IPv4 判斷 + 清理重複 host asset
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Audit 1 發現 bug:
  原 code: if host_ip.replace('.', '').isdigit() → IP 判斷
  導致 labels.host='125' (短名) 被誤當 IP → 建 host/125 (錯)
  同時 blackbox-icmp instance='192.168.0.112' 無 port → split 失敗 → 漏建

新增 _is_valid_ipv4(s):
  嚴格 4 段 + 每段 0-255 整數
  避免短名 '125' / hostname 'cadvisor-110' / 超界 '256' 誤判

_collect_prometheus_targets 流程改:
  1. 先從 instance 抽 (IP:port 形式 或純 IP)
     instance_host = instance.split(':')[0] if ':' in instance else instance
  2. 用 _is_valid_ipv4 嚴格驗證
  3. labels.host 不再當 fallback (短名不可靠)

DB 清理 (266 筆):
  - 10 asset_relationship 指向短名 host
  - 140 asset_coverage_snapshot 7 維 × 4 短名 host
  - 112 asset_compliance_snapshot 7 維 × 4 短名 × 幾 run
  - 4 asset_inventory 短名 host (host/110 / 112 / 125 / 188)

預期下次 scan (1h):
  - host/192.168.0.112 + host/192.168.0.121 補進 (原漏,blackbox-icmp 無 port)
  - 不再有短名 host asset

6/6 單元測試通過:
  _is_valid_ipv4('192.168.0.125')=True
  _is_valid_ipv4('125')=False  ← 關鍵修復
  _is_valid_ipv4('cadvisor-110')=False
  _is_valid_ipv4('192.168.0.256')=False (超界)
  _is_valid_ipv4('')=False
  _is_valid_ipv4('192.168.1')=False (3 段)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:46:22 +08:00
OG T
0004554bc6 feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m47s
GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI.

leWOOOgo 積木化鐵律對齊:
  - Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB
  - Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算
  - 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正

services/aiops_kpi_service.py (~230 行):
  AiopsKpiService.get_snapshot() 回 6 section:

  1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified)
  2. coverage_kpi: 7 維 × (green/yellow/red/unknown)
     + green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO)
  3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy
  4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d
  5. automation_flow_24h: aol detail + by_actor + by_operation_type
  6. ai_autonomy_score: 0-100 總分
     5 子項 × 20: asset_coverage / rule_quality / capacity_health /
                  automation_flow / ai_diversity
     grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50)

api/v1/aiops_kpi.py (~35 行 精簡 router):
  只做 router = APIRouter() + @router.get 委派給 service

main.py:
  include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI'])

統帥使用:
  curl http://192.168.0.121:32334/api/v1/aiops/kpi | jq .
  一次看見 AI 自主化成熟度全景

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:21:46 +08:00
OG T
7db8845cbb fix(asset_scanner+coverage): host_service→monitoring_target (CHECK violation 修) + log 補 4 維
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m59s
2 個 bug 修復 + 實證驗證:

1. asset_scanner: host_service 不在 asset_inventory CHECK 允許列表
  ceb61c3 部署後 Pod log: CheckViolationError 'asset_inventory_type_valid'
  詳: '192.168.0.125:32334' 寫入時 asset_type='host_service' 被拒
  allowed list: host/container/k8s_workload/k8s_resource/database/...
               monitoring_target/third_party_service/... (27 種)
  修: host_service → monitoring_target (ADR-090 schema 原為 scrape target 預留)

2. coverage_evaluator logger: 只 log 原 3 維 (monitoring/alerting/km)
  導致誤以為 c1f23cf 4 維新 code 沒生效 (實際 DB 已有 auto_playbook/
  remediation/rule_matching/rule_creation 資料)
  修: logger.info 補 playbook/remediation/rule_matching/rule_creation 4 個 kwarg

實證 coverage 7 維 DB 分佈 (已生效):
  auto_alerting:    22 green / 78 red / 52 unknown
  auto_km_creation:  5 green / 17 yellow / 130 unknown
  auto_monitoring:   1 green / 1 red / 150 unknown
  auto_playbook:     3 green / 19 yellow / 130 unknown  ← 新維度
  auto_remediation:  0 / 0 / 98 red / 54 unknown        ← 新維度
  auto_rule_creation: 0 / 0 / 100 red / 52 unknown       ← 新維度
  auto_rule_matching: 4 green / 96 yellow / 52 unknown   ← 新維度

治理洞察:
  98 red remediation = 大部分 asset 過去 30d 沒修復行動 (修復能力缺口)
  100 red rule_creation = 無 AI rule (全 yaml_hardcoded)
  96 yellow rule_matching = 過去 30d 沒告警觸發 (可能沒問題/沒覆蓋)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:27:48 +08:00
OG T
ceb61c3c8e feat(asset_scanner): Gap 1 修 — Prometheus targets 補齊 host-install services
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m32s
Audit 發現 asset_inventory 只涵蓋 K8s (mon=120, mon1=121 共 2 node+78 pods),
完全漏 110 (Harbor/Gitea/監控) + 112 (security) + 188 (PG/Redis/Ollama) +
125 (mon backup/standby) 這 4 主機的 host-install services.

用戶 4 主機架構 (110/112/120/121/188) 只覆蓋 2/5 = 40%.

新增 _collect_prometheus_targets:
  GET /api/v1/targets?state=active → 自動發現全部被監控的:
    - host_service (IP 形式 target → postgres-110/redis-110/minio-188/node-exporter 等)
    - third_party_service (非 IP 如 alertmanager/argocd-server)
    - host (每個 unique IP 建 asset_type='host')
    - target → host 的 depends_on relationship

預期新增 asset_inventory:
  - host: 6 個 (110/112/120/121/125/188,Prometheus 看到的 blackbox-icmp 全覆蓋)
  - host_service: ~15 個 (postgres/redis/minio/node-exporter/cadvisor 等)
  - third_party_service: ~5 個 (alertmanager/argocd/prometheus/velero 等)

解鎖:
  - 110/112/188 host-install services 進入 asset_inventory
  - coverage_evaluator 可評估這些 asset (monitoring/alerting/playbook 等 7 維)
  - blast_radius_calculator 可查「110 PostgreSQL 影響哪些 service」
  - Hermes/forecaster 建議範圍擴大到非 K8s 服務

對齊統帥鐵律: 朝 AI 自主化 — 不硬編主機清單,動態從 Prometheus 發現

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:06:34 +08:00
OG T
a391dfc389 feat(aiops): capacity_forecaster — Phase 4 Holt-Winters MVP (predict_linear)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥批准 4 項下階段候選之一完成: AI 容量預測.

新增 capacity_forecaster_job.py (~220 行):
  每日 05:00 Taipei 跑預測 (02:00 scanner → 03:00 compliance →
  04:00 Hermes → 05:00 forecaster 形成完整日鏈).

預測方法論 (MVP):
  Prometheus predict_linear(metric[7d], 86400*7) — 基於過去 7d 做線性外推
  3 個預測 query:
    1. disk_saturation_7d: predict_linear(node_filesystem_avail_bytes[7d], 7d) < 0
    2. mem_saturation_7d: predict_linear(MemAvailable[7d], 7d) / MemTotal < 10%
    3. cpu_high_7d_trend: avg_over_time(cpu_used_pct[7d]) > 70%

發現高風險 host → 寫 aol(capacity_recommendation) + 推 Telegram
  - input: host + horizon + findings count
  - output: findings list + proposed_actions + requires_human_decision=true

proposed_actions 依 findings 推導:
  - disk: 清理 log/docker/PG WAL 或擴容
  - mem: top consumer / JVM 調整
  - cpu: scale out / vCPU 擴充

統帥鐵律對齊:
   只推建議不自動 scale up
   7d window 有足夠樣本
   AI 預測 + 人工決策

未來 TODO:
  - 真 Holt-Winters (含季節性) — 需 Python statsmodels
  - 業務週期調整 (週一高峰/週末低谷)

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:00:36 +08:00
OG T
c1f23cfabe feat(coverage_evaluator): 擴充 4 維 — playbook/remediation/rule_matching/rule_creation
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點: coverage 7 維中原只實作 3 維 (monitoring/alerting/km),其餘 4 維永遠 unknown

v2 擴充:
  + auto_playbook: asset.name 出現在 playbooks.symptom_pattern/description (approved 狀態) → green
     沒對應 playbook 但 type='k8s_workload' → yellow
  + auto_remediation: 過去 30d remediation_events.target_resource ILIKE asset.name → green
     沒 target 但 k8s_workload/container → red (應有修復能力但沒)
  + auto_rule_matching: 過去 30d incidents.affected_services ILIKE asset.name
     或 incidents.alertname match alert_rule.labels.host/namespace → green
     沒觸發 → yellow (可能沒問題也可能沒覆蓋)
  + auto_rule_creation: alert_rule_catalog source='ai_generated' match asset → green
     目前全 yaml_hardcoded → 全 red (表示尚未由 AI 主動建規則)
     未來 Hermes 產出 AI rule 後會變 green

解鎖: coverage 7 維完整 SLO KPI (MASTER §7.1)
  - red count = 真正的治理缺口
  - green ratio = 自動化成熟度
  - AI 可主動推薦 red asset 的補覆蓋動作

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:54:36 +08:00
OG T
ba18ad2ef8 feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate
All checks were successful
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s
CD Pipeline / build-and-deploy (push) Successful in 8m37s
統帥 2026-04-19 決策:
  - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則
  - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護)
  - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整

1. rule_stats_updater v2 noise 算法:
  原: 任何 EXPIRED approval 都算 fp
  問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報
  修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ...

2. hermes_rule_quality v2 LLM 升級:
  新增 _llm_analyze_noisy_rule:
    - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule
    - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate
    - 3 路 parse fallback (直接 / NemoTron wrapper / description nested)
  _write_advisory_aol 加 llm_analysis 到 output_payload
  _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長)
  符合統帥鐵律: AI 分析但不自動動作,仍人工決策

3. ops/monitoring/alerts-unified.yml 替換 Rule 1:
  刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報)
  加 HostDiskUsageHigh (>80% for 10m, warning)
  加 HostDiskUsageCritical (>90% for 5m, critical)
  兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯
  (待 deploy-alerts workflow 下次 apply 到 Prometheus)

4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推):
  UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate'

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:39:05 +08:00
OG T
6ab0ce9c75 feat(aiops): Hermes rule quality advisor — E3 AI 規則品質建議 (保守版)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m22s
實證 rule_stats 跑完後發現 2 條 100% noise_rate 規則:
  - PostgreSQLDiskGrowthRate (tp=0 fp=2)
  - NoAlertsReceived2Hours   (tp=0 fp=1)
加上 MoWoooWorkDown (33%), KubePodCrashLooping (25%)

新增 hermes_rule_quality_job.py (~210 行):
  每日 04:00 Taipei 分析 alert_rule_catalog:
    - threshold: noise_rate >= 0.7 AND 樣本 >= 5
    - 為每條寫 aol('rule_rejected', proposed_action='review_or_deprecate')
    - 推 Telegram 摘要給 SRE group

統帥鐵律對齊:
   不自動改 review_status (人工決策 deprecate,AI 只推建議)
   threshold 作為「觸發討論」而非「最終決策」
   aol(rule_rejected) 留 trail,未來可升級 LLM 辯證

解鎖 E3 Hermes 基礎: 後續可加 LLM 分析假報真因 (expr 缺 for: window、
label match 太寬泛、metric 本身 noisy 等),產出具體改進建議.

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:11:26 +08:00
OG T
e677773e39 fix(asset_scanner): Pod→Deployment via ReplicaSet 橋樑 (relationship 漏掉修復)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m31s
Review 盲點: 實測 asset_relationship 52 筆,但都是 Pod→StatefulSet + Pod→ConfigMap,
完全沒有 Pod→Deployment!

真因:
  K8s 中 Pod.ownerReferences[0].kind = 'ReplicaSet' (99% 案例)
  Deployment 管 ReplicaSet 管 Pod (兩層 owner chain)
  原 code 只 match kind in (deployment/statefulset/daemonset) → 跳過 ReplicaSet
  → Pod→Deployment 關係全部漏掉

修復 v3.1:
  0. 新增 collect replicasets pass (僅作為 bridge,不寫 asset_inventory)
     建 rs_to_deployment map: {ns/rs_name: deployment_name}
  2. Pod ownerRef.kind='ReplicaSet' → 反查 rs_to_deployment → 建 Pod→Deployment

預期效果:
  - asset_relationship 從 52 → 150+ (所有 Deployment-managed Pod 都有 relationship)
  - OpenClaw blast_radius 計算 Deployment 影響的 Pod 數 = 正確

不寫 ReplicaSet 為 asset (他是 ephemeral 中介,滾動更新會大量產生,污染 inventory)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:26:57 +08:00
OG T
c8b263db06 fix(coverage_evaluator): KM 欄位修正 ke.body → ke.content + 擴大 title 匹配
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
實測 df71c9a 部署後 coverage_evaluator 生效:
  - monitoring: 2 hosts match Prometheus targets
  - alerting: 74 筆 (22 green + 52 red)
  - km: 0 (錯誤: column "ke.body" does not exist)

真因: knowledge_entries 表欄位是 'content' 不是 'body'
修復: ke.content ILIKE '%name%' OR ke.title ILIKE '%name%'

同時清 unused import (typing.Any)

下輪 coverage_evaluator tick 將正確 UPDATE auto_km_creation 維度
解鎖完整 3 維 coverage (monitoring/alerting/km)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:24:46 +08:00
OG T
92349bc37c feat(aiops): asset_change_tracker — 8 張 0 writer 表全數上線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點 10: asset_change_event 仍 0 筆 (最後一張 0 writer 表)

新增 asset_change_tracker_job.py (~180 行):
  每 1h 比對最近兩次 asset_discovery_run,寫 asset_change_event:
     asset_added: newer run 有但 older run 沒有 (EXCEPT SET)
     asset_removed: older 有但 newer 沒有
     lifecycle_changed: asset_inventory.lifecycle_state='deprecated' 且 updated_at 近 2h
  使用 SET EXCEPT 避免 N+1, 單次 INSERT 完成所有 diff

8 張 ADR-090 0 writer 表到此全數有 writer:
   asset_inventory / asset_discovery_run / asset_coverage_snapshot
     / asset_relationship / asset_change_event / asset_compliance_snapshot (asset_*)
   alert_rule_catalog
   host_capacity_snapshot / capacity_violation_event (capacity_*)

Phase 7 資產盤點 + 覆蓋矩陣 + 變化追蹤完整實作.
接下來可以上 Hermes AI agent 分析品質 (deprecate noisy rules, 推薦 coverage 修復).

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:18:34 +08:00
OG T
df71c9a37b feat(aiops): rule_stats_updater — 計算 noise_rate + true/false positive
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m26s
Review 盲點 5: alert_rule_catalog 68 筆但 noise_rate/TP/FP/last_fired_at 全 NULL

新增 rule_stats_updater_job.py (~170 行):
  每 1h UPDATE 全表 alert_rule_catalog,從 incidents + approval_records 推算:
    - last_fired_at = max(incidents.created_at WHERE alertname=rule_name)
    - true_positive_count = count incidents.status='RESOLVED' past 30d
    - false_positive_count = count approval_records.status='EXPIRED' past 30d
      (EXPIRED = 48h 無人處理,視為假警報 proxy)
    - noise_rate = fp / (tp + fp)

窗口: 30 天 (可配置)
使用單一 UPDATE + subquery,避免 N+1 (68 rule × 3 query = 204 queries → 1 query)

解鎖 E3 Hermes:
  後續 Hermes AI agent 讀 alert_rule_catalog WHERE noise_rate > 0.5
  提案 review_status='deprecated' 或 superseded_by_rule_id

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:05:30 +08:00
OG T
505232336b feat(aiops): coverage_evaluator — 把 coverage_snapshot 從 unknown 升為真實 status
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點 4: asset_coverage_snapshot 546 筆全是 'unknown',沒實際意義

新增 coverage_evaluator_job.py (~270 行):
  每 1h 針對最新 asset_discovery_run 的 coverage_snapshot 做 3 維升級:
     auto_monitoring: Prometheus /api/v1/targets 看 host asset IP
       → green (有 target) / red (無 target)
     auto_alerting: alert_rule_catalog.labels 是否 match asset
       → host/namespace/layer 三種 match 策略, green/red
     auto_km_creation: knowledge_entries.body ILIKE asset.name
       → green (有 KM) / yellow (無 KM)
  evidence JSONB 記錄升級依據,供 AI 後續稽核

未實作 (留 unknown):
     auto_rule_matching (需 alert history 統計)
     auto_playbook / auto_remediation / auto_rule_creation (需 playbook 表)

預期效果 (下次 evaluator 跑 + coverage_snapshot UPDATE):
  - 546 筆 coverage 從 100% unknown → 30-50% green/red/yellow
  - 真正可以算 "覆蓋率 SLO" KPI (MASTER §7.1)
  - AI 可從 coverage_snapshot 看出 red asset,主動推 remediation

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:02:30 +08:00
OG T
fdf8b739f1 feat(asset_scanner): v3 擴充多資源類型 + asset_relationship builder
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 原本 MVP 只掃 pods (39 assets) 盲點,本次擴充:

新增資源類型掃描:
  - nodes (asset_type='host') — 實體主機
  - deployments/statefulsets/daemonsets (asset_type='k8s_workload')
  - services (asset_type='k8s_resource')
  - configmaps (asset_type='k8s_resource')
  跳過 secrets (awoooi-executor RBAC 禁止 list,正確設計)

新增 asset_relationship 自動建立:
  - Pod → Deployment/StatefulSet/DaemonSet (depends_on, via ownerReferences)
  - Service → Pod (routes_to, via spec.selector 匹配 Pod.labels)
  - Pod → ConfigMap (depends_on, via spec.volumes[].configMap.name)
  用 ON CONFLICT (from/to/type) DO UPDATE last_verified_at 保持 idempotent

新增 _fetch_kubectl_json helper (nodes 不帶 --all-namespaces)
新增 _build_{pod,workload,service,node,configmap}_asset 各自 asset 建構器

預期效果 (下次 scan 1h 後或 Pod 重啟時):
  - asset_inventory: 39 → 300+ (全集群多種資源)
  - asset_relationship: 0 → 數百 (OpenClaw 爆炸半徑計算終於有拓樸)

解鎖下游:
  - AI 計算 blast_radius 可查 asset_relationship (之前無資料)
  - MASTER §3.3 D3 Declarative Remediation 的 blast_radius_calculator 有真實依賴圖

Refs: ADR-090 §3.3, MASTER §3.1 L6×D1 (8D 感官拓樸)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:54:18 +08:00
OG T
02263445c2 fix(asset_scanner): kubectl 改 subprocess — K8sProvider 不支援 --all-namespaces
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m9s
5b9b36f 部署後 asset_scanner 跑 3 次但 total=0, new=0:
  - asset_inventory 仍 0 筆
  - Pod 手動 kubectl get pods --all-namespaces -o json 可取 JSON
  - 真因: K8sProvider._kubectl_get 把 namespace 參數塞進 '-n $ns',
    所以 '--all-namespaces' 變成 '-n --all-namespaces' (kubectl 拒絕)

修復:
  - 不走 K8sProvider,直接 asyncio.create_subprocess_exec
  - kubectl get pods --all-namespaces -o json
  - 30s timeout,rc != 0 拋 RuntimeError 觸發 aol status='failed'

驗證: 部署後 asset_inventory 應在 1 分鐘內開始有 pods 寫入

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:31:26 +08:00
OG T
4259a104f5 feat(aiops): capacity_scanner + compliance_scanner (ADR-090 Phase 7 剩 2)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
完成 ADR-090 Phase 7 第 3+4 個 service,解鎖 2 張 0 writer 表:

B3. apps/api/src/jobs/capacity_scanner_job.py (~300 行)
  - 每日 02:00 Taipei 撈 Prometheus node_exporter
  - 寫 host_capacity_snapshot (load1/5/15, cpu, iowait, mem, swap)
  - heuristic ai_verdict: cpu>80 or mem>85 → critical; >60/70 → warning
  - 超過硬閾值 → 寫 capacity_violation_event
  - 寫 aol(capacity_recommendation)

B4. apps/api/src/jobs/compliance_scanner_job.py (~270 行)
  - 每日 03:00 Taipei 遍歷 asset_inventory active assets
  - 為每個 asset 寫 7 維 compliance snapshot
  - secret_rotated: 真實檢查 (metadata.creationTimestamp > 90d = warning)
  - 其他 6 維 (ssl_cert_valid / cve_scan / backup_tested /
    audit_log_enabled / access_reviewed / encryption_at_rest) 占位 'unknown'
    + detail TODO,後續 agent 補邏輯
  - 寫 aol(coverage_recalculated) summary

main.py lifespan 同步 wire 2 個新 loop

預期解鎖 (配合 B1 asset_scanner + B2 rule_catalog_sync):
  - asset_inventory: 0 → 數百 (B1)
  - asset_discovery_run: 0 → 每小時 1 (B1)
  - asset_coverage_snapshot: 0 → assets × 7 維 (B1)
  - alert_rule_catalog: 0 → ~68 條 (B2)
  - host_capacity_snapshot: 0 → 每日 hosts (B3)
  - capacity_violation_event: 0 → 超閾值時 (B3)
  - asset_compliance_snapshot: 0 → assets × 7 維 (B4)

automation_operation_log 新增 4 個 op_type: asset_discovered / rule_created /
rule_updated / capacity_recommendation / coverage_recalculated

8 張 0 writer 表到此全數有 writer,ADR-090 Phase 7 實作完成.

Refs: ADR-090 §4.2 Phase 4, MASTER §3.5 D5 (capacity-aware)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:23:27 +08:00
OG T
5b9b36f30d fix(ci)+feat(aiops): cd.yaml shared network + rule_catalog_sync (ADR-090 E3)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m31s
CI 修復 (c0f3509 第三次 fail 真因):
  c0f3509 log: 'Detected act task network: (none, will fall back to bridge)'
  → grep ACT_NET 在 CI 環境未 match → fallback bridge
  → default bridge 不支援 container name DNS → pg-test-b5 解析失敗

修復 (v3 — 主動創 shared network):
  - B5_NET=b5-test-net (idempotent docker network create)
  - ci-runner 自己 docker network connect $HOSTNAME
  - pg-test-b5 --network=$B5_NET
  - 兩邊同 user-defined network → container name DNS 正常

新增 rule_catalog_sync_job (ADR-090 § Phase 7 第 2 個 service):
  + apps/api/src/jobs/rule_catalog_sync_job.py (~230 行)
    - run_rule_catalog_sync_loop: 啟動延遲 90s,每 1h sync
    - sync_once: HTTP GET {PROMETHEUS_URL}/api/v1/rules (type=alert)
    - UPSERT alert_rule_catalog (rule_name 為 UNIQUE)
    - 只在實際 INSERT/UPDATE 發生時才寫 aol (避免 N 條 rule 污染)
  + main.py lifespan asyncio.create_task() wire

預期解鎖:
  - alert_rule_catalog: 從 0 → Prometheus active rules 數 (~68 條)
  - automation_operation_log: 新增 'rule_created' / 'rule_updated' op_type
  - E3 Hermes AI 終於有 baseline 可以提案規則修正

Refs: ADR-090 §4.2 E3, MASTER §3.3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:08:34 +08:00
OG T
c0f3509d39 fix(drift-card): Drift Diff HTTP 400 — item-by-item 累計長度避免切斷 HTML
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m0s
統帥回報 14:18 點 [查看 Diff] 收到 'Drift Diff 查詢失敗: HTTP error: 400'

真因 (telegram_gateway.py:2087 _send_drift_diff_detail):
  - report_id=7ffe78ae 有 48 items,單筆 git_value 最長 1794 字 (env array)
  - 累計 _full 遠超 4096,執行 _full[:3950] 截斷
  - 截斷可能切在 HTML tag 中間 (<code>... 或 &lt; entity 中間)
  - Telegram parse_mode='HTML' 拒絕不完整 HTML → 400

修復:
  - item-by-item 累計長度,單個 item 算 _block 長度+1
  - 預留 3800 上限 (4096 - 250 buffer 給 header + '… 還有 X 項' 提示)
  - 確保 _full 永遠是完整 HTML 結構

驗證: 下次 drift report 出現 + 統帥點 [查看 Diff] 應正常顯示 (本 session 的下個 cycle)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:26:29 +08:00
OG T
ddb902f1ff fix(ci+aiops): cd.yaml grep set-e bug + 新增 asset_scanner_job (ADR-090)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CI 修復 (b636d3b 第二次 fail 真因):
  cd.yaml line 161 ACT_NET=$(docker network ls | grep -E '^GITEA-ACTIONS-...')
  act runner 用 'bash -e -o pipefail',grep 無 match 時 exit 1 → 整 step 中斷
  (前一次 e7ba8cb fail 是 PG IP 不通,b636d3b 是 grep set-e bug — 兩個不同錯誤)

修復:
  ACT_NET=$(... | (grep -E '...' || echo "") | head -1)
  把 grep 包在 subshell 並 || echo "" 確保失敗時 ACT_NET 為空字串

新增 asset_scanner_job (ADR-090 § Phase 7 第 1 個 service):
  + apps/api/src/jobs/asset_scanner_job.py (~360 行)
    - run_asset_scanner_loop: 每 1h cron,首次延遲 60s
    - scan_once: 用 K8sProvider kubectl_get pods --all-namespaces
    - UPSERT asset_inventory (asset_key 為 UNIQUE,跨 run 沿用同 asset_id)
    - 為每個 active asset 寫 7 維 asset_coverage_snapshot (預設 unknown)
    - 寫 automation_operation_log(asset_discovered)
  + main.py lifespan asyncio.create_task() wire

預期解鎖:
  - asset_inventory: 從 0 → 數百 (全 namespace pods)
  - asset_discovery_run: 每小時 1 筆
  - asset_coverage_snapshot: 每筆 asset × 7 dim
  - automation_operation_log: 新增 'asset_discovered' op_type

下一階段 (rule_catalog / capacity / compliance scanner) 待 CI 通過後分批提交.

Refs: ADR-090 §4.1, MASTER §3.4 D4, project_blindspot_governance.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:15:45 +08:00
OG T
e7ba8cb181 fix(aiops): 打通 AI 自主學習鏈 — verifier 改 await + aol 動作回灌
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 7m29s
統帥 2026-04-19 全景審計發現:
  - automation_operation_log: 22 筆 (全部 drift_narrator),33 件/7d approval 動作 0 筆回灌
  - incident_evidence.verification_result: 1212 筆 100% NULL,verifier 從未寫入
  - 根因: _run_post_execution_verify 用 asyncio.create_task fire-and-forget,
          Pod recycle 時 task 被殺,verification_result 永遠寫不進去

修復 (打通 verifier→learning→Playbook EWMA→finetune 全鏈):

approval_execution.py:
  + _log_aol_started: 主流程開始時 INSERT aol(playbook_executed, pending)
  + _log_aol_completed: 4 個 return 點 UPDATE aol 為 success/failed + duration + stderr
    └ NO_ACTION / parse_fail / K8s 成功 / K8s 失敗 全部留痕
  ~ _run_post_execution_verify 兩處 (成功+失敗 path) 從 create_task 改 await + 60s timeout
  + 失敗時 stderr_feed_back 寫入 result.error → 解開 E6 stderr 回灌閉環

declarative_remediation.py:
  ~ _log_remediation_event task 加 named + add_done_callback,task 失敗時有 log
    (原 fire-and-forget 0 筆寫入,現在可診斷為何 task 死掉)

預期效果:
  - aol playbook_executed 即時可見 (33 件/7d 立刻有資料)
  - incident_evidence.verification_result 開始累積 → finetune_exporter 7d cron 終於有料
  - Playbook EWMA trust_score 開始動態變化
  - stderr_feed_back 接通 → 失敗訊號回灌 retry/Playbook 負向強化

不影響:
  - background_task 跑在背景,+60s 延遲不阻塞 API
  - aol 寫入失敗只 logger.warning,不阻塞執行主流程

Refs: MASTER §3.1 L6×D1 (ADR-081 PostExecutionVerifier),
      MASTER §3.4 D4 (ADR-083 學習閉環),
      ADR-090 監控盲區治理 (2026-04-18 全景審計)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:07:29 +08:00
OG T
2abc91e360 fix(drift-card): 修 drift 卡片 2 bug — AI 研判 copy 樣式 + Diff 按鈕 AttributeError
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m8s
Bug 1: 按「🔍 查看 Diff」失敗
  錯誤: 'DriftReportRepository' object has no attribute 'get_by_id'
  根因: DriftReportRepository 方法叫 get(), 其他 repo 都叫 get_by_id()
  修法: 加 get_by_id() alias, 對齊 repo 介面慣例

Bug 2: AI 研判內容被渲染成 code block + copy 按鈕
  根因: telegram_gateway line 1962 用 <pre> 包 diff_summary
       但 diff_summary 是 AI 研判敘述 + emoji 清單, 非 code
  修法: 移除 <pre>, 改以分隔線 + html.escape 純文字顯示

驗收:
- 下次 drift 卡片: AI 研判段落純文字(無紫色 code block + copy)
- 按「🔍 查看 Diff」→ 送完整 diff 詳情(非 AttributeError)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:27:13 +08:00
OG T
4b8be32610 fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 25m27s
Ansible Lint / lint (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

## TG-1: INFO_ACTIONS 加 view
security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式,
不再誤觸發 4-part nonce 寫格式。

## AP-1: approval_records.telegram_message_id 持久化
telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE
approval_records SET telegram_message_id, telegram_chat_id
(不只 Redis, Pod 重啟仍可找回原卡片)。

## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量
approval_execution._push_execution_result_to_alert 除了 reply 原卡片,
還 editMessageReplyMarkup 移除按鈕(修「永遠執行中」卡片問題)。
  - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息
    顯示 "📚 KM +N  🎯 Playbook 更新×M"
  - 成功:  執行成功 + action + KM 增量
  - 失敗:  執行失敗 + 原因 + KM 增量

## AP-3: primary_responsibility 正規化降「 未知」比例
openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單
(FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA,
否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。

## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:15:58 +08:00
OG T
68a42a3c97 fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

根因:
  commit 7e9448f 的 Python hallucination validator 只裝在
  `analyze_alert` (webhook path),但 incident sweeper 走
  `generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23
  PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod"
  幻覺未攔截。

修:
1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法
2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除
3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper
4. helper 補:
   - result.action_title = '[安全降級] 調查 {ns} 真實資源狀態'
     (之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字)
   - 每個欄位賦值 try/except 保底,單欄失敗不影響其他

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:11:09 +08:00
OG T
fdce0a3ab9 fix(approval): NO_ACTION 不再誤標 EXECUTION_FAILED (MASTER §7.1 #11 修)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

根因:
approval.action='NO_ACTION - 待分析' (幻覺 validator 降級產物) 丟進
parse_operation_from_action → operation_type=None → background_execution_skip
→ update_execution_status(success=False) → 標為 EXECUTION_FAILED。

污染 KPI:
  MASTER §7.1 #11 auto_execute 成功率 = EXECUTION_SUCCESS / (SUCCESS+FAILED)
  NO_ACTION 本來就不該計入失敗,但卻被算進去拖垮指標。
  實測 30d 成功率 0.9% 有很大比例是 NO_ACTION 誤標造成。

修復:
parse 失敗時先判斷是否 NO_ACTION 類 (action 含 NO_ACTION/OBSERVE/INVESTIGATE
等關鍵字) → 走專屬 noop 分支:
  - log event=background_execution_noop (info 級)
  - update_execution_status(success=True) → EXECUTION_SUCCESS
  - timeline 標  純觀察類動作完成
  - reply 原告警卡片顯示成功
  - return True

真正解析失敗 (非 NO_ACTION) 保留原失敗路徑,但補上 error_message
(P0.2 延伸),讓 rejection_reason 有 "Could not parse operation type from
action: <action>" 而非空字串。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:08:16 +08:00
OG T
2e988bdb81 fix(telegram): drift 執行結果貼回卡片 + audit log user_id
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
IDE 抓到 _stamp 未使用(結果沒送)+ user_id 未使用(audit 缺漏)。

修:
1. _edit_drift_card_outcome 不只移除按鈕,還 send 簽核戳訊息
   (reply_to 原卡片,若 msg_id 存在),格式:
      已採納 by @username (成功)
     Drift <report_id>
2. _handle_drift_action 加 drift_callback_dispatched log(audit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:07:13 +08:00
OG T
877c8479e0 fix(telegram): TG-2 + TG-4 修 drift 按鈕 black hole
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

統帥截圖直擊: 按「查看 Diff」→ 變成「執行中」,且看不到還有 21 項。

全景盤點發現 9 個 Telegram 子系統 bug,本 commit 修 2 個最痛的:

## TG-2: drift_view/drift_adopt/drift_revert 3 按鈕**無 handler**
  點擊 → fallthrough → UX 黑洞 / 誤觸發 approve 路徑。

修復: handle_callback 在 state guard 後(line 2752 後)加 Step 1.85
  offroute: 3 個 drift_* action → _handle_drift_action 專職處理,
  不走 nonce approve/reject dispatch,避免誤觸發執行流。

3 個按鈕實作:
  - drift_view: 讀 drift_reports → 送新訊息展示全部 items
    (HIGH/MEDIUM/INFO emoji + Git vs K8s 原值對照,上限 50 項 4000 字)
  - drift_adopt: 呼叫 drift_adopt_service.adopt_drift()
  - drift_revert: 呼叫 drift_remediator.revert()

## TG-4: drift card message_id 沒存 Redis → edit 回不了卡片
修復: send_drift_card 成功後 setex f"tg_drift:{incident_id}" TTL 24h,
  供 _edit_drift_card_outcome 在 adopt/revert 執行後更新原卡片(先移除
  按鈕 + 加「XX by @username (成功/失敗)」簽核戳)。

## 未包含(follow-up):
  TG-1 INFO_ACTIONS 擴充(view)  — 下一 commit
  TG-3 handler 重複分派 — 評估中
  TG-5 Bot webhook URL 未設 — 需統帥決策公開 URL
  approval card NO_ACTION 誤標 FAILED — 下一 commit
  approval card description 矛盾 / responsibility 未知 / 執行後 edit

全景 9 bug 清單詳見 project_phase7_round3_telegram_subsystem_audit(待建)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:06:30 +08:00
OG T
98aef55b31 feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 11m49s
run-migration / migrate (push) Failing after 15s
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)

MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈:
  #3  fine-tune JSONL /week        — finetune_exports 表不存在
  #4  MCP 呼叫/24h                 — timeline_events 沒 mcp_call event_type
  #6  Declarative 修復使用率       — remediation_events 表不存在
  #7  general 兜底 17.3%           — classify_alert_early 漏 5 類
  #10 notification_outcomes /week  — 表不存在

本 commit 全修。

## 1. Migration: adr090d_kpi_data_sources.sql (3 張表)

- finetune_exports       — P3 Fine-tune JSONL 追蹤
- remediation_events     — P5 Declarative 修復追蹤
- notification_outcomes  — 通知品質 + RLHF 語料

Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。

## 2. classify_alert_early 擴 4 類規則 (降 general 兜底)

- test 攔截: Test*/FPTest/FingerprintTest/ADR089*Test/L4Closure*/*FreshUniq*
  → category='test', TYPE-1 純通知
- High*CPU/Memory/Disk/Load → host_resource
- TLS*/SSL*/*ProbeFailure* → ssl_cert
- PostgreSQL*/MySQL*/MongoDB*/*DiskGrowthRate → database

預期 general 17.3% → 3-5% (達標 <10%)。

## 3. finetune_exporter DB 寫入

_run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。

## 4. declarative_remediation DB 寫入

evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events
(status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。

## 5. telegram_gateway DB 寫入 (send_approval_card)

_send_request 成功返回 message_id 後寫 notification_outcomes 一筆,
channel='telegram', delivery_status='delivered|failed'。未來人類按鈕時
update user_action → RLHF 訓料黃金。

## 6. pre_decision_investigator MCP 呼叫追蹤

_call_single_tool() finally 寫 timeline_events event_type='mcp_call',
含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。

## 預期量化改善

| KPI | 修前 | 修後 24h 後應見 |
|-----|------|----------------|
| #3 fine-tune /week | 0 (表不存在) | >=10 (每週 cron 跑) |
| #4 MCP 呼叫/24h | 0 | >0 (實測將寫 timeline) |
| #6 declarative 使用率 | 表不存在 | 有資料 (pending/success/failed 分佈) |
| #7 general 兜底 | 17.3% | <10% |
| #10 notification_outcomes | 0 | 每次 approval card 寫一筆 |

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:00:31 +08:00
OG T
898145d68e refactor(openclaw): SuggestedAction 改用頂部 import (避免 inline 三重巢狀)
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 11m17s
Ansible Lint / lint (push) Has been cancelled
IDE 對 inline "from src.models.ai import" 誤報(但運行正常)。
改為頂部 import 既滿足 IDE 也更 Pythonic。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:28:19 +08:00
OG T
e6e484c1dc fix(openclaw): import path 修正 — src.models.ai (非 openclaw_schema)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
IDE 正確抓到的 bug(非 false positive),SuggestedAction 在 src/models/ai.py。
_SA.NO_ACTION 現在能正確降級。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:26:45 +08:00
OG T
7e9448f6d0 fix(openclaw): 幻覺 deployment 名雙層防禦 — Prompt + Python validator
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)

生產事件 (approval f763bedf, 22:58):
- Alert: KubePodCrashLooping, labels.deployment="awoooi-api"
- NEMOTRON 雖收 inventory "awoooi-api, awoooi-web, awoooi-worker"
  仍輸出 kubectl_command="kubectl rollout restart deployment/awoooi-prod"
  (把 namespace 誤當 deployment 名)
- 執行結果: "Deployment 'awoooi-prod' not found in namespace 'awoooi-prod'"

## Layer 1: NEMOTRON_SYSTEM_PROMPT 強化 (prompts.py)
新增「🔒 DEPLOYMENT NAME RULE (STRICTLY ENFORCED)」區塊:
- namespace NEVER is a deployment name
- "awoooi-prod" 是 NAMESPACE,不可寫 deployment/awoooi-prod
- 若有 inventory,deployment 必須 exact match
- 優先用 labels.deployment,unknown → NO_ACTION

## Layer 2: Python 後驗證 (openclaw.py:1322+)
LLM 回應解析後 regex 抽出 deployment 名,對照 _k8s_inventory:
- 在清單內 → 通過
- 不在清單內 → 降級:
    * kubectl_command → "kubectl get deploy -n {ns}"(純調查)
    * suggested_action → NO_ACTION
    * target_resource → "unknown(hallucinated)"
    * confidence → 0.0
    * description 加註 [安全降級] 並列出合法 inventory
- log 'openclaw_deployment_hallucination_detected' 記錄

效果: 就算 LLM 無視 prompt,Python 層也會擋下。
破壞性 kubectl 絕不執行於不存在的 deployment。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:26:09 +08:00
OG T
6ad73b4834 fix(flywheel): 三修 L5/L6 斷鏈 — RBAC 擴權 + 失敗原因入庫 + verifier 失敗時也跑
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m6s
2026-04-18 晚(台北時區) — ogt + Claude Opus 4.7 (1M)

全景飛輪診斷暴露 3 個真斷鏈:
  - L5 執行 30d: EXECUTION_FAILED 216 / EXECUTION_SUCCESS 2 (失敗率 99%)
  - L6 驗證 7d: verification_result 全 NULL (988 筆 evidence 都沒驗)
  - 所有 rejection_reason / error_message 欄位全空(無法診斷)

根因: awoooi-executor ServiceAccount RBAC 不足,executor.py 每次
kubectl get nodes/HPA 都 Forbidden,連 evidence 都抓不到,後面 repair
全炸,verifier 因為 execution 沒 success 永遠不 trigger,evidence
驗證結果永遠 NULL。修一個 RBAC 解 3 個節點。

## P0.1 RBAC 擴權 (k8s/awoooi-prod/07-rbac.yaml)

新增 cluster-scope 讀權(僅 list/get/watch,零寫入):
  - nodes + nodes/status (evidence gathering 必需)
  - horizontalpodautoscalers (HPA 狀態)
  - metrics.k8s.io: nodes + pods (resource metrics)
  - statefulsets + daemonsets (完整 workload 視圖)

已 kubectl apply + 煙霧測試: kubectl get nodes 可跑。

## P0.2 失敗時必寫 rejection_reason (approval_db.py)

update_execution_status() 新增 error_message 參數,失敗時寫入
rejection_reason (截 2000 字) → 之後診斷有依據。

approval_execution.py 呼叫端同步更新,result.error 一路傳進 DB。

## P0.3 Verifier 失敗時也跑 (approval_execution.py)

原邏輯: verifier 只在 result.success=True 時呼叫 → 99% 失敗下
永遠不跑。

新邏輯: 失敗 path 也 create_task 跑 verifier,action_taken 後綴
加 ":FAILED" 標記。verifier 抓 post_state 寫
verification_result='failed' 回 incident_evidence。

L7 learning 從此有失敗樣本可學,playbook trust 負向 2x 衰減才
真正生效。

預期效果:
  - EXECUTION_FAILED 率 30d 內應從 99% 降到 <30%
  - incident_evidence.verification_result NULL 率應從 100% 降到 <10%
  - approval_records.rejection_reason 補齊率從 0% 到 100%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 20:12:57 +08:00
OG T
b0d560dbb3 fix(drift-narrator): shortener 用 replace — 包容 LLM 加 'Resource/Name:' 前綴幻覺
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m50s
2026-04-18 下午(台北時區)— ogt + Claude Opus 4.7

Round 4 LLM 自己在 field 前加資源識別符:
  'Deployment/awoooi-web: spec.template.spec.containers'
導致 startswith 模式 shortener 失效(前綴不在開頭)。

防禦式修法: startswith 不中 → 改用 replace 清除任何位置的前綴。
結果:
  'Deployment/awoooi-web: containers' 

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 18:12:15 +08:00
OG T
b63aed72df fix(drift-narrator): 砍 spec.template.spec. 前綴 — 修 Telegram 自動換行醜陋排版
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m1s
2026-04-18 下午(台北時區)— ogt + Claude Opus 4.7

統帥實彈三輪視覺回報: 字段名 'spec.template.spec.volumes' 共 24 字元,
加上 emoji+': '+summary 超過 Telegram <pre> 視覺寬度,自動換行
造成 emoji 與 field name 斷開、單獨成行的醜狀。

修復: _shorten_field_path() 砍 3 種常見前綴:
  - 'spec.template.spec.' → ''
  - 'spec.template.' → ''  (後備)
  - 'spec.' → ''  (後備)

效果對比:
  前: '🟡 spec.template.spec.affinity.podAntiAffinity.preferredDuringS: [清單 3 項]'
  後: '🟡 affinity.podAntiAffinity.preferredDuringS: [清單 3 項]'

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:10:20 +08:00
OG T
f3960f36d2 fix(drift-narrator): fallback 強化 — 標註 K8s 預設值補齊 + 可操作數獨立計算
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m37s
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

統帥實彈測試回報: 卡片顯示「securityContext: (未設) → {物件 0 欄位}」毫無意義。

根因: _fallback_items 對「K8s controller 自動補齊空物件」的噪音
     誤當成真實變更輸出。且「還有 29 項」數字包含白名單 + trivial。

修復 3 項:

1. _is_trivial_drift() 新判定函數
   None/空字串/{}/[]/false/0 等互相視為「無實質變更」
   捕捉 K8s controller 自動補齊場景

2. _summarize_item() 替代原本 smart_shorten
   - trivial → "K8s 預設值補齊 (無實質變更)"
   - None → value → "新增 xxx"
   - value → None → "已刪除 (原: xxx)"
   - 其他 → "from → to"

3. _fallback_items() 改進
   - 按 level 排序 (HIGH 優先)
   - 白名單 + HPA allowlist 先過濾

4. _count_nontrivial_drift() + Telegram 呈現
   - 新增「可操作」計數 (去掉白名單 + trivial)
   - 「還有 N 項」用可操作數,不會誤導
   - items 為空時顯示「全為白名單或預設值補齊」

預期效果:
  之前: "... 還有 29 項" (其實只 1 個是真實 drift)
  現在: "... 還有 0 項" 或 "(全部為白名單或預設值補齊)"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:29:49 +08:00
OG T
1606093dd2 fix(drift-narrator): 兩個 hotfix — NEMOTRON wrapper 解析 + tags asyncpg 型別
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

Live-fire test (report_id=80a34b58) 暴露兩個 bug:

## Bug 1: LLM JSON 被 NEMOTRON wrapper 吞掉
根因: openclaw.call() 經 NEMOTRON 路由時強制回 {description,...} 結構,
     我的 prompt 要 {narrative, items} 無法穿透。
     (同 1ff3405 早前碰過的 JSON 裸奔問題根源)

修復: 三路 fallback 解析
  - Path 1: 直接我們的 {narrative, items}(Ollama 或 LLM 守規矩)
  - Path 2: NEMOTRON wrapper,description 巢狀 JSON 含我們結構
  - Path 3: description 是純敘述 → 當 narrative + Python fallback_items

## Bug 2: tags 參數 asyncpg DataError
根因: 傳 '{drift,type4d,llm_summary}' 字面量字串,asyncpg 要求 Python list
      '(a sized iterable container expected (got type str))'

修復: tags 改傳 ['drift','type4d','llm_summary'] Python list,移除 CAST AS text[]
     asyncpg 自動推斷 text[]

Live-fire 結果驗證:
  - narrative  生成(fallback path)
  - items ⚠️ 只 1 筆(NEMOTRON 未吐我們結構)
  - DB write  tags 型別錯
  - Telegram  送出(雖 fallback 內容但視覺 OK)

本 commit 後預期:
  - LLM 回應走 Path 2/3 → narrative + Python fallback items(5 筆 smart summary)
  - DB write 成功 → automation_operation_log + ai_collaboration_trace 皆有記錄
  - 若 LLM 未來學會走 Path 1(給我們 {narrative, items}),自動升級

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:26:17 +08:00
OG T
a156566b17 feat(drift-narrator): ADR-090-C L4 稽核閉環 — notification_formatted op 入庫
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 10m47s
run-migration / migrate (push) Failing after 14s
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

架構鐵律執行:
「沒有被記錄的 AI 決策,就等於沒有發生過。」
drift_narrator 每次呼叫 LLM 生成摘要,必須完整寫入
automation_operation_log + ai_collaboration_trace,形成 L4 稽核 + RLHF 語料。

本 commit 兩件事:

1. apps/api/migrations/adr090c_notification_formatted_op_type.sql
   - 擴充 automation_operation_log.operation_type CHECK 加 'notification_formatted'
   - DROP + ADD CONSTRAINT idempotent 模式
   - 已用 awoooi(表 owner)apply 進 prod 驗證通過

2. apps/api/src/services/drift_narrator_service.py
   - 新增 _log_ai_action_to_db() 負責 DB 稽核寫入
   - 在 _generate_narrative_and_items() 結尾(success / fallback 都寫)呼叫
   - automation_operation_log:
     * operation_type='notification_formatted'
     * actor='drift_narrator'
     * input = {report_id, namespace, counts, items_scanned}
     * output = {narrative, items, items_count}
     * duration_ms, tags=['drift','type4d','llm_summary']
     * parent_op_id 查詢 alert_fired 鏈路(未來 drift → alert 關聯)
   - ai_collaboration_trace:
     * agent='drift_narrator', model=provider (ollama / nemotron / 等)
     * prompt(限 8000 字)+ response(JSONB)
     * accepted = LLM JSON 解析成功 flag(未來 RLHF 訓料金礦)
   - 錯誤處理: DB 寫入 try/except 包住,永不破壞 Telegram 通知主流程

P2.4 事件關聯:
  - SELECT parent op via input->>'report_id' 或 'drift_report_id'
  - 若找到則綁定 parent_op_id(形成 alert_fired → notification_formatted 追溯鏈)
  - 目前 drift 本身不經 alert_fired,parent 為 NULL(等未來鏈路接通)

P2.5 RLHF 語料:
  - ai_collaboration_trace.accepted=true 的紀錄即為「LLM 解析成功」樣本
  - 未來統帥按 Telegram [ 採納變更] / [ 回滾] 時,對應 trace 也可更新
    outcome flag,形成完整 Human-in-the-loop 語料

技術細節:
  - get_db_context() auto-commit(src/db/base.py:128),無需手動 commit
  - prompt 最長 8000 字(一般 drift 約 2-3k)
  - raw_response 保留前 500 字在 trace.response JSON 中

相關:
  - feedback_ai_autonomous_direction.md L4 北極星
  - feedback_secrets_leak_incidents_2026-04-18.md L1-L4 分層
  - ADR-090 11 張神經網路表
  - commit fb88512(B 方案視覺層)

IDE 可能顯示 src.db.base 找不到 —— 那是誤報(drift_repository.py 用同一條路徑)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:04:23 +08:00
OG T
fb88512fcb fix(drift-narrator): B 方案 LLM 驅動智能摘要 — 徹底消滅 str()[:30] 暴力截斷
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

根因:
_format_drift_summary() 對 dict/list 型別的 git_value/actual_value
直接呼叫 str()[:30] 暴力截斷,產生像 "[{'name': 'repair-ssh-key', 's"
這種亂碼掉半個 dict key 的亂七八糟輸出,徹底違背「AI 自主化」原則。

B 方案架構決策:
「捨棄 Python 寫死的字串解析邏輯。將原始 Config Diff 結構直接作為
Context,餵給 Hermes/NemoTron,利用 prompt 規定輸出格式,讓 LLM 自己
消化並輸出包含紅黃燈標示的 Top 5 人類易讀摘要。」

實作:
1. _NARRATIVE_PROMPT 重寫 — 要求 LLM 回傳 {narrative, items[]} JSON
   - drift items 以 JSON serialize 餵進 prompt(保留 200 字 context)
   - items 限 5 筆,HIGH 優先
   - summary 30 字繁中口語(非技術 repr)
2. _generate_narrative_and_items() 新方法 — 解析 LLM JSON 並驗證結構
3. _format_drift_for_llm() 新方法 — 結構化 JSON 給 LLM(取代舊 str 版)
4. _render_telegram_body() 新方法 — 組裝乾淨的 Telegram 卡片
   範例輸出:
     🤖 AI 研判
     <LLM 4-5 行敘述>

     📊 漂移明細 (HIGH: 1 | MEDIUM: 29)
     🔴 spec.template.spec.volumes: 新增 2 項 repair-ssh-key 掛載
     🟡 spec.template.spec.serviceAccount: (未設) → awoooi-executor
     ... 還有 27 項 (按 🔍 查看 Diff)

5. Fallback 強化 — _smart_shorten() + _fallback_items()
   LLM 失敗時用型別感知的 Python 摘要(dict/list 顯示大小,不暴力 repr)

移除:
- _format_drift_summary() — 舊的暴力截斷實作
- _generate_narrative() — 只回 string 的舊介面

保留:
- _fallback_narrative() / _format_intent_summary() — 仍有用
- Redis 快取 / trigger 條件 / DB update — 邏輯不變

MVP 階段:
本 commit 只改視覺呈現,沒動 automation_operation_log / ai_collaboration_trace
稽核寫入。等 Telegram 視覺驗證 OK 後再做 Phase 2 加入 DB 稽核。

相關:
  - feedback_ai_autonomous_direction.md 北極星原則
  - 1ff3405 今早的 JSON 裸奔 hotfix(只修了 narrative,沒修 items)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:54:16 +08:00
OG T
2d43751729 feat(ops): ADR-090-B 零信任收尾範本 — wrapper / sudoers / migrator / CI
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 12m17s
run-migration / migrate (push) Failing after 14s
2026-04-18 台北時區 —— ogt + Claude Opus 4.7 (1M)

本 commit 響應本 Session 兩次憑證外洩事故
(feedback_secrets_leak_incidents_2026-04-18.md),
交付統帥可直接部署的零信任基礎設施範本.

檔案清單:

1. scripts/host-ops/awoooi-hosts-add.sh
   - 110 主機 /etc/hosts 白名單 wrapper
   - 只允許預定義主機名,idempotent,帶 IP 格式驗證
   - 安裝: /usr/local/bin/awoooi-hosts-add (root:root 0755)

2. scripts/host-ops/awoooi-wrapper.sudoers
   - 配套 sudoers 規則 (NOPASSWD for wrapper + SIGHUP only)
   - 安裝: /etc/sudoers.d/awoooi-wrapper (root:root 0440)
   - 禁 tee / bash / sh 這類 generic shell access

3. apps/api/migrations/adr090b_awoooi_migrator_role.sql
   - PG 限權角色 awoooi_migrator
   - 只能 DDL (CREATE/ALTER/DROP/INDEX/COMMENT)
   - 明確 REVOKE 所有 DML + default privileges 鎖死
   - 本檔由統帥執行 (需 superuser),不由 Claude 執行

4. k8s/awoooi-prod/awoooi-migrator-secret.template.yaml
   - K8s Secret patch 範本
   - 新增 MIGRATION_DATABASE_URL key (awoooi_migrator 連線串)
   - 與應用 DATABASE_URL 拆開

5. .gitea/workflows/run-migration.yml
   - CI 自動套用新 migration (單 transaction + ON_ERROR_STOP)
   - 用 Gitea secret MIGRATION_DATABASE_URL,不走明碼
   - 每次成功寫一筆 asset_discovery_run (audit trail)

零信任三層防線 (對應 feedback_secrets_leak_incidents):
  L1 對話無密碼 -> wrapper 內建白名單
  L2 操作經 wrapper -> sudoers + awoooi_migrator
  L3 顯示強制遮蔽 -> CI 走 secret,不走 env

本 Session 發現的 3 次憑證外洩全部在 feedback_secrets_leak
memory 登記,並有對應 P0 輪替計畫.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:39 +08:00
OG T
5ae82d1d1f feat(db): ADR-090 L4 AIOps 地基 — 資產盤點 × 7 項自動化覆蓋矩陣永久化 DB
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

MoWoooWorkDown 假警報 RCA 暴露三重結構性失守:
- 110/188 主機 load 18/16 × 13 天 / cadvisor 288% / K3s 120/121 無監控
- Prometheus 僅 35 targets / 58 rules(覆蓋不到三成)
- HostHighCpuLoad 量錯維度(CPU idle vs load_avg)

統帥戰略指令:
- 全景資產 × 七大自動化 × 永久化 DB
- AI 四分工(OpenClaw × NemoTron × Hermes × Claude LLM)
- 所有自動化操作歷程必進 DB,不靠 MD(MD 會漂移)

本 commit 交付:

1. SQL migration (apps/api/migrations/adr090_asset_inventory_foundation.sql)
   - 11 張表 + 33 indexes + 20 CHECK + 3 UNIQUE + 16 FK
   - pgcrypto extension dependency
   - 完整 idempotent(CREATE IF NOT EXISTS + single transaction)
   - 已 apply 進 awoooi_prod(188 PG),驗證通過

2. ADR-090 (docs/adr/ADR-090-monitoring-blindspot-governance.md)
   - 決策紀錄 + 7 引擎對映 + 4 替代方案否決

3. 主戰略文件 (docs/superpowers/specs/2026-04-18-blindspot-governance-capacity-l4.md)
   - §0-§14: 背景 / 根因 / Schema DDL / 4 層防禦 / 7 Phase 實施 /
     HARD_RULES / AI 分工矩陣 / 驗收指標 / 技術債 / 回滾 / 接手協議

4. MASTER §8 Living Changelog 追加 Phase 7 啟動條目

11 張表:
  asset_inventory / asset_discovery_run / asset_coverage_snapshot /
  asset_relationship / alert_rule_catalog / asset_change_event /
  asset_compliance_snapshot / host_capacity_snapshot /
  capacity_violation_event / automation_operation_log /
  ai_collaboration_trace

首筆 bootstrap 記錄已 seed 進 asset_discovery_run
(run_id=6760c5bf-57e5-4a40-b82d-31b794464652)

相關 Memory (未 commit,存於 ~/.claude/...):
  - project_blindspot_governance.md (跨 session 指針)
  - feedback_monitor_self_monitoring.md (監控工具必須被監控)
  - feedback_secrets_leak_incidents_2026-04-18.md (憑證外洩三防線)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:46 +08:00
OG T
1ff3405755 fix(drift-narrator): 修復 JSON 裸奔 — 從 NEMOTRON 回傳解析 description 欄位
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m44s
根因:openclaw.call() 經 NEMOTRON 路由後強制輸出 JSON(NEMOTRON_SYSTEM_PROMPT 鐵律)
      但 _generate_narrative 期待純文字 → JSON 整包吐到 Telegram <pre> 區塊裸奔

修復:收到 text 後先嘗試 JSON 解析
      - 成功 → 按優先順序取 description / action_title / reasoning
      - 失敗(非 JSON)→ 原文使用(向下相容 Ollama qwen 純文字回傳)

效果:Telegram Config Drift 卡片顯示繁中人話摘要,不再吐原始 JSON

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 01:08:32 +08:00
OG T
4f2e122fd2 fix(openclaw): Checkpoint-2 webhook path K8s inventory injection — 防止 NemoTron 幻覺 awoooi-service
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m39s
根因:NemoTron 在 webhook path(analyze_alert)無叢集上下文
→ 盲猜 deployment/awoooi-service → kubectl not found → EXECUTION_FAILED → trust score 0 永遠

修復:
- analyze_alert() Step 0.5: 呼叫 _fetch_k8s_inventory_for_openclaw() 拉取真實 Deployment 清單
- 注入「🔒 叢集實際資源清單」section 到 full_prompt,強制 LLM 從清單選擇資源名
- 失敗/超時 → 返回空字串 → 注入警示提示,主流程不中斷
- available_len 計算納入 k8s_section 長度防止 4K 截斷

影響:
- Solver Agent path (solver_agent.py) 已在 cf50a5c 修復
- 本 commit 修復 Alertmanager webhook path(analyze_alert → NemoTron)
- 兩條路徑均有 K8s 環境感知,LLM 不再幻覺資源名

ADR-082: Phase 2 多 Agent 協作
2026-04-17 ogt + Claude Sonnet 4.6(Checkpoint-2 webhook path completion)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 00:53:27 +08:00
OG T
cf50a5ce25 fix(solver+execution): Checkpoint-1 假成功修復 + Checkpoint-2 K8s 環境感知
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m55s
## Checkpoint-1: 假成功根治
- approval_execution.py: execute_approved_action 改返回 bool
  (原返回 None,呼叫端無法判斷 K8s 是否接受指令)
- decision_manager.py auto-execute 路徑: 用 _exec_success 取代硬編 success=True
  修復: K8s 拒絕指令時正確發  而非  自動修復完成

## Checkpoint-2: K8s 環境感知 (Inventory Pre-flight)
- solver_agent.py: 新增 _fetch_k8s_inventory() — 生成 kubectl 指令前先拉
  kubectl get deployments,statefulsets -n awoooi-prod,將真實名稱清單
  注入 Solver prompt,LLM 必須從清單選擇,防止幻覺(awooiii-api 三個 i)
- 超時 5s 或失敗 → 返回 "",prompt 顯示警示但不中斷主流程

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 23:08:23 +08:00
OG T
cbb719b4a1 fix(decision_manager): ADR-091 hotfix — 修復 d5dbfc9 喪屍閘門邏輯漏洞
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m9s
d5dbfc9 引入的閘門條件 `not action.strip()` 在 action="待分析" 時
判斷為 False(非空字串),導致閘門失效,喪屍卡片仍然突圍廣播。

根本原因:c759b4e P1 修復讓 suggested_action fallback 為 "待分析"
而非 "",使原本的 empty-string 檢查形同虛設。

修復:改用集合判斷 `_action_text in {"", "待分析", "NO_ACTION", "待分析 - 系統自動保護"}`,
涵蓋所有已知失敗狀態 token,完全封堵喪屍卡片廣播路徑。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 22:44:53 +08:00