Your Name
86ee013cdf
feat(hermes-complete): Hermes NL 三項補強 + ConsensusEngine + ADR 收尾
...
CD Pipeline / build-and-deploy (push) Successful in 9m32s
## Hermes NL 補強(nl_gateway.py)
- T1 hermes_dispatch_log DB 寫入(asyncio.create_task 非阻擋)
- T2 Redis 速率限制:per-chat_id 20 req/min,fail-open
- T3 Multi-turn session:hermes:session:{chat_id}:{user_id} TTL=300s,最近 3 輪
## ConsensusEngine(ADR-095 宣告式設計)
- consensus_engine.py: CONSENSUS_WEIGHTS class 屬性
security=0.4 鎖定,9 個 Claude Code agent 分配 0.6
- config.py: ENABLE_12AGENT_CONSENSUS=False feature flag
## ADR 狀態
- ADR-093/094/095: Proposed → 🟡 批准實作中
- 各 ADR 加 v1.1 變更紀錄
## K8s ConfigMap
- prod 04-configmap.yaml: 加 3 個 feature flags(均 false)
- dev 02-configmap.yaml: 同步加入
## LOGBOOK
- 記錄 WS0–WS6 + 補強完成,feature flags 啟用指引
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:22:40 +08:00
Your Name
2572ec46d2
feat(ws4): Hermes NL 自然語言介面 — 12-Agent Claude SDK 接入(ADR-094/095)
...
## hermes/ 套件(5 個新模組)
### display_names.py
- 12 agent 視覺識別表(emoji + hashtag + handle + short_name)
- format_response_header() 產生 Telegram 前綴
### agent_loader.py
- 解析 .claude/agents/*.md frontmatter → system prompt
- lru_cache 避免重複讀檔
### safety_hooks.py
- 移植 awoooi-guard.js 20 條 HARD BLOCK 規則(DENY_PATTERNS)
- 5 條 MUTATE_PATTERNS → 須走審批流
### nl_gateway.py
- Layer 1: 關鍵字正則路由(12 條規則,<10ms)
- Layer 3: DEFAULT_AGENT = "debugger"
- Claude Agent SDK query() 非同步串流,取 ResultMessage.result
- 安全降級:SDK error → 友好錯誤訊息
### telegram_webhook.py
- WS4 Hermes NL 接入(@tsenyangbot mention 或私訊觸發)
- HERMES_NL_ENABLED=False(feature flag 保護,預設關閉)
## telegram_gateway.py
- send_hermes_reply(text, chat_id, reply_to_message_id)
無 500 字截斷,支援 Agent 長回覆
## config.py
- HERMES_NL_ENABLED: bool = False
- TELEGRAM_BOT_USERNAME: str = "tsenyangbot"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:10:06 +08:00
Your Name
294e0e3387
feat(ws3): ADR-093 Callback User-ID Binding + ADR-094 Webhook 入口
...
## T3.1/T3.2 Bound User Check(security_interceptor.py)
- verify_callback() Step 0: 檢查 Redis cb_bind:{nonce}
→ 若有 binding 且 caller != bound_user_id → UserNotWhitelistedError
→ 若 key 不存在(舊格式)→ 降級走 whitelist(向後相容)
→ 若 Redis unavailable → 降級繼續(安全降級)
- bind_callback_user(nonce, user_id): async 方法,TTL=48h
## T3.3 Telegram Webhook 入口(ADR-094)
- apps/api/src/api/v1/telegram_webhook.py(新建)
POST /api/v1/telegram/webhook
- X-Telegram-Bot-Api-Secret-Token header 驗證
- TELEGRAM_WEBHOOK_SECRET="" → dev 跳過(不 break 現有測試)
- WS4 Hermes NL 接入預留佔位
## T3.4 config.py
- 新增 TELEGRAM_WEBHOOK_SECRET field(預設空字串)
## main.py
- 掛載 telegram_webhook_v1.router 到 /api/v1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:10:06 +08:00
Your Name
6d5fd3c124
feat(ws2): ADR-093 路由統一 — BIGINT + NotificationMatrix + feature flag
...
## 修復
### T2.1 BigInteger overflow 修復
- `db/models.py`: telegram_chat_id Integer → BigInteger
(原 int32 無法容納群組 ID -1003711974679)
### T2.2 移除 CAST workaround
- `approval_db.py:739`: 移除 CAST(:telegram_chat_id AS BIGINT)
ORM 已正確使用 BigInteger,workaround 可退役
### T2.3 Redis key 一致性修復
- `heartbeat_report_service.py:575`: telegram:polling_leader → telegram:polling:leader
(telegram_gateway.py 使用冒號分隔,heartbeat 用底線是 bug)
## 新增
### T2.4 notification_matrix.py
- `services/notification_matrix.py`: ADR-093 路由矩陣
- Destination(DM/GROUP/BOTH) + RoutingRule dataclass
- NOTIFICATION_ROUTING dict(TYPE-1 ~ TYPE-8M 完整映射)
- resolve_chat_ids(type, dm, group, *, tg_group_cutover=False) 灰階切流 API
### T2.5 telegram_gateway.py feature flag 保護
- line 43: 加 notification_matrix import
- line 1827-1834: TG_GROUP_CUTOVER=False 時維持舊行為
TG_GROUP_CUTOVER=True 時解除 _interactive_types 黑名單,由矩陣控制
### T2.6 Migration SQL
- `migrations/adr093_notification_routing.sql`:
- CREATE TABLE approval_records (telegram_chat_id BIGINT)
- CREATE ROLE awoooi_migrator (IF NOT EXISTS)
- 含舊環境 ALTER COLUMN int→bigint 保護
## 測試同步
- `tests/integration/setup_test_schema.sql`: telegram_chat_id BIGINT
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:10:06 +08:00
Your Name
55f111e0e3
fix(aiops): correct host alert fallback and resolved stamp
CD Pipeline / build-and-deploy (push) Successful in 8m54s
2026-04-25 00:14:07 +08:00
Your Name
0d81b28b1b
fix(aiops): bound phase2 timeout and repair incident links
E2E Health Check / e2e-health (push) Successful in 52s
CD Pipeline / build-and-deploy (push) Successful in 9m24s
2026-04-24 23:53:56 +08:00
Your Name
e75e4678a9
feat(p2.4): Telegram 中間態推播 — 分析中佔位卡 + 完成後自動刪除
...
CD Pipeline / build-and-deploy (push) Has been cancelled
P2.4 實作 2026-04-24 ogt + Claude Sonnet 4.6
問題: LLM 分析耗時 10-30s,期間 Telegram 無任何回應,使用者不知系統在處理
修復:
- telegram_gateway.py: 新增 send_analyzing_placeholder() — 發送「AI 正在分析中...」佔位卡
- telegram_gateway.py: 新增 delete_message() — 刪除佔位卡
- webhooks.py: LLM 分析前 3s 內送出佔位卡(超時不阻塞主流程)
- webhooks.py: _push_to_telegram_background 收到 placeholder_message_id → 完整卡發出後刪除佔位卡
- webhooks.py: import asyncio(補缺漏)
效果: 使用者在告警到達 <3s 內即看到「分析中...」訊息,完整卡出現後自動清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 15:56:26 +08:00
Your Name
bb5f16f8ef
fix(aiops-p2): P2.1 LLM品質三修 — Evidence-First + consensus confidence + raw_evidence注入
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:
- consensus_engine 四 ExpertAgent confidence=0.0 → 加權投票 total=0 → 永遠返回 NO_ACTION
- prompts.py 無 Evidence-First 指令 → LLM 靠記憶推理,無真實環境約束
- openclaw.py analyze_alert 建 prompt 未注入 MCP evidence (diagnosis_context)
修復:
- consensus_engine: SRE/Security/Cost/Performance 依訊號強度設 0.45~0.80 confidence
- consensus_engine: _normalize_action 加「重新啟動」別名 → RESTART
- consensus_engine: SecurityAgent 移除未使用的 _target 變數
- prompts.py: 加 Evidence-First Protocol + Skepticism Rules 區塊
- openclaw.py: analyze_alert 提取 diagnosis_context → <raw_evidence> 注入 full_prompt
驗證: consensus score 從 0.0 → 0.744(CrashLoop 測試案例)
P2.1 fix 2026-04-24 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 15:52:25 +08:00
Your Name
04ff22563e
fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
...
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)
【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值
【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log
【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql
【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 15:41:35 +08:00
Your Name
7f4088bcd0
fix(aiops-p0): 六大病根 P0 全面修復(ADR-092 B4)
...
【P0.1】knowledge_extractor_service.py:210 — AttributeError 修復
- Signal.description 欄位不存在(100% 失敗,KM 每天+5 根因)
- 改用 alert_name + annotations.summary 拼接文字
【P0.2+P0.3】Gate 9+11 唯讀指令鬆綁
- blast_radius_calculator: kubectl get/top/describe/logs/version → score=1(非 50)
- operation_parser: 增加 INVESTIGATE 類型識別(唯讀 kubectl 不回 None)
- executor.py: OperationType 新增 INVESTIGATE enum
- approval_execution.py: INVESTIGATE 路徑直接呼叫 execute_kubectl_command
【P0.4】MCP SSH/K8s Provider 修復
- decision_manager: params= → parameters=(符合 MCPToolProvider.execute 簽名)
- decision_manager: MCPToolResult .get() → .success/.output(dataclass 用法)
- decision_manager + ssh_provider: 補入 hosts 120/121(原 default 缺失)
- auto_approve: phase2_agent_debate source bypass confidence 閾值
【P0.5】告警規則語義矛盾修復
- alert_rules.yaml: 8 條 kubectl 查詢規則 RESTART_DEPLOYMENT → NO_ACTION
(CrashLoopBackOff/PostgreSQL 連線/慢查詢/MinIO 磁碟/K3s 節點/告警鏈路/SSL/CoreDNS 等)
- incident_service.py: cAdvisor/CoreDNS 從 general 拆出獨立分類
【P0.6】proactive_inspector 動態基線 PromQL 全修
- 5 個 MONITORED_METRICS PromQL 全部修正(cadvisor label/datname/blackbox)
- db_connection_pool: datname="awoooi" → "awoooi_prod"
- http_error_rate: 無效 http_requests_total → blackbox probe_success
- cpu/memory: namespace label → name=~"k8s_api_awoooi-api.*"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 15:32:23 +08:00
Your Name
45dbe07188
fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
...
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」
【六大修復】
1. MCP Provider 三蟲修復
- ssh_provider: asyncssh.run() → conn.run()
- prometheus_provider: KeyError 'query' → .get() 容錯
- k8s_provider: 空 pod_name → 早返回錯誤字典
2. Agent Debate / 決策品質
- decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
- intent_classifier: LLM 逾時降級至關鍵字分類(非 None)
3. Watchdog 誤報修復(ADR-092 B3)
- W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
- W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
- approval_timeout_resolver: 60min → 15min,batch 50 → 200
4. Config Drift 自動化
- drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
- drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片
5. Playbook 飛輪穩定
- playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
- playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)
6. 可觀測性
- alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
- auto_approve: reject 原因 Redis 計數器
- heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊
【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 10:55:50 +08:00
Your Name
9244c5e845
feat(heartbeat): 系統報告新增 5 大動態區塊
...
CD Pipeline / build-and-deploy (push) Successful in 13m50s
新增告警流水線(24h)、DB/Redis 狀態、K8s Pods、Scanner 狀態、Telegram Bot
各區塊採 asyncio.gather(return_exceptions=True) 平行探測,任一失敗不影響其他
新增 AlertPipelineStats/DbRedisStats/PodInfo/ScannerStats/TelegramBotStats dataclasses
_build_warnings() 加入 DB/Redis 異常、PENDING>10、Pod 未就緒/高重啟次數判斷
report_to_telegram_html() 對應輸出 5 個新 HTML 區塊
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 09:29:16 +08:00
Your Name
88af639651
fix(report): 修正 approval_records.status 大小寫不一致
...
CD Pipeline / build-and-deploy (push) Successful in 9m46s
DB 以 SQLEnum 儲存 enum name(EXECUTION_FAILED 大寫),
而非 enum value(execution_failed 小寫)。
SQL 加 UPPER(status::text) 確保不論大小寫皆能命中。
驗證:live DB 查詢 success=0, failed=2(之前永遠 0/0)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 09:10:39 +08:00
Your Name
6810ab359d
fix(report): 日報重發 + 自動修復 0% 兩大根因修復
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題一:日度巡檢報告重複發送(多 Pod 各自跑 daily job)
- 根因:run_daily_report_loop 沒有接 leader lock
其他 scanner(capacity/hermes/compliance)都有呼叫
try_acquire_daily_lock,唯獨日報 loop 缺失
- 修法:asyncio.sleep 後加 try_acquire_daily_lock("daily_report")
搶不到 lock 的 Pod 直接 continue,等下一個 08:00
問題二:自動修復成功率永遠 0.0%
- 根因:_collect_repair_stats 查 incidents.outcome->>'execution_success'
但整條執行鏈路(approval_execution.py NO_ACTION + 真實執行)
從未將 execution_success 寫回 incidents.outcome JSON
導致查詢永遠回 0
- 修法:改查 approval_records.status(EXECUTION_SUCCESS / EXECUTION_FAILED)
這是唯一被穩定寫入的 source of truth
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 09:03:44 +08:00
Your Name
1625e7bd19
fix(telegram): 按鈕回覆靜默兩大根因修復
...
CD Pipeline / build-and-deploy (push) Successful in 17m40s
問題一:ai_advisory_* 按鈕(容量預測/合規等)
- 按下後只發 toast(2-3 秒消失),群組永無回覆
- 修法:_handle_ai_advisory_action 加 message_id 參數,
answer_callback 後額外 sendMessage reply 到原卡片
問題二:已解決告警再次點「批准」
- sign_approval early-return(status != pending)但
_notify_approval_result 仍發「⚡ 執行中...」→ 永無後續
- 修法:僅 approval.status == APPROVED 時才發「執行中...」
其他終態改發「ℹ️ 此告警已處理(狀態:...)」並 return
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 01:57:55 +08:00
Your Name
479f8d8971
refactor(tests): 技術債清零 — 移除 FakeRepo/FakeSession Mock DB 違規
...
CD Pipeline / build-and-deploy (push) Failing after 35s
## ai_router.py
- 抽取 _aggregate_feedback_stats() 純函數,feedback_from_aider_events 呼叫它
## aider_event_processor.py
- _process_one 加 _session_factory=None DI 參數(預設 get_session_factory())
- 可注入測試 factory,不改既有生產邏輯
## test_ai_router_feedback.py(完全重寫)
- 移除 FakeRepo/FakeSession,改為直接測試 _aggregate_feedback_stats 純函數
- 新增 test_feedback_skips_missing_model 邊界條件
- DB 失敗降級行為 test 保留(只 patch get_session_factory,無 FakeRepo)
## test_aider_event_processor.py(完全重寫)
- 移除 FakeRepo/FakeSession,改用真實 PostgreSQL(real_factory fixture)
- Redis xack + IncidentEngine 保留 mock(外部 broker/AI 服務,符合例外)
- 每個測試後 rollback,不污染 dev DB
## setup_test_schema.sql
- 補入 aider_events_payload_gin GIN index(與 adr091 生產 migration 一致)
## integration/conftest.py
- 補注解說明密碼名稱 awoooi_prod_2026 的歷史混淆
- 修正 assert 邏輯:檢查 DB 名稱而非 URL 字串,避免密碼含 prod 觸發誤判
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 01:33:30 +08:00
Your Name
4fc1f49dca
fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險
...
CD Pipeline / build-and-deploy (push) Successful in 14m3s
D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀
success_count+failure_count;消除飛輪執行成功率永遠 0.0% 假象
D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後
原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW
D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類
非破壞性動作強制 risk_level = LOW,跳過 Telegram 批准直接 auto-approve
→ approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS
Root cause 鏈:BUTTON_DATA_INVALID 修復後 TG 按鈕可發,但 NO_ACTION
積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-21 22:26:07 +08:00
Your Name
8fd31eca66
fix(telegram): nonce UUID base64url 壓縮 — 徹底解決 BUTTON_DATA_INVALID
...
CD Pipeline / build-and-deploy (push) Successful in 9m45s
前次修法(truncate random)不完整:host_restart_service(20 chars) 即使去掉 random
仍 68 bytes > 64 限制。
根本修法:UUID (36 chars) → base64url encode UUID bytes → 22 chars
nonce 格式:{action}:{b64url_uuid}:{timestamp}:{random}
最長 case: host_restart_service(20)+22+10+8+3 colons = 63 bytes
generate_callback_nonce: UUID → base64url 22 chars
parse_callback_data: 22-char b64url → 還原完整 UUID,handler 不需改動
全 action 驗證:approve/silence/reject/docker_restart/host_restart_service/renew_cert
全部 ≤ 63 bytes,UUID round-trip 正確。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-21 21:30:20 +08:00
Your Name
bd735482f7
fix(telegram): BUTTON_DATA_INVALID — nonce 超過 64 bytes 根因修復
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:Telegram callback_data 上限 64 bytes。
5 個長 action 名(docker_restart/host_restart_service 等)+ UUID approval_id
= 71-77 bytes → BUTTON_DATA_INVALID。
修復:
1. security_interceptor.generate_callback_nonce:若 nonce > 63 bytes,
改用 3-part 格式(捨棄 random)— timestamp 仍保時間唯一性。
2. security_interceptor.parse_callback_data:接受 3-part 或 4-part 格式。
3. telegram_gateway:移除 debug payload logging(診斷完成)。
影響 action:docker_restart / host_restart_service / host_clear_log /
reload_nginx / renew_cert(全部 > 7 chars + UUID = 64 bytes 以上)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-21 21:17:49 +08:00
Your Name
685f5c684f
debug(telegram): log full payload on 4xx to diagnose BUTTON_DATA_INVALID
...
CD Pipeline / build-and-deploy (push) Successful in 13m29s
前次 response_body 已確認錯誤碼,這次記錄完整 payload(payload_preview 前
1000 bytes)以找出觸發 BUTTON_DATA_INVALID 的確切欄位。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-21 20:56:28 +08:00
Your Name
acab1cd95e
fix(gitea-review): PR/push AI analysis always failing — 兩個根因修復
...
CD Pipeline / build-and-deploy (push) Successful in 17m26s
Root cause 1 (push review): local_code_review_service.review_push() 回傳
dict,但呼叫端直接存取 analysis.issues → AttributeError。
修復:_call_openclaw_push_review 將 dict 轉成 CodeReviewResult。
Root cause 2 (PR review): openclaw_http_service 呼叫
/api/v1/analyze/code-review 但 OpenClaw 從未實作此 endpoint(404)。
修復:_call_openclaw_code_review 改走 local_code_review_service.review_pr()
(Ollama qwen2.5-coder + Gemini fallback)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-21 15:19:14 +08:00
Your Name
3323a9052c
debug: log telegram 400 response body to diagnose card send failure
CD Pipeline / build-and-deploy (push) Successful in 12m38s
2026-04-21 01:05:21 +08:00
Your Name
9e9bd8679f
fix(aider-watch): code-review fixes (4 issues)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通)
2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model
3. aider_event_service: classify_severity 移除 error_count 觸發告警(防假陽性)
4. worker: run_aider_event_processor_loop 包 proc.start() try/except(防靜默崩潰)
2026-04-20 @ Asia/Taipei
2026-04-21 00:59:21 +08:00
Your Name
de2d34d4cd
fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4
...
CD Pipeline / build-and-deploy (push) Has been cancelled
C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard,
evolver 不再封存 seeder 建立的 APPROVED playbook,保護自動修復鏈路
C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄,
evolver 封存後重啟可復活 yaml_rule playbooks
C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules(),
不等下次重啟即可建立對應 APPROVED Playbook
C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警,
鏈路斷裂立即 TYPE-8M;total checks 由 3 升為 4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 20:18:11 +08:00
Your Name
156a52f807
fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診
...
CD Pipeline / build-and-deploy (push) Successful in 13m33s
B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum
→ _pg_upsert playbook.status.value炸(163次/48h),修:update_with_validation強制enum轉型
B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂
→ PENDING≠Telegram已發;修:成功push後mark tg_sent:{fingerprint} Redis(24h TTL)
→ find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂
drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在
→ 新增adopt_drift()包裝:從DB載入DriftReport後委派adopt(),修復採納失敗
B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障(MASTER §1.1盲區)
→ 新增每15分鐘自健診:W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率
→ 任一異常→TYPE-8M send_meta_alert;Redis去重1h
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 20:00:06 +08:00
Your Name
40771cda6d
feat(ai_router): feedback_from_aider_events read-only hook (Phase 24 A8)
2026-04-20 19:40:01 +08:00
Your Name
cd894310dc
feat(api): POST /api/v1/aider/events HMAC webhook + Redis stream push
...
- Router layer: HTTP validation + HMAC-SHA256 signature verification
- Service layer: Redis stream push (aider_event_service.push_aider_batch_to_stream)
- leWOOOgo積木化遵循: Router → Service → Redis
- All 6 tests passing (signature validation, batch limits, edge cases)
2026-04-20 19:40:01 +08:00
Your Name
964427c5d4
feat(service): aider_event_service — classify + signal_data builder (uses existing debounce)
2026-04-20 19:40:01 +08:00
Your Name
54d60d04f5
feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace
...
統帥三問決議:全做;AI 推薦 0.85 門檻純顯示不自動;先查 aol 再修
## RCA: awoooi-service 失敗來源
- /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed
- grep codebase: 無任何程式碼寫死 awoooi-service(只有歷史 comment)
- 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名
- cf5050c/4f2e122(2026-04-18)已修 NEMOTRON 幻覺雙路徑;本次修第三條路徑
## 修復
### P0.3a alert_rule_engine._extract_vars
- labels.service 降級:-service 結尾先剝 suffix 視為 base name
- match_rule 回傳新增 target_source 欄位追 trace
- 下次 awoooi-service 復發可直接看來源(label.service(stripped) 等)
### P0.3c approval_execution._log_aol_started.input
- 補 parsed_target/operation/namespace 欄位
- 未來 aol 查 failed 可直接看 target,無需推敲
### P0.1 telegram_gateway._send_drift_diff_detail
- 分頁(10 項/頁)取代一次洗版 30 項
- header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動
- 底部 ⬅️ /➡️ 分頁按鈕(callback: drift_view_page:{report_id}_{page})
- security_interceptor INFO_ACTIONS 加 drift_view_page 白名單
### P0.2 drift_narrator recommendation
- LLM prompt 加 recommendation 欄位(action/confidence/reason)
- action ∈ {adopt, revert, ignore, investigate}
- 卡片頂部顯示「🎯 AI 建議:⏪ 回滾 (85%) — reason」
- LLM 失敗走 _fallback_recommendation(規則式依 intent 對應)
- 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items
- 統帥指令:純顯示不自動執行(門檻 0.85 保留未來)
## 驗證
- 90 個 pytest test 全過(drift + rule_engine + approval_execution)
- 5 檔 AST syntax check 過
## 下次驗收
1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」
2. drift_view 按下 → 3 桶分類 header + ⬅️ /➡️
3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 04:04:13 +08:00
Your Name
f572561467
feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler
...
CD Pipeline / build-and-deploy (push) Successful in 8m31s
統帥 2026-04-19 截圖反饋:
1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop)
2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行)
新增 services/ai_advisory_helpers.py (~240 行):
- try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}',
TTL 25h,fail-open (Redis 掛照推,不阻塞).
- try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用).
- is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h.
- build_ai_advisory_keyboard: 統一 4 按鈕
✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令
callback_data 格式: 'ai_advisory_{action}:{type}:{id}'
- handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback,
view/produce_cmd 留 P1.
4 個 LLM scanner 改用 helper:
- capacity_forecaster: daily_lock + snooze check per host + 按鈕
- compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕
- coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕
- hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕
telegram_gateway.py:
handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後)
新增 _handle_ai_advisory_action 方法:
解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback
→ answer_callback (Telegram toast 回饋)
→ 返回 dict (info_action=True for view/produce_cmd)
統帥鐵律對齊:
✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等)
✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作)
✅ aol.output 加 human_feedback 供 AI 學習
✅ snooze 避免重複告警 (24h TTL)
✅ 原 drift 按鈕 pattern 複用 (non-breaking)
明早 AI 將收到:
- 單一訊息 (非重複)
- 含 4 按鈕 (手動 feedback 閉環)
- snooze 後同主題 24h 不再推
view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 23:02:57 +08:00
Your Name
fa643ebdc7
refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1)
...
CD Pipeline / build-and-deploy (push) Successful in 8m52s
首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化:
1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行)
2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token
P1.1+1.2 新增 services/llm_json_parser.py (~90 行):
parse_llm_json_response(text, required_key, logger_context)
3-path fallback:
Path 1: 剝 markdown fence + 直接 JSON 含 required_key
Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON)
Path 3: 所有失敗 return None + logger.warning
失敗永不 raise,呼叫者決定 fallback.
4 個 LLM scanner 改用 helper:
- hermes_rule_quality_job: required_key='recommended_actions'
- capacity_forecaster_job: required_key='priority_actions'
- compliance_scanner_job: required_key='posture_grade'
- coverage_evaluator_job: required_key='worst_dimension'
每個減少約 20 行重複.
P1.3 coverage 觸發條件改雙條件:
原: total_red >= 20 (bootstrap 必觸發)
新: red_ratio > 30% AND total_scanned >= 50
_fetch_red_summary 加 total_scanned 回傳供計算.
5/5 單元測試 parse_llm_json_response:
✅ direct / markdown fence / NemoTron wrapper / invalid / missing key
P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判).
其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 22:39:40 +08:00
Your Name
37b6c9ba56
chore: remove empty ai_orchestrator.py (意外進 commit 的空檔)
...
CD Pipeline / build-and-deploy (push) Successful in 13m6s
上個 commit (86d9b22 LOGBOOK) 因 stash pop 意外帶入 0 行空檔
ai_orchestrator.py,非刻意創建。本次刪除保持 services/ 乾淨。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 22:22:53 +08:00
Your Name
86d9b22125
docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Session 35 commits 完整結案:
- Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster)
- KPI Dashboard API (autonomy_score 63/100 可量化)
- Audit 誠實 3 Gaps
- Gap 1 host IPv4 嚴格 + 清理 266 筆重複
- Gap 2 真因確認非 bug
- Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage)
AI 自主化達成:
1/9 LLM (只 Hermes) → 4/9 LLM decision
8 張 0 writer 表全活化
7/7 coverage 維度完整
今晚 AI 將自主推 4 種 Telegram 分析報告
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 22:22:42 +08:00
OG T
0004554bc6
feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構)
...
CD Pipeline / build-and-deploy (push) Successful in 8m47s
GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI.
leWOOOgo 積木化鐵律對齊:
- Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB
- Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算
- 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正
services/aiops_kpi_service.py (~230 行):
AiopsKpiService.get_snapshot() 回 6 section:
1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified)
2. coverage_kpi: 7 維 × (green/yellow/red/unknown)
+ green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO)
3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy
4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d
5. automation_flow_24h: aol detail + by_actor + by_operation_type
6. ai_autonomy_score: 0-100 總分
5 子項 × 20: asset_coverage / rule_quality / capacity_health /
automation_flow / ai_diversity
grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50)
api/v1/aiops_kpi.py (~35 行 精簡 router):
只做 router = APIRouter() + @router.get 委派給 service
main.py:
include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI'])
統帥使用:
curl http://192.168.0.121:32334/api/v1/aiops/kpi | jq .
一次看見 AI 自主化成熟度全景
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 21:21:46 +08:00
OG T
c0f3509d39
fix(drift-card): Drift Diff HTTP 400 — item-by-item 累計長度避免切斷 HTML
...
CD Pipeline / build-and-deploy (push) Failing after 2m0s
統帥回報 14:18 點 [查看 Diff] 收到 'Drift Diff 查詢失敗: HTTP error: 400'
真因 (telegram_gateway.py:2087 _send_drift_diff_detail):
- report_id=7ffe78ae 有 48 items,單筆 git_value 最長 1794 字 (env array)
- 累計 _full 遠超 4096,執行 _full[:3950] 截斷
- 截斷可能切在 HTML tag 中間 (<code>... 或 < entity 中間)
- Telegram parse_mode='HTML' 拒絕不完整 HTML → 400
修復:
- item-by-item 累計長度,單個 item 算 _block 長度+1
- 預留 3800 上限 (4096 - 250 buffer 給 header + '… 還有 X 項' 提示)
- 確保 _full 永遠是完整 HTML 結構
驗證: 下次 drift report 出現 + 統帥點 [查看 Diff] 應正常顯示 (本 session 的下個 cycle)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 14:26:29 +08:00
OG T
e7ba8cb181
fix(aiops): 打通 AI 自主學習鏈 — verifier 改 await + aol 動作回灌
...
CD Pipeline / build-and-deploy (push) Failing after 7m29s
統帥 2026-04-19 全景審計發現:
- automation_operation_log: 22 筆 (全部 drift_narrator),33 件/7d approval 動作 0 筆回灌
- incident_evidence.verification_result: 1212 筆 100% NULL,verifier 從未寫入
- 根因: _run_post_execution_verify 用 asyncio.create_task fire-and-forget,
Pod recycle 時 task 被殺,verification_result 永遠寫不進去
修復 (打通 verifier→learning→Playbook EWMA→finetune 全鏈):
approval_execution.py:
+ _log_aol_started: 主流程開始時 INSERT aol(playbook_executed, pending)
+ _log_aol_completed: 4 個 return 點 UPDATE aol 為 success/failed + duration + stderr
└ NO_ACTION / parse_fail / K8s 成功 / K8s 失敗 全部留痕
~ _run_post_execution_verify 兩處 (成功+失敗 path) 從 create_task 改 await + 60s timeout
+ 失敗時 stderr_feed_back 寫入 result.error → 解開 E6 stderr 回灌閉環
declarative_remediation.py:
~ _log_remediation_event task 加 named + add_done_callback,task 失敗時有 log
(原 fire-and-forget 0 筆寫入,現在可診斷為何 task 死掉)
預期效果:
- aol playbook_executed 即時可見 (33 件/7d 立刻有資料)
- incident_evidence.verification_result 開始累積 → finetune_exporter 7d cron 終於有料
- Playbook EWMA trust_score 開始動態變化
- stderr_feed_back 接通 → 失敗訊號回灌 retry/Playbook 負向強化
不影響:
- background_task 跑在背景,+60s 延遲不阻塞 API
- aol 寫入失敗只 logger.warning,不阻塞執行主流程
Refs: MASTER §3.1 L6×D1 (ADR-081 PostExecutionVerifier),
MASTER §3.4 D4 (ADR-083 學習閉環),
ADR-090 監控盲區治理 (2026-04-18 全景審計)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 12:07:29 +08:00
OG T
2abc91e360
fix(drift-card): 修 drift 卡片 2 bug — AI 研判 copy 樣式 + Diff 按鈕 AttributeError
...
CD Pipeline / build-and-deploy (push) Successful in 13m8s
Bug 1: 按「🔍 查看 Diff」失敗
錯誤: 'DriftReportRepository' object has no attribute 'get_by_id'
根因: DriftReportRepository 方法叫 get(), 其他 repo 都叫 get_by_id()
修法: 加 get_by_id() alias, 對齊 repo 介面慣例
Bug 2: AI 研判內容被渲染成 code block + copy 按鈕
根因: telegram_gateway line 1962 用 <pre> 包 diff_summary
但 diff_summary 是 AI 研判敘述 + emoji 清單, 非 code
修法: 移除 <pre>, 改以分隔線 + html.escape 純文字顯示
驗收:
- 下次 drift 卡片: AI 研判段落純文字(無紫色 code block + copy)
- 按「🔍 查看 Diff」→ 送完整 diff 詳情(非 AttributeError)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 11:27:13 +08:00
OG T
4b8be32610
fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX
...
CD Pipeline / build-and-deploy (push) Failing after 25m27s
Ansible Lint / lint (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)
## TG-1: INFO_ACTIONS 加 view
security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式,
不再誤觸發 4-part nonce 寫格式。
## AP-1: approval_records.telegram_message_id 持久化
telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE
approval_records SET telegram_message_id, telegram_chat_id
(不只 Redis, Pod 重啟仍可找回原卡片)。
## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量
approval_execution._push_execution_result_to_alert 除了 reply 原卡片,
還 editMessageReplyMarkup 移除按鈕(修「永遠執行中」卡片問題)。
- 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息
顯示 "📚 KM +N 🎯 Playbook 更新×M"
- 成功: ✅ 執行成功 + action + KM 增量
- 失敗: ❌ 執行失敗 + 原因 + KM 增量
## AP-3: primary_responsibility 正規化降「❓ 未知」比例
openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單
(FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA,
否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。
## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 01:15:58 +08:00
OG T
68a42a3c97
fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper
...
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)
根因:
commit 7e9448f 的 Python hallucination validator 只裝在
`analyze_alert` (webhook path),但 incident sweeper 走
`generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23
PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod"
幻覺未攔截。
修:
1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法
2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除
3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper
4. helper 補:
- result.action_title = '[安全降級] 調查 {ns} 真實資源狀態'
(之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字)
- 每個欄位賦值 try/except 保底,單欄失敗不影響其他
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 01:11:09 +08:00
OG T
fdce0a3ab9
fix(approval): NO_ACTION 不再誤標 EXECUTION_FAILED (MASTER §7.1 #11 修)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)
根因:
approval.action='NO_ACTION - 待分析' (幻覺 validator 降級產物) 丟進
parse_operation_from_action → operation_type=None → background_execution_skip
→ update_execution_status(success=False) → 標為 EXECUTION_FAILED。
污染 KPI:
MASTER §7.1 #11 auto_execute 成功率 = EXECUTION_SUCCESS / (SUCCESS+FAILED)
NO_ACTION 本來就不該計入失敗,但卻被算進去拖垮指標。
實測 30d 成功率 0.9% 有很大比例是 NO_ACTION 誤標造成。
修復:
parse 失敗時先判斷是否 NO_ACTION 類 (action 含 NO_ACTION/OBSERVE/INVESTIGATE
等關鍵字) → 走專屬 noop 分支:
- log event=background_execution_noop (info 級)
- update_execution_status(success=True) → EXECUTION_SUCCESS
- timeline 標 ✅ 純觀察類動作完成
- reply 原告警卡片顯示成功
- return True
真正解析失敗 (非 NO_ACTION) 保留原失敗路徑,但補上 error_message
(P0.2 延伸),讓 rejection_reason 有 "Could not parse operation type from
action: <action>" 而非空字串。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 01:08:16 +08:00
OG T
2e988bdb81
fix(telegram): drift 執行結果貼回卡片 + audit log user_id
...
CD Pipeline / build-and-deploy (push) Has been cancelled
IDE 抓到 _stamp 未使用(結果沒送)+ user_id 未使用(audit 缺漏)。
修:
1. _edit_drift_card_outcome 不只移除按鈕,還 send 簽核戳訊息
(reply_to 原卡片,若 msg_id 存在),格式:
✅ 已採納 by @username (成功)
Drift <report_id>
2. _handle_drift_action 加 drift_callback_dispatched log(audit)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 01:07:13 +08:00
OG T
877c8479e0
fix(telegram): TG-2 + TG-4 修 drift 按鈕 black hole
...
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)
統帥截圖直擊: 按「查看 Diff」→ 變成「執行中」,且看不到還有 21 項。
全景盤點發現 9 個 Telegram 子系統 bug,本 commit 修 2 個最痛的:
## TG-2: drift_view/drift_adopt/drift_revert 3 按鈕**無 handler**
點擊 → fallthrough → UX 黑洞 / 誤觸發 approve 路徑。
修復: handle_callback 在 state guard 後(line 2752 後)加 Step 1.85
offroute: 3 個 drift_* action → _handle_drift_action 專職處理,
不走 nonce approve/reject dispatch,避免誤觸發執行流。
3 個按鈕實作:
- drift_view: 讀 drift_reports → 送新訊息展示全部 items
(HIGH/MEDIUM/INFO emoji + Git vs K8s 原值對照,上限 50 項 4000 字)
- drift_adopt: 呼叫 drift_adopt_service.adopt_drift()
- drift_revert: 呼叫 drift_remediator.revert()
## TG-4: drift card message_id 沒存 Redis → edit 回不了卡片
修復: send_drift_card 成功後 setex f"tg_drift:{incident_id}" TTL 24h,
供 _edit_drift_card_outcome 在 adopt/revert 執行後更新原卡片(先移除
按鈕 + 加「XX by @username (成功/失敗)」簽核戳)。
## 未包含(follow-up):
TG-1 INFO_ACTIONS 擴充(view) — 下一 commit
TG-3 handler 重複分派 — 評估中
TG-5 Bot webhook URL 未設 — 需統帥決策公開 URL
approval card NO_ACTION 誤標 FAILED — 下一 commit
approval card description 矛盾 / responsibility 未知 / 執行後 edit
全景 9 bug 清單詳見 project_phase7_round3_telegram_subsystem_audit(待建)。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 01:06:30 +08:00
OG T
98aef55b31
feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修
...
CD Pipeline / build-and-deploy (push) Successful in 11m49s
run-migration / migrate (push) Failing after 15s
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)
MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈:
#3 fine-tune JSONL /week — finetune_exports 表不存在
#4 MCP 呼叫/24h — timeline_events 沒 mcp_call event_type
#6 Declarative 修復使用率 — remediation_events 表不存在
#7 general 兜底 17.3% — classify_alert_early 漏 5 類
#10 notification_outcomes /week — 表不存在
本 commit 全修。
## 1. Migration: adr090d_kpi_data_sources.sql (3 張表)
- finetune_exports — P3 Fine-tune JSONL 追蹤
- remediation_events — P5 Declarative 修復追蹤
- notification_outcomes — 通知品質 + RLHF 語料
Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。
## 2. classify_alert_early 擴 4 類規則 (降 general 兜底)
- test 攔截: Test*/FPTest/FingerprintTest/ADR089*Test/L4Closure*/*FreshUniq*
→ category='test', TYPE-1 純通知
- High*CPU/Memory/Disk/Load → host_resource
- TLS*/SSL*/*ProbeFailure* → ssl_cert
- PostgreSQL*/MySQL*/MongoDB*/*DiskGrowthRate → database
預期 general 17.3% → 3-5% (達標 <10%)。
## 3. finetune_exporter DB 寫入
_run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。
## 4. declarative_remediation DB 寫入
evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events
(status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。
## 5. telegram_gateway DB 寫入 (send_approval_card)
_send_request 成功返回 message_id 後寫 notification_outcomes 一筆,
channel='telegram', delivery_status='delivered|failed'。未來人類按鈕時
update user_action → RLHF 訓料黃金。
## 6. pre_decision_investigator MCP 呼叫追蹤
_call_single_tool() finally 寫 timeline_events event_type='mcp_call',
含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。
## 預期量化改善
| KPI | 修前 | 修後 24h 後應見 |
|-----|------|----------------|
| #3 fine-tune /week | 0 (表不存在) | >=10 (每週 cron 跑) |
| #4 MCP 呼叫/24h | 0 | >0 (實測將寫 timeline) |
| #6 declarative 使用率 | 表不存在 | 有資料 (pending/success/failed 分佈) |
| #7 general 兜底 | 17.3% | <10% |
| #10 notification_outcomes | 0 | 每次 approval card 寫一筆 |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 00:00:31 +08:00
OG T
898145d68e
refactor(openclaw): SuggestedAction 改用頂部 import (避免 inline 三重巢狀)
...
CD Pipeline / build-and-deploy (push) Successful in 11m17s
Ansible Lint / lint (push) Has been cancelled
IDE 對 inline "from src.models.ai import" 誤報(但運行正常)。
改為頂部 import 既滿足 IDE 也更 Pythonic。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 23:28:19 +08:00
OG T
e6e484c1dc
fix(openclaw): import path 修正 — src.models.ai (非 openclaw_schema)
...
CD Pipeline / build-and-deploy (push) Has started running
IDE 正確抓到的 bug(非 false positive),SuggestedAction 在 src/models/ai.py。
_SA.NO_ACTION 現在能正確降級。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 23:26:45 +08:00
OG T
7e9448f6d0
fix(openclaw): 幻覺 deployment 名雙層防禦 — Prompt + Python validator
...
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)
生產事件 (approval f763bedf, 22:58):
- Alert: KubePodCrashLooping, labels.deployment="awoooi-api"
- NEMOTRON 雖收 inventory "awoooi-api, awoooi-web, awoooi-worker"
仍輸出 kubectl_command="kubectl rollout restart deployment/awoooi-prod"
(把 namespace 誤當 deployment 名)
- 執行結果: "Deployment 'awoooi-prod' not found in namespace 'awoooi-prod'"
## Layer 1: NEMOTRON_SYSTEM_PROMPT 強化 (prompts.py)
新增「🔒 DEPLOYMENT NAME RULE (STRICTLY ENFORCED)」區塊:
- namespace NEVER is a deployment name
- "awoooi-prod" 是 NAMESPACE,不可寫 deployment/awoooi-prod
- 若有 inventory,deployment 必須 exact match
- 優先用 labels.deployment,unknown → NO_ACTION
## Layer 2: Python 後驗證 (openclaw.py:1322+)
LLM 回應解析後 regex 抽出 deployment 名,對照 _k8s_inventory:
- 在清單內 → 通過
- 不在清單內 → 降級:
* kubectl_command → "kubectl get deploy -n {ns}"(純調查)
* suggested_action → NO_ACTION
* target_resource → "unknown(hallucinated)"
* confidence → 0.0
* description 加註 [安全降級] 並列出合法 inventory
- log 'openclaw_deployment_hallucination_detected' 記錄
效果: 就算 LLM 無視 prompt,Python 層也會擋下。
破壞性 kubectl 絕不執行於不存在的 deployment。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 23:26:09 +08:00
OG T
6ad73b4834
fix(flywheel): 三修 L5/L6 斷鏈 — RBAC 擴權 + 失敗原因入庫 + verifier 失敗時也跑
...
CD Pipeline / build-and-deploy (push) Successful in 11m6s
2026-04-18 晚(台北時區) — ogt + Claude Opus 4.7 (1M)
全景飛輪診斷暴露 3 個真斷鏈:
- L5 執行 30d: EXECUTION_FAILED 216 / EXECUTION_SUCCESS 2 (失敗率 99%)
- L6 驗證 7d: verification_result 全 NULL (988 筆 evidence 都沒驗)
- 所有 rejection_reason / error_message 欄位全空(無法診斷)
根因: awoooi-executor ServiceAccount RBAC 不足,executor.py 每次
kubectl get nodes/HPA 都 Forbidden,連 evidence 都抓不到,後面 repair
全炸,verifier 因為 execution 沒 success 永遠不 trigger,evidence
驗證結果永遠 NULL。修一個 RBAC 解 3 個節點。
## P0.1 RBAC 擴權 (k8s/awoooi-prod/07-rbac.yaml)
新增 cluster-scope 讀權(僅 list/get/watch,零寫入):
- nodes + nodes/status (evidence gathering 必需)
- horizontalpodautoscalers (HPA 狀態)
- metrics.k8s.io: nodes + pods (resource metrics)
- statefulsets + daemonsets (完整 workload 視圖)
已 kubectl apply + 煙霧測試: kubectl get nodes 可跑。
## P0.2 失敗時必寫 rejection_reason (approval_db.py)
update_execution_status() 新增 error_message 參數,失敗時寫入
rejection_reason (截 2000 字) → 之後診斷有依據。
approval_execution.py 呼叫端同步更新,result.error 一路傳進 DB。
## P0.3 Verifier 失敗時也跑 (approval_execution.py)
原邏輯: verifier 只在 result.success=True 時呼叫 → 99% 失敗下
永遠不跑。
新邏輯: 失敗 path 也 create_task 跑 verifier,action_taken 後綴
加 ":FAILED" 標記。verifier 抓 post_state 寫
verification_result='failed' 回 incident_evidence。
L7 learning 從此有失敗樣本可學,playbook trust 負向 2x 衰減才
真正生效。
預期效果:
- EXECUTION_FAILED 率 30d 內應從 99% 降到 <30%
- incident_evidence.verification_result NULL 率應從 100% 降到 <10%
- approval_records.rejection_reason 補齊率從 0% 到 100%
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 20:12:57 +08:00
OG T
b0d560dbb3
fix(drift-narrator): shortener 用 replace — 包容 LLM 加 'Resource/Name:' 前綴幻覺
...
CD Pipeline / build-and-deploy (push) Successful in 10m50s
2026-04-18 下午(台北時區)— ogt + Claude Opus 4.7
Round 4 LLM 自己在 field 前加資源識別符:
'Deployment/awoooi-web: spec.template.spec.containers'
導致 startswith 模式 shortener 失效(前綴不在開頭)。
防禦式修法: startswith 不中 → 改用 replace 清除任何位置的前綴。
結果:
'Deployment/awoooi-web: containers' ✅
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 18:12:15 +08:00
OG T
b63aed72df
fix(drift-narrator): 砍 spec.template.spec. 前綴 — 修 Telegram 自動換行醜陋排版
...
CD Pipeline / build-and-deploy (push) Successful in 12m1s
2026-04-18 下午(台北時區)— ogt + Claude Opus 4.7
統帥實彈三輪視覺回報: 字段名 'spec.template.spec.volumes' 共 24 字元,
加上 emoji+': '+summary 超過 Telegram <pre> 視覺寬度,自動換行
造成 emoji 與 field name 斷開、單獨成行的醜狀。
修復: _shorten_field_path() 砍 3 種常見前綴:
- 'spec.template.spec.' → ''
- 'spec.template.' → '' (後備)
- 'spec.' → '' (後備)
效果對比:
前: '🟡 spec.template.spec.affinity.podAntiAffinity.preferredDuringS: [清單 3 項]'
後: '🟡 affinity.podAntiAffinity.preferredDuringS: [清單 3 項]'
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 17:10:20 +08:00
OG T
f3960f36d2
fix(drift-narrator): fallback 強化 — 標註 K8s 預設值補齊 + 可操作數獨立計算
...
CD Pipeline / build-and-deploy (push) Successful in 10m37s
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)
統帥實彈測試回報: 卡片顯示「securityContext: (未設) → {物件 0 欄位}」毫無意義。
根因: _fallback_items 對「K8s controller 自動補齊空物件」的噪音
誤當成真實變更輸出。且「還有 29 項」數字包含白名單 + trivial。
修復 3 項:
1. _is_trivial_drift() 新判定函數
None/空字串/{}/[]/false/0 等互相視為「無實質變更」
捕捉 K8s controller 自動補齊場景
2. _summarize_item() 替代原本 smart_shorten
- trivial → "K8s 預設值補齊 (無實質變更)"
- None → value → "新增 xxx"
- value → None → "已刪除 (原: xxx)"
- 其他 → "from → to"
3. _fallback_items() 改進
- 按 level 排序 (HIGH 優先)
- 白名單 + HPA allowlist 先過濾
4. _count_nontrivial_drift() + Telegram 呈現
- 新增「可操作」計數 (去掉白名單 + trivial)
- 「還有 N 項」用可操作數,不會誤導
- items 為空時顯示「全為白名單或預設值補齊」
預期效果:
之前: "... 還有 29 項" (其實只 1 個是真實 drift)
現在: "... 還有 0 項" 或 "(全部為白名單或預設值補齊)"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 16:29:49 +08:00