Compare commits

..

8 Commits

Author SHA1 Message Date
OG T
4b8be32610 fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 25m27s
Ansible Lint / lint (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

## TG-1: INFO_ACTIONS 加 view
security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式,
不再誤觸發 4-part nonce 寫格式。

## AP-1: approval_records.telegram_message_id 持久化
telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE
approval_records SET telegram_message_id, telegram_chat_id
(不只 Redis, Pod 重啟仍可找回原卡片)。

## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量
approval_execution._push_execution_result_to_alert 除了 reply 原卡片,
還 editMessageReplyMarkup 移除按鈕(修「永遠執行中」卡片問題)。
  - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息
    顯示 "📚 KM +N  🎯 Playbook 更新×M"
  - 成功:  執行成功 + action + KM 增量
  - 失敗:  執行失敗 + 原因 + KM 增量

## AP-3: primary_responsibility 正規化降「 未知」比例
openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單
(FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA,
否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。

## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:15:58 +08:00
OG T
68a42a3c97 fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

根因:
  commit 7e9448f 的 Python hallucination validator 只裝在
  `analyze_alert` (webhook path),但 incident sweeper 走
  `generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23
  PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod"
  幻覺未攔截。

修:
1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法
2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除
3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper
4. helper 補:
   - result.action_title = '[安全降級] 調查 {ns} 真實資源狀態'
     (之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字)
   - 每個欄位賦值 try/except 保底,單欄失敗不影響其他

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:11:09 +08:00
OG T
fdce0a3ab9 fix(approval): NO_ACTION 不再誤標 EXECUTION_FAILED (MASTER §7.1 #11 修)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

根因:
approval.action='NO_ACTION - 待分析' (幻覺 validator 降級產物) 丟進
parse_operation_from_action → operation_type=None → background_execution_skip
→ update_execution_status(success=False) → 標為 EXECUTION_FAILED。

污染 KPI:
  MASTER §7.1 #11 auto_execute 成功率 = EXECUTION_SUCCESS / (SUCCESS+FAILED)
  NO_ACTION 本來就不該計入失敗,但卻被算進去拖垮指標。
  實測 30d 成功率 0.9% 有很大比例是 NO_ACTION 誤標造成。

修復:
parse 失敗時先判斷是否 NO_ACTION 類 (action 含 NO_ACTION/OBSERVE/INVESTIGATE
等關鍵字) → 走專屬 noop 分支:
  - log event=background_execution_noop (info 級)
  - update_execution_status(success=True) → EXECUTION_SUCCESS
  - timeline 標  純觀察類動作完成
  - reply 原告警卡片顯示成功
  - return True

真正解析失敗 (非 NO_ACTION) 保留原失敗路徑,但補上 error_message
(P0.2 延伸),讓 rejection_reason 有 "Could not parse operation type from
action: <action>" 而非空字串。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:08:16 +08:00
OG T
2e988bdb81 fix(telegram): drift 執行結果貼回卡片 + audit log user_id
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
IDE 抓到 _stamp 未使用(結果沒送)+ user_id 未使用(audit 缺漏)。

修:
1. _edit_drift_card_outcome 不只移除按鈕,還 send 簽核戳訊息
   (reply_to 原卡片,若 msg_id 存在),格式:
      已採納 by @username (成功)
     Drift <report_id>
2. _handle_drift_action 加 drift_callback_dispatched log(audit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:07:13 +08:00
OG T
877c8479e0 fix(telegram): TG-2 + TG-4 修 drift 按鈕 black hole
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

統帥截圖直擊: 按「查看 Diff」→ 變成「執行中」,且看不到還有 21 項。

全景盤點發現 9 個 Telegram 子系統 bug,本 commit 修 2 個最痛的:

## TG-2: drift_view/drift_adopt/drift_revert 3 按鈕**無 handler**
  點擊 → fallthrough → UX 黑洞 / 誤觸發 approve 路徑。

修復: handle_callback 在 state guard 後(line 2752 後)加 Step 1.85
  offroute: 3 個 drift_* action → _handle_drift_action 專職處理,
  不走 nonce approve/reject dispatch,避免誤觸發執行流。

3 個按鈕實作:
  - drift_view: 讀 drift_reports → 送新訊息展示全部 items
    (HIGH/MEDIUM/INFO emoji + Git vs K8s 原值對照,上限 50 項 4000 字)
  - drift_adopt: 呼叫 drift_adopt_service.adopt_drift()
  - drift_revert: 呼叫 drift_remediator.revert()

## TG-4: drift card message_id 沒存 Redis → edit 回不了卡片
修復: send_drift_card 成功後 setex f"tg_drift:{incident_id}" TTL 24h,
  供 _edit_drift_card_outcome 在 adopt/revert 執行後更新原卡片(先移除
  按鈕 + 加「XX by @username (成功/失敗)」簽核戳)。

## 未包含(follow-up):
  TG-1 INFO_ACTIONS 擴充(view)  — 下一 commit
  TG-3 handler 重複分派 — 評估中
  TG-5 Bot webhook URL 未設 — 需統帥決策公開 URL
  approval card NO_ACTION 誤標 FAILED — 下一 commit
  approval card description 矛盾 / responsibility 未知 / 執行後 edit

全景 9 bug 清單詳見 project_phase7_round3_telegram_subsystem_audit(待建)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:06:30 +08:00
AWOOOI CD
41e6b503e2 chore(cd): deploy 98aef55 [skip ci] 2026-04-18 16:11:01 +00:00
OG T
98aef55b31 feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 11m49s
run-migration / migrate (push) Failing after 15s
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)

MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈:
  #3  fine-tune JSONL /week        — finetune_exports 表不存在
  #4  MCP 呼叫/24h                 — timeline_events 沒 mcp_call event_type
  #6  Declarative 修復使用率       — remediation_events 表不存在
  #7  general 兜底 17.3%           — classify_alert_early 漏 5 類
  #10 notification_outcomes /week  — 表不存在

本 commit 全修。

## 1. Migration: adr090d_kpi_data_sources.sql (3 張表)

- finetune_exports       — P3 Fine-tune JSONL 追蹤
- remediation_events     — P5 Declarative 修復追蹤
- notification_outcomes  — 通知品質 + RLHF 語料

Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。

## 2. classify_alert_early 擴 4 類規則 (降 general 兜底)

- test 攔截: Test*/FPTest/FingerprintTest/ADR089*Test/L4Closure*/*FreshUniq*
  → category='test', TYPE-1 純通知
- High*CPU/Memory/Disk/Load → host_resource
- TLS*/SSL*/*ProbeFailure* → ssl_cert
- PostgreSQL*/MySQL*/MongoDB*/*DiskGrowthRate → database

預期 general 17.3% → 3-5% (達標 <10%)。

## 3. finetune_exporter DB 寫入

_run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。

## 4. declarative_remediation DB 寫入

evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events
(status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。

## 5. telegram_gateway DB 寫入 (send_approval_card)

_send_request 成功返回 message_id 後寫 notification_outcomes 一筆,
channel='telegram', delivery_status='delivered|failed'。未來人類按鈕時
update user_action → RLHF 訓料黃金。

## 6. pre_decision_investigator MCP 呼叫追蹤

_call_single_tool() finally 寫 timeline_events event_type='mcp_call',
含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。

## 預期量化改善

| KPI | 修前 | 修後 24h 後應見 |
|-----|------|----------------|
| #3 fine-tune /week | 0 (表不存在) | >=10 (每週 cron 跑) |
| #4 MCP 呼叫/24h | 0 | >0 (實測將寫 timeline) |
| #6 declarative 使用率 | 表不存在 | 有資料 (pending/success/failed 分佈) |
| #7 general 兜底 | 17.3% | <10% |
| #10 notification_outcomes | 0 | 每次 approval card 寫一筆 |

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:00:31 +08:00
AWOOOI CD
805230436d chore(cd): deploy 898145d [skip ci] 2026-04-18 15:38:17 +00:00
9 changed files with 770 additions and 48 deletions

View File

@@ -0,0 +1,149 @@
-- ADR-090-D: MASTER §7.1 北極星 KPI 資料源建立
-- 建立時間: 2026-04-18 晚 (台北時區)
-- 建立者: ogt + Claude Opus 4.7 (1M)
--
-- 背景:
-- MASTER §7.1 15 個 KPI 對標發現 4 張關鍵表根本沒建立,導致以下 KPI 永遠
-- 量不到:
-- #3 fine-tune JSONL /week → finetune_exports 表
-- #6 Declarative 修復使用率 → remediation_events 表
-- #10 notification_outcomes → notification_outcomes 表
--
-- 此 migration 補齊 3 張資料源表(idempotent)。
--
-- 對應 MASTER § 指標:
-- §3.3 D3 修復抽象(Imperative → Declarative)
-- §3.4 D4 學習深度(Fine-tune)
-- §3.6 D6 自我治理(通知品質)
-- ═══════════════════════════════════════════════════════════════════
-- 1. finetune_exports — Phase 3 Fine-tune JSONL 產出追蹤
-- ═══════════════════════════════════════════════════════════════════
CREATE TABLE IF NOT EXISTS finetune_exports (
export_id BIGSERIAL PRIMARY KEY,
export_type TEXT NOT NULL, -- 'evidence_snapshot' | 'agent_session' | 'decision_outcome'
source_table TEXT, -- 來源表名 (incidents / agent_sessions ...)
source_ids TEXT[], -- 涵蓋的 source record ids
file_path TEXT, -- 匯出的 JSONL 檔案路徑
record_count INT NOT NULL DEFAULT 0,
size_bytes BIGINT,
checksum_sha256 TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT finetune_export_type_valid CHECK (export_type IN (
'evidence_snapshot','agent_session','decision_outcome',
'incident_rca','playbook_outcome','rlhf_trace'
))
);
COMMENT ON TABLE finetune_exports IS
'ADR-090-D: MASTER §7.1 #3 Fine-tune JSONL 產出追蹤。每次 finetune_exporter 匯出寫一筆。';
CREATE INDEX IF NOT EXISTS idx_finetune_exports_created
ON finetune_exports(created_at DESC);
CREATE INDEX IF NOT EXISTS idx_finetune_exports_type
ON finetune_exports(export_type);
-- ═══════════════════════════════════════════════════════════════════
-- 2. remediation_events — Phase 5 Declarative 修復追蹤
-- ═══════════════════════════════════════════════════════════════════
CREATE TABLE IF NOT EXISTS remediation_events (
event_id BIGSERIAL PRIMARY KEY,
incident_id TEXT,
approval_id TEXT,
remediation_type TEXT NOT NULL, -- 'declarative' | 'imperative' | 'gitops_pr' | 'kubectl'
action_name TEXT,
target_resource TEXT, -- deployment/awoooi-api 等
namespace TEXT,
dry_run BOOLEAN NOT NULL DEFAULT false,
status TEXT NOT NULL, -- 'pending' | 'success' | 'failed' | 'rolled_back'
error_message TEXT,
blast_radius_score INT,
duration_ms INT,
executed_by TEXT, -- 'ai_agent' | 'human:ogt' | 'cron'
triggered_by_op_id UUID, -- 指向 automation_operation_log.op_id
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT remediation_type_valid CHECK (remediation_type IN (
'declarative','imperative','gitops_pr','kubectl','ansible','helm','argocd_sync'
)),
CONSTRAINT remediation_status_valid CHECK (status IN (
'pending','success','failed','rolled_back','dry_run_ok','dry_run_failed'
))
);
COMMENT ON TABLE remediation_events IS
'ADR-090-D: MASTER §7.1 #6 Declarative 修復使用率。每次 declarative_remediation 執行寫一筆。';
CREATE INDEX IF NOT EXISTS idx_remediation_events_time
ON remediation_events(created_at DESC);
CREATE INDEX IF NOT EXISTS idx_remediation_events_type
ON remediation_events(remediation_type);
CREATE INDEX IF NOT EXISTS idx_remediation_events_incident
ON remediation_events(incident_id) WHERE incident_id IS NOT NULL;
-- ═══════════════════════════════════════════════════════════════════
-- 3. notification_outcomes — 通知成果追蹤
-- ═══════════════════════════════════════════════════════════════════
CREATE TABLE IF NOT EXISTS notification_outcomes (
outcome_id BIGSERIAL PRIMARY KEY,
incident_id TEXT,
approval_id TEXT,
channel TEXT NOT NULL, -- 'telegram' | 'email' | 'slack' | 'webhook'
notification_type TEXT, -- TYPE-1/2/3/4/4D/5S/6B/7E/8M
recipient TEXT, -- chat_id / email / user
message_id TEXT, -- telegram message_id 等
sent_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
delivery_status TEXT NOT NULL, -- 'delivered' | 'failed' | 'pending'
delivery_error TEXT,
-- 人類互動追蹤 (RLHF 語料黃金)
user_action TEXT, -- 'approved' | 'rejected' | 'silenced' | 'ignored' | 'no_response'
user_action_at TIMESTAMPTZ,
user_comment TEXT,
-- 通知品質
snoozed_count INT NOT NULL DEFAULT 0,
time_to_action_sec INT, -- 收到到按鈕按下的秒數
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT notif_channel_valid CHECK (channel IN (
'telegram','email','slack','webhook','sms','discord'
)),
CONSTRAINT notif_delivery_valid CHECK (delivery_status IN (
'delivered','failed','pending','rate_limited'
))
);
COMMENT ON TABLE notification_outcomes IS
'ADR-090-D: MASTER §7.1 #10 notification_outcomes 追蹤。每次 telegram_gateway 推送寫一筆,用戶按鈕觸發時 update user_action。';
CREATE INDEX IF NOT EXISTS idx_notification_outcomes_sent
ON notification_outcomes(sent_at DESC);
CREATE INDEX IF NOT EXISTS idx_notification_outcomes_incident
ON notification_outcomes(incident_id) WHERE incident_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_notification_outcomes_approval
ON notification_outcomes(approval_id) WHERE approval_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_notification_outcomes_pending_action
ON notification_outcomes(sent_at DESC)
WHERE user_action IS NULL AND delivery_status='delivered';
-- ═══════════════════════════════════════════════════════════════════
-- 驗收 (執行後可手動跑)
-- ═══════════════════════════════════════════════════════════════════
-- SELECT table_name FROM information_schema.tables
-- WHERE table_schema='public'
-- AND table_name IN ('finetune_exports','remediation_events','notification_outcomes')
-- ORDER BY table_name;
-- 預期: 3 筆
-- SELECT conname FROM pg_constraint WHERE conrelid IN (
-- 'finetune_exports'::regclass,
-- 'remediation_events'::regclass,
-- 'notification_outcomes'::regclass
-- ) AND contype='c' ORDER BY conname;

View File

@@ -144,14 +144,57 @@ class ApprovalExecutionService:
namespace = parsed.namespace
if operation_type is None or resource_name is None:
# 2026-04-19 ogt + Claude Opus 4.7: 區分 NO_ACTION vs 真解析失敗
# NO_ACTION 是 AI 刻意選的「純調查不破壞」,不該誤標 EXECUTION_FAILED
# 污染 auto_execute 成功率 KPI (MASTER §7.1 #11)
_action_upper = (approval.action or "").upper()
_is_no_action = (
"NO_ACTION" in _action_upper
or "NO-ACTION" in _action_upper
or "NOACTION" in _action_upper
or "(未設)" in approval.action
or _action_upper.startswith("OBSERVE")
or _action_upper.startswith("INVESTIGATE")
)
if _is_no_action:
logger.info(
"background_execution_noop",
approval_id=str(approval.id),
action=approval.action,
reason="NO_ACTION - 純調查/觀察類,不執行破壞動作",
)
# 標為 SUCCESS (觀察/調查本身就是成功完成)
await service.update_execution_status(approval.id, success=True)
await timeline.add_event(
event_type="exec",
status="success",
title="✅ 純觀察類動作完成 (NO_ACTION)",
description=f"Action: {approval.action[:120]}",
actor="leWOOOgo",
actor_role="executor",
approval_id=str(approval.id),
)
# 執行結果 reply 原告警卡片
asyncio.create_task(
self._push_execution_result_to_alert(
approval, success=True, error=None,
)
)
return True # NO_ACTION 視為成功完成
# 真解析失敗 (非 NO_ACTION)
logger.warning(
"background_execution_skip",
approval_id=str(approval.id),
reason="Could not parse operation type from action",
action=approval.action,
)
# Phase 5: 更新資料庫狀態
await service.update_execution_status(approval.id, success=False)
# Phase 5: 更新資料庫狀態 + 帶 error_message (P0.2)
await service.update_execution_status(
approval.id, success=False,
error_message=f"Could not parse operation type from action: {approval.action[:150]}",
)
await timeline.add_event(
event_type="exec",
status="error",
@@ -453,11 +496,53 @@ class ApprovalExecutionService:
settings = get_settings()
gateway = get_telegram_gateway()
# 2026-04-19 ogt + Claude Opus 4.7 修 AP-2: 除了 reply 外,
# 也 edit 原卡片移除按鈕 + 更新狀態戳記(避免卡片永遠停在「執行中」)
try:
await gateway._send_request("editMessageReplyMarkup", {
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
"message_id": orig_msg_id,
"reply_markup": {"inline_keyboard": []},
})
except Exception as _edit_e:
logger.debug("push_execution_edit_buttons_failed",
approval_id=str(approval.id), error=str(_edit_e))
# 附加 KM/Playbook 增量(查最近該 incident 的 KM + playbook 使用)
km_info = ""
try:
from sqlalchemy import text as _sql
from src.db.base import get_db_context
async with get_db_context() as _db:
_km_row = await _db.execute(
_sql("""SELECT COUNT(*) FROM knowledge_entries
WHERE created_at > NOW() - interval '2 minutes'"""),
)
_km_count = _km_row.scalar() or 0
_pb_row = await _db.execute(
_sql("""SELECT COUNT(*) FROM playbooks
WHERE updated_at > NOW() - interval '2 minutes'"""),
)
_pb_count = _pb_row.scalar() or 0
if _km_count or _pb_count:
km_info = f"\n📚 KM +{_km_count} 🎯 Playbook 更新×{_pb_count}"
except Exception:
pass
if success:
text = f"✅ <b>執行成功</b>\n<code>{(approval.action or '')[:180]}</code>"
text = (
f"✅ <b>執行成功</b>\n"
f"<code>{(approval.action or '')[:180]}</code>"
f"{km_info}"
)
else:
err_short = (error or "未知錯誤")[:150]
text = f"❌ <b>執行失敗</b>\n<code>{(approval.action or '')[:180]}</code>\n原因: {err_short}"
text = (
f"❌ <b>執行失敗</b>\n"
f"<code>{(approval.action or '')[:180]}</code>\n"
f"原因: {err_short}"
f"{km_info}"
)
await gateway._http_client.post(
f"https://api.telegram.org/bot{settings.OPENCLAW_TG_BOT_TOKEN}/sendMessage",

View File

@@ -166,6 +166,16 @@ class DeclarativeRemediation:
can_auto=spec.can_auto_execute,
action=action[:80],
)
# 2026-04-18 ADR-090-D: 寫入 remediation_events 表(MASTER §7.1 #6 KPI 資料源)
# fire-and-forget,不阻塞主流程
try:
import asyncio as _a
_a.create_task(_log_remediation_event(spec, action, target, namespace))
except RuntimeError:
# 非 async context (正規呼叫都是 async),靜默跳過
pass
return spec
@@ -173,6 +183,54 @@ class DeclarativeRemediation:
# Helpers
# ─────────────────────────────────────────────────────────────────────────────
async def _log_remediation_event(
spec: "DeclarativeSpec",
action: str,
target: str,
namespace: str,
) -> None:
"""
2026-04-18 ADR-090-D: 寫入 remediation_events 表(MASTER §7.1 #6 KPI 資料源)
每次 DeclarativeRemediation.evaluate() 呼叫後寫一筆 'pending' 記錄。
後續實際執行狀態由 approval_execution.py 更新(未來 iteration)。
"""
try:
from sqlalchemy import text as _sql
from src.db.base import get_db_context
# remediation_type 判定
_rt = "declarative" if spec.can_auto_execute else "imperative"
if spec.requires_gitops_pr:
_rt = "gitops_pr"
async with get_db_context() as db:
await db.execute(
_sql("""
INSERT INTO remediation_events (
remediation_type, action_name, target_resource, namespace,
dry_run, status, blast_radius_score, executed_by,
metadata
) VALUES (
:rt, :an, :tr, :ns,
:dr, 'pending', :br, 'ai_agent',
CAST(:md AS jsonb)
)
"""),
{
"rt": _rt,
"an": action[:200],
"tr": target[:100] if target else None,
"ns": namespace[:50],
"dr": spec.dry_run_required,
"br": spec.blast_radius_score,
"md": '{"tier":"' + spec.tier + '"}',
},
)
except Exception as _e:
logger.warning("remediation_events_db_write_failed", error=str(_e))
def _build_constraints(action: str, namespace: str, score: int) -> list[str]:
"""依動作特性建立安全約束清單。"""
constraints: list[str] = []

View File

@@ -50,7 +50,7 @@ from datetime import timedelta
from pathlib import Path
import structlog
from sqlalchemy import and_, select
from sqlalchemy import and_, select, text as sql_text
from src.db.base import get_session_factory
from src.db.models import AgentSession, AutoRepairExecution, IncidentEvidence
@@ -143,6 +143,40 @@ class FineTuneExporter:
row_count=len(rows),
path=output_path,
)
# 2026-04-18 ADR-090-D: 寫入 finetune_exports 表(MASTER §7.1 #3 KPI 資料源)
try:
import hashlib, os
_size = os.path.getsize(output_path) if output_path and os.path.exists(output_path) else None
_checksum = None
if output_path and os.path.exists(output_path):
with open(output_path, 'rb') as _f:
_checksum = hashlib.sha256(_f.read()).hexdigest()
_ids = [str(ev.id) for ev in evidences]
async with session_factory() as _db:
await _db.execute(
sql_text("""
INSERT INTO finetune_exports (
export_type, source_table, source_ids,
file_path, record_count, size_bytes, checksum_sha256,
metadata
) VALUES (
'evidence_snapshot', 'incident_evidence', :ids,
:fp, :rc, :sz, :cs, CAST(:md AS jsonb)
)
"""),
{
"ids": _ids,
"fp": output_path,
"rc": len(rows),
"sz": _size,
"cs": _checksum,
"md": json.dumps({"lookback_days": EXPORT_LOOKBACK_DAYS}),
},
)
except Exception as _db_e:
logger.warning("finetune_exports_db_write_failed", error=str(_db_e))
return output_path, len(rows)
async def _build_row(self, db, ev: IncidentEvidence) -> dict | None:

View File

@@ -184,6 +184,40 @@ def classify_alert_early(alertname: str, severity: str, labels: dict | None = No
):
return "backup", "TYPE-1"
# 2026-04-18 ogt + Claude Opus 4.7: 擴規則降 general 兜底MASTER §7.1 #7 <10%
# 根據 7d 實測 general 17 種 alertname 整理:
#
# 5.1 測試告警攔截(避免污染生產指標)
# TestAlert / FingerprintTest / E2ETestAlert / ADR089Test / L4ClosureLoop
# FP[A-Z]... / *FreshUniq* → test category (TYPE-1 純通知)
if (
alertname.startswith(("Test", "FingerprintTest", "ADR089", "L4Closure", "FPTest"))
or "FreshUniq" in alertname
or alertname in ("E2ETestAlert",)
or alertname.startswith("FP") and alertname[2:3].isupper() # FPTestB, FPTestA
):
return "test", "TYPE-1"
# 5.2 HighCPU / HighMemory / 其他 High* 主機資源類
if alertname.startswith(("HighCPU", "HighMemory", "HighMem", "HighDisk", "HighLoad")):
return "host_resource", "TYPE-3"
# 5.3 TLS / SSL / ProbeFailure → ssl_cert 或 external_site
if (
alertname.startswith(("TLS", "SSL", "Certificate"))
or "ProbeFailure" in alertname
or alertname in ("TestConnectivity",) # ProbeFailure 同義
):
return "ssl_cert", "TYPE-3"
# 5.4 PostgreSQL 詳盡(補 PostgreSQL* 變體,原 rule 用 startswith("Postgres")
# 按理涵蓋 PostgreSQLDiskGrowthRate 但實測落 general → 加保險規則)
if (
alertname.startswith(("PostgreSQL", "MySQL", "MongoDB"))
or "DiskGrowthRate" in alertname
):
return "database", "TYPE-3"
# 6. 主機資源(從 infrastructure 分離ADR-075 統帥決議)
if alertname.startswith("Host"):
return "host_resource", "TYPE-3"

View File

@@ -1144,6 +1144,77 @@ class OpenClawService:
return None
def _validate_deployment_inventory(
self,
result: "OpenClawDecision | None",
k8s_inventory: str,
k8s_ns: str,
) -> None:
"""
2026-04-19 ogt + Claude Opus 4.7 (抽取自 analyze_alert):
幻覺 deployment 名偵測與降級。雙路徑共用(analyze_alert + generate_incident_proposal)。
根因: NEMOTRON 即使 prompt 含 inventory 仍會拿 namespace 當 deployment 名
→ 執行 kubectl rollout restart deployment/awoooi-prod → "not found"
修復: 正則抽出 kubectl 指令的 deployment 名,對照 inventory 白名單;
不在白名單 → 降級為 NO_ACTION + 轉純調查 get deploy + 信心 0。
"""
if not result or not k8s_inventory:
return
_inventory_names = {n.strip() for n in k8s_inventory.split(",") if n.strip()}
if not _inventory_names:
return
_kcmd = (result.kubectl_command or "").lower()
import re as _re
_m = _re.search(r"deployment[/\s]+([a-z0-9][a-z0-9-]*)", _kcmd)
if not _m:
return
_deploy_guess = _m.group(1)
if _deploy_guess in _inventory_names:
return
logger.warning(
"openclaw_deployment_hallucination_detected",
hallucinated=_deploy_guess,
inventory=sorted(_inventory_names),
original_kubectl_cmd=result.kubectl_command,
original_action=(
result.suggested_action.value
if hasattr(result.suggested_action, "value")
else str(result.suggested_action)
),
namespace=k8s_ns,
)
# 降級為安全調查動作,不執行破壞性操作
try:
result.kubectl_command = f"kubectl get deploy -n {k8s_ns}"
except Exception:
pass
try:
result.target_resource = "unknown(hallucinated)"
except Exception:
pass
try:
result.suggested_action = SuggestedAction.NO_ACTION
except Exception:
pass
try:
result.action_title = f"[安全降級] 調查 {k8s_ns} 真實資源狀態"
except Exception:
pass
try:
result.description = (
f"[安全降級] 原 LLM 建議的 deployment '{_deploy_guess}' 不在叢集 inventory "
f"({', '.join(sorted(_inventory_names))})。"
f"已降級為純調查動作(kubectl get deploy),請手動確認實際問題資源。"
)
except Exception:
pass
try:
result.confidence = 0.0
except Exception:
pass
def _parse_analysis_result(self, raw_response: str) -> OpenClawDecision | None:
"""
解析 LLM 分析結果 - 使用 Pydantic Schema Enforcement
@@ -1198,7 +1269,12 @@ class OpenClawService:
data["confidence"] = 0.0 # 截斷/缺失 → 0.0,不可偽造
if "risk_level" not in data:
data["risk_level"] = "low"
if "primary_responsibility" not in data:
# 2026-04-19 ogt + Claude Opus 4.7 修 AP-3:
# primary_responsibility 有時 LLM 填空字串/None → resp_display 顯示「❓ 未知」
# 強制正規化: 空/None/不在白名單 → 用 kubectl 有無推 INFRA 或 BE (非「未知」)
_valid_resp = {"FE", "BE", "INFRA", "DB", "COLLAB"}
_cur_resp = str(data.get("primary_responsibility") or "").strip().upper()
if _cur_resp not in _valid_resp:
data["primary_responsibility"] = "INFRA" if "kubectl" in str(data) else "BE"
if "suggested_action" not in data:
data["suggested_action"] = "RESTART_DEPLOYMENT" if "restart" in str(data).lower() else "NO_ACTION"
@@ -1322,44 +1398,8 @@ Trace URL: {signoz_trace_url}
# 解析結果
result = self._parse_analysis_result(raw_response)
# 2026-04-18 ogt + Claude Opus 4.7: 幻覺 deployment 名偵測與降級 (Checkpoint-3)
# 根因: NEMOTRON 即使 prompt 有 inventory 仍會拿 namespace "awoooi-prod" 當 deployment 名
# → 執行時 kubectl rollout restart deployment/awoooi-prod → "not found"
# 修復: LLM 回應後 Python 驗證 kubectl_command 中的 deployment 名是否在 inventory
# 不在 → 降級為 NO_ACTION + 改成投查 kubectl get deploy(無破壞,只排查)
if result and _k8s_inventory:
_inventory_names = {n.strip() for n in _k8s_inventory.split(",") if n.strip()}
_kcmd = (result.kubectl_command or "").lower()
import re as _re
_m = _re.search(r"deployment[/\s]+([a-z0-9][a-z0-9-]*)", _kcmd)
if _m:
_deploy_guess = _m.group(1)
if _deploy_guess not in _inventory_names:
logger.warning(
"openclaw_deployment_hallucination_detected",
hallucinated=_deploy_guess,
inventory=sorted(_inventory_names),
original_kubectl_cmd=result.kubectl_command,
original_action=result.suggested_action.value if hasattr(result.suggested_action, 'value') else str(result.suggested_action),
)
# 降級為安全調查動作,不執行破壞性操作
result.kubectl_command = f"kubectl get deploy -n {_k8s_ns}"
result.target_resource = "unknown(hallucinated)"
# Pydantic enum 處理 — SuggestedAction 已在檔頂 import (line 34)
try:
result.suggested_action = SuggestedAction.NO_ACTION
except Exception:
pass
result.description = (
f"[安全降級] 原 LLM 建議的 deployment '{_deploy_guess}' 不在叢集 inventory "
f"({', '.join(sorted(_inventory_names))})。"
f"已降級為純調查動作,請手動確認實際問題資源。"
)
# 信心度歸零
try:
result.confidence = 0.0
except Exception:
pass
# 2026-04-18 → 2026-04-19: 幻覺 deployment 名偵測與降級 (共用 helper)
self._validate_deployment_inventory(result, _k8s_inventory, _k8s_ns)
if result:
logger.info(
@@ -1551,6 +1591,15 @@ Focus on:
# 解析 LLM 結果
result = self._parse_analysis_result(raw_response)
# 2026-04-19 ogt + Claude Opus 4.7: 同 analyze_alert 也需幻覺驗證
# 此路徑沒有 inventory 預抓,動態抓
_k8s_ns_for_validate = alert_context.get("namespace", "awoooi-prod") if "alert_context" in dir() else "awoooi-prod"
try:
_k8s_inv = await _fetch_k8s_inventory_for_openclaw(namespace=_k8s_ns_for_validate)
except Exception:
_k8s_inv = ""
self._validate_deployment_inventory(result, _k8s_inv, _k8s_ns_for_validate)
if result:
logger.info(
"proposal_generation_complete",

View File

@@ -265,6 +265,9 @@ class PreDecisionInvestigator:
tool_name = reg.tool.name
snapshot.mcp_health[tool_name] = False # 預設失敗,成功後覆蓋
_started = asyncio.get_event_loop().time()
_mcp_status = "failed"
_mcp_error = None
try:
result = await asyncio.wait_for(
reg.provider.execute(tool_name, params),
@@ -277,10 +280,12 @@ class PreDecisionInvestigator:
tool=tool_name,
error=result.error,
)
_mcp_error = str(result.error)[:200] if result.error else "unknown"
return
snapshot.mcp_health[tool_name] = True
snapshot.sensors_succeeded += 1
_mcp_status = "success"
# 依感官維度填入對應欄位
raw = result.output
@@ -288,8 +293,73 @@ class PreDecisionInvestigator:
except asyncio.TimeoutError:
logger.warning("investigator_tool_timeout", tool=tool_name, timeout=MCP_TOOL_TIMEOUT_SEC)
except Exception:
_mcp_status = "timeout"
_mcp_error = f"timeout {MCP_TOOL_TIMEOUT_SEC}s"
except Exception as _e:
logger.exception("investigator_tool_error", tool=tool_name)
_mcp_status = "error"
_mcp_error = str(_e)[:200]
finally:
# 2026-04-18 ADR-090-D: MCP 呼叫入 timeline_events(MASTER §7.1 #4 KPI)
try:
_duration_ms = int((asyncio.get_event_loop().time() - _started) * 1000)
asyncio.create_task(_log_mcp_call_to_timeline(
snapshot_incident_id=getattr(snapshot, "incident_id", None),
provider_name=reg.provider.name,
tool_name=tool_name,
status=_mcp_status,
error=_mcp_error,
duration_ms=_duration_ms,
))
except Exception:
pass
async def _log_mcp_call_to_timeline(
snapshot_incident_id: str | None,
provider_name: str,
tool_name: str,
status: str,
error: str | None,
duration_ms: int,
) -> None:
"""
2026-04-18 ADR-090-D: MCP 呼叫寫入 timeline_events,支援 MASTER §7.1 #4
"MCP 呼叫次數/24h > 0" KPI 量測。
"""
try:
from sqlalchemy import text as _sql
from src.db.base import get_db_context
import json as _json
_description = _json.dumps({
"provider": provider_name,
"tool": tool_name,
"status": status,
"error": error,
"duration_ms": duration_ms,
}, ensure_ascii=False)
async with get_db_context() as _db:
await _db.execute(
_sql("""
INSERT INTO timeline_events (
incident_id, event_type, status, title, description, actor,
actor_role, created_at
) VALUES (
:iid, 'mcp_call', :st, :tl, :desc, :actor,
'mcp', NOW()
)
"""),
{
"iid": snapshot_incident_id or "unknown",
"st": status,
"tl": f"MCP {provider_name}.{tool_name}"[:100],
"desc": _description[:500],
"actor": provider_name[:50],
},
)
except Exception:
# 靜默失敗,timeline_events 是稽核,不能反噬 MCP 主流程
pass
# ─────────────────────────────────────────────────────────────────────────────

View File

@@ -1688,6 +1688,64 @@ class TelegramGateway:
message_id=_msg_id,
)
# 2026-04-18 ADR-090-D: 寫入 notification_outcomes (MASTER §7.1 #10 KPI)
try:
from sqlalchemy import text as _sql
from src.db.base import get_db_context
_delivered = "delivered" if _msg_id else "failed"
_notif_type = f"TYPE-3-{alert_category}" if alert_category else "TYPE-3"
async with get_db_context() as _db:
await _db.execute(
_sql("""
INSERT INTO notification_outcomes (
approval_id, channel, notification_type, recipient,
message_id, delivery_status, metadata
) VALUES (
:aid, 'telegram', :nt, :rp,
:mid, :ds, CAST(:md AS jsonb)
)
"""),
{
"aid": approval_id,
"nt": _notif_type,
"rp": str(settings.OPENCLAW_TG_CHAT_ID),
"mid": str(_msg_id) if _msg_id else None,
"ds": _delivered,
"md": '{"risk_level":"' + str(risk_level) + '"}',
},
)
except Exception as _db_e:
logger.warning("notification_outcomes_db_write_failed", error=str(_db_e))
# 2026-04-19 ogt + Claude Opus 4.7: 修 AP-1 — message_id 同時存進
# approval_records.telegram_message_id,不只 Redis(重啟會丟)
if _msg_id:
try:
from src.services.approval_db import get_approval_service
_svc = get_approval_service()
if hasattr(_svc, "update_telegram_message"):
# 若有 update_telegram_message 方法(通常用 incident_id)
# 先用 incident_id 更新,再 fallback 直接 UPDATE approval_records
from sqlalchemy import text as _sql2
from src.db.base import get_db_context as _gdc
async with _gdc() as _db2:
await _db2.execute(
_sql2("""
UPDATE approval_records
SET telegram_message_id = :mid,
telegram_chat_id = :cid
WHERE id = :aid
"""),
{
"mid": int(_msg_id),
"cid": int(settings.OPENCLAW_TG_CHAT_ID),
"aid": str(approval_id),
},
)
except Exception as _db_e2:
logger.warning("approval_tg_msg_id_db_persist_failed",
approval_id=str(approval_id), error=str(_db_e2))
# 2026-04-10 Claude Sonnet 4.6 Asia/Taipei: 儲存 message_id 供自動修復後更新卡片
# key: tg_approval:{approval_id}TTL 24h
if _msg_id:
@@ -1935,7 +1993,7 @@ class TelegramGateway:
]
}
return await self._send_request(
_result = await self._send_request(
"sendMessage",
{
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
@@ -1945,6 +2003,176 @@ class TelegramGateway:
},
)
# 2026-04-19 ogt + Claude Opus 4.7: 修 TG-4 存 drift message_id 到 Redis
# 供 drift_adopt/drift_revert 執行後 edit 回原卡片
try:
_msg_id = _result.get("result", {}).get("message_id")
if _msg_id:
await get_redis().setex(
f"tg_drift:{incident_id}", 86400, str(_msg_id)
)
except Exception as _e:
logger.warning("tg_drift_msg_id_store_failed", incident_id=incident_id, error=str(_e))
return _result
# =========================================================================
# 2026-04-19 ogt + Claude Opus 4.7: drift_* 按鈕 handler (修 TG-2)
# =========================================================================
async def _handle_drift_action(
self,
action: str,
approval_id: str,
callback_query_id: str,
user_id: int,
username: str,
user: dict,
) -> dict:
"""
處理 drift_view / drift_adopt / drift_revert 按鈕。
approval_id 在 drift card 即 report_id (send_drift_card 設計)。
"""
report_id = approval_id
logger.info(
"drift_callback_dispatched",
action=action, report_id=report_id,
user_id=user_id, username=username,
)
try:
if action == "drift_view":
await self._answer_callback(callback_query_id, action, text="🔍 撈全部 Diff...")
await self._send_drift_diff_detail(report_id)
return {
"action": action, "approval_id": approval_id,
"user": user, "success": True, "info_action": True,
}
if action == "drift_adopt":
await self._answer_callback(callback_query_id, action, text="✅ 採納中...")
try:
from src.services.drift_adopt_service import get_drift_adopt_service
_adopt_result = await get_drift_adopt_service().adopt_drift(report_id)
_ok = bool(_adopt_result.get("success") if isinstance(_adopt_result, dict) else _adopt_result)
except Exception as _e:
logger.warning("drift_adopt_failed", report_id=report_id, error=str(_e))
_ok = False
await self._edit_drift_card_outcome(
report_id=report_id, verb="已採納", by=username, ok=_ok,
)
return {"action": action, "approval_id": approval_id, "user": user, "success": _ok}
if action == "drift_revert":
await self._answer_callback(callback_query_id, action, text="⏪ 回滾中...")
try:
from src.services.drift_remediator import get_drift_remediator
_revert_result = await get_drift_remediator().revert(report_id)
_ok = bool(_revert_result.get("success") if isinstance(_revert_result, dict) else _revert_result)
except Exception as _e:
logger.warning("drift_revert_failed", report_id=report_id, error=str(_e))
_ok = False
await self._edit_drift_card_outcome(
report_id=report_id, verb="已回滾", by=username, ok=_ok,
)
return {"action": action, "approval_id": approval_id, "user": user, "success": _ok}
except Exception as _outer:
logger.exception("drift_action_handler_error", action=action, error=str(_outer))
return {"action": action, "approval_id": approval_id, "user": user, "success": False}
async def _send_drift_diff_detail(self, report_id: str) -> None:
"""
送完整 Drift Diff 到 Telegram (drift_view 按鈕回應)
展示全部 items (含 HIGH + MEDIUM + 可操作+trivial 分群)
"""
try:
from src.repositories.drift_repository import get_drift_repository
_rpt = await get_drift_repository().get_by_id(report_id)
if not _rpt:
await self._send_request("sendMessage", {
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
"text": f"⚠️ 找不到 Drift report <code>{html.escape(report_id)}</code>",
"parse_mode": "HTML",
})
return
_lines = [f"📊 <b>完整 Drift Diff</b> — <code>{html.escape(report_id)}</code>"]
_lines.append(f"Namespace: <code>{html.escape(_rpt.namespace)}</code>")
_lines.append(f"HIGH×{_rpt.high_count} MEDIUM×{_rpt.medium_count} INFO×{_rpt.info_count}")
_lines.append("" * 20)
for i, _item in enumerate(_rpt.items[:50], 1):
_level = getattr(_item.drift_level, "value", str(_item.drift_level))
_emoji = "🔴" if _level == "high" else ("🟡" if _level == "medium" else "")
_field = (_item.field_path or "")[:80]
_git = str(_item.git_value)[:40] if _item.git_value is not None else "(未設)"
_k8s = str(_item.actual_value)[:40] if _item.actual_value is not None else "(未設)"
_lines.append(f"{_emoji} <b>{html.escape(_field)}</b>")
_lines.append(f" Git: <code>{html.escape(_git)}</code>")
_lines.append(f" K8s: <code>{html.escape(_k8s)}</code>")
if len(_rpt.items) > 50:
_lines.append(f"… 還有 {len(_rpt.items) - 50} 項未顯示")
_full = "\n".join(_lines)
# Telegram 訊息上限 4096 字元
if len(_full) > 4000:
_full = _full[:3950] + "\n… (截斷)"
await self._send_request("sendMessage", {
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
"text": _full,
"parse_mode": "HTML",
"disable_web_page_preview": True,
})
except Exception as _e:
logger.warning("drift_diff_detail_send_failed", report_id=report_id, error=str(_e))
await self._send_request("sendMessage", {
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
"text": f"⚠️ Drift Diff 查詢失敗: <code>{html.escape(str(_e)[:150])}</code>",
"parse_mode": "HTML",
})
async def _edit_drift_card_outcome(
self, report_id: str, verb: str, by: str, ok: bool,
) -> None:
"""
drift_adopt/drift_revert 執行後:
1. 原卡片移除按鈕(用 editMessageReplyMarkup
2. 在原卡片下 reply 執行結果訊息(包含 verb/by/成功失敗)
"""
_icon = "" if ok else ""
_stamp = (
f"{_icon} <b>{html.escape(verb)}</b> by @{html.escape(by)} "
f"({'成功' if ok else '失敗'})\n"
f"Drift <code>{html.escape(report_id)}</code>"
)
_msg_id: int | None = None
try:
_msg_id_raw = await get_redis().get(f"tg_drift:{report_id}")
if _msg_id_raw:
_msg_id = int(_msg_id_raw)
# 先移除按鈕
await self._send_request("editMessageReplyMarkup", {
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
"message_id": _msg_id,
"reply_markup": {"inline_keyboard": []},
})
except Exception as _e:
logger.warning("drift_card_buttons_remove_failed", report_id=report_id, error=str(_e))
# 送簽核戳訊息reply_to 原卡片,若有 msg_id
try:
_payload: dict = {
"chat_id": settings.OPENCLAW_TG_CHAT_ID,
"text": _stamp,
"parse_mode": "HTML",
}
if _msg_id:
_payload["reply_to_message_id"] = _msg_id
await self._send_request("sendMessage", _payload)
except Exception as _e:
logger.warning("drift_outcome_stamp_send_failed", report_id=report_id, error=str(_e))
# =========================================================================
# ADR-075: TYPE-8M Meta-System 告警(飛輪/告警鏈路健康)
# 2026-04-12 ogt
@@ -2722,6 +2950,21 @@ class TelegramGateway:
if guard_result is not None:
return guard_result
# ===================================================================
# Step 1.85: 2026-04-19 ogt + Claude Opus 4.7 — drift_* 按鈕直接處理
# 修 Telegram 子系統 bug TG-2: drift_view/drift_adopt/drift_revert
# 過去無 handler → 按下永遠「執行中」/ fallthrough 誤觸發 approve
# ===================================================================
if action in ("drift_view", "drift_adopt", "drift_revert"):
return await self._handle_drift_action(
action=action,
approval_id=approval_id, # 本身即 report_id
callback_query_id=callback_query_id,
user_id=user_id,
username=username,
user=user,
)
# ===================================================================
# Step 1.9: Phase 5 Sprint 5.3 — 分類按鈕寫類 action 路由
# 2026-04-14 Claude Sonnet 4.6

View File

@@ -39,7 +39,7 @@ resources:
images:
- name: 192.168.0.110:5000/library/api:IMAGE_TAG_PLACEHOLDER
newName: 192.168.0.110:5000/awoooi/api
newTag: 6ad73b48345326756677d98e17bfaf72eec74f9d
newTag: 98aef55b3176827f9d4edfa47a70f6ba586af688
- name: 192.168.0.110:5000/library/web:IMAGE_TAG_PLACEHOLDER
newName: 192.168.0.110:5000/awoooi/web
newTag: 6ad73b48345326756677d98e17bfaf72eec74f9d
newTag: 98aef55b3176827f9d4edfa47a70f6ba586af688