fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
Some checks failed
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s

【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)

【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值

【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log

【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql

【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-04-24 15:41:35 +08:00
parent 7f4088bcd0
commit 04ff22563e
11 changed files with 321 additions and 20 deletions

View File

@@ -135,6 +135,8 @@ rules:
- HostUnusualDiskWriteRate
- HostDiskWillFillIn24Hours
- HostOutOfDiskSpace
- HostDiskUsageHigh
- HostDiskUsageCritical
# 網路相關
- HostUnusualNetworkThroughputIn
- HostUnusualNetworkThroughputOut
@@ -764,6 +766,36 @@ rules:
command: "curl -sv {instance} --max-time 10 2>&1 | grep -E '(HTTP|Connected|Failed)'"
reasoning: "[規則匹配] 外部網站下線屬外部依賴,通知統帥後等待服務恢復,必要時切換備援路徑。"
# 2026-04-24 ogt + Claude Sonnet 4.6: Sentry / ClickHouse 監控告警 — 外部服務,禁止 kubectl 操作
- id: sentry_clickhouse_alert
priority: 60
description: Sentry 或 ClickHouse 監控告警(外部服務,不是 K8s workload
match:
alertname:
- SentryClickHouseMemoryPressure
- SentryClickHouseCpuHigh
- SentryClickHouseDiskUsageHigh
- ClickHouseMemoryHigh
- ClickHouseMemoryPressure
- ClickHouseCpuHigh
- ClickHouseReplicationLag
- ClickHouseQuerySlow
- SentryWorkerQueueHigh
- SentryKafkaLag
- SentryBacklogHigh
response:
action_title: "⚠️ Sentry/ClickHouse 告警 — 需 SSH 人工排查"
description: "⚠️ Sentry/ClickHouse 屬外部監控服務,無法透過 kubectl 自動修復。請 SSH 登入服務主機排查根因clickhouse-client / docker stats / journalctl -xe。若記憶體壓力持續考慮調整 ClickHouse max_memory_usage 設定或清理舊資料。"
suggested_action: NO_ACTION
kubectl_command: ""
estimated_downtime: "N/A"
risk: high
responsibility: INFRA
responsibility_reasoning: "Sentry/ClickHouse 基礎設施由 INFRA 團隊管理"
secondary_teams: []
optimization: []
reasoning: "[規則匹配] Sentry/ClickHouse 非 K8s 服務kubectl 操作無效。需 SSH 進入服務主機,確認記憶體/CPU/磁碟狀況後手動介入。"
# ── 通用兜底 ────────────────────────────────────────────────
- id: generic_fallback

View File

@@ -0,0 +1,40 @@
-- ADR-092 B4 — Playbook 學習閉環斷鏈修復DB Schema
-- 根因approval_records 缺 matched_playbook_id → 人工審核後 EWMA 無法更新 Playbook trust score
-- timeline_events 缺 incident_id → pre_decision_investigator MCP 呼叫稽核每天+1 靜默錯誤
--
-- 執行方式(需人工執行一次):
-- psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql
--
-- 2026-04-24 ogt + Claude Sonnet 4.6(亞太)
BEGIN;
-- ─────────────────────────────────────────────────────────────────────────────
-- approval_records: 新增 matched_playbook_id 欄位B2 fix
-- ─────────────────────────────────────────────────────────────────────────────
ALTER TABLE approval_records
ADD COLUMN IF NOT EXISTS matched_playbook_id VARCHAR(36) DEFAULT NULL;
CREATE INDEX IF NOT EXISTS ix_approval_matched_playbook
ON approval_records (matched_playbook_id)
WHERE matched_playbook_id IS NOT NULL;
COMMENT ON COLUMN approval_records.matched_playbook_id
IS 'Playbook ID 命中時紀錄,學習服務讀取以更新 EWMA trust score';
-- ─────────────────────────────────────────────────────────────────────────────
-- timeline_events: 新增 incident_id 欄位P1.6 fix
-- ─────────────────────────────────────────────────────────────────────────────
ALTER TABLE timeline_events
ADD COLUMN IF NOT EXISTS incident_id VARCHAR(64) DEFAULT NULL;
CREATE INDEX IF NOT EXISTS ix_timeline_incident_id
ON timeline_events (incident_id)
WHERE incident_id IS NOT NULL;
COMMENT ON COLUMN timeline_events.incident_id
IS 'MCP 工具呼叫稽核時關聯的 Incident ID';
COMMIT;

View File

@@ -290,24 +290,62 @@ def _extract_candidates(parsed: dict[str, Any]) -> list[CandidateAction]:
# OpenClaw Nemo 格式轉換
# 2026-04-17 ogt + Claude Sonnet 4.6: Nemo path kubectl 驗證
# 根因Nemo 回傳 {"action_title": "重啟 Crash Looping Pod"} 自然語言
# 直接用 action_title 作為 action → 無 kubectl → auto_approve 誤通過 → 死迴圈
# 修復action_title 不含 kubectl → return [](觸發 _degraded_plan 輸出真實 kubectl
# 直接用 action_title → 無 kubectl → auto_approve 誤通過 → 死迴圈
# 2026-04-24 ogt + Claude Sonnet 4.6: 修復靜默丟棄 → 語意合成兜底
# 舊action_title 無 kubectl → return [] → _degraded_plan confidence=0.2
# 新:先嘗試語意合成 kubectl 指令;真的無從映射才 return []
if "action_title" in parsed and "candidates" not in parsed:
action_title = str(parsed.get("action_title", ""))
if "kubectl" not in action_title.lower():
return [] # 交由 _degraded_plan 接手,輸出真實 kubectl 調查指令
confidence = float(parsed.get("confidence", 0.5))
risk_level = str(parsed.get("risk_level", "medium"))
risk_to_blast = {"critical": 60, "high": 40, "medium": 25, "low": 10}
blast = risk_to_blast.get(risk_level.lower(), 30)
if action_title and confidence > 0:
if "kubectl" in action_title.lower():
if action_title and confidence > 0:
return [CandidateAction(
action=action_title[:200],
blast_radius=blast,
rollback_cost=20,
confidence=confidence,
rationale=f"OpenClaw Nemo 建議: {action_title}",
)]
return []
# action_title 無 kubectl → 嘗試語意合成 kubectl 指令
_at_lower = action_title.lower()
_synthesized: str | None = None
if any(w in _at_lower for w in ("rollback", "undo", "回滾", "還原")):
_synthesized = "kubectl rollout undo deployment -n awoooi-prod"
elif any(w in _at_lower for w in ("restart", "重啟", "重新啟動")):
_synthesized = "kubectl rollout restart deployment -n awoooi-prod"
elif any(w in _at_lower for w in ("scale", "擴容", "縮容", "replicas")):
_synthesized = "kubectl scale deployment -n awoooi-prod"
elif any(w in _at_lower for w in ("logs", "日誌", "log")):
_synthesized = "kubectl logs -n awoooi-prod --tail=100 --selector=app=awoooi-api"
elif any(w in _at_lower for w in ("describe", "診斷", "diagnos")):
_synthesized = "kubectl describe pods -n awoooi-prod"
if _synthesized:
logger.debug(
"solver_nemo_action_synthesized",
action_title=action_title[:80],
synthesized=_synthesized,
)
return [CandidateAction(
action=action_title[:200],
action=_synthesized,
blast_radius=blast,
rollback_cost=20,
confidence=confidence,
rationale=f"OpenClaw Nemo 建議: {action_title}",
confidence=min(confidence, 0.5), # 合成指令最高 0.5,避免誤入自動執行
rationale=f"[語意合成] Nemo 建議{action_title[:80]}」→ 轉為 kubectl 指令",
)]
# 完全無從映射 → return [](交由 _degraded_plan 輸出 category-based 調查指令)
logger.debug(
"solver_nemo_no_kubectl_fallback",
action_title=action_title[:80],
reason="action_title 無 kubectl 且語意合成失敗,降級至 _degraded_plan",
)
return []
raw = parsed.get("candidates", [])
@@ -328,24 +366,39 @@ def _extract_candidates(parsed: dict[str, Any]) -> list[CandidateAction]:
def _default_action_for_category(category: str) -> str:
"""降級時的預設調查指令 — 必須是真實 kubectl 命令(調查優先,不執行破壞性操作)
"""降級時的預設調查指令 — 必須是真實 kubectl/ssh 命令(調查優先,不執行破壞性操作)
2026-04-17 ogt + Claude Sonnet 4.6: 改為真實 kubectl 指令
舊:自然語言如 "restart_pod""check_disk_usage" → 無法被 auto_approve 執行
kubectl 調查指令 → 可執行,且均為唯讀操作,無副作用
2026-04-24 ogt + Claude Sonnet 4.6: 擴展非 K8s 類別ClickHouse/主機磁碟/DB
根因SentryClickHouseMemoryPressure/HostDiskUsageHigh 類別不符任何 K8s 關鍵字
→ 全部 fallback 到 "kubectl get pods"(無意義診斷指令)
修復:加入 clickhouse/database/sentry/host/node/infra 類別映射
"""
category_lower = category.lower()
# K8s workload 層
if "pod" in category_lower or "kube" in category_lower or "crash" in category_lower:
return "kubectl get pods -n awoooi-prod -o wide"
if "disk" in category_lower or "storage" in category_lower or "pvc" in category_lower:
return "kubectl exec -n awoooi-prod deployment/postgresql -- df -h"
if "cpu" in category_lower or "load" in category_lower:
return "kubectl top pods -n awoooi-prod --sort-by=cpu"
if "memory" in category_lower or "oom" in category_lower:
return "kubectl top pods -n awoooi-prod --sort-by=memory"
if "network" in category_lower or "connect" in category_lower:
return "kubectl get services -n awoooi-prod"
return "kubectl get pods -n awoooi-prod"
if "disk" in category_lower or "storage" in category_lower or "pvc" in category_lower:
return "kubectl exec -n awoooi-prod deployment/postgresql -- df -h"
# 外部服務層(非 K8s — 唯讀診斷)
if "clickhouse" in category_lower or "sentry" in category_lower:
return "kubectl get pods -n awoooi-prod -l app=sentry -o wide"
if "database" in category_lower or "postgres" in category_lower or "redis" in category_lower:
return "kubectl get pods -n awoooi-prod -l tier=database -o wide"
if "rollback" in category_lower or "deploy" in category_lower or "version" in category_lower:
return "kubectl rollout history deployment -n awoooi-prod"
if "latency" in category_lower or "slow" in category_lower or "timeout" in category_lower:
return "kubectl top pods -n awoooi-prod --sort-by=cpu"
# 主機層host/node/infra — 調查指令kubectl 只查 node 資訊)
if "host" in category_lower or "node" in category_lower or "infra" in category_lower:
return "kubectl describe nodes | grep -A5 'Conditions\\|Allocatable'"
return "kubectl get pods -n awoooi-prod -o wide"
def compute_input_hash(diagnosis: DiagnosisReport) -> str:

View File

@@ -167,6 +167,15 @@ class ApprovalRecord(Base):
comment="Telegram chat_id where the approval card was sent",
)
# B2 fix 2026-04-24 ogt + Claude Sonnet 4.6: Playbook 學習閉環斷鏈修復
# 原欄位缺失 → 人工審核後 matched_playbook_id 永遠 NULL → EWMA 無法更新
matched_playbook_id: Mapped[str | None] = mapped_column(
String(36),
nullable=True,
index=True,
comment="匹配的 Playbook ID學習服務用以更新 EWMA trust score",
)
# Timestamps
created_at: Mapped[datetime] = mapped_column(
DateTime(timezone=True),
@@ -193,6 +202,7 @@ class ApprovalRecord(Base):
Index("ix_approval_created_at", "created_at"),
Index("ix_approval_requested_by", "requested_by"),
Index("ix_approval_fingerprint", "fingerprint"), # 戰略 B: 指紋查詢優化
Index("ix_approval_matched_playbook", "matched_playbook_id"), # B2 fix
)
@@ -244,6 +254,14 @@ class TimelineEvent(Base):
# Context
risk_level: Mapped[str | None] = mapped_column(String(20), nullable=True)
approval_id: Mapped[str | None] = mapped_column(String(36), nullable=True, index=True)
# P1.6 fix 2026-04-24 ogt + Claude Sonnet 4.6: pre_decision_investigator raw SQL 寫不存在欄位
# 原本 INSERT INTO timeline_events (incident_id, ...) 失敗 → 每天+1 錯誤靜默吞
incident_id: Mapped[str | None] = mapped_column(
String(64),
nullable=True,
index=True,
comment="關聯的 Incident IDMCP 事件稽核用)",
)
# Timestamp
created_at: Mapped[datetime] = mapped_column(
@@ -255,6 +273,7 @@ class TimelineEvent(Base):
__table_args__ = (
Index("ix_timeline_event_type", "event_type"),
Index("ix_timeline_created_at", "created_at"),
Index("ix_timeline_incident_id", "incident_id"), # P1.6 fix
)

View File

@@ -103,6 +103,9 @@ def _record_to_request(record: ApprovalRecord) -> ApprovalRequest:
fingerprint=record.fingerprint,
hit_count=record.hit_count,
last_seen_at=record.last_seen_at,
# B4 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補回 DB 欄位(人工審核路徑讀回必要)
incident_id=getattr(record, "incident_id", None),
matched_playbook_id=getattr(record, "matched_playbook_id", None),
)

View File

@@ -109,6 +109,11 @@ def approval_record_to_request(record: ApprovalRecord) -> ApprovalRequest:
fingerprint=record.fingerprint,
hit_count=record.hit_count,
last_seen_at=record.last_seen_at,
# B3 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補回 DB 欄位(人工審核路徑讀回必要)
# incident_id / matched_playbook_id 在 ApprovalRequest 基礎模型中有定義
# telegram_message_id / telegram_chat_id 只在 DB model不在 Pydantic ApprovalRequest
incident_id=getattr(record, "incident_id", None),
matched_playbook_id=getattr(record, "matched_playbook_id", None),
)
@@ -164,6 +169,9 @@ def approval_request_to_record_data(
# 不在 dict 裡導致 DB 欄位永遠 NULLTelegram 卡片顯示 INC 號是空白
# 用戶在 Telegram 根本認不出對應的告警,審核閉環名存實亡
"incident_id": request.incident_id,
# B5 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補 matched_playbook_id
# 原本缺失 → Playbook 人工審核後 trust score 永遠不更新(學習閉環斷鏈)
"matched_playbook_id": getattr(request, "matched_playbook_id", None),
}

View File

@@ -718,9 +718,19 @@ class ApprovalExecutionService:
# 2026-04-04 ogt: 執行結果沉澱到 KM — 移出 try/except 確保 learning 失敗也寫入
# 統帥鐵律: 所有異常與自動修復紀錄必須回寫 KM
asyncio.create_task(
self._write_execution_result_to_km(approval, success, error_message)
)
# P1.5 fix 2026-04-24 ogt + Claude Sonnet 4.6: fire-and-forget → await30s 熔斷)
# 根因asyncio.create_task 在 Pod recycle 時被殺KM 寫入遺失audit D 每天+5
try:
await asyncio.wait_for(
self._write_execution_result_to_km(approval, success, error_message),
timeout=30.0,
)
except asyncio.TimeoutError:
logger.warning(
"km_write_timeout",
approval_id=str(approval.id),
timeout_sec=30.0,
)
async def _run_post_execution_verify(
self,

View File

@@ -1140,6 +1140,7 @@ def _package_to_proposal_data(package: Any) -> dict[str, Any]:
"confidence": confidence,
"risk_level": risk_level,
"source": "phase2_agent_debate",
"provider": "phase2_agents", # 2026-04-24 ogt: 讓 Telegram 顯示 "Phase2 Agents" 而非通用 "AI 仲裁"
"requires_human_review": package.requires_human_approval,
# Phase 2 診斷摘要(供 Audit Trail / 學習閉環,不直接顯示給用戶)
"debate_summary": package.debate_summary or "",
@@ -2283,6 +2284,45 @@ class DecisionManager:
# 需要 EvidenceSnapshot若 P1 未開啟則自行收集
# 2026-04-15 ogt + Claude Sonnet 4.6(亞太)
if aiops_flags.is_phase_enabled(2): # Gate 2: 用 is_phase_enabled 統一父 Phase 守衛
# 2026-04-24 ogt + Claude Sonnet 4.6: YAML NO_ACTION 優先門
# 根因Phase 2 五 agent 對主機層/外部服務告警HostDiskUsageHigh /
# SentryClickHouseMemoryPressure做 kubectl 分析 → Solver 永遠降級
# (無 K8s target) → confidence=20% + kubectl get pods無意義
# 修復YAML 匹配到 NO_ACTION 且 kubectl_command="" → 跳過 Agent Debate
# 直接返回 YAML 規則響應,讓 Telegram 推送人工排查通知
try:
from src.services.alert_rule_engine import match_rule as _p2_match_rule
_p2_labels = incident.signals[0].labels if incident.signals else {}
_p2_alertname = _p2_labels.get("alertname", "")
_p2_yaml = _p2_match_rule({
"labels": _p2_labels,
"alert_type": _p2_alertname,
"message": (
incident.signals[0].annotations.get("summary", "")
if incident.signals else ""
),
"target_resource": incident.affected_services[0] if incident.affected_services else "unknown",
"namespace": _p2_labels.get("namespace", "awoooi-prod"),
"severity": incident.severity.value if hasattr(incident.severity, "value") else "medium",
})
_is_no_action_yaml = (
_p2_yaml is not None
and _p2_yaml.get("suggested_action") == "NO_ACTION"
and not _p2_yaml.get("kubectl_command", "").strip()
and _p2_yaml.get("rule_id", "") not in ("generic_fallback", "")
)
if _is_no_action_yaml:
logger.info(
"p2_yaml_no_action_bypass",
incident_id=incident.incident_id,
alertname=_p2_alertname,
rule_id=_p2_yaml.get("rule_id", ""),
reason="YAML NO_ACTION 規則命中,跳過 Agent Debate",
)
return _p2_yaml
except Exception as _p2_yaml_err:
logger.debug("p2_yaml_precheck_error", error=str(_p2_yaml_err))
p2_snapshot = evidence_snapshot
if p2_snapshot is None:
try:

View File

@@ -466,18 +466,53 @@ SSH_MCP_ALLOWED_HOSTS 使用完整 IP如 "192.168.0.110")。
"""
def _build_prometheus_query(alertname: str, namespace: str, pod_name: str) -> str:
"""依告警類型生成 Prometheus PromQL 查詢(供 prometheus_query tool 使用)。
2026-04-24 ogt + Claude Sonnet 4.6: P0.4 fix — _build_tool_params 補 query 欄位"""
an = alertname.lower()
# CPU / 負載
if any(k in an for k in ("cpu", "load", "throttl")):
filter_pod = f',pod=~"{pod_name}.*"' if pod_name else ""
return f'avg(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"{filter_pod}}}[5m]))'
# 記憶體
elif any(k in an for k in ("memory", "mem", "oom")):
filter_pod = f',pod=~"{pod_name}.*"' if pod_name else ""
return f'avg(container_memory_working_set_bytes{{namespace="{namespace}"{filter_pod}}}) / 1048576'
# CrashLoop / 重啟
elif any(k in an for k in ("crash", "restart", "oom", "backoff")):
return f'sum(increase(kube_pod_container_status_restarts_total{{namespace="{namespace}"}}[15m]))'
# 磁碟 / 儲存
elif any(k in an for k in ("disk", "storage", "pvc", "volume", "capacity")):
return 'sum(kubelet_volume_stats_used_bytes) by (persistentvolumeclaim)'
# HTTP / 可用性
elif any(k in an for k in ("http", "error", "5xx", "probe", "down", "unhealthy")):
return '1 - avg(probe_success)'
# Pod / Container 狀態
elif any(k in an for k in ("pod", "container", "deploy", "replicaset")):
return f'kube_pod_status_phase{{namespace="{namespace}"}}'
# 通用 fallback
else:
return f'up{{namespace="{namespace}"}}'
def _build_tool_params(incident: "Incident") -> dict[str, Any]:
"""從 Incident 提取 MCP 工具呼叫所需的公共參數。"""
labels = _get_labels(incident)
raw_host = labels.get("instance", "").split(":")[0] or labels.get("host", "")
host = _SHORT_HOST_MAP.get(raw_host, raw_host) # 短名 → 完整 IP
namespace = labels.get("namespace", "awoooi-prod")
pod_name = labels.get("pod", labels.get("name", ""))
alertname = labels.get("alertname", "")
return {
"namespace": labels.get("namespace", "awoooi-prod"),
"pod_name": labels.get("pod", labels.get("name", "")),
"namespace": namespace,
"pod_name": pod_name,
"deployment": labels.get("deployment", ""),
"host": host,
"container": labels.get("container", labels.get("name", "")),
"alertname": labels.get("alertname", ""),
"alertname": alertname,
# P0.4 fix 2026-04-24 ogt + Claude Sonnet 4.6: Prometheus tool 需要 query 欄位
# 原本缺少此欄位 → prometheus_query/range tool 傳入空 query → 回傳 error dict
"query": _build_prometheus_query(alertname, namespace, pod_name),
}

View File

@@ -6,6 +6,32 @@
---
## 📍 2026-04-24 — 12 Agent 全景審計 + P0-P2 全面並行修復
### 需求
統帥「請用12位Agent的新遊戲規則進行全景、全流程、全節點的所有 AI 自動化流程優化!到目前為止都還沒有完全正常運作!」
### 審計結論
12 Agent 分工並行掃描:
- 系統有效串接率:~60%125個服務中約75個真正在主流程使用
- 孤立服務12個重要服務零引用trust_drift_detector/rollback_manager 等)
- 7大致命病根詳見 project_audit_20260424.md
### 最關鍵發現
1. **MCP 感官 = 0**Prometheus KeyError 100% + legacy kwarg bug 靜默吞
2. **auto_execute 24h = 0**Gate 9blast_radius 唯讀指令判 human+ Gate 11operation_parser 不認唯讀指令)
3. **Playbook 學習 = 永遠 False**5個斷鏈疊加 + 冷啟動死結
4. **KM +5/天主因**knowledge_extractor_service.py:210 AttributeError 100% 失敗
5. **動態基線9天0筆**5個 PromQL label 全錯cadvisor namespace/container 不對)
6. **timeline_events +1/天**pre_decision_investigator.py:344 raw SQL INSERT 寫不存在欄位
### 修復動作(並行執行中)
- P0.1-P0.6:立即止血(知識萃取/auto_execute gate/MCP/告警規則/動態基線)
- P1.1-P1.5學習閉環修復DB migration + matched_playbook_id 斷鏈)
- P2.1/P2.4/P2.6LLM 品質 + Telegram 中間態 + AI 治理
---
## 📍 2026-04-24 — 12-Agent 全景盤點 + 六大自動化飛輪修復
### 根因(截圖告警分析)

View File

@@ -1698,6 +1698,41 @@ Phase 6 完成後
---
### 2026-04-24 — 12 Agent 全景審計 + P0-P2 並行修復啟動
**審計方式**12 個專責 Agent 並行掃描 125 個服務檔案 × 7 層架構 × 13 個關鍵節點
**核心發現:系統有效串接率 ~60%,連接的 60% 有大量靜默故障**
| 病根 | 位置 | 嚴重度 |
|------|------|--------|
| A. Prometheus MCP 100% KeyError | `pre_decision_investigator.py:469-481` | 🔴🔴🔴 |
| B. auto_execute Gate 9+11 必攔唯讀指令 | `blast_radius_calculator.py:38-61`, `operation_parser.py:51-185` | 🔴🔴🔴 |
| C. Playbook 學習閉環5個斷鏈matched_playbook_id 永遠None | `proposal_service.py:232-257`, `db/models.py:59`, `approval_db.py:91-112` | 🔴🔴🔴 |
| D. KM每天+5主因s.description AttributeError 100% 失敗 | `knowledge_extractor_service.py:210` | 🔴🔴🔴 |
| E. consensus_engine 4個 ExpertAgent confidence=0.0 硬寫死 | `consensus_engine.py:165-334` | 🔴🔴 |
| F. auto_repair 主路徑完全不接 PostExecutionVerifier | `auto_repair_service.py:324-500` | 🔴🔴 |
| G. ProactiveInspector 5個 PromQL 全回空 vector基線9天0筆 | `proactive_inspector.py:40-71` | 🔴🔴 |
| H. timeline_events 每天+1raw SQL INSERT 欄位不存在被靜默吞 | `pre_decision_investigator.py:344,361` | 🔴🔴 |
**孤立服務12個重要服務零引用**trust_drift_detector / rollback_manager / resource_resolver / diagnosis_aggregator / model_rollback_service / kb_rot_cleaner / ci_auto_repair / sentry_webhook_service 等
**P0 修復(今日並行執行)**
- P0.1: knowledge_extractor_service.py:210 s.description → alert_name + annotations.summary
- P0.2+P0.3: blast_radius + operation_parser 補唯讀指令白名單
- P0.4: pre_decision_investigator Prometheus _build_tool_params 補 query 欄位
- P0.5: alert_rules.yaml generic_fallback NO_ACTION + 6條語義矛盾修正
- P0.6: proactive_inspector 5個 PromQL 修正 label + datname
**P1 DB Migration今日並行執行**
- ApprovalRecord 加 matched_playbook_id 欄位Alembic migration
- timeline_events 加 incident_id + stage 欄位
- proposal_service / approval_db / approval_repository 讀寫補齊
- decision_manager:2463 冷啟動豁免total_executions=0 略過 success_rate 過濾)
- approval_execution:702-704 KM 寫入 fire-and-forget → await asyncio.wait_for(timeout=10s)
---
### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護commit de2d34d
**觸發**:統帥全景盤查 AI 自動化節點後,發現 Playbook 自動修復鏈路有 3 個結構性斷點。