fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration（ADR-092 B4）

【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位（prometheus_query tool 必要參數） - 新增 _build_prometheus_query() — 依告警類型生成 PromQL（CPU/Memory/Crash/Disk/HTTP/Pod/fallback） - 修復後 D3_METRICS 感官維度實際取得資料（原本 100% 回 missing_query_parameter）【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位（MCP 稽核用）+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3（兩個轉換函式必須同步） - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入靜默遺失 - 修復：await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行：psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門（主機層/外部服務跳過 Agent Debate） - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底（取代靜默丟棄） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:41:35 +08:00
parent 7f4088bcd0
commit 04ff22563e
11 changed files with 321 additions and 20 deletions
--- a/apps/api/alert_rules.yaml
+++ b/apps/api/alert_rules.yaml
@@ -135,6 +135,8 @@ rules:
        - HostUnusualDiskWriteRate
        - HostDiskWillFillIn24Hours
        - HostOutOfDiskSpace
+        - HostDiskUsageHigh
+        - HostDiskUsageCritical
        # 網路相關
        - HostUnusualNetworkThroughputIn
        - HostUnusualNetworkThroughputOut
@@ -764,6 +766,36 @@ rules:
          command: "curl -sv {instance} --max-time 10 2>&1 | grep -E '(HTTP|Connected|Failed)'"
      reasoning: "[規則匹配] 外部網站下線屬外部依賴，通知統帥後等待服務恢復，必要時切換備援路徑。"

+  # 2026-04-24 ogt + Claude Sonnet 4.6: Sentry / ClickHouse 監控告警 — 外部服務，禁止 kubectl 操作
+  - id: sentry_clickhouse_alert
+    priority: 60
+    description: Sentry 或 ClickHouse 監控告警（外部服務，不是 K8s workload）
+    match:
+      alertname:
+        - SentryClickHouseMemoryPressure
+        - SentryClickHouseCpuHigh
+        - SentryClickHouseDiskUsageHigh
+        - ClickHouseMemoryHigh
+        - ClickHouseMemoryPressure
+        - ClickHouseCpuHigh
+        - ClickHouseReplicationLag
+        - ClickHouseQuerySlow
+        - SentryWorkerQueueHigh
+        - SentryKafkaLag
+        - SentryBacklogHigh
+    response:
+      action_title: "⚠️ Sentry/ClickHouse 告警 — 需 SSH 人工排查"
+      description: "⚠️ Sentry/ClickHouse 屬外部監控服務，無法透過 kubectl 自動修復。請 SSH 登入服務主機排查根因：clickhouse-client / docker stats / journalctl -xe。若記憶體壓力持續，考慮調整 ClickHouse max_memory_usage 設定或清理舊資料。"
+      suggested_action: NO_ACTION
+      kubectl_command: ""
+      estimated_downtime: "N/A"
+      risk: high
+      responsibility: INFRA
+      responsibility_reasoning: "Sentry/ClickHouse 基礎設施由 INFRA 團隊管理"
+      secondary_teams: []
+      optimization: []
+      reasoning: "[規則匹配] Sentry/ClickHouse 非 K8s 服務，kubectl 操作無效。需 SSH 進入服務主機，確認記憶體/CPU/磁碟狀況後手動介入。"
+
  # ── 通用兜底 ────────────────────────────────────────────────

  - id: generic_fallback
--- a/apps/api/migrations/adr092_p1_learning_chain_fix.sql
+++ b/apps/api/migrations/adr092_p1_learning_chain_fix.sql
@@ -0,0 +1,40 @@
+-- ADR-092 B4 — Playbook 學習閉環斷鏈修復（DB Schema）
+-- 根因：approval_records 缺 matched_playbook_id → 人工審核後 EWMA 無法更新 Playbook trust score
+--       timeline_events 缺 incident_id → pre_decision_investigator MCP 呼叫稽核每天+1 靜默錯誤
+--
+-- 執行方式（需人工執行一次）：
+--   psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql
+--
+-- 2026-04-24 ogt + Claude Sonnet 4.6（亞太）
+
+BEGIN;
+
+-- ─────────────────────────────────────────────────────────────────────────────
+-- approval_records: 新增 matched_playbook_id 欄位（B2 fix）
+-- ─────────────────────────────────────────────────────────────────────────────
+
+ALTER TABLE approval_records
+    ADD COLUMN IF NOT EXISTS matched_playbook_id VARCHAR(36) DEFAULT NULL;
+
+CREATE INDEX IF NOT EXISTS ix_approval_matched_playbook
+    ON approval_records (matched_playbook_id)
+    WHERE matched_playbook_id IS NOT NULL;
+
+COMMENT ON COLUMN approval_records.matched_playbook_id
+    IS 'Playbook ID 命中時紀錄，學習服務讀取以更新 EWMA trust score';
+
+-- ─────────────────────────────────────────────────────────────────────────────
+-- timeline_events: 新增 incident_id 欄位（P1.6 fix）
+-- ─────────────────────────────────────────────────────────────────────────────
+
+ALTER TABLE timeline_events
+    ADD COLUMN IF NOT EXISTS incident_id VARCHAR(64) DEFAULT NULL;
+
+CREATE INDEX IF NOT EXISTS ix_timeline_incident_id
+    ON timeline_events (incident_id)
+    WHERE incident_id IS NOT NULL;
+
+COMMENT ON COLUMN timeline_events.incident_id
+    IS 'MCP 工具呼叫稽核時關聯的 Incident ID';
+
+COMMIT;
--- a/apps/api/src/agents/solver_agent.py
+++ b/apps/api/src/agents/solver_agent.py
@@ -290,24 +290,62 @@ def _extract_candidates(parsed: dict[str, Any]) -> list[CandidateAction]:
    # OpenClaw Nemo 格式轉換
    # 2026-04-17 ogt + Claude Sonnet 4.6: Nemo path kubectl 驗證
    # 根因：Nemo 回傳 {"action_title": "重啟 Crash Looping Pod"} 自然語言
-    #       直接用 action_title 作為 action → 無 kubectl → auto_approve 誤通過 → 死迴圈
-    # 修復：action_title 不含 kubectl → return []（觸發 _degraded_plan 輸出真實 kubectl）
+    #       直接用 action_title → 無 kubectl → auto_approve 誤通過 → 死迴圈
+    # 2026-04-24 ogt + Claude Sonnet 4.6: 修復靜默丟棄 → 語意合成兜底
+    # 舊：action_title 無 kubectl → return [] → _degraded_plan confidence=0.2
+    # 新：先嘗試語意合成 kubectl 指令；真的無從映射才 return []
    if "action_title" in parsed and "candidates" not in parsed:
        action_title = str(parsed.get("action_title", ""))
-        if "kubectl" not in action_title.lower():
-            return []  # 交由 _degraded_plan 接手，輸出真實 kubectl 調查指令
        confidence = float(parsed.get("confidence", 0.5))
        risk_level = str(parsed.get("risk_level", "medium"))
        risk_to_blast = {"critical": 60, "high": 40, "medium": 25, "low": 10}
        blast = risk_to_blast.get(risk_level.lower(), 30)
-        if action_title and confidence > 0:
+
+        if "kubectl" in action_title.lower():
+            if action_title and confidence > 0:
+                return [CandidateAction(
+                    action=action_title[:200],
+                    blast_radius=blast,
+                    rollback_cost=20,
+                    confidence=confidence,
+                    rationale=f"OpenClaw Nemo 建議: {action_title}",
+                )]
+            return []
+
+        # action_title 無 kubectl → 嘗試語意合成 kubectl 指令
+        _at_lower = action_title.lower()
+        _synthesized: str | None = None
+        if any(w in _at_lower for w in ("rollback", "undo", "回滾", "還原")):
+            _synthesized = "kubectl rollout undo deployment -n awoooi-prod"
+        elif any(w in _at_lower for w in ("restart", "重啟", "重新啟動")):
+            _synthesized = "kubectl rollout restart deployment -n awoooi-prod"
+        elif any(w in _at_lower for w in ("scale", "擴容", "縮容", "replicas")):
+            _synthesized = "kubectl scale deployment -n awoooi-prod"
+        elif any(w in _at_lower for w in ("logs", "日誌", "log")):
+            _synthesized = "kubectl logs -n awoooi-prod --tail=100 --selector=app=awoooi-api"
+        elif any(w in _at_lower for w in ("describe", "診斷", "diagnos")):
+            _synthesized = "kubectl describe pods -n awoooi-prod"
+
+        if _synthesized:
+            logger.debug(
+                "solver_nemo_action_synthesized",
+                action_title=action_title[:80],
+                synthesized=_synthesized,
+            )
            return [CandidateAction(
-                action=action_title[:200],
+                action=_synthesized,
                blast_radius=blast,
                rollback_cost=20,
-                confidence=confidence,
-                rationale=f"OpenClaw Nemo 建議: {action_title}",
+                confidence=min(confidence, 0.5),  # 合成指令最高 0.5，避免誤入自動執行
+                rationale=f"[語意合成] Nemo 建議「{action_title[:80]}」→ 轉為 kubectl 指令",
            )]
+
+        # 完全無從映射 → return []（交由 _degraded_plan 輸出 category-based 調查指令）
+        logger.debug(
+            "solver_nemo_no_kubectl_fallback",
+            action_title=action_title[:80],
+            reason="action_title 無 kubectl 且語意合成失敗，降級至 _degraded_plan",
+        )
        return []

    raw = parsed.get("candidates", [])
@@ -328,24 +366,39 @@ def _extract_candidates(parsed: dict[str, Any]) -> list[CandidateAction]:


 def _default_action_for_category(category: str) -> str:
-    """降級時的預設調查指令 — 必須是真實 kubectl 命令（調查優先，不執行破壞性操作）
+    """降級時的預設調查指令 — 必須是真實 kubectl/ssh 命令（調查優先，不執行破壞性操作）

    2026-04-17 ogt + Claude Sonnet 4.6: 改為真實 kubectl 指令
-    舊：自然語言如 "restart_pod"、"check_disk_usage" → 無法被 auto_approve 執行
-    新：kubectl 調查指令 → 可執行，且均為唯讀操作，無副作用
+    2026-04-24 ogt + Claude Sonnet 4.6: 擴展非 K8s 類別（ClickHouse/主機磁碟/DB）
+      根因：SentryClickHouseMemoryPressure/HostDiskUsageHigh 類別不符任何 K8s 關鍵字
+            → 全部 fallback 到 "kubectl get pods"（無意義診斷指令）
+      修復：加入 clickhouse/database/sentry/host/node/infra 類別映射
    """
    category_lower = category.lower()
+    # K8s workload 層
    if "pod" in category_lower or "kube" in category_lower or "crash" in category_lower:
        return "kubectl get pods -n awoooi-prod -o wide"
-    if "disk" in category_lower or "storage" in category_lower or "pvc" in category_lower:
-        return "kubectl exec -n awoooi-prod deployment/postgresql -- df -h"
    if "cpu" in category_lower or "load" in category_lower:
        return "kubectl top pods -n awoooi-prod --sort-by=cpu"
    if "memory" in category_lower or "oom" in category_lower:
        return "kubectl top pods -n awoooi-prod --sort-by=memory"
    if "network" in category_lower or "connect" in category_lower:
        return "kubectl get services -n awoooi-prod"
-    return "kubectl get pods -n awoooi-prod"
+    if "disk" in category_lower or "storage" in category_lower or "pvc" in category_lower:
+        return "kubectl exec -n awoooi-prod deployment/postgresql -- df -h"
+    # 外部服務層（非 K8s — 唯讀診斷）
+    if "clickhouse" in category_lower or "sentry" in category_lower:
+        return "kubectl get pods -n awoooi-prod -l app=sentry -o wide"
+    if "database" in category_lower or "postgres" in category_lower or "redis" in category_lower:
+        return "kubectl get pods -n awoooi-prod -l tier=database -o wide"
+    if "rollback" in category_lower or "deploy" in category_lower or "version" in category_lower:
+        return "kubectl rollout history deployment -n awoooi-prod"
+    if "latency" in category_lower or "slow" in category_lower or "timeout" in category_lower:
+        return "kubectl top pods -n awoooi-prod --sort-by=cpu"
+    # 主機層（host/node/infra — 調查指令，kubectl 只查 node 資訊）
+    if "host" in category_lower or "node" in category_lower or "infra" in category_lower:
+        return "kubectl describe nodes | grep -A5 'Conditions\\|Allocatable'"
+    return "kubectl get pods -n awoooi-prod -o wide"


 def compute_input_hash(diagnosis: DiagnosisReport) -> str:
--- a/apps/api/src/db/models.py
+++ b/apps/api/src/db/models.py
@@ -167,6 +167,15 @@ class ApprovalRecord(Base):
        comment="Telegram chat_id where the approval card was sent",
    )

+    # B2 fix 2026-04-24 ogt + Claude Sonnet 4.6: Playbook 學習閉環斷鏈修復
+    # 原欄位缺失 → 人工審核後 matched_playbook_id 永遠 NULL → EWMA 無法更新
+    matched_playbook_id: Mapped[str | None] = mapped_column(
+        String(36),
+        nullable=True,
+        index=True,
+        comment="匹配的 Playbook ID，學習服務用以更新 EWMA trust score",
+    )
+
    # Timestamps
    created_at: Mapped[datetime] = mapped_column(
        DateTime(timezone=True),
@@ -193,6 +202,7 @@ class ApprovalRecord(Base):
        Index("ix_approval_created_at", "created_at"),
        Index("ix_approval_requested_by", "requested_by"),
        Index("ix_approval_fingerprint", "fingerprint"),  # 戰略 B: 指紋查詢優化
+        Index("ix_approval_matched_playbook", "matched_playbook_id"),  # B2 fix
    )


@@ -244,6 +254,14 @@ class TimelineEvent(Base):
    # Context
    risk_level: Mapped[str | None] = mapped_column(String(20), nullable=True)
    approval_id: Mapped[str | None] = mapped_column(String(36), nullable=True, index=True)
+    # P1.6 fix 2026-04-24 ogt + Claude Sonnet 4.6: pre_decision_investigator raw SQL 寫不存在欄位
+    # 原本 INSERT INTO timeline_events (incident_id, ...) 失敗 → 每天+1 錯誤靜默吞
+    incident_id: Mapped[str | None] = mapped_column(
+        String(64),
+        nullable=True,
+        index=True,
+        comment="關聯的 Incident ID（MCP 事件稽核用）",
+    )

    # Timestamp
    created_at: Mapped[datetime] = mapped_column(
@@ -255,6 +273,7 @@ class TimelineEvent(Base):
    __table_args__ = (
        Index("ix_timeline_event_type", "event_type"),
        Index("ix_timeline_created_at", "created_at"),
+        Index("ix_timeline_incident_id", "incident_id"),  # P1.6 fix
    )


--- a/apps/api/src/repositories/approval_repository.py
+++ b/apps/api/src/repositories/approval_repository.py
@@ -103,6 +103,9 @@ def _record_to_request(record: ApprovalRecord) -> ApprovalRequest:
        fingerprint=record.fingerprint,
        hit_count=record.hit_count,
        last_seen_at=record.last_seen_at,
+        # B4 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補回 DB 欄位（人工審核路徑讀回必要）
+        incident_id=getattr(record, "incident_id", None),
+        matched_playbook_id=getattr(record, "matched_playbook_id", None),
    )


--- a/apps/api/src/services/approval_db.py
+++ b/apps/api/src/services/approval_db.py
@@ -109,6 +109,11 @@ def approval_record_to_request(record: ApprovalRecord) -> ApprovalRequest:
        fingerprint=record.fingerprint,
        hit_count=record.hit_count,
        last_seen_at=record.last_seen_at,
+        # B3 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補回 DB 欄位（人工審核路徑讀回必要）
+        # incident_id / matched_playbook_id 在 ApprovalRequest 基礎模型中有定義
+        # telegram_message_id / telegram_chat_id 只在 DB model，不在 Pydantic ApprovalRequest
+        incident_id=getattr(record, "incident_id", None),
+        matched_playbook_id=getattr(record, "matched_playbook_id", None),
    )


@@ -164,6 +169,9 @@ def approval_request_to_record_data(
        # 不在 dict 裡導致 DB 欄位永遠 NULL，Telegram 卡片顯示 INC 號是空白
        # 用戶在 Telegram 根本認不出對應的告警，審核閉環名存實亡
        "incident_id": request.incident_id,
+        # B5 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補 matched_playbook_id
+        # 原本缺失 → Playbook 人工審核後 trust score 永遠不更新（學習閉環斷鏈）
+        "matched_playbook_id": getattr(request, "matched_playbook_id", None),
    }


--- a/apps/api/src/services/approval_execution.py
+++ b/apps/api/src/services/approval_execution.py
@@ -718,9 +718,19 @@ class ApprovalExecutionService:

        # 2026-04-04 ogt: 執行結果沉澱到 KM — 移出 try/except 確保 learning 失敗也寫入
        # 統帥鐵律: 所有異常與自動修復紀錄必須回寫 KM
-        asyncio.create_task(
-            self._write_execution_result_to_km(approval, success, error_message)
-        )
+        # P1.5 fix 2026-04-24 ogt + Claude Sonnet 4.6: fire-and-forget → await（30s 熔斷）
+        # 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入遺失（audit D 每天+5）
+        try:
+            await asyncio.wait_for(
+                self._write_execution_result_to_km(approval, success, error_message),
+                timeout=30.0,
+            )
+        except asyncio.TimeoutError:
+            logger.warning(
+                "km_write_timeout",
+                approval_id=str(approval.id),
+                timeout_sec=30.0,
+            )

    async def _run_post_execution_verify(
        self,
--- a/apps/api/src/services/decision_manager.py
+++ b/apps/api/src/services/decision_manager.py
@@ -1140,6 +1140,7 @@ def _package_to_proposal_data(package: Any) -> dict[str, Any]:
        "confidence": confidence,
        "risk_level": risk_level,
        "source": "phase2_agent_debate",
+        "provider": "phase2_agents",  # 2026-04-24 ogt: 讓 Telegram 顯示 "Phase2 Agents" 而非通用 "AI 仲裁"
        "requires_human_review": package.requires_human_approval,
        # Phase 2 診斷摘要（供 Audit Trail / 學習閉環，不直接顯示給用戶）
        "debate_summary": package.debate_summary or "",
@@ -2283,6 +2284,45 @@ class DecisionManager:
        # 需要 EvidenceSnapshot；若 P1 未開啟則自行收集
        # 2026-04-15 ogt + Claude Sonnet 4.6（亞太）
        if aiops_flags.is_phase_enabled(2):  # Gate 2: 用 is_phase_enabled 統一父 Phase 守衛
+            # 2026-04-24 ogt + Claude Sonnet 4.6: YAML NO_ACTION 優先門
+            # 根因：Phase 2 五 agent 對主機層/外部服務告警（HostDiskUsageHigh /
+            #       SentryClickHouseMemoryPressure）做 kubectl 分析 → Solver 永遠降級
+            #       (無 K8s target) → confidence=20% + kubectl get pods（無意義）
+            # 修復：YAML 匹配到 NO_ACTION 且 kubectl_command="" → 跳過 Agent Debate，
+            #       直接返回 YAML 規則響應，讓 Telegram 推送人工排查通知
+            try:
+                from src.services.alert_rule_engine import match_rule as _p2_match_rule
+                _p2_labels = incident.signals[0].labels if incident.signals else {}
+                _p2_alertname = _p2_labels.get("alertname", "")
+                _p2_yaml = _p2_match_rule({
+                    "labels": _p2_labels,
+                    "alert_type": _p2_alertname,
+                    "message": (
+                        incident.signals[0].annotations.get("summary", "")
+                        if incident.signals else ""
+                    ),
+                    "target_resource": incident.affected_services[0] if incident.affected_services else "unknown",
+                    "namespace": _p2_labels.get("namespace", "awoooi-prod"),
+                    "severity": incident.severity.value if hasattr(incident.severity, "value") else "medium",
+                })
+                _is_no_action_yaml = (
+                    _p2_yaml is not None
+                    and _p2_yaml.get("suggested_action") == "NO_ACTION"
+                    and not _p2_yaml.get("kubectl_command", "").strip()
+                    and _p2_yaml.get("rule_id", "") not in ("generic_fallback", "")
+                )
+                if _is_no_action_yaml:
+                    logger.info(
+                        "p2_yaml_no_action_bypass",
+                        incident_id=incident.incident_id,
+                        alertname=_p2_alertname,
+                        rule_id=_p2_yaml.get("rule_id", ""),
+                        reason="YAML NO_ACTION 規則命中，跳過 Agent Debate",
+                    )
+                    return _p2_yaml
+            except Exception as _p2_yaml_err:
+                logger.debug("p2_yaml_precheck_error", error=str(_p2_yaml_err))
+
            p2_snapshot = evidence_snapshot
            if p2_snapshot is None:
                try:
--- a/apps/api/src/services/pre_decision_investigator.py
+++ b/apps/api/src/services/pre_decision_investigator.py
@@ -466,18 +466,53 @@ SSH_MCP_ALLOWED_HOSTS 使用完整 IP（如 "192.168.0.110"）。
 """


+def _build_prometheus_query(alertname: str, namespace: str, pod_name: str) -> str:
+    """依告警類型生成 Prometheus PromQL 查詢（供 prometheus_query tool 使用）。
+    2026-04-24 ogt + Claude Sonnet 4.6: P0.4 fix — _build_tool_params 補 query 欄位"""
+    an = alertname.lower()
+    # CPU / 負載
+    if any(k in an for k in ("cpu", "load", "throttl")):
+        filter_pod = f',pod=~"{pod_name}.*"' if pod_name else ""
+        return f'avg(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"{filter_pod}}}[5m]))'
+    # 記憶體
+    elif any(k in an for k in ("memory", "mem", "oom")):
+        filter_pod = f',pod=~"{pod_name}.*"' if pod_name else ""
+        return f'avg(container_memory_working_set_bytes{{namespace="{namespace}"{filter_pod}}}) / 1048576'
+    # CrashLoop / 重啟
+    elif any(k in an for k in ("crash", "restart", "oom", "backoff")):
+        return f'sum(increase(kube_pod_container_status_restarts_total{{namespace="{namespace}"}}[15m]))'
+    # 磁碟 / 儲存
+    elif any(k in an for k in ("disk", "storage", "pvc", "volume", "capacity")):
+        return 'sum(kubelet_volume_stats_used_bytes) by (persistentvolumeclaim)'
+    # HTTP / 可用性
+    elif any(k in an for k in ("http", "error", "5xx", "probe", "down", "unhealthy")):
+        return '1 - avg(probe_success)'
+    # Pod / Container 狀態
+    elif any(k in an for k in ("pod", "container", "deploy", "replicaset")):
+        return f'kube_pod_status_phase{{namespace="{namespace}"}}'
+    # 通用 fallback
+    else:
+        return f'up{{namespace="{namespace}"}}'
+
+
 def _build_tool_params(incident: "Incident") -> dict[str, Any]:
    """從 Incident 提取 MCP 工具呼叫所需的公共參數。"""
    labels = _get_labels(incident)
    raw_host = labels.get("instance", "").split(":")[0] or labels.get("host", "")
    host = _SHORT_HOST_MAP.get(raw_host, raw_host)  # 短名 → 完整 IP
+    namespace = labels.get("namespace", "awoooi-prod")
+    pod_name = labels.get("pod", labels.get("name", ""))
+    alertname = labels.get("alertname", "")
    return {
-        "namespace": labels.get("namespace", "awoooi-prod"),
-        "pod_name": labels.get("pod", labels.get("name", "")),
+        "namespace": namespace,
+        "pod_name": pod_name,
        "deployment": labels.get("deployment", ""),
        "host": host,
        "container": labels.get("container", labels.get("name", "")),
-        "alertname": labels.get("alertname", ""),
+        "alertname": alertname,
+        # P0.4 fix 2026-04-24 ogt + Claude Sonnet 4.6: Prometheus tool 需要 query 欄位
+        # 原本缺少此欄位 → prometheus_query/range tool 傳入空 query → 回傳 error dict
+        "query": _build_prometheus_query(alertname, namespace, pod_name),
    }


--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -6,6 +6,32 @@

 ---

+## 📍 2026-04-24 — 12 Agent 全景審計 + P0-P2 全面並行修復
+
+### 需求
+統帥：「請用12位Agent的新遊戲規則，進行全景、全流程、全節點的所有 AI 自動化流程優化！到目前為止都還沒有完全正常運作！」
+
+### 審計結論
+12 Agent 分工並行掃描：
+- 系統有效串接率：~60%（125個服務中約75個真正在主流程使用）
+- 孤立服務：12個重要服務零引用（trust_drift_detector/rollback_manager 等）
+- 7大致命病根（詳見 project_audit_20260424.md）
+
+### 最關鍵發現
+1. **MCP 感官 = 0**：Prometheus KeyError 100% + legacy kwarg bug 靜默吞
+2. **auto_execute 24h = 0**：Gate 9（blast_radius 唯讀指令判 human）+ Gate 11（operation_parser 不認唯讀指令）
+3. **Playbook 學習 = 永遠 False**：5個斷鏈疊加 + 冷啟動死結
+4. **KM +5/天主因**：knowledge_extractor_service.py:210 AttributeError 100% 失敗
+5. **動態基線9天0筆**：5個 PromQL label 全錯（cadvisor namespace/container 不對）
+6. **timeline_events +1/天**：pre_decision_investigator.py:344 raw SQL INSERT 寫不存在欄位
+
+### 修復動作（並行執行中）
+- P0.1-P0.6：立即止血（知識萃取/auto_execute gate/MCP/告警規則/動態基線）
+- P1.1-P1.5：學習閉環修復（DB migration + matched_playbook_id 斷鏈）
+- P2.1/P2.4/P2.6：LLM 品質 + Telegram 中間態 + AI 治理
+
+---
+
 ## 📍 2026-04-24 — 12-Agent 全景盤點 + 六大自動化飛輪修復

 ### 根因（截圖告警分析）
--- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
+++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
@@ -1698,6 +1698,41 @@ Phase 6 完成後

 ---

+### 2026-04-24 — 12 Agent 全景審計 + P0-P2 並行修復啟動
+
+**審計方式**：12 個專責 Agent 並行掃描 125 個服務檔案 × 7 層架構 × 13 個關鍵節點
+
+**核心發現：系統有效串接率 ~60%，連接的 60% 有大量靜默故障**
+
+| 病根 | 位置 | 嚴重度 |
+|------|------|--------|
+| A. Prometheus MCP 100% KeyError | `pre_decision_investigator.py:469-481` | 🔴🔴🔴 |
+| B. auto_execute Gate 9+11 必攔唯讀指令 | `blast_radius_calculator.py:38-61`, `operation_parser.py:51-185` | 🔴🔴🔴 |
+| C. Playbook 學習閉環5個斷鏈（matched_playbook_id 永遠None） | `proposal_service.py:232-257`, `db/models.py:59`, `approval_db.py:91-112` | 🔴🔴🔴 |
+| D. KM每天+5主因：s.description AttributeError 100% 失敗 | `knowledge_extractor_service.py:210` | 🔴🔴🔴 |
+| E. consensus_engine 4個 ExpertAgent confidence=0.0 硬寫死 | `consensus_engine.py:165-334` | 🔴🔴 |
+| F. auto_repair 主路徑完全不接 PostExecutionVerifier | `auto_repair_service.py:324-500` | 🔴🔴 |
+| G. ProactiveInspector 5個 PromQL 全回空 vector，基線9天0筆 | `proactive_inspector.py:40-71` | 🔴🔴 |
+| H. timeline_events 每天+1：raw SQL INSERT 欄位不存在被靜默吞 | `pre_decision_investigator.py:344,361` | 🔴🔴 |
+
+**孤立服務（12個重要服務零引用）**：trust_drift_detector / rollback_manager / resource_resolver / diagnosis_aggregator / model_rollback_service / kb_rot_cleaner / ci_auto_repair / sentry_webhook_service 等
+
+**P0 修復（今日並行執行）**：
+- P0.1: knowledge_extractor_service.py:210 s.description → alert_name + annotations.summary
+- P0.2+P0.3: blast_radius + operation_parser 補唯讀指令白名單
+- P0.4: pre_decision_investigator Prometheus _build_tool_params 補 query 欄位
+- P0.5: alert_rules.yaml generic_fallback NO_ACTION + 6條語義矛盾修正
+- P0.6: proactive_inspector 5個 PromQL 修正 label + datname
+
+**P1 DB Migration（今日並行執行）**：
+- ApprovalRecord 加 matched_playbook_id 欄位（Alembic migration）
+- timeline_events 加 incident_id + stage 欄位
+- proposal_service / approval_db / approval_repository 讀寫補齊
+- decision_manager:2463 冷啟動豁免（total_executions=0 略過 success_rate 過濾）
+- approval_execution:702-704 KM 寫入 fire-and-forget → await asyncio.wait_for(timeout=10s)
+
+---
+
 ### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護（commit de2d34d）

 **觸發**：統帥全景盤查 AI 自動化節點後，發現 Playbook 自動修復鏈路有 3 個結構性斷點。