fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數) - 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback) - 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter) 【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3(兩個轉換函式必須同步) - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值 【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失 - 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate) - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底(取代靜默丟棄) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -135,6 +135,8 @@ rules:
|
||||
- HostUnusualDiskWriteRate
|
||||
- HostDiskWillFillIn24Hours
|
||||
- HostOutOfDiskSpace
|
||||
- HostDiskUsageHigh
|
||||
- HostDiskUsageCritical
|
||||
# 網路相關
|
||||
- HostUnusualNetworkThroughputIn
|
||||
- HostUnusualNetworkThroughputOut
|
||||
@@ -764,6 +766,36 @@ rules:
|
||||
command: "curl -sv {instance} --max-time 10 2>&1 | grep -E '(HTTP|Connected|Failed)'"
|
||||
reasoning: "[規則匹配] 外部網站下線屬外部依賴,通知統帥後等待服務恢復,必要時切換備援路徑。"
|
||||
|
||||
# 2026-04-24 ogt + Claude Sonnet 4.6: Sentry / ClickHouse 監控告警 — 外部服務,禁止 kubectl 操作
|
||||
- id: sentry_clickhouse_alert
|
||||
priority: 60
|
||||
description: Sentry 或 ClickHouse 監控告警(外部服務,不是 K8s workload)
|
||||
match:
|
||||
alertname:
|
||||
- SentryClickHouseMemoryPressure
|
||||
- SentryClickHouseCpuHigh
|
||||
- SentryClickHouseDiskUsageHigh
|
||||
- ClickHouseMemoryHigh
|
||||
- ClickHouseMemoryPressure
|
||||
- ClickHouseCpuHigh
|
||||
- ClickHouseReplicationLag
|
||||
- ClickHouseQuerySlow
|
||||
- SentryWorkerQueueHigh
|
||||
- SentryKafkaLag
|
||||
- SentryBacklogHigh
|
||||
response:
|
||||
action_title: "⚠️ Sentry/ClickHouse 告警 — 需 SSH 人工排查"
|
||||
description: "⚠️ Sentry/ClickHouse 屬外部監控服務,無法透過 kubectl 自動修復。請 SSH 登入服務主機排查根因:clickhouse-client / docker stats / journalctl -xe。若記憶體壓力持續,考慮調整 ClickHouse max_memory_usage 設定或清理舊資料。"
|
||||
suggested_action: NO_ACTION
|
||||
kubectl_command: ""
|
||||
estimated_downtime: "N/A"
|
||||
risk: high
|
||||
responsibility: INFRA
|
||||
responsibility_reasoning: "Sentry/ClickHouse 基礎設施由 INFRA 團隊管理"
|
||||
secondary_teams: []
|
||||
optimization: []
|
||||
reasoning: "[規則匹配] Sentry/ClickHouse 非 K8s 服務,kubectl 操作無效。需 SSH 進入服務主機,確認記憶體/CPU/磁碟狀況後手動介入。"
|
||||
|
||||
# ── 通用兜底 ────────────────────────────────────────────────
|
||||
|
||||
- id: generic_fallback
|
||||
|
||||
40
apps/api/migrations/adr092_p1_learning_chain_fix.sql
Normal file
40
apps/api/migrations/adr092_p1_learning_chain_fix.sql
Normal file
@@ -0,0 +1,40 @@
|
||||
-- ADR-092 B4 — Playbook 學習閉環斷鏈修復(DB Schema)
|
||||
-- 根因:approval_records 缺 matched_playbook_id → 人工審核後 EWMA 無法更新 Playbook trust score
|
||||
-- timeline_events 缺 incident_id → pre_decision_investigator MCP 呼叫稽核每天+1 靜默錯誤
|
||||
--
|
||||
-- 執行方式(需人工執行一次):
|
||||
-- psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql
|
||||
--
|
||||
-- 2026-04-24 ogt + Claude Sonnet 4.6(亞太)
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- ─────────────────────────────────────────────────────────────────────────────
|
||||
-- approval_records: 新增 matched_playbook_id 欄位(B2 fix)
|
||||
-- ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
ALTER TABLE approval_records
|
||||
ADD COLUMN IF NOT EXISTS matched_playbook_id VARCHAR(36) DEFAULT NULL;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS ix_approval_matched_playbook
|
||||
ON approval_records (matched_playbook_id)
|
||||
WHERE matched_playbook_id IS NOT NULL;
|
||||
|
||||
COMMENT ON COLUMN approval_records.matched_playbook_id
|
||||
IS 'Playbook ID 命中時紀錄,學習服務讀取以更新 EWMA trust score';
|
||||
|
||||
-- ─────────────────────────────────────────────────────────────────────────────
|
||||
-- timeline_events: 新增 incident_id 欄位(P1.6 fix)
|
||||
-- ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
ALTER TABLE timeline_events
|
||||
ADD COLUMN IF NOT EXISTS incident_id VARCHAR(64) DEFAULT NULL;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS ix_timeline_incident_id
|
||||
ON timeline_events (incident_id)
|
||||
WHERE incident_id IS NOT NULL;
|
||||
|
||||
COMMENT ON COLUMN timeline_events.incident_id
|
||||
IS 'MCP 工具呼叫稽核時關聯的 Incident ID';
|
||||
|
||||
COMMIT;
|
||||
@@ -290,24 +290,62 @@ def _extract_candidates(parsed: dict[str, Any]) -> list[CandidateAction]:
|
||||
# OpenClaw Nemo 格式轉換
|
||||
# 2026-04-17 ogt + Claude Sonnet 4.6: Nemo path kubectl 驗證
|
||||
# 根因:Nemo 回傳 {"action_title": "重啟 Crash Looping Pod"} 自然語言
|
||||
# 直接用 action_title 作為 action → 無 kubectl → auto_approve 誤通過 → 死迴圈
|
||||
# 修復:action_title 不含 kubectl → return [](觸發 _degraded_plan 輸出真實 kubectl)
|
||||
# 直接用 action_title → 無 kubectl → auto_approve 誤通過 → 死迴圈
|
||||
# 2026-04-24 ogt + Claude Sonnet 4.6: 修復靜默丟棄 → 語意合成兜底
|
||||
# 舊:action_title 無 kubectl → return [] → _degraded_plan confidence=0.2
|
||||
# 新:先嘗試語意合成 kubectl 指令;真的無從映射才 return []
|
||||
if "action_title" in parsed and "candidates" not in parsed:
|
||||
action_title = str(parsed.get("action_title", ""))
|
||||
if "kubectl" not in action_title.lower():
|
||||
return [] # 交由 _degraded_plan 接手,輸出真實 kubectl 調查指令
|
||||
confidence = float(parsed.get("confidence", 0.5))
|
||||
risk_level = str(parsed.get("risk_level", "medium"))
|
||||
risk_to_blast = {"critical": 60, "high": 40, "medium": 25, "low": 10}
|
||||
blast = risk_to_blast.get(risk_level.lower(), 30)
|
||||
if action_title and confidence > 0:
|
||||
|
||||
if "kubectl" in action_title.lower():
|
||||
if action_title and confidence > 0:
|
||||
return [CandidateAction(
|
||||
action=action_title[:200],
|
||||
blast_radius=blast,
|
||||
rollback_cost=20,
|
||||
confidence=confidence,
|
||||
rationale=f"OpenClaw Nemo 建議: {action_title}",
|
||||
)]
|
||||
return []
|
||||
|
||||
# action_title 無 kubectl → 嘗試語意合成 kubectl 指令
|
||||
_at_lower = action_title.lower()
|
||||
_synthesized: str | None = None
|
||||
if any(w in _at_lower for w in ("rollback", "undo", "回滾", "還原")):
|
||||
_synthesized = "kubectl rollout undo deployment -n awoooi-prod"
|
||||
elif any(w in _at_lower for w in ("restart", "重啟", "重新啟動")):
|
||||
_synthesized = "kubectl rollout restart deployment -n awoooi-prod"
|
||||
elif any(w in _at_lower for w in ("scale", "擴容", "縮容", "replicas")):
|
||||
_synthesized = "kubectl scale deployment -n awoooi-prod"
|
||||
elif any(w in _at_lower for w in ("logs", "日誌", "log")):
|
||||
_synthesized = "kubectl logs -n awoooi-prod --tail=100 --selector=app=awoooi-api"
|
||||
elif any(w in _at_lower for w in ("describe", "診斷", "diagnos")):
|
||||
_synthesized = "kubectl describe pods -n awoooi-prod"
|
||||
|
||||
if _synthesized:
|
||||
logger.debug(
|
||||
"solver_nemo_action_synthesized",
|
||||
action_title=action_title[:80],
|
||||
synthesized=_synthesized,
|
||||
)
|
||||
return [CandidateAction(
|
||||
action=action_title[:200],
|
||||
action=_synthesized,
|
||||
blast_radius=blast,
|
||||
rollback_cost=20,
|
||||
confidence=confidence,
|
||||
rationale=f"OpenClaw Nemo 建議: {action_title}",
|
||||
confidence=min(confidence, 0.5), # 合成指令最高 0.5,避免誤入自動執行
|
||||
rationale=f"[語意合成] Nemo 建議「{action_title[:80]}」→ 轉為 kubectl 指令",
|
||||
)]
|
||||
|
||||
# 完全無從映射 → return [](交由 _degraded_plan 輸出 category-based 調查指令)
|
||||
logger.debug(
|
||||
"solver_nemo_no_kubectl_fallback",
|
||||
action_title=action_title[:80],
|
||||
reason="action_title 無 kubectl 且語意合成失敗,降級至 _degraded_plan",
|
||||
)
|
||||
return []
|
||||
|
||||
raw = parsed.get("candidates", [])
|
||||
@@ -328,24 +366,39 @@ def _extract_candidates(parsed: dict[str, Any]) -> list[CandidateAction]:
|
||||
|
||||
|
||||
def _default_action_for_category(category: str) -> str:
|
||||
"""降級時的預設調查指令 — 必須是真實 kubectl 命令(調查優先,不執行破壞性操作)
|
||||
"""降級時的預設調查指令 — 必須是真實 kubectl/ssh 命令(調查優先,不執行破壞性操作)
|
||||
|
||||
2026-04-17 ogt + Claude Sonnet 4.6: 改為真實 kubectl 指令
|
||||
舊:自然語言如 "restart_pod"、"check_disk_usage" → 無法被 auto_approve 執行
|
||||
新:kubectl 調查指令 → 可執行,且均為唯讀操作,無副作用
|
||||
2026-04-24 ogt + Claude Sonnet 4.6: 擴展非 K8s 類別(ClickHouse/主機磁碟/DB)
|
||||
根因:SentryClickHouseMemoryPressure/HostDiskUsageHigh 類別不符任何 K8s 關鍵字
|
||||
→ 全部 fallback 到 "kubectl get pods"(無意義診斷指令)
|
||||
修復:加入 clickhouse/database/sentry/host/node/infra 類別映射
|
||||
"""
|
||||
category_lower = category.lower()
|
||||
# K8s workload 層
|
||||
if "pod" in category_lower or "kube" in category_lower or "crash" in category_lower:
|
||||
return "kubectl get pods -n awoooi-prod -o wide"
|
||||
if "disk" in category_lower or "storage" in category_lower or "pvc" in category_lower:
|
||||
return "kubectl exec -n awoooi-prod deployment/postgresql -- df -h"
|
||||
if "cpu" in category_lower or "load" in category_lower:
|
||||
return "kubectl top pods -n awoooi-prod --sort-by=cpu"
|
||||
if "memory" in category_lower or "oom" in category_lower:
|
||||
return "kubectl top pods -n awoooi-prod --sort-by=memory"
|
||||
if "network" in category_lower or "connect" in category_lower:
|
||||
return "kubectl get services -n awoooi-prod"
|
||||
return "kubectl get pods -n awoooi-prod"
|
||||
if "disk" in category_lower or "storage" in category_lower or "pvc" in category_lower:
|
||||
return "kubectl exec -n awoooi-prod deployment/postgresql -- df -h"
|
||||
# 外部服務層(非 K8s — 唯讀診斷)
|
||||
if "clickhouse" in category_lower or "sentry" in category_lower:
|
||||
return "kubectl get pods -n awoooi-prod -l app=sentry -o wide"
|
||||
if "database" in category_lower or "postgres" in category_lower or "redis" in category_lower:
|
||||
return "kubectl get pods -n awoooi-prod -l tier=database -o wide"
|
||||
if "rollback" in category_lower or "deploy" in category_lower or "version" in category_lower:
|
||||
return "kubectl rollout history deployment -n awoooi-prod"
|
||||
if "latency" in category_lower or "slow" in category_lower or "timeout" in category_lower:
|
||||
return "kubectl top pods -n awoooi-prod --sort-by=cpu"
|
||||
# 主機層(host/node/infra — 調查指令,kubectl 只查 node 資訊)
|
||||
if "host" in category_lower or "node" in category_lower or "infra" in category_lower:
|
||||
return "kubectl describe nodes | grep -A5 'Conditions\\|Allocatable'"
|
||||
return "kubectl get pods -n awoooi-prod -o wide"
|
||||
|
||||
|
||||
def compute_input_hash(diagnosis: DiagnosisReport) -> str:
|
||||
|
||||
@@ -167,6 +167,15 @@ class ApprovalRecord(Base):
|
||||
comment="Telegram chat_id where the approval card was sent",
|
||||
)
|
||||
|
||||
# B2 fix 2026-04-24 ogt + Claude Sonnet 4.6: Playbook 學習閉環斷鏈修復
|
||||
# 原欄位缺失 → 人工審核後 matched_playbook_id 永遠 NULL → EWMA 無法更新
|
||||
matched_playbook_id: Mapped[str | None] = mapped_column(
|
||||
String(36),
|
||||
nullable=True,
|
||||
index=True,
|
||||
comment="匹配的 Playbook ID,學習服務用以更新 EWMA trust score",
|
||||
)
|
||||
|
||||
# Timestamps
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
DateTime(timezone=True),
|
||||
@@ -193,6 +202,7 @@ class ApprovalRecord(Base):
|
||||
Index("ix_approval_created_at", "created_at"),
|
||||
Index("ix_approval_requested_by", "requested_by"),
|
||||
Index("ix_approval_fingerprint", "fingerprint"), # 戰略 B: 指紋查詢優化
|
||||
Index("ix_approval_matched_playbook", "matched_playbook_id"), # B2 fix
|
||||
)
|
||||
|
||||
|
||||
@@ -244,6 +254,14 @@ class TimelineEvent(Base):
|
||||
# Context
|
||||
risk_level: Mapped[str | None] = mapped_column(String(20), nullable=True)
|
||||
approval_id: Mapped[str | None] = mapped_column(String(36), nullable=True, index=True)
|
||||
# P1.6 fix 2026-04-24 ogt + Claude Sonnet 4.6: pre_decision_investigator raw SQL 寫不存在欄位
|
||||
# 原本 INSERT INTO timeline_events (incident_id, ...) 失敗 → 每天+1 錯誤靜默吞
|
||||
incident_id: Mapped[str | None] = mapped_column(
|
||||
String(64),
|
||||
nullable=True,
|
||||
index=True,
|
||||
comment="關聯的 Incident ID(MCP 事件稽核用)",
|
||||
)
|
||||
|
||||
# Timestamp
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
@@ -255,6 +273,7 @@ class TimelineEvent(Base):
|
||||
__table_args__ = (
|
||||
Index("ix_timeline_event_type", "event_type"),
|
||||
Index("ix_timeline_created_at", "created_at"),
|
||||
Index("ix_timeline_incident_id", "incident_id"), # P1.6 fix
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -103,6 +103,9 @@ def _record_to_request(record: ApprovalRecord) -> ApprovalRequest:
|
||||
fingerprint=record.fingerprint,
|
||||
hit_count=record.hit_count,
|
||||
last_seen_at=record.last_seen_at,
|
||||
# B4 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補回 DB 欄位(人工審核路徑讀回必要)
|
||||
incident_id=getattr(record, "incident_id", None),
|
||||
matched_playbook_id=getattr(record, "matched_playbook_id", None),
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -109,6 +109,11 @@ def approval_record_to_request(record: ApprovalRecord) -> ApprovalRequest:
|
||||
fingerprint=record.fingerprint,
|
||||
hit_count=record.hit_count,
|
||||
last_seen_at=record.last_seen_at,
|
||||
# B3 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補回 DB 欄位(人工審核路徑讀回必要)
|
||||
# incident_id / matched_playbook_id 在 ApprovalRequest 基礎模型中有定義
|
||||
# telegram_message_id / telegram_chat_id 只在 DB model,不在 Pydantic ApprovalRequest
|
||||
incident_id=getattr(record, "incident_id", None),
|
||||
matched_playbook_id=getattr(record, "matched_playbook_id", None),
|
||||
)
|
||||
|
||||
|
||||
@@ -164,6 +169,9 @@ def approval_request_to_record_data(
|
||||
# 不在 dict 裡導致 DB 欄位永遠 NULL,Telegram 卡片顯示 INC 號是空白
|
||||
# 用戶在 Telegram 根本認不出對應的告警,審核閉環名存實亡
|
||||
"incident_id": request.incident_id,
|
||||
# B5 fix 2026-04-24 ogt + Claude Sonnet 4.6: 補 matched_playbook_id
|
||||
# 原本缺失 → Playbook 人工審核後 trust score 永遠不更新(學習閉環斷鏈)
|
||||
"matched_playbook_id": getattr(request, "matched_playbook_id", None),
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -718,9 +718,19 @@ class ApprovalExecutionService:
|
||||
|
||||
# 2026-04-04 ogt: 執行結果沉澱到 KM — 移出 try/except 確保 learning 失敗也寫入
|
||||
# 統帥鐵律: 所有異常與自動修復紀錄必須回寫 KM
|
||||
asyncio.create_task(
|
||||
self._write_execution_result_to_km(approval, success, error_message)
|
||||
)
|
||||
# P1.5 fix 2026-04-24 ogt + Claude Sonnet 4.6: fire-and-forget → await(30s 熔斷)
|
||||
# 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入遺失(audit D 每天+5)
|
||||
try:
|
||||
await asyncio.wait_for(
|
||||
self._write_execution_result_to_km(approval, success, error_message),
|
||||
timeout=30.0,
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning(
|
||||
"km_write_timeout",
|
||||
approval_id=str(approval.id),
|
||||
timeout_sec=30.0,
|
||||
)
|
||||
|
||||
async def _run_post_execution_verify(
|
||||
self,
|
||||
|
||||
@@ -1140,6 +1140,7 @@ def _package_to_proposal_data(package: Any) -> dict[str, Any]:
|
||||
"confidence": confidence,
|
||||
"risk_level": risk_level,
|
||||
"source": "phase2_agent_debate",
|
||||
"provider": "phase2_agents", # 2026-04-24 ogt: 讓 Telegram 顯示 "Phase2 Agents" 而非通用 "AI 仲裁"
|
||||
"requires_human_review": package.requires_human_approval,
|
||||
# Phase 2 診斷摘要(供 Audit Trail / 學習閉環,不直接顯示給用戶)
|
||||
"debate_summary": package.debate_summary or "",
|
||||
@@ -2283,6 +2284,45 @@ class DecisionManager:
|
||||
# 需要 EvidenceSnapshot;若 P1 未開啟則自行收集
|
||||
# 2026-04-15 ogt + Claude Sonnet 4.6(亞太)
|
||||
if aiops_flags.is_phase_enabled(2): # Gate 2: 用 is_phase_enabled 統一父 Phase 守衛
|
||||
# 2026-04-24 ogt + Claude Sonnet 4.6: YAML NO_ACTION 優先門
|
||||
# 根因:Phase 2 五 agent 對主機層/外部服務告警(HostDiskUsageHigh /
|
||||
# SentryClickHouseMemoryPressure)做 kubectl 分析 → Solver 永遠降級
|
||||
# (無 K8s target) → confidence=20% + kubectl get pods(無意義)
|
||||
# 修復:YAML 匹配到 NO_ACTION 且 kubectl_command="" → 跳過 Agent Debate,
|
||||
# 直接返回 YAML 規則響應,讓 Telegram 推送人工排查通知
|
||||
try:
|
||||
from src.services.alert_rule_engine import match_rule as _p2_match_rule
|
||||
_p2_labels = incident.signals[0].labels if incident.signals else {}
|
||||
_p2_alertname = _p2_labels.get("alertname", "")
|
||||
_p2_yaml = _p2_match_rule({
|
||||
"labels": _p2_labels,
|
||||
"alert_type": _p2_alertname,
|
||||
"message": (
|
||||
incident.signals[0].annotations.get("summary", "")
|
||||
if incident.signals else ""
|
||||
),
|
||||
"target_resource": incident.affected_services[0] if incident.affected_services else "unknown",
|
||||
"namespace": _p2_labels.get("namespace", "awoooi-prod"),
|
||||
"severity": incident.severity.value if hasattr(incident.severity, "value") else "medium",
|
||||
})
|
||||
_is_no_action_yaml = (
|
||||
_p2_yaml is not None
|
||||
and _p2_yaml.get("suggested_action") == "NO_ACTION"
|
||||
and not _p2_yaml.get("kubectl_command", "").strip()
|
||||
and _p2_yaml.get("rule_id", "") not in ("generic_fallback", "")
|
||||
)
|
||||
if _is_no_action_yaml:
|
||||
logger.info(
|
||||
"p2_yaml_no_action_bypass",
|
||||
incident_id=incident.incident_id,
|
||||
alertname=_p2_alertname,
|
||||
rule_id=_p2_yaml.get("rule_id", ""),
|
||||
reason="YAML NO_ACTION 規則命中,跳過 Agent Debate",
|
||||
)
|
||||
return _p2_yaml
|
||||
except Exception as _p2_yaml_err:
|
||||
logger.debug("p2_yaml_precheck_error", error=str(_p2_yaml_err))
|
||||
|
||||
p2_snapshot = evidence_snapshot
|
||||
if p2_snapshot is None:
|
||||
try:
|
||||
|
||||
@@ -466,18 +466,53 @@ SSH_MCP_ALLOWED_HOSTS 使用完整 IP(如 "192.168.0.110")。
|
||||
"""
|
||||
|
||||
|
||||
def _build_prometheus_query(alertname: str, namespace: str, pod_name: str) -> str:
|
||||
"""依告警類型生成 Prometheus PromQL 查詢(供 prometheus_query tool 使用)。
|
||||
2026-04-24 ogt + Claude Sonnet 4.6: P0.4 fix — _build_tool_params 補 query 欄位"""
|
||||
an = alertname.lower()
|
||||
# CPU / 負載
|
||||
if any(k in an for k in ("cpu", "load", "throttl")):
|
||||
filter_pod = f',pod=~"{pod_name}.*"' if pod_name else ""
|
||||
return f'avg(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"{filter_pod}}}[5m]))'
|
||||
# 記憶體
|
||||
elif any(k in an for k in ("memory", "mem", "oom")):
|
||||
filter_pod = f',pod=~"{pod_name}.*"' if pod_name else ""
|
||||
return f'avg(container_memory_working_set_bytes{{namespace="{namespace}"{filter_pod}}}) / 1048576'
|
||||
# CrashLoop / 重啟
|
||||
elif any(k in an for k in ("crash", "restart", "oom", "backoff")):
|
||||
return f'sum(increase(kube_pod_container_status_restarts_total{{namespace="{namespace}"}}[15m]))'
|
||||
# 磁碟 / 儲存
|
||||
elif any(k in an for k in ("disk", "storage", "pvc", "volume", "capacity")):
|
||||
return 'sum(kubelet_volume_stats_used_bytes) by (persistentvolumeclaim)'
|
||||
# HTTP / 可用性
|
||||
elif any(k in an for k in ("http", "error", "5xx", "probe", "down", "unhealthy")):
|
||||
return '1 - avg(probe_success)'
|
||||
# Pod / Container 狀態
|
||||
elif any(k in an for k in ("pod", "container", "deploy", "replicaset")):
|
||||
return f'kube_pod_status_phase{{namespace="{namespace}"}}'
|
||||
# 通用 fallback
|
||||
else:
|
||||
return f'up{{namespace="{namespace}"}}'
|
||||
|
||||
|
||||
def _build_tool_params(incident: "Incident") -> dict[str, Any]:
|
||||
"""從 Incident 提取 MCP 工具呼叫所需的公共參數。"""
|
||||
labels = _get_labels(incident)
|
||||
raw_host = labels.get("instance", "").split(":")[0] or labels.get("host", "")
|
||||
host = _SHORT_HOST_MAP.get(raw_host, raw_host) # 短名 → 完整 IP
|
||||
namespace = labels.get("namespace", "awoooi-prod")
|
||||
pod_name = labels.get("pod", labels.get("name", ""))
|
||||
alertname = labels.get("alertname", "")
|
||||
return {
|
||||
"namespace": labels.get("namespace", "awoooi-prod"),
|
||||
"pod_name": labels.get("pod", labels.get("name", "")),
|
||||
"namespace": namespace,
|
||||
"pod_name": pod_name,
|
||||
"deployment": labels.get("deployment", ""),
|
||||
"host": host,
|
||||
"container": labels.get("container", labels.get("name", "")),
|
||||
"alertname": labels.get("alertname", ""),
|
||||
"alertname": alertname,
|
||||
# P0.4 fix 2026-04-24 ogt + Claude Sonnet 4.6: Prometheus tool 需要 query 欄位
|
||||
# 原本缺少此欄位 → prometheus_query/range tool 傳入空 query → 回傳 error dict
|
||||
"query": _build_prometheus_query(alertname, namespace, pod_name),
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -6,6 +6,32 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 2026-04-24 — 12 Agent 全景審計 + P0-P2 全面並行修復
|
||||
|
||||
### 需求
|
||||
統帥:「請用12位Agent的新遊戲規則,進行全景、全流程、全節點的所有 AI 自動化流程優化!到目前為止都還沒有完全正常運作!」
|
||||
|
||||
### 審計結論
|
||||
12 Agent 分工並行掃描:
|
||||
- 系統有效串接率:~60%(125個服務中約75個真正在主流程使用)
|
||||
- 孤立服務:12個重要服務零引用(trust_drift_detector/rollback_manager 等)
|
||||
- 7大致命病根(詳見 project_audit_20260424.md)
|
||||
|
||||
### 最關鍵發現
|
||||
1. **MCP 感官 = 0**:Prometheus KeyError 100% + legacy kwarg bug 靜默吞
|
||||
2. **auto_execute 24h = 0**:Gate 9(blast_radius 唯讀指令判 human)+ Gate 11(operation_parser 不認唯讀指令)
|
||||
3. **Playbook 學習 = 永遠 False**:5個斷鏈疊加 + 冷啟動死結
|
||||
4. **KM +5/天主因**:knowledge_extractor_service.py:210 AttributeError 100% 失敗
|
||||
5. **動態基線9天0筆**:5個 PromQL label 全錯(cadvisor namespace/container 不對)
|
||||
6. **timeline_events +1/天**:pre_decision_investigator.py:344 raw SQL INSERT 寫不存在欄位
|
||||
|
||||
### 修復動作(並行執行中)
|
||||
- P0.1-P0.6:立即止血(知識萃取/auto_execute gate/MCP/告警規則/動態基線)
|
||||
- P1.1-P1.5:學習閉環修復(DB migration + matched_playbook_id 斷鏈)
|
||||
- P2.1/P2.4/P2.6:LLM 品質 + Telegram 中間態 + AI 治理
|
||||
|
||||
---
|
||||
|
||||
## 📍 2026-04-24 — 12-Agent 全景盤點 + 六大自動化飛輪修復
|
||||
|
||||
### 根因(截圖告警分析)
|
||||
|
||||
@@ -1698,6 +1698,41 @@ Phase 6 完成後
|
||||
|
||||
---
|
||||
|
||||
### 2026-04-24 — 12 Agent 全景審計 + P0-P2 並行修復啟動
|
||||
|
||||
**審計方式**:12 個專責 Agent 並行掃描 125 個服務檔案 × 7 層架構 × 13 個關鍵節點
|
||||
|
||||
**核心發現:系統有效串接率 ~60%,連接的 60% 有大量靜默故障**
|
||||
|
||||
| 病根 | 位置 | 嚴重度 |
|
||||
|------|------|--------|
|
||||
| A. Prometheus MCP 100% KeyError | `pre_decision_investigator.py:469-481` | 🔴🔴🔴 |
|
||||
| B. auto_execute Gate 9+11 必攔唯讀指令 | `blast_radius_calculator.py:38-61`, `operation_parser.py:51-185` | 🔴🔴🔴 |
|
||||
| C. Playbook 學習閉環5個斷鏈(matched_playbook_id 永遠None) | `proposal_service.py:232-257`, `db/models.py:59`, `approval_db.py:91-112` | 🔴🔴🔴 |
|
||||
| D. KM每天+5主因:s.description AttributeError 100% 失敗 | `knowledge_extractor_service.py:210` | 🔴🔴🔴 |
|
||||
| E. consensus_engine 4個 ExpertAgent confidence=0.0 硬寫死 | `consensus_engine.py:165-334` | 🔴🔴 |
|
||||
| F. auto_repair 主路徑完全不接 PostExecutionVerifier | `auto_repair_service.py:324-500` | 🔴🔴 |
|
||||
| G. ProactiveInspector 5個 PromQL 全回空 vector,基線9天0筆 | `proactive_inspector.py:40-71` | 🔴🔴 |
|
||||
| H. timeline_events 每天+1:raw SQL INSERT 欄位不存在被靜默吞 | `pre_decision_investigator.py:344,361` | 🔴🔴 |
|
||||
|
||||
**孤立服務(12個重要服務零引用)**:trust_drift_detector / rollback_manager / resource_resolver / diagnosis_aggregator / model_rollback_service / kb_rot_cleaner / ci_auto_repair / sentry_webhook_service 等
|
||||
|
||||
**P0 修復(今日並行執行)**:
|
||||
- P0.1: knowledge_extractor_service.py:210 s.description → alert_name + annotations.summary
|
||||
- P0.2+P0.3: blast_radius + operation_parser 補唯讀指令白名單
|
||||
- P0.4: pre_decision_investigator Prometheus _build_tool_params 補 query 欄位
|
||||
- P0.5: alert_rules.yaml generic_fallback NO_ACTION + 6條語義矛盾修正
|
||||
- P0.6: proactive_inspector 5個 PromQL 修正 label + datname
|
||||
|
||||
**P1 DB Migration(今日並行執行)**:
|
||||
- ApprovalRecord 加 matched_playbook_id 欄位(Alembic migration)
|
||||
- timeline_events 加 incident_id + stage 欄位
|
||||
- proposal_service / approval_db / approval_repository 讀寫補齊
|
||||
- decision_manager:2463 冷啟動豁免(total_executions=0 略過 success_rate 過濾)
|
||||
- approval_execution:702-704 KM 寫入 fire-and-forget → await asyncio.wait_for(timeout=10s)
|
||||
|
||||
---
|
||||
|
||||
### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護(commit de2d34d)
|
||||
|
||||
**觸發**:統帥全景盤查 AI 自動化節點後,發現 Playbook 自動修復鏈路有 3 個結構性斷點。
|
||||
|
||||
Reference in New Issue
Block a user