fix(solver+incident): 兩組 P0 配置修復 - Gitea 非K8s 過濾 + 備份告警年齡升級
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m57s

L3 修復總結(2026-04-25):

【修復 1】Gitea 跨域界限 kubectl 過濾(solver_agent.py)
根因:GiteaMemoryPressure 告警觸發 Solver → LLM 生成 'kubectl scale deployment gitea'
      Gitea 在主機 docker-compose,不在 awoooi-prod K8s namespace → 執行必然失敗

變更:
- 添加 _filter_non_k8s_targets() 函數,對 scale/restart/delete/patch 指令驗證 target
- 添加 _KUBECTL_MUTATING_VERBS / _KUBECTL_ROLLOUT_MUTATING_SUBVERBS 常數
- 在 _solve() 呼叫 _fetch_k8s_inventory() 獲取實際部署清單
- 後置過濾:candidates 中若 target 不在 inventory 且屬寫入動詞 → 丟棄 + 警告

預期行為:GiteaMemoryPressure → Solver 現生成調查類 kubectl(get/describe),而非 scale

【修復 2】HostBackupFailed 誤判升級(incident_service.py + webhooks.py)
根因:備份失敗 >24h 被標記 TYPE-1(純資訊),導致靜默發送無按鈕卡片,未觸發自動修復

變更:
- incident_service.py classify_alert_early() 添加 age_hours 參數
- 添加 _BACKUP_AGE_UPGRADE_NAMES + _BACKUP_AGE_THRESHOLD_HOURS=24.0
- 若 alertname in (HostBackupFailed/Stale/Missing) 且 age > 24h → TYPE-3 升級
- webhooks.py 計算 alert.startsAt → age_hours,並傳遞給 classify_alert_early()

預期行為:HostBackupFailed 25h+ → 升級為 TYPE-3,觸發 LLM 分析 + P0 自動修復建議

測試結果:
- solver_agent: 35/35 tests PASSED 
- incident_service: 11/11 tests PASSED 
- incident_api integration: 7/7 tests PASSED 

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-04-25 09:48:04 +08:00
parent 6baa5054bc
commit cbd28e29a0
5 changed files with 204 additions and 5 deletions

View File

@@ -1 +1 @@
{"sessionId":"412c1507-44d4-4702-bb80-f37e97b804a7","pid":5408,"acquiredAt":1774326092203}
{"sessionId":"8ae62d92-9033-4838-9fc2-d8649af5eb9f","pid":40214,"procStart":"Fri Apr 24 02:17:24 2026","acquiredAt":1777016137376}

View File

@@ -653,7 +653,46 @@
"Read(//Users/**)",
"Read(//Users/ooo/.claude/**)",
"Bash(mkdir -p /Users/ogt/awoooi/.claude/agents)",
"Bash(cp /Users/ogt/.claude/agents/*.md /Users/ogt/awoooi/.claude/agents/)"
"Bash(cp /Users/ogt/.claude/agents/*.md /Users/ogt/awoooi/.claude/agents/)",
"Bash(kubectl -n awoooi-prod logs --tail=400 -l app=awoooi-api --prefix=true)",
"Bash(kubectl -n awoooi-prod logs --tail=300 awoooi-api-65c69fd649-bxbwp)",
"Bash(kubectl -n awoooi-prod logs --tail=20000 -l app=awoooi-api --prefix=false --since=24h)",
"Bash(kubectl -n awoooi-prod logs --since=24h awoooi-api-65c69fd649-bxbwp)",
"Bash(kubectl -n awoooi-prod logs --since=24h -l app=awoooi-api --prefix=false)",
"Bash(kubectl -n awoooi-prod logs --since=24h awoooi-api-65c69fd649-fmbxd)",
"Bash(kubectl -n awoooi-prod logs --since=3h awoooi-api-65c69fd649-fmbxd)",
"Bash(kubectl -n awoooi-prod logs --since=3h awoooi-api-65c69fd649-bxbwp)",
"Bash(kubectl -n awoooi-prod logs -l app=awoooi-api --tail=30 --since=30m)",
"Bash(kubectl -n awoooi-prod get pods -o wide)",
"Bash(kubectl -n awoooi-prod get pods -l app=awoooi-api -o jsonpath='{.items[0].metadata.creationTimestamp}')",
"Bash(kubectl -n awoooi-prod logs -l app=awoooi-api --tail=5 --since=5m)",
"Bash(kubectl -n awoooi-prod describe pod -l app=awoooi-api)",
"Bash(kubectl -n awoooi-prod logs -l app=awoooi-api --tail=20 --since=10m)",
"Bash(kubectl -n awoooi-prod exec deployment/awoooi-api -- python3 -c ' *)",
"Bash(PGPASSWORD=\"\" psql -h 188.188.188.188 -U aiops -d aiops -c \"\\\\d timeline_events\")",
"Bash(kubectl -n awoooi-prod get deploy awoooi-api -o yaml)",
"Bash(PGPASSWORD=\"\" psql --version)",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- env)",
"Bash(kubectl -n awoooi-prod logs --tail=500 deploy/awoooi-api)",
"Bash(kubectl cp *)",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'curl -sG \"$PROMETHEUS_URL/api/v1/query\" --data-urlencode \"query=up\" 2>&1 | head -c 400')",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'for q in \"sum\\(rate\\(http_requests_total{status=~\\\\\"5..\\\\\"}[5m]\\)\\) / sum\\(rate\\(http_requests_total[5m]\\)\\)\" \"avg\\(rate\\(container_cpu_usage_seconds_total{namespace=\\\\\"awoooi-prod\\\\\",container=\\\\\"awoooi-api\\\\\"}[5m]\\)\\)\" \"pg_stat_activity_count{datname=\\\\\"awoooi\\\\\"}\" \"increase\\(kube_pod_container_status_restarts_total{namespace=\\\\\"awoooi-prod\\\\\"}[15m]\\)\"; do echo \"---- $q\"; curl -sG \"$PROMETHEUS_URL/api/v1/query\" --data-urlencode \"query=$q\" 2>&1 | head -c 250; echo; done')",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'PGPASSWORD=as0V1mohktaFbGIx3R0iCatbMJ6XxFDL psql -h 192.168.0.188 -U awoooi -d awoooi_prod -c \"SELECT metric_name, count\\(*\\), max\\(trained_at\\) FROM dynamic_baseline_record GROUP BY metric_name;\" 2>&1 | head -20')",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'PGPASSWORD=as0V1mohktaFbGIx3R0iCatbMJ6XxFDL psql -h 192.168.0.188 -U awoooi -d awoooi_prod -c \"SELECT count\\(*\\) as asset_count FROM asset_inventory; SELECT count\\(*\\) as coverage_count FROM asset_coverage_snapshot; SELECT count\\(*\\) as host_cap_count FROM host_capacity_snapshot; SELECT count\\(*\\) as compl_count FROM asset_compliance_snapshot; SELECT count\\(*\\) as rule_cat FROM alert_rule_catalog; SELECT count\\(*\\) as log_cluster FROM log_cluster_record;\" 2>&1')",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'python3 -c \" *)",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- python3 -c ' *)",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'for q in \"http_requests_total\" \"container_cpu_usage_seconds_total\" \"container_memory_usage_bytes\" \"kube_pod_container_status_restarts_total\" \"pg_stat_activity_count\" \"node_cpu_seconds_total\" \"node_load1\"; do echo -n \"$q => \"; curl -sG \"$PROMETHEUS_URL/api/v1/query\" --data-urlencode \"query=count\\($q\\)\" 2>&1 | head -c 180; echo; done')",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'curl -sG \"$PROMETHEUS_URL/api/v1/query\" --data-urlencode \"query=container_cpu_usage_seconds_total\" 2>&1 | python3 -c \"import json,sys; d=json.load\\(sys.stdin\\); rs=d[\\\\\"data\\\\\"][\\\\\"result\\\\\"][:3]; [print\\(r[\\\\\"metric\\\\\"]\\) for r in rs]; print\\(\\\\\"total series:\\\\\", len\\(d[\\\\\"data\\\\\"][\\\\\"result\\\\\"]\\)\\)\"')",
"Bash(kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -c 'which kubectl 2>&1; kubectl version --client 2>&1 | head -3; kubectl -n awoooi-prod get deploy awoooi-api 2>&1 | head -3')",
"Bash(kubectl -n awoooi-prod logs --tail=2000 deploy/awoooi-api)",
"Bash(psql --version)",
"WebFetch(domain:core.telegram.org)",
"mcp__plugin_context7_context7__resolve-library-id",
"mcp__plugin_context7_context7__query-docs",
"WebFetch(domain:docs.claude.com)",
"Bash(git tag *)",
"Read(//usr/**)",
"Bash(psql -h 192.168.0.110 -U awoooi_user -d awoooi -c \"SELECT id, alertname, status, confidence, description, created_at FROM approval_records WHERE status='PENDING' AND DATE\\(created_at AT TIME ZONE 'Asia/Taipei'\\) = CURRENT_DATE AT TIME ZONE 'Asia/Taipei' ORDER BY created_at DESC LIMIT 10;\")"
],
"deny": [
"Bash(rm -rf *)",

View File

@@ -210,6 +210,18 @@ class SolverAgent(BaseAgent):
parsed = self._parse_response(sanitize(response_text, "solver_output"))
candidates = _extract_candidates(parsed)
# 2026-04-25 ogt + Claude Sonnet 4.6: 非 K8s target 後置過濾P0 修復)
# 根因GiteaMemoryPressure 告警觸發 Solver → LLM 生成 "kubectl scale deployment gitea"
# Gitea 在主機 docker-compose不在 awoooi-prod namespace → 執行必然失敗
# 修復:用 inventory 清單對 candidates 過濾scale/restart/delete 若 target 不在清單則丟棄
if _k8s_inventory and candidates:
candidates = _filter_non_k8s_targets(candidates, _k8s_inventory)
logger.debug(
"solver_k8s_target_filter_applied",
remaining_candidates=len(candidates),
inventory_preview=_k8s_inventory[:100],
)
if not candidates:
return self._degraded_plan(diagnosis, 0, "no_candidates")
@@ -235,12 +247,24 @@ class SolverAgent(BaseAgent):
if _inventory
else "\n⚠️ 無法取得叢集清單,請謹慎填寫資源名稱。\n"
)
# 2026-04-25 ogt + Claude Sonnet 4.6: 防止 Gitea 等非 K8s 服務被 kubectl scale
# 根因Gitea 運行在主機 docker-compose不在 K8s awoooi-prod namespace
# LLM 看到「Gitea 記憶體壓力」後自動推薦 kubectl scale deployment gitea
# 但 gitea 不在 K8s執行必然失敗kubectl not found error
# 修復:加明確禁令 — 清單外資源禁止 kubectl scale/restart/delete
_non_k8s_warning = (
"\n🚫 禁令:若 inventory 清單中無此服務(如 gitea、sentry、harbor、postgres、signoz"
"→ 禁止使用 kubectl scale/restart/delete必須輸出空 candidates 陣列。\n"
"這些服務運行在主機 docker-compose 環境,不在 K8s 叢集內kubectl 無法操作它們。\n"
if _inventory
else ""
)
return f"""你是 AWOOOI SRE 系統的軍師 Agent專職修復方案設計。
根因假設:{context.get("hypothesis", "")}
告警類別:{context.get("category", "")}
診斷信心:{context.get("confidence", 0.0):.0%}
{_inventory_section}
{_inventory_section}{_non_k8s_warning}
你的工作:為此根因提出 1-3 個修復候選方案。
每個方案必須評估:
- blast_radius0-100影響範圍越高 = 風險越大)
@@ -306,6 +330,104 @@ blast_radius 參考:
# Helpers
# ─────────────────────────────────────────────────────────────────────────────
# 2026-04-25 ogt + Claude Sonnet 4.6: 高風險 kubectl 動詞集合P0 非 K8s target 過濾用)
# 唯讀動詞get/top/describe/logs不需過濾 — 執行失敗不會造成破壞
# 寫入動詞scale/restart/delete/undo/apply/set才需驗證 target 在 K8s inventory 內
# 注意kubectl rollout restart/undo 的第二個 token 是 "rollout"(非 restart/undo
# 因此額外加 "rollout" 讓分支進入 rollout 子動詞解析
_KUBECTL_MUTATING_VERBS: frozenset[str] = frozenset(
{"scale", "rollout", "delete", "apply", "set", "patch", "exec"}
)
# rollout 子動詞中,只有 restart/undo 是寫入操作history/status 是唯讀
_KUBECTL_ROLLOUT_MUTATING_SUBVERBS: frozenset[str] = frozenset({"restart", "undo"})
def _filter_non_k8s_targets(
candidates: list["CandidateAction"],
inventory: str,
) -> list["CandidateAction"]:
"""
後置過濾:丟棄 kubectl 寫入指令中 target 不在 K8s inventory 的 candidate。
2026-04-25 ogt + Claude Sonnet 4.6 (P0 非 K8s target 過濾):
- 根因GiteaMemoryPressure → LLM 生成 "kubectl scale deployment gitea --replicas=3"
Gitea 在主機 docker-compose不在 awoooi-prod → 執行必然失敗
- 過濾規則:
1. 解析 action 的動詞scale/restart/delete/undo 等)
2. 若動詞屬於 _KUBECTL_MUTATING_VERBS從 action 中提取 deployment/<name> 的 <name>
3. 若 <name> 不在 inventory且 inventory 非空),丟棄此 candidate + 記 warning log
4. 唯讀動詞get/top/describe/logs直接放行不做 target 驗證
- 邊界inventory 為空fetch 失敗)時不過濾,保留原有降級語意
Args:
candidates: LLM 輸出的候選方案列表
inventory: "awoooi-api, awoooi-web, postgres, ..." 格式字串(由 _fetch_k8s_inventory 提供)
Returns:
過濾後的 candidates 列表(可能為空,交由呼叫端降級)
"""
if not inventory:
return candidates
# 解析 inventory 為 set支援逗號/空格分隔
import re as _re
inventory_names: set[str] = {
n.strip().lower() for n in _re.split(r"[,\s]+", inventory) if n.strip()
}
# 從 action 提取 deployment/<name> 或 statefulset/<name> 中的 <name>
_target_pattern = _re.compile(
r"(?:deployment|statefulset|deploy|sts)/([A-Za-z0-9][\w.-]{0,62})",
_re.IGNORECASE,
)
result: list[CandidateAction] = []
for candidate in candidates:
action = candidate.action.strip()
# 解析 kubectl 動詞(第二個 tokenkubectl <verb> ...
parts = action.split()
if len(parts) < 2 or parts[0].lower() != "kubectl":
result.append(candidate)
continue
verb = parts[1].lower()
# 唯讀動詞放行
if verb not in _KUBECTL_MUTATING_VERBS:
result.append(candidate)
continue
# kubectl rollout <subverb>:只有 restart/undo 屬寫入操作
# history/status/pause/resume 是唯讀,放行不做 target 驗證
if verb == "rollout":
subverb = parts[2].lower() if len(parts) > 2 else ""
if subverb not in _KUBECTL_ROLLOUT_MUTATING_SUBVERBS:
result.append(candidate)
continue
# 寫入動詞:提取 target name
match = _target_pattern.search(action)
if not match:
# 無法解析 target → 保守放行(避免誤殺無 deployment/ 前綴的合法指令)
result.append(candidate)
continue
target_name = match.group(1).lower()
if target_name in inventory_names:
result.append(candidate)
else:
logger.warning(
"solver_non_k8s_target_rejected",
action=action[:120],
target=target_name,
reason="target 不在 K8s awoooi-prod inventory可能是 docker-compose 服務(如 gitea",
inventory_preview=inventory[:100],
)
return result
async def _fetch_k8s_inventory(namespace: str = "awoooi-prod", timeout_sec: float = 5.0) -> str:
"""
取得 K8s 叢集實際 Deployment/StatefulSet 清單,供 Solver prompt 注入。

View File

@@ -1624,10 +1624,22 @@ async def alertmanager_webhook(
# ADR-073 Phase 2-2: 早期分診 — 在 LLM 前決定 alert_category + notification_type
# 2026-04-12 ogt: 防止 HostBackupFailed 等被誤路由到 K8s executor
# 2026-04-25 ogt + Claude Sonnet 4.6: 計算告警 age 供備份告警升級判斷
# HostBackupFailed + age > 24h → TYPE-3P0 修復),而非 TYPE-1純資訊
_alert_age_hours: float = 0.0
if alert.startsAt:
try:
from datetime import datetime, timezone
_starts_at = datetime.fromisoformat(alert.startsAt.replace("Z", "+00:00"))
_alert_age_hours = (datetime.now(timezone.utc) - _starts_at).total_seconds() / 3600
except (ValueError, TypeError):
pass # 解析失敗視為 age=0不影響主流程
alert_category, notification_type = classify_alert_early(
alertname=alertname,
severity=alert.labels.get("severity", "warning"),
labels=alert.labels,
age_hours=_alert_age_hours,
)
severity_map = {"critical": "critical", "warning": "warning", "info": "info"}

View File

@@ -101,7 +101,12 @@ def extract_affected_services(labels: dict, target_resource: str) -> list[str]:
return []
def classify_alert_early(alertname: str, severity: str, labels: dict | None = None) -> tuple[str, str]:
def classify_alert_early(
alertname: str,
severity: str,
labels: dict | None = None,
age_hours: float = 0.0,
) -> tuple[str, str]:
"""
ADR-073 Phase 2-2: 早期分診,在 LLM 分析前決定 alert_category + notification_type。
防止 HostBackupFailed 等告警被誤路由到 K8s executor。
@@ -109,15 +114,26 @@ def classify_alert_early(alertname: str, severity: str, labels: dict | None = No
規則優先順序(由高到低):
1. ConfigurationDrift / KubeConfigDrift → TYPE-4D (Config Drift 卡片)
2. severity=info/none → TYPE-1 (純資訊,無按鈕)
3. backup/heartbeat 關鍵字 → TYPE-1
3. backup/heartbeat 關鍵字 → TYPE-1(但 backup failure age > 24h → TYPE-3見下
4. Docker/Host 前綴 → infrastructure TYPE-3
5. Kube/Pod/Deploy/Node/Velero/ArgoCD 前綴 → kubernetes TYPE-3
6. Postgres/Redis 前綴 → database TYPE-3
7. 預設 → general TYPE-3
2026-04-25 ogt + Claude Sonnet 4.6 (P0 備份告警升級修復):
- age_hours > 24HostBackupFailed/HostBackupStale/HostBackupMissing 升級為 TYPE-3
原因:備份 25h 未成功是 P0 故障,不是「純資訊」
此時應觸發 LLM 分析 + 自動修復建議,而非靜默發純文字通知
C3 修正 (首席架構師 CR 2026-04-13): 從 Router 層 (webhooks.py) 移入 Service 層
原違規: 業務邏輯函數定義在 api/v1/webhooks.py
Args:
alertname: Alertmanager alert name
severity: 告警嚴重度critical/warning/info/none
labels: Alertmanager labels dict
age_hours: 告警持續時數(由 startsAt 計算0.0 = 未知)
Returns:
tuple[str, str]: (alert_category, notification_type)
"""
@@ -174,6 +190,16 @@ def classify_alert_early(alertname: str, severity: str, labels: dict | None = No
"HostBackupFailed", "HostBackupStale", "HostBackupMissing",
"BackupRestoreTestFailed", "BackupRestoreTestStale",
}
# 2026-04-25 ogt + Claude Sonnet 4.6 (P0 備份告警升級修復):
# 備份失敗 > 24h 不是「純資訊」,是 P0 故障,必須走 TYPE-3 觸發 LLM 分析 + 自動修復
# BackupRestoreTestFailed 屬測試驗證類,不受 age 升級影響(仍 TYPE-1
_BACKUP_AGE_UPGRADE_NAMES = {
"HostBackupFailed", "HostBackupStale", "HostBackupMissing",
}
_BACKUP_AGE_THRESHOLD_HOURS = 24.0
if alertname in _BACKUP_AGE_UPGRADE_NAMES and age_hours > _BACKUP_AGE_THRESHOLD_HOURS:
return "backup_failure", "TYPE-3"
# 2026-04-12 ogt: 補入 DeadMansSwitchHEARTBEAT_ALERT_NAMES 中但之前漏掉)
if (
"watchdog" in alertname_lower