fix(aiops): route backup decisions through ssh
Some checks failed
CD Pipeline / tests (push) Successful in 1m35s
Code Review / ai-code-review (push) Successful in 34s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled

This commit is contained in:
Your Name
2026-05-01 12:50:01 +08:00
parent 337bcb912e
commit 11673d80ea
11 changed files with 276 additions and 19 deletions

View File

@@ -1344,6 +1344,43 @@ Architecture Review 發現的安全要求2026-04-11
3. **群組 B 工具需 trust_score >= 0.8**(硬編碼守衛)
### Host/Backup SSH Route Invariants (2026-05-01)
`backup_failure` is a host-layer category. Keep it aligned anywhere
`host_resource` is routed, especially:
- `DecisionManager`: non-`kubectl` actions must route to SSH MCP before
`parse_kubectl_action()`. Otherwise SSH diagnosis strings with shell syntax
are blocked as `forbidden_shell_metachar`.
- `DecisionManager`: `kubectl` actions from `host_resource` or
`backup_failure` must be blocked and escalated to emergency intervention.
- `AutoRepairService`: host/backup incidents must not fall back to K8s
rollout Playbooks.
Runtime baseline for host/backup repair:
```bash
kubectl -n awoooi-prod get secret ssh-mcp-key awoooi-repair-ssh-key awoooi-repair-known-hosts
kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -lc '
ls -l /run/secrets/ssh_mcp_key /etc/ssh-mcp/known_hosts \
/etc/repair-ssh/id_ed25519 /etc/repair-known-hosts/known_hosts
'
kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -lc '
for h in 192.168.0.110 192.168.0.120 192.168.0.121; do
ssh -i /run/secrets/ssh_mcp_key -o BatchMode=yes \
-o StrictHostKeyChecking=yes -o ConnectTimeout=5 wooo@$h "echo OK:$h"
done
ssh -i /run/secrets/ssh_mcp_key -o BatchMode=yes \
-o StrictHostKeyChecking=yes -o ConnectTimeout=5 ollama@192.168.0.188 "echo OK:188"
'
```
`awoooi-executor` RBAC must include read-only backup evidence:
`jobs.batch`, `cronjobs.batch`, PVCs, and Velero backup resources. It may patch
`statefulsets.apps` / `daemonsets.apps` only for safe rollout restart.
---
## 🚀 Sprint C — DR 備份與恢復 (2026-04-11) ✅
@@ -1503,4 +1540,3 @@ ssh-mcp-key ✅ (ssh_mcp_key + known_hosts)
### Runbook
`docs/runbooks/ssh-mcp-setup.md`

View File

@@ -786,6 +786,31 @@ kubectl -n awoooi-prod logs -l app=awoooi-api --tail=50 | \
| `Permission denied (publickey)` | known_hosts 缺少該主機 | Pod exec SSH 看錯誤訊息 |
| `Load key ... Permission denied` | fsGroup 未設定 | Pod exec `ls -la /etc/repair-ssh/` |
| `Connection refused/timeout` | NetworkPolicy 封鎖 22 | Pod exec `ssh -v` 看連線過程 |
| `forbidden_shell_metachar` 且 action 是 `ssh ... '...'` | host/backup category 沒在 DecisionManager kubectl parser 前路由 SSH | 查 `alert_category` 是否為 `backup_failure`,確認 `_is_host_layer_ssh_category()` 覆蓋 |
### Telegram 按鈕 E2E 檢查 (2026-05-01)
告警卡片按鈕不是純 UI。每個按鈕都必須能在
`callback_action_spec.yaml` 找到 callback pattern並經
`callback_dispatcher.py` 路由到實際 handler。
| 卡片/情境 | 必要按鈕 | 預期處理 |
|-----------|----------|----------|
| Approval / LLM action | approve, reject, details, ignore | 寫 approval decision、執行或拒絕、查詳情、忽略告警 |
| Auto repair unavailable / emergency | investigate, escalate/assign, rollback when applicable | 通知人工/AI Agent 介入,不可靜默 |
| Drift TYPE-4D | view diff, adopt, rollback, ignore | 看 diff、採納變更、回滾、忽略 |
| Backup / host diagnosis | restart only when rule allows, charts/logs/details, cleanup when safe | 不得提供 K8s-only repair button 當 host/backup 主動作 |
| Post-verification degraded/failed | rollback proposal, investigate, details | 不自動 rollback需人工或 emergency AI Agent 接手 |
Regression test target: button callback names emitted by `telegram_gateway.py`
must stay in sync with `callback_action_spec.yaml`; stale buttons are a
production bug because Telegram cards can outlive code deploys.
Provider name drift is also a ghost-button bug. `callback_action_spec.yaml`
may use friendly names (`k8s`, `ssh`), but dispatcher must normalize to actual
registered MCP providers (`kubernetes`, `ssh_host`) before `get_provider()`.
`backup_failure` cards must expose read-only diagnostics before any write
action: host disk, backup jobs, and Velero backup status.
---

View File

@@ -22,7 +22,7 @@
# description: <說明>
version: "1.0"
last_updated: "2026-04-14"
last_updated: "2026-05-01"
actions:
# ==========================================================================
@@ -188,6 +188,53 @@ actions:
timeout_sec: 1
description: "返回飛輪儀表板 URL"
backup_check_host_disk:
label: "查主機磁碟"
emoji: "💾"
risk: low
callback_format: info
category: backup_failure
mcp:
provider: ssh
tool: ssh_get_disk_usage
params:
host: "{labels.instance}"
reply_format: code
timeout_sec: 8
description: "備份失敗時檢查主機磁碟容量與 Docker 目錄大小"
backup_check_jobs:
label: "查備份 Job"
emoji: "📦"
risk: low
callback_format: info
category: backup_failure
mcp:
provider: k8s
tool: kubectl_get
params:
namespace: "awoooi-prod"
resource: "jobs"
reply_format: truncated
timeout_sec: 8
description: "列出 awoooi-prod 內的備份相關 Job 狀態"
backup_check_velero:
label: "查 Velero"
emoji: "🧰"
risk: low
callback_format: info
category: backup_failure
mcp:
provider: k8s
tool: kubectl_get
params:
namespace: "velero"
resource: "backups.velero.io"
reply_format: truncated
timeout_sec: 8
description: "列出 Velero backup CR 狀態"
# ==========================================================================
# 寫類按鈕有副作用4-part nonce callback
# ==========================================================================

View File

@@ -35,6 +35,18 @@ import yaml
logger = structlog.get_logger(__name__)
_PROVIDER_ALIASES = {
"k8s": "kubernetes",
"ssh": "ssh_host",
}
def _resolve_provider_name(provider_name: str) -> str:
"""Normalize legacy callback spec provider names to registered MCP providers."""
return _PROVIDER_ALIASES.get(provider_name, provider_name)
# =============================================================================
# Data Types
# =============================================================================
@@ -262,14 +274,15 @@ async def dispatch_action(
# MCP registry dispatch
from src.plugins.mcp.registry import get_provider
provider = get_provider(spec.mcp_provider)
provider_name = _resolve_provider_name(spec.mcp_provider)
provider = get_provider(provider_name)
if not provider:
duration = (time.perf_counter() - start) * 1000
return DispatchResult(
success=False, action=action_name, incident_id=incident_id,
user_id=user_id,
result_text=f"{spec.emoji} {spec.label} 失敗MCP provider '{spec.mcp_provider}' 未註冊",
error=f"provider_not_found: {spec.mcp_provider}",
result_text=f"{spec.emoji} {spec.label} 失敗MCP provider '{provider_name}' 未註冊",
error=f"provider_not_found: {provider_name}",
duration_ms=duration,
)

View File

@@ -85,6 +85,22 @@ def _should_escalate_auto_approve_rejection(reason: Any) -> bool:
}
_HOST_LAYER_SSH_CATEGORIES = {"infrastructure", "host_resource", "backup_failure"}
_NON_K8S_HOST_CATEGORIES = {"host_resource", "backup_failure"}
def _is_host_layer_ssh_category(category: str | None) -> bool:
"""Return True when DecisionManager must route non-kubectl actions to SSH."""
return (category or "") in _HOST_LAYER_SSH_CATEGORIES
def _is_non_k8s_host_category(category: str | None) -> bool:
"""Return True for host/backup alerts that must not auto-run kubectl."""
return (category or "") in _NON_K8S_HOST_CATEGORIES
async def _escalate_decision_auto_repair_unavailable(
*,
incident: Incident,
@@ -1990,36 +2006,36 @@ class DecisionManager:
except Exception as _rescue_err:
logger.debug("target_rescue_skipped", error=str(_rescue_err))
# ADR-073 Phase 3-2: infrastructure 告警 (Docker/Host) → SSH MCP routing (2026-04-12 ogt)
# alert_category = "infrastructure" 表示 Docker 告警,非 kubectl action → SSH
# ADR-073 Phase 3-2: infrastructure/host/backup 告警 → SSH MCP routing.
# alert_category = "backup_failure" uses the same host-layer path as AutoRepairService.
# P1-1 fix 2026-04-12: 必須在 kubectl safety guard 之前 routing否則 docker 指令被 _action_safe=False 攔截
_alert_category = getattr(incident, "alert_category", None) or ""
if _alert_category in {"infrastructure", "host_resource"} and action and not action.startswith("kubectl"):
if _is_host_layer_ssh_category(_alert_category) and action and not action.startswith("kubectl"):
await self._ssh_execute(incident, token, action, _target)
return
# 2026-04-15 ogt: host_resource 告警HostHighCpuLoad 等)不是 K8s workload 問題
# 2026-04-15 ogt: host_resource/backup_failure 告警不是 K8s workload 問題
# 不得執行 kubectl 操作,改降級人工審核
# 根因:原本只擋了 infrastructure忘記 host_resource 也不走 K8s
if _alert_category == "host_resource" and action and action.startswith("kubectl"):
if _is_non_k8s_host_category(_alert_category) and action and action.startswith("kubectl"):
logger.warning(
"auto_execute_blocked_host_resource_no_k8s",
"auto_execute_blocked_host_layer_no_k8s",
incident_id=incident.incident_id,
alert_category=_alert_category,
action=action[:80],
reason="host_resource 告警不應執行 K8s kubectl 操作,降級人工審核",
reason="host/backup 告警不應執行 K8s kubectl 操作,降級人工審核",
)
token.state = DecisionState.READY
token.proposal_data["auto_executed"] = False
token.proposal_data["mcp_all_failed"] = True
token.proposal_data["blocked_reason"] = "host_resource 告警禁止 K8s kubectl請人工排查主機"
token.proposal_data["blocked_reason"] = f"{_alert_category} 告警禁止 K8s kubectl請人工排查主機/備份"
await self._save_token(token)
_fire_and_forget(
_escalate_decision_auto_repair_unavailable(
incident=incident,
token=token,
failure_reason=token.proposal_data["blocked_reason"],
attempted_actions="auto_execute -> host_resource_k8s_block -> emergency_intervention",
attempted_actions=f"auto_execute -> {_alert_category}_k8s_block -> emergency_intervention",
)
)
_fire_and_forget(_push_decision_to_telegram(incident, token.proposal_data))

View File

@@ -4,7 +4,11 @@ from src.api.v1.webhooks import (
_should_bypass_alertmanager_llm,
_should_use_alertmanager_rule_first,
)
from src.services.decision_manager import _should_escalate_auto_approve_rejection
from src.services.decision_manager import (
_is_host_layer_ssh_category,
_is_non_k8s_host_category,
_should_escalate_auto_approve_rejection,
)
from src.services.telegram_gateway import _format_resolved_guard_stamp
@@ -84,6 +88,18 @@ def test_manual_gate_reasons_escalate_to_emergency_intervention():
assert _should_escalate_auto_approve_rejection("critical_operation") is False
def test_backup_failure_routes_to_decision_ssh_before_kubectl_parser():
assert _is_host_layer_ssh_category("backup_failure") is True
assert _is_host_layer_ssh_category("host_resource") is True
assert _is_host_layer_ssh_category("kubernetes") is False
def test_backup_failure_blocks_k8s_auto_execute():
assert _is_non_k8s_host_category("backup_failure") is True
assert _is_non_k8s_host_category("host_resource") is True
assert _is_non_k8s_host_category("infrastructure") is False
def test_resolved_guard_stamp_without_timestamp_is_clean():
assert _format_resolved_guard_stamp(None) == "✅ 此事件已解決"

View File

@@ -21,6 +21,7 @@ from src.services.callback_dispatcher import (
list_actions_for_category,
load_action_registry,
_lookup_context,
_resolve_provider_name,
_resolve_template,
)
@@ -68,6 +69,11 @@ class TestRegistryLoading:
assert spec and spec.callback_format == "info", \
f"{qa} should use info format"
def test_legacy_provider_aliases_resolve_to_registered_names(self):
assert _resolve_provider_name("k8s") == "kubernetes"
assert _resolve_provider_name("ssh") == "ssh_host"
assert _resolve_provider_name("prometheus") == "prometheus"
# =============================================================================
# Category filtering
@@ -91,6 +97,16 @@ class TestCategoryFiltering:
assert any(a.callback_format == "info" for a in actions), "需至少 1 個查類"
assert any(a.callback_format == "nonce" for a in actions), "需至少 1 個寫類"
def test_backup_failure_has_read_only_diagnostics(self):
actions = list_actions_for_category("backup_failure")
names = {a.name for a in actions}
assert {
"backup_check_host_disk",
"backup_check_jobs",
"backup_check_velero",
}.issubset(names)
assert all(a.callback_format == "info" for a in actions)
# =============================================================================
# Template variable resolution

View File

@@ -12,19 +12,31 @@ Live e2e 用 `HostBackupFailed` 打 Alertmanager 後發現 aged backup 告警會
### 完成
- `_should_use_alertmanager_rule_first()` / `_should_bypass_alertmanager_llm()` 納入 `backup_failure`,備份失敗 YAML `SSH_DIAGNOSE` 不再被 LLM 覆蓋成 K8s 動作。
- `DecisionManager` SSH route 與 `AutoRepairService` 分類對齊:`backup_failure` 非 kubectl action 先走 SSH MCP不再落入 `parse_kubectl_action()` 後被 `forbidden_shell_metachar` 擋下。
- `DecisionManager` host/backup K8s block 納入 `backup_failure`,若 LLM 或 Playbook 產生 kubectl 動作,直接走 emergency escalation而不是對備份告警誤做 K8s 修復。
- `AutoRepairService` 追加 host/backup Playbook guard主機/備份 incident 若匹配到 K8s rollout 類 Playbook阻擋為 `HOST_BACKUP_K8S_PLAYBOOK`,改走緊急介入。
- `AutoRepairService` post-verification rollback guardhost/backup 或非 K8s Playbook 驗證失敗時,不再合成 `kubectl rollout restart deployment/{target}`,改走 emergency escalation且不自動 resolve incident。
- `EmergencyEscalationService` 沿用既有 `APPROVAL_ESCALATED` DB enum 寫 AOL避免緊急通道因新 enum 未 migration 而留痕失敗。
-`phase25_knowledge_enum_names.sql`,讓 `AUTO_RUNBOOK` / `ANTI_PATTERN` enum name 可寫入 PG修復 auto runbook KM 沉澱失敗。
- `NodeExporterDown` Prometheus rule `auto_repair` 改為 `true`,與 YAML rule catalog 的 exporter restart 策略一致。
- `awoooi-executor` RBAC 補 backup/DR 診斷權限PVC、Jobs/CronJobs、Velero resources read-only以及 StatefulSet/DaemonSet safe rollout patch。
- NetworkPolicy 補 K3s master/worker `22/tcp` egress讓 SSH MCP 可以覆蓋 120/121不只 110/188。
- Telegram category buttons 補 provider alias 正規化:`k8s``kubernetes``ssh``ssh_host`,避免按鈕畫出來後 dispatcher 找不到 MCP provider。
- `backup_failure` 補三個 read-only 診斷按鈕:查主機磁碟、查備份 Job、查 Velero備份告警不再只有通用批准/拒絕/詳情。
-`backup_failure` NO_ACTION / SSH_DIAGNOSE 單元測試。
### 驗證
- `python3 -m py_compile apps/api/src/api/v1/webhooks.py` 通過。
- `python3 -m py_compile apps/api/src/services/decision_manager.py apps/api/src/services/callback_dispatcher.py` 通過。
- `cd apps/api && pytest tests/test_alertmanager_rule_bypass.py tests/test_telegram_ai_automation_block.py tests/test_ai_router_diagnose_fallback.py -q` → 24 passed。
- `cd apps/api && pytest tests/test_auto_repair_service.py tests/test_alertmanager_rule_bypass.py -q` → 27 passed。
- `cd apps/api && pytest tests/test_auto_repair_service.py tests/test_alertmanager_rule_bypass.py -q` → 29 passed。
- `cd apps/api && pytest tests/test_alertmanager_rule_bypass.py tests/test_callback_dispatcher.py tests/test_telegram_button_consistency.py -q` → 56 passed。
- YAML parse `ops/monitoring/alerts-unified.yml``apps/api/alert_rules.yaml` 通過。
- YAML parse `callback_action_spec.yaml``07-rbac.yaml``02-network-policy.yaml` 通過。
- Live Secret/mount 檢查:`ssh-mcp-key``awoooi-repair-ssh-key``awoooi-repair-known-hosts` 存在且掛載可讀。
- Live SSH MCP key 檢查:`wooo@192.168.0.110``ollama@192.168.0.188` OK`wooo@192.168.0.120/121` 已通過 host key但 remote `authorized_keys` 尚未納入該公鑰,回 `Permission denied (publickey,password)`
- Live RBAC apply 被 Argo 依 Git 狀態拉回;`07-rbac.yaml` 需推上 Gitea 由 Argo 同步後再驗 `can-i`
## 2026-04-30 | ADR-104 Playbook 版本化 lineage

View File

@@ -140,6 +140,48 @@ MoWoooWorkDown → Jaccard 匹配 momo-app-down-repair → SSH ollama@192.168.0.
---
## Appendix B — Backup Failure Route Parity (2026-05-01)
`HostBackupFailed` / backup 類告警的 `alert_category``backup_failure`。它必須在所有 host-layer 自動化路徑與 `host_resource` 同級處理:
| Layer | 必須行為 |
|-------|----------|
| Alertmanager rule-first | YAML `SSH_DIAGNOSE` / `NO_ACTION` 不進 LLM 覆寫 |
| AutoRepairService | `backup_failure` 視為 host-layer拒絕 K8s Playbook fallback |
| DecisionManager | 非 `kubectl` 動作在 kubectl parser 前路由 SSH MCP |
| DecisionManager K8s guard | `backup_failure` 產生 `kubectl` 時降級 emergency escalation |
| Telegram buttons | `backup_failure` 顯示只讀診斷按鈕:主機磁碟、備份 Job、Velero 狀態 |
2026-05-01 根因DecisionManager SSH route 只含 `infrastructure` / `host_resource`,漏掉 `backup_failure`,導致 `ssh 192.168.0.110 '...;...'` 類只讀診斷動作掉進 `parse_kubectl_action()`,被 `forbidden_shell_metachar` 擋下。
同日按鈕審計也發現 category button 的 friendly provider name 會漂移:`callback_action_spec.yaml` 使用 `k8s` / `ssh`,但 MCP registry 實際名稱是 `kubernetes` / `ssh_host`。Dispatcher 必須正規化 provider alias否則卡片按鈕會顯示但執行時變成 `provider_not_found`
### Runtime 權限基準
- K8s Secret:
- `awoooi-repair-ssh-key` 掛載 `/etc/repair-ssh/`
- `awoooi-repair-known-hosts` 掛載 `/etc/repair-known-hosts/`
- `ssh-mcp-key` 掛載 `/run/secrets/ssh_mcp_key``/etc/ssh-mcp/known_hosts`
- Remote `authorized_keys`:
- `wooo@192.168.0.110`
- `wooo@192.168.0.120`
- `wooo@192.168.0.121`
- `ollama@192.168.0.188`
- NetworkPolicy egress:
- `192.168.0.110:22`
- `192.168.0.120:22`
- `192.168.0.121:22`
- `192.168.0.188:22`
- `awoooi-executor` RBAC:
- read `jobs.batch`, `cronjobs.batch`
- read `persistentvolumeclaims`
- read Velero `backups`, `backupstoragelocations`, `backuprepositories`, `podvolumebackups`, `podvolumerestores`, `restores`, `schedules`
- patch `statefulsets.apps` / `daemonsets.apps` only for safe rollout restart
If SSH MCP fails, the incident must not silently become a manual approval card; it must raise the emergency intervention path with the exact SSH failure reason when available.
---
## 首席架構師 Review 記錄 (2026-04-05)
評分:**72/100 → 修正後 88/100**

View File

@@ -1,8 +1,9 @@
# AWOOOI 正式環境零信任網路策略
# 負責人: CIO
# 版本: v1.5
# 日期: 2026-04-14
# 版本: v1.6
# 日期: 2026-05-01
# 變更:
# - v1.6: 新增 K3s node 120/121 SSH egress供 SSH MCP 主機診斷/修復
# - v1.5: 新增 keepalived VIP 192.168.0.125/32 ArgoCD NodePort 30443 egress修復 heartbeat probe
# - v1.4: 新增 ArgoCD MCP egressargocd namespace port 80/443
# - v1.3: 新增 192.168.0.111 Ollama 主機 (M1 Pro),移除 188 的 Ollama port
@@ -168,7 +169,7 @@ spec:
- protocol: TCP
port: 8080
# 允許訪問 K8s API (Executor 執行 kubectl 指令)
# 允許訪問 K8s API + K3s master SSH (Executor 執行 kubectl/host diagnosis)
# 2026-03-23 修復: Y 按鈕執行超時
# 重要: ClusterIP (10.43.0.1:443) 會路由到實際端點 (192.168.0.120:6443)
# 必須同時允許兩者,否則流量會被 192.168.0.0/16 排除規則阻擋
@@ -180,8 +181,11 @@ spec:
port: 443
- to:
- ipBlock:
cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點 + ArgoCD NodePort
cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點 + ArgoCD NodePort + SSH MCP
ports:
# SSH MCP — K3s master host diagnosis/repair path
- protocol: TCP
port: 22
- protocol: TCP
port: 6443
# ArgoCD MCP NodePort (2026-04-11): ClusterIP DNAT 跨 namespace 不穩定,改用 NodePort
@@ -221,6 +225,9 @@ spec:
- ipBlock:
cidr: 192.168.0.121/32
ports:
# SSH MCP — K3s worker host diagnosis/repair path
- protocol: TCP
port: 22
- protocol: TCP
port: 6443
- protocol: TCP

View File

@@ -62,6 +62,11 @@ rules:
resources: ["services", "configmaps"]
verbs: ["get", "list", "watch"]
# 2026-05-01: backup/disk diagnostics need PVC visibility; read-only only.
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
@@ -89,6 +94,23 @@ rules:
resources: ["statefulsets", "daemonsets"]
verbs: ["get", "list", "watch"]
# 2026-05-01: HostBackupFailed / VeleroBackupFailed diagnosis needs backup job status.
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
# 2026-05-01: Velero backup status is read-only evidence for backup_failure alerts.
- apiGroups: ["velero.io"]
resources:
- backups
- backupstoragelocations
- backuprepositories
- podvolumebackups
- podvolumerestores
- restores
- schedules
verbs: ["get", "list", "watch"]
# ============================================================================
# 寫入權限 (Write) - 僅限故障排除操作
# ============================================================================
@@ -104,6 +126,11 @@ rules:
resources: ["deployments"]
verbs: ["patch"]
# 2026-05-01: allow the same safe rollout restart primitive on controller types.
- apiGroups: ["apps"]
resources: ["statefulsets", "daemonsets"]
verbs: ["patch"]
# Scale Deployments (擴縮容)
- apiGroups: ["apps"]
resources: ["deployments/scale"]