fix(aiops): route backup decisions through ssh
This commit is contained in:
@@ -1344,6 +1344,43 @@ Architecture Review 發現的安全要求(2026-04-11):
|
||||
|
||||
3. **群組 B 工具需 trust_score >= 0.8**(硬編碼守衛)
|
||||
|
||||
### Host/Backup SSH Route Invariants (2026-05-01)
|
||||
|
||||
`backup_failure` is a host-layer category. Keep it aligned anywhere
|
||||
`host_resource` is routed, especially:
|
||||
|
||||
- `DecisionManager`: non-`kubectl` actions must route to SSH MCP before
|
||||
`parse_kubectl_action()`. Otherwise SSH diagnosis strings with shell syntax
|
||||
are blocked as `forbidden_shell_metachar`.
|
||||
- `DecisionManager`: `kubectl` actions from `host_resource` or
|
||||
`backup_failure` must be blocked and escalated to emergency intervention.
|
||||
- `AutoRepairService`: host/backup incidents must not fall back to K8s
|
||||
rollout Playbooks.
|
||||
|
||||
Runtime baseline for host/backup repair:
|
||||
|
||||
```bash
|
||||
kubectl -n awoooi-prod get secret ssh-mcp-key awoooi-repair-ssh-key awoooi-repair-known-hosts
|
||||
|
||||
kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -lc '
|
||||
ls -l /run/secrets/ssh_mcp_key /etc/ssh-mcp/known_hosts \
|
||||
/etc/repair-ssh/id_ed25519 /etc/repair-known-hosts/known_hosts
|
||||
'
|
||||
|
||||
kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -lc '
|
||||
for h in 192.168.0.110 192.168.0.120 192.168.0.121; do
|
||||
ssh -i /run/secrets/ssh_mcp_key -o BatchMode=yes \
|
||||
-o StrictHostKeyChecking=yes -o ConnectTimeout=5 wooo@$h "echo OK:$h"
|
||||
done
|
||||
ssh -i /run/secrets/ssh_mcp_key -o BatchMode=yes \
|
||||
-o StrictHostKeyChecking=yes -o ConnectTimeout=5 ollama@192.168.0.188 "echo OK:188"
|
||||
'
|
||||
```
|
||||
|
||||
`awoooi-executor` RBAC must include read-only backup evidence:
|
||||
`jobs.batch`, `cronjobs.batch`, PVCs, and Velero backup resources. It may patch
|
||||
`statefulsets.apps` / `daemonsets.apps` only for safe rollout restart.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Sprint C — DR 備份與恢復 (2026-04-11) ✅
|
||||
@@ -1503,4 +1540,3 @@ ssh-mcp-key ✅ (ssh_mcp_key + known_hosts)
|
||||
|
||||
### Runbook
|
||||
`docs/runbooks/ssh-mcp-setup.md`
|
||||
|
||||
|
||||
@@ -786,6 +786,31 @@ kubectl -n awoooi-prod logs -l app=awoooi-api --tail=50 | \
|
||||
| `Permission denied (publickey)` | known_hosts 缺少該主機 | Pod exec SSH 看錯誤訊息 |
|
||||
| `Load key ... Permission denied` | fsGroup 未設定 | Pod exec `ls -la /etc/repair-ssh/` |
|
||||
| `Connection refused/timeout` | NetworkPolicy 封鎖 22 | Pod exec `ssh -v` 看連線過程 |
|
||||
| `forbidden_shell_metachar` 且 action 是 `ssh ... '...'` | host/backup category 沒在 DecisionManager kubectl parser 前路由 SSH | 查 `alert_category` 是否為 `backup_failure`,確認 `_is_host_layer_ssh_category()` 覆蓋 |
|
||||
|
||||
### Telegram 按鈕 E2E 檢查 (2026-05-01)
|
||||
|
||||
告警卡片按鈕不是純 UI。每個按鈕都必須能在
|
||||
`callback_action_spec.yaml` 找到 callback pattern,並經
|
||||
`callback_dispatcher.py` 路由到實際 handler。
|
||||
|
||||
| 卡片/情境 | 必要按鈕 | 預期處理 |
|
||||
|-----------|----------|----------|
|
||||
| Approval / LLM action | approve, reject, details, ignore | 寫 approval decision、執行或拒絕、查詳情、忽略告警 |
|
||||
| Auto repair unavailable / emergency | investigate, escalate/assign, rollback when applicable | 通知人工/AI Agent 介入,不可靜默 |
|
||||
| Drift TYPE-4D | view diff, adopt, rollback, ignore | 看 diff、採納變更、回滾、忽略 |
|
||||
| Backup / host diagnosis | restart only when rule allows, charts/logs/details, cleanup when safe | 不得提供 K8s-only repair button 當 host/backup 主動作 |
|
||||
| Post-verification degraded/failed | rollback proposal, investigate, details | 不自動 rollback,需人工或 emergency AI Agent 接手 |
|
||||
|
||||
Regression test target: button callback names emitted by `telegram_gateway.py`
|
||||
must stay in sync with `callback_action_spec.yaml`; stale buttons are a
|
||||
production bug because Telegram cards can outlive code deploys.
|
||||
|
||||
Provider name drift is also a ghost-button bug. `callback_action_spec.yaml`
|
||||
may use friendly names (`k8s`, `ssh`), but dispatcher must normalize to actual
|
||||
registered MCP providers (`kubernetes`, `ssh_host`) before `get_provider()`.
|
||||
`backup_failure` cards must expose read-only diagnostics before any write
|
||||
action: host disk, backup jobs, and Velero backup status.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -22,7 +22,7 @@
|
||||
# description: <說明>
|
||||
|
||||
version: "1.0"
|
||||
last_updated: "2026-04-14"
|
||||
last_updated: "2026-05-01"
|
||||
|
||||
actions:
|
||||
# ==========================================================================
|
||||
@@ -188,6 +188,53 @@ actions:
|
||||
timeout_sec: 1
|
||||
description: "返回飛輪儀表板 URL"
|
||||
|
||||
backup_check_host_disk:
|
||||
label: "查主機磁碟"
|
||||
emoji: "💾"
|
||||
risk: low
|
||||
callback_format: info
|
||||
category: backup_failure
|
||||
mcp:
|
||||
provider: ssh
|
||||
tool: ssh_get_disk_usage
|
||||
params:
|
||||
host: "{labels.instance}"
|
||||
reply_format: code
|
||||
timeout_sec: 8
|
||||
description: "備份失敗時檢查主機磁碟容量與 Docker 目錄大小"
|
||||
|
||||
backup_check_jobs:
|
||||
label: "查備份 Job"
|
||||
emoji: "📦"
|
||||
risk: low
|
||||
callback_format: info
|
||||
category: backup_failure
|
||||
mcp:
|
||||
provider: k8s
|
||||
tool: kubectl_get
|
||||
params:
|
||||
namespace: "awoooi-prod"
|
||||
resource: "jobs"
|
||||
reply_format: truncated
|
||||
timeout_sec: 8
|
||||
description: "列出 awoooi-prod 內的備份相關 Job 狀態"
|
||||
|
||||
backup_check_velero:
|
||||
label: "查 Velero"
|
||||
emoji: "🧰"
|
||||
risk: low
|
||||
callback_format: info
|
||||
category: backup_failure
|
||||
mcp:
|
||||
provider: k8s
|
||||
tool: kubectl_get
|
||||
params:
|
||||
namespace: "velero"
|
||||
resource: "backups.velero.io"
|
||||
reply_format: truncated
|
||||
timeout_sec: 8
|
||||
description: "列出 Velero backup CR 狀態"
|
||||
|
||||
# ==========================================================================
|
||||
# 寫類按鈕(有副作用,4-part nonce callback)
|
||||
# ==========================================================================
|
||||
|
||||
@@ -35,6 +35,18 @@ import yaml
|
||||
logger = structlog.get_logger(__name__)
|
||||
|
||||
|
||||
_PROVIDER_ALIASES = {
|
||||
"k8s": "kubernetes",
|
||||
"ssh": "ssh_host",
|
||||
}
|
||||
|
||||
|
||||
def _resolve_provider_name(provider_name: str) -> str:
|
||||
"""Normalize legacy callback spec provider names to registered MCP providers."""
|
||||
|
||||
return _PROVIDER_ALIASES.get(provider_name, provider_name)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Data Types
|
||||
# =============================================================================
|
||||
@@ -262,14 +274,15 @@ async def dispatch_action(
|
||||
|
||||
# MCP registry dispatch
|
||||
from src.plugins.mcp.registry import get_provider
|
||||
provider = get_provider(spec.mcp_provider)
|
||||
provider_name = _resolve_provider_name(spec.mcp_provider)
|
||||
provider = get_provider(provider_name)
|
||||
if not provider:
|
||||
duration = (time.perf_counter() - start) * 1000
|
||||
return DispatchResult(
|
||||
success=False, action=action_name, incident_id=incident_id,
|
||||
user_id=user_id,
|
||||
result_text=f"{spec.emoji} {spec.label} 失敗:MCP provider '{spec.mcp_provider}' 未註冊",
|
||||
error=f"provider_not_found: {spec.mcp_provider}",
|
||||
result_text=f"{spec.emoji} {spec.label} 失敗:MCP provider '{provider_name}' 未註冊",
|
||||
error=f"provider_not_found: {provider_name}",
|
||||
duration_ms=duration,
|
||||
)
|
||||
|
||||
|
||||
@@ -85,6 +85,22 @@ def _should_escalate_auto_approve_rejection(reason: Any) -> bool:
|
||||
}
|
||||
|
||||
|
||||
_HOST_LAYER_SSH_CATEGORIES = {"infrastructure", "host_resource", "backup_failure"}
|
||||
_NON_K8S_HOST_CATEGORIES = {"host_resource", "backup_failure"}
|
||||
|
||||
|
||||
def _is_host_layer_ssh_category(category: str | None) -> bool:
|
||||
"""Return True when DecisionManager must route non-kubectl actions to SSH."""
|
||||
|
||||
return (category or "") in _HOST_LAYER_SSH_CATEGORIES
|
||||
|
||||
|
||||
def _is_non_k8s_host_category(category: str | None) -> bool:
|
||||
"""Return True for host/backup alerts that must not auto-run kubectl."""
|
||||
|
||||
return (category or "") in _NON_K8S_HOST_CATEGORIES
|
||||
|
||||
|
||||
async def _escalate_decision_auto_repair_unavailable(
|
||||
*,
|
||||
incident: Incident,
|
||||
@@ -1990,36 +2006,36 @@ class DecisionManager:
|
||||
except Exception as _rescue_err:
|
||||
logger.debug("target_rescue_skipped", error=str(_rescue_err))
|
||||
|
||||
# ADR-073 Phase 3-2: infrastructure 告警 (Docker/Host) → SSH MCP routing (2026-04-12 ogt)
|
||||
# alert_category = "infrastructure" 表示 Docker 告警,非 kubectl action → SSH
|
||||
# ADR-073 Phase 3-2: infrastructure/host/backup 告警 → SSH MCP routing.
|
||||
# alert_category = "backup_failure" uses the same host-layer path as AutoRepairService.
|
||||
# P1-1 fix 2026-04-12: 必須在 kubectl safety guard 之前 routing,否則 docker 指令被 _action_safe=False 攔截
|
||||
_alert_category = getattr(incident, "alert_category", None) or ""
|
||||
if _alert_category in {"infrastructure", "host_resource"} and action and not action.startswith("kubectl"):
|
||||
if _is_host_layer_ssh_category(_alert_category) and action and not action.startswith("kubectl"):
|
||||
await self._ssh_execute(incident, token, action, _target)
|
||||
return
|
||||
|
||||
# 2026-04-15 ogt: host_resource 告警(HostHighCpuLoad 等)不是 K8s workload 問題
|
||||
# 2026-04-15 ogt: host_resource/backup_failure 告警不是 K8s workload 問題
|
||||
# 不得執行 kubectl 操作,改降級人工審核
|
||||
# 根因:原本只擋了 infrastructure,忘記 host_resource 也不走 K8s
|
||||
if _alert_category == "host_resource" and action and action.startswith("kubectl"):
|
||||
if _is_non_k8s_host_category(_alert_category) and action and action.startswith("kubectl"):
|
||||
logger.warning(
|
||||
"auto_execute_blocked_host_resource_no_k8s",
|
||||
"auto_execute_blocked_host_layer_no_k8s",
|
||||
incident_id=incident.incident_id,
|
||||
alert_category=_alert_category,
|
||||
action=action[:80],
|
||||
reason="host_resource 告警不應執行 K8s kubectl 操作,降級人工審核",
|
||||
reason="host/backup 告警不應執行 K8s kubectl 操作,降級人工審核",
|
||||
)
|
||||
token.state = DecisionState.READY
|
||||
token.proposal_data["auto_executed"] = False
|
||||
token.proposal_data["mcp_all_failed"] = True
|
||||
token.proposal_data["blocked_reason"] = "host_resource 告警禁止 K8s kubectl,請人工排查主機"
|
||||
token.proposal_data["blocked_reason"] = f"{_alert_category} 告警禁止 K8s kubectl,請人工排查主機/備份"
|
||||
await self._save_token(token)
|
||||
_fire_and_forget(
|
||||
_escalate_decision_auto_repair_unavailable(
|
||||
incident=incident,
|
||||
token=token,
|
||||
failure_reason=token.proposal_data["blocked_reason"],
|
||||
attempted_actions="auto_execute -> host_resource_k8s_block -> emergency_intervention",
|
||||
attempted_actions=f"auto_execute -> {_alert_category}_k8s_block -> emergency_intervention",
|
||||
)
|
||||
)
|
||||
_fire_and_forget(_push_decision_to_telegram(incident, token.proposal_data))
|
||||
|
||||
@@ -4,7 +4,11 @@ from src.api.v1.webhooks import (
|
||||
_should_bypass_alertmanager_llm,
|
||||
_should_use_alertmanager_rule_first,
|
||||
)
|
||||
from src.services.decision_manager import _should_escalate_auto_approve_rejection
|
||||
from src.services.decision_manager import (
|
||||
_is_host_layer_ssh_category,
|
||||
_is_non_k8s_host_category,
|
||||
_should_escalate_auto_approve_rejection,
|
||||
)
|
||||
from src.services.telegram_gateway import _format_resolved_guard_stamp
|
||||
|
||||
|
||||
@@ -84,6 +88,18 @@ def test_manual_gate_reasons_escalate_to_emergency_intervention():
|
||||
assert _should_escalate_auto_approve_rejection("critical_operation") is False
|
||||
|
||||
|
||||
def test_backup_failure_routes_to_decision_ssh_before_kubectl_parser():
|
||||
assert _is_host_layer_ssh_category("backup_failure") is True
|
||||
assert _is_host_layer_ssh_category("host_resource") is True
|
||||
assert _is_host_layer_ssh_category("kubernetes") is False
|
||||
|
||||
|
||||
def test_backup_failure_blocks_k8s_auto_execute():
|
||||
assert _is_non_k8s_host_category("backup_failure") is True
|
||||
assert _is_non_k8s_host_category("host_resource") is True
|
||||
assert _is_non_k8s_host_category("infrastructure") is False
|
||||
|
||||
|
||||
def test_resolved_guard_stamp_without_timestamp_is_clean():
|
||||
assert _format_resolved_guard_stamp(None) == "✅ 此事件已解決"
|
||||
|
||||
|
||||
@@ -21,6 +21,7 @@ from src.services.callback_dispatcher import (
|
||||
list_actions_for_category,
|
||||
load_action_registry,
|
||||
_lookup_context,
|
||||
_resolve_provider_name,
|
||||
_resolve_template,
|
||||
)
|
||||
|
||||
@@ -68,6 +69,11 @@ class TestRegistryLoading:
|
||||
assert spec and spec.callback_format == "info", \
|
||||
f"{qa} should use info format"
|
||||
|
||||
def test_legacy_provider_aliases_resolve_to_registered_names(self):
|
||||
assert _resolve_provider_name("k8s") == "kubernetes"
|
||||
assert _resolve_provider_name("ssh") == "ssh_host"
|
||||
assert _resolve_provider_name("prometheus") == "prometheus"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Category filtering
|
||||
@@ -91,6 +97,16 @@ class TestCategoryFiltering:
|
||||
assert any(a.callback_format == "info" for a in actions), "需至少 1 個查類"
|
||||
assert any(a.callback_format == "nonce" for a in actions), "需至少 1 個寫類"
|
||||
|
||||
def test_backup_failure_has_read_only_diagnostics(self):
|
||||
actions = list_actions_for_category("backup_failure")
|
||||
names = {a.name for a in actions}
|
||||
assert {
|
||||
"backup_check_host_disk",
|
||||
"backup_check_jobs",
|
||||
"backup_check_velero",
|
||||
}.issubset(names)
|
||||
assert all(a.callback_format == "info" for a in actions)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Template variable resolution
|
||||
|
||||
@@ -12,19 +12,31 @@ Live e2e 用 `HostBackupFailed` 打 Alertmanager 後發現 aged backup 告警會
|
||||
|
||||
### 完成
|
||||
- `_should_use_alertmanager_rule_first()` / `_should_bypass_alertmanager_llm()` 納入 `backup_failure`,備份失敗 YAML `SSH_DIAGNOSE` 不再被 LLM 覆蓋成 K8s 動作。
|
||||
- `DecisionManager` SSH route 與 `AutoRepairService` 分類對齊:`backup_failure` 非 kubectl action 先走 SSH MCP,不再落入 `parse_kubectl_action()` 後被 `forbidden_shell_metachar` 擋下。
|
||||
- `DecisionManager` host/backup K8s block 納入 `backup_failure`,若 LLM 或 Playbook 產生 kubectl 動作,直接走 emergency escalation,而不是對備份告警誤做 K8s 修復。
|
||||
- `AutoRepairService` 追加 host/backup Playbook guard:主機/備份 incident 若匹配到 K8s rollout 類 Playbook,阻擋為 `HOST_BACKUP_K8S_PLAYBOOK`,改走緊急介入。
|
||||
- `AutoRepairService` post-verification rollback guard:host/backup 或非 K8s Playbook 驗證失敗時,不再合成 `kubectl rollout restart deployment/{target}`,改走 emergency escalation,且不自動 resolve incident。
|
||||
- `EmergencyEscalationService` 沿用既有 `APPROVAL_ESCALATED` DB enum 寫 AOL,避免緊急通道因新 enum 未 migration 而留痕失敗。
|
||||
- 補 `phase25_knowledge_enum_names.sql`,讓 `AUTO_RUNBOOK` / `ANTI_PATTERN` enum name 可寫入 PG,修復 auto runbook KM 沉澱失敗。
|
||||
- `NodeExporterDown` Prometheus rule `auto_repair` 改為 `true`,與 YAML rule catalog 的 exporter restart 策略一致。
|
||||
- `awoooi-executor` RBAC 補 backup/DR 診斷權限:PVC、Jobs/CronJobs、Velero resources read-only,以及 StatefulSet/DaemonSet safe rollout patch。
|
||||
- NetworkPolicy 補 K3s master/worker `22/tcp` egress,讓 SSH MCP 可以覆蓋 120/121,不只 110/188。
|
||||
- Telegram category buttons 補 provider alias 正規化:`k8s` → `kubernetes`、`ssh` → `ssh_host`,避免按鈕畫出來後 dispatcher 找不到 MCP provider。
|
||||
- `backup_failure` 補三個 read-only 診斷按鈕:查主機磁碟、查備份 Job、查 Velero;備份告警不再只有通用批准/拒絕/詳情。
|
||||
- 補 `backup_failure` NO_ACTION / SSH_DIAGNOSE 單元測試。
|
||||
|
||||
### 驗證
|
||||
- `python3 -m py_compile apps/api/src/api/v1/webhooks.py` 通過。
|
||||
- `python3 -m py_compile apps/api/src/services/decision_manager.py apps/api/src/services/callback_dispatcher.py` 通過。
|
||||
- `cd apps/api && pytest tests/test_alertmanager_rule_bypass.py tests/test_telegram_ai_automation_block.py tests/test_ai_router_diagnose_fallback.py -q` → 24 passed。
|
||||
- `cd apps/api && pytest tests/test_auto_repair_service.py tests/test_alertmanager_rule_bypass.py -q` → 27 passed。
|
||||
- `cd apps/api && pytest tests/test_auto_repair_service.py tests/test_alertmanager_rule_bypass.py -q` → 29 passed。
|
||||
- `cd apps/api && pytest tests/test_alertmanager_rule_bypass.py tests/test_callback_dispatcher.py tests/test_telegram_button_consistency.py -q` → 56 passed。
|
||||
- YAML parse `ops/monitoring/alerts-unified.yml`、`apps/api/alert_rules.yaml` 通過。
|
||||
- YAML parse `callback_action_spec.yaml`、`07-rbac.yaml`、`02-network-policy.yaml` 通過。
|
||||
- Live Secret/mount 檢查:`ssh-mcp-key`、`awoooi-repair-ssh-key`、`awoooi-repair-known-hosts` 存在且掛載可讀。
|
||||
- Live SSH MCP key 檢查:`wooo@192.168.0.110`、`ollama@192.168.0.188` OK;`wooo@192.168.0.120/121` 已通過 host key,但 remote `authorized_keys` 尚未納入該公鑰,回 `Permission denied (publickey,password)`。
|
||||
- Live RBAC apply 被 Argo 依 Git 狀態拉回;`07-rbac.yaml` 需推上 Gitea 由 Argo 同步後再驗 `can-i`。
|
||||
|
||||
## 2026-04-30 | ADR-104 Playbook 版本化 lineage
|
||||
|
||||
|
||||
@@ -140,6 +140,48 @@ MoWoooWorkDown → Jaccard 匹配 momo-app-down-repair → SSH ollama@192.168.0.
|
||||
|
||||
---
|
||||
|
||||
## Appendix B — Backup Failure Route Parity (2026-05-01)
|
||||
|
||||
`HostBackupFailed` / backup 類告警的 `alert_category` 是 `backup_failure`。它必須在所有 host-layer 自動化路徑與 `host_resource` 同級處理:
|
||||
|
||||
| Layer | 必須行為 |
|
||||
|-------|----------|
|
||||
| Alertmanager rule-first | YAML `SSH_DIAGNOSE` / `NO_ACTION` 不進 LLM 覆寫 |
|
||||
| AutoRepairService | `backup_failure` 視為 host-layer,拒絕 K8s Playbook fallback |
|
||||
| DecisionManager | 非 `kubectl` 動作在 kubectl parser 前路由 SSH MCP |
|
||||
| DecisionManager K8s guard | `backup_failure` 產生 `kubectl` 時降級 emergency escalation |
|
||||
| Telegram buttons | `backup_failure` 顯示只讀診斷按鈕:主機磁碟、備份 Job、Velero 狀態 |
|
||||
|
||||
2026-05-01 根因:DecisionManager SSH route 只含 `infrastructure` / `host_resource`,漏掉 `backup_failure`,導致 `ssh 192.168.0.110 '...;...'` 類只讀診斷動作掉進 `parse_kubectl_action()`,被 `forbidden_shell_metachar` 擋下。
|
||||
|
||||
同日按鈕審計也發現 category button 的 friendly provider name 會漂移:`callback_action_spec.yaml` 使用 `k8s` / `ssh`,但 MCP registry 實際名稱是 `kubernetes` / `ssh_host`。Dispatcher 必須正規化 provider alias,否則卡片按鈕會顯示但執行時變成 `provider_not_found`。
|
||||
|
||||
### Runtime 權限基準
|
||||
|
||||
- K8s Secret:
|
||||
- `awoooi-repair-ssh-key` 掛載 `/etc/repair-ssh/`
|
||||
- `awoooi-repair-known-hosts` 掛載 `/etc/repair-known-hosts/`
|
||||
- `ssh-mcp-key` 掛載 `/run/secrets/ssh_mcp_key` 與 `/etc/ssh-mcp/known_hosts`
|
||||
- Remote `authorized_keys`:
|
||||
- `wooo@192.168.0.110`
|
||||
- `wooo@192.168.0.120`
|
||||
- `wooo@192.168.0.121`
|
||||
- `ollama@192.168.0.188`
|
||||
- NetworkPolicy egress:
|
||||
- `192.168.0.110:22`
|
||||
- `192.168.0.120:22`
|
||||
- `192.168.0.121:22`
|
||||
- `192.168.0.188:22`
|
||||
- `awoooi-executor` RBAC:
|
||||
- read `jobs.batch`, `cronjobs.batch`
|
||||
- read `persistentvolumeclaims`
|
||||
- read Velero `backups`, `backupstoragelocations`, `backuprepositories`, `podvolumebackups`, `podvolumerestores`, `restores`, `schedules`
|
||||
- patch `statefulsets.apps` / `daemonsets.apps` only for safe rollout restart
|
||||
|
||||
If SSH MCP fails, the incident must not silently become a manual approval card; it must raise the emergency intervention path with the exact SSH failure reason when available.
|
||||
|
||||
---
|
||||
|
||||
## 首席架構師 Review 記錄 (2026-04-05)
|
||||
|
||||
評分:**72/100 → 修正後 88/100**
|
||||
|
||||
@@ -1,8 +1,9 @@
|
||||
# AWOOOI 正式環境零信任網路策略
|
||||
# 負責人: CIO
|
||||
# 版本: v1.5
|
||||
# 日期: 2026-04-14
|
||||
# 版本: v1.6
|
||||
# 日期: 2026-05-01
|
||||
# 變更:
|
||||
# - v1.6: 新增 K3s node 120/121 SSH egress,供 SSH MCP 主機診斷/修復
|
||||
# - v1.5: 新增 keepalived VIP 192.168.0.125/32 ArgoCD NodePort 30443 egress(修復 heartbeat probe)
|
||||
# - v1.4: 新增 ArgoCD MCP egress(argocd namespace port 80/443)
|
||||
# - v1.3: 新增 192.168.0.111 Ollama 主機 (M1 Pro),移除 188 的 Ollama port
|
||||
@@ -168,7 +169,7 @@ spec:
|
||||
- protocol: TCP
|
||||
port: 8080
|
||||
|
||||
# 允許訪問 K8s API (Executor 執行 kubectl 指令)
|
||||
# 允許訪問 K8s API + K3s master SSH (Executor 執行 kubectl/host diagnosis)
|
||||
# 2026-03-23 修復: Y 按鈕執行超時
|
||||
# 重要: ClusterIP (10.43.0.1:443) 會路由到實際端點 (192.168.0.120:6443)
|
||||
# 必須同時允許兩者,否則流量會被 192.168.0.0/16 排除規則阻擋
|
||||
@@ -180,8 +181,11 @@ spec:
|
||||
port: 443
|
||||
- to:
|
||||
- ipBlock:
|
||||
cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點 + ArgoCD NodePort
|
||||
cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點 + ArgoCD NodePort + SSH MCP
|
||||
ports:
|
||||
# SSH MCP — K3s master host diagnosis/repair path
|
||||
- protocol: TCP
|
||||
port: 22
|
||||
- protocol: TCP
|
||||
port: 6443
|
||||
# ArgoCD MCP NodePort (2026-04-11): ClusterIP DNAT 跨 namespace 不穩定,改用 NodePort
|
||||
@@ -221,6 +225,9 @@ spec:
|
||||
- ipBlock:
|
||||
cidr: 192.168.0.121/32
|
||||
ports:
|
||||
# SSH MCP — K3s worker host diagnosis/repair path
|
||||
- protocol: TCP
|
||||
port: 22
|
||||
- protocol: TCP
|
||||
port: 6443
|
||||
- protocol: TCP
|
||||
|
||||
@@ -62,6 +62,11 @@ rules:
|
||||
resources: ["services", "configmaps"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
|
||||
# 2026-05-01: backup/disk diagnostics need PVC visibility; read-only only.
|
||||
- apiGroups: [""]
|
||||
resources: ["persistentvolumeclaims"]
|
||||
verbs: ["get", "list"]
|
||||
|
||||
- apiGroups: ["networking.k8s.io"]
|
||||
resources: ["ingresses"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
@@ -89,6 +94,23 @@ rules:
|
||||
resources: ["statefulsets", "daemonsets"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
|
||||
# 2026-05-01: HostBackupFailed / VeleroBackupFailed diagnosis needs backup job status.
|
||||
- apiGroups: ["batch"]
|
||||
resources: ["jobs", "cronjobs"]
|
||||
verbs: ["get", "list", "watch"]
|
||||
|
||||
# 2026-05-01: Velero backup status is read-only evidence for backup_failure alerts.
|
||||
- apiGroups: ["velero.io"]
|
||||
resources:
|
||||
- backups
|
||||
- backupstoragelocations
|
||||
- backuprepositories
|
||||
- podvolumebackups
|
||||
- podvolumerestores
|
||||
- restores
|
||||
- schedules
|
||||
verbs: ["get", "list", "watch"]
|
||||
|
||||
# ============================================================================
|
||||
# 寫入權限 (Write) - 僅限故障排除操作
|
||||
# ============================================================================
|
||||
@@ -104,6 +126,11 @@ rules:
|
||||
resources: ["deployments"]
|
||||
verbs: ["patch"]
|
||||
|
||||
# 2026-05-01: allow the same safe rollout restart primitive on controller types.
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["statefulsets", "daemonsets"]
|
||||
verbs: ["patch"]
|
||||
|
||||
# Scale Deployments (擴縮容)
|
||||
- apiGroups: ["apps"]
|
||||
resources: ["deployments/scale"]
|
||||
|
||||
Reference in New Issue
Block a user