diff --git a/.agents/skills/04-awoooi-devops-commander.md b/.agents/skills/04-awoooi-devops-commander.md index 5a340413..3931d176 100644 --- a/.agents/skills/04-awoooi-devops-commander.md +++ b/.agents/skills/04-awoooi-devops-commander.md @@ -1344,6 +1344,43 @@ Architecture Review 發現的安全要求(2026-04-11): 3. **群組 B 工具需 trust_score >= 0.8**(硬編碼守衛) +### Host/Backup SSH Route Invariants (2026-05-01) + +`backup_failure` is a host-layer category. Keep it aligned anywhere +`host_resource` is routed, especially: + +- `DecisionManager`: non-`kubectl` actions must route to SSH MCP before + `parse_kubectl_action()`. Otherwise SSH diagnosis strings with shell syntax + are blocked as `forbidden_shell_metachar`. +- `DecisionManager`: `kubectl` actions from `host_resource` or + `backup_failure` must be blocked and escalated to emergency intervention. +- `AutoRepairService`: host/backup incidents must not fall back to K8s + rollout Playbooks. + +Runtime baseline for host/backup repair: + +```bash +kubectl -n awoooi-prod get secret ssh-mcp-key awoooi-repair-ssh-key awoooi-repair-known-hosts + +kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -lc ' + ls -l /run/secrets/ssh_mcp_key /etc/ssh-mcp/known_hosts \ + /etc/repair-ssh/id_ed25519 /etc/repair-known-hosts/known_hosts +' + +kubectl -n awoooi-prod exec deploy/awoooi-api -- sh -lc ' + for h in 192.168.0.110 192.168.0.120 192.168.0.121; do + ssh -i /run/secrets/ssh_mcp_key -o BatchMode=yes \ + -o StrictHostKeyChecking=yes -o ConnectTimeout=5 wooo@$h "echo OK:$h" + done + ssh -i /run/secrets/ssh_mcp_key -o BatchMode=yes \ + -o StrictHostKeyChecking=yes -o ConnectTimeout=5 ollama@192.168.0.188 "echo OK:188" +' +``` + +`awoooi-executor` RBAC must include read-only backup evidence: +`jobs.batch`, `cronjobs.batch`, PVCs, and Velero backup resources. It may patch +`statefulsets.apps` / `daemonsets.apps` only for safe rollout restart. + --- ## 🚀 Sprint C — DR 備份與恢復 (2026-04-11) ✅ @@ -1503,4 +1540,3 @@ ssh-mcp-key ✅ (ssh_mcp_key + known_hosts) ### Runbook `docs/runbooks/ssh-mcp-setup.md` - diff --git a/.agents/skills/05-awoooi-sre-qa.md b/.agents/skills/05-awoooi-sre-qa.md index 764fabb3..003a153c 100644 --- a/.agents/skills/05-awoooi-sre-qa.md +++ b/.agents/skills/05-awoooi-sre-qa.md @@ -786,6 +786,31 @@ kubectl -n awoooi-prod logs -l app=awoooi-api --tail=50 | \ | `Permission denied (publickey)` | known_hosts 缺少該主機 | Pod exec SSH 看錯誤訊息 | | `Load key ... Permission denied` | fsGroup 未設定 | Pod exec `ls -la /etc/repair-ssh/` | | `Connection refused/timeout` | NetworkPolicy 封鎖 22 | Pod exec `ssh -v` 看連線過程 | +| `forbidden_shell_metachar` 且 action 是 `ssh ... '...'` | host/backup category 沒在 DecisionManager kubectl parser 前路由 SSH | 查 `alert_category` 是否為 `backup_failure`,確認 `_is_host_layer_ssh_category()` 覆蓋 | + +### Telegram 按鈕 E2E 檢查 (2026-05-01) + +告警卡片按鈕不是純 UI。每個按鈕都必須能在 +`callback_action_spec.yaml` 找到 callback pattern,並經 +`callback_dispatcher.py` 路由到實際 handler。 + +| 卡片/情境 | 必要按鈕 | 預期處理 | +|-----------|----------|----------| +| Approval / LLM action | approve, reject, details, ignore | 寫 approval decision、執行或拒絕、查詳情、忽略告警 | +| Auto repair unavailable / emergency | investigate, escalate/assign, rollback when applicable | 通知人工/AI Agent 介入,不可靜默 | +| Drift TYPE-4D | view diff, adopt, rollback, ignore | 看 diff、採納變更、回滾、忽略 | +| Backup / host diagnosis | restart only when rule allows, charts/logs/details, cleanup when safe | 不得提供 K8s-only repair button 當 host/backup 主動作 | +| Post-verification degraded/failed | rollback proposal, investigate, details | 不自動 rollback,需人工或 emergency AI Agent 接手 | + +Regression test target: button callback names emitted by `telegram_gateway.py` +must stay in sync with `callback_action_spec.yaml`; stale buttons are a +production bug because Telegram cards can outlive code deploys. + +Provider name drift is also a ghost-button bug. `callback_action_spec.yaml` +may use friendly names (`k8s`, `ssh`), but dispatcher must normalize to actual +registered MCP providers (`kubernetes`, `ssh_host`) before `get_provider()`. +`backup_failure` cards must expose read-only diagnostics before any write +action: host disk, backup jobs, and Velero backup status. --- diff --git a/apps/api/src/services/callback_action_spec.yaml b/apps/api/src/services/callback_action_spec.yaml index adbfa566..3d07088b 100644 --- a/apps/api/src/services/callback_action_spec.yaml +++ b/apps/api/src/services/callback_action_spec.yaml @@ -22,7 +22,7 @@ # description: <說明> version: "1.0" -last_updated: "2026-04-14" +last_updated: "2026-05-01" actions: # ========================================================================== @@ -188,6 +188,53 @@ actions: timeout_sec: 1 description: "返回飛輪儀表板 URL" + backup_check_host_disk: + label: "查主機磁碟" + emoji: "💾" + risk: low + callback_format: info + category: backup_failure + mcp: + provider: ssh + tool: ssh_get_disk_usage + params: + host: "{labels.instance}" + reply_format: code + timeout_sec: 8 + description: "備份失敗時檢查主機磁碟容量與 Docker 目錄大小" + + backup_check_jobs: + label: "查備份 Job" + emoji: "📦" + risk: low + callback_format: info + category: backup_failure + mcp: + provider: k8s + tool: kubectl_get + params: + namespace: "awoooi-prod" + resource: "jobs" + reply_format: truncated + timeout_sec: 8 + description: "列出 awoooi-prod 內的備份相關 Job 狀態" + + backup_check_velero: + label: "查 Velero" + emoji: "🧰" + risk: low + callback_format: info + category: backup_failure + mcp: + provider: k8s + tool: kubectl_get + params: + namespace: "velero" + resource: "backups.velero.io" + reply_format: truncated + timeout_sec: 8 + description: "列出 Velero backup CR 狀態" + # ========================================================================== # 寫類按鈕(有副作用,4-part nonce callback) # ========================================================================== diff --git a/apps/api/src/services/callback_dispatcher.py b/apps/api/src/services/callback_dispatcher.py index 0c28d85a..aa9ac037 100644 --- a/apps/api/src/services/callback_dispatcher.py +++ b/apps/api/src/services/callback_dispatcher.py @@ -35,6 +35,18 @@ import yaml logger = structlog.get_logger(__name__) +_PROVIDER_ALIASES = { + "k8s": "kubernetes", + "ssh": "ssh_host", +} + + +def _resolve_provider_name(provider_name: str) -> str: + """Normalize legacy callback spec provider names to registered MCP providers.""" + + return _PROVIDER_ALIASES.get(provider_name, provider_name) + + # ============================================================================= # Data Types # ============================================================================= @@ -262,14 +274,15 @@ async def dispatch_action( # MCP registry dispatch from src.plugins.mcp.registry import get_provider - provider = get_provider(spec.mcp_provider) + provider_name = _resolve_provider_name(spec.mcp_provider) + provider = get_provider(provider_name) if not provider: duration = (time.perf_counter() - start) * 1000 return DispatchResult( success=False, action=action_name, incident_id=incident_id, user_id=user_id, - result_text=f"{spec.emoji} {spec.label} 失敗:MCP provider '{spec.mcp_provider}' 未註冊", - error=f"provider_not_found: {spec.mcp_provider}", + result_text=f"{spec.emoji} {spec.label} 失敗:MCP provider '{provider_name}' 未註冊", + error=f"provider_not_found: {provider_name}", duration_ms=duration, ) diff --git a/apps/api/src/services/decision_manager.py b/apps/api/src/services/decision_manager.py index ed1f931e..638bc4a8 100644 --- a/apps/api/src/services/decision_manager.py +++ b/apps/api/src/services/decision_manager.py @@ -85,6 +85,22 @@ def _should_escalate_auto_approve_rejection(reason: Any) -> bool: } +_HOST_LAYER_SSH_CATEGORIES = {"infrastructure", "host_resource", "backup_failure"} +_NON_K8S_HOST_CATEGORIES = {"host_resource", "backup_failure"} + + +def _is_host_layer_ssh_category(category: str | None) -> bool: + """Return True when DecisionManager must route non-kubectl actions to SSH.""" + + return (category or "") in _HOST_LAYER_SSH_CATEGORIES + + +def _is_non_k8s_host_category(category: str | None) -> bool: + """Return True for host/backup alerts that must not auto-run kubectl.""" + + return (category or "") in _NON_K8S_HOST_CATEGORIES + + async def _escalate_decision_auto_repair_unavailable( *, incident: Incident, @@ -1990,36 +2006,36 @@ class DecisionManager: except Exception as _rescue_err: logger.debug("target_rescue_skipped", error=str(_rescue_err)) - # ADR-073 Phase 3-2: infrastructure 告警 (Docker/Host) → SSH MCP routing (2026-04-12 ogt) - # alert_category = "infrastructure" 表示 Docker 告警,非 kubectl action → SSH + # ADR-073 Phase 3-2: infrastructure/host/backup 告警 → SSH MCP routing. + # alert_category = "backup_failure" uses the same host-layer path as AutoRepairService. # P1-1 fix 2026-04-12: 必須在 kubectl safety guard 之前 routing,否則 docker 指令被 _action_safe=False 攔截 _alert_category = getattr(incident, "alert_category", None) or "" - if _alert_category in {"infrastructure", "host_resource"} and action and not action.startswith("kubectl"): + if _is_host_layer_ssh_category(_alert_category) and action and not action.startswith("kubectl"): await self._ssh_execute(incident, token, action, _target) return - # 2026-04-15 ogt: host_resource 告警(HostHighCpuLoad 等)不是 K8s workload 問題 + # 2026-04-15 ogt: host_resource/backup_failure 告警不是 K8s workload 問題 # 不得執行 kubectl 操作,改降級人工審核 # 根因:原本只擋了 infrastructure,忘記 host_resource 也不走 K8s - if _alert_category == "host_resource" and action and action.startswith("kubectl"): + if _is_non_k8s_host_category(_alert_category) and action and action.startswith("kubectl"): logger.warning( - "auto_execute_blocked_host_resource_no_k8s", + "auto_execute_blocked_host_layer_no_k8s", incident_id=incident.incident_id, alert_category=_alert_category, action=action[:80], - reason="host_resource 告警不應執行 K8s kubectl 操作,降級人工審核", + reason="host/backup 告警不應執行 K8s kubectl 操作,降級人工審核", ) token.state = DecisionState.READY token.proposal_data["auto_executed"] = False token.proposal_data["mcp_all_failed"] = True - token.proposal_data["blocked_reason"] = "host_resource 告警禁止 K8s kubectl,請人工排查主機" + token.proposal_data["blocked_reason"] = f"{_alert_category} 告警禁止 K8s kubectl,請人工排查主機/備份" await self._save_token(token) _fire_and_forget( _escalate_decision_auto_repair_unavailable( incident=incident, token=token, failure_reason=token.proposal_data["blocked_reason"], - attempted_actions="auto_execute -> host_resource_k8s_block -> emergency_intervention", + attempted_actions=f"auto_execute -> {_alert_category}_k8s_block -> emergency_intervention", ) ) _fire_and_forget(_push_decision_to_telegram(incident, token.proposal_data)) diff --git a/apps/api/tests/test_alertmanager_rule_bypass.py b/apps/api/tests/test_alertmanager_rule_bypass.py index 4b10a46d..80cd892d 100644 --- a/apps/api/tests/test_alertmanager_rule_bypass.py +++ b/apps/api/tests/test_alertmanager_rule_bypass.py @@ -4,7 +4,11 @@ from src.api.v1.webhooks import ( _should_bypass_alertmanager_llm, _should_use_alertmanager_rule_first, ) -from src.services.decision_manager import _should_escalate_auto_approve_rejection +from src.services.decision_manager import ( + _is_host_layer_ssh_category, + _is_non_k8s_host_category, + _should_escalate_auto_approve_rejection, +) from src.services.telegram_gateway import _format_resolved_guard_stamp @@ -84,6 +88,18 @@ def test_manual_gate_reasons_escalate_to_emergency_intervention(): assert _should_escalate_auto_approve_rejection("critical_operation") is False +def test_backup_failure_routes_to_decision_ssh_before_kubectl_parser(): + assert _is_host_layer_ssh_category("backup_failure") is True + assert _is_host_layer_ssh_category("host_resource") is True + assert _is_host_layer_ssh_category("kubernetes") is False + + +def test_backup_failure_blocks_k8s_auto_execute(): + assert _is_non_k8s_host_category("backup_failure") is True + assert _is_non_k8s_host_category("host_resource") is True + assert _is_non_k8s_host_category("infrastructure") is False + + def test_resolved_guard_stamp_without_timestamp_is_clean(): assert _format_resolved_guard_stamp(None) == "✅ 此事件已解決" diff --git a/apps/api/tests/test_callback_dispatcher.py b/apps/api/tests/test_callback_dispatcher.py index 8296ac43..02b83531 100644 --- a/apps/api/tests/test_callback_dispatcher.py +++ b/apps/api/tests/test_callback_dispatcher.py @@ -21,6 +21,7 @@ from src.services.callback_dispatcher import ( list_actions_for_category, load_action_registry, _lookup_context, + _resolve_provider_name, _resolve_template, ) @@ -68,6 +69,11 @@ class TestRegistryLoading: assert spec and spec.callback_format == "info", \ f"{qa} should use info format" + def test_legacy_provider_aliases_resolve_to_registered_names(self): + assert _resolve_provider_name("k8s") == "kubernetes" + assert _resolve_provider_name("ssh") == "ssh_host" + assert _resolve_provider_name("prometheus") == "prometheus" + # ============================================================================= # Category filtering @@ -91,6 +97,16 @@ class TestCategoryFiltering: assert any(a.callback_format == "info" for a in actions), "需至少 1 個查類" assert any(a.callback_format == "nonce" for a in actions), "需至少 1 個寫類" + def test_backup_failure_has_read_only_diagnostics(self): + actions = list_actions_for_category("backup_failure") + names = {a.name for a in actions} + assert { + "backup_check_host_disk", + "backup_check_jobs", + "backup_check_velero", + }.issubset(names) + assert all(a.callback_format == "info" for a in actions) + # ============================================================================= # Template variable resolution diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index de3b7d25..84a9ab10 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -12,19 +12,31 @@ Live e2e 用 `HostBackupFailed` 打 Alertmanager 後發現 aged backup 告警會 ### 完成 - `_should_use_alertmanager_rule_first()` / `_should_bypass_alertmanager_llm()` 納入 `backup_failure`,備份失敗 YAML `SSH_DIAGNOSE` 不再被 LLM 覆蓋成 K8s 動作。 +- `DecisionManager` SSH route 與 `AutoRepairService` 分類對齊:`backup_failure` 非 kubectl action 先走 SSH MCP,不再落入 `parse_kubectl_action()` 後被 `forbidden_shell_metachar` 擋下。 +- `DecisionManager` host/backup K8s block 納入 `backup_failure`,若 LLM 或 Playbook 產生 kubectl 動作,直接走 emergency escalation,而不是對備份告警誤做 K8s 修復。 - `AutoRepairService` 追加 host/backup Playbook guard:主機/備份 incident 若匹配到 K8s rollout 類 Playbook,阻擋為 `HOST_BACKUP_K8S_PLAYBOOK`,改走緊急介入。 - `AutoRepairService` post-verification rollback guard:host/backup 或非 K8s Playbook 驗證失敗時,不再合成 `kubectl rollout restart deployment/{target}`,改走 emergency escalation,且不自動 resolve incident。 - `EmergencyEscalationService` 沿用既有 `APPROVAL_ESCALATED` DB enum 寫 AOL,避免緊急通道因新 enum 未 migration 而留痕失敗。 - 補 `phase25_knowledge_enum_names.sql`,讓 `AUTO_RUNBOOK` / `ANTI_PATTERN` enum name 可寫入 PG,修復 auto runbook KM 沉澱失敗。 - `NodeExporterDown` Prometheus rule `auto_repair` 改為 `true`,與 YAML rule catalog 的 exporter restart 策略一致。 +- `awoooi-executor` RBAC 補 backup/DR 診斷權限:PVC、Jobs/CronJobs、Velero resources read-only,以及 StatefulSet/DaemonSet safe rollout patch。 +- NetworkPolicy 補 K3s master/worker `22/tcp` egress,讓 SSH MCP 可以覆蓋 120/121,不只 110/188。 +- Telegram category buttons 補 provider alias 正規化:`k8s` → `kubernetes`、`ssh` → `ssh_host`,避免按鈕畫出來後 dispatcher 找不到 MCP provider。 +- `backup_failure` 補三個 read-only 診斷按鈕:查主機磁碟、查備份 Job、查 Velero;備份告警不再只有通用批准/拒絕/詳情。 - 補 `backup_failure` NO_ACTION / SSH_DIAGNOSE 單元測試。 ### 驗證 - `python3 -m py_compile apps/api/src/api/v1/webhooks.py` 通過。 +- `python3 -m py_compile apps/api/src/services/decision_manager.py apps/api/src/services/callback_dispatcher.py` 通過。 - `cd apps/api && pytest tests/test_alertmanager_rule_bypass.py tests/test_telegram_ai_automation_block.py tests/test_ai_router_diagnose_fallback.py -q` → 24 passed。 - `cd apps/api && pytest tests/test_auto_repair_service.py tests/test_alertmanager_rule_bypass.py -q` → 27 passed。 - `cd apps/api && pytest tests/test_auto_repair_service.py tests/test_alertmanager_rule_bypass.py -q` → 29 passed。 +- `cd apps/api && pytest tests/test_alertmanager_rule_bypass.py tests/test_callback_dispatcher.py tests/test_telegram_button_consistency.py -q` → 56 passed。 - YAML parse `ops/monitoring/alerts-unified.yml`、`apps/api/alert_rules.yaml` 通過。 +- YAML parse `callback_action_spec.yaml`、`07-rbac.yaml`、`02-network-policy.yaml` 通過。 +- Live Secret/mount 檢查:`ssh-mcp-key`、`awoooi-repair-ssh-key`、`awoooi-repair-known-hosts` 存在且掛載可讀。 +- Live SSH MCP key 檢查:`wooo@192.168.0.110`、`ollama@192.168.0.188` OK;`wooo@192.168.0.120/121` 已通過 host key,但 remote `authorized_keys` 尚未納入該公鑰,回 `Permission denied (publickey,password)`。 +- Live RBAC apply 被 Argo 依 Git 狀態拉回;`07-rbac.yaml` 需推上 Gitea 由 Argo 同步後再驗 `can-i`。 ## 2026-04-30 | ADR-104 Playbook 版本化 lineage diff --git a/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md b/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md index e0bcdd8f..faa883d6 100644 --- a/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md +++ b/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md @@ -140,6 +140,48 @@ MoWoooWorkDown → Jaccard 匹配 momo-app-down-repair → SSH ollama@192.168.0. --- +## Appendix B — Backup Failure Route Parity (2026-05-01) + +`HostBackupFailed` / backup 類告警的 `alert_category` 是 `backup_failure`。它必須在所有 host-layer 自動化路徑與 `host_resource` 同級處理: + +| Layer | 必須行為 | +|-------|----------| +| Alertmanager rule-first | YAML `SSH_DIAGNOSE` / `NO_ACTION` 不進 LLM 覆寫 | +| AutoRepairService | `backup_failure` 視為 host-layer,拒絕 K8s Playbook fallback | +| DecisionManager | 非 `kubectl` 動作在 kubectl parser 前路由 SSH MCP | +| DecisionManager K8s guard | `backup_failure` 產生 `kubectl` 時降級 emergency escalation | +| Telegram buttons | `backup_failure` 顯示只讀診斷按鈕:主機磁碟、備份 Job、Velero 狀態 | + +2026-05-01 根因:DecisionManager SSH route 只含 `infrastructure` / `host_resource`,漏掉 `backup_failure`,導致 `ssh 192.168.0.110 '...;...'` 類只讀診斷動作掉進 `parse_kubectl_action()`,被 `forbidden_shell_metachar` 擋下。 + +同日按鈕審計也發現 category button 的 friendly provider name 會漂移:`callback_action_spec.yaml` 使用 `k8s` / `ssh`,但 MCP registry 實際名稱是 `kubernetes` / `ssh_host`。Dispatcher 必須正規化 provider alias,否則卡片按鈕會顯示但執行時變成 `provider_not_found`。 + +### Runtime 權限基準 + +- K8s Secret: + - `awoooi-repair-ssh-key` 掛載 `/etc/repair-ssh/` + - `awoooi-repair-known-hosts` 掛載 `/etc/repair-known-hosts/` + - `ssh-mcp-key` 掛載 `/run/secrets/ssh_mcp_key` 與 `/etc/ssh-mcp/known_hosts` +- Remote `authorized_keys`: + - `wooo@192.168.0.110` + - `wooo@192.168.0.120` + - `wooo@192.168.0.121` + - `ollama@192.168.0.188` +- NetworkPolicy egress: + - `192.168.0.110:22` + - `192.168.0.120:22` + - `192.168.0.121:22` + - `192.168.0.188:22` +- `awoooi-executor` RBAC: + - read `jobs.batch`, `cronjobs.batch` + - read `persistentvolumeclaims` + - read Velero `backups`, `backupstoragelocations`, `backuprepositories`, `podvolumebackups`, `podvolumerestores`, `restores`, `schedules` + - patch `statefulsets.apps` / `daemonsets.apps` only for safe rollout restart + +If SSH MCP fails, the incident must not silently become a manual approval card; it must raise the emergency intervention path with the exact SSH failure reason when available. + +--- + ## 首席架構師 Review 記錄 (2026-04-05) 評分:**72/100 → 修正後 88/100** diff --git a/k8s/awoooi-prod/02-network-policy.yaml b/k8s/awoooi-prod/02-network-policy.yaml index 404e88a6..b09c4813 100644 --- a/k8s/awoooi-prod/02-network-policy.yaml +++ b/k8s/awoooi-prod/02-network-policy.yaml @@ -1,8 +1,9 @@ # AWOOOI 正式環境零信任網路策略 # 負責人: CIO -# 版本: v1.5 -# 日期: 2026-04-14 +# 版本: v1.6 +# 日期: 2026-05-01 # 變更: +# - v1.6: 新增 K3s node 120/121 SSH egress,供 SSH MCP 主機診斷/修復 # - v1.5: 新增 keepalived VIP 192.168.0.125/32 ArgoCD NodePort 30443 egress(修復 heartbeat probe) # - v1.4: 新增 ArgoCD MCP egress(argocd namespace port 80/443) # - v1.3: 新增 192.168.0.111 Ollama 主機 (M1 Pro),移除 188 的 Ollama port @@ -168,7 +169,7 @@ spec: - protocol: TCP port: 8080 - # 允許訪問 K8s API (Executor 執行 kubectl 指令) + # 允許訪問 K8s API + K3s master SSH (Executor 執行 kubectl/host diagnosis) # 2026-03-23 修復: Y 按鈕執行超時 # 重要: ClusterIP (10.43.0.1:443) 會路由到實際端點 (192.168.0.120:6443) # 必須同時允許兩者,否則流量會被 192.168.0.0/16 排除規則阻擋 @@ -180,8 +181,11 @@ spec: port: 443 - to: - ipBlock: - cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點 + ArgoCD NodePort + cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點 + ArgoCD NodePort + SSH MCP ports: + # SSH MCP — K3s master host diagnosis/repair path + - protocol: TCP + port: 22 - protocol: TCP port: 6443 # ArgoCD MCP NodePort (2026-04-11): ClusterIP DNAT 跨 namespace 不穩定,改用 NodePort @@ -221,6 +225,9 @@ spec: - ipBlock: cidr: 192.168.0.121/32 ports: + # SSH MCP — K3s worker host diagnosis/repair path + - protocol: TCP + port: 22 - protocol: TCP port: 6443 - protocol: TCP diff --git a/k8s/awoooi-prod/07-rbac.yaml b/k8s/awoooi-prod/07-rbac.yaml index 09b9fe60..a9ee6120 100644 --- a/k8s/awoooi-prod/07-rbac.yaml +++ b/k8s/awoooi-prod/07-rbac.yaml @@ -62,6 +62,11 @@ rules: resources: ["services", "configmaps"] verbs: ["get", "list", "watch"] + # 2026-05-01: backup/disk diagnostics need PVC visibility; read-only only. + - apiGroups: [""] + resources: ["persistentvolumeclaims"] + verbs: ["get", "list"] + - apiGroups: ["networking.k8s.io"] resources: ["ingresses"] verbs: ["get", "list", "watch"] @@ -89,6 +94,23 @@ rules: resources: ["statefulsets", "daemonsets"] verbs: ["get", "list", "watch"] + # 2026-05-01: HostBackupFailed / VeleroBackupFailed diagnosis needs backup job status. + - apiGroups: ["batch"] + resources: ["jobs", "cronjobs"] + verbs: ["get", "list", "watch"] + + # 2026-05-01: Velero backup status is read-only evidence for backup_failure alerts. + - apiGroups: ["velero.io"] + resources: + - backups + - backupstoragelocations + - backuprepositories + - podvolumebackups + - podvolumerestores + - restores + - schedules + verbs: ["get", "list", "watch"] + # ============================================================================ # 寫入權限 (Write) - 僅限故障排除操作 # ============================================================================ @@ -104,6 +126,11 @@ rules: resources: ["deployments"] verbs: ["patch"] + # 2026-05-01: allow the same safe rollout restart primitive on controller types. + - apiGroups: ["apps"] + resources: ["statefulsets", "daemonsets"] + verbs: ["patch"] + # Scale Deployments (擴縮容) - apiGroups: ["apps"] resources: ["deployments/scale"]