docs: 完整治理架構 ADR-010/011/012 + CLAUDE.md 鐵律更新
2026-03-23 重大事故修復與治理: 1. ADR-010: Secrets 集中管理 (Bitwarden + Sealed Secrets) 2. ADR-011: NetworkPolicy 變更治理 (偵測 + 告警 + 人工決策) 3. ADR-012: 危險操作治理 (Tier 分級 + CI/CD 攔截 + 審計) 4. UX-001: 告警疲勞解決方案 (時間衰減 + 智慧分組) CLAUDE.md 更新: - 新增最高優先級鐵律 (禁止 ClawBot、OpenClaw 核心、禁止危險 API) - 新增任務開始前必讀 Memory 對照表 事故教訓: - Telegram Token 連續三次被 logOut 失效 - AWOOOI API 程式碼呼叫 logOut 導致災難 - 已停用 AWOOOI API Telegram,OpenClaw 為唯一 Gateway Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
36
CLAUDE.md
36
CLAUDE.md
@@ -21,14 +21,36 @@
|
||||
| 測試/診斷 | `.agents/skills/05-awoooi-sre-qa.md` |
|
||||
| Git/依賴 | `.agents/skills/06-awoooi-monorepo-master.md` |
|
||||
|
||||
## 核心鐵律 (簡版)
|
||||
## 核心鐵律 (必讀)
|
||||
|
||||
1. **i18n 零硬編碼** - 所有 UI 文字必須用 next-intl
|
||||
2. **禁止 SQLite** - 只用 PostgreSQL
|
||||
3. **禁止 latest tag** - K8s image 必須精確版本
|
||||
4. **CORS 白名單** - 禁止 `*`
|
||||
5. **Dry-Run 先行** - K8s 變更必須先 dry-run
|
||||
6. **Memory 同步** - 任務完成後更新 Memory MD
|
||||
### 🔴 最高優先級鐵律
|
||||
|
||||
1. **禁止 ClawBot** - 全專案使用 OpenClaw,發現 ClawBot 必須立即更名
|
||||
2. **OpenClaw 是產品核心** - 禁止淘汰或取代 OpenClaw (192.168.0.188)
|
||||
3. **禁止危險 Telegram API** - `logOut`, `close` 絕對禁止進入程式碼
|
||||
4. **先停後換** - 更新 Token/Secret 前必須先停止所有使用該資源的服務
|
||||
5. **單一 Telegram Gateway** - 只有 OpenClaw 可以使用 Telegram,AWOOOI API 禁止
|
||||
|
||||
### 標準鐵律
|
||||
|
||||
6. **i18n 零硬編碼** - 所有 UI 文字必須用 next-intl
|
||||
7. **禁止 SQLite** - 只用 PostgreSQL
|
||||
8. **禁止 latest tag** - K8s image 必須精確版本
|
||||
9. **CORS 白名單** - 禁止 `*`
|
||||
10. **Dry-Run 先行** - K8s 變更必須先 dry-run
|
||||
11. **Memory 同步** - 任務完成後更新 Memory MD
|
||||
|
||||
### 任務開始前必讀
|
||||
|
||||
**涉及以下主題時,必須先讀取對應 Memory:**
|
||||
|
||||
| 主題 | 必讀 Memory |
|
||||
|------|-------------|
|
||||
| Telegram | `feedback_telegram_token_disaster.md`, `reference_telegram_token.md` |
|
||||
| OpenClaw | `feedback_architecture_openclaw_core.md`, `feedback_openclaw_naming.md` |
|
||||
| NetworkPolicy | `docs/adr/ADR-011-networkpolicy-governance.md` |
|
||||
| 危險操作 | `docs/adr/ADR-012-dangerous-operations-governance.md` |
|
||||
| Secrets | `docs/adr/ADR-010-secrets-management.md` |
|
||||
|
||||
## Memory 系統
|
||||
|
||||
|
||||
393
docs/adr/ADR-010-secrets-management.md
Normal file
393
docs/adr/ADR-010-secrets-management.md
Normal file
@@ -0,0 +1,393 @@
|
||||
# ADR-010: 集中式 Secrets 管理架構
|
||||
|
||||
**狀態**: 已批准
|
||||
**日期**: 2026-03-23
|
||||
**決策者**: 統帥
|
||||
|
||||
## 背景
|
||||
|
||||
### 當前痛點
|
||||
|
||||
```
|
||||
問題:密碼散落各處,變更一處 → 全線崩潰
|
||||
|
||||
Harbor 密碼存在位置:
|
||||
├── GitHub Secrets (wooo-aiops)
|
||||
├── GitHub Secrets (awoooi) ← 今天才補上
|
||||
├── K8s Secrets (awoooi-prod)
|
||||
├── Self-hosted Runner 環境變數
|
||||
├── 開發者本地 ~/.docker/config.json
|
||||
└── 文檔/記憶體 MD 檔案
|
||||
|
||||
結果:
|
||||
- 變更密碼需同步 5+ 個地方
|
||||
- 經常遺漏 → CI/CD 崩潰
|
||||
- 無審計日誌
|
||||
- 無法追蹤誰何時改了什麼
|
||||
```
|
||||
|
||||
### 需要管理的 Secrets 清單
|
||||
|
||||
| 類別 | Secret 名稱 | 使用位置 |
|
||||
|------|------------|----------|
|
||||
| **Registry** | HARBOR_USER / HARBOR_PASSWORD | CI/CD, K8s |
|
||||
| **Database** | DATABASE_URL (PostgreSQL) | API, Worker |
|
||||
| **Cache** | REDIS_URL | API, Worker, OpenClaw |
|
||||
| **AI** | ANTHROPIC_API_KEY | API (Agent Teams) |
|
||||
| **AI** | GEMINI_API_KEY | API (Fallback) |
|
||||
| **AI** | OLLAMA_URL | API (Local LLM) |
|
||||
| **Notification** | OPENCLAW_TG_BOT_TOKEN | OpenClaw |
|
||||
| **Notification** | OPENCLAW_TG_CHAT_ID | OpenClaw |
|
||||
| **K8s** | KUBECONFIG | CI/CD |
|
||||
| **Webhook** | WEBHOOK_HMAC_SECRET | API |
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
採用 **混合式 Secrets 管理架構**:
|
||||
|
||||
1. **Bitwarden (Self-hosted)** - Single Source of Truth
|
||||
2. **Sealed Secrets** - K8s 內部密碼加密
|
||||
3. **GitHub Secrets** - CI/CD 專用 (從 Bitwarden 同步)
|
||||
|
||||
---
|
||||
|
||||
## 架構設計
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ AWOOOI Secrets 管理架構 │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Bitwarden (Self-hosted on 192.168.0.188) │ │
|
||||
│ │ ══════════════════════════════════════ │ │
|
||||
│ │ 🔐 Single Source of Truth │ │
|
||||
│ │ │ │
|
||||
│ │ Organizations: │ │
|
||||
│ │ └── AWOOOI/ │ │
|
||||
│ │ ├── Infrastructure/ │ │
|
||||
│ │ │ ├── Harbor (admin / Wooo_Harbor_2026!) │ │
|
||||
│ │ │ ├── PostgreSQL (awoooi / ******) │ │
|
||||
│ │ │ └── Redis (password / ******) │ │
|
||||
│ │ ├── AI-Services/ │ │
|
||||
│ │ │ ├── Anthropic API Key │ │
|
||||
│ │ │ ├── Gemini API Key │ │
|
||||
│ │ │ └── Ollama URL │ │
|
||||
│ │ └── Notifications/ │ │
|
||||
│ │ ├── Telegram Bot Token │ │
|
||||
│ │ └── Telegram Chat ID │ │
|
||||
│ └───────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌───────────────┼───────────────┐ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌─────────────┐ ┌─────────────────┐ │
|
||||
│ │ GitHub Secrets │ │ Sealed │ │ Local Dev │ │
|
||||
│ │ (CI/CD) │ │ Secrets │ │ (.env.local) │ │
|
||||
│ │ │ │ (K8s) │ │ │ │
|
||||
│ │ HARBOR_USER │ │ │ │ bw get password │ │
|
||||
│ │ HARBOR_PASSWORD │ │ 加密存 Git │ │ "Harbor" │ │
|
||||
│ │ OP_TOKEN (sync) │ │ 自動解密 │ │ │ │
|
||||
│ └────────┬────────┘ └──────┬──────┘ └─────────────────┘ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ K3s Cluster (192.168.0.120/121) │ │
|
||||
│ │ ┌─────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ awoooi-prod namespace │ │ │
|
||||
│ │ │ ├── awoooi-secrets (from Sealed Secrets) │ │ │
|
||||
│ │ │ ├── awoooi-config (ConfigMap, non-sensitive) │ │ │
|
||||
│ │ │ └── Pods (consume secrets as env vars) │ │ │
|
||||
│ │ └─────────────────────────────────────────────────────┘ │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 實施計畫
|
||||
|
||||
### Phase 1: Bitwarden Self-hosted (Day 1-2)
|
||||
|
||||
```bash
|
||||
# 在 192.168.0.188 部署 Vaultwarden (輕量版 Bitwarden)
|
||||
docker run -d --name vaultwarden \
|
||||
-v /data/vaultwarden:/data \
|
||||
-e ADMIN_TOKEN='your-admin-token' \
|
||||
-e DOMAIN='https://vault.wooo.work' \
|
||||
-p 8088:80 \
|
||||
vaultwarden/server:latest
|
||||
|
||||
# Nginx 設定
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name vault.wooo.work;
|
||||
|
||||
location / {
|
||||
proxy_pass http://localhost:8088;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**初始化步驟:**
|
||||
1. 訪問 https://vault.wooo.work
|
||||
2. 建立 Organization: AWOOOI
|
||||
3. 建立 Collections: Infrastructure, AI-Services, Notifications
|
||||
4. 匯入現有密碼
|
||||
|
||||
### Phase 2: Sealed Secrets (Day 2-3)
|
||||
|
||||
```bash
|
||||
# Step 1: 安裝 Sealed Secrets Controller
|
||||
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.26.0/controller.yaml
|
||||
|
||||
# Step 2: 取得 Public Key
|
||||
kubeseal --fetch-cert \
|
||||
--controller-name=sealed-secrets-controller \
|
||||
--controller-namespace=kube-system \
|
||||
> /Users/ogt/awoooi/k8s/sealed-secrets-cert.pem
|
||||
|
||||
# Step 3: 加密現有 Secrets
|
||||
cat <<EOF > /tmp/secrets.yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: awoooi-secrets
|
||||
namespace: awoooi-prod
|
||||
type: Opaque
|
||||
stringData:
|
||||
DATABASE_URL: "postgresql+asyncpg://awoooi:xxx@192.168.0.188:5432/awoooi_prod"
|
||||
REDIS_URL: "redis://:xxx@192.168.0.188:6380/10"
|
||||
HARBOR_USER: "admin"
|
||||
HARBOR_PASSWORD: "Wooo_Harbor_2026!"
|
||||
ANTHROPIC_API_KEY: "sk-ant-xxx"
|
||||
EOF
|
||||
|
||||
kubeseal --cert=sealed-secrets-cert.pem \
|
||||
--format=yaml \
|
||||
< /tmp/secrets.yaml \
|
||||
> k8s/awoooi-prod/03-sealed-secrets.yaml
|
||||
|
||||
# 清理明文
|
||||
rm /tmp/secrets.yaml
|
||||
|
||||
# Step 4: 提交加密版本
|
||||
git add k8s/awoooi-prod/03-sealed-secrets.yaml
|
||||
git commit -m "feat(security): migrate to Sealed Secrets"
|
||||
```
|
||||
|
||||
### Phase 3: GitHub Actions 整合 (Day 3-4)
|
||||
|
||||
```yaml
|
||||
# .github/workflows/sync-secrets.yml
|
||||
name: Sync Secrets from Bitwarden
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
schedule:
|
||||
- cron: '0 0 * * 0' # 每週日同步
|
||||
|
||||
jobs:
|
||||
sync:
|
||||
runs-on: [self-hosted, harbor, k8s]
|
||||
steps:
|
||||
- name: Install Bitwarden CLI
|
||||
run: |
|
||||
npm install -g @bitwarden/cli
|
||||
bw config server https://vault.wooo.work
|
||||
|
||||
- name: Login & Sync
|
||||
env:
|
||||
BW_PASSWORD: ${{ secrets.BW_MASTER_PASSWORD }}
|
||||
run: |
|
||||
export BW_SESSION=$(bw login --raw)
|
||||
|
||||
# 取得 Harbor 認證
|
||||
HARBOR_USER=$(bw get username "Harbor")
|
||||
HARBOR_PASSWORD=$(bw get password "Harbor")
|
||||
|
||||
# 更新 GitHub Secrets
|
||||
gh secret set HARBOR_USER --body "$HARBOR_USER"
|
||||
gh secret set HARBOR_PASSWORD --body "$HARBOR_PASSWORD"
|
||||
|
||||
echo "✅ Secrets synced from Bitwarden"
|
||||
```
|
||||
|
||||
### Phase 4: 開發者本地環境 (Day 4)
|
||||
|
||||
```bash
|
||||
# 安裝 Bitwarden CLI
|
||||
brew install bitwarden-cli
|
||||
|
||||
# 登入
|
||||
bw config server https://vault.wooo.work
|
||||
bw login
|
||||
|
||||
# 產生 .env.local
|
||||
cat > generate-env.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
export BW_SESSION=$(bw unlock --raw)
|
||||
|
||||
echo "DATABASE_URL=$(bw get password 'PostgreSQL')" > .env.local
|
||||
echo "REDIS_URL=$(bw get password 'Redis')" >> .env.local
|
||||
echo "ANTHROPIC_API_KEY=$(bw get password 'Anthropic')" >> .env.local
|
||||
|
||||
echo "✅ .env.local generated from Bitwarden"
|
||||
EOF
|
||||
|
||||
chmod +x generate-env.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 密碼 Rotation SOP
|
||||
|
||||
### 變更密碼流程
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ 密碼變更 SOP (Single Source of Truth) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Step 1: 在 Bitwarden 更新密碼 │
|
||||
│ └── vault.wooo.work → Edit → Save │
|
||||
│ │
|
||||
│ Step 2: 觸發同步 Workflow │
|
||||
│ └── gh workflow run sync-secrets.yml │
|
||||
│ │
|
||||
│ Step 3: 重新 Seal K8s Secrets │
|
||||
│ └── ./scripts/reseal-secrets.sh │
|
||||
│ │
|
||||
│ Step 4: 部署更新 │
|
||||
│ └── kubectl apply -f k8s/awoooi-prod/ │
|
||||
│ │
|
||||
│ Step 5: 驗證 │
|
||||
│ └── curl https://awoooi.wooo.work/api/v1/health │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 自動化 Rotation Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# scripts/rotate-secret.sh
|
||||
|
||||
SECRET_NAME=$1
|
||||
NEW_VALUE=$2
|
||||
|
||||
if [ -z "$SECRET_NAME" ] || [ -z "$NEW_VALUE" ]; then
|
||||
echo "Usage: ./rotate-secret.sh <secret-name> <new-value>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "🔄 Rotating secret: $SECRET_NAME"
|
||||
|
||||
# Step 1: Update Bitwarden
|
||||
export BW_SESSION=$(bw unlock --raw)
|
||||
bw edit item "$SECRET_NAME" --value "$NEW_VALUE"
|
||||
|
||||
# Step 2: Sync to GitHub
|
||||
gh workflow run sync-secrets.yml
|
||||
|
||||
# Step 3: Re-seal K8s secrets
|
||||
./scripts/reseal-secrets.sh
|
||||
|
||||
# Step 4: Apply to cluster
|
||||
kubectl apply -f k8s/awoooi-prod/03-sealed-secrets.yaml
|
||||
|
||||
# Step 5: Rolling restart
|
||||
kubectl rollout restart deployment -n awoooi-prod
|
||||
|
||||
echo "✅ Secret rotated successfully"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 緊急撤銷流程
|
||||
|
||||
```bash
|
||||
# 如果密碼洩漏,立即執行:
|
||||
|
||||
# 1. 在 Bitwarden 產生新密碼
|
||||
bw generate -ulns --length 32
|
||||
|
||||
# 2. 更新所有相關服務
|
||||
./scripts/rotate-secret.sh "Harbor" "$(bw generate -ulns --length 32)"
|
||||
|
||||
# 3. 撤銷舊 Token (如適用)
|
||||
# Harbor: Admin UI → Users → Reset Password
|
||||
# GitHub: Settings → Secrets → Delete old, Add new
|
||||
|
||||
# 4. 審計日誌檢查
|
||||
bw list items --search "Harbor" --pretty
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 監控與告警
|
||||
|
||||
```yaml
|
||||
# 新增到 Prometheus 告警規則
|
||||
groups:
|
||||
- name: secrets-monitoring
|
||||
rules:
|
||||
- alert: SecretRotationOverdue
|
||||
expr: time() - secret_last_rotation_timestamp > 86400 * 90
|
||||
for: 1d
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Secret {{ $labels.secret_name }} 超過 90 天未輪換"
|
||||
|
||||
- alert: SecretAccessAnomaly
|
||||
expr: rate(secret_access_total[1h]) > 100
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "異常的 Secret 存取頻率"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 成本分析
|
||||
|
||||
| 項目 | 方案 | 成本 |
|
||||
|------|------|------|
|
||||
| Secrets 儲存 | Vaultwarden (Self-hosted) | $0 |
|
||||
| K8s 加密 | Sealed Secrets | $0 |
|
||||
| 同步工具 | Bitwarden CLI | $0 |
|
||||
| **總計** | | **$0/年** |
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
- [ ] Vaultwarden 部署在 192.168.0.188
|
||||
- [ ] 所有 Secrets 已匯入 Bitwarden
|
||||
- [ ] Sealed Secrets Controller 運行中
|
||||
- [ ] K8s Secrets 已加密存入 Git
|
||||
- [ ] GitHub Actions 可從 Bitwarden 同步
|
||||
- [ ] 開發者可用 CLI 產生 .env.local
|
||||
- [ ] Rotation SOP 文件完成
|
||||
- [ ] 監控告警已設定
|
||||
|
||||
---
|
||||
|
||||
## 附錄:Secrets 清單 (機密)
|
||||
|
||||
> ⚠️ 此清單僅供內部參考,實際密碼存於 Bitwarden
|
||||
|
||||
| 名稱 | 類型 | 最後更新 | 下次 Rotation |
|
||||
|------|------|----------|---------------|
|
||||
| Harbor | Registry | 2026-03-23 | 2026-06-23 |
|
||||
| PostgreSQL | Database | TBD | TBD |
|
||||
| Redis | Cache | TBD | TBD |
|
||||
| Anthropic | API Key | TBD | TBD |
|
||||
| Gemini | API Key | TBD | TBD |
|
||||
| Telegram Bot | Token | 2026-03-22 | 2026-06-22 |
|
||||
|
||||
288
docs/adr/ADR-011-networkpolicy-governance.md
Normal file
288
docs/adr/ADR-011-networkpolicy-governance.md
Normal file
@@ -0,0 +1,288 @@
|
||||
# ADR-011: NetworkPolicy 變更治理架構
|
||||
|
||||
**狀態**: 提案
|
||||
**日期**: 2026-03-23
|
||||
**決策者**: 統帥
|
||||
**觸發**: 多次 NetworkPolicy 變更導致生產事故
|
||||
|
||||
## 問題陳述
|
||||
|
||||
```
|
||||
事故時間線:
|
||||
├── 2026-03-20: Worker 無法連 Redis → 發現 Egress 被阻擋
|
||||
├── 2026-03-22: OTEL 上報失敗 → 發現 Port 24317/24318 未開
|
||||
├── 2026-03-23: Y 按鈕執行超時 → K8s API 192.168.0.120:6443 未開
|
||||
└── 每次都是「事後診斷」才發現 NetworkPolicy 問題
|
||||
```
|
||||
|
||||
**根本原因**:
|
||||
1. 任何有 kubectl 權限的人都可以直接修改 NetworkPolicy
|
||||
2. 修改後沒有任何告警或審計
|
||||
3. Git 版本與叢集版本經常不同步
|
||||
4. 沒有 Dry-Run / Diff 機制防止錯誤
|
||||
|
||||
---
|
||||
|
||||
## 決策:三層防護架構
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ NetworkPolicy 變更治理架構 │
|
||||
├──────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Layer 1: GitOps (Single Source of Truth) │
|
||||
│ ══════════════════════════════════════════ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ 開發者 │ ──▶ │ PR 審核 │ ──▶ │ ArgoCD │ │
|
||||
│ │ 修改 YAML │ │ 至少 2 人 │ │ 自動同步 │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
||||
│ │ Git │ │ GitHub │ │ K8s │ │
|
||||
│ │ Commit │ │ Actions │ │ Cluster │ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
||||
│ │
|
||||
│ Layer 2: Policy Validation (防錯) │
|
||||
│ ═══════════════════════════════════ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Kyverno / OPA Gatekeeper │ │
|
||||
│ │ ──────────────────────────────── │ │
|
||||
│ │ Rule 1: NetworkPolicy 必須有 system 標籤 │ │
|
||||
│ │ Rule 2: 禁止 podSelector: {} (空選擇器) 覆蓋既有規則 │ │
|
||||
│ │ Rule 3: Egress 必須明確指定 Port (禁止開放全部) │ │
|
||||
│ │ Rule 4: 生產環境 NetworkPolicy 必須有註解說明 │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Layer 3: 變更告警 (監控) │
|
||||
│ ════════════════════════════ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Falco / kube-state-metrics + Alertmanager │ │
|
||||
│ │ ──────────────────────────────────────────── │ │
|
||||
│ │ Alert: NetworkPolicy 被創建/修改/刪除 │ │
|
||||
│ │ Alert: 直接 kubectl apply (繞過 GitOps) │ │
|
||||
│ │ Alert: 非白名單用戶修改關鍵資源 │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 實施計畫
|
||||
|
||||
### Phase 1: 立即措施 (Day 1) ✅
|
||||
|
||||
```yaml
|
||||
# 1. 為所有 NetworkPolicy 加上變更元數據
|
||||
metadata:
|
||||
annotations:
|
||||
awoooi.io/last-modified: "2026-03-23T15:30:00Z"
|
||||
awoooi.io/modified-by: "ogt"
|
||||
awoooi.io/change-reason: "修復 K8s API 連線 - Y 按鈕執行超時"
|
||||
awoooi.io/ticket: "AWOOOI-123"
|
||||
```
|
||||
|
||||
```bash
|
||||
# 2. 設定 kubectl 審計 (立即啟用)
|
||||
# 在 K3s server 加入:
|
||||
--audit-policy-file=/etc/rancher/k3s/audit-policy.yaml
|
||||
--audit-log-path=/var/log/k3s/audit.log
|
||||
```
|
||||
|
||||
### Phase 2: GitOps (Week 1)
|
||||
|
||||
```yaml
|
||||
# ArgoCD Application (輕量版)
|
||||
apiVersion: argoproj.io/v1alpha1
|
||||
kind: Application
|
||||
metadata:
|
||||
name: awoooi-networkpolicy
|
||||
namespace: argocd
|
||||
spec:
|
||||
project: default
|
||||
source:
|
||||
repoURL: https://github.com/your-org/awoooi.git
|
||||
path: k8s/awoooi-prod
|
||||
targetRevision: main
|
||||
destination:
|
||||
server: https://kubernetes.default.svc
|
||||
namespace: awoooi-prod
|
||||
syncPolicy:
|
||||
automated:
|
||||
prune: false # 禁止自動刪除
|
||||
selfHeal: true # 有人手動改會自動復原
|
||||
syncOptions:
|
||||
- CreateNamespace=false
|
||||
```
|
||||
|
||||
### Phase 3: Policy Validation (Week 2)
|
||||
|
||||
```yaml
|
||||
# Kyverno Policy: 強制 NetworkPolicy 必須有註解
|
||||
apiVersion: kyverno.io/v1
|
||||
kind: ClusterPolicy
|
||||
metadata:
|
||||
name: require-networkpolicy-annotations
|
||||
spec:
|
||||
validationFailureAction: Enforce
|
||||
rules:
|
||||
- name: require-change-reason
|
||||
match:
|
||||
resources:
|
||||
kinds:
|
||||
- NetworkPolicy
|
||||
namespaces:
|
||||
- awoooi-prod
|
||||
validate:
|
||||
message: "NetworkPolicy 必須有 awoooi.io/change-reason 註解"
|
||||
pattern:
|
||||
metadata:
|
||||
annotations:
|
||||
awoooi.io/change-reason: "?*"
|
||||
awoooi.io/modified-by: "?*"
|
||||
```
|
||||
|
||||
### Phase 4: 變更告警 (Week 2)
|
||||
|
||||
```yaml
|
||||
# Falco Rule: NetworkPolicy 變更告警
|
||||
- rule: NetworkPolicy Modified
|
||||
desc: 偵測 NetworkPolicy 被創建、修改或刪除
|
||||
condition: >
|
||||
k8s_audit and
|
||||
ka.target.resource = "networkpolicies" and
|
||||
ka.verb in (create, update, patch, delete)
|
||||
output: >
|
||||
🚨 NetworkPolicy 變更告警
|
||||
[%ka.verb] %ka.target.namespace/%ka.target.name
|
||||
by %ka.user.name from %ka.sourceips
|
||||
priority: WARNING
|
||||
tags: [network, security, k8s]
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Prometheus Alert: 直接 kubectl 修改 (繞過 GitOps)
|
||||
- alert: DirectKubectlNetworkPolicyChange
|
||||
expr: |
|
||||
increase(apiserver_audit_event_total{
|
||||
verb=~"create|update|patch|delete",
|
||||
resource="networkpolicies",
|
||||
user_agent!~"argocd.*"
|
||||
}[5m]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "有人直接 kubectl 修改 NetworkPolicy (繞過 GitOps)"
|
||||
description: "{{ $labels.user }} 在 {{ $labels.namespace }} 修改了 NetworkPolicy"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PR 審核流程 (CODEOWNERS)
|
||||
|
||||
```
|
||||
# .github/CODEOWNERS
|
||||
# NetworkPolicy 變更需要 CIO + SRE 雙重審核
|
||||
k8s/*/02-network-policy.yaml @awoooi/cio @awoooi/sre-team
|
||||
k8s/*/networkpolicy*.yaml @awoooi/cio @awoooi/sre-team
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 回滾機制 (統帥要求補充)
|
||||
|
||||
### 基線管理
|
||||
|
||||
```bash
|
||||
k8s/awoooi-prod/
|
||||
├── 02-network-policy.yaml # 當前版本
|
||||
├── .baselines/
|
||||
│ ├── LATEST_KNOWN_GOOD.yaml # 最後驗證通過版本
|
||||
│ └── 02-network-policy.{date}.yaml # 歷史版本
|
||||
```
|
||||
|
||||
### 一鍵回滾
|
||||
|
||||
```bash
|
||||
# 緊急回滾 (< 30 秒)
|
||||
./scripts/rollback-networkpolicy.sh awoooi-prod
|
||||
|
||||
# 回滾到特定日期
|
||||
./scripts/rollback-networkpolicy.sh awoooi-prod 2026-03-22
|
||||
```
|
||||
|
||||
### Drift 偵測 + 人工決策 (禁止盲目自動回滾)
|
||||
|
||||
- 每 5 分鐘比對叢集狀態 vs Git 基線
|
||||
- 偵測到差異 → 立即告警 + 顯示差異
|
||||
- **人工判斷**: 合法變更 → 同步到 Git;錯誤變更 → 執行回滾
|
||||
- 唯一自動回滾條件: 變更導致健康檢查失敗 + 變更在 10 分鐘內 + 基線曾健康
|
||||
|
||||
### 回滾 vs 修復 決策樹
|
||||
|
||||
```
|
||||
偵測到問題
|
||||
│
|
||||
├─ 影響生產?
|
||||
│ ├─ 是 → 立即回滾 (< 30 秒)
|
||||
│ └─ 否 → 評估後決定
|
||||
│
|
||||
└─ 回滾後
|
||||
├─ 記錄 RCA
|
||||
├─ 修復根因
|
||||
└─ 更新基線
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| NetworkPolicy 只能透過 PR 修改 | ⬜ |
|
||||
| PR 需要 2 人審核 | ⬜ |
|
||||
| 直接 kubectl apply 會觸發告警 | ⬜ |
|
||||
| 告警發送到 Telegram + OpenClaw | ⬜ |
|
||||
| 審計日誌保留 90 天 | ⬜ |
|
||||
| Kyverno 強制註解規則 | ⬜ |
|
||||
| **基線快照每日保存** | ⬜ |
|
||||
| **一鍵回滾腳本可用** | ⬜ |
|
||||
| **Drift 偵測每 5 分鐘執行** | ⬜ |
|
||||
| **回滾耗時 < 30 秒** | ⬜ |
|
||||
|
||||
---
|
||||
|
||||
## 成本
|
||||
|
||||
| 元件 | 方案 | 成本 |
|
||||
|------|------|------|
|
||||
| GitOps | ArgoCD (已有) | $0 |
|
||||
| Policy | Kyverno (開源) | $0 |
|
||||
| 告警 | Falco + Alertmanager (開源) | $0 |
|
||||
| **總計** | | **$0** |
|
||||
|
||||
---
|
||||
|
||||
## 附錄: 今日事故根因分析
|
||||
|
||||
```
|
||||
2026-03-23 Y 按鈕執行超時
|
||||
|
||||
根因:
|
||||
NetworkPolicy allow-required-egress 遺漏 K8s API 實際端點
|
||||
|
||||
問題鏈:
|
||||
1. ClusterIP 10.43.0.1:443 已允許 ✓
|
||||
2. 但實際流量路由到 192.168.0.120:6443 ✗
|
||||
3. 192.168.0.120 在 192.168.0.0/16 排除範圍內 → 被阻擋
|
||||
|
||||
修復:
|
||||
新增 192.168.0.120:6443 到 allow-required-egress
|
||||
|
||||
教訓:
|
||||
1. K8s Service ClusterIP ≠ 實際 Endpoint
|
||||
2. NetworkPolicy 需要允許完整路由路徑
|
||||
3. 變更前應該用 dry-run 驗證
|
||||
```
|
||||
370
docs/adr/ADR-012-dangerous-operations-governance.md
Normal file
370
docs/adr/ADR-012-dangerous-operations-governance.md
Normal file
@@ -0,0 +1,370 @@
|
||||
# ADR-012: 危險操作治理架構
|
||||
|
||||
**狀態**: 提案
|
||||
**日期**: 2026-03-23
|
||||
**決策者**: 統帥
|
||||
**觸發事件**: Telegram Token 連續三次被 logOut 災難
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
### 事故摘要
|
||||
|
||||
2026-03-23,AWOOOI API 程式碼中的 `logOut` API 呼叫導致 Telegram Bot Token 連續三次被永久失效:
|
||||
|
||||
```
|
||||
時間線:
|
||||
11:09 - 第一個 Token 被 logOut (舊版 AWOOOI API)
|
||||
19:31 - 第二個 Token 被 logOut (舊 Pod 未完全終止)
|
||||
19:39 - 第三個 Token 成功 (AWOOOI API 已停用 Telegram)
|
||||
```
|
||||
|
||||
### 問題根因
|
||||
|
||||
| 問題 | 說明 |
|
||||
|------|------|
|
||||
| 危險程式碼進入生產 | `logOut` 永久失效 Token,無人攔截 |
|
||||
| 無 Code Review | 涉及外部服務的危險操作未被審查 |
|
||||
| 無執行時告警 | 危險操作執行時沒有任何通知 |
|
||||
| 無審計日誌 | 無法追溯誰/何時執行了什麼 |
|
||||
| 變更順序錯誤 | 給新 Token 時舊服務還在運行 |
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
建立**三層危險操作治理架構**:
|
||||
|
||||
1. **預防層** - CI/CD 攔截危險模式
|
||||
2. **執行層** - 危險操作需人工確認
|
||||
3. **審計層** - 所有操作記錄可追溯
|
||||
|
||||
---
|
||||
|
||||
## 架構設計
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 危險操作治理架構 (Dangerous Operations Governance) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Layer 1: 預防層 (Prevention) │ │
|
||||
│ │ ════════════════════════════ │ │
|
||||
│ │ │ │
|
||||
│ │ CI/CD Pipeline │ │
|
||||
│ │ ├── 危險模式掃描 (grep logOut, deleteWebhook, etc.) │ │
|
||||
│ │ ├── 外部服務 API 白名單檢查 │ │
|
||||
│ │ └── PR 標記 (涉及危險操作需額外審核) │ │
|
||||
│ │ │ │
|
||||
│ │ Code Review │ │
|
||||
│ │ ├── CODEOWNERS: 危險檔案需 CTO/CISO 審核 │ │
|
||||
│ │ └── PR Template: 勾選「是否涉及外部服務」 │ │
|
||||
│ │ │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Layer 2: 執行層 (Execution Control) │ │
|
||||
│ │ ═══════════════════════════════════ │ │
|
||||
│ │ │ │
|
||||
│ │ 危險操作分級 │ │
|
||||
│ │ ├── Tier 0 (禁止): logOut, revokeToken │ │
|
||||
│ │ ├── Tier 1 (需審批): deleteWebhook, setWebhook │ │
|
||||
│ │ ├── Tier 2 (需告警): sendMessage (批量) │ │
|
||||
│ │ └── Tier 3 (自動): getMe, getUpdates │ │
|
||||
│ │ │ │
|
||||
│ │ 執行前檢查 │ │
|
||||
│ │ ├── 產生簽核卡片 (Tier 1 操作) │ │
|
||||
│ │ ├── 發送 Telegram 告警 (Tier 1-2 操作) │ │
|
||||
│ │ └── 記錄到活躍事件 │ │
|
||||
│ │ │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Layer 3: 審計層 (Audit Trail) │ │
|
||||
│ │ ═════════════════════════════ │ │
|
||||
│ │ │ │
|
||||
│ │ 記錄內容 │ │
|
||||
│ │ ├── 操作類型、時間、執行者 │ │
|
||||
│ │ ├── 影響範圍 (哪些服務/Token) │ │
|
||||
│ │ ├── 執行結果 (成功/失敗) │ │
|
||||
│ │ └── 關聯的 Incident ID │ │
|
||||
│ │ │ │
|
||||
│ │ 儲存位置 │ │
|
||||
│ │ ├── PostgreSQL: audit_logs 表 │ │
|
||||
│ │ ├── SignOz: Trace 追蹤 │ │
|
||||
│ │ └── 活躍事件: 顯示在 Dashboard │ │
|
||||
│ │ │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 危險 API 清單
|
||||
|
||||
### Telegram API
|
||||
|
||||
| API | 分級 | 說明 | 處理方式 |
|
||||
|-----|------|------|----------|
|
||||
| `logOut` | Tier 0 (禁止) | 永久失效 Token | CI/CD 攔截,禁止進入生產 |
|
||||
| `close` | Tier 0 (禁止) | 關閉所有 session | CI/CD 攔截 |
|
||||
| `deleteWebhook` | Tier 1 (審批) | 可能影響其他服務 | 需簽核卡片 |
|
||||
| `setWebhook` | Tier 1 (審批) | 可能覆蓋其他設定 | 需簽核卡片 |
|
||||
| `sendMessage` (批量) | Tier 2 (告警) | 可能 spam 用戶 | 發送告警 |
|
||||
| `getUpdates` | Tier 3 (自動) | 只能單一實例 | 限定 OpenClaw |
|
||||
| `getMe` | Tier 3 (自動) | 無副作用 | 允許 |
|
||||
|
||||
### K8s API
|
||||
|
||||
| 操作 | 分級 | 說明 | 處理方式 |
|
||||
|------|------|------|----------|
|
||||
| `kubectl delete namespace` | Tier 0 (禁止) | 刪除整個 namespace | 絕對禁止 |
|
||||
| `kubectl delete pvc` | Tier 1 (審批) | 刪除持久化資料 | 需簽核 |
|
||||
| `kubectl rollout restart` | Tier 2 (告警) | 重啟服務 | 發送告警 |
|
||||
| `kubectl scale` | Tier 2 (告警) | 調整副本數 | 發送告警 |
|
||||
| `kubectl get` | Tier 3 (自動) | 只讀操作 | 允許 |
|
||||
|
||||
### Database
|
||||
|
||||
| 操作 | 分級 | 說明 | 處理方式 |
|
||||
|------|------|------|----------|
|
||||
| `DROP DATABASE` | Tier 0 (禁止) | 刪除資料庫 | 絕對禁止 |
|
||||
| `TRUNCATE TABLE` | Tier 1 (審批) | 清空表格 | 需簽核 |
|
||||
| `DELETE FROM` (無 WHERE) | Tier 1 (審批) | 刪除所有資料 | 需簽核 |
|
||||
| `ALTER TYPE` | Tier 2 (告警) | 修改 schema | 發送告警 |
|
||||
|
||||
---
|
||||
|
||||
## 實施計畫
|
||||
|
||||
### Phase 1: CI/CD 攔截 (Week 1)
|
||||
|
||||
```yaml
|
||||
# .github/workflows/ci.yml
|
||||
- name: Scan for Dangerous Operations
|
||||
run: |
|
||||
echo "🔍 掃描危險操作..."
|
||||
|
||||
# Tier 0: 禁止的操作
|
||||
BANNED_PATTERNS="logOut|\.close\(\)|revokeToken|DROP DATABASE|TRUNCATE TABLE"
|
||||
|
||||
if grep -rn --include="*.py" -E "$BANNED_PATTERNS" apps/; then
|
||||
echo "❌ 發現 Tier 0 禁止操作!"
|
||||
echo "請移除以下危險程式碼後重新提交"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Tier 1: 需要審批的操作
|
||||
REVIEW_PATTERNS="deleteWebhook|setWebhook|DELETE FROM"
|
||||
|
||||
if grep -rn --include="*.py" -E "$REVIEW_PATTERNS" apps/; then
|
||||
echo "⚠️ 發現 Tier 1 操作,需要 CTO/CISO 審核"
|
||||
echo "::warning::此 PR 包含危險操作,請確認已通過審核"
|
||||
fi
|
||||
|
||||
echo "✅ 危險操作掃描完成"
|
||||
```
|
||||
|
||||
### Phase 2: CODEOWNERS (Week 1)
|
||||
|
||||
```
|
||||
# .github/CODEOWNERS
|
||||
|
||||
# 涉及外部服務的檔案需要額外審核
|
||||
apps/api/src/services/telegram_*.py @awoooi/cto @awoooi/ciso
|
||||
apps/api/src/services/executor.py @awoooi/cto @awoooi/ciso
|
||||
k8s/**/02-network-policy.yaml @awoooi/cio @awoooi/sre
|
||||
|
||||
# 危險操作相關
|
||||
apps/api/src/services/*gateway*.py @awoooi/cto
|
||||
```
|
||||
|
||||
### Phase 3: 執行時控制 (Week 2)
|
||||
|
||||
```python
|
||||
# apps/api/src/core/dangerous_ops.py
|
||||
|
||||
from enum import Enum
|
||||
from typing import Callable, Any
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
class OperationTier(Enum):
|
||||
FORBIDDEN = 0 # 絕對禁止
|
||||
REQUIRES_APPROVAL = 1 # 需要簽核
|
||||
REQUIRES_ALERT = 2 # 需要告警
|
||||
AUTOMATIC = 3 # 自動允許
|
||||
|
||||
DANGEROUS_OPERATIONS = {
|
||||
"telegram.logOut": OperationTier.FORBIDDEN,
|
||||
"telegram.close": OperationTier.FORBIDDEN,
|
||||
"telegram.deleteWebhook": OperationTier.REQUIRES_APPROVAL,
|
||||
"telegram.setWebhook": OperationTier.REQUIRES_APPROVAL,
|
||||
"k8s.deleteNamespace": OperationTier.FORBIDDEN,
|
||||
"k8s.deletePVC": OperationTier.REQUIRES_APPROVAL,
|
||||
"k8s.rolloutRestart": OperationTier.REQUIRES_ALERT,
|
||||
}
|
||||
|
||||
async def execute_dangerous_operation(
|
||||
operation_name: str,
|
||||
operation_fn: Callable,
|
||||
*args,
|
||||
**kwargs
|
||||
) -> Any:
|
||||
"""執行危險操作的統一入口"""
|
||||
|
||||
tier = DANGEROUS_OPERATIONS.get(operation_name, OperationTier.AUTOMATIC)
|
||||
|
||||
if tier == OperationTier.FORBIDDEN:
|
||||
logger.error(
|
||||
"forbidden_operation_blocked",
|
||||
operation=operation_name,
|
||||
)
|
||||
raise PermissionError(f"操作 {operation_name} 被禁止執行")
|
||||
|
||||
if tier == OperationTier.REQUIRES_APPROVAL:
|
||||
# 產生簽核卡片,等待 Y/n
|
||||
approval = await create_approval_request(
|
||||
title=f"危險操作: {operation_name}",
|
||||
description=f"即將執行 {operation_name},需要人工確認",
|
||||
tier="Tier 1",
|
||||
)
|
||||
if not approval.approved:
|
||||
raise PermissionError("操作被拒絕")
|
||||
|
||||
if tier in [OperationTier.REQUIRES_APPROVAL, OperationTier.REQUIRES_ALERT]:
|
||||
# 發送告警
|
||||
await send_telegram_alert(
|
||||
f"⚠️ 危險操作執行中: {operation_name}"
|
||||
)
|
||||
|
||||
# 記錄審計日誌
|
||||
await log_audit_trail(
|
||||
operation=operation_name,
|
||||
tier=tier.name,
|
||||
args=str(args),
|
||||
kwargs=str(kwargs),
|
||||
)
|
||||
|
||||
# 執行操作
|
||||
result = await operation_fn(*args, **kwargs)
|
||||
|
||||
# 記錄結果
|
||||
await log_audit_trail(
|
||||
operation=operation_name,
|
||||
result="success",
|
||||
)
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
### Phase 4: 審計日誌 (Week 2)
|
||||
|
||||
```sql
|
||||
-- 新增審計日誌表
|
||||
CREATE TABLE IF NOT EXISTS audit_logs (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
operation VARCHAR(255) NOT NULL,
|
||||
tier VARCHAR(50) NOT NULL,
|
||||
actor VARCHAR(255), -- 執行者 (系統/用戶)
|
||||
target VARCHAR(255), -- 影響目標
|
||||
args JSONB,
|
||||
result VARCHAR(50), -- success/failure/blocked
|
||||
error_message TEXT,
|
||||
incident_id VARCHAR(50), -- 關聯的 Incident
|
||||
trace_id VARCHAR(50), -- SignOz Trace ID
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_audit_logs_operation ON audit_logs(operation);
|
||||
CREATE INDEX idx_audit_logs_timestamp ON audit_logs(timestamp);
|
||||
CREATE INDEX idx_audit_logs_tier ON audit_logs(tier);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 變更管理流程
|
||||
|
||||
### Token/Secret 更新 SOP
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Token/Secret 更新 SOP │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Step 1: 停止所有使用該 Token 的服務 │
|
||||
│ ──────────────────────────────────── │
|
||||
│ - kubectl scale deployment --replicas=0 │
|
||||
│ - docker stop <container> │
|
||||
│ - 等待所有實例完全終止 │
|
||||
│ │
|
||||
│ Step 2: 驗證沒有殘留實例 │
|
||||
│ ──────────────────────────── │
|
||||
│ - kubectl get pods (確認沒有 Running/Terminating) │
|
||||
│ - docker ps (確認沒有相關容器) │
|
||||
│ │
|
||||
│ Step 3: 取得新 Token │
|
||||
│ ──────────────────────── │
|
||||
│ - @BotFather → Revoke current token │
|
||||
│ - 複製新 Token │
|
||||
│ │
|
||||
│ Step 4: 更新到唯一的服務 │
|
||||
│ ──────────────────────── │
|
||||
│ - 更新 .env 或 K8s Secret │
|
||||
│ - 記錄到 memory/reference_telegram_token.md │
|
||||
│ │
|
||||
│ Step 5: 啟動服務並驗證 │
|
||||
│ ──────────────────────── │
|
||||
│ - 啟動服務 │
|
||||
│ - 檢查日誌確認無錯誤 │
|
||||
│ - 測試 Telegram 連線 │
|
||||
│ │
|
||||
│ Step 6: 更新文檔 │
|
||||
│ ──────────────── │
|
||||
│ - 更新 memory 相關 MD │
|
||||
│ - 發送 Telegram 確認訊息 │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| CI/CD 危險模式掃描 | ⬜ |
|
||||
| CODEOWNERS 危險檔案審核 | ⬜ |
|
||||
| Tier 0 操作絕對禁止 | ⬜ |
|
||||
| Tier 1 操作需簽核卡片 | ⬜ |
|
||||
| Tier 2 操作發送告警 | ⬜ |
|
||||
| 審計日誌表建立 | ⬜ |
|
||||
| Token 更新 SOP 文檔 | ✅ |
|
||||
| Memory 文檔更新 | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 附錄: 2026-03-23 事故完整時間線
|
||||
|
||||
```
|
||||
11:09:47 - AWOOOI API 啟動,呼叫 logOut (第一個 Token 失效)
|
||||
11:09:52 - OpenClaw 嘗試啟動失敗 "Logged out"
|
||||
...
|
||||
19:31:16 - 給第二個 Token
|
||||
19:31:27 - 舊 AWOOOI Pod 呼叫 logOut (第二個 Token 失效)
|
||||
19:31:31 - OpenClaw 啟動失敗 "Logged out"
|
||||
19:33:00 - 新 AWOOOI Pod 部署完成 (但 Token 已死)
|
||||
...
|
||||
19:35:00 - 清空 AWOOOI API Telegram Token
|
||||
19:35:30 - 重啟 AWOOOI API (不再碰 Telegram)
|
||||
19:39:01 - 給第三個 Token
|
||||
19:39:17 - OpenClaw 啟動成功
|
||||
19:39:36 - 所有系統健康
|
||||
```
|
||||
327
docs/design/UX-001-incident-card-fatigue.md
Normal file
327
docs/design/UX-001-incident-card-fatigue.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# UX-001: 事件卡片告警疲勞解決方案
|
||||
|
||||
**狀態**: 提案
|
||||
**日期**: 2026-03-23
|
||||
**提案者**: 統帥
|
||||
**問題**: 事件卡片長時間堆積導致用戶麻木,失去警示效果
|
||||
|
||||
---
|
||||
|
||||
## 問題分析
|
||||
|
||||
```
|
||||
現況:
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 活躍事件 │
|
||||
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
|
||||
│ │ P0 紅色 │ │ P0 紅色 │ │ P0 紅色 │ │
|
||||
│ │ 09:39 │ │ 09:44 │ │ 09:57 │ ... │
|
||||
│ │ mitigating │ │ mitigating │ │ mitigating │ │
|
||||
│ └────────────┘ └────────────┘ └────────────┘ │
|
||||
│ │
|
||||
│ 問題:全部看起來一樣 → 視覺疲勞 → 無人處理 │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 心理學原理
|
||||
|
||||
1. **Weber-Fechner Law**: 持續刺激會導致感知鈍化
|
||||
2. **Banner Blindness**: 重複視覺元素被大腦自動忽略
|
||||
3. **Decision Fatigue**: 太多選擇導致不做選擇
|
||||
|
||||
---
|
||||
|
||||
## 解決方案:四層優化
|
||||
|
||||
### Layer 1: 時間衰減視覺 (Time-based Visual Decay)
|
||||
|
||||
```
|
||||
新舊卡片視覺差異:
|
||||
|
||||
🔴 剛發生 (< 5分鐘) → 紅色脈動動畫 + 聲音提示
|
||||
🟠 較新 (5-30分鐘) → 橘色漸層 + 輕微動畫
|
||||
🟡 等待中 (30分-2小時) → 黃色靜態
|
||||
⚪ 陳舊 (> 2小時) → 灰色 + 淡化 50%
|
||||
```
|
||||
|
||||
```tsx
|
||||
// 時間衰減樣式
|
||||
const getCardStyle = (createdAt: Date) => {
|
||||
const ageMinutes = (Date.now() - createdAt.getTime()) / 60000;
|
||||
|
||||
if (ageMinutes < 5) return {
|
||||
borderColor: 'red',
|
||||
animation: 'pulse 1s infinite',
|
||||
playSound: true,
|
||||
};
|
||||
if (ageMinutes < 30) return {
|
||||
borderColor: 'orange',
|
||||
animation: 'subtle-glow 2s infinite',
|
||||
};
|
||||
if (ageMinutes < 120) return {
|
||||
borderColor: 'yellow',
|
||||
opacity: 0.9,
|
||||
};
|
||||
return {
|
||||
borderColor: 'gray',
|
||||
opacity: 0.5,
|
||||
filter: 'grayscale(50%)',
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
### Layer 2: 智慧分組 (Smart Grouping)
|
||||
|
||||
```
|
||||
優化後佈局:
|
||||
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 🚨 需要立即處理 (2) [全部處理] │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ 🔴 postgres-primary-0 P0 剛剛 │ │
|
||||
│ │ 🔴 awoooi-worker P0 3分鐘前 │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ⏳ 等待處理 (3) [批量關閉 ▼] │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ 🟡 harbor-core P0 45分鐘前 [Y] [n] │ │
|
||||
│ │ 🟡 postgres-native P0 1小時前 [Y] [n] │ │
|
||||
│ │ 🟡 health-check-test P2 2小時前 [Y] [n] │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ 📦 陳舊事件 (1) [自動關閉倒數] │
|
||||
│ ┌────────────────────────────────────────────────┐ │
|
||||
│ │ ⚪ postgres-primary-0 P0 3小時前 將在1h自動關 │ │
|
||||
│ └────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Layer 3: 升級機制 (Escalation)
|
||||
|
||||
```
|
||||
時間軸自動升級:
|
||||
|
||||
0-15分鐘: Dashboard 顯示
|
||||
15-30分鐘: Telegram 通知 (首次)
|
||||
30-60分鐘: Telegram 再次提醒 + @mention 負責人
|
||||
1-2小時: 電話通知 (P0/P1)
|
||||
2小時+: 自動降級為 P3 或自動關閉 (視規則)
|
||||
```
|
||||
|
||||
```yaml
|
||||
# 升級規則配置
|
||||
escalation_rules:
|
||||
P0:
|
||||
- after: 15m
|
||||
action: telegram_notify
|
||||
message: "🚨 P0 事件等待處理: {incident_id}"
|
||||
- after: 30m
|
||||
action: telegram_mention
|
||||
mention: "@oncall"
|
||||
- after: 1h
|
||||
action: phone_call
|
||||
to: oncall_phone
|
||||
- after: 4h
|
||||
action: auto_close
|
||||
reason: "超過 4 小時無人處理,自動關閉"
|
||||
|
||||
P1:
|
||||
- after: 30m
|
||||
action: telegram_notify
|
||||
- after: 2h
|
||||
action: auto_downgrade
|
||||
to: P3
|
||||
|
||||
P2:
|
||||
- after: 2h
|
||||
action: auto_close
|
||||
reason: "低優先級超時自動關閉"
|
||||
```
|
||||
|
||||
### Layer 4: 智慧合併 (Smart Merge)
|
||||
|
||||
```
|
||||
同類事件合併:
|
||||
|
||||
Before:
|
||||
├── postgres-primary-0 崩潰 (09:44)
|
||||
├── postgres-primary-0 重啟 (09:57)
|
||||
├── postgres-primary-0 OOM (10:15)
|
||||
|
||||
After:
|
||||
├── 📦 postgres-primary-0 問題群組 (3 個事件)
|
||||
│ └── 最近: OOM (10:15)
|
||||
│ └── [展開詳情]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 實施計畫
|
||||
|
||||
### Phase 1: 視覺優化 (Week 1)
|
||||
|
||||
```tsx
|
||||
// apps/web/src/components/incident/IncidentCard.tsx
|
||||
|
||||
import { formatDistanceToNow } from 'date-fns';
|
||||
import { zhTW } from 'date-fns/locale';
|
||||
|
||||
interface IncidentCardProps {
|
||||
incident: Incident;
|
||||
}
|
||||
|
||||
export const IncidentCard = ({ incident }: IncidentCardProps) => {
|
||||
const ageMinutes = getAgeInMinutes(incident.created_at);
|
||||
const urgencyLevel = getUrgencyLevel(ageMinutes, incident.priority);
|
||||
|
||||
return (
|
||||
<motion.div
|
||||
className={cn(
|
||||
"incident-card",
|
||||
urgencyLevel === 'critical' && "animate-pulse border-red-500",
|
||||
urgencyLevel === 'urgent' && "border-orange-400",
|
||||
urgencyLevel === 'waiting' && "border-yellow-300 opacity-90",
|
||||
urgencyLevel === 'stale' && "border-gray-300 opacity-50 grayscale-50"
|
||||
)}
|
||||
initial={{ scale: 0.95, opacity: 0 }}
|
||||
animate={{ scale: 1, opacity: 1 }}
|
||||
>
|
||||
{/* 時間徽章 */}
|
||||
<div className="absolute top-2 right-2">
|
||||
<Badge variant={urgencyLevel}>
|
||||
{formatDistanceToNow(incident.created_at, {
|
||||
addSuffix: true,
|
||||
locale: zhTW
|
||||
})}
|
||||
</Badge>
|
||||
</div>
|
||||
|
||||
{/* 自動關閉倒數 (陳舊事件) */}
|
||||
{urgencyLevel === 'stale' && (
|
||||
<div className="text-xs text-gray-400 mt-2">
|
||||
⏱️ 將在 {getAutoCloseCountdown(incident)} 自動關閉
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* ... 其他內容 */}
|
||||
</motion.div>
|
||||
);
|
||||
};
|
||||
```
|
||||
|
||||
### Phase 2: 分組佈局 (Week 2)
|
||||
|
||||
```tsx
|
||||
// apps/web/src/components/incident/IncidentDashboard.tsx
|
||||
|
||||
export const IncidentDashboard = () => {
|
||||
const { incidents } = useIncidents();
|
||||
|
||||
const grouped = useMemo(() => ({
|
||||
critical: incidents.filter(i => getAgeMinutes(i) < 15),
|
||||
waiting: incidents.filter(i => getAgeMinutes(i) >= 15 && getAgeMinutes(i) < 120),
|
||||
stale: incidents.filter(i => getAgeMinutes(i) >= 120),
|
||||
}), [incidents]);
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
{/* 緊急區 - 紅色背景 */}
|
||||
{grouped.critical.length > 0 && (
|
||||
<Section
|
||||
title="🚨 需要立即處理"
|
||||
count={grouped.critical.length}
|
||||
variant="critical"
|
||||
action={<Button>全部處理</Button>}
|
||||
>
|
||||
{grouped.critical.map(i => <IncidentCard key={i.id} incident={i} />)}
|
||||
</Section>
|
||||
)}
|
||||
|
||||
{/* 等待區 - 黃色背景 */}
|
||||
{grouped.waiting.length > 0 && (
|
||||
<Section
|
||||
title="⏳ 等待處理"
|
||||
count={grouped.waiting.length}
|
||||
variant="waiting"
|
||||
collapsible
|
||||
>
|
||||
{grouped.waiting.map(i => <IncidentCard key={i.id} incident={i} />)}
|
||||
</Section>
|
||||
)}
|
||||
|
||||
{/* 陳舊區 - 可折疊 */}
|
||||
{grouped.stale.length > 0 && (
|
||||
<Section
|
||||
title="📦 陳舊事件"
|
||||
count={grouped.stale.length}
|
||||
variant="stale"
|
||||
collapsible
|
||||
defaultCollapsed
|
||||
>
|
||||
{grouped.stale.map(i => <IncidentCard key={i.id} incident={i} />)}
|
||||
</Section>
|
||||
)}
|
||||
</div>
|
||||
);
|
||||
};
|
||||
```
|
||||
|
||||
### Phase 3: 升級機制 (Week 3)
|
||||
|
||||
```python
|
||||
# apps/api/src/services/escalation_service.py
|
||||
|
||||
class EscalationService:
|
||||
"""事件升級服務"""
|
||||
|
||||
RULES = {
|
||||
'P0': [
|
||||
{'after_minutes': 15, 'action': 'telegram_notify'},
|
||||
{'after_minutes': 30, 'action': 'telegram_mention', 'mention': '@oncall'},
|
||||
{'after_minutes': 60, 'action': 'phone_call'},
|
||||
{'after_minutes': 240, 'action': 'auto_close'},
|
||||
],
|
||||
'P1': [
|
||||
{'after_minutes': 30, 'action': 'telegram_notify'},
|
||||
{'after_minutes': 120, 'action': 'auto_downgrade', 'to': 'P3'},
|
||||
],
|
||||
'P2': [
|
||||
{'after_minutes': 120, 'action': 'auto_close'},
|
||||
],
|
||||
}
|
||||
|
||||
async def check_escalations(self):
|
||||
"""定期檢查需要升級的事件"""
|
||||
incidents = await self.get_open_incidents()
|
||||
|
||||
for incident in incidents:
|
||||
age_minutes = self.get_age_minutes(incident)
|
||||
rules = self.RULES.get(incident.priority, [])
|
||||
|
||||
for rule in rules:
|
||||
if age_minutes >= rule['after_minutes']:
|
||||
if not await self.is_escalation_sent(incident.id, rule):
|
||||
await self.execute_escalation(incident, rule)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 預期效果
|
||||
|
||||
| 指標 | 現況 | 優化後 |
|
||||
|------|------|--------|
|
||||
| 平均處理時間 | > 2 小時 | < 30 分鐘 |
|
||||
| 忽略率 | ~80% | < 20% |
|
||||
| 用戶注意力 | 分散 | 集中在關鍵事件 |
|
||||
| 告警疲勞 | 嚴重 | 可控 |
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
- [ ] 新事件有脈動動畫吸引注意
|
||||
- [ ] 陳舊事件自動淡化
|
||||
- [ ] 事件按時間分組顯示
|
||||
- [ ] P0 事件 15 分鐘未處理自動 Telegram 通知
|
||||
- [ ] 超過 4 小時自動關閉 (可配置)
|
||||
- [ ] 同類事件智慧合併
|
||||
@@ -116,6 +116,23 @@ spec:
|
||||
- protocol: TCP
|
||||
port: 8080
|
||||
|
||||
# 允許訪問 K8s API (Executor 執行 kubectl 指令)
|
||||
# 2026-03-23 修復: Y 按鈕執行超時
|
||||
# 重要: ClusterIP (10.43.0.1:443) 會路由到實際端點 (192.168.0.120:6443)
|
||||
# 必須同時允許兩者,否則流量會被 192.168.0.0/16 排除規則阻擋
|
||||
- to:
|
||||
- ipBlock:
|
||||
cidr: 10.43.0.1/32 # K3s API Server ClusterIP
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 443
|
||||
- to:
|
||||
- ipBlock:
|
||||
cidr: 192.168.0.120/32 # K3s Master 實際 API Server 端點
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 6443
|
||||
|
||||
# 允許 DNS 解析
|
||||
- to:
|
||||
- namespaceSelector: {}
|
||||
|
||||
Reference in New Issue
Block a user