docs: 新增 ADR-025 告警鏈路 E2E 驗證 + 更新 Skills
新增: - ADR-025: 告警鏈路 E2E 驗證架構 (2026-03-26 事故教訓) 更新: - ADR-011: 新增 DNS 規則最佳實踐 (附錄 B) - Skill 04: 新增 NetworkPolicy DNS 規則 + CoreDNS 設定 - Skill 05: 新增告警鏈路 Smoke Test 要求 - CLAUDE.md: 新增告警鏈路驗證到任務前必讀 事故根因: 1. URL 路徑錯誤 (webhook vs webhooks) 2. NetworkPolicy DNS 規則標籤不匹配 3. CoreDNS 上游 DNS 依賴 systemd-resolved Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -179,6 +179,78 @@ metadata:
|
||||
- `📝` 用途說明
|
||||
- `⚠️` 注意事項
|
||||
|
||||
### 🔴🔴 NetworkPolicy DNS 規則 (2026-03-26)
|
||||
|
||||
> **血的教訓**: DNS 規則標籤錯誤導致 2 天無告警!
|
||||
|
||||
```yaml
|
||||
# ❌ 錯誤: 使用不存在的標籤
|
||||
- ports:
|
||||
- port: 53
|
||||
to:
|
||||
- podSelector:
|
||||
matchLabels:
|
||||
environment: prod # CoreDNS 沒有這個標籤!
|
||||
k8s-app: kube-dns
|
||||
system: awoooi # CoreDNS 沒有這個標籤!
|
||||
|
||||
# ✅ 正確: 使用 namespace selector
|
||||
- ports:
|
||||
- port: 53
|
||||
protocol: UDP
|
||||
- port: 53
|
||||
protocol: TCP
|
||||
to:
|
||||
- namespaceSelector:
|
||||
matchLabels:
|
||||
kubernetes.io/metadata.name: kube-system
|
||||
podSelector:
|
||||
matchLabels:
|
||||
k8s-app: kube-dns
|
||||
```
|
||||
|
||||
### 🔴🔴 CoreDNS 上游 DNS 設定
|
||||
|
||||
> **血的教訓**: 容器內無法使用 127.0.0.53 (systemd-resolved)
|
||||
|
||||
```yaml
|
||||
# ❌ 錯誤: 使用 /etc/resolv.conf (指向 127.0.0.53)
|
||||
forward . /etc/resolv.conf
|
||||
|
||||
# ✅ 正確: 使用真實 DNS 伺服器
|
||||
forward . 8.8.8.8 1.1.1.1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔴🔴🔴 告警鏈路 E2E 驗證 (ADR-025)
|
||||
|
||||
> **2026-03-26**: URL 路徑錯誤導致 2 天無告警 (`webhook` vs `webhooks`)
|
||||
|
||||
### 部署後 Smoke Test (強制)
|
||||
|
||||
```bash
|
||||
# 每次部署後必須執行
|
||||
curl -s -X POST "$API_URL/api/v1/webhooks/alertmanager" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"receiver":"smoke-test","status":"firing","alerts":[...]}' \
|
||||
| jq -e '.success == true' || exit 1
|
||||
```
|
||||
|
||||
### URL 路徑規範
|
||||
|
||||
| 正確 | 錯誤 |
|
||||
|-----|------|
|
||||
| `/api/v1/webhooks/alertmanager` | `/api/v1/webhook/alertmanager` |
|
||||
| 複數形式 `webhooks` | 單數形式 `webhook` |
|
||||
|
||||
### Alertmanager ConfigMap 修改流程
|
||||
|
||||
1. 提取 webhook URL
|
||||
2. curl 測試 URL 可達性
|
||||
3. 必須收到 200 或 422 (格式錯但端點存在)
|
||||
4. 驗證失敗 → **阻止 apply**
|
||||
|
||||
---
|
||||
|
||||
## Turborepo 快取強化協議
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| **版本** | v1.4 |
|
||||
| **版本** | v1.5 |
|
||||
| **建立日期** | 2026-03-20 (台北) |
|
||||
| **建立者** | Claude Code |
|
||||
| **最後修改** | 2026-03-26 03:30 (台北) |
|
||||
@@ -25,6 +25,7 @@
|
||||
| v1.2 | 2026-03-25 | Claude Code | 加入文件資訊區塊 |
|
||||
| v1.3 | 2026-03-26 | Claude Code | **Phase 15 觀測性測試** |
|
||||
| v1.4 | 2026-03-26 | Claude Code | **Runner 殭屍進程診斷流程** |
|
||||
| v1.5 | 2026-03-26 | Claude Code | **LLM 測試策略** (首席架構師審查 P3) |
|
||||
|
||||
---
|
||||
|
||||
@@ -114,6 +115,68 @@ cd apps/web && node scripts/verify-frontend.js
|
||||
|
||||
---
|
||||
|
||||
## 🔴🔴🔴 告警鏈路 E2E 驗證 (2026-03-26 ADR-025)
|
||||
|
||||
> **血的教訓**: URL 路徑錯誤 (`webhook` vs `webhooks`) + DNS 規則錯誤,導致 2 天無 Telegram 告警
|
||||
|
||||
### 部署後 Smoke Test (強制)
|
||||
|
||||
**任何 API 或 Alertmanager 部署後必須執行:**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# scripts/smoke-test-alert-chain.sh
|
||||
|
||||
API_URL="${1:-http://192.168.0.120:32334}"
|
||||
|
||||
echo "🔔 Testing alert chain..."
|
||||
|
||||
RESPONSE=$(curl -s -X POST "$API_URL/api/v1/webhooks/alertmanager" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"receiver": "smoke-test",
|
||||
"status": "firing",
|
||||
"alerts": [{
|
||||
"status": "firing",
|
||||
"labels": {"alertname": "SmokeTest", "severity": "info"},
|
||||
"annotations": {"summary": "Smoke test from CI"}
|
||||
}]
|
||||
}')
|
||||
|
||||
# 驗證成功
|
||||
if echo "$RESPONSE" | jq -e '.success == true' > /dev/null 2>&1; then
|
||||
echo "✅ Alert chain smoke test passed"
|
||||
echo "📬 Approval ID: $(echo "$RESPONSE" | jq -r '.approval_id')"
|
||||
exit 0
|
||||
else
|
||||
echo "❌ Alert chain smoke test FAILED"
|
||||
echo "$RESPONSE"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### 驗證項目
|
||||
|
||||
| 項目 | 驗證方式 | 失敗動作 |
|
||||
|------|---------|---------|
|
||||
| Webhook 端點可達 | curl 收到 200 | 回滾部署 |
|
||||
| Telegram 通知送達 | 檢查 message_id | 檢查 DNS/Token |
|
||||
| Approval 建立成功 | approval_id 存在 | 檢查 DB 連線 |
|
||||
|
||||
### DNS 連通性檢查
|
||||
|
||||
```bash
|
||||
# 從 Pod 內測試
|
||||
kubectl exec -n awoooi-prod deployment/awoooi-api -- \
|
||||
python -c "import socket; print(socket.gethostbyname('api.telegram.org'))"
|
||||
```
|
||||
|
||||
**失敗原因**:
|
||||
- CoreDNS 上游 DNS 設定錯誤 (`127.0.0.53`)
|
||||
- NetworkPolicy DNS 規則標籤不匹配
|
||||
|
||||
---
|
||||
|
||||
## Playwright 自動化規範
|
||||
|
||||
### 測試腳本結構
|
||||
@@ -532,6 +595,86 @@ ls -la ~/actions-runner-awoooi*/_work/_temp/
|
||||
|
||||
---
|
||||
|
||||
## 🧠 LLM 測試策略 (2026-03-26 首席架構師審查)
|
||||
|
||||
> **背景**: LLM 測試天生非確定性,需特殊處理確保 CI 穩定
|
||||
> **ADR**: ADR-018-llm-testing-strategy.md (Deferred - 採用方案 A)
|
||||
|
||||
### 確定性參數 (必須)
|
||||
|
||||
```python
|
||||
# ✅ 所有 LLM 測試必須使用確定性參數
|
||||
response = await client.post(
|
||||
f"{OLLAMA_URL}/api/chat",
|
||||
json={
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"stream": False,
|
||||
"options": {
|
||||
"temperature": 0.0, # 🔴 確定性輸出
|
||||
"seed": 42, # 🔴 可重現性
|
||||
},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
### CI 分層策略
|
||||
|
||||
| 層級 | Workflow | 執行時間 | 包含測試 |
|
||||
|------|----------|----------|----------|
|
||||
| **Fast CI** | `ci.yaml` | ~3 min | Lint, Unit, Integration |
|
||||
| **Nightly LLM** | `nightly-llm.yaml` | ~45 min | Prompt Validation, Model Regression |
|
||||
| **Daily E2E** | `daily-e2e-health.yaml` | ~5 min | Health Check, K8s 驗證 |
|
||||
|
||||
### Ollama CPU 模式須知
|
||||
|
||||
> **192.168.0.188**: 純 CPU 推理 (無 GPU),速度 ~0.45 tok/s
|
||||
|
||||
```python
|
||||
# CPU 模式必須設定足夠長的 Timeout
|
||||
TIMEOUT = 300 # 秒 (CPU 推理需 ~222-666 秒)
|
||||
|
||||
async with httpx.AsyncClient(timeout=TIMEOUT) as client:
|
||||
response = await client.post(...)
|
||||
```
|
||||
|
||||
### 測試分類執行
|
||||
|
||||
```bash
|
||||
# 快速測試 (CI 每次)
|
||||
pytest apps/api/tests/ -k "not llm and not model" -v
|
||||
|
||||
# LLM 測試 (Nightly)
|
||||
pytest apps/api/tests/test_model_regression.py -v
|
||||
pytest apps/api/tests/test_prompt_validation.py -v
|
||||
```
|
||||
|
||||
### 繁體中文輸出驗證
|
||||
|
||||
```python
|
||||
# System Prompt 必須強調繁中
|
||||
AWOOOI_SYSTEM_PROMPT = """
|
||||
...
|
||||
- 【重要】必須使用台灣繁體中文回應 (Traditional Chinese Taiwan)
|
||||
- 禁止使用簡體中文字符 (如:与→與、说→說、这→這)
|
||||
...
|
||||
"""
|
||||
|
||||
# 驗證器範例
|
||||
def validate_traditional_chinese(response: str) -> bool:
|
||||
simplified_chars = ["与", "说", "这", "为", "时"]
|
||||
return not any(c in response for c in simplified_chars)
|
||||
```
|
||||
|
||||
### 參考
|
||||
|
||||
- `src/core/prompts.py`: 集中式 System Prompt (ADR-019)
|
||||
- `tests/test_model_regression.py`: 模型回歸測試
|
||||
- `tests/test_prompt_validation.py`: Prompt 品質測試
|
||||
- `.github/workflows/nightly-llm.yaml`: Nightly LLM Workflow
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `apps/web/playwright.config.ts`: Playwright 設定
|
||||
@@ -542,3 +685,7 @@ ls -la ~/actions-runner-awoooi*/_work/_temp/
|
||||
- `src/core/telemetry.py`: **Phase 15.2 Trace Context**
|
||||
- `memory/project_phase15_langfuse.md`: **📊 Phase 15 完整記錄**
|
||||
- `memory/feedback_runner_zombie_process.md`: **🚨 Runner 殭屍進程修復**
|
||||
- `docs/adr/ADR-018-llm-testing-strategy.md`: **🧠 LLM 測試策略 (Deferred)**
|
||||
- `docs/adr/ADR-019-system-prompt-management.md`: **📝 System Prompt 集中管理**
|
||||
- `.github/workflows/nightly-llm.yaml`: **🌙 Nightly LLM 測試**
|
||||
- `.github/workflows/daily-e2e-health.yaml`: **🏥 Daily E2E 健康檢查**
|
||||
|
||||
Reference in New Issue
Block a user