- docs/reference/ALERT-TAXONOMY-CATALOG.md:16大類、56筆alertname、24條Rule優先順序表 - docs/ai/AI-MODEL-CARDS.md:7個AI模型治理卡(deepseek/qwen/gemini/claude/nemotron)+fallback順序 - docs/templates/POSTMORTEM-TEMPLATE.md:對齊report_generation_service,[AUTO]欄位已標記 - docs/operations/ON-CALL-HANDBOOK.md:P0/P1 SOP、Kill Switch、SLO應對、常用指令速查 建立: 2026-04-14 台北時間 Claude Sonnet 4.6(戰術B Phase 1 完整收尾) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
298 lines
11 KiB
Markdown
298 lines
11 KiB
Markdown
# AWOOOI 告警分類目錄(Alert Taxonomy Catalog)
|
||
|
||
> **文件類型**: 告警分類權威目錄
|
||
> **版本**: v1.0
|
||
> **建立日期**: 2026-04-14(台北時間)
|
||
> **建立者**: Claude Sonnet 4.6(首席架構師)
|
||
> **資料來源**: `alert_types.py`(56 筆)、`alert_rules.yaml`(24 條規則)、`classify_alert_early()`
|
||
> **維護方式**: `alert_types.py` 為 Layer 2 fallback;新類型優先加入 `alert_rules.yaml`(Layer 1)
|
||
|
||
---
|
||
|
||
## 告警分類三層架構
|
||
|
||
```
|
||
Layer 1 → alert_rules.yaml (incident_type 欄位,優先)
|
||
Layer 2 → ALERTNAME_TO_TYPE dict (alert_types.py,fallback)
|
||
Layer 3 → "custom" (未匹配,兜底)
|
||
```
|
||
|
||
**程式碼入口**: `alert_rule_engine.get_incident_type(alertname)` — 已整合三層降級
|
||
|
||
---
|
||
|
||
## 類別目錄(16 大類)
|
||
|
||
| # | 類別 ID | 中文說明 | 預設 Action | 風險等級 | 數量 |
|
||
|---|---------|---------|------------|---------|-----|
|
||
| 1 | `host_down` | 主機完全失去回應 | `RESTART_SERVICE` | HIGH | 5 |
|
||
| 2 | `host_cpu` | 主機 CPU 過載 | `NO_ACTION` | MEDIUM | 2 |
|
||
| 3 | `host_memory` | 主機記憶體不足 | `NO_ACTION` | MEDIUM | 2 |
|
||
| 4 | `disk_full` | 磁碟空間耗盡 | `NO_ACTION` | HIGH | 3 |
|
||
| 5 | `backup_failure` | 備份任務失敗 | `NO_ACTION` | MEDIUM | 3 |
|
||
| 6 | `k8s_node_failure` | K8s 節點異常 | `NO_ACTION` | CRITICAL | 3 |
|
||
| 7 | `k8s_pod_crash` | Pod 崩潰/未就緒 | `KUBECTL_ROLLOUT_RESTART` | HIGH | 4 |
|
||
| 8 | `k8s_deployment_mismatch` | 部署副本數不符 | `KUBECTL_SCALE` | MEDIUM | 1 |
|
||
| 9 | `database_down` | 資料庫服務下線 | `RESTART_SERVICE` | CRITICAL | 2 |
|
||
| 10 | `database_performance` | 資料庫效能異常 | `NO_ACTION` | MEDIUM | 6 |
|
||
| 11 | `service_down` | 內部服務下線 | `RESTART_SERVICE` | HIGH | 9 |
|
||
| 12 | `service_404` | 外部服務/網站無法連線 | `NO_ACTION` | MEDIUM | 5 |
|
||
| 13 | `ssl_expiry` | SSL/TLS 憑證到期 | `NO_ACTION` | MEDIUM | 2 |
|
||
| 14 | `alert_chain_broken` | 告警鏈路斷裂 | `NO_ACTION` | CRITICAL | 4 |
|
||
| 15 | `docker_container_unhealthy` | Docker 容器異常 | `RESTART_SERVICE` | MEDIUM | 2 |
|
||
| 16 | `auto_repair_degraded` | 自動修復效率下降 | `NO_ACTION` | HIGH | 2 |
|
||
| — | `high_cpu` / `high_memory` | 舊版相容 alias | — | — | 3 |
|
||
| — | `custom` | 未匹配兜底 | `NO_ACTION` | LOW | — |
|
||
|
||
---
|
||
|
||
## 詳細分類清單
|
||
|
||
### 1. 主機層(Host Layer)
|
||
|
||
#### `host_down` — 主機失去回應
|
||
|
||
| alertname | 說明 | SSH 可用? |
|
||
|-----------|------|-----------|
|
||
| `HostDown` | 主機完全失去 ICMP/TCP 回應 | 否(已失聯) |
|
||
|
||
- **決策路徑**: TYPE-3(高風險,人工審核)
|
||
- **修復模式**: SSH 路徑(若主機重新上線後)or 手動介入
|
||
- **Rule 匹配**: `host_down` 規則(priority 100)
|
||
|
||
#### `host_cpu` — CPU 過載
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `HostHighCpuLoad` | 主機 CPU 負載持續超過閾值 |
|
||
| `HighCPUUsage` | 舊版 alias(相容保留) |
|
||
|
||
- **決策路徑**: LLM 分析 → 識別高耗 CPU 程序 → NO_ACTION(INFO)
|
||
- **Rule 匹配**: `host_high_cpu` 規則(priority 115)
|
||
|
||
#### `host_memory` — 記憶體不足
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `HostOutOfMemory` | 主機可用記憶體低於 10% |
|
||
| `HighMemoryUsage` | 舊版 alias(相容保留) |
|
||
| `RedisMemoryHigh` | Redis 記憶體超過 80% |
|
||
|
||
- **決策路徑**: NO_ACTION(INFO),建議人工清理
|
||
- **Rule 匹配**: `host_out_of_memory` 規則(priority 110)
|
||
|
||
#### `disk_full` — 磁碟空間耗盡
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `HostOutOfDiskSpace` | 主機磁碟剩餘空間不足 |
|
||
| `DiskSpaceLow` | 舊版 alias(相容保留) |
|
||
|
||
- **決策路徑**: TYPE-1(INFO),log 分析後 NO_ACTION
|
||
- **Rule 匹配**: `disk_full` 規則(priority 120)
|
||
|
||
#### `backup_failure` — 備份失敗
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `HostBackupFailed` | 主機備份腳本失敗 |
|
||
| `VeleroBackupFailed` | Kubernetes Velero 備份失敗 |
|
||
| `VeleroBackupNotRun` | Velero 備份超過預定時間未執行 |
|
||
|
||
- **決策路徑**: NO_ACTION(通知統帥)
|
||
- **SLA 影響**: RPO 可能受損,需人工確認
|
||
|
||
---
|
||
|
||
### 2. Kubernetes 層(K8s Layer)
|
||
|
||
#### `k8s_node_failure` — 節點異常
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `K3sNodeNotReady` | K3s 節點 NotReady 狀態 |
|
||
| `KubeNodeNotReady` | Kubernetes 節點 NotReady |
|
||
| `KubeNodeUnreachable` | 節點無法連線 |
|
||
|
||
- **決策路徑**: TYPE-3(CRITICAL,強制人工審核)
|
||
- **禁止自動修復**: 節點問題影響全叢集,不自動 drain/cordon
|
||
|
||
#### `k8s_pod_crash` — Pod 崩潰
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `KubePodCrashLooping` | Pod 反覆崩潰重啟(CrashLoopBackOff) |
|
||
| `KubePodNotReady` | Pod 持續未就緒 |
|
||
|
||
- **決策路徑**: LLM RCA → rollout restart
|
||
- **標準 kubectl**: `kubectl rollout restart deployment/{name} -n {namespace}`
|
||
- **Rule 匹配**: `pod_crash_looping` / `pod_not_ready` 規則(priority 50/55)
|
||
|
||
#### `k8s_deployment_mismatch` — 副本數不符
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `KubeDeploymentReplicasMismatch` | 實際 Pod 數 ≠ 期望副本數 |
|
||
|
||
- **決策路徑**: LLM 分析 → kubectl scale(需人工確認)
|
||
- **注意**: `--replicas=0` 被安全閘阻擋(ADR-064)
|
||
|
||
---
|
||
|
||
### 3. 資料庫層(Database Layer)
|
||
|
||
#### `database_down` — 資料庫下線
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `PostgreSQLDown` | PostgreSQL 無法連線 |
|
||
| `RedisDown` | Redis 無法連線 |
|
||
|
||
- **決策路徑**: TYPE-3(CRITICAL)→ 人工審核後 rollout restart
|
||
- **Rule 匹配**: `postgresql_down` / `redis_down` 規則(priority 30/35)
|
||
|
||
#### `database_performance` — 資料庫效能
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `PostgreSQLHighConnections` | 連線數過高 |
|
||
| `PostgreSQLSlowQueries` | 慢查詢偵測 |
|
||
| `PostgreSQLDeadlocks` | 死鎖偵測 |
|
||
| `PostgreSQLTooManyConnections` | 超過最大連線限制 |
|
||
| `RedisKeyEviction` | Redis key 被驅逐 |
|
||
| `RedisConnectionsHigh` | Redis 連線數過高 |
|
||
| `RedisCommandLatencyHigh` | Redis 命令延遲過高 |
|
||
|
||
- **決策路徑**: LLM 分析 → NO_ACTION(INFO + 建議)
|
||
|
||
---
|
||
|
||
### 4. 服務層(Service Layer)
|
||
|
||
#### `service_down` — 內部服務下線
|
||
|
||
| alertname | 說明 | 主機位置 |
|
||
|-----------|------|---------|
|
||
| `OpenClawDown` | OpenClaw AI 引擎下線 | 188 |
|
||
| `MinIODown` | MinIO 物件儲存下線 | 188 |
|
||
| `HarborDown` | Harbor 容器倉儲下線 | 188 |
|
||
| `GiteaDown` | Gitea CI/CD 服務下線 | 188 |
|
||
| `AlertmanagerDown` | Alertmanager 下線(鏈路失效) | 110 |
|
||
| `SignOzDown` | SignOz 監控平台下線 | 110 |
|
||
| `SentryDown` | Sentry 錯誤追蹤下線 | — |
|
||
| `KaliScannerDown` | Kali 安全掃描器下線 | 121 |
|
||
|
||
- **Rule 覆蓋**:
|
||
- `minio_down`(priority 10)→ SSH docker restart
|
||
- `gitea_down`(priority 125)→ NO_ACTION(CI/CD 服務,不自動修復)
|
||
- `openclaw_down`(priority 40)→ kubectl rollout restart
|
||
- **修復模式**: 視服務所在層(K8s Pod / Docker Container / 主機服務)
|
||
|
||
#### `service_404` — 外部網站/服務異常
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `MoWoooWorkDown` | 工作用外部網站下線 |
|
||
| `TsenyangWebsiteDown` | 個人網站下線 |
|
||
| `StockWoooWorkDown` | 股票工具網站下線 |
|
||
| `BitanWoooWorkDown` | 碧潭相關網站下線 |
|
||
| `ExternalSiteSSLExpiringSoon` | 外部網站 SSL 到期 |
|
||
| `TargetDown` | 舊版 blackbox 告警 alias |
|
||
|
||
- **Rule 匹配**: `external_site_down`(priority 127)→ NO_ACTION(通知)
|
||
- **決策路徑**: TYPE-1(INFO)
|
||
|
||
#### `ssl_expiry` — SSL/TLS 憑證到期
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `SSLCertExpiringSoon` | 憑證即將到期(< 30 天) |
|
||
| `ExternalSiteSSLExpiringSoon` | 外部網站憑證到期 |
|
||
|
||
- **Rule 匹配**: `ssl_cert_expiring`(priority 126)→ NO_ACTION(提醒)
|
||
- **修復方式**: 人工 certbot renew / cert-manager 更新
|
||
|
||
---
|
||
|
||
### 5. 告警鏈路層(Alert Chain Layer)
|
||
|
||
#### `alert_chain_broken` — 告警鏈路斷裂
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `AlertChainBroken_Alertmanager` | Alertmanager 停止發送告警 |
|
||
| `AlertChainBroken_Sentry` | Sentry 停止發送錯誤通知 |
|
||
| `NoAlertsReceived2Hours` | 2 小時無任何告警(watchdog 失效) |
|
||
| `AlertChainUnhealthy` | 鏈路整體健康度低 |
|
||
|
||
- **嚴重性**: CRITICAL(告警鏈路失效 = 系統失明)
|
||
- **Rule 匹配**: `alert_chain_broken`(priority 20)
|
||
- **通知**: Telegram 緊急通道(立即)
|
||
|
||
---
|
||
|
||
### 6. 容器層(Docker Layer)
|
||
|
||
#### `docker_container_unhealthy` — Docker 容器異常
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `DockerContainerUnhealthy` | Docker healthcheck 失敗 |
|
||
| `DockerContainerExited` | 容器意外退出(非 K8s 管理的裸 Docker) |
|
||
|
||
- **修復模式**: SSH → `docker restart {container_name}`
|
||
- **注意**: 不可用 `docker restart` 重啟整個 Docker Engine(會殺死 Gitea)
|
||
|
||
---
|
||
|
||
### 7. 自動修復監控(Auto Repair Monitoring)
|
||
|
||
#### `auto_repair_degraded` — 飛輪效率下降
|
||
|
||
| alertname | 說明 |
|
||
|-----------|------|
|
||
| `AutoRepairLowSuccessRate` | SLO-1 < 80%(7天滾動) |
|
||
| `PermanentFixRequired` | 同一告警重複修復 > 3 次 |
|
||
|
||
- **觸發**: HITL-8(SLO-1 < 50% 連續 2h)
|
||
- **處理**: TYPE-8M 通知,建議暫停自動修復
|
||
- **Root Cause**: 見 `feedback_auto_repair_flywheel_v2.md`
|
||
|
||
---
|
||
|
||
## Alert Rule 優先順序速查
|
||
|
||
| Priority | Rule ID | alertname 關鍵字 | Action |
|
||
|----------|---------|-----------------|--------|
|
||
| 10 | `minio_down` | MinioDown, MinIODown | SSH restart |
|
||
| 20 | `alert_chain_broken` | AlertChainBroken_* | NO_ACTION |
|
||
| 30 | `postgresql_down` | PostgreSQLDown | kubectl restart |
|
||
| 35 | `redis_down` | RedisDown | kubectl restart |
|
||
| 40 | `openclaw_down` | OpenClawDown | kubectl restart |
|
||
| 50 | `pod_crash_looping` | KubePodCrashLooping | kubectl restart |
|
||
| 55 | `pod_not_ready` | KubePodNotReady | kubectl restart |
|
||
| 100 | `host_down` | HostDown | NO_ACTION |
|
||
| 110 | `host_out_of_memory` | HostOutOfMemory | NO_ACTION |
|
||
| 115 | `host_high_cpu` | HostHighCpuLoad | NO_ACTION |
|
||
| 120 | `disk_full` | HostOutOfDiskSpace | NO_ACTION |
|
||
| 125 | `gitea_down` | GiteaDown | NO_ACTION |
|
||
| 126 | `ssl_cert_expiring` | SSLCertExpiringSoon | NO_ACTION |
|
||
| 127 | `external_site_down` | MoWoooWorkDown | NO_ACTION |
|
||
| 999 | `generic_fallback` | * | NO_ACTION |
|
||
|
||
**總計**: 24 條規則,56 個 alertname 對應
|
||
|
||
---
|
||
|
||
## 新增告警類別 SOP
|
||
|
||
1. 確認 alertname 是否已在 `ALERTNAME_TO_TYPE` 中 → 如有,只需在 `alert_rules.yaml` 補規則
|
||
2. 決定 `incident_type`(參考現有 16 類;必要時新增)
|
||
3. 在 `alert_rules.yaml` 新增規則(`priority` < 999,`> 127` 且 `< 999` 為安全區)
|
||
4. 若新 `incident_type`,在 `alert_types.py` 補 dict 條目
|
||
5. 更新本文件 + `LOGBOOK.md`
|
||
|
||
---
|
||
|
||
*本文件由 Claude Sonnet 4.6 於 2026-04-14 台北時間建立,以 `alert_types.py` v56筆 + `alert_rules.yaml` v24條 為資料來源*
|