chore: 收工存檔 — LOGBOOK + 封存過時 flywheel-alerts CRD
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-14 傍晚 Session 收尾: - LOGBOOK.md: 記錄本日 6 commits + Mission C E2E 驗證 + MASTER 藍圖 P0+P1 全綠 - k8s/monitoring/flywheel-alerts.yaml: 加 🔴 DEPRECATED 標記 (11/11 規則已遷入 ops/monitoring/alerts-unified.yml,此 CRD 檔無 Operator 支援) Backlog 清剿盤點: ✅ C2 hasType4 前端硬編(已接真實 API) ✅ C3 WebSocket 無重連(指數退避 + polling fallback) ✅ flywheel-alerts Docker 方式改寫(已完成,僅舊檔未清理 → 本 commit 封存) ✅ risk_level YAML 優先邏輯(decision_manager:1663) ⏳ SMTP_USER/SMTP_PASSWORD CHANGE_ME(需統帥提供帳密) ⏳ 各類 E2E 驗證(需真實告警觸發) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -6,7 +6,74 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-14 — P0 文件補建完成,護城河已部署 e778e4d ✅)
|
||||
## 📍 2026-04-14 傍晚 — MASTER 藍圖 P0+P1 全綠 + E2E 實彈驗證
|
||||
|
||||
**本日新增 6 commits(cc42aa0 → dd0a778 → 0f48a50 CD)**:
|
||||
- `cc42aa0` — GAP-A2(3 告警規則 gitea/ssl/external_site)+ GAP-A1(validate_kubectl_command + 34 測試)
|
||||
- `aae7c12` — GAP-C3(SSH 修復 KM 萃取 + 18 測試)
|
||||
- `43c9689` — 4 份治理文件(Alert Taxonomy / AI Model Cards / Postmortem Template / On-Call Handbook)
|
||||
- `dedd7c2` — BP-1 B.1 KM 萃取品質精修(`_write_execution_result_to_km` 區分自動/人工 + 富化元資料)
|
||||
- `dd0a778` — GAP-B4 LLM 超時降級扶梯(內層 25s + NemoClaw 3s)🔴 Tier 3 紅區
|
||||
- `0f48a50` — CD deploy dd0a778
|
||||
|
||||
**MASTER 藍圖 P0+P1 全部完成**(含驗證已實作:GAP-C2 retry, GAP-D1 trust feedback, GAP-A3 alert grouping)
|
||||
|
||||
**E2E 實彈射擊(Mission C)**:
|
||||
- `KubePodCrashLooping` via `/webhooks/alertmanager` → LLM(ollama, 1582t) → Playbook `high-cpu-restart` 相似度 39% → `incident_resolved_after_auto_repair` → Telegram msg 20723 → KM 1 筆(`km_conversion_service` 路徑寫入)
|
||||
- **發現 KM 雙路徑設計** → 建立 [feedback_km_dual_path_design.md](memory/feedback_km_dual_path_design.md)
|
||||
|
||||
**測試全綠**:152/152 tests passed
|
||||
|
||||
**剩餘 Backlog**(明日推進):
|
||||
- GAP-D5 自動報告生成(需 APScheduler)
|
||||
- project_current_status.md 小型 Backlog(WebSocket 重連、Blackbox E2E、flywheel-alerts.yaml Docker 方式)
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-14 早上 — aae7c12 ✅)
|
||||
|
||||
**本次 session 完成(Task 3.3)**:
|
||||
- `approval_execution.py` — `_trigger_playbook_extraction`: 寫入 `approval.action → outcome.learning_notes`
|
||||
- `playbook_service.py` — `_parse_ssh_command()` + `_extract_repair_steps()` SSH 路徑 + `[SSH]` name prefix + ssh/host_layer tags
|
||||
- `test_playbook_ssh_extraction.py` — 18 新測試(794 通過,0 失敗)
|
||||
|
||||
**飛輪雙手對齊**:
|
||||
- kubectl 路徑:`decision_chain.reasoning_steps → KM` ✅(既有)
|
||||
- SSH 路徑:`approval.action → learning_notes → SSH RepairStep → KM` ✅(Task 3.3 新增)
|
||||
|
||||
**剩餘(純文件)**:
|
||||
|
||||
| 文件 | 路徑 | 狀態 |
|
||||
|------|------|------|
|
||||
| 告警分類目錄(16 類) | docs/reference/ALERT-TAXONOMY-CATALOG.md | 待辦 |
|
||||
| AI Model Card(5 模型)| docs/ai/AI-MODEL-CARDS.md | 待辦 |
|
||||
| Postmortem 模板 | docs/templates/POSTMORTEM-TEMPLATE.md | 待辦 |
|
||||
| On-call Handbook | docs/operations/ON-CALL-HANDBOOK.md | 待辦 |
|
||||
|
||||
---
|
||||
|
||||
## 📍 前次狀態 (2026-04-14 — Task 2.2+2.3 完成,cc42aa0 ✅)
|
||||
|
||||
**本次 session 新增(Task 2.2 + Task 2.3)**:
|
||||
- `alert_rules.yaml` — 新增 3 類規則(gitea_down/ssl_cert_expiring/external_site_down),共 24 條
|
||||
- `alert_rule_engine.py` — `validate_kubectl_command()` 阻擋 delete pvc/namespace/drain/replicas=0/rm-rf/DROP TABLE/$() 注入,整合進 `match_rule()`
|
||||
- `test_alert_rule_engine_validation.py` — 34 新測試(776 通過,0 失敗)
|
||||
|
||||
**待完成(剩餘工作清單)**:
|
||||
|
||||
| 項目 | 類型 | 狀態 |
|
||||
|------|------|------|
|
||||
| Task 3.3: SSH 修復 KM 萃取(playbook_service.py)| 代碼 | 待辦 |
|
||||
| `docs/reference/ALERT-TAXONOMY-CATALOG.md` | 文件 | 待辦 |
|
||||
| `docs/ai/AI-MODEL-CARDS.md` | 文件 | 待辦 |
|
||||
| `docs/templates/POSTMORTEM-TEMPLATE.md` | 文件 | 待辦 |
|
||||
| `docs/operations/ON-CALL-HANDBOOK.md` | 文件 | 待辦 |
|
||||
| CD 部署 cc42aa0 驗證 | E2E | 觀察中 |
|
||||
| 首次日度報告(08:00 台北)| E2E | 等待中 |
|
||||
|
||||
---
|
||||
|
||||
## 📍 前次狀態 (2026-04-14 — P0 文件補建完成,護城河已部署 e778e4d ✅)
|
||||
|
||||
**本次 session 新增(2 份 P0 業界標準文件)**:
|
||||
- `docs/slo/SLO-SLI-DEFINITION.md` — 5 個 SLI + SLO 目標值表 + Error Budget 規則 + 里程碑
|
||||
|
||||
@@ -1,12 +1,18 @@
|
||||
# =============================================================================
|
||||
# 飛輪健康度告警規則 — ADR-074 M1
|
||||
# 🔴 [已封存 / DEPRECATED] 請勿使用此檔案
|
||||
# =============================================================================
|
||||
# Prometheus PrometheusRule CRD — 飛輪自監控告警
|
||||
# 2026-04-14 Claude Sonnet 4.6(Backlog 清剿):
|
||||
# - 本檔為 PrometheusRule CRD 格式,需 Prometheus Operator 才能載入
|
||||
# - 但 AWOOOI 的 Prometheus 是 Docker 部署(188),無 Operator
|
||||
# - 11 條規則已全數遷入 ops/monitoring/alerts-unified.yml(group: awoooi_flywheel_meta_alerts)
|
||||
# - 本檔保留僅作歷史參考,請勿 kubectl apply
|
||||
#
|
||||
# 權威來源:ops/monitoring/alerts-unified.yml
|
||||
# =============================================================================
|
||||
# 歷史資訊:飛輪健康度告警規則 — ADR-074 M1
|
||||
# 數據來源:/api/v1/stats/flywheel/metrics(awoooi-flywheel scrape job)
|
||||
#
|
||||
# 部署:kubectl apply -f k8s/monitoring/flywheel-alerts.yaml
|
||||
#
|
||||
# 2026-04-12 ogt (ADR-074 M1)
|
||||
# 建立:2026-04-12 ogt (ADR-074 M1)
|
||||
# 封存:2026-04-14 Claude Sonnet 4.6(11/11 規則已遷入 alerts-unified.yml)
|
||||
# =============================================================================
|
||||
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
|
||||
Reference in New Issue
Block a user