- CI 894 完成,f5e33da2 已部署 - flywheel outcome 欄位修復確認 - telegram _send_request 修復確認(零 AttributeError) - Sweeper:20/20 近48h incidents sweeper_done 標記完整 - E2E 鏈路 7 節點完整流程確認(36 incidents) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
655 lines
32 KiB
Markdown
655 lines
32 KiB
Markdown
# LOGBOOK - AWOOOI 進度軌跡
|
||
|
||
> **用途**: AI 代理進度追蹤,防止 Session 斷層
|
||
> **規則**: 完成重要節點後追加一行
|
||
> **歷史**: 舊條目已壓縮,詳細記錄見 git log
|
||
|
||
---
|
||
|
||
## 📍 2026-04-16 — E2E 全節點驗證 + 生產 bug 連環修復
|
||
|
||
### 問題背景
|
||
Sweeper 首次啟動把 117 個歷史 incident(最舊 7 天)全部洗版到 Telegram,
|
||
用戶反映「所有告警訊息都長得好像」(全部降級 confidence=20%)
|
||
|
||
### 根本原因鏈
|
||
|
||
1. **Sweeper key bug**: 檢查 `decision:INC-*`(不存在),沒有設置 done marker → 每輪都認為未分析
|
||
2. **CAST SQL bug**: `decision_chain = :dc::json` → asyncpg 語法錯誤 → 學習記錄無法寫入 DB
|
||
3. **Age filter 缺失**: 啟動時一次觸發所有歷史 incident → Telegram 洪水
|
||
4. **shadow_mode 卡住**: ConfigMap 已改 false,但 Pod 在更新前創建 → 載入舊值 true
|
||
5. **flywheel stats bug**: `incidents.outcomes` 欄位不存在(應為 `outcome`) → stats/summary API 500
|
||
6. **Telegram method bug**: `_make_request` 不存在(正確方法名 `_send_request`) → 分析完後推送失敗
|
||
|
||
### 修復 Commits
|
||
|
||
| Commit | 修復 | 狀態 |
|
||
|--------|------|------|
|
||
| `20b3fef` | sweeper key format: `sweeper_done:INC-*` marker | ✅ 已部署 |
|
||
| `0760315` | CAST SQL + shadow_mode=false | ✅ 已部署 |
|
||
| `9bfa6fc` | sweeper 48h 舊案過濾 | ✅ 已部署 |
|
||
| `1e86cc2` | flywheel `outcome` 欄位 | ✅ 已部署 `f5e33da2` |
|
||
| `f5e33da` | telegram `_send_request` 方法名稱 | ✅ 已部署 `f5e33da2` |
|
||
| `381be78` | chore(cd): deploy f5e33da | ✅ CD 完成 |
|
||
|
||
### E2E 驗證結果(最終確認 `f5e33da2`,2026-04-16 02:17 台北)
|
||
|
||
全 12 節點驗證通過,E2E 鏈路完全打通:
|
||
|
||
```
|
||
告警接收 → Incident 建立 → Sweeper 觸發分析
|
||
→ decision_analyzing → evidence_snapshot_saved → investigator_done
|
||
→ agent_debate_start → agent_debate_done → agent_session_recorded
|
||
→ telegram_decision_pushed
|
||
```
|
||
|
||
36 個 incident 均完整走過 7 節點流程。零 `AttributeError`,零 sweeper 洪水。
|
||
|
||
### 其他改善
|
||
|
||
- KM 16 筆缺漏 embedding → 補齊(867/867 全有向量)
|
||
- AIOPS_P4_SHADOW_MODE=false 生效(rollout restart + 新 pod 確認)
|
||
- proactive_inspector: shadow_mode=false,anomalies=0(系統健康)
|
||
|
||
---
|
||
|
||
## 📍 2026-04-15 深夜 — AI 自主化飛輪 Phase 4-6 全完成 + 生產全開 🎉
|
||
|
||
### Phase 4 異常偵測升級(commit 14a0226,ADR-084)
|
||
|
||
| 成品 | 路徑 |
|
||
|------|------|
|
||
| TrendPredictor | `services/trend_predictor.py` — statsmodels ARIMA 趨勢預測 |
|
||
| ProactiveInspector | `services/proactive_inspector.py` — 主動巡檢(L1-L4 四層) |
|
||
| 8D 感官升級 | `services/pre_decision_investigator.py` — anomaly_context 增強 |
|
||
|
||
### Phase 5 修復抽象化(commit 655d1a5,ADR-086)
|
||
|
||
| 成品 | 路徑 |
|
||
|------|------|
|
||
| BlastRadiusCalculator | `services/blast_radius_calculator.py` — CRITICAL/HIGH/MEDIUM/LOW 分控 |
|
||
| DeclarativeRemediation | `services/declarative_remediation.py` — dry-run → apply 分階段 rollout |
|
||
| GitOpsPRService | `services/gitops_pr_service.py` — 自動 PR 生成 IaC 修復 |
|
||
| RollbackManager | `services/rollback_manager.py` — 自動回滾策略 |
|
||
| DecisionManager 接線 | `AIOPS_P5_BLAST_RADIUS_CHECK` gate 守衛 |
|
||
|
||
### Phase 6 自我治理閉環(commit 05b7743 + 77a92eb)
|
||
|
||
| 成品 | 路徑 |
|
||
|------|------|
|
||
| AiSloCalculator | `services/ai_slo_calculator.py` — SLO 計算器 |
|
||
| TrustDriftDetector | `services/trust_drift_detector.py` — 信任度漂移偵測 |
|
||
| KbRotCleaner | `jobs/kb_rot_cleaner.py` — 知識庫腐爛清理 Job |
|
||
| 自我降級引擎 | `services/decision_manager.py` 接線 |
|
||
| SLO REST API | `api/v1/ai_slo.py` — GET /api/v1/ai/slo |
|
||
| OfflineReplayService | `services/offline_replay_service.py` — 離線回放驗證 |
|
||
| ModelRollbackService | `services/model_rollback_service.py` — 模型回滾機制 |
|
||
| DB 表 | `db/models.py` AiGovernanceEvent + 3 index |
|
||
|
||
### P1-P6 全開(commit 76558a3)
|
||
|
||
```
|
||
AIOPS_P1_ENABLED=True ... AIOPS_P6_ENABLED=True(全部)
|
||
Nemotron 接線 + offline replay loop 啟動
|
||
```
|
||
|
||
### 生產修補(全開後,2026-04-15 深夜)
|
||
|
||
| Commit | 修復內容 |
|
||
|--------|---------|
|
||
| `85c4e3b` | KM 寫入全為 unknown 根因(alertname/affected_services/category 三節點) |
|
||
| `ecfb714` | YAML 規則引擎與自動執行路徑核心斷點接通 |
|
||
| `3696fb5` | host_resource 誤發 K8s kubectl + 自動執行重複風暴 |
|
||
| `67f4370` | 四個生產致命 bug(outcome 寫入/OpenClaw/Telegram/LLM 規則顯示) |
|
||
| `256a24e` | drain3/statsmodels 依賴補入 + warmup skip 舊資料 |
|
||
| `c05bac6` | Playbook seed tuple unpack + text[]→jsonb migration |
|
||
| `da871fc` | AIOps P1/P2/P6 migration SQL 補齊(已 prod 套用) |
|
||
|
||
### 技術債(下次 Sprint)
|
||
|
||
- `send_notification()` 未私有化(raw text bypass 可能)
|
||
- `approval_repository.py:find_by_fingerprint()` 無 TTL
|
||
|
||
---
|
||
|
||
## 📍 2026-04-15 — AI 自主化飛輪 Phase 0 防護欄建立
|
||
|
||
### 完成項目
|
||
|
||
| 成品 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| MASTER v2 藍圖 | `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md` | §0-§8 全填完,1456 行,7 Phase 完整規劃 |
|
||
| ADR-080 | `docs/adr/ADR-080-ai-autonomy-flywheel-overview.md` | 7 Phase + 4 北極星 + 7 架構師 Review Gates |
|
||
| Feature Flags | `apps/api/src/core/feature_flags.py` | P1~P6 全 False + 15 細粒度子開關 |
|
||
| Jobs 模組 | `apps/api/src/jobs/__init__.py` | Jobs 目錄初始化 |
|
||
| 基線快照 Job | `apps/api/src/jobs/baseline_snapshot.py` | 拍攝飛輪啟動前 6 大指標現況 |
|
||
| HARD_RULES v1.9 | `docs/HARD_RULES.md` | 新增 Phase 退出條件鐵律 |
|
||
|
||
### Phase 0 基線數值(待 baseline_snapshot 執行後填入)
|
||
|
||
| 指標 | 現況(預估) | Phase 6 目標 |
|
||
|------|------------|------------|
|
||
| MCP 呼叫/24h | 0 | > 0 |
|
||
| Playbook avg_confidence | ~0.3(靜態) | 動態 EWMA |
|
||
| 學習閉環觸發率 | 0% | ≥ 99% |
|
||
| general 告警比例 | ~41% | < 10% |
|
||
| RESTART 修復比例 | ~68% | < 40% |
|
||
| 自動執行成功/24h | 0 | > 0 |
|
||
|
||
### 下一步
|
||
|
||
- 統帥 review ADR-080 + MASTER v2 → 批准後 Phase 1 開工
|
||
- Phase 1: PreDecisionInvestigator + MCP ToolRegistry + EvidenceSnapshot + PostExecutionVerifier
|
||
- 執行 `python -m src.jobs.baseline_snapshot` 拍攝真實基線數字
|
||
|
||
---
|
||
|
||
## 📍 2026-04-15 — AI 自主化飛輪 Phase 1 感官縱深建立
|
||
|
||
### 成品(ADR-081)
|
||
|
||
| 成品 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| DB Model | `apps/api/src/db/models.py` | IncidentEvidence 表(8D 感官 + 執行前後狀態 + 驗證結果) |
|
||
| EvidenceSnapshot | `apps/api/src/services/evidence_snapshot.py` | 不可變快照,build_summary() 組裝 LLM 上下文 |
|
||
| SanitizationService | `apps/api/src/services/sanitization_service.py` | Prompt Injection 0-tolerance(12 pattern)+ 敏感詞遮罩 |
|
||
| MCPToolRegistry | `apps/api/src/services/mcp_tool_registry.py` | 動態工具登記冊,suggest_tools() 不寫死告警類型 |
|
||
| PreDecisionInvestigator | `apps/api/src/services/pre_decision_investigator.py` | 8D 並行感官蒐集,P99 < 8s,Redis 30s 快取 |
|
||
| PostExecutionVerifier | `apps/api/src/services/post_execution_verifier.py` | 執行後 K8s 收斂等待 + 三態評估(success/degraded/failed) |
|
||
| decision_manager 接線 | `apps/api/src/services/decision_manager.py` | AIOPS_P1_PRE_DECISION_INVESTIGATOR flag 守衛 |
|
||
| approval_execution 接線 | `apps/api/src/services/approval_execution.py` | AIOPS_P1_POST_EXECUTION_VERIFIER fire-and-forget |
|
||
|
||
### 測試覆蓋
|
||
|
||
| 測試檔 | 數量 |
|
||
|--------|------|
|
||
| test_sanitization_service.py | 28 |
|
||
| test_mcp_tool_registry.py | 33 |
|
||
| test_pre_decision_investigator.py | 28 |
|
||
| test_post_execution_verifier.py | 22 |
|
||
| **總計** | **111 新增(Phase 1),130 全數通過** |
|
||
|
||
### Gate 1 修復(4 項)
|
||
|
||
1. `evidence_snapshot.py`: rowcount < 1 → warning log(靜默零行更新)
|
||
2. `post_execution_verifier.py`: 移除裸 `"error"` failure signal(防 error_rate key 誤判)
|
||
3. `pre_decision_investigator.py`: D4/D5/D7/D8 補 sanitize_dict_values(Prompt Injection 0-tolerance)
|
||
4. `feature_flags.py`: 補充 Pod 重啟才能 hot-reload flags 說明
|
||
|
||
### 下一步
|
||
|
||
- ~~Phase 2: 5 Agent 骨架 + Orchestrator + AgentSession DB~~ → **✅ 完成(commit d316221)**
|
||
|
||
---
|
||
|
||
## 📍 2026-04-15 深夜 — AI 自主化飛輪 Phase 3 學習閉環重建完成
|
||
|
||
### 成品(ADR-083,commit 7da64ea → Gitea)
|
||
|
||
| 成品 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| fire-and-forget 修復 | `services/approval_execution.py` | `create_task` → `await asyncio.wait_for(timeout=30)` × 2 處(成功 + 失敗路徑) |
|
||
| matched_playbook_id 欄位 | `models/approval.py` | `ApprovalRequestBase` 新增,auto_execute 路徑填充 |
|
||
| _auto_execute 傳遞 | `services/decision_manager.py` | `token.proposal_data.get("playbook_id")` → `ApprovalRequest.matched_playbook_id` |
|
||
| 雙路徑查找 | `services/learning_service.py` | `matched_playbook_id` + `metadata` fallback |
|
||
| trust_score 欄位 | `models/playbook.py` | 新增 `trust_score: float = 0.3`(EWMA 動態信任度) |
|
||
| 2x EWMA 更新 | `repositories/playbook_repository.py` | 成功 α=0.1、失敗 α=0.2,trust < 0.1 → 警告 |
|
||
| Evolver Agent | `services/playbook_evolver.py` | 低信任封存 + 休眠封存 + Jaccard 相似合併(新建) |
|
||
| ADR-083 | `docs/adr/ADR-083-learning-loop-reconstruction.md` | 學習閉環重建決策紀錄 |
|
||
| MASTER §8 | `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md` | Phase 3 完工追加 |
|
||
|
||
### 根因修復對照
|
||
|
||
| 根因 | 修復前 | 修復後 |
|
||
|------|--------|--------|
|
||
| 學習觸發率 | 0%(GC 隨時取消) | ≈100%(await + 30s 熔斷) |
|
||
| Playbook EWMA | 永遠停在 0.3 | 每次執行後動態更新 |
|
||
| 負向懲罰 | 無 | 失敗 2x 衰減(α=0.2) |
|
||
| 知識庫管理 | 無退場機制 | Evolver 自動封存低信任 |
|
||
|
||
### 架構狀態
|
||
|
||
```
|
||
AIOPS_P3_ENABLED=False(預設)— 骨架就位,等統帥批准後開啟
|
||
AIOPS_P3_EVOLVER_ENABLED=False — Evolver 定時 job 等統帥批准
|
||
學習路徑:ApprovalRequest.matched_playbook_id → learning_service → playbook_repository.update_stats(EWMA)
|
||
```
|
||
|
||
### 下一步
|
||
- Gate 3 架構審查(首席架構師 Review Phase 3)
|
||
- 開啟 `AIOPS_P3_ENABLED=True` 後 E2E 驗證
|
||
- Phase 4 異常偵測升級(依賴 Phase 3 穩定)
|
||
|
||
---
|
||
|
||
## 📍 2026-04-15 深夜 — AI 自主化飛輪 Phase 2 多 Agent 協作骨架上線
|
||
|
||
### 成品(ADR-082,commit d316221)
|
||
|
||
| 成品 | 路徑 | 說明 |
|
||
|------|------|------|
|
||
| Protocol 型別系統 | `apps/api/src/agents/protocol.py` | 5 Agent 共用資料契約(dataclass,不可變) |
|
||
| DiagnosticianAgent | `apps/api/src/agents/diagnostician_agent.py` | RCA 偵探,confidence < 0.4 → ABSTAIN |
|
||
| SolverAgent | `apps/api/src/agents/solver_agent.py` | 修復軍師,blast_radius 評分 + 降級 mock |
|
||
| ReviewerAgent | `apps/api/src/agents/reviewer_agent.py` | 安全審查,HARD_RULES 靜態 regex + blast_radius 閾值 |
|
||
| CriticAgent | `apps/api/src/agents/critic_agent.py` | 刻意唱反調,強制 3 問批判,critical → REJECT |
|
||
| CoordinatorAgent | `apps/api/src/agents/coordinator_agent.py` | 純規則聚合(無 LLM),6 級決策閘 |
|
||
| AgentOrchestrator | `apps/api/src/services/agent_orchestrator.py` | 30s 全局超時,Reviewer‖Critic 並行,DB + Redis Streams |
|
||
| DecisionManager 接線 | `apps/api/src/services/decision_manager.py` | `is_phase_enabled(2)` gate + `_package_to_proposal_data` 橋接 |
|
||
| AgentSession DB 表 | `apps/api/src/db/models.py` | Immutable Event Sourcing,4 複合 index |
|
||
| ADR-082 | `docs/adr/ADR-082-multi-agent-collaboration.md` | 架構決策紀錄 |
|
||
|
||
### Gate 2 修復(7 項)
|
||
|
||
| 嚴重度 | 問題 | 修復位置 |
|
||
|--------|------|---------|
|
||
| CRITICAL | DELETE FROM regex lookahead 位置錯誤,攔到安全語句、放行危險語句 | reviewer_agent.py:58 |
|
||
| CRITICAL | REQUEST_REVISION 可抵達 auto-execute(Solver 未修訂不可執行) | coordinator_agent.py |
|
||
| IMPORTANT | `_extract_json` flat regex 不支援巢狀 JSON,所有 Agent LLM 解析靜默失敗 | base.py:167 |
|
||
| IMPORTANT | `all_degraded` 遺漏 `verdict.degraded`,Reviewer 熔斷不被感知 | coordinator_agent.py |
|
||
| IMPORTANT | Solver ABSTAIN guard 放行降級假設(confidence=0.2 觸發 LLM) | solver_agent.py:72 |
|
||
| IMPORTANT | `dataclasses.asdict()` 保留 Enum 實例,所有 DB 審計寫入靜默失敗 | agent_orchestrator.py |
|
||
| IMPORTANT | P2 gate 直讀屬性繞過父 Phase 守衛(應用 `is_phase_enabled(2)`) | decision_manager.py |
|
||
|
||
### 架構狀態
|
||
|
||
```
|
||
AIOPS_P2_ENABLED=False(預設)— 骨架就位,等統帥批准後開啟
|
||
執行路徑:EvidenceSnapshot → Diagnostician → Solver → (Reviewer‖Critic) → Coordinator → DecisionPackage
|
||
全局超時:30s,單 Agent:5s,降級後繼續(不阻塞 Coordinator)
|
||
```
|
||
|
||
### 下一步
|
||
|
||
- Phase 2 測試:`test_agent_protocol.py` / `test_agent_orchestrator.py` / 各 Agent 單元測試
|
||
- 或 統帥指示進入 Phase 3(學習閉環重建)
|
||
|
||
---
|
||
|
||
## 📍 2026-04-14 午夜 — Phase 5 分類按鈕完整化全數上線
|
||
|
||
**Sprint 5.0 → 5.4 全數完成**,26 個 commits 推版:
|
||
|
||
| Sprint | 產出 | Commit |
|
||
|--------|------|--------|
|
||
| 5.0 規格 | callback_action_spec.yaml (24 actions) | `2e2f5a1` |
|
||
| 5.1 Dispatch 框架 | TelegramGateway._dispatch_category_action | `581b244` |
|
||
| 5.2 MCP 接入 | dispatcher 真實 MCP registry + internal + graceful | `208c28e` |
|
||
| 5.3 寫類 + audit | Step 1.9 nonce 路由 + Multi-Sig 守衛 | `de8bbd8` |
|
||
| 5.4 動態按鈕 | `_build_inline_keyboard` 從 registry 生成 | `a92562d` |
|
||
|
||
**Bug A/B 深查**:
|
||
- Bug B LLM timeout 硬編 120s/130s 真修 `36754a8`(openclaw.py 改用 OPENCLAW_TIMEOUT=30s)
|
||
- Bug A approval.incident_id NULL 加診斷 log(等 live-fire 抓真因)
|
||
|
||
**按鈕從死變活**:
|
||
- 原 28 死按鈕(callback 格式錯 + 0 handler)已下架
|
||
- 新動態按鈕:從 yaml 生成,spec 決定格式(nonce/info),MCP dispatcher 真執行
|
||
- 完整 audit log + reply_to 原卡片
|
||
|
||
---
|
||
|
||
## 📍 2026-04-14 深夜收官 — GAP-A4 解開 8.3h 飛輪沉默 + 技術債處理
|
||
|
||
**真兇逮到**:GAP-A4 規則模板 placeholder 解析缺漏
|
||
- Log 顯示大量 `auto_execute_blocked_unresolved_placeholder`
|
||
- target 退回 alertname / unknown / IP:port → 垃圾 kubectl 指令
|
||
- GAP-A1 防注入閘盡責攔下 → 自動修復路徑卡死 → 飛輪沉默
|
||
|
||
**修復 `10b74af`**(三層防護):
|
||
1. `_strip_pod_suffix()` — Deployment/StatefulSet/Legacy pod 三種格式
|
||
2. `_is_bad_target()` — 垃圾識別(空/unknown/alertname/IP:port/含空白)
|
||
3. `_extract_vars()` 多層 label 查找(deployment > app > statefulset > pod > container)
|
||
4. `match_rule()` 後置雙驗證(bad target + 殘留 placeholder)
|
||
|
||
**測試**:33 個新 GAP-A4 測試 + 214/214 回歸全綠
|
||
|
||
**技術債處理**:
|
||
- ✅ report_generation 重試機制(3 次指數退避 + 失敗降級通知)`下一 commit`
|
||
- 🟡 DEFER: QueryBuilder 抽象(YAGNI,僅 1 處用 JSON path query)
|
||
- ✅ E2E 測試(GAP-A4 TestMatchRuleRejection 全流程覆蓋 + Mission C prod 實測)
|
||
|
||
---
|
||
|
||
## 📍 2026-04-14 深夜 — MASTER 藍圖 11/11 Task 全部完成 🏆
|
||
|
||
**結案文件**:
|
||
- [docs/reports/2026-04-14-MASTER-BLUEPRINT-CLOSURE.md](reports/2026-04-14-MASTER-BLUEPRINT-CLOSURE.md)
|
||
- [docs/adr/ADR-077-master-blueprint-completion.md](adr/ADR-077-master-blueprint-completion.md)
|
||
|
||
**本日 9 commits 完整清單**:
|
||
`cc42aa0 → aae7c12 → 43c9689 → dedd7c2 → dd0a778 → 0f48a50 → b8b124c → 8de807c → f54dea4`
|
||
|
||
**Phase 4(自動報告系統)完成**:
|
||
- Task 4.1 日度巡檢報告:`run_daily_report_loop` 已啟動(`daily_report_next_in: 48993s`,明早 08:00 台北)
|
||
- Task 4.2 Postmortem 自動組裝:`incident_service.resolve_incident` hook `8de807c`
|
||
- f54dea4 修復 DB 欄位 bug(ApprovalRequestRecord/PlaybookRecord → 實際 IncidentRecord.outcome + Redis playbook_service)
|
||
|
||
**架構審查**:CONDITIONAL PASS(decision_manager Tier 3 審查紀錄入 ADR-077)
|
||
|
||
**通訊渠道**:全走 Telegram,SMTP 不需要
|
||
|
||
---
|
||
|
||
## 📍 2026-04-14 傍晚 — MASTER 藍圖 P0+P1 全綠 + E2E 實彈驗證
|
||
|
||
**本日新增 6 commits(cc42aa0 → dd0a778 → 0f48a50 CD)**:
|
||
- `cc42aa0` — GAP-A2(3 告警規則 gitea/ssl/external_site)+ GAP-A1(validate_kubectl_command + 34 測試)
|
||
- `aae7c12` — GAP-C3(SSH 修復 KM 萃取 + 18 測試)
|
||
- `43c9689` — 4 份治理文件(Alert Taxonomy / AI Model Cards / Postmortem Template / On-Call Handbook)
|
||
- `dedd7c2` — BP-1 B.1 KM 萃取品質精修(`_write_execution_result_to_km` 區分自動/人工 + 富化元資料)
|
||
- `dd0a778` — GAP-B4 LLM 超時降級扶梯(內層 25s + NemoClaw 3s)🔴 Tier 3 紅區
|
||
- `0f48a50` — CD deploy dd0a778
|
||
|
||
**MASTER 藍圖 P0+P1 全部完成**(含驗證已實作:GAP-C2 retry, GAP-D1 trust feedback, GAP-A3 alert grouping)
|
||
|
||
**E2E 實彈射擊(Mission C)**:
|
||
- `KubePodCrashLooping` via `/webhooks/alertmanager` → LLM(ollama, 1582t) → Playbook `high-cpu-restart` 相似度 39% → `incident_resolved_after_auto_repair` → Telegram msg 20723 → KM 1 筆(`km_conversion_service` 路徑寫入)
|
||
- **發現 KM 雙路徑設計** → 建立 [feedback_km_dual_path_design.md](memory/feedback_km_dual_path_design.md)
|
||
|
||
**測試全綠**:152/152 tests passed
|
||
|
||
**剩餘 Backlog**(明日推進):
|
||
- GAP-D5 自動報告生成(需 APScheduler)
|
||
- project_current_status.md 小型 Backlog(WebSocket 重連、Blackbox E2E、flywheel-alerts.yaml Docker 方式)
|
||
|
||
---
|
||
|
||
## 📍 當前狀態 (2026-04-14 早上 — aae7c12 ✅)
|
||
|
||
**本次 session 完成(Task 3.3)**:
|
||
- `approval_execution.py` — `_trigger_playbook_extraction`: 寫入 `approval.action → outcome.learning_notes`
|
||
- `playbook_service.py` — `_parse_ssh_command()` + `_extract_repair_steps()` SSH 路徑 + `[SSH]` name prefix + ssh/host_layer tags
|
||
- `test_playbook_ssh_extraction.py` — 18 新測試(794 通過,0 失敗)
|
||
|
||
**飛輪雙手對齊**:
|
||
- kubectl 路徑:`decision_chain.reasoning_steps → KM` ✅(既有)
|
||
- SSH 路徑:`approval.action → learning_notes → SSH RepairStep → KM` ✅(Task 3.3 新增)
|
||
|
||
**剩餘(純文件)**:
|
||
|
||
| 文件 | 路徑 | 狀態 |
|
||
|------|------|------|
|
||
| 告警分類目錄(16 類) | docs/reference/ALERT-TAXONOMY-CATALOG.md | 待辦 |
|
||
| AI Model Card(5 模型)| docs/ai/AI-MODEL-CARDS.md | 待辦 |
|
||
| Postmortem 模板 | docs/templates/POSTMORTEM-TEMPLATE.md | 待辦 |
|
||
| On-call Handbook | docs/operations/ON-CALL-HANDBOOK.md | 待辦 |
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-14 — Task 2.2+2.3 完成,cc42aa0 ✅)
|
||
|
||
**本次 session 新增(Task 2.2 + Task 2.3)**:
|
||
- `alert_rules.yaml` — 新增 3 類規則(gitea_down/ssl_cert_expiring/external_site_down),共 24 條
|
||
- `alert_rule_engine.py` — `validate_kubectl_command()` 阻擋 delete pvc/namespace/drain/replicas=0/rm-rf/DROP TABLE/$() 注入,整合進 `match_rule()`
|
||
- `test_alert_rule_engine_validation.py` — 34 新測試(776 通過,0 失敗)
|
||
|
||
**待完成(剩餘工作清單)**:
|
||
|
||
| 項目 | 類型 | 狀態 |
|
||
|------|------|------|
|
||
| Task 3.3: SSH 修復 KM 萃取(playbook_service.py)| 代碼 | 待辦 |
|
||
| `docs/reference/ALERT-TAXONOMY-CATALOG.md` | 文件 | 待辦 |
|
||
| `docs/ai/AI-MODEL-CARDS.md` | 文件 | 待辦 |
|
||
| `docs/templates/POSTMORTEM-TEMPLATE.md` | 文件 | 待辦 |
|
||
| `docs/operations/ON-CALL-HANDBOOK.md` | 文件 | 待辦 |
|
||
| CD 部署 cc42aa0 驗證 | E2E | 觀察中 |
|
||
| 首次日度報告(08:00 台北)| E2E | 等待中 |
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-14 — P0 文件補建完成,護城河已部署 e778e4d ✅)
|
||
|
||
**本次 session 新增(2 份 P0 業界標準文件)**:
|
||
- `docs/slo/SLO-SLI-DEFINITION.md` — 5 個 SLI + SLO 目標值表 + Error Budget 規則 + 里程碑
|
||
- `docs/operations/HUMAN-IN-THE-LOOP.md` — 9 種觸發條件 + Kill Switch + Fail-safe 逾時行為 + SOP
|
||
|
||
**護城河狀態**:量尺(SLO)+ 煞車(HITL)均已就位,配合 684d6cf 的聚合/重試/報告代碼
|
||
|
||
**觀察中**:等明日 08:00 台北時間日度報告推送,驗證 684d6cf E2E
|
||
|
||
**下一步**(優先級順序):
|
||
1. 等 CD 部署並觀察 E2E
|
||
2. Task 2.2:alert_rules.yaml 補 3 類規則(storage/devops_tool/external_site)
|
||
3. Task 2.3:alert_rule_engine.py kubectl 注入防護
|
||
4. Task 3.3:SSH 修復 KM 萃取
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-14 — 戰術 B 四大 Task 全部完成,675 tests ✅)
|
||
|
||
**本次 session 新增(4 Task,6 檔案,75 新測試)**:
|
||
- `feat(adr-076): Task 2` — `alert_grouping_service.py` — 5分鐘滑動視窗告警聚合引擎 + 16 tests
|
||
- `feat(adr-076): Task 3` — `approval_execution.py` — 執行失敗重試(MAX_RETRY=2, 30s, 瞬態/永久分類)+ 29 tests
|
||
- `feat(adr-076): Task 4` — `report_generation_service.py` — 日度巡檢報告(08:00台北) + Postmortem + 30 tests
|
||
- `webhooks.py` — ADR-076 聚合邏輯整合(指紋後/LLM前)
|
||
- `main.py` — 日度報告迴圈掛進 lifespan
|
||
|
||
**測試**: 600 → 675 通過(+75),10 skipped,0 failed
|
||
|
||
**下一步**:git push gitea main → Pod 部署驗證 → 觀察 E2E
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-14 — MASTER AIOps Blueprint 完成,等待統帥批准)
|
||
|
||
**本次 session 新增(無 commit,純文件工作)**:
|
||
- `docs/superpowers/plans/2026-04-14-MASTER-aiops-full-automation-blueprint.md` — 整合4份計畫文件的主計畫書 v1.0
|
||
- Memory: `aiops_current_architecture_diagnosis.md` — 完整架構診斷報告
|
||
|
||
**飛輪現況**: Pod 38ff2bb,飛輪 83% 完整,4 Phase 等待批准後實作
|
||
|
||
**業界標準文件缺口**(已識別,尚未建立):SLO/SLI、AI Model Card、Human-in-Loop Spec、Alert Taxonomy Catalog、Configuration Reference
|
||
|
||
**下一步**:等統帥批准 MASTER 計畫書後,開始 Phase 1 實作
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-14 — 飛輪 Bug 修補完成,全面部署 38ff2bb ✅)
|
||
|
||
**本次 session 修補(6 commits,全已部署,Pod 跑 38ff2bb)**:
|
||
- `38ff2bb` heartbeat → ADR-075 TYPE-1 格式(INFO 樹狀結構)
|
||
- `f1face4` HostHighCpuLoad 獨立規則 → NO_ACTION(停止 kubectl scale unknown)
|
||
- `1a4b52e` fingerprint 加 alertname 防跨告警指紋衝突 + 心跳分類補入
|
||
- `b17a677` gitea webhook analysis.model_dump() dict bug
|
||
- `0c88f67` DIAGNOSE 強制 deepseek-r1:14b(不用 gemma3:4b)
|
||
- `09134f5` incident.title bug + DIAGNOSE→NEMOTRON confidence=0.0 修復
|
||
|
||
**飛輪狀態**:規格書層次一二三四全完成,ADR-075 全完成,本次額外修補已補齊
|
||
|
||
**下一步**:觀察自動修復 E2E,或繼續 ADR-075 Phase 3(Prometheus 規則)
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-12 深夜 — ADR-075 Phase 1+2+CR 全完成,git push gitea main ✅)
|
||
|
||
**ADR-075 全部完成**(3 commits: 2cef209 → 561c1d8 → 1cb654c):
|
||
|
||
Phase 1(4 斷點修復):
|
||
- ✅ 斷點 A: decision_manager 提取 alert_category → send_approval_card
|
||
- ✅ 斷點 B: send_approval_card 新增參數 → _build_inline_keyboard
|
||
- ✅ 斷點 C: 互動型通知(TYPE-3/4/4D/8M)禁止發 SRE 群組
|
||
- ✅ 斷點 D: k8s_workload → kubernetes + 6 新類按鈕組
|
||
- ✅ classify_alert_early: 13 條規則,7 新分類,52 tests
|
||
|
||
Phase 2(TYPE-8M):
|
||
- ✅ send_meta_alert() ⚙️ META SYSTEM 卡片
|
||
- ✅ decision_manager TYPE-8M elif 分支
|
||
|
||
CR 修補:
|
||
- ✅ P0-2: NotificationType.TYPE_8M 加入 enum + classify_notification 早期回傳
|
||
- ✅ P1-1: 移除 _CATEGORY_BUTTONS 死碼
|
||
- ✅ P1-4: 測試 docstring 更新 13 條規則
|
||
- 664 tests all pass
|
||
|
||
**下一步**:ADR-075 Phase 3(Prometheus 規則),或評估下一個 ADR
|
||
|
||
---
|
||
|
||
## 📍 前次狀態 (2026-04-12 — 層次三+四全部完成,CD 推送中)
|
||
|
||
**系統狀態**: ADR-073 Phase 1-4 ✅ | ADR-074 M1-M5 ✅ | ADR-073-C C1-C4 ✅ | git push gitea main ✅
|
||
|
||
**Phase 1 完成清單**:
|
||
- ✅ 1-1~1-3: Harbor 確認 + kustomization→105998d + ArgoCD sync
|
||
- ✅ 1-4: Pod image 105998d 已驗證
|
||
- ✅ 1-5: `_collect_mcp_context` 存在 Pod
|
||
- ✅ 1-6: debounce 5→30 min
|
||
- ✅ 1-7: alertname NULL 根因修復(signals JSONB alias)
|
||
- ✅ 1-8: cold_start_playbooks.py — Playbooks 0→15
|
||
- ✅ 1-9: batch_vectorize_km.py — 711/713 KM 向量化
|
||
|
||
**架構修復**:
|
||
- ArgoCD ignoreDifferences 移除(image 更新路徑修通)
|
||
- B5 CI break-glass(TODO 2026-04-13 恢復)
|
||
|
||
**Phase 2 完成清單**:
|
||
- ✅ 2-1: DB Migration — alertname column 新增 (已在 Pod 執行)
|
||
- ✅ 2-2: classify_alert_early() — 6 條規則 config_drift/info/backup/infra/k8s/db/general
|
||
- ✅ 2-3: _try_auto_repair_background() outcome 寫入點
|
||
- ✅ 2-4: create_incident_for_approval() notification_type/alert_category 寫入
|
||
- ✅ 2-5: 134 筆 alertname NULL → 0 回填完成
|
||
|
||
**下一步**: Phase 3(Tier 3 — 首席架構師授權後執行)
|
||
|
||
---
|
||
|
||
## 里程碑摘要(壓縮版)
|
||
|
||
| 日期 | 里程碑 | commit |
|
||
|------|--------|--------|
|
||
| 2026-04-08 | Sprint 5.1 資料安全護欄完成(Guardrail BLOCK/HITL/MultiSig)| 88696db/0f5fecf |
|
||
| 2026-04-10 | ADR-068 飛輪閉環 E2E 驗收(HostHighCpuLoad→PB-20260406)| — |
|
||
| 2026-04-10 | ADR-067 五大 Ollama 應用全完成(Phase 30-34)| — |
|
||
| 2026-04-10 | B5 CI 整合測試 640 通過 | — |
|
||
| 2026-04-11 | ADR-070 全自動 AIOps 閉環(MCP 10 providers)| a2cc985 |
|
||
| 2026-04-11 | ADR-071 A-J Telegram 通知卡片 10 種 | 1ec1965 |
|
||
| 2026-04-11 | ADR-072 Bug 修復 P0-P2 全完成 | f34fe19 |
|
||
| 2026-04-11 | MCP Security Code Review P0/P1/P2 全修補 | f323633 |
|
||
| 2026-04-11 | ADR-069 基礎設施重建 Sprint A/B/C 全完成 | — |
|
||
| 2026-04-11 | D1 models.json v1.3.0(9 purpose keys)| f2c18c4 |
|
||
| 2026-04-11 | M3 alertname_to_type 抽至 constants | 1ede9f9 |
|
||
| 2026-04-11 | I1 ADR-064 Rule Engine get_incident_type() 整合 | 615822d |
|
||
| 2026-04-11 | ArgoCD MCP 連線修復(IP 120:30443)| f23176c |
|
||
| 2026-04-11 | 首席架構師 CR Round 1 — get_incident_type rule.id bug + 11 tests | d77b2ad |
|
||
| 2026-04-11 | ADR-070 全自動化三大修復(auto_approve/MCP context/target)| c439277 |
|
||
| 2026-04-11 | 首席架構師 CR Round 2 — Tier 3 ternary/timeout/DESTRUCTIVE_PATTERNS + 25 tests | 8be87b0 |
|
||
| 2026-04-12 | SSH MCP 188/110 連通驗證,authorized_keys 確認 | 796517f |
|
||
| 2026-04-12 | fix(test): integration 測試排除(pytest addopts),618 單元測試全通過 | 6e0ee8b |
|
||
| 2026-04-12 | refactor: I2 DI 化 MCP Providers + config list bug + model_regression integration | 184b37a/a67a27f |
|
||
| 2026-04-12 | docs(spec): AIOps 飛輪全面修復整合規格書 v1.0(四層方案+監控缺口+ADR-074)| 77771c1 |
|
||
| 2026-04-12 | docs(adr): ADR-073 補充 ADR-071 整合工作序 + ADR-074 Sprint | f2b427d |
|
||
| 2026-04-12 | docs(spec): v2.2 §15 Subsystem 1 四階段路線圖(截圖定案)| d3ddaaf |
|
||
| 2026-04-12 | docs(spec): 規格書 v2.0 — 四階段細化實施步驟 + 防偏差守則(等待批准)| — |
|
||
| 2026-04-12 | fix(flywheel): Phase 1 — kustomization→8be87b0 + debounce 30min + alertname NULL | 7c4b36c |
|
||
| 2026-04-12 | fix(ci): Break-Glass — B5 flaky PG test bypass,解封 P0 飛輪部署 | 105998d |
|
||
| 2026-04-12 | fix(argocd): 移除 ignoreDifferences image,修復 GitOps image 更新斷路 | — |
|
||
| 2026-04-12 | feat(flywheel): Phase 1 完成 — 15 Playbooks 冷啟動 ✅ | 711/713 KM 向量化 ✅ |
|
||
| 2026-04-12 | feat(flywheel): Phase 2 完成 — classify_alert_early + outcome寫入 + 134筆回填 | 54efa30/97fc49a |
|
||
| 2026-04-12 | feat(flywheel): Phase 3 完成 — Tier 3 七大修復 (首席架構師授權) | dbc77c5 |
|
||
| 2026-04-12 | feat(flywheel): Phase 4 完成 — KM conversion hook + daily vectorize cron | f2fc471 |
|
||
| 2026-04-12 | feat(adr-074): M1 飛輪 Prometheus Exporter + M2 主機網路監控 | 16d6823 |
|
||
| 2026-04-12 | fix(cr): ADR-073 CR P0/P1/P2 全修補 + B5 CI 0收集修復 | e770813/c09521a |
|
||
| 2026-04-12 | feat(m3-m5): ADR-074 M3 CI Webhook + M4 備份驗證 + M5 詳細告警 | 3489e05~ec6a341 |
|
||
| 2026-04-12 | feat(c2-c4): ADR-073-C 前端飛輪 KPI+WebSocket+介入路徑視覺化 | 4b51f9b~9b1812c |
|
||
|
||
---
|
||
|
||
## ADR-073 飛輪完整盤點(2026-04-12)
|
||
|
||
| 項目 | 發現 |
|
||
|------|------|
|
||
| Playbooks | **0** — 飛輪從未冷啟動 |
|
||
| EXECUTION_SUCCESS | **2/380(0.5%)** — 自動修復幾乎從未成功 |
|
||
| KM vectorized | **全部 False(699 筆)** — RAG 無法查詢歷史案例 |
|
||
| alertname in DB | **全部 NULL** — signals JSONB 結構問題 |
|
||
| debounce window | **5 分鐘**(應 30 分鐘)— 同一問題反覆重建 Incident |
|
||
| 8be87b0 部署 | ❌ CD 失敗未上線 — 三大修復未生效 |
|
||
| ADR-073 | 已寫入 `docs/adr/ADR-073-flywheel-full-audit.md`,等待統帥批准後實作 |
|
||
|
||
---
|
||
|
||
## 2026-04-15 深夜 — P0 告警靜默根治 + Phase 6 自我治理閉環收官
|
||
|
||
### P0 告警靜默 RCA
|
||
|
||
| 根因 | 影響 | commit |
|
||
|------|------|--------|
|
||
| `approval_db.py` PENDING 無 TTL(殭屍記錄 hit_count=77/30/17) | Telegram 完全靜默 | fab65e7 |
|
||
| `create_approval_with_fingerprint()` expires_at=NULL | 自動過期邏輯形同虛設 | f31b4e3 |
|
||
| `openclaw.py:897` DIAGNOSE require_local=True(v4.3 未同步)| 所有 DIAGNOSE privacy_skip 無聲失敗 | 3ce5025 |
|
||
|
||
緊急處置:kubectl 直接過期 7 筆殭屍 ApprovalRecord
|
||
|
||
### P2 飛輪斷鏈 + asyncpg CrashLoop 修復
|
||
|
||
| 修復 | commit |
|
||
|------|--------|
|
||
| `approval_timeout_resolver.py` 新建:逾期 PENDING → EXPIRED + resolve_incident | f045506 |
|
||
| `anomaly_counter.py` + `incident_service.py`:timeout_ignored disposition | f045506 |
|
||
| `db/base.py` asyncpg 多指令 CREATE INDEX 拆分 | f9ba200 |
|
||
|
||
### Phase 6 自我治理閉環 — 全部完成
|
||
|
||
| 元件 | 檔案 |
|
||
|------|------|
|
||
| AI SLO 計算器 | `services/ai_slo_calculator.py` |
|
||
| Trust Drift 偵測器 | `services/trust_drift_detector.py` |
|
||
| KB Rot 清理 Job | `jobs/kb_rot_cleaner.py` |
|
||
| 自我降級引擎 | `services/decision_manager.py` |
|
||
| SLO REST API | `api/v1/ai_slo.py`(GET /api/v1/ai/slo) |
|
||
| DB 表 + Migration | `db/models.py` AiGovernanceEvent + 3 index |
|
||
|
||
附帶修復:心跳停用(已轉發另一群組)、ai_router.py 通知改 ADR-075 格式
|
||
|
||
**下一步:** `send_notification` 私有化(封死 raw text bypass)
|
||
|
||
---
|
||
|
||
## 已知技術債(下 Sprint 評估)
|
||
|
||
| 項目 | 說明 |
|
||
|------|------|
|
||
| NetworkPolicy ClusterIP 10.43.16.201/32 | ArgoCD 重裝需更新 |
|
||
| `_collect_mcp_context` Provider 直接實例化 | ✅ 已 DI 化(I2)184b37a |
|
||
| B3 Phase 15.5 Trace Context UI | 統帥裁示暫緩 |
|
||
| A-3 bitan Docker 化 | P3 低優先 |
|
||
| `approval_repository.py:find_by_fingerprint()` 無 TTL | 非熱路徑,latent bug,下次重構修 |
|
||
| `send_notification()` 未私有化 | 任何 caller 可 bypass 格式 — 下次 PR |
|
||
|
||
---
|
||
|
||
## 2026-04-15 深夜(台北)— Phase 3 學習閉環全部落地
|
||
|
||
### Phase 3 Root Cause 修復完成
|
||
|
||
| 修復 | commit |
|
||
|------|--------|
|
||
| Root cause 3:驗證結果→學習 + 診斷 feedback + 知識遺忘 + Fine-tune 管線 | fb1bbd0 |
|
||
| AgentSession 學習接線:record_agent_session() + orchestrator 辯證訊號 | 66c4eda |
|
||
| Evolver loop 排程 + POST /api/v1/learning/evolver/run 演練端點 | 4718c76 |
|
||
| Evolver force=True bypass flag + import 清理 | 01fb531 e5e94f5 |
|
||
|
||
### Phase 3 全部新增元件
|
||
|
||
| 元件 | 檔案 |
|
||
|------|------|
|
||
| Root cause 3 接線 | `services/approval_execution.py` → `record_verification_result()` |
|
||
| 驗證/診斷/AgentSession 學習 | `services/learning_service.py` 三個新方法 |
|
||
| 知識遺忘 Job | `jobs/knowledge_decay_job.py`(每日 30d 清除) |
|
||
| Fine-tune 管線 | `services/finetune_exporter.py`(每週 Alpaca JSONL) |
|
||
| Evolver 每日 Loop | `services/playbook_evolver.py:run_evolver_loop()` |
|
||
| Evolver 演練端點 | `api/v1/learning.py:POST /learning/evolver/run` |
|
||
|
||
### Phase 3 退出條件
|
||
|
||
- [x] Root cause 1/2/3 全部修復
|
||
- [x] 2x EWMA + Evolver + 診斷 feedback
|
||
- [x] AgentSession 學習接線
|
||
- [x] 知識遺忘 + Fine-tune 管線
|
||
- [x] Evolver 演練端點部署完成
|
||
- [x] Evolver 演練 1 次成功(HTTP 200 errors:[] ✅ 2026-04-15 21:20 台北)
|
||
- [ ] 生產 7 天監控(trust_score 更新、JSONL 累積、null 率)
|
||
|
||
**下一步:** 7 天生產觀察 Phase 3 退出條件(2026-04-22 檢查點)
|