docs: Phase 18 首席架構師審查 + ADR-043
All checks were successful
E2E Health Check / e2e-health (push) Successful in 24s
All checks were successful
E2E Health Check / e2e-health (push) Successful in 24s
2026-03-31 Claude Code (首席架構師) 審查結果: - 初審: 91/100 OUTSTANDING (條件通過) - P0 修復後: 95/100 OUTSTANDING ADR-043: FailureWatcher 與 DecisionManager 協調 - 優先級規則: DecisionManager > FailureWatcher - 資源鎖定機制 (Redis) - Approval 衝突檢查 - 來源標記審計 P1 待改進 (Phase 19): - DI 注入模式統一 - 整合測試覆蓋 - Trace ID 記錄 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -14,7 +14,7 @@
|
||||
| **Phase 22.4 命名清理** | ✅ **已完成** (ClawBot 舊檔案移除) |
|
||||
| **P0-1 CD Secrets 注入** | ✅ **已完成** (ADR-035 強制) |
|
||||
| **P0-2 NVIDIA 模型修正** | ✅ **已完成** (nemotron-mini-4b) |
|
||||
| **Phase 18 失敗自動修復** | ✅ **全部完成!** (18.1-18.6 + 40 tests) |
|
||||
| **Phase 18 失敗自動修復** | ✅ **OUTSTANDING** (95/100 + P0 修復 `138a56a`) |
|
||||
| **Phase 21 定期報告** | ✅ **全部完成!** |
|
||||
| **Phase 21.1 Daily E2E** | ✅ **已完成** (每日 00:00 台北) |
|
||||
| **Phase 21.2 K3s Report** | ✅ **已完成** (每日 09:00 台北) |
|
||||
|
||||
@@ -0,0 +1,120 @@
|
||||
# ADR-043: FailureWatcher 與 DecisionManager 協調
|
||||
|
||||
> **狀態**: 已批准
|
||||
> **決策日期**: 2026-03-31
|
||||
> **決策者**: Claude Code (首席架構師)
|
||||
> **相關**: Phase 18 失敗自動修復閉環
|
||||
|
||||
## 背景
|
||||
|
||||
Phase 18 引入 FailureWatcher 自動修復機制,Phase 6.5 有 DecisionManager 處理 Incident 決策。
|
||||
兩者可能對同一資源產生競爭:
|
||||
|
||||
- FailureWatcher: 執行失敗 → 自動修復
|
||||
- DecisionManager: Incident 告警 → AI 決策 → 提案
|
||||
|
||||
## 問題陳述
|
||||
|
||||
同一資源 (如 `deployment/api`) 可能同時:
|
||||
1. 被 FailureWatcher 觸發自動修復 (因執行失敗)
|
||||
2. 被 DecisionManager 生成修復提案 (因 Incident 告警)
|
||||
|
||||
這會導致:
|
||||
- 重複操作 (雙重重啟)
|
||||
- 資源競爭 (同時修改同一 Deployment)
|
||||
- 日誌混亂 (無法追蹤誰觸發了修復)
|
||||
|
||||
## 決策
|
||||
|
||||
### 優先級規則
|
||||
|
||||
```
|
||||
DecisionManager 提案 > FailureWatcher 自動修復
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- DecisionManager 經過 Trust Engine 評估
|
||||
- DecisionManager 有完整的 Approval 流程
|
||||
- FailureWatcher 是「補救」機制,不應覆蓋「主動決策」
|
||||
|
||||
### 實作方式
|
||||
|
||||
#### 1. 資源鎖定 (Redis)
|
||||
|
||||
```python
|
||||
# FailureWatcher 執行前檢查
|
||||
lock_key = f"awoooi:resource_lock:{namespace}:{target_resource}"
|
||||
existing_lock = await redis.get(lock_key)
|
||||
if existing_lock:
|
||||
# DecisionManager 已鎖定,跳過自動修復
|
||||
logger.info("resource_locked_by_decision_manager", resource=target_resource)
|
||||
return {"next_action": "deferred_to_decision_manager"}
|
||||
```
|
||||
|
||||
#### 2. Approval 檢查
|
||||
|
||||
```python
|
||||
# FailureWatcher 執行前檢查是否有 pending approval
|
||||
pending = await approval_repo.get_pending_for_resource(target_resource)
|
||||
if pending:
|
||||
logger.info("pending_approval_exists", resource=target_resource)
|
||||
return {"next_action": "awaiting_existing_approval"}
|
||||
```
|
||||
|
||||
#### 3. 來源標記
|
||||
|
||||
```python
|
||||
# AuditLog 擴展
|
||||
authorization_channel: str # "auto_repair" | "decision_manager" | "manual"
|
||||
source_system: str # "failure_watcher" | "decision_manager" | "api"
|
||||
```
|
||||
|
||||
### 協調流程
|
||||
|
||||
```
|
||||
Incident/執行失敗
|
||||
↓
|
||||
檢查資源鎖定 (Redis)
|
||||
├─ 已鎖定 → 跳過,記錄 deferred
|
||||
└─ 未鎖定 → 繼續
|
||||
↓
|
||||
檢查 Pending Approval
|
||||
├─ 存在 → 跳過,等待現有流程
|
||||
└─ 不存在 → 繼續
|
||||
↓
|
||||
根據來源執行:
|
||||
├─ DecisionManager → 建立 Approval (帶鎖)
|
||||
└─ FailureWatcher → 自動修復 (LOW 風險)
|
||||
```
|
||||
|
||||
## 後果
|
||||
|
||||
### 正面
|
||||
- 避免重複操作
|
||||
- 明確優先級
|
||||
- 完整審計追蹤
|
||||
|
||||
### 負面
|
||||
- 增加 Redis 依賴
|
||||
- 邏輯複雜度提升
|
||||
|
||||
### 風險
|
||||
- Redis 故障時需降級策略 (保守跳過)
|
||||
|
||||
## 實作狀態
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| 資源鎖定機制 | 🟡 P1 待實作 |
|
||||
| Approval 檢查 | 🟡 P1 待實作 |
|
||||
| 來源標記 | ✅ 已有 `authorization_channel` |
|
||||
|
||||
## 相關文件
|
||||
|
||||
- Phase 18: 失敗自動修復閉環
|
||||
- Phase 6.5: DecisionManager 雙軌決策
|
||||
- ADR-040: 全域自動修復熔斷
|
||||
|
||||
---
|
||||
|
||||
**Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>**
|
||||
Reference in New Issue
Block a user