All checks were successful
E2E Health Check / e2e-health (push) Successful in 24s
2026-03-31 Claude Code (首席架構師) 審查結果: - 初審: 91/100 OUTSTANDING (條件通過) - P0 修復後: 95/100 OUTSTANDING ADR-043: FailureWatcher 與 DecisionManager 協調 - 優先級規則: DecisionManager > FailureWatcher - 資源鎖定機制 (Redis) - Approval 衝突檢查 - 來源標記審計 P1 待改進 (Phase 19): - DI 注入模式統一 - 整合測試覆蓋 - Trace ID 記錄 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
121 lines
2.9 KiB
Markdown
121 lines
2.9 KiB
Markdown
# ADR-043: FailureWatcher 與 DecisionManager 協調
|
||
|
||
> **狀態**: 已批准
|
||
> **決策日期**: 2026-03-31
|
||
> **決策者**: Claude Code (首席架構師)
|
||
> **相關**: Phase 18 失敗自動修復閉環
|
||
|
||
## 背景
|
||
|
||
Phase 18 引入 FailureWatcher 自動修復機制,Phase 6.5 有 DecisionManager 處理 Incident 決策。
|
||
兩者可能對同一資源產生競爭:
|
||
|
||
- FailureWatcher: 執行失敗 → 自動修復
|
||
- DecisionManager: Incident 告警 → AI 決策 → 提案
|
||
|
||
## 問題陳述
|
||
|
||
同一資源 (如 `deployment/api`) 可能同時:
|
||
1. 被 FailureWatcher 觸發自動修復 (因執行失敗)
|
||
2. 被 DecisionManager 生成修復提案 (因 Incident 告警)
|
||
|
||
這會導致:
|
||
- 重複操作 (雙重重啟)
|
||
- 資源競爭 (同時修改同一 Deployment)
|
||
- 日誌混亂 (無法追蹤誰觸發了修復)
|
||
|
||
## 決策
|
||
|
||
### 優先級規則
|
||
|
||
```
|
||
DecisionManager 提案 > FailureWatcher 自動修復
|
||
```
|
||
|
||
**理由**:
|
||
- DecisionManager 經過 Trust Engine 評估
|
||
- DecisionManager 有完整的 Approval 流程
|
||
- FailureWatcher 是「補救」機制,不應覆蓋「主動決策」
|
||
|
||
### 實作方式
|
||
|
||
#### 1. 資源鎖定 (Redis)
|
||
|
||
```python
|
||
# FailureWatcher 執行前檢查
|
||
lock_key = f"awoooi:resource_lock:{namespace}:{target_resource}"
|
||
existing_lock = await redis.get(lock_key)
|
||
if existing_lock:
|
||
# DecisionManager 已鎖定,跳過自動修復
|
||
logger.info("resource_locked_by_decision_manager", resource=target_resource)
|
||
return {"next_action": "deferred_to_decision_manager"}
|
||
```
|
||
|
||
#### 2. Approval 檢查
|
||
|
||
```python
|
||
# FailureWatcher 執行前檢查是否有 pending approval
|
||
pending = await approval_repo.get_pending_for_resource(target_resource)
|
||
if pending:
|
||
logger.info("pending_approval_exists", resource=target_resource)
|
||
return {"next_action": "awaiting_existing_approval"}
|
||
```
|
||
|
||
#### 3. 來源標記
|
||
|
||
```python
|
||
# AuditLog 擴展
|
||
authorization_channel: str # "auto_repair" | "decision_manager" | "manual"
|
||
source_system: str # "failure_watcher" | "decision_manager" | "api"
|
||
```
|
||
|
||
### 協調流程
|
||
|
||
```
|
||
Incident/執行失敗
|
||
↓
|
||
檢查資源鎖定 (Redis)
|
||
├─ 已鎖定 → 跳過,記錄 deferred
|
||
└─ 未鎖定 → 繼續
|
||
↓
|
||
檢查 Pending Approval
|
||
├─ 存在 → 跳過,等待現有流程
|
||
└─ 不存在 → 繼續
|
||
↓
|
||
根據來源執行:
|
||
├─ DecisionManager → 建立 Approval (帶鎖)
|
||
└─ FailureWatcher → 自動修復 (LOW 風險)
|
||
```
|
||
|
||
## 後果
|
||
|
||
### 正面
|
||
- 避免重複操作
|
||
- 明確優先級
|
||
- 完整審計追蹤
|
||
|
||
### 負面
|
||
- 增加 Redis 依賴
|
||
- 邏輯複雜度提升
|
||
|
||
### 風險
|
||
- Redis 故障時需降級策略 (保守跳過)
|
||
|
||
## 實作狀態
|
||
|
||
| 項目 | 狀態 |
|
||
|------|------|
|
||
| 資源鎖定機制 | 🟡 P1 待實作 |
|
||
| Approval 檢查 | 🟡 P1 待實作 |
|
||
| 來源標記 | ✅ 已有 `authorization_channel` |
|
||
|
||
## 相關文件
|
||
|
||
- Phase 18: 失敗自動修復閉環
|
||
- Phase 6.5: DecisionManager 雙軌決策
|
||
- ADR-040: 全域自動修復熔斷
|
||
|
||
---
|
||
|
||
**Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>**
|