Files
awoooi/docs/adr/ADR-066-approval-execution-loop-fix.md
OG T 309fe04698 docs(adr066): 批准執行閉環修復記錄 — LOGBOOK + ADR-066 + Skill 02 更新
- LOGBOOK.md: 新增 2026-04-09 批准執行閉環修復狀態區塊
- ADR-066: 記錄根本問題鏈條、決策與受影響檔案
- Skills/02: v2.7 新增 Nemotron tool→kubectl_command 回填鐵律 + 教訓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:23:55 +08:00

90 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-066: 批准執行閉環修復 — Nemotron Tool Call → kubectl_command 回填
**狀態**: 已批准並實施 (2026-04-09)
**提案者**: Claude Sonnet 4.6
**首席架構師**: ogt
---
## 背景
自 Phase 22 (Nemotron 協作) 上線以來,所有透過 Telegram 或前端按下「批准」按鈕的操作,**從未實際執行任何 K8s 操作**。資料庫記錄顯示 `approval_records.action = "未知操作 | "`,幾個月來無人察覺。
## 問題鏈條分析
```
1. Nemotron 產生 tool_call: restart_deployment(deployment_name=sentry)
2. openclaw.py 只存到 proposal["nemotron_tools"]
3. proposal["kubectl_command"] 仍為空字串
4. proposal_service.py 取 llm_proposal["action"] = action_title (中文標題)
5. approval_records.action = "重啟 sentry 服務" 或 "未知操作 | "
6. execute_approved_action() 呼叫 parse_operation_from_action("未知操作 | ")
7. 返回 ParsedOperation(None, None, ...) → SKIP 執行
8. 狀態顯示「已批准」但什麼都沒發生
```
**附加問題**:
- Telegram 批准/拒絕按鈕按下後無任何回應(無 answer callback無訊息更新
- Telegram message_id 只存 Redis 24h TTL過期後無法追蹤
## 決策
### D1: Nemotron tool call 回填 kubectl_command
`openclaw.py` 在 Nemotron (和 Gemini fallback) 成功後,將 tool call 轉換為 kubectl 指令並回填:
```python
if _tool_name == "restart_deployment":
proposal["kubectl_command"] = f"kubectl rollout restart deployment/{_deploy} -n {_ns}"
elif _tool_name == "delete_pod":
proposal["kubectl_command"] = f"kubectl delete pod {_pod} -n {_ns}"
elif _tool_name == "scale_deployment":
proposal["kubectl_command"] = f"kubectl scale deployment/{_deploy} --replicas={_replicas} -n {_ns}"
```
### D2: proposal_service 優先用 kubectl_command
```python
_kubectl = llm_proposal.get("kubectl_command", "").strip()
action = _kubectl if _kubectl else llm_proposal["action"]
```
### D3: 批准/拒絕後立即回應 Telegram
新增 `_notify_approval_result()`:
1. `editMessageReplyMarkup` — 移除批准/拒絕按鈕,保留資訊按鈕
2. `sendMessage reply_to` — 在原訊息下回覆狀態行
3. fallback: `send_notification`
### D4: message_id 持久化到 DB
`approval_records``telegram_message_id` + `telegram_chat_id` 欄位。已執行 `ALTER TABLE`。不再純靠 Redis TTL。
## 受影響檔案
| 檔案 | 修改說明 |
|------|---------|
| `apps/api/src/services/openclaw.py` | Nemotron/Gemini tool → 回填 kubectl_command |
| `apps/api/src/services/proposal_service.py` | action 優先用 kubectl_command |
| `apps/api/src/services/telegram_gateway.py` | _notify_approval_result + _execute_approval_action 回應 |
| `apps/api/src/services/approval_db.py` | update_telegram_message() 新方法 |
| `apps/api/src/services/decision_manager.py` | 存 message_id 同時寫 Redis + DB |
| `apps/api/src/db/models.py` | ApprovalRecord 加兩個欄位 |
## Commits
- `ae97808` fix(proposal): action 優先用 kubectl_command
- `2c7d5d0` fix(openclaw): Nemotron tool call 回填 kubectl_command
- `1483218` feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id
## 後置行動
- 清除 5 筆孤兒 `OBSERVE` approval (incident_id=NULL),設為 EXPIRED
- 驗證新告警完整執行閉環Alertmanager → Incident → Nemotron → approval → 批准 → kubectl → Telegram 通知
## 教訓
1. 不要假設 UI 顯示「已批准」就代表後端有執行
2. 新增協作模組時必須驗證產出是否傳播到執行鏈路
3. 按鈕互動必須有即時回饋,否則使用者無法判斷是否成功