- LOGBOOK.md: 新增 2026-04-09 批准執行閉環修復狀態區塊 - ADR-066: 記錄根本問題鏈條、決策與受影響檔案 - Skills/02: v2.7 新增 Nemotron tool→kubectl_command 回填鐵律 + 教訓 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
90 lines
3.5 KiB
Markdown
90 lines
3.5 KiB
Markdown
# ADR-066: 批准執行閉環修復 — Nemotron Tool Call → kubectl_command 回填
|
||
|
||
**狀態**: 已批准並實施 (2026-04-09)
|
||
**提案者**: Claude Sonnet 4.6
|
||
**首席架構師**: ogt
|
||
|
||
---
|
||
|
||
## 背景
|
||
|
||
自 Phase 22 (Nemotron 協作) 上線以來,所有透過 Telegram 或前端按下「批准」按鈕的操作,**從未實際執行任何 K8s 操作**。資料庫記錄顯示 `approval_records.action = "未知操作 | "`,幾個月來無人察覺。
|
||
|
||
## 問題鏈條分析
|
||
|
||
```
|
||
1. Nemotron 產生 tool_call: restart_deployment(deployment_name=sentry)
|
||
2. openclaw.py 只存到 proposal["nemotron_tools"]
|
||
3. proposal["kubectl_command"] 仍為空字串
|
||
4. proposal_service.py 取 llm_proposal["action"] = action_title (中文標題)
|
||
5. approval_records.action = "重啟 sentry 服務" 或 "未知操作 | "
|
||
6. execute_approved_action() 呼叫 parse_operation_from_action("未知操作 | ")
|
||
7. 返回 ParsedOperation(None, None, ...) → SKIP 執行
|
||
8. 狀態顯示「已批准」但什麼都沒發生
|
||
```
|
||
|
||
**附加問題**:
|
||
- Telegram 批准/拒絕按鈕按下後無任何回應(無 answer callback,無訊息更新)
|
||
- Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤
|
||
|
||
## 決策
|
||
|
||
### D1: Nemotron tool call 回填 kubectl_command
|
||
|
||
`openclaw.py` 在 Nemotron (和 Gemini fallback) 成功後,將 tool call 轉換為 kubectl 指令並回填:
|
||
|
||
```python
|
||
if _tool_name == "restart_deployment":
|
||
proposal["kubectl_command"] = f"kubectl rollout restart deployment/{_deploy} -n {_ns}"
|
||
elif _tool_name == "delete_pod":
|
||
proposal["kubectl_command"] = f"kubectl delete pod {_pod} -n {_ns}"
|
||
elif _tool_name == "scale_deployment":
|
||
proposal["kubectl_command"] = f"kubectl scale deployment/{_deploy} --replicas={_replicas} -n {_ns}"
|
||
```
|
||
|
||
### D2: proposal_service 優先用 kubectl_command
|
||
|
||
```python
|
||
_kubectl = llm_proposal.get("kubectl_command", "").strip()
|
||
action = _kubectl if _kubectl else llm_proposal["action"]
|
||
```
|
||
|
||
### D3: 批准/拒絕後立即回應 Telegram
|
||
|
||
新增 `_notify_approval_result()`:
|
||
1. `editMessageReplyMarkup` — 移除批准/拒絕按鈕,保留資訊按鈕
|
||
2. `sendMessage reply_to` — 在原訊息下回覆狀態行
|
||
3. fallback: `send_notification`
|
||
|
||
### D4: message_id 持久化到 DB
|
||
|
||
`approval_records` 加 `telegram_message_id` + `telegram_chat_id` 欄位。已執行 `ALTER TABLE`。不再純靠 Redis TTL。
|
||
|
||
## 受影響檔案
|
||
|
||
| 檔案 | 修改說明 |
|
||
|------|---------|
|
||
| `apps/api/src/services/openclaw.py` | Nemotron/Gemini tool → 回填 kubectl_command |
|
||
| `apps/api/src/services/proposal_service.py` | action 優先用 kubectl_command |
|
||
| `apps/api/src/services/telegram_gateway.py` | _notify_approval_result + _execute_approval_action 回應 |
|
||
| `apps/api/src/services/approval_db.py` | update_telegram_message() 新方法 |
|
||
| `apps/api/src/services/decision_manager.py` | 存 message_id 同時寫 Redis + DB |
|
||
| `apps/api/src/db/models.py` | ApprovalRecord 加兩個欄位 |
|
||
|
||
## Commits
|
||
|
||
- `ae97808` fix(proposal): action 優先用 kubectl_command
|
||
- `2c7d5d0` fix(openclaw): Nemotron tool call 回填 kubectl_command
|
||
- `1483218` feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id
|
||
|
||
## 後置行動
|
||
|
||
- 清除 5 筆孤兒 `OBSERVE` approval (incident_id=NULL),設為 EXPIRED
|
||
- 驗證新告警完整執行閉環:Alertmanager → Incident → Nemotron → approval → 批准 → kubectl → Telegram 通知
|
||
|
||
## 教訓
|
||
|
||
1. 不要假設 UI 顯示「已批准」就代表後端有執行
|
||
2. 新增協作模組時必須驗證產出是否傳播到執行鏈路
|
||
3. 按鈕互動必須有即時回饋,否則使用者無法判斷是否成功
|