From 309fe0469841647d58044eb44eb4dd40f9f2ee42 Mon Sep 17 00:00:00 2001 From: OG T Date: Thu, 9 Apr 2026 18:23:55 +0800 Subject: [PATCH] =?UTF-8?q?docs(adr066):=20=E6=89=B9=E5=87=86=E5=9F=B7?= =?UTF-8?q?=E8=A1=8C=E9=96=89=E7=92=B0=E4=BF=AE=E5=BE=A9=E8=A8=98=E9=8C=84?= =?UTF-8?q?=20=E2=80=94=20LOGBOOK=20+=20ADR-066=20+=20Skill=2002=20?= =?UTF-8?q?=E6=9B=B4=E6=96=B0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - LOGBOOK.md: 新增 2026-04-09 批准執行閉環修復狀態區塊 - ADR-066: 記錄根本問題鏈條、決策與受影響檔案 - Skills/02: v2.7 新增 Nemotron tool→kubectl_command 回填鐵律 + 教訓 Co-Authored-By: Claude Sonnet 4.6 --- .agents/skills/02-lewooogo-backend-core.md | 35 ++++++++ docs/LOGBOOK.md | 23 +++++ .../ADR-066-approval-execution-loop-fix.md | 89 +++++++++++++++++++ 3 files changed, 147 insertions(+) create mode 100644 docs/adr/ADR-066-approval-execution-loop-fix.md diff --git a/.agents/skills/02-lewooogo-backend-core.md b/.agents/skills/02-lewooogo-backend-core.md index 4bdce86f..e3388038 100644 --- a/.agents/skills/02-lewooogo-backend-core.md +++ b/.agents/skills/02-lewooogo-backend-core.md @@ -37,6 +37,7 @@ | v2.4 | 2026-03-31 | Claude Code | 🏛️ Phase 22 首席架構師審查通過 (Mock違規+分層修復全部完成) | | v2.5 | 2026-04-01 | Claude Code | ♻️ Phase R-R2 完成 (legacy -971行) + R-R2.1 P0/P1修復 + ADR-046 型別統一 | | v2.6 | 2026-04-08 | Claude Code | 🛡️ Sprint 5.1 Data Safety Guardrails — Service Registry 模式 + 審查修正鐵律 | +| v2.7 | 2026-04-09 | Claude Sonnet 4.6 | 🔧 ADR-066 批准執行閉環修復 — Nemotron tool→kubectl_command 回填鐵律 | --- @@ -729,6 +730,40 @@ Python stop() timeout: 75 # 比 K8s 少 15s 緩衝 > **ConfigMap**: `AI_FALLBACK_ORDER: '["nvidia","gemini","ollama","claude"]'` > **審查結果**: P0 修復後 85/100 → 最終 94/100 +### 🔴 鐵律:Nemotron/Gemini Tool Call 必須回填 kubectl_command (ADR-066) + +**背景**: 幾個月來批准按鈕完全無效,因為 Nemotron tool 結果未傳播到執行鏈路。 + +```python +# ✅ 正確 — openclaw.py 必須回填 +_tools = proposal["nemotron_tools"] +if _tools: + _t = _tools[0] + if _t["tool"] == "restart_deployment": + proposal["kubectl_command"] = f"kubectl rollout restart deployment/{_deploy} -n {_ns}" + elif _t["tool"] == "delete_pod": + proposal["kubectl_command"] = f"kubectl delete pod {_pod} -n {_ns}" + elif _t["tool"] == "scale_deployment": + proposal["kubectl_command"] = f"kubectl scale deployment/{_deploy} --replicas={_replicas} -n {_ns}" + +# ✅ 正確 — proposal_service 優先用 kubectl_command +_kubectl = llm_proposal.get("kubectl_command", "").strip() +action = _kubectl if _kubectl else llm_proposal["action"] + +# ❌ 禁止 — 只存 nemotron_tools[] 不回填 kubectl_command +proposal["nemotron_tools"] = result.get("tools", []) +# (缺少回填 → parse_operation_from_action → None → SKIP) +``` + +**為何重要**: `execute_approved_action` 靠 `parse_operation_from_action(approval.action)` 決定執行什麼。若 action 是中文標題或 "未知操作",解析失敗,靜默跳過,UI 卻顯示「已批准」。 + +**檢查清單**: +- [ ] 新增 Tool Call 工具時,同步更新 openclaw.py 的回填邏輯 +- [ ] 測試批准後 `audit_logs` 有寫入記錄 +- [ ] 批准後 Telegram 有收到 reply 狀態訊息 + +--- + ### 鐵律:NVIDIA Nemotron 優先仲裁 ```python diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 00918f50..5a22de71 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -6,6 +6,29 @@ --- +## 📍 當前狀態 (2026-04-09 批准執行閉環修復 + TG 訊息持久化 + AI 鏈路透明化) + +| 項目 | 修法 | Commit | +|------|------|--------| +| **根本Bug: 批准後永遠不執行** | Nemotron tool 回填 `kubectl_command` | 2c7d5d0 | +| **根本Bug: action="未知操作"** | proposal_service 優先用 kubectl_command | ae97808 | +| **TG 按鈕按後無回應** | `_notify_approval_result()` reply 狀態行 | 1483218 | +| **message_id 僅 Redis 24h** | approval_records 加 telegram_message_id 欄位 | 1483218 | +| **AI 鏈路不透明** | TG 訊息顯示模型/後端 (Ollama/NIM/Gemini) | d8c2969 | +| **Ollama 取代 NVIDIA NIM** | OllamaToolProvider llama3.1:8b,44s→4s | 7857c25 | +| 5 筆孤兒 OBSERVE approval | 直接 EXPIRED 清理 DB | — | + +**根本問題鏈**: +``` +Nemotron→restart_deployment(sentry) → 只存 nemotron_tools[] → kubectl_command="" → +approval_records.action="未知操作 | " → parse_operation_from_action→None → +execute_approved_action SKIP → 幾個月來批准/拒絕按鈕完全無效 +``` + +**下一步**: 部署驗證 → 新告警完整執行閉環測試 + +--- + ## 📍 當前狀態 (2026-04-10 Sprint 5R 前端重構 13/14 完成 — CD 部署中) | 步驟 | 內容 | Commit | 狀態 | diff --git a/docs/adr/ADR-066-approval-execution-loop-fix.md b/docs/adr/ADR-066-approval-execution-loop-fix.md new file mode 100644 index 00000000..2a22e6be --- /dev/null +++ b/docs/adr/ADR-066-approval-execution-loop-fix.md @@ -0,0 +1,89 @@ +# ADR-066: 批准執行閉環修復 — Nemotron Tool Call → kubectl_command 回填 + +**狀態**: 已批准並實施 (2026-04-09) +**提案者**: Claude Sonnet 4.6 +**首席架構師**: ogt + +--- + +## 背景 + +自 Phase 22 (Nemotron 協作) 上線以來,所有透過 Telegram 或前端按下「批准」按鈕的操作,**從未實際執行任何 K8s 操作**。資料庫記錄顯示 `approval_records.action = "未知操作 | "`,幾個月來無人察覺。 + +## 問題鏈條分析 + +``` +1. Nemotron 產生 tool_call: restart_deployment(deployment_name=sentry) +2. openclaw.py 只存到 proposal["nemotron_tools"] +3. proposal["kubectl_command"] 仍為空字串 +4. proposal_service.py 取 llm_proposal["action"] = action_title (中文標題) +5. approval_records.action = "重啟 sentry 服務" 或 "未知操作 | " +6. execute_approved_action() 呼叫 parse_operation_from_action("未知操作 | ") +7. 返回 ParsedOperation(None, None, ...) → SKIP 執行 +8. 狀態顯示「已批准」但什麼都沒發生 +``` + +**附加問題**: +- Telegram 批准/拒絕按鈕按下後無任何回應(無 answer callback,無訊息更新) +- Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤 + +## 決策 + +### D1: Nemotron tool call 回填 kubectl_command + +`openclaw.py` 在 Nemotron (和 Gemini fallback) 成功後,將 tool call 轉換為 kubectl 指令並回填: + +```python +if _tool_name == "restart_deployment": + proposal["kubectl_command"] = f"kubectl rollout restart deployment/{_deploy} -n {_ns}" +elif _tool_name == "delete_pod": + proposal["kubectl_command"] = f"kubectl delete pod {_pod} -n {_ns}" +elif _tool_name == "scale_deployment": + proposal["kubectl_command"] = f"kubectl scale deployment/{_deploy} --replicas={_replicas} -n {_ns}" +``` + +### D2: proposal_service 優先用 kubectl_command + +```python +_kubectl = llm_proposal.get("kubectl_command", "").strip() +action = _kubectl if _kubectl else llm_proposal["action"] +``` + +### D3: 批准/拒絕後立即回應 Telegram + +新增 `_notify_approval_result()`: +1. `editMessageReplyMarkup` — 移除批准/拒絕按鈕,保留資訊按鈕 +2. `sendMessage reply_to` — 在原訊息下回覆狀態行 +3. fallback: `send_notification` + +### D4: message_id 持久化到 DB + +`approval_records` 加 `telegram_message_id` + `telegram_chat_id` 欄位。已執行 `ALTER TABLE`。不再純靠 Redis TTL。 + +## 受影響檔案 + +| 檔案 | 修改說明 | +|------|---------| +| `apps/api/src/services/openclaw.py` | Nemotron/Gemini tool → 回填 kubectl_command | +| `apps/api/src/services/proposal_service.py` | action 優先用 kubectl_command | +| `apps/api/src/services/telegram_gateway.py` | _notify_approval_result + _execute_approval_action 回應 | +| `apps/api/src/services/approval_db.py` | update_telegram_message() 新方法 | +| `apps/api/src/services/decision_manager.py` | 存 message_id 同時寫 Redis + DB | +| `apps/api/src/db/models.py` | ApprovalRecord 加兩個欄位 | + +## Commits + +- `ae97808` fix(proposal): action 優先用 kubectl_command +- `2c7d5d0` fix(openclaw): Nemotron tool call 回填 kubectl_command +- `1483218` feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id + +## 後置行動 + +- 清除 5 筆孤兒 `OBSERVE` approval (incident_id=NULL),設為 EXPIRED +- 驗證新告警完整執行閉環:Alertmanager → Incident → Nemotron → approval → 批准 → kubectl → Telegram 通知 + +## 教訓 + +1. 不要假設 UI 顯示「已批准」就代表後端有執行 +2. 新增協作模組時必須驗證產出是否傳播到執行鏈路 +3. 按鈕互動必須有即時回饋,否則使用者無法判斷是否成功