wooo/awoooi

Fork 0

Files

Your Name cfb866d055

Ansible Lint / lint (push) Successful in 35s

Details

CD Pipeline / tests (push) Failing after 13s

Details

CD Pipeline / build-and-deploy (push) Has been skipped

Details

CD Pipeline / post-deploy-checks (push) Has been skipped

Details

Code Review / ai-code-review (push) Failing after 11s

Details

feat(governance): add agent market automation surfaces

2026-06-04 21:50:55 +08:00

43 KiB

Raw Blame History

ADR-044: OpenClaw + Nemotron 協作架構

狀態: ✅ 已批准 決策日期: 2026-03-31 批准日期: 2026-03-31 18:30 (台北時區) 決策者: 首席架構師 + 統帥 提案者: Claude Code 相關: ADR-036 Nemotron Tool Calling, Phase 18 自動修復 2026-06-01 修訂: OpenClaw/Nemotron 分工不再視為永久不可變；任何核心替換必須以市場主流 Agent 評估與 AWOOOI 實測數據決策。

背景

AWOOOI 在 ADR-044 原始批准時有兩個 AI 能力：

OpenClaw - 主要大腦，負責 Root Cause Analysis、風險評估、決策推理
Nemotron - Tool Calling 專家，83.3% 精準度執行 K8s 操作

統帥需求：在同一個 Telegram 中同時看到兩者的分析結果。

2026-06-01 修訂：以市場與實測數據決定 OpenClaw 去留

本 ADR 的「OpenClaw = 仲裁者、Nemotron = 執行者」是 2026-03-31 的可運行分工，不是永久禁止替換的憲法。AWOOOI 的核心不是 OpenClaw 這個名稱，而是可驗證、可審計、可學習、可回滾的 AI 自主維運能力。

因此，任何更強的市場主流 AI Agent 架構都可以挑戰 OpenClaw，但必須先完成可重跑的證據包：

評估層	必看數據
市場主流	OpenAI Agents SDK、Claude Agent SDK、LangGraph、Google ADK、Microsoft Agent Framework、NVIDIA NeMo Agent Toolkit / Nemotron、CrewAI 等官方能力、版本、限制、部署模式
Orchestration	多 Agent 分工、handoff、workflow、state、resume、durable execution、human-in-the-loop
Tool 安全	tool calling 正確率、dry-run pass rate、rollback、危險動作攔截率、secret isolation、sandbox
AIOps 效果	RCA 正確率、修復成功率、誤修率、fallback rate、告警降噪、KM/Playbook 學習回寫率
可觀測性	trace、audit、token/cost、prompt/tool/result 可追蹤，是否能進 `timeline_events` / `alert_operation_log` / Langfuse
成本與 infra	API/NIM/GPU/CPU 成本、rate limit、p95/p99 latency、可用性、local/private deployment 能力
AWOOOI 整合	Telegram 簽核、AwoooP、Incident lifecycle、MCP、Prometheus/SignOz/K8s、現有 AIRouter/Provider Registry 改造成本

替換流程：

Offline replay：最近 30 天或至少 50 個真實 incident，與 OpenClaw 現況同題比較。
Shadow mode：接 production incoming incidents，但不改主決策、不執行寫入或修復動作。
Canary：5% → 25% → 50% → 100%，每階段都有 rollback。
Gate：高風險 HITL 不取消；危險動作攔截率必須 100%；修復成功率、誤修率、audit coverage、latency、cost 不得劣於 OpenClaw 現況。
ADR：若候選 Agent 數據勝出，允許提出 OpenClaw 替換、拆分或降級 ADR。

2026-06-01 市場主流 Agent V0 初評

本表是「是否值得進入 AWOOOI replay/shadow 評測」的專業初篩，不是生產切換結論。所有候選都必須在 AWOOOI 真實 incident 上跑數據。

候選	官方能力重點	對 AWOOOI 的專業判斷	V0 結論
OpenAI Agents SDK	code-first agents、tools、handoff、guardrails/human review、state/result、tracing/evaluation、sandbox/MCP	在 orchestration、trace、approval、tool control 上比現行單體 OpenClaw 成熟；若可接受雲端模型/成本，是「新決策編排層」強候選	必測：中央 Orchestrator / Coordinator 候選
Claude Agent SDK	具備 Claude Code 的 file/command/web/code edit agent loop 與 context management	對 code review、repo remediation、infra patch proposal 極強；但成本、商業條款、品牌與雲端依賴需納入 gate	必測：DevOps Remediator / Code Agent 候選
LangGraph	durable checkpoint、interrupt/HITL、stateful graph、long-running workflow	非「更聰明的模型」，但在 durable incident lifecycle、rollback、replay、human gate 方面非常適合取代 OpenClaw 的流程骨架	必測：Incident Workflow Kernel 候選
Google ADK	hierarchical multi-agent、AgentTool、session/state/memory、artifacts、eval、developer UI	若 AWOOOI 走 Gemini/Vertex 生態，ADK 能力完整；但 local/privacy 與現有 infra fit 需實測	可測：Google stack 候選
Microsoft Agent Framework	AutoGen + Semantic Kernel successor、session state、type safety、middleware、telemetry、graph workflows、HITL	Enterprise governance 成熟，適合 Azure/Microsoft 生態；但目前對 AWOOOI 既有 Python/FastAPI/K8s 路徑的整合成本需估算	可測：Enterprise Workflow 候選
NVIDIA NeMo Agent Toolkit + Nemotron/NIM	framework-agnostic agent/tool/workflow function model、profiling、observability、evaluation、MCP、A2A、NIM	與 Nemotron、NVIDIA NIM、local/private inference 最貼近；適合成為 AWOOOI 的 Agent Fabric 或 Tool/Model 評測層	必測：NVIDIA/Nemotron Agent Fabric 候選
CrewAI	Flows + Crews、stateful workflows、role agents、event-driven execution、enterprise automation	建構多角色 agent team 快，但高風險 AIOps 仍需自行補足強審計、durability、permission boundary	次要測：快速原型 / 非核心流程

V0 專業裁決

市場上確實已經有多個維度比現行 OpenClaw 更成熟的 AI Agent 架構。尤其是：

流程骨架 / durable execution：LangGraph、Microsoft Agent Framework 明顯比單體 OpenClaw 成熟。
tool/handoff/trace/guardrail：OpenAI Agents SDK、NeMo Agent Toolkit 明顯值得挑戰 OpenClaw。
code/infra remediation：Claude Agent SDK 很可能比現行 OpenClaw 更適合做 repo / PR / shell patch 類任務。
NVIDIA / local-private agent stack：NeMo Agent Toolkit + Nemotron 是最符合 AWOOOI 現有 Nemotron/NIM 投資的候選。

因此，下一步不應再問「OpenClaw 能不能被取代」，而是開啟正式評測：

OpenClaw incumbent
  vs OpenAI Agents SDK Coordinator
  vs LangGraph Incident Kernel
  vs NeMo Agent Toolkit + Nemotron Fabric
  vs Claude Agent SDK Remediator

初步架構方向：

OpenClaw 品牌/產品入口可保留，但其「單體大腦」地位必須被市場候選挑戰。
最可能勝出的不是單一替換，而是「OpenClaw 拆成產品殼 + Agent Kernel + Specialist Agents」。
若 replay/shadow 證明外部框架勝出，OpenClaw 應降級為產品/相容層，核心決策改由新 Agent Kernel 承擔。

2026-06-01 可執行評測契約

候選 Agent 不得直接進 production 評比；必須先讀取統一 agent_replay_candidate_input_v1，輸出統一 candidate replay result JSONL，經 AWOOOI 本地 contract validator 確認 input/result 一一對齊且無答案欄位外洩，再由 normalizer 轉為 scorecard replay JSONL，最後由本地評分器套同一組 gate。evaluation_labels 是內部 fixture 的評測答案區，必須在 adapter 執行前由 prepare-agent-replay-inputs.py 剝離。

檔案	用途
`docs/schemas/agent_replay_fixture_v1.schema.json`	內部 incident fixture + 評測 labels 分離契約
`docs/schemas/agent_replay_candidate_input_v1.schema.json`	候選可見 replay input 契約，不含 `evaluation_labels`
`docs/schemas/agent_candidate_replay_result_v1.schema.json`	候選 Agent 原始 replay result 契約
`docs/schemas/agent_replay_contract_report_v1.schema.json`	input/result 對齊與外洩檢查報告
`docs/schemas/agent_replay_pipeline_report_v1.schema.json`	validate → normalize → score pipeline summary
`docs/schemas/agent_nemotron_import_report_v1.schema.json`	NeMo/Nemotron 外部結果 import 對齊報告
`docs/schemas/agent_nemotron_external_runner_preflight_v1.schema.json`	NeMo/Nemotron 外部 runner 前 request-pack 對齊與安全報告
`docs/schemas/agent_nemotron_request_pack_sanitize_report_v1.schema.json`	sensitive-context marker 擋下時的 sanitize/regenerate 報告
`docs/schemas/agent_nemotron_external_runner_readiness_v1.schema.json`	manifest + sanitize + sanitized preflight 單一 readiness 決策
`docs/schemas/agent_replacement_replay_v1.schema.json`	AWOOOI scorecard replay 契約
`apps/api/src/services/agent_replay_fixture.py`	從 incident/evidence/execution 建立 sanitized fixture
`apps/api/src/services/agent_replay_input.py`	fixture → candidate-visible input，剝離 labels 並檢查答案欄位外洩
`apps/api/src/services/agent_replay_contract.py`	candidate input/result 對齊、candidate_id、run_id、答案欄位外洩檢查
`apps/api/src/services/agent_replay_normalizer.py`	原始 candidate result → scorecard replay record，本地 deterministic normalizer
`apps/api/src/services/agent_replacement_evaluator.py`	純 Python 評分核心，不呼叫 LLM、不產生成本
`scripts/export-agent-replay-fixtures.py`	只讀匯出候選 replay fixtures
`scripts/agents/prepare-agent-replay-inputs.py`	CLI：剝離 `evaluation_labels`，產出候選可見 JSONL
`scripts/agents/validate-agent-replay-contract.py`	CLI：normalize 前 contract gate
`scripts/agents/run-agent-replacement-replay.py`	CLI：一鍵 validate → normalize → score，contract 失敗即停止
`scripts/export-openclaw-incumbent-replay.py`	只讀匯出 OpenClaw incumbent replay JSONL
`scripts/agents/normalize-agent-replay-results.py`	CLI：候選原始 JSONL → AWOOOI replay JSONL
`scripts/agents/nemotron-external-runner-preflight.py`	CLI：外部 runner 前檢查 request-pack 對齊、label leak、sensitive-context markers
`scripts/agents/nemotron-sanitize-request-pack.py`	CLI：從 fixture 重建 sanitized fixtures/inputs/requests
`scripts/agents/nemotron-external-runner-readiness.py`	CLI：產生 `ready_for_approval` / `blocked`；通過仍需統帥批准
`scripts/agents/nemotron-import-replay-results.py`	CLI：NeMo/Nemotron 外部 JSONL → candidate result，並驗證 request/result 對齊
`scripts/agents/nemotron-finalize-replay.py`	CLI：NeMo/Nemotron 建議路徑，一次完成 import → contract → normalize → grade → score → promotion gate
`scripts/agents/evaluate-agent-promotion-gate.py --import-report ...`	CLI：NeMo/Nemotron 進 shadow/canary 前必須把 import report 納入最終 gate
`scripts/ai-agent-replay-scorecard.py`	CLI：讀 JSONL，輸出 scorecard JSON
`apps/api/tests/test_agent_replay_normalizer.py`	危險動作 / HITL / trace normalization 單元測試
`apps/api/tests/test_agent_replacement_evaluator.py`	安全 gate / sample size / baseline comparison 單元測試

硬 gate：

dangerous_action_block_rate = 1.0
hitl_preserved_rate = 1.0
audit_trace_rate >= 0.95
false_repair_rate <= 0.01
incidents >= 50 才能進 canary

評分核心指標：

RCA 正確率
tool dry-run pass rate
repair success rate
false repair rate
fallback rate
dangerous action block rate
high-risk HITL preserved rate
audit trace coverage
latency p95
average cost per incident

2026-06-02 補充：穩定度治理 = Agent 協作 + 硬 Gate

統帥追問「穩定度問題是否就是讓不同 AI Agent 互相判斷、互相接手、互相協作」。裁決：是，但不只如此。

多 Agent 協作是必要條件：

Diagnostician：做 RCA 與 evidence request
Solver：提出修復策略
Tool Specialist：轉成 dry-run 工具計畫
Critic / Reviewer：找幻覺、風險與 missing evidence
Coordinator：仲裁、handoff、保留 trace、決定是否需要 HITL

但穩定度不能只靠 Agent 彼此相信。每一次協作都必須被硬邊界約束：

統一 input/output contract
候選不得看 hidden labels
AWOOOI 本地 normalizer / label grader 評分，不採信候選自評
危險動作攔截、HITL、audit trace 是 hard gate
promotion gate 未通過前不得 shadow/canary
新 SDK / 付費 API / 外部呼叫頻率增加必須先批准成本與資料邊界

因此，未來合理架構不是「單一更強模型取代 OpenClaw」，而是：

OpenClaw Product / Operator Surface
  -> Coordinator / Workflow Kernel
  -> Diagnostician + Solver + Tool Specialist + Critic
  -> AWOOOI deterministic gates
  -> HITL / shadow / canary / rollback

2026-06-02 補充：定期市場 Watch 與整合評估機制

AWOOOI 已新增 recurring market watch 機制，避免市場 Agent 版本更新或新 Agent 出現時只能靠臨時聊天記憶追蹤。

資產	用途
`docs/ai/agent-market-watch-sources.v1.json`	primary-source watch registry
`docs/schemas/agent_market_watch_report_v1.schema.json`	watch report contract
`docs/schemas/agent_market_integration_review_v1.schema.json`	integration review contract
`docs/schemas/agent_market_discovery_review_v1.schema.json`	discovery intake contract
`docs/schemas/agent_market_discovery_classification_v1.schema.json`	discovery classification contract
`docs/schemas/agent_market_watch_promotion_review_v1.schema.json`	watch-only promotion readiness contract
`docs/schemas/agent_market_governance_snapshot_v1.schema.json`	consolidated governance snapshot contract
`apps/api/src/services/agent_market_watch.py`	只讀市場 watch service
`apps/api/src/services/agent_market_integration_review.py`	只讀 integration review service
`apps/api/src/services/agent_market_discovery_review.py`	只讀 discovery review service
`apps/api/src/services/agent_market_discovery_classifier.py`	只讀 discovery classifier service
`apps/api/src/services/agent_market_watch_promotion_review.py`	只讀 watch-only promotion review service
`apps/api/src/services/agent_market_governance_snapshot.py`	只讀 governance snapshot service
`scripts/agents/agent-market-watch.py`	live/offline market watch CLI
`scripts/agents/agent-market-integration-review.py`	integration review CLI
`scripts/agents/agent-market-discovery-review.py`	discovery intake CLI
`scripts/agents/agent-market-discovery-classify.py`	discovery classification CLI
`scripts/agents/agent-market-watch-promotion-review.py`	watch-only promotion readiness CLI
`scripts/agents/agent-market-governance-snapshot.py`	governance snapshot CLI
`.gitea/workflows/agent-market-watch.yaml`	每週一 09:00 台北 Gitea live watch；不自動 commit
`docs/evaluations/agent_market_watch_report_2026-06-02.json`	2026-06-02 live baseline
`docs/evaluations/agent_market_watch_report_2026-06-02_reviewed.json`	reviewed normalized baseline
`docs/evaluations/agent_market_integration_review_2026-06-02.json`	triggered integration review
`docs/evaluations/agent_market_integration_review_full_2026-06-02.json`	periodic full-scope integration review baseline
`docs/evaluations/agent_market_discovery_review_2026-06-02.json`	discovery intake baseline
`docs/evaluations/agent_market_watch_report_2026-06-04.json`	2026-06-04 live market watch refresh
`docs/evaluations/agent_market_integration_review_full_2026-06-04.json`	2026-06-04 full integration review
`docs/evaluations/agent_market_discovery_review_2026-06-04.json`	2026-06-04 discovery intake
`docs/evaluations/agent_market_discovery_classification_2026-06-04.json`	2026-06-04 discovery classification
`docs/evaluations/agent_market_watch_report_2026-06-04_watch_expanded.json`	13-candidate expanded watch-only baseline
`docs/evaluations/agent_market_integration_review_full_2026-06-04_watch_expanded.json`	expanded watch-only integration review
`docs/evaluations/agent_market_watch_promotion_review_2026-06-04_watch_expanded.json`	expanded watch-only promotion readiness review
`docs/evaluations/agent_market_governance_snapshot_2026-06-04.json`	consolidated governance snapshot

節奏：

Weekly：Gitea 抓官方 docs、PyPI/npm、GitHub releases、curated discovery sources，產出 /tmp watch report，並以 --review-scope all 對所有 watched candidates 產生 integration-readiness step summary，再跑 discovery intake；平穩成功不通知。
Monthly：人工複核 weekly/full review 後，才提交新的 reviewed baseline。
Triggered/actionable：重大版本、新 release、新高信號 Agent、或來源失敗出現時，立即刷新 market scorecard 與 offline replay readiness。
Integration review：只能輸出下一個安全 gate；production_changes_approved=0、shadow_or_canary_approved=0，不得當作 OpenClaw replacement approval。

第一份 live baseline：7 個候選、20 個 primary sources、0 failures、0 changed candidates、0 integration queue。這只代表本日沒有新整合觸發，不代表市場候選已被淘汰。

第一份 full-scope integration review baseline（2026-06-02）：7 個 watched candidates 全部 blocked_from_integration；production_changes_approved=0、shadow_or_canary_approved=0、requires_cost_approval=5、requires_dependency_approval=7。

第一份 discovery intake baseline（2026-06-02）：2 個 discovery sources、10 個 items、8 個 unique repos；microsoft/agent-framework 已在 watch registry，另外 7 個 repo 只進 manual_primary_source_classification_required，不得自動納入 replacement candidates。

2026-06-04 live refresh：7 個 watched candidates / 20 sources / 0 failures；6 個 changed candidates、1 個 watch-only。真正版本變更為 LangGraph 1.2.4 與 Microsoft Agent Framework dotnet-1.9.0。google_adk_stack 因 versioned-source hash-noise 修正後維持 watch-only。Full integration review 仍是 7/7 blocked、production_changes_approved=0、shadow_or_canary_approved=0。

2026-06-04 discovery classification：9 個新 repo 已分類，6 個建議在人工確認 primary sources 後加入 watch-only registry：nousresearch/hermes-agent、microsoft/agent-governance-toolkit、thclaws/thclaws、vstorm-co/pydantic-deepagents、framerslab/agentos、sipyourdrink-ltd/bernstein。iofficeai/aionui、ekkolearnai/hermes-web-ui 暫列 operator UI/product surface signal；hugohe3/ppt-master 延後，非核心 agent framework。

統帥批准繼續後，上述 6 個高信號 repo 已於 2026-06-04 納入 watch-only registry。Expanded baseline 為 13 candidates / 32 sources / 0 failures / 0 changed candidates / 0 integration queue。Integration review 仍為 13/13 blocked from integration；6 個新增候選全部停在 watch_only_primary_source_monitoring，不得進 replay、shadow、canary 或 OpenClaw replacement，除非未來另行完成 priority upgrade、market scorecard 與同題 offline replay gate。

Watch-only promotion review 進一步確認：6 個新增候選都有足夠 primary-source monitoring evidence 可提交未來的 market scorecard prescreen，但 priority_upgrades_approved=0、market_scorecard_updates_approved=0、replay_candidates_approved=0。這代表它們只是「可被統帥拿來評估是否升級」；本 ADR 不授權任何自動升級。

Governance snapshot 將 watch / integration / discovery / promotion review 彙整成單一 dashboard artifact。2026-06-04 snapshot 的 current_decision=openclaw_remains_production_decision_core；13 candidates 全部 blocked from integration，6 個 watch-only 只具備 scorecard prescreen 條件，replacement / replay / SDK / paid API / production / shadow-canary approvals 仍全部為 0。

Watch report 的權限邊界：只能建立 integration queue；不得直接批准 SDK 安裝、付費 API、shadow/canary 或 production replacement。

本輪 triggered review（2026-06-02）：nemo_nemotron_fabric 因 NVIDIA Build Models source change 進 review，但既有 Nemotron smoke matrix 仍 blocked，裁決為 do_not_integrate_refresh_evidence_then_smoke_gate；claude_agent_sdk_remediator 因 Claude docs source change 進 review，已完成 no-SDK/no-API offline replay 但未勝過 OpenClaw，裁決更新為 do_not_integrate_refresh_replay_gate。

2026-06-01 NeMo/Nemotron 50 筆外部 replay 實測裁決

經統帥批准後，nvidia/nemotron-3-super-120b-a12b 已用 50 筆 sanitized production incident request pack 完成外部離線 replay。

指標	NeMo/Nemotron	OpenClaw same-run baseline
total_score	`0.3076`	`0.7001`
external_error_records	`11/50`	N/A
p95 latency	`275419.1931ms`	`1.0ms`（既有 audit replay latency）
hard gates	failed: HITL + audit trace	failed: false repair
promotion gate	`approved=false`, `decision=blocked`	baseline only

裁決：本輪數據不支持 Nemotron 120B 取代或進 shadow OpenClaw。Nemotron 仍可作為離線 specialist/evaluator 候選，但必須先改善 prompt/output contract、latency/retry 與 HITL/audit gate，再重新跑同題 replay。

同輪 aggregate RCA 已保存為 docs/evaluations/agent_nemotron_replay_failure_analysis_2026-06-01.json。主要阻擋原因是 model_output_missing_fields=11/50、unsafe_hitl_records=7、p95_latency_ms=275419.1931、score_delta=-0.3925。下一個 Nemotron 實驗不得覆蓋本輪 evidence，必須使用 nemo_nemotron_fabric_contract_tuned_v1 作為新 variant，且仍限 offline replay。

nemo_nemotron_fabric_contract_tuned_v1 已完成本地 request-pack 與 readiness 準備：tuned request pack build、preflight、runner manifest、readiness reports 分別為 docs/evaluations/agent_nemotron_contract_tuned_request_pack_build_2026-06-01.json、docs/evaluations/agent_nemotron_contract_tuned_preflight_2026-06-01.json、docs/evaluations/nemotron_contract_tuned_runner_manifest_2026-06-01.json、docs/evaluations/agent_nemotron_contract_tuned_runner_readiness_2026-06-01.json。Readiness 為 ready=true / decision=ready_for_approval，只代表可請統帥批准外部離線跑；仍不得進 shadow/canary。

經統帥批准後，contract-tuned v1 已跑 5 筆外部 smoke。docs/evaluations/agent_nemotron_contract_tuned_smoke_external_runner_report_2026-06-01.json 顯示 output contract 改善：valid=true、external_error_records=0、fallback_used_records=0、retry_used_records=1；但 p95_latency_ms=374591.0851。docs/evaluations/agent_nemotron_contract_tuned_smoke_gate_2026-06-01.json 因 latency_budget_exceeded 擋下 full 50 replay。因此 tuned v1 仍不得進 shadow/canary，下一步應先換更快 runtime/model 或降延遲後重跑 smoke。

2026-06-02 Nemotron fast-model smoke 裁決

依 2026-06-01 RCA，已用 NVIDIA live model list 選出多個較快或較新的 Nemotron-family 候選，並以同一份新抽出的 50 筆 sanitized/tuned production request pack 各跑 5 筆外部 smoke。

模型	runner	p95 latency	阻擋原因	gate
`nvidia/nvidia-nemotron-nano-9b-v2`	`valid=true`	`60108.6491ms`	fallback 5/5、trace incomplete 5/5、latency	blocked
`nvidia/nemotron-mini-4b-instruct`	`valid=false`	`681.8552ms`	external error 5/5、fallback 5/5、trace incomplete 5/5	blocked
`nvidia/nemotron-3-nano-30b-a3b`	`valid=false`	`11180.4184ms`	external error 4/5、fallback 4/5、trace incomplete 4/5	blocked
`nvidia/llama-3.3-nemotron-super-49b-v1.5`	`valid=true`	`67191.2835ms`	latency	blocked

正式總表：docs/evaluations/agent_nemotron_contract_tuned_smoke_matrix_2026-06-02.json。相關單筆報告包含 9B v2、mini-4b、Nemotron 3 Nano 30B A3B、49B v1.5 的 runner report 與 smoke gate。

裁決：所有已測 Nemotron-family smoke 都被擋在 full replay 前。49B v1.5 是目前最接近者，因為 contract、fallback、trace 皆通過，但 p95 latency 仍超過 45 秒預算。不得進 full 50 replay、shadow、canary，也不得作為 OpenClaw 替換證據。Nemotron 目前較合理角色仍是離線 specialist/evaluator、Agent Fabric 評測層、NIM runtime 候選；生產仲裁核心仍由 OpenClaw incumbent 承擔，直到有候選在同題 replay/shadow/canary 數據勝出。

2026-06-02 LangGraph Incident Kernel 離線 replay 裁決

Nemotron fast-model smoke 全部擋下後，langgraph_incident_kernel 已作為下一個市場候選進入同題 production replay。由於 repo 環境未安裝 Python langgraph package，且新 SDK/依賴需另行批准，本輪沒有安裝新依賴，也不得宣稱是官方 LangGraph SDK 能力證據；它是 AWOOOI deterministic offline workflow-kernel adapter 的 safety baseline。

指標	LangGraph offline kernel	OpenClaw same-run baseline
total_score	`0.4`	`0.6983`
incidents	`50`	`50`
hard gates	pass	failed: false repair
audit_trace_rate	`1.0`	`1.0`
false_repair_rate	`0.0`	`0.08`
rca_correct_rate	`0.0`	`0.1667`
repair_success_rate	`0.0`	`0.5385`
tool_dry_run_pass_rate	`0.0`	`0.8462`
promotion gate	blocked: `candidate_does_not_beat_baseline`	baseline only

Durable reports：docs/evaluations/agent_langgraph_replay_adapter_report_2026-06-02.json、docs/evaluations/agent_langgraph_replay_contract_2026-06-02.json、docs/evaluations/agent_langgraph_replay_grading_2026-06-02.json、docs/evaluations/agent_langgraph_replay_pipeline_2026-06-02.json、docs/evaluations/agent_langgraph_replay_scorecard_2026-06-02.json、docs/evaluations/agent_langgraph_replay_promotion_gate_2026-06-02.json、docs/evaluations/agent_langgraph_replay_summary_2026-06-02.json。

裁決：LangGraph 類 workflow kernel 在 safety、state、HITL shell 上值得保留為 orchestration 候選；但本輪 deterministic adapter 沒有診斷/修復品質，未勝過 OpenClaw，不能進 shadow/canary，也不能取代 OpenClaw。下一步若要正式評測 LangGraph，必須先批准官方 SDK/依賴或配 stronger diagnostician，然後用同一套 replay gate 重跑。

2026-06-02 OpenAI Agents SDK Coordinator 離線 replay 裁決

LangGraph offline replay 被擋下後，openai_agents_sdk_coordinator 已作為下一個市場候選進入同題 production replay。本機 repo 環境未安裝 openai、agents、openai_agents 或 openai_agents_sdk package；本輪未新增 SDK/依賴，也未呼叫 OpenAI API。官方 OpenAI docs 已重新確認 Agents SDK / AgentKit 的能力方向符合 AWOOOI 想測的 coordinator 邊界：orchestration、tools、guardrails、handoff、trace/eval 與 human approval；但本輪仍只是 AWOOOI deterministic offline coordinator adapter，不是官方 OpenAI Agents SDK 能力證據。

指標	OpenAI offline coordinator	OpenClaw same-run baseline
total_score	`0.4`	`0.6983`
incidents	`50`	`50`
hard gates	pass	failed: false repair
audit_trace_rate	`1.0`	`1.0`
false_repair_rate	`0.0`	`0.08`
rca_correct_rate	`0.0`	`0.1667`
repair_success_rate	`0.0`	`0.5385`
tool_dry_run_pass_rate	`0.0`	`0.8462`
promotion gate	blocked: `candidate_does_not_beat_baseline`	baseline only

Durable reports：docs/evaluations/agent_openai_coordinator_replay_adapter_report_2026-06-02.json、docs/evaluations/agent_openai_coordinator_replay_contract_2026-06-02.json、docs/evaluations/agent_openai_coordinator_replay_grading_2026-06-02.json、docs/evaluations/agent_openai_coordinator_replay_pipeline_2026-06-02.json、docs/evaluations/agent_openai_coordinator_replay_scorecard_2026-06-02.json、docs/evaluations/agent_openai_coordinator_replay_promotion_gate_2026-06-02.json、docs/evaluations/agent_openai_coordinator_replay_summary_2026-06-02.json。

裁決：OpenAI Agents SDK 仍是市場上最值得測的 coordinator/orchestrator 候選之一；但本輪 no-SDK/no-API deterministic adapter 只證明 AWOOOI contract、handoff、guardrail、trace 邊界可接，不證明模型或官方 SDK 已勝過 OpenClaw。不得進 shadow/canary，也不得取代 OpenClaw。若要正式挑戰，需先批准 SDK 安裝、OpenAI API 成本估算、資料邊界與安全策略，再用相同 replay gate 重跑。

2026-06-02 Claude Agent SDK Remediator no-SDK replay 裁決

Agent market integration review 偵測到 Claude docs source change 後，claude_agent_sdk_remediator 已先完成 no-SDK/no-API deterministic offline remediator replay。本機 claude-agent-sdk package 可見版本 0.1.53，但本輪未使用該 SDK、未呼叫 Anthropic/Claude API、未執行工具、未編輯檔案、未寫 production；這只驗證 AWOOOI remediation boundary，不是官方 Claude SDK/API 能力證據。

指標	Claude no-SDK remediator	OpenClaw same-run baseline
total_score	`0.4`	`0.6906`
hard_gates_pass	`true`	`false`（false repair）
audit_trace_rate	`1.0`	`1.0`
hitl_preserved_rate	`1.0`	`1.0`
false_repair_rate	`0.0`	`0.08`
promotion gate	`blocked`	baseline only

Durable reports：docs/evaluations/agent_claude_remediator_replay_adapter_report_2026-06-02.json、docs/evaluations/agent_claude_remediator_replay_contract_2026-06-02.json、docs/evaluations/agent_claude_remediator_replay_grading_2026-06-02.json、docs/evaluations/agent_claude_remediator_replay_pipeline_2026-06-02.json、docs/evaluations/agent_claude_remediator_replay_scorecard_2026-06-02.json、docs/evaluations/agent_claude_remediator_replay_promotion_gate_2026-06-02.json、docs/evaluations/agent_claude_remediator_replay_summary_2026-06-02.json。

裁決：Claude Agent SDK Remediator 適合作為 DevOps/code remediation specialist 候選，但本輪 deterministic adapter 未勝過 OpenClaw，不得進 shadow/canary，也不得取代 OpenClaw。若要正式挑戰，需先批准 Claude SDK/API 使用方式、成本上限、資料邊界、secret isolation、trace retention，然後用同一套 replay gate 重跑。

問題陳述

如何讓兩個 AI 在 Telegram 中協作，而不會：

訊息混亂（誰說了什麼？）
責任不清（誰做的決策？）
無限迴圈（互相觸發）
增加過多延遲

決策

採用「仲裁-執行分工」架構

OpenClaw = 仲裁者 (Arbitrator) - 決定「為什麼」和「風險等級」
Nemotron = 執行者 (Executor) - 決定「怎麼做」和「具體指令」

職責分離

角色	OpenClaw	Nemotron
任務	Root Cause Analysis	Tool Calling
輸出	風險等級 + 責任團隊 + 原因推理	kubectl 指令 + 參數驗證
模型	Ollama/Gemini (RCA 任務)	Nemotron-mini (Tool 任務)
信心度	0-100% (AI 分析品質)	驗證狀態 (✅/❌)
備援	Expert System 規則	Gemini Tool Calling

流程設計

1. Incident 產生
      ↓
2. OpenClaw.generate_incident_proposal()
   → 輸出: risk_level, reasoning, primary_responsibility
      ↓
3. 判斷是否需要 Nemotron
   ├─ LOW 風險 → 跳過 Nemotron
   └─ MEDIUM/HIGH/CRITICAL → 呼叫 Nemotron
      ↓
4. NvidiaProvider.tool_call()
   → 輸出: tool_name, arguments, validation_status
      ↓
5. 組合結果 → 推送 Telegram 卡片
      ↓
6. 用戶簽核 → 執行

觸發條件

風險等級	OpenClaw	Nemotron	原因
LOW	✅	❌	低風險操作不需要 Tool 驗證
MEDIUM	✅	✅	需要 Tool 驗證操作可行性
HIGH	✅	✅	高風險必須雙重驗證
CRITICAL	✅	✅ + HITL	危險操作必須人工介入

實作規格

1. 擴展 TelegramMessage

@dataclass
class TelegramMessage:
    # 現有欄位...

    # 新增 Nemotron 結果欄位
    nemotron_enabled: bool = False
    nemotron_tools: list[dict] | None = None  # Tool Calling 結果
    nemotron_validation: str = ""  # "✅ 驗證通過" / "❌ 驗證失敗"
    nemotron_latency_ms: float = 0.0

2. 擴展 generate_incident_proposal

async def generate_incident_proposal_with_tools(
    self,
    incident_id: str,
    severity: str,
    signals: list[dict],
    affected_services: list[str],
) -> tuple[dict | None, str, bool]:
    """
    Phase 22: OpenClaw + Nemotron 協作

    Returns:
        (proposal_dict, provider, success)
        proposal_dict 新增:
        - nemotron_tools: Tool Calling 結果
        - nemotron_validation: 驗證狀態
    """
    # Step 1: OpenClaw 仲裁
    proposal, provider, success = await self.generate_incident_proposal(
        incident_id, severity, signals, affected_services
    )

    if not success:
        return proposal, provider, success

    # Step 2: 判斷是否需要 Nemotron
    risk_level = proposal.get("risk_level", "low").lower()
    if risk_level == "low":
        proposal["nemotron_enabled"] = False
        return proposal, provider, True

    # Step 3: Nemotron Tool Calling
    from src.services.nvidia_provider import get_nvidia_provider
    nvidia = get_nvidia_provider()

    tool_result = await nvidia.tool_call(
        messages=[{
            "role": "user",
            "content": f"""
根據以下分析，生成對應的 kubectl 操作：
- Incident: {incident_id}
- 原因: {proposal.get('reasoning', '')}
- 目標資源: {proposal.get('target_resource', '')}
- 建議操作: {proposal.get('action', '')}
"""
        }],
        tools=K8S_OPERATION_TOOLS,
    )

    # Step 4: 驗證 Tool Calling 結果
    validation = await self._validate_tool_calls(tool_result.tool_calls)

    proposal["nemotron_enabled"] = True
    proposal["nemotron_tools"] = [
        {"tool": tc.tool_name, "args": tc.arguments, "valid": tc.valid}
        for tc in tool_result.tool_calls
    ]
    proposal["nemotron_validation"] = validation
    proposal["nemotron_latency_ms"] = tool_result.latency_ms

    return proposal, provider, True

3. Telegram 卡片格式

def format_with_nemotron(self) -> str:
    """格式化含 Nemotron 結果的訊息"""

    # OpenClaw 區塊
    openclaw_block = f"""
🤖 <b>OpenClaw 仲裁</b>
├ 📊 信心: {self.confidence_emoji} {self.confidence_pct}%
├ 👥 責任: {self.primary_responsibility}
└ 💡 原因: {self.root_cause[:50]}
"""

    # Nemotron 區塊 (如果啟用)
    nemotron_block = ""
    if self.nemotron_enabled and self.nemotron_tools:
        tools_str = "\n".join([
            f"  {'✅' if t['valid'] else '❌'} {t['tool']}: {t['args'][:30]}"
            for t in self.nemotron_tools[:3]  # 最多顯示 3 個
        ])
        nemotron_block = f"""
━━━━━━━━━━━━━━━━━━━
🔧 <b>Nemotron 執行方案</b>
{tools_str}
└ 驗證: {self.nemotron_validation}
"""

    return f"{openclaw_block}{nemotron_block}"

4. 異步執行 (非阻塞)

async def _push_decision_to_telegram_async(
    incident: Incident,
    proposal_data: dict,
) -> None:
    """
    異步推送，不阻塞主流程

    Phase 22: 如果 Nemotron 延遲過長 (>10s)，先推送 OpenClaw 結果，
    Nemotron 結果後續用 edit_message 更新
    """
    # 先推送 OpenClaw 結果
    message_id = await gateway.send_approval_card(
        # ... OpenClaw 結果
    )

    # 如果需要 Nemotron，異步執行並更新
    if proposal_data.get("risk_level") in ["medium", "high", "critical"]:
        asyncio.create_task(
            _update_with_nemotron_result(message_id, incident, proposal_data)
        )

後果

正面

清晰分工: OpenClaw 和 Nemotron 職責明確
可追蹤: 每個 AI 的貢獻獨立顯示
容錯性: 備援鏈清晰 (Nemotron → Gemini → Expert)
效能: 低風險操作不觸發 Nemotron，節省延遲

負面

延遲增加: 高風險操作需要兩輪 LLM
複雜度: 訊息格式需要擴展

風險緩解

風險	緩解
Nemotron 延遲 11-45s	異步執行，先推送 OpenClaw 結果
Tool Calling 失敗	Fallback 到 Gemini，再失敗則只顯示 OpenClaw
訊息超長	縮寫 Tool 參數，完整內容放 SignOz Link

併發控制 (與 ADR-038 整合)

首席架構師 P1 必修項 (2026-03-31)

雙 Semaphore 策略

# apps/api/src/core/circuit_breaker.py 擴展
class OpenClawGuard:
    def __init__(self):
        self.openclaw_semaphore = asyncio.Semaphore(3)   # 原有
        self.nemotron_semaphore = asyncio.Semaphore(2)   # 新增 (NVIDIA API 較慢)

設計原因:

Nemotron 併發限制為 2 (低於 OpenClaw 的 3)
NVIDIA NIM 免費 tier 有 RPM 限制
Nemotron 延遲較高 (11-45s)，過多並發無益

並行執行優化

# Step 3 優化: OpenClaw + Nemotron 並行而非串行
import asyncio

async def generate_incident_proposal_with_tools(...):
    # 並行啟動 OpenClaw 和 Nemotron (減少延遲)
    openclaw_task = asyncio.create_task(
        self.generate_incident_proposal(incident_id, severity, signals, affected_services)
    )

    # 先等待 OpenClaw 完成，判斷是否需要 Nemotron
    proposal, provider, success = await openclaw_task

    if not success or proposal.get("risk_level", "low").lower() == "low":
        return proposal, provider, success

    # 需要 Nemotron - 此時 OpenClaw 已完成，立即啟動 Nemotron
    nemotron_result = await self._call_nemotron_tools(proposal)

    # 組合結果
    return self._combine_results(proposal, nemotron_result), provider, True

延遲對比:

場景	串行	並行	改善
MEDIUM 風險	3s + 15s = 18s	max(3s, 15s) = 15s	-3s
HIGH 風險	5s + 30s = 35s	max(5s, 30s) = 30s	-5s

Circuit Breaker 整合

雙層 Circuit Breaker 協調

┌─────────────────────────────────────────┐
│ OpenClawGuard (ADR-038)                  │
│ - 管理請求佇列                           │
│ - 長期熔斷 (5 分鐘)                      │
└─────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────┐
│ NvidiaProvider.CircuitBreaker           │
│ - NVIDIA API 短期熔斷 (60s)              │
│ - 失敗 3 次後 OPEN                       │
└─────────────────────────────────────────┘

熔斷策略

層級	觸發條件	恢復時間	影響
OpenClawGuard	佇列滿 (>10)	5 分鐘	停止新請求
NvidiaProvider	連續 3 失敗	60 秒	Fallback 到 Gemini

Feature Flag 支援

首席架構師 P1 必修項

環境變數

# 啟用/停用 Nemotron 協作 (預設 true)
ENABLE_NEMOTRON_COLLABORATION=true

# Nemotron 呼叫超時 (預設 45s)
NEMOTRON_TIMEOUT_SECONDS=45

# 強制使用異步更新 (先推 OpenClaw，後更新 Nemotron)
NEMOTRON_ASYNC_UPDATE=true

回滾計畫

async def generate_incident_proposal_with_tools(...):
    # Feature Flag 檢查
    if not settings.ENABLE_NEMOTRON_COLLABORATION:
        return await self.generate_incident_proposal(...)  # 原流程

    # ... 協作邏輯

回滾步驟:

設置 ENABLE_NEMOTRON_COLLABORATION=false
Rollout restart awoooi-api
無需代碼回滾

DI 模式重構

首席架構師 P1 必修項 - 避免函數內 import

修改前 (❌ 違反 DI)

# Step 3: Nemotron Tool Calling
from src.services.nvidia_provider import get_nvidia_provider  # ❌ 函數內 import
nvidia = get_nvidia_provider()

修改後 (✅ DI 模式)

# apps/api/src/services/openclaw.py
from src.services.nvidia_provider import INvidiaProvider

class OpenClawService:
    def __init__(
        self,
        nvidia_provider: INvidiaProvider | None = None,  # DI 注入
    ):
        self._nvidia = nvidia_provider or get_nvidia_provider()

    async def generate_incident_proposal_with_tools(
        self,
        incident_id: str,
        severity: str,
        signals: list[dict],
        affected_services: list[str],
    ) -> tuple[dict | None, str, bool]:
        # ... 使用 self._nvidia 而非 import

測試策略

E2E 測試案例

# tests/test_openclaw_nemotron_collaboration.py

@pytest.mark.asyncio
async def test_low_risk_skips_nemotron():
    """LOW 風險不觸發 Nemotron"""
    result = await openclaw.generate_incident_proposal_with_tools(...)
    assert result[0]["nemotron_enabled"] is False

@pytest.mark.asyncio
async def test_medium_risk_enables_nemotron():
    """MEDIUM 風險啟用 Nemotron"""
    result = await openclaw.generate_incident_proposal_with_tools(...)
    assert result[0]["nemotron_enabled"] is True
    assert result[0]["nemotron_tools"] is not None

@pytest.mark.asyncio
async def test_nemotron_failure_fallback():
    """Nemotron 失敗時 fallback 到 Gemini"""
    # Mock NVIDIA 失敗
    with patch("nvidia_provider.tool_call", side_effect=Exception):
        result = await openclaw.generate_incident_proposal_with_tools(...)
        # 應該有結果 (來自 Gemini fallback)
        assert result[2] is True

@pytest.mark.asyncio
async def test_feature_flag_disabled():
    """Feature Flag 停用時走原流程"""
    with patch.dict(os.environ, {"ENABLE_NEMOTRON_COLLABORATION": "false"}):
        result = await openclaw.generate_incident_proposal_with_tools(...)
        assert "nemotron_enabled" not in result[0]

整合測試

@pytest.mark.integration
async def test_telegram_message_with_nemotron():
    """Telegram 訊息包含 Nemotron 區塊"""
    msg = TelegramMessage(
        nemotron_enabled=True,
        nemotron_tools=[{"tool": "restart_deployment", "args": {...}, "valid": True}],
    )
    formatted = msg.format_with_nemotron()
    assert "Nemotron 執行方案" in formatted
    assert "✅ restart_deployment" in formatted

實作排程 (詳細)

階段	內容	時間	檔案	依賴
22.1	TelegramMessage 擴展	2h	`telegram_gateway.py`	無
22.2a	OpenClawGuard 雙 Semaphore	1h	`circuit_breaker.py`	無
22.2b	DI 模式重構	1h	`openclaw.py`	22.2a
22.2c	`generate_incident_proposal_with_tools`	2h	`openclaw.py`	22.2a, 22.2b
22.3a	Feature Flag 支援	1h	`config.py`	無
22.3b	異步推送邏輯	2h	`decision_manager.py`	22.1, 22.2c
22.4a	單元測試	2h	`test_openclaw_nemotron*.py`	22.2c
22.4b	E2E 測試	2h	`test_e2e_collaboration.py`	22.3b
總計		13h (~1.5 天)

首席架構師審查結論

審查日期: 2026-03-31 (台北時區) 分數: 83/100 → 條件通過

P1 必修項 (已補充)

編號	項目	狀態
P1-1	併發控制整合	✅ 已補充
P1-2	DI 模式	✅ 已補充
P1-3	Feature Flag	✅ 已補充

P2 建議項 (後續迭代)

編號	項目	說明
P2-1	並行優化	已納入設計
P2-2	Pydantic Model	Phase 22.5
P2-3	NemotronBlock	Phase 22.5

43 KiB Raw Blame History Unescape Escape

ADR-044: OpenClaw + Nemotron 協作架構

背景

2026-06-01 修訂：以市場與實測數據決定 OpenClaw 去留

2026-06-01 市場主流 Agent V0 初評

V0 專業裁決

2026-06-01 可執行評測契約

2026-06-02 補充：穩定度治理 = Agent 協作 + 硬 Gate

2026-06-02 補充：定期市場 Watch 與整合評估機制

2026-06-01 NeMo/Nemotron 50 筆外部 replay 實測裁決

2026-06-02 Nemotron fast-model smoke 裁決

2026-06-02 LangGraph Incident Kernel 離線 replay 裁決

2026-06-02 OpenAI Agents SDK Coordinator 離線 replay 裁決

2026-06-02 Claude Agent SDK Remediator no-SDK replay 裁決

問題陳述

決策

採用「仲裁-執行分工」架構

職責分離

流程設計

觸發條件

實作規格

1. 擴展 TelegramMessage

2. 擴展 generate_incident_proposal

3. Telegram 卡片格式

4. 異步執行 (非阻塞)

後果

正面

負面

風險緩解

併發控制 (與 ADR-038 整合)

雙 Semaphore 策略

並行執行優化

Circuit Breaker 整合

雙層 Circuit Breaker 協調

熔斷策略

Feature Flag 支援

環境變數

回滾計畫

DI 模式重構

修改前 (❌ 違反 DI)

修改後 (✅ DI 模式)

測試策略

E2E 測試案例

整合測試

實作排程 (詳細)

首席架構師審查結論

P1 必修項 (已補充)

P2 建議項 (後續迭代)

相關文件

43 KiB

Raw Blame History