Commit Graph

1465 Commits

Author SHA1 Message Date
OG T
cd1c0ffdb8 fix(openclaw): call() 5值→3值 — 修復全部 AI 分析降級根因
根因: _call_with_fallback() 回傳 5-tuple,但 call() 直接 return
導致呼叫端 (diagnostician/solver/critic agents) 3-var unpack
→ ValueError: too many values to unpack → 全部 降級 20%

修復: call() 明確解包再回傳 (response, provider, success)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:27:48 +08:00
OG T
5e4dbbbb41 fix(alertmanager): webhook URL 改指向 VIP 192.168.0.125:32334
根因: Alertmanager 打 120:32334 → Connection Refused
      120/121 NodePort 直接訪問不通,只有 VIP 125:32334 可通
影響: 告警完全無法送達 AWOOOI API,鏈路靜默失效 (自 2026-04-12 起)
修復: url → http://192.168.0.125:32334/api/v1/webhooks/alertmanager
驗證: 手動 inject 測試告警,API 端收到並觸發完整 LLM 分析流程

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:19:58 +08:00
AWOOOI CD
9a4fa5edf5 chore(cd): deploy 27ba97e [skip ci] 2026-04-15 19:12:19 +00:00
OG T
62e2efda85 fix(heartbeat): 恢復 30 分鐘心跳報告到 SRE 戰情室
2026-04-15 停用理由(forwarded_to_separate_group)有誤,
SRE 戰情室就是 SRE_GROUP_CHAT_ID,不應停用。
恢復 start_heartbeat_monitor(interval=30min)。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:05:01 +08:00
OG T
5ee76dc30d fix(cd): CI/CD Telegram 通知改用 HTML 結構化格式
Deploy Start / Failure 從純文字 pipe 格式改為:
  🚀 AWOOOI 部署開始
  ├ 📝 <commit>
  ├ 🔖 <sha>
  └ 👤 <actor>

commit message 做 HTML escape 防特殊字元

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:04:23 +08:00
OG T
27ba97e586 fix(ollama): 清除所有硬寫 188:11434 fallback — 全部改指向 111 GPU
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m59s
- decision_manager.py: 兩處 getattr fallback 188 → 111
- routes/agent.py: OLLAMA_BASE_URL 188 → 111
- knowledge_extractor_service.py: _OLLAMA_BASE 188 → 111

config.py 預設早已是 111,此次清掉 code 層殘留的 188 硬寫值。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:01:31 +08:00
OG T
5f9c9d84a2 fix(configmap): Ollama 改指向 111 GPU + fallback 順序調整
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- OLLAMA_URL: 188(CPU-only) → 111(RTX GPU,avg 10s)
- AI_FALLBACK_ORDER: nvidia→gemini→ollama→claude
  改為 ollama→nvidia→gemini→claude
  本地 GPU 優先,外部 API 備援,雲端 Claude 最終兜底

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:00:16 +08:00
OG T
7e3cc8b3b0 fix(agents): 移除人工 per-agent timeout,LLM 必須等完整回應
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
原設計 asyncio.wait_for(timeout_sec=25s) 是任意截斷,
只要 LLM 超過時限就降級為 confidence=20%,根本沒有分析。

正確做法:
- 移除所有 4 個 agent 的 asyncio.wait_for() 包裝
- 只留 except Exception 捕真實異常(連線失敗、模型崩潰)
- 全流程由 Orchestrator GLOBAL_TIMEOUT_SEC=90s 防掛死
- _PER_AGENT_TIMEOUT_SEC 常數廢棄移除

影響:LLM 推理多久就等多久,不再人工截斷,
      deepseek-r1:14b 等模型得以完整輸出分析結果。

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:54:34 +08:00
OG T
5a3a649f8a fix(decision): TYPE-1 告警重複洗版兩個根因修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: TYPE-1 bypass 在 existing_token 檢查之前執行
→ 每次 get_or_create_decision() 不管 token 是否存在,都直接推 TG
→ 修復: existing_token 檢查提前到 TYPE-1 bypass 之前(統一入口)

根因 2: TYPE-1 token TTL 僅 3600s
→ 1h 後 token 過期,下次掃描重新建立並再推 TG
→ 修復: TYPE-1 token TTL 提升至 86400s (24h)

影響: HostBackupFailed 等 TYPE-1 告警每個 incident 只推 1 次(24h 內)

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:49:31 +08:00
OG T
62bcc50770 fix(tg+km): 補齊 Telegram 操作紀錄揭露與 KM 分類修復 (ADR-076)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Telegram 訊息新增欄位:
- alert_category 分類標籤(🏷️ 主機/K8s/資料庫/服務等)
- playbook_name 顯示匹配到的 Playbook 名稱
- 頻率統計從 count_24h>1 降至 >=1(初次告警也顯示)
- TelegramMessage 新增 alert_category/playbook_name 欄位
- decision_manager → send_approval_card 穿透 playbook_name

KM 修復:
- EntryType.PLAYBOOK → EntryType.AUTO_RUNBOOK(前者不存在,會 AttributeError)
- category "auto_generated" → "ai_system"(前端 i18n 有對應翻譯)
- runbook_generator 同步修正 category
- KM 建立後推 Telegram 通知(best-effort)

DB decision_chain 補欄位:
- 新增 playbook_id / playbook_name / alert_category

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:46:17 +08:00
AWOOOI CD
44ecf609e0 chore(cd): deploy 9538f6c [skip ci] 2026-04-15 18:39:05 +00:00
OG T
9538f6cca4 fix(agents): 修正 Agent 5s timeout 導致 LLM 推理全部失敗
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 16m14s
根本原因: deepseek-r1:14b 實測推理 2.2-27.3s avg 10.6s
但 Diagnostician/Critic/Solver/Reviewer 全部使用 timeout_sec=5.0 (開發機測試值)
→ 67% 的 Agent 推理 timeout → 降級 confidence=20% → 自動修復從不觸發

修復:
- _PER_AGENT_TIMEOUT_SEC: 5s → 25s (覆蓋 avg 的 2.3x buffer)
- GLOBAL_TIMEOUT_SEC: 30s → 90s (3個序列Agent × 25s + buffer)
- 明確傳遞 timeout_sec 給所有 4 個 Agent 呼叫

預期效果: 正常告警 AI 分析 confidence ≥ 0.5 → 觸發自動修復

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:28:05 +08:00
OG T
a07daf7e3f fix(incidents): GET /incidents 加 48h age filter,阻止舊 incident 反覆觸發 AI 分析
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根本原因: DECISION_TOKEN_TTL=3600s → 舊 incident token 每小時過期
→ GET /api/v1/incidents 重複觸發 get_or_create_decision → OPENCLAW_NEMO timeout
→ Expert System fallback (confidence=20%) → Telegram 洪水

修復: 只對 created_at 在 48h 內的 incident 觸發背景 AI 分析
48h+ 的舊 incident 不再觸發(仍顯示在列表,只是不重新分析)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:21:53 +08:00
OG T
e8bf37cfd9 docs(logbook): 最終確認 f5e33da2 全節點 E2E 鏈路打通 (2026-04-16)
- CI 894 完成,f5e33da2 已部署
- flywheel outcome 欄位修復確認
- telegram _send_request 修復確認(零 AttributeError)
- Sweeper:20/20 近48h incidents sweeper_done 標記完整
- E2E 鏈路 7 節點完整流程確認(36 incidents)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:18:45 +08:00
AWOOOI CD
381be78344 chore(cd): deploy f5e33da [skip ci] 2026-04-15 17:55:11 +00:00
OG T
588ecfd940 docs(logbook): 2026-04-16 E2E 全節點驗證 + 生產 bug 修復記錄
2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:46:39 +08:00
OG T
f5e33da2fc fix(telegram): 修正 _make_request → _send_request 方法名稱不一致
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m48s
7 處呼叫 _make_request 但方法實際名稱為 _send_request,
導致 sweeper 分析完後 telegram_decision_push_failed 錯誤。

影響方法:send_push_notification, send_drift_card 等 ADR-071 系列。
_send_request 定義於 line 1272,OTEL 追蹤已含括。

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:44:29 +08:00
OG T
1e86cc2896 fix(flywheel): 修正 incidents.outcomes → outcome 欄位錯誤
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
GET /api/v1/stats/summary 觸發 UndefinedColumnError: column "outcomes" does not exist
實際欄位為 incidents.outcome (json 型別)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:42:14 +08:00
AWOOOI CD
644cae33c3 chore(cd): deploy 9bfa6fc [skip ci] 2026-04-15 17:37:10 +00:00
OG T
9bfa6fc045 fix(sweeper): 限制只掃 48h 內 incident,防止歷史舊案洗版 Telegram
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題:
  首次部署 sweeper 時,找到 117 個無 sweeper_done: 標記的舊 incident
  (最舊 2026-04-09,7 天前) → 觸發全部 LLM 分析
  舊 incident 資料格式 → OPENCLAW_NEMO timeout → Expert System 降級
  confidence=0.2 "降級" → Telegram 連發相同格式告警洗版

修正:
  加入 _MAX_INCIDENT_AGE_HOURS=48 過濾
  只處理 48h 內的 INVESTIGATING incident
  確保 created_at 時區安全(naive → UTC)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:27:02 +08:00
OG T
0760315059 fix(decision-manager): 修正 CAST 語法 + 關閉 shadow_mode
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
decision_chain_persist_failed 根因:
  asyncpg 不支援 :dc::json 語法 (:: 與具名參數 : 衝突)
  改為 CAST(:dc AS jsonb) — asyncpg 標準寫法

configmap:
  AIOPS_P4_SHADOW_MODE: true → false
  真實主動監控啟用 (proactive_inspector 輸出供 PreDecisionInvestigator 讀取)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:24:48 +08:00
OG T
20b3fefca7 fix(sweeper): 修正 decision key 格式 BUG (decision:INC-* → sweeper_done:INC-*)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根本原因:
  decision token 實際 key 格式為 decision:DEC-{HEX12}
  sweeper 錯誤地查詢 decision:{incident_id} (永遠 = 0)
  → 每 90s 將 186 個 incident 全部列為「未分析」
  → 觸發大量重複 AI 分析請求 (雖 get_or_create_decision 有去重保護)

修正方式:
  改用 sweeper_done:{incident_id} 輕量標記 (TTL 1h)
  分析完成後才設標記,確保失敗的 incident 下輪仍會重試
  get_or_create_decision 內部已有 COMPLETED/READY 去重,雙重保護

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:20:16 +08:00
OG T
bb7441ec8a docs: 更新 CLAUDE.md + HARD_RULES.md v2.0 + LOGBOOK (2026-04-16)
- HARD_RULES.md v2.0: 新增 Self-Loop Workflow、Circuit Breaker Exception、State & Flow Validation
- CLAUDE.md: 補充 §4 必讀Memory 表格
- LOGBOOK: 記錄 AIOps E2E 修復進度

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:20:16 +08:00
AWOOOI CD
3fc2c41216 chore(cd): deploy 457018c [skip ci] 2026-04-15 17:18:47 +00:00
OG T
457018c0f9 fix(decision-manager): AI 分析結果寫入 incidents.decision_chain (DB 長期保存)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
修復 Gap: decision token 只有 Redis TTL 12min,AI 診斷歷史永久丟失
  - 新增 _persist_decision_to_db() method
  - get_or_create_decision() 完成後 fire-and-forget 寫入 PG
  - 寫入: ts / confidence / risk_level / provider / source / diagnosis[:200]
  - try/except 吞錯不影響主流程,warning log 追蹤

DB/Cache 分層:
  PG (長期): incidents.decision_chain (歷史) + outcomes + KM entries
  Redis (短期): decision token dedup + working memory + playbook cache

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:08:30 +08:00
OG T
ce1a4d286e feat(sweeper): 新增 Incident Analysis Sweeper — 自動觸發未分析 Incident AI 決策
Gap修復:
  Signal Worker 創建 Incident 後,AI 分析只在 GET /api/v1/incidents 被呼叫時觸發
  若前端無人訪問,新 Incident 永遠沒有 AI 分析與 Telegram 通知

解法:
  新增 src/jobs/incident_analysis_sweeper.py
  每 90 秒掃描無 decision token 的 INVESTIGATING incidents
  自動背景觸發 get_or_create_decision() — Semaphore(3) 限流,每批最多 5 筆
  main.py lifespan 啟動時 asyncio.create_task() 掛載

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:08:30 +08:00
AWOOOI CD
34dd20298a chore(cd): deploy d258a1f [skip ci] 2026-04-15 16:22:45 +00:00
OG T
d258a1fb87 test(ai-router): 更新 DIAGNOSE routing 測試 — None → OPENCLAW_NEMO
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m52s
test_diagnose_override_is_none → test_diagnose_override_is_openclaw_nemo
配合 ai_router.py DIAGNOSE 路由修復(Ollama 238s timeout 根因修復)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:13:00 +08:00
OG T
d4fed639f6 fix(ai-router): DIAGNOSE 恢復 OPENCLAW_NEMO 路由,修復全部 timeout 降級問題
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m16s
根因: 2026-04-12 patch 把 DIAGNOSE 改為 None(複雜度路由)
→ 落入 Rule 6 → Ollama deepseek-r1:14b (CPU 238s) → timeout
→ 降級 20% 信心 + 「待分析」→ 全部 unknown

修復:
1. ai_router.py: DIAGNOSE → OPENCLAW_NEMO(via 188:8088 NVIDIA NIM, 2-27s)
2. ai_router.py: 移除 Rule 6 的 DIAGNOSE deepseek 特殊case(已無用)
3. 04-configmap.yaml: AI_FALLBACK_ORDER 改為 nvidia 優先
   gemini→ollama→nvidia(舊)→ nvidia→gemini→ollama(新)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:03:04 +08:00
AWOOOI CD
b55575b56b chore(cd): deploy c9efaa3 [skip ci] 2026-04-15 15:59:47 +00:00
OG T
c9efaa3740 fix(playbook-seed): 修正 get_db_context import 路徑(db.session → db.base)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
seed 啟動時靜默失敗的根因:
  from src.db.session import get_db_context  ← 模組不存在
  from src.db.base import get_db_context     ← 正確路徑

此 bug 導致 yaml_rule playbooks 完全無法建立。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:49:56 +08:00
AWOOOI CD
7d3391cb69 chore(cd): deploy 800ab16 [skip ci] 2026-04-15 15:41:49 +00:00
OG T
800ab1685f fix(playbook+flywheel): 修復 PlaybookSource enum + repair_steps 相容 + KM stats raw SQL
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 14m58s
Type Sync Check / check-type-sync (push) Failing after 1m17s
修復三個串聯 bug,讓 Playbook seed 能正常執行:

1. PlaybookSource 新增 YAML_RULE enum(alert_rules.yaml 匯入專用)
2. playbook_seed_service: source=YAML_RULE,dedup 改用 raw SQL by name,
   不再呼叫 list_playbooks(舊格式 repair_steps 會 validation error)
3. playbook_repository._orm_to_pydantic: 舊格式 repair_steps 補齊
   step_number/action_type 必填欄位(向下相容)
4. flywheel_stats_service: embedding IS NULL 改用 raw SQL,
   修復 KnowledgeEntryRecord ORM 無 embedding 屬性的 AttributeError

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:32:04 +08:00
AWOOOI CD
4bee14ae08 chore(cd): deploy 77a92eb [skip ci] 2026-04-15 14:39:13 +00:00
OG T
77a92eb469 feat(P6): 提交 offline_replay_service + model_rollback_service (漏提)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m59s
Phase 6 ADR-087 治理閉環兩個核心服務,
之前建立後沒有 git add,一直是 untracked 狀態。

2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:29:09 +08:00
OG T
85c4e3b434 fix(km): 修復 KM 寫入全為 unknown 的根因 (三個節點)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
Bug-A: approval_execution.py 呼叫不存在的 get_incident() → AttributeError
被 except 吞掉 → alertname/alert_category/affected_services 全用預設值
修復: 改用 get_from_working_memory() + get_from_episodic_memory() 雙路徑

Bug-B: _record_to_incident() 從 PG 還原 Incident 時漏掉
notification_type + alert_category 欄位 → km_conversion 讀到 None
修復: 加入這兩個欄位的還原

Bug-C: main.py working_memory_warmup 重建 Incident 時同樣遺漏
notification_type + alert_category
修復: 同步補上

2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:28:48 +08:00
AWOOOI CD
65c8eb587c chore(cd): deploy 256a24e [skip ci] 2026-04-15 14:20:27 +00:00
OG T
256a24e843 fix(deps+startup): drain3/statsmodels 補入 pyproject + warmup skip 舊資料
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 17m23s
- pyproject.toml: 補 drain3>=0.9.11, statsmodels>=0.14.0, sse-starlette
  → Docker build 從 pyproject 裝,requirements.txt 的套件之前沒裝進 image
  → P4 LogAnomalyDetector 400次 drain3_not_available 告警排除
- main.py: working_memory warmup per-record try/except
  → 舊 incident 含非法 source (node-exporter) → 跳過,不 crash 整個 warmup

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:08:13 +08:00
OG T
c05bac6112 fix(playbook): seed tuple unpack + text[] → jsonb migration
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_seed_service.py: list_playbooks 回傳 tuple[list, int],
  缺少解包導致 'list' has no attribute 'source'
- fix_playbooks_array_to_jsonb.sql: source_incident_ids/tags text[] → jsonb
  (已手動套用 prod DB)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:03:59 +08:00
OG T
da871fc149 chore(db): 補齊 AIOps P1/P2/P6 migration SQL(已套用到 prod)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_evidence / agent_sessions / ai_governance_events 三表
IF NOT EXISTS,production DB 已手動確認存在並 apply。

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:02:17 +08:00
OG T
76558a3cd9 feat(AIOps): 全開 P1-P6 feature flags + Nemotron + offline replay loop
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- configmap: 啟用 AIOPS_P1~P6 全部總開關與子開關
- configmap: ENABLE_NEMOTRON_COLLABORATION=true(回歸 120s timeout)
- feature_flags.py: 補齊 AIOPS_P6_GOVERNANCE_ENABLED 缺失欄位
- main.py: 掛載 run_offline_replay_loop(ADR-087 Phase 6)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:59:51 +08:00
OG T
ecfb7148bf fix(prod): 接通 YAML 規則引擎與自動執行路徑 — 架構核心斷點
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
架構斷點根因:
  YAML 規則引擎(alert_rules.yaml)是人工審閱的權威動作來源,
  但自動執行路徑只讀 proposal_data["kubectl_command"](LLM 生成),
  兩者完全脫節 → HostHighCpuLoad 得到 kubectl restart,DockerContainerUnhealthy
  的 SSH 指令被 LLM 的 kubectl 覆蓋。

修復策略:
  在 auto_execute 入口,先查 YAML match_rule:
  1. YAML → NO_ACTION(如 HostHighCpuLoad)→ 立即返回,不執行任何操作
  2. YAML → 非 kubectl 指令(如 ssh docker restart)→ 覆蓋 LLM action,
     後續 infrastructure SSH 路由才能生效

影響:
  - HostHighCpuLoad / NodeCPUUsageHigh → 停止自動執行,降級人工審核
  - DockerContainerUnhealthy → SSH docker restart(若 labels 有 host/container)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第三批

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:50:25 +08:00
OG T
3696fb5938 fix(prod): 修復 host_resource 誤發 K8s kubectl + 自動執行重複風暴
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: host_resource 告警(HostHighCpuLoad 等)
   不得執行 kubectl 操作 → 降級人工審核
   根因:原本只擋 infrastructure,host_resource 漏進 K8s 路徑
   → 導致 kubectl rollout restart deployment/HostHighCpuLoad 被真實執行

2. decision_manager: auto_execute 路徑補入 Redis cooldown
   同一 target 5 分鐘內最多自動執行 2 次,防止 awoooi-worker 3x 風暴
   根因:decision_manager 自動執行路徑完全無冷卻保護

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第二批

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:45:46 +08:00
OG T
67f437043a fix(prod): 修復四個生產致命 bug — outcome 寫入 / OpenClaw / Telegram 通知 / LLM 規則顯示
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: 移除 UPDATE incidents 中的 verification_result 欄位
   (incidents 表無此欄位 → 所有 outcome 寫入失敗 outcome_write_failed)

2. failure_watcher: get_openclaw_service → get_openclaw
   (函數名錯誤 → OpenClaw 分析全部 ImportError 崩潰)

3. failure_watcher: tg.send_message → tg.send_notification
   (TelegramGateway 無 send_message 方法 → 修復通知無法送出)

4. decision_manager: expert_analyze 補齊 initial_diagnosis / diagnosis_description key
   (openclaw.py 讀這兩個 key,但 expert_analyze 只有 matched_rule / description
    → LLM 永遠看到 Matched Rule=unknown,無法正確分析)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:41:31 +08:00
OG T
e465ee1936 docs(Phase 3): Evolver 演練完成 — exit condition #6 通過
- MASTER spec §3/§7/§8:三處 Evolver 演練勾選完成
- LOGBOOK:演練結果記錄 + 下一步更新為 7 天生產監控

演練結果:POST /api/v1/learning/evolver/run → HTTP 200 errors:[] 2026-04-15

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:24:33 +08:00
AWOOOI CD
e449b275aa chore(cd): deploy e5e94f5 [skip ci] 2026-04-15 13:19:00 +00:00
OG T
5f86da52d9 docs(LOGBOOK): Phase 3 全部落地記錄 — 6 個元件 + 退出條件清單
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:10:47 +08:00
OG T
e5e94f5fda fix(Phase 3): 管理員端點傳 force=True — 確保 Evolver 演練不受 flag 阻擋
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m56s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:09:13 +08:00
OG T
01fb531c02 fix(Phase 3): Evolver force=True bypass flag + 清理未使用 import
- run_evolver(force=True):管理員手動端點可繞過 feature flag
- 移除 typing.Any 未使用 import
- 移除 _merge_similar 中冗餘的 calculate_jaccard_similarity import

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:09:01 +08:00
OG T
4718c7667c feat(Phase 3): Evolver loop 排程 + 手動觸發端點 — 合併演練閘道完工
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_evolver.py: 新增 run_evolver_loop()(24h 無限迴圈)
- main.py: 掛載 run_evolver_loop asyncio.create_task
- api/v1/learning.py: POST /api/v1/learning/evolver/run(Phase 3 exit #6 演練端點)
- MASTER §8: 補錄 66c4eda AgentSession + 本次 Evolver 完整退出條件清單

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:07:56 +08:00