Commit Graph

316 Commits

Author SHA1 Message Date
AWOOOI CD
f72419dd17 chore(cd): deploy b0da6da [skip ci] 2026-05-01 15:27:48 +08:00
AWOOOI CD
f53d7e5584 chore(cd): deploy f8e4497 [skip ci] 2026-05-01 14:41:18 +08:00
Your Name
f8e44971c1 feat(aiops): enable read-only agent loop canary
All checks were successful
CD Pipeline / tests (push) Successful in 1m43s
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Successful in 10m22s
CD Pipeline / post-deploy-checks (push) Successful in 4m3s
2026-05-01 14:20:16 +08:00
AWOOOI CD
33a7148916 chore(cd): deploy b6cf616 [skip ci] 2026-05-01 14:02:59 +08:00
AWOOOI CD
1fe75e9f99 chore(cd): deploy 6ec3f11 [skip ci] 2026-05-01 13:45:55 +08:00
Your Name
3691402561 chore(cd): deploy 11673d80 api [skip ci] 2026-05-01 12:52:23 +08:00
Your Name
11673d80ea fix(aiops): route backup decisions through ssh
Some checks failed
CD Pipeline / tests (push) Successful in 1m35s
Code Review / ai-code-review (push) Successful in 34s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-01 12:50:01 +08:00
Your Name
ce4cf4c94b chore(cd): deploy 2c12bce api [skip ci] 2026-05-01 10:58:55 +08:00
Your Name
78bcc090ad chore(cd): deploy 97be5de api [skip ci] 2026-05-01 10:52:31 +08:00
AWOOOI CD
046d598e88 chore(cd): deploy e4aef6a [skip ci] 2026-05-01 10:43:56 +08:00
Your Name
fa6a78af2a chore(cd): deploy e4aef6a api [skip ci] 2026-05-01 10:42:07 +08:00
AWOOOI CD
7472eb2fcd chore(cd): deploy ca22ec2 [skip ci] 2026-05-01 10:24:48 +08:00
AWOOOI CD
3e0ab0f8c6 chore(cd): deploy f154ac0 [skip ci] 2026-05-01 00:14:36 +08:00
AWOOOI CD
f946e7b184 chore(cd): deploy 6e04fe9 [skip ci] 2026-04-30 23:18:20 +08:00
AWOOOI CD
64b09273f7 chore(cd): deploy e29aab5 [skip ci] 2026-04-30 15:58:18 +08:00
AWOOOI CD
a93fbe5d66 chore(cd): deploy 36967d0 [skip ci] 2026-04-30 15:44:46 +08:00
AWOOOI CD
38ffcf4395 chore(cd): deploy 712d3e5 [skip ci] 2026-04-30 15:20:33 +08:00
AWOOOI CD
ae52d51210 chore(cd): deploy 72945bf [skip ci] 2026-04-30 15:05:57 +08:00
AWOOOI CD
6e76c5dfd5 chore(cd): deploy c9393c3 [skip ci] 2026-04-30 14:41:46 +08:00
AWOOOI CD
19788302df chore(cd): deploy 80defbe [skip ci] 2026-04-30 14:26:44 +08:00
Your Name
80defbed7c fix(aiops): fallback and escalate automation blockers
Some checks failed
CD Pipeline / tests (push) Successful in 2m41s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Successful in 7m51s
CD Pipeline / post-deploy-checks (push) Failing after 2m15s
2026-04-30 14:13:57 +08:00
AWOOOI CD
9ee3cc6242 chore(cd): deploy 4723499 [skip ci] 2026-04-30 11:11:04 +08:00
AWOOOI CD
a0be4ebb03 chore(cd): deploy 0f7e9d3 [skip ci] 2026-04-30 10:54:29 +08:00
Your Name
9f15f3cfe4 chore(cd): deploy 639bb64 [skip ci] 2026-04-30 09:41:20 +08:00
AWOOOI CD
d197e2785d chore(cd): deploy 4a57c2d [skip ci] 2026-04-29 15:48:24 +00:00
AWOOOI CD
dae0aa2312 chore(cd): deploy d845d53 [skip ci] 2026-04-29 15:06:57 +00:00
AWOOOI CD
b857be0a64 chore(cd): deploy fe2b8f4 [skip ci] 2026-04-29 14:47:51 +00:00
AWOOOI CD
525a243550 chore(cd): deploy dccdcdb [skip ci] 2026-04-29 13:59:53 +00:00
AWOOOI CD
4c91d89dd2 chore(cd): deploy 4115ddd [skip ci] 2026-04-29 13:04:37 +00:00
Your Name
dc18b0ebd6 fix(prometheus_url): drift 殘存追修 — kured 守門員 + monitoring API
debugger 全 codebase 追根溯源後揪出 5 處 PROMETHEUS_URL drift 殘存
(根因:docs/reference/SERVICE-ENDPOINTS.md 早期把 Prometheus 標在 188
是整個 codebase drift 的源頭)。

本次修最急的 2 處:

## 🔴🔴 kured.yaml:132(守門員失效風險)
- 188 → 110
- kured 跑 reboot 前會查 Prometheus alerts,連錯主機 = 跳過保護直接 reboot 主機
- 對齊 ConfigMap + config.py PROMETHEUS_URL

## 🟡 monitoring.py:67(單一事實源)
- 寫死 110:9090 改用 settings.PROMETHEUS_URL
- 主機巧合正確但繞過 ConfigMap 注入機制
- 未來 Prometheus 再遷移避免再次 drift

## 暫不修
- k3s_monitor_service.py:38 用 121:30090 是 K3s NodePort 內網端點
  與外部 PROMETHEUS_URL 概念不同,需新增 PROMETHEUS_INTERNAL_URL setting
- 其他 docstring + 文件 drift(SERVICE-ENDPOINTS.md 等)留待後續

## 驗證
1552 unit tests 全綠(無回歸)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
AWOOOI CD
20009cddcf chore(cd): deploy 143c15f [skip ci] 2026-04-28 07:36:19 +00:00
Your Name
143c15f052 feat(wave2+km): LLM 動態按鈕啟用 + KM 自動寫入 + AI Router dead code 標記
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m52s
- ConfigMap: USE_LLM_DYNAMIC_BUTTONS=true(B2/B3/B4 handler 全就緒)
- decision_manager: auto_execute 失敗路徑補 KM fire-and-forget 寫入
- ai_router: _build_fallback_chain 標記 DEPRECATED 2026-04-28
- tests: test_golden_regression.py 新增 172 行 golden 回歸測試

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 15:27:33 +08:00
AWOOOI CD
2e6ae7fe84 chore(cd): deploy 7f200af [skip ci] 2026-04-28 07:14:34 +00:00
AWOOOI CD
b8a330f9e4 chore(cd): deploy c1a1be6 [skip ci] 2026-04-27 12:21:13 +00:00
Your Name
c1a1be61bd fix(ssh-auto): 主機告警 SSH 自動診斷授權(HostHighCpuLoad 不再卡人工審核)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m7s
根因:SSH_MCP_ALLOWED_HOSTS 未設定 → _ssh_execute() 全部攔截
      + auto_approve 只認 kubectl 不認 ssh → 主機告警永遠降級人工

修復:
- ConfigMap: 補 SSH_MCP_ALLOWED_HOSTS 四主機白名單
- alert_rules: HostHighCpuLoad 等從 NO_ACTION 改為 ssh_diagnose 指令
- auto_approve: _has_executable 加入 ssh 開頭識別
- decision_manager: _ssh_execute() 加入 ssh_diagnose 路由
- ssh_provider: 新增 ssh_diagnose tool(ps aux + free -h + df -h,只讀)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 20:13:07 +08:00
AWOOOI CD
ded17caca0 chore(cd): deploy a0502b7 [skip ci] 2026-04-27 11:55:33 +00:00
AWOOOI CD
0a22f49932 chore(cd): deploy e3bad58 [skip ci] 2026-04-27 08:21:06 +00:00
AWOOOI CD
dfbf3f8f20 chore(cd): deploy a184b82 [skip ci] 2026-04-27 08:08:52 +00:00
Your Name
c3fa03fc19 fix(solver): 補 AGENT_SOLVER_TIMEOUT_SEC=80 + prompt 禁無腦重啟
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題1:AGENT_SOLVER_TIMEOUT_SEC 預設 20s K8s 未設 → deepseek-r1:14b 必然
       timeout → candidates=[] → action="" → Telegram 顯示「待分析」+「規則分析」

問題2:Solver prompt JSON 範例只有 restart + kubectl top,LLM 模仿範例
       → 所有告警都推重啟,HostDisk/CPU 類應優先診斷+清理

修復:
- K8s 加 AGENT_SOLVER_TIMEOUT_SEC=80(< OPENCLAW_TIMEOUT=120,留 buffer)
- Solver prompt 加根因對應修復規則:HostDisk→df/du/journalctl,CPU→top/ps,
  OOM→kubectl logs,禁止「先重啟」
- JSON 範例改為 HostDisk SSH 診斷場景,不再只有 K8s 命令

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 15:51:42 +08:00
Your Name
1b6a4dc14c fix(k8s): 補 AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=100 救急 step_timeout
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:deepseek-r1:14b 推理單題實測 28s,SRE prompt 更長必然 >30s
      AGENT_DIAGNOSTICIAN_TIMEOUT_SEC 預設 30s,K8s 沒有覆寫
      導致 diagnostician 必然 step_timeout → 信心 20% 降級

修復:K8s 加 AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=100(低於 OPENCLAW_TIMEOUT=120,留 20s buffer)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 15:40:46 +08:00
AWOOOI CD
e0ca1c1f78 chore(cd): deploy ea23972 [skip ci] 2026-04-27 07:30:40 +00:00
AWOOOI CD
92a5d94382 chore(cd): deploy f4998b3 [skip ci] 2026-04-27 07:15:37 +00:00
Your Name
1ab6786ce3 feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
Some checks failed
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Wave 6 P2.3 ops 配套 + tool-expert 部署文件:

新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
  · 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
  · failover/recovery 完整 SOP
  · 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
  · 4 panel:current primary / failover events / quota usage / health status
  · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
  · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
2026-04-27 08:11:40 +08:00
AWOOOI CD
b0bf3783e4 chore(cd): deploy 2c57b71 [skip ci] 2026-04-26 13:04:37 +00:00
Your Name
55c6b4e2d9 feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警
ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統,全部 Engineer-A2/C/C2 補上。

新服務 (1581 行):
- ollama_health_monitor.py (356):3 層健康檢測(TCP/HTTP/推理)
- ollama_failover_manager.py (571):111→188 自動切換 + Redis 持久化 + recovery callback
- ollama_auto_recovery.py (436):30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache
- failover_alerter.py (218):P1.5 Telegram 容災告警

服務整合:
- ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain
- main.py lifespan: 啟動時 wire callback + start recovery,關閉時優雅 stop
- config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA(帳單熔斷)

K8s 配置:
- 04-configmap.yaml.patch-188-fallback:注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434

測試 (2082 行):
- test_ollama_health_monitor.py (402)
- test_ollama_failover_manager.py (707)
- test_ollama_auto_recovery.py (580)
- test_ai_router_failover_integration.py (257)
- test_lifespan_failover_wiring.py (136)

依賴鏈:service 三件套 + ai_router + main.py 一起 commit,缺一就 ImportError。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:18:33 +08:00
AWOOOI CD
4a8c3ca5c4 chore(cd): deploy bb12647 [skip ci] 2026-04-25 02:39:34 +00:00
AWOOOI CD
f676b61282 chore(cd): deploy cbd28e2 [skip ci] 2026-04-25 01:55:58 +00:00
AWOOOI CD
b8b5c68f31 chore(cd): deploy f9f2263 [skip ci] 2026-04-24 19:37:26 +00:00
AWOOOI CD
411a285735 chore(cd): deploy 250eca9 [skip ci] 2026-04-24 19:23:08 +00:00
Your Name
c14f23b33a feat(k8s+notification): TG_GROUP_CUTOVER=true — 所有告警全切 SRE 群組
notification_matrix TYPE-5S: DM → GROUP(SignOz 事件補齊)
prod/dev ConfigMap TG_GROUP_CUTOVER: false → true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 03:07:28 +08:00