AWOOOI CD
|
f72419dd17
|
chore(cd): deploy b0da6da [skip ci]
|
2026-05-01 15:27:48 +08:00 |
|
AWOOOI CD
|
f53d7e5584
|
chore(cd): deploy f8e4497 [skip ci]
|
2026-05-01 14:41:18 +08:00 |
|
Your Name
|
f8e44971c1
|
feat(aiops): enable read-only agent loop canary
CD Pipeline / tests (push) Successful in 1m43s
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Successful in 10m22s
CD Pipeline / post-deploy-checks (push) Successful in 4m3s
|
2026-05-01 14:20:16 +08:00 |
|
AWOOOI CD
|
33a7148916
|
chore(cd): deploy b6cf616 [skip ci]
|
2026-05-01 14:02:59 +08:00 |
|
AWOOOI CD
|
1fe75e9f99
|
chore(cd): deploy 6ec3f11 [skip ci]
|
2026-05-01 13:45:55 +08:00 |
|
Your Name
|
3691402561
|
chore(cd): deploy 11673d80 api [skip ci]
|
2026-05-01 12:52:23 +08:00 |
|
Your Name
|
11673d80ea
|
fix(aiops): route backup decisions through ssh
CD Pipeline / tests (push) Successful in 1m35s
Code Review / ai-code-review (push) Successful in 34s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-05-01 12:50:01 +08:00 |
|
Your Name
|
ce4cf4c94b
|
chore(cd): deploy 2c12bce api [skip ci]
|
2026-05-01 10:58:55 +08:00 |
|
Your Name
|
78bcc090ad
|
chore(cd): deploy 97be5de api [skip ci]
|
2026-05-01 10:52:31 +08:00 |
|
AWOOOI CD
|
046d598e88
|
chore(cd): deploy e4aef6a [skip ci]
|
2026-05-01 10:43:56 +08:00 |
|
Your Name
|
fa6a78af2a
|
chore(cd): deploy e4aef6a api [skip ci]
|
2026-05-01 10:42:07 +08:00 |
|
AWOOOI CD
|
7472eb2fcd
|
chore(cd): deploy ca22ec2 [skip ci]
|
2026-05-01 10:24:48 +08:00 |
|
AWOOOI CD
|
3e0ab0f8c6
|
chore(cd): deploy f154ac0 [skip ci]
|
2026-05-01 00:14:36 +08:00 |
|
AWOOOI CD
|
f946e7b184
|
chore(cd): deploy 6e04fe9 [skip ci]
|
2026-04-30 23:18:20 +08:00 |
|
AWOOOI CD
|
64b09273f7
|
chore(cd): deploy e29aab5 [skip ci]
|
2026-04-30 15:58:18 +08:00 |
|
AWOOOI CD
|
a93fbe5d66
|
chore(cd): deploy 36967d0 [skip ci]
|
2026-04-30 15:44:46 +08:00 |
|
AWOOOI CD
|
38ffcf4395
|
chore(cd): deploy 712d3e5 [skip ci]
|
2026-04-30 15:20:33 +08:00 |
|
AWOOOI CD
|
ae52d51210
|
chore(cd): deploy 72945bf [skip ci]
|
2026-04-30 15:05:57 +08:00 |
|
AWOOOI CD
|
6e76c5dfd5
|
chore(cd): deploy c9393c3 [skip ci]
|
2026-04-30 14:41:46 +08:00 |
|
AWOOOI CD
|
19788302df
|
chore(cd): deploy 80defbe [skip ci]
|
2026-04-30 14:26:44 +08:00 |
|
Your Name
|
80defbed7c
|
fix(aiops): fallback and escalate automation blockers
CD Pipeline / tests (push) Successful in 2m41s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Successful in 7m51s
CD Pipeline / post-deploy-checks (push) Failing after 2m15s
|
2026-04-30 14:13:57 +08:00 |
|
AWOOOI CD
|
9ee3cc6242
|
chore(cd): deploy 4723499 [skip ci]
|
2026-04-30 11:11:04 +08:00 |
|
AWOOOI CD
|
a0be4ebb03
|
chore(cd): deploy 0f7e9d3 [skip ci]
|
2026-04-30 10:54:29 +08:00 |
|
Your Name
|
9f15f3cfe4
|
chore(cd): deploy 639bb64 [skip ci]
|
2026-04-30 09:41:20 +08:00 |
|
AWOOOI CD
|
d197e2785d
|
chore(cd): deploy 4a57c2d [skip ci]
|
2026-04-29 15:48:24 +00:00 |
|
AWOOOI CD
|
dae0aa2312
|
chore(cd): deploy d845d53 [skip ci]
|
2026-04-29 15:06:57 +00:00 |
|
AWOOOI CD
|
b857be0a64
|
chore(cd): deploy fe2b8f4 [skip ci]
|
2026-04-29 14:47:51 +00:00 |
|
AWOOOI CD
|
525a243550
|
chore(cd): deploy dccdcdb [skip ci]
|
2026-04-29 13:59:53 +00:00 |
|
AWOOOI CD
|
4c91d89dd2
|
chore(cd): deploy 4115ddd [skip ci]
|
2026-04-29 13:04:37 +00:00 |
|
Your Name
|
dc18b0ebd6
|
fix(prometheus_url): drift 殘存追修 — kured 守門員 + monitoring API
debugger 全 codebase 追根溯源後揪出 5 處 PROMETHEUS_URL drift 殘存
(根因:docs/reference/SERVICE-ENDPOINTS.md 早期把 Prometheus 標在 188
是整個 codebase drift 的源頭)。
本次修最急的 2 處:
## 🔴🔴 kured.yaml:132(守門員失效風險)
- 188 → 110
- kured 跑 reboot 前會查 Prometheus alerts,連錯主機 = 跳過保護直接 reboot 主機
- 對齊 ConfigMap + config.py PROMETHEUS_URL
## 🟡 monitoring.py:67(單一事實源)
- 寫死 110:9090 改用 settings.PROMETHEUS_URL
- 主機巧合正確但繞過 ConfigMap 注入機制
- 未來 Prometheus 再遷移避免再次 drift
## 暫不修
- k3s_monitor_service.py:38 用 121:30090 是 K3s NodePort 內網端點
與外部 PROMETHEUS_URL 概念不同,需新增 PROMETHEUS_INTERNAL_URL setting
- 其他 docstring + 文件 drift(SERVICE-ENDPOINTS.md 等)留待後續
## 驗證
1552 unit tests 全綠(無回歸)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 10:44:39 +08:00 |
|
AWOOOI CD
|
20009cddcf
|
chore(cd): deploy 143c15f [skip ci]
|
2026-04-28 07:36:19 +00:00 |
|
Your Name
|
143c15f052
|
feat(wave2+km): LLM 動態按鈕啟用 + KM 自動寫入 + AI Router dead code 標記
CD Pipeline / build-and-deploy (push) Successful in 9m52s
- ConfigMap: USE_LLM_DYNAMIC_BUTTONS=true(B2/B3/B4 handler 全就緒)
- decision_manager: auto_execute 失敗路徑補 KM fire-and-forget 寫入
- ai_router: _build_fallback_chain 標記 DEPRECATED 2026-04-28
- tests: test_golden_regression.py 新增 172 行 golden 回歸測試
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-28 15:27:33 +08:00 |
|
AWOOOI CD
|
2e6ae7fe84
|
chore(cd): deploy 7f200af [skip ci]
|
2026-04-28 07:14:34 +00:00 |
|
AWOOOI CD
|
b8a330f9e4
|
chore(cd): deploy c1a1be6 [skip ci]
|
2026-04-27 12:21:13 +00:00 |
|
Your Name
|
c1a1be61bd
|
fix(ssh-auto): 主機告警 SSH 自動診斷授權(HostHighCpuLoad 不再卡人工審核)
CD Pipeline / build-and-deploy (push) Successful in 9m7s
根因:SSH_MCP_ALLOWED_HOSTS 未設定 → _ssh_execute() 全部攔截
+ auto_approve 只認 kubectl 不認 ssh → 主機告警永遠降級人工
修復:
- ConfigMap: 補 SSH_MCP_ALLOWED_HOSTS 四主機白名單
- alert_rules: HostHighCpuLoad 等從 NO_ACTION 改為 ssh_diagnose 指令
- auto_approve: _has_executable 加入 ssh 開頭識別
- decision_manager: _ssh_execute() 加入 ssh_diagnose 路由
- ssh_provider: 新增 ssh_diagnose tool(ps aux + free -h + df -h,只讀)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-27 20:13:07 +08:00 |
|
AWOOOI CD
|
ded17caca0
|
chore(cd): deploy a0502b7 [skip ci]
|
2026-04-27 11:55:33 +00:00 |
|
AWOOOI CD
|
0a22f49932
|
chore(cd): deploy e3bad58 [skip ci]
|
2026-04-27 08:21:06 +00:00 |
|
AWOOOI CD
|
dfbf3f8f20
|
chore(cd): deploy a184b82 [skip ci]
|
2026-04-27 08:08:52 +00:00 |
|
Your Name
|
c3fa03fc19
|
fix(solver): 補 AGENT_SOLVER_TIMEOUT_SEC=80 + prompt 禁無腦重啟
CD Pipeline / build-and-deploy (push) Has been cancelled
問題1:AGENT_SOLVER_TIMEOUT_SEC 預設 20s K8s 未設 → deepseek-r1:14b 必然
timeout → candidates=[] → action="" → Telegram 顯示「待分析」+「規則分析」
問題2:Solver prompt JSON 範例只有 restart + kubectl top,LLM 模仿範例
→ 所有告警都推重啟,HostDisk/CPU 類應優先診斷+清理
修復:
- K8s 加 AGENT_SOLVER_TIMEOUT_SEC=80(< OPENCLAW_TIMEOUT=120,留 buffer)
- Solver prompt 加根因對應修復規則:HostDisk→df/du/journalctl,CPU→top/ps,
OOM→kubectl logs,禁止「先重啟」
- JSON 範例改為 HostDisk SSH 診斷場景,不再只有 K8s 命令
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-27 15:51:42 +08:00 |
|
Your Name
|
1b6a4dc14c
|
fix(k8s): 補 AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=100 救急 step_timeout
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:deepseek-r1:14b 推理單題實測 28s,SRE prompt 更長必然 >30s
AGENT_DIAGNOSTICIAN_TIMEOUT_SEC 預設 30s,K8s 沒有覆寫
導致 diagnostician 必然 step_timeout → 信心 20% 降級
修復:K8s 加 AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=100(低於 OPENCLAW_TIMEOUT=120,留 20s buffer)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-27 15:40:46 +08:00 |
|
AWOOOI CD
|
e0ca1c1f78
|
chore(cd): deploy ea23972 [skip ci]
|
2026-04-27 07:30:40 +00:00 |
|
AWOOOI CD
|
92a5d94382
|
chore(cd): deploy f4998b3 [skip ci]
|
2026-04-27 07:15:37 +00:00 |
|
Your Name
|
1ab6786ce3
|
feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Wave 6 P2.3 ops 配套 + tool-expert 部署文件:
新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
· 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
· failover/recovery 完整 SOP
· 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
· 4 panel:current primary / failover events / quota usage / health status
· 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
· ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
|
2026-04-27 08:11:40 +08:00 |
|
AWOOOI CD
|
b0bf3783e4
|
chore(cd): deploy 2c57b71 [skip ci]
|
2026-04-26 13:04:37 +00:00 |
|
Your Name
|
55c6b4e2d9
|
feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警
ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統,全部 Engineer-A2/C/C2 補上。
新服務 (1581 行):
- ollama_health_monitor.py (356):3 層健康檢測(TCP/HTTP/推理)
- ollama_failover_manager.py (571):111→188 自動切換 + Redis 持久化 + recovery callback
- ollama_auto_recovery.py (436):30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache
- failover_alerter.py (218):P1.5 Telegram 容災告警
服務整合:
- ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain
- main.py lifespan: 啟動時 wire callback + start recovery,關閉時優雅 stop
- config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA(帳單熔斷)
K8s 配置:
- 04-configmap.yaml.patch-188-fallback:注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434
測試 (2082 行):
- test_ollama_health_monitor.py (402)
- test_ollama_failover_manager.py (707)
- test_ollama_auto_recovery.py (580)
- test_ai_router_failover_integration.py (257)
- test_lifespan_failover_wiring.py (136)
依賴鏈:service 三件套 + ai_router + main.py 一起 commit,缺一就 ImportError。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-26 20:18:33 +08:00 |
|
AWOOOI CD
|
4a8c3ca5c4
|
chore(cd): deploy bb12647 [skip ci]
|
2026-04-25 02:39:34 +00:00 |
|
AWOOOI CD
|
f676b61282
|
chore(cd): deploy cbd28e2 [skip ci]
|
2026-04-25 01:55:58 +00:00 |
|
AWOOOI CD
|
b8b5c68f31
|
chore(cd): deploy f9f2263 [skip ci]
|
2026-04-24 19:37:26 +00:00 |
|
AWOOOI CD
|
411a285735
|
chore(cd): deploy 250eca9 [skip ci]
|
2026-04-24 19:23:08 +00:00 |
|
Your Name
|
c14f23b33a
|
feat(k8s+notification): TG_GROUP_CUTOVER=true — 所有告警全切 SRE 群組
notification_matrix TYPE-5S: DM → GROUP(SignOz 事件補齊)
prod/dev ConfigMap TG_GROUP_CUTOVER: false → true
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-25 03:07:28 +08:00 |
|