Commit Graph

1055 Commits

Author SHA1 Message Date
OG T
b1e207ffae fix(host_aggregator): E2E驗證後修正 HOST_CONFIGS — Ollama位置+NodePort+Nginx
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
從 K3s Pod 內 Python socket 實測確認後修正:
- 110: 加 Prometheus(9090) Grafana(3002),移除 GH Runner(3000 refused)
- 112: 移除 SSH:22 (K3s Pod NetworkPolicy 未開)
- 120: 移除 awoooi NodePort(只在121不在120)
- 188: 移除 Ollama(在111非188) 和 Nginx:443(Pod內打不通)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 20:27:46 +08:00
OG T
c200d7a52d fix(web+k8s): CSRF mismatch + NetworkPolicy 缺少監控端口
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m19s
1. pending-approvals-card: 改為點擊時即時 fetch 新 CSRF token
   避免多 useCSRF 實例互相覆蓋 cookie 導致 header/cookie 不一致
2. NetworkPolicy: 補開 110:3002(Grafana) 9090(Prometheus) 3001(Gitea)
   修正 monitoring probe "All connection attempts failed"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 20:11:00 +08:00
OG T
21567a7a6d fix(host_aggregator): 修正四台主機 probe 端點錯誤導致全部顯示 unhealthy
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m1s
- 110: Harbor http→tcp(5000), Docker 2375→Gitea tcp(3001)
- 120: K3s 6443 https(401誤判)→tcp, 移除 Traefik 80(closed)
- 188: OpenClaw 8089→8088 (實際端口)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:52:34 +08:00
OG T
8c2983b70a fix(api+web): CORS 補 K3s NodePort origins + sign 補 signer_id/name
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CORS (config.py):
- 補 http://192.168.0.125:32335 (K3s VIP NodePort)
- 補 http://192.168.0.120:32335 + 121:32335 (K3s nodes)
- 修前: 內網瀏覽器開 :32335 打 API 全 CORS blocked
  (incidents Failed to fetch / monitoring 無法連線根因)

sign body (pending-approvals-card.tsx):
- signer: 'web-ui' → signer_id: CURRENT_USER.id + signer_name: CURRENT_USER.name
- 修前: POST /approvals/{id}/sign 回 403 (缺必填欄位 422 誤報為 403)
  — 實際是 422 Field required signer_id + signer_name

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:50:48 +08:00
OG T
34f0228d92 fix(executor): K8s ClusterIP 10.43.0.1 不可達 — 加 K8S_API_SERVER_URL 覆蓋 + migration job
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m0s
問題: in-cluster config 讀到 10.43.0.1:443,但 K3s Pod 內 iptables/kube-proxy
      沒把流量導到實際 API server,導致 Connection refused,批准後 kubectl 永遠失敗

修復:
- executor.py: load_incluster_config() 後讀 K8S_API_SERVER_URL env 覆蓋 host
- 04-configmap.yaml: 設 K8S_API_SERVER_URL=https://192.168.0.120:6443
- migrate-sprint5r-telegram-message-id.yaml: approval_records 新增兩欄 migration job

E2E 驗證: kubectl rollout restart deployment/awoooi-worker success=True 

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:10:27 +08:00
OG T
ebccb88278 fix(approval_db): 修復 incident_id 篩選查空 DB 欄位而非 JSON 導致執行斷路
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
get_all_approvals(incident_id=...) 原本在應用層過濾
a.metadata.get("incident_id"),但 ApprovalRecord.incident_id
是直接欄位,不在 extra_metadata JSON,導致永遠返回空列表,
Telegram 批准後出現 telegram_approval_not_found_by_incident,
審批從未實際執行。改為 .where(ApprovalRecord.incident_id == incident_id)
DB 層直接篩選,同時效能更佳。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:05:48 +08:00
OG T
9a8f410f23 fix(web): PendingApprovalsCard 批准/拒絕補 CSRF Token — 修復 403
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: fetch 沒帶 X-CSRF-Token header + credentials:include
     API 回 403 "CSRF token cookie missing"

修復: 加 useCSRF hook,sign/reject 請求帶 ...getHeaders() + credentials:include
     與 incident-card.tsx / openclaw-state-machine.tsx 同一模式

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:00:02 +08:00
OG T
a2a98452ad fix(web): 移除 AIModelStatus 假綠燈 — Gemini/NVIDIA 不應 assumed up
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: /api/v1/health 的 components 只有 api/database/redis/ollama/openclaw
     d.components.gemini 永遠 undefined → healthy: true 是硬編碼假數據

修復: 改為只有 components 有對應 key 才更新狀態
     無 health 資料時保持 false(unknown),不顯示假綠燈

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:51:14 +08:00
OG T
a4d6b3f3e6 fix(review): 首席架構師+QA 修復 C1/P1/P2/I2/I3 — Sprint 5R Review 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
C1/P1-1: DB migration — approval_records 新增 telegram_message_id/telegram_chat_id
  - apps/api/migrations/sprint5r_telegram_message_id.sql (新增)
  - apps/api/src/db/base.py: init_db() 加 ALTER TABLE ADD COLUMN IF NOT EXISTS
  - k8s/jobs/migrate-sprint5r-telegram-message-id.yaml (追蹤)

P1-2: risk_map 補 "high" 鍵防止 LLM 回傳 high 時降為 MEDIUM
  - apps/api/src/services/proposal_service.py:183

I2/M3: kubectl_command 回填補齊 delete_deployment/drain_node/cordon_node/delete_service
       + 抽取 _backfill_kubectl_command() helper 消除重複邏輯
  - apps/api/src/services/openclaw.py

I3: _notify_approval_result 靜默 except 改為 logger.warning
  - apps/api/src/services/telegram_gateway.py

P2-2: PendingApprovalsCard 審批動作加 loading/disabled 防止重複點擊
  - apps/web/src/components/shared/pending-approvals-card.tsx

P2-3: SecurityPanel/CompliancePanel error 死碼修復 — catch() 補 setError()
  - apps/web/src/components/panels/SecurityPanel.tsx (含 'Unresolved' i18n)
  - apps/web/src/components/panels/CompliancePanel.tsx

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:38:10 +08:00
OG T
896bef94ee fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:38:08 +08:00
OG T
890e2a9568 fix(review): 架構審查修復 — P0 import crash + i18n 零 hardcode + 靜默錯誤
P0:
- proposal_service.py: 補 get_redis + INCIDENT_KEY_PREFIX import
  (修前: resolve_incident_after_approval 必 NameError crash)

P1 i18n:
- page.tsx: 拓撲群組移除 emoji,改用 tTopo() i18n key
- page.tsx: 主機標籤 (DevOps金庫等) 改 tTopo() i18n
- ai-model-status.tsx: 加 useTranslations,AI 模型狀態 → t('aiModelStatus')
- disposition-mini.tsx: 查看完整報表 → t('viewAllReport')
- recent-activity.tsx: 查看活動串流 → t('viewAllAlerts')

P2 品質:
- pending-approvals-card.tsx: approve/reject 加 r.ok 檢查+錯誤顯示,查看全部授權加路由+i18n
- page-tabs.tsx: TabSkeleton 載入中... → t('loading')
- page.tsx: ↑5% → tDashboard('trendUp', {pct}) 動態值
- page.tsx: Prometheus '23' hardcode → '-- targets'

i18n 新增 key (zh-TW + en 同步):
- dashboard: viewAllAlerts/viewAllAuth/viewAllReport/aiModelStatus/loading/trendUp
- topology: groupExternal/allReachable/investigating/hostDevops/hostAiData/hostK3sMaster/hostK3sWorker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:34:50 +08:00
OG T
309fe04698 docs(adr066): 批准執行閉環修復記錄 — LOGBOOK + ADR-066 + Skill 02 更新
- LOGBOOK.md: 新增 2026-04-09 批准執行閉環修復狀態區塊
- ADR-066: 記錄根本問題鏈條、決策與受影響檔案
- Skills/02: v2.7 新增 Nemotron tool→kubectl_command 回填鐵律 + 教訓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:23:55 +08:00
OG T
c01026be9b docs(skills+adr): 自動修復全鏈路知識更新 — ADR-058 Appendix A + Skills v2.5
ADR-058: 188白名單補完 + Appendix A (12 Bug修復記錄 + E2E驗證 + Playbook覆蓋矩陣)
Skill-04 DevOps v2.5: SSH自動修復架構章節 (白名單/SOP/陷阱)
Skill-05 SRE: 自動修復E2E驗收規範 + 診斷表

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:21:24 +08:00
OG T
2779233b25 docs: Sprint 5R 實施完成紀錄更新
- LOGBOOK: 13/14 步驟全部完成,CD 部署中
- ADR-065: 狀態更新為「實施完成」
- Skills 01 v1.8: Sprint 5R 完成記錄
- Memory: project_current_status + sprint5r_plan 已更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:19:57 +08:00
OG T
1483218bab feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id 到 DB
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m9s
問題:按下 TG 批准/拒絕按鈕後完全沒有任何回應,使用者不知道是否成功。
      Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤。

修正:
- approval_records 加 telegram_message_id / telegram_chat_id 欄位(已 ALTER TABLE)
- approval_db.update_telegram_message() — 持久化 message_id 到 DB
- decision_manager: 發送告警卡片後同時寫 Redis + DB
- telegram_gateway._notify_approval_result() — 批准/拒絕後:
    1. editMessageReplyMarkup 移除批准/拒絕按鈕,保留資訊按鈕
    2. sendMessage reply_to 在原訊息下回覆狀態行
    3. fallback: send_notification 發新訊息
- _handle_group_command: chat_id 改為 _chat_id 消除 IDE lint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:19:31 +08:00
OG T
2c7d5d049c fix(openclaw): Nemotron tool call 回填 kubectl_command,讓批准後執行器能解析
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根本問題:Nemotron 產生的 restart_deployment(deployment_name=sentry) tool call
只存在 nemotron_tools[],沒有回填到 proposal["kubectl_command"]。
proposal_service 拿到的 kubectl_command 是空的,approval_records.action 存空值,
parse_operation_from_action 永遠返回 None,execute_approved_action 永遠 skip。

修正:Nemotron (和 Gemini fallback) 成功後,將 tool call 轉換為 kubectl 指令
並回填 proposal["kubectl_command"],讓 proposal_service 能取到可執行指令。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:15:01 +08:00
OG T
a39647d793 docs(logbook): 自動修復全鏈路完整閉環記錄 — 雙主機 E2E 驗證通過
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
docker-110: SentryDown → REPAIR_OK:sentry (6208ms)
docker-188: MoWoooWorkDown → REPAIR_OK:momo-app (3791ms)
20 Playbooks (8 auto-generated), repair-bot 雙主機白名單更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:14:17 +08:00
OG T
ae9780837d fix(proposal): action 優先用 kubectl_command,修復批准後永遠 skip 執行的根本 bug
根本問題:approval_records.action 存的是 LLM action_title(中文標題,如「重啟 sentry 服務」),
parse_operation_from_action() 無法解析,導致 execute_approved_action() 每次都 skip。

修正:action 優先取 llm_proposal["kubectl_command"](可執行的 kubectl 指令),
僅在沒有 kubectl_command 時才 fallback 到 action_title。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:13:22 +08:00
OG T
49a15e1ac9 feat(web): G1 骨架屏取代載入中 + S8 完整提交 — Sprint 5R
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- G1: PulseSkeleton + CardSkeleton 元件
- 首頁所有 LobsterLoading 替換為 PulseSkeleton/CardSkeleton
- Tab 2/4 載入狀態用 CardSkeleton
- 活躍事件載入用 PulseSkeleton

Sprint 5R Phase 1B+1C 全部完成:
S1(KPI卡片) S2(FlowPipeline OpenClaw) S3(AI提案) S4(環形圖)
S5(時間線) S6(Terminal) S7(待審批) S8(拓撲群組+主機)
S9(AI模型) S10(監控3×2) S11(Tab修復) S12(頁面修復) G1(骨架屏)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:09:26 +08:00
OG T
09c6eb3358 feat(web): S2 FlowPipeline 龍蝦→OpenClaw icon — Sprint 5R
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- LobsterSVG 替換為 OpenClawIcon (dashboardicons.com/openclaw PNG)
- 4 種嚴重度渲染全部更新 (P0/P1/P2/P3)
- icon 直接取代圓圈作為活躍步驟標記(非浮動)
- S3 確認: AI 提案橫幅已存在且樣式正確

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:07:53 +08:00
OG T
03b07d5bc5 feat(web): S8 基礎架構拓撲群組 2×2 + 主機 4 台 — Sprint 5R
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 拓撲模式(預設): 4 群組 2×2 網格 (基礎設施/AI數據/K3s/外部)
  每群組含名稱+服務數+健康摘要+服務列表(色點)
  有 warning 的群組加橘色光暈
- 主機模式: 4 台 2×2 (110/188/120/121) 含 CPU/RAM 進度條
  優先使用 API 真實數據,fallback 靜態值
- 預設切換為拓撲模式 (設計稿要求)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:06:01 +08:00
OG T
07a097c259 fix(infra): Sprint 3 自動修復全鏈路修復 — docker-188 SSH egress + service registry 擴充
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
NetworkPolicy: 新增 192.168.0.188:22 egress — repair-bot-188.sh 執行路徑
service-registry.yaml: 新增 signoz/bitan-app (AUTO, 188主機)

修復覆蓋: Bug #11 補完 (188 SSH) + 188 服務分級覆蓋
E2E 驗證: MoWoooWorkDown → SSH → REPAIR_OK:momo-app (3791ms)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:04:19 +08:00
OG T
895784e646 feat(web): S7+S9+S10 待審批+AI模型+監控工具3×2 — Sprint 5R
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m15s
- S7: PendingApprovalsCard 含風險標籤 + 批准/拒絕按鈕
- S9: AIModelStatus 2×2 (OpenClaw/Ollama/Gemini/NVIDIA)
- S10: MonitoringTools 改 3×2 網格 (名稱+元資訊+左側色條)
- 右欄順序: OpenClaw → 待審批 → 基礎架構 → 監控工具 → AI模型

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 16:10:28 +08:00
OG T
a0f3a7d532 feat(web): S6 OpenClaw AI Terminal + 狀態數據 — Sprint 5R
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m15s
- 分隔線下方新增:模型名稱 + 運行狀態
- 即時統計:今日分析數 / 成功率 / MTTR
- AI 推理終端:#141413 背景 + #a0e8a0 螢光綠 + JetBrains Mono
- 最後一行黃色閃爍游標 ▎
- 資料來源:/api/v1/alert-operation-logs + /api/v1/stats/disposition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:56:03 +08:00
OG T
b85a0e232e feat(web): S4+S5 處置統計環形圖 + 最近活動時間線 — Sprint 5R
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- S4: DispositionMini 元件 (SVG 環形圖 + 四類列表)
- S5: RecentActivity 元件 (時間線 + 色點 + JetBrains Mono)
- 左欄改為 flex:6 可滾動多卡片列
- 右欄改為 flex:4 (60:40 比例)
- 左欄結構: 活躍事件 → 處置統計 → 最近活動

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:51:54 +08:00
OG T
7a2e07f74f feat(web): S1 KPI Strip 改 5 張卡片 — Sprint 5R Phase 1B
- 7 指標分隔線 → 5 張 kpi-card 卡片橫排
- 系統健康(進度條) / 活動事件(P1:P2) / 自動修復率(進度條+↑5%) / 待審批 / 本週操作
- 移除龍蝦游泳列(統帥指示移除)
- 新增 weeklyOps 從 /api/v1/audit-logs/stats 取得

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:48:04 +08:00
OG T
289dac6bd1 fix(web): S11+S12 載入失敗修復 — Sprint 5R Phase 1A
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- S11: Tab 2 approvals API path 修正 (?status=pending → /pending)
- S11: Tab 2 fetch 加 r.ok 檢查避免解析錯誤 JSON
- S12: 安全合規改用 SecurityPanel + CompliancePanel (解決 double AppLayout)
- S12: 知識庫改為 redirect 到 /knowledge-base (避免 lazy import 問題)
- S12: 拓撲圖加入 useDashboardStore.connect() 啟動 SSE

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:43:06 +08:00
OG T
c180bdaaac docs: Sprint 5R 前端重構批准 — ADR-065 + 設計稿 + Skills + LOGBOOK
- ADR-065: Sprint 5R 前端重構決策(版本 A 批准)
- sprint5r-approved-design.html: 統帥批准的設計稿存檔
- Skills 01 v1.7: 品牌 Logo/AwoooI 一致性鐵律
- LOGBOOK: Sprint 5R 開始實施

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:15:43 +08:00
OG T
d8c2969341 feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
  "🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:05:16 +08:00
OG T
aa2eb486ce docs(logbook): 自動修復 L7 閉環記錄 — 12 Bug 全修 E2E 6208ms 成功
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:55:40 +08:00
OG T
7857c25677 feat: Ollama 本機 Tool Calling 取代 NVIDIA 雲端 (44s→~5s)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- nvidia_provider.py: 新增 OllamaToolProvider
  - 實作 INvidiaProvider protocol,打 Ollama /v1/chat/completions
  - 模型: llama3.1:8b (tool calling 最穩定的 8B)
  - 延遲: 44s → ~5s(本機 M1 Pro 192.168.0.111)
  - get_nvidia_provider() 根據 USE_OLLAMA_TOOL_CALLING 切換
- config.py: USE_OLLAMA_TOOL_CALLING=True (預設開啟), OLLAMA_TOOL_MODEL=llama3.1:8b
- 回退: USE_OLLAMA_TOOL_CALLING=False → 恢復 NvidiaProvider 雲端

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:55:04 +08:00
OG T
77f2da9264 fix(k8s): Bug #11+#12 — SSH egress 白名單 + repair-ssh-key 讀取權限
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #11 (NetworkPolicy): allow-required-egress 缺少 192.168.0.110:22
  - K8s Pod 到 110 的 SSH port 22 被 default-deny-all 封鎖
  - 自動修復的 SSH_COMMAND Playbook 必然 Connection refused
  - 修正: 加入 port 22 到 110 的 egress 白名單

Bug #12 (Deployment): repair-ssh-key Secret defaultMode=0400 (root-only)
  - Pod 以 appuser(UID 1000) 跑,無法讀取 root-owned 的 SSH key
  - ssh 報錯: "Load key: Permission denied"
  - 修正: 加入 securityContext.fsGroup=1000,讓 appuser 透過 group read 存取
  - 已驗證: Pod 內 ssh → repair-bot-110 → REPAIR_OK:sentry 

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:50:49 +08:00
OG T
4f80ba38c0 feat: 告警狀態變更在原訊息延續 (方案 B)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m28s
**telegram_gateway.py**
- 新增 append_incident_update(incident_id, status_line)
  - 從 Redis tg_msg:{id} 取 message_id
  - editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默),保留 Row2(詳情/重診/歷史)
  - sendMessage reply_to_message_id: 在原訊息下方追加狀態行
  - 找不到 message_id 回傳 False(呼叫方自行 fallback)

**decision_manager.py**
- _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h)
- _push_auto_repair_result: 改用 append_incident_update,找不到 message_id 才 fallback 新訊息

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:21:33 +08:00
OG T
20a2ec1455 ci: 重觸發 CD — Bug #5+#6 修正部署 (ssh binary + target_resource) 2026-04-09 14:19:43 +08:00
OG T
2554ac1e60 fix: E2E test 告警識別 + 自動修復結果 Telegram 通知
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
**alert_rule_engine.py**
- _matches() 加入 instance_prefix 匹配(最高優先)
- match_rule() 傳入 instance label 至 _matches
- 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別

**alert_rules.yaml**
- 新增 e2e_smoke_test 規則 (priority=120)
- alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host
- suggested_action: NO_ACTION,顯示「告警鏈路驗證成功」

**decision_manager.py**
- _auto_execute() 成功後發 Telegram 結果通知 
- _auto_execute() 失敗後發 Telegram 失敗通知 
- 新增 _push_auto_repair_result() 函數

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:16:15 +08:00
OG T
1fb0c0ca90 fix(auto-repair): Bug #5+#6 — SSH binary + affected_services 匹配修正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #5 (webhooks.py): target_resource 現在優先用 component label
  - SentryDown alert 有 labels.component="sentry"
  - 舊邏輯: labels.instance="192.168.0.110:9000" → Playbook affected_services 不匹配
  - 新邏輯: component → pod → instance → alertname

Bug #6 (Dockerfile): python:3.11-slim 無 openssh-client
  - SSH_COMMAND Playbook 執行路徑調用 asyncio.create_subprocess_exec("ssh", ...)
  - image 沒有 ssh binary → 所有 SSH 修復必然失敗
  - 修正: 在 production stage 安裝 openssh-client

服務清單: 補 sentry 主服務到 service-registry.yaml (AUTO 級別)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:11:50 +08:00
OG T
73ef9c6b12 fix(web): QA 掃描 — alert-operation-logs i18n + classic emoji→icon + knowledge 載入中
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m28s
- alert-operation-logs: 30+ 處硬編碼中文改 useTranslations (18 event types + UI)
- classic: 告警 badge + 等待確認 + TOOL_EMOJI → Lucide icon
- knowledge: 載入中 → common.loading
- 新增 alertOpLogs i18n section (zh-TW + en)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:58:04 +08:00
OG T
1d88b7cd9d fix(webhooks): Signal.labels 補 alertname 讓 playbook 匹配能讀到原始 alertname
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: create_incident_for_approval 建立 Signal 時 labels 只有
namespace/resource,沒有 alertname,導致 _extract_symptoms 讀
labels.alertname 取得 None,fallback 到 alert_name="custom",
playbook Jaccard 永遠無法匹配真實 alertname (如 SentryDown)。

修正: 新增 alertname 參數,傳入 Signal.labels["alertname"]。
兩個呼叫點 (LLM 成功 + fallback) 都補上 alertname=alertname。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:54:42 +08:00
OG T
08db3580a7 fix(monitoring): 修復 110 主機 CPU 高負載
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
  → 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
  → --housekeeping_interval=30s --docker_only=true
  → CPU 從 239% 降到 <1%

根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
  → 加 scrape_interval: 30s / scrape_timeout: 25s
  → CPU 從 48% 降到 0%

整體 load average: 20 → 9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:53:13 +08:00
OG T
e4070b2f86 fix(webhooks): 補 get_alert_operation_log_repository import 兩處
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m53s
alert_received_log_failed 錯誤原因:alertmanager_webhook 函數內
直接呼叫 get_alert_operation_log_repository() 但未在 local scope import,
導致 NameError 被 except 吞掉,ALERT_RECEIVED 事件無法記錄。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:29:48 +08:00
OG T
fc03eb1f4d fix(auto-repair): _extract_symptoms 優先用 labels.alertname 取得原始 alertname
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: signal.alert_name 存的是 alert_type (如 "custom"),而非 Prometheus
alertname (如 "SentryDown"),導致 playbook Jaccard 匹配永遠失敗 (NO_MATCH)。

根本原因: webhook 的 alertname_to_type mapping 將未知 alertname 轉為 "custom",
存入 signal.alert_name,但 Playbook 的 symptom_pattern.alert_names 存原始名稱。

修正: 從 signal.labels["alertname"] 讀取原始 Prometheus alertname,
fallback 到 signal.alert_name (保持向下相容)。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:26:18 +08:00
OG T
5bd8a8a719 fix(monitoring): 補齊 blackbox-tcp scrape targets (11→15)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
SentryDown/HarborDown/SignOzDown 等告警引用的 instance 不在 scrape list
導致 absent metric = 0,告警持續 firing

新增缺少的 targets:
  192.168.0.125:6443/32334/32335 (K3s)
  192.168.0.110:9000/5000/3100 (Sentry/Harbor/Langfuse)
  192.168.0.188:3301/5432/6380/11434/8089 (SignOz/PG/Redis/Ollama/OpenClaw)

已在 110 主機 reload Prometheus,全部 15 targets UP

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:20:19 +08:00
OG T
af49a54728 fix(playbook): alert_names 完全匹配時 bypass 相似度門檻
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m58s
症狀: SentryDown/OllamaDown 告警觸發 incident,但 playbook 搜索
回傳 NO_MATCH,即使 alert_names 完全一致。

根本原因: Jaccard 加權計算中,affected_services 存的是 Prometheus
instance IP (192.168.0.110:9000),而 Playbook 存的是服務名 (sentry),
導致 services 維度得 0,最終 0.35 < min_similarity=0.4。

修正: alert_names 有交集時直接通過,不受其他維度拉低分數影響。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:05:07 +08:00
OG T
79a9a514dd fix(rules): ADR-064 L1 Redis 分散式鎖防止多 Pod 重複生成規則
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
問題: _generating set 是進程級,多 Pod 各自獨立,同一 alertname 可能被
  多個 Pod 同時送給 Ollama/Gemini 生成規則

修復: SET NX EX lock_key — 只有第一個 Pod 能取鎖,其他 Pod 直接跳過
降級: Redis 不可用時 fallback 進程級 set(保持原有行為)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:03:51 +08:00
OG T
6615432471 docs(logbook): Sprint 5.2 自動修復閉環完成記錄 2026-04-09 12:01:33 +08:00
OG T
b66263ad36 fix(decision_manager): resolved Incident 不重送 Telegram
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
dedup TTL 10分鐘過期後,已 resolve 的 Incident 仍被重新推送
加入狀態檢查,resolved/closed 直接跳過

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:00:39 +08:00
OG T
8d0042ed29 feat(ops): Sprint 5.2 docker-health-monitor 升級為自動修復模式
舊版: 純感知層 (L4-6),只送 Webhook,修復由 API 執行
新版: 感知 + 自動修復 + 回報

修復分級 (ADR-060):
- 一般容器: docker restart
- 監控棧 (prometheus/grafana/alertmanager): docker start (保護 WAL)
- DB/Redis/ClickHouse: 僅告警,禁止重啟

已部署到:
- 192.168.0.110 ~/awoooi-ops/docker-health-monitor.sh
- 192.168.0.188 ~/awoooi-ops/docker-health-monitor.sh
- 兩台 cron */5 * * * * 運行中

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:59:11 +08:00
OG T
b43e1f1818 feat(rules): L2-2 alerts-unified — 補充 14 條 Prometheus 告警規則 + target_down 自動修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
新增規則:
- postgresql_down / postgresql_connection_pool / postgresql_slow_queries
- redis_down / ollama_down / minio_down / minio_disk_high / harbor_down
- k3s_node_down / awoooi_api_down / alert_chain_broken / nvidia_circuit_breaker

修正:
- target_down: kubectl_command 從診斷改為自動重啟 exporter (docker restart / systemctl)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:49:28 +08:00
OG T
afe52c2c70 docs(logbook): Sprint 5 全面完成 + 監控告警全部修復
- C1-C4 + I1-I5 審查修正清零
- node-exporter Docker 部署 110+188
- RedisMemoryHigh 除以零誤報修正
- Prometheus 0 firing alerts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:48:58 +08:00
OG T
9361fd1fa7 fix(decision_manager): action 不應 strip_placeholders 避免截斷 deployment name
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
_strip_placeholders 移除 <...> 導致 kubectl rollout restart deployment/<name>
變成 kubectl rollout restart deployment/,Telegram 顯示建議指令不完整

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:45:33 +08:00