OG T
07a097c259
fix(infra): Sprint 3 自動修復全鏈路修復 — docker-188 SSH egress + service registry 擴充
...
CD Pipeline / build-and-deploy (push) Has been cancelled
NetworkPolicy: 新增 192.168.0.188:22 egress — repair-bot-188.sh 執行路徑
service-registry.yaml: 新增 signoz/bitan-app (AUTO, 188主機)
修復覆蓋: Bug #11 補完 (188 SSH) + 188 服務分級覆蓋
E2E 驗證: MoWoooWorkDown → SSH → REPAIR_OK:momo-app (3791ms)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:04:19 +08:00
OG T
895784e646
feat(web): S7+S9+S10 待審批+AI模型+監控工具3×2 — Sprint 5R
...
CD Pipeline / build-and-deploy (push) Successful in 12m15s
- S7: PendingApprovalsCard 含風險標籤 + 批准/拒絕按鈕
- S9: AIModelStatus 2×2 (OpenClaw/Ollama/Gemini/NVIDIA)
- S10: MonitoringTools 改 3×2 網格 (名稱+元資訊+左側色條)
- 右欄順序: OpenClaw → 待審批 → 基礎架構 → 監控工具 → AI模型
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 16:10:28 +08:00
OG T
a0f3a7d532
feat(web): S6 OpenClaw AI Terminal + 狀態數據 — Sprint 5R
...
CD Pipeline / build-and-deploy (push) Successful in 13m15s
- 分隔線下方新增:模型名稱 + 運行狀態
- 即時統計:今日分析數 / 成功率 / MTTR
- AI 推理終端:#141413 背景 + #a0e8a0 螢光綠 + JetBrains Mono
- 最後一行黃色閃爍游標 ▎
- 資料來源:/api/v1/alert-operation-logs + /api/v1/stats/disposition
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:56:03 +08:00
OG T
b85a0e232e
feat(web): S4+S5 處置統計環形圖 + 最近活動時間線 — Sprint 5R
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- S4: DispositionMini 元件 (SVG 環形圖 + 四類列表)
- S5: RecentActivity 元件 (時間線 + 色點 + JetBrains Mono)
- 左欄改為 flex:6 可滾動多卡片列
- 右欄改為 flex:4 (60:40 比例)
- 左欄結構: 活躍事件 → 處置統計 → 最近活動
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:51:54 +08:00
OG T
7a2e07f74f
feat(web): S1 KPI Strip 改 5 張卡片 — Sprint 5R Phase 1B
...
- 7 指標分隔線 → 5 張 kpi-card 卡片橫排
- 系統健康(進度條) / 活動事件(P1:P2) / 自動修復率(進度條+↑5%) / 待審批 / 本週操作
- 移除龍蝦游泳列(統帥指示移除)
- 新增 weeklyOps 從 /api/v1/audit-logs/stats 取得
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:48:04 +08:00
OG T
289dac6bd1
fix(web): S11+S12 載入失敗修復 — Sprint 5R Phase 1A
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- S11: Tab 2 approvals API path 修正 (?status=pending → /pending)
- S11: Tab 2 fetch 加 r.ok 檢查避免解析錯誤 JSON
- S12: 安全合規改用 SecurityPanel + CompliancePanel (解決 double AppLayout)
- S12: 知識庫改為 redirect 到 /knowledge-base (避免 lazy import 問題)
- S12: 拓撲圖加入 useDashboardStore.connect() 啟動 SSE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:43:06 +08:00
OG T
c180bdaaac
docs: Sprint 5R 前端重構批准 — ADR-065 + 設計稿 + Skills + LOGBOOK
...
- ADR-065: Sprint 5R 前端重構決策(版本 A 批准)
- sprint5r-approved-design.html: 統帥批准的設計稿存檔
- Skills 01 v1.7: 品牌 Logo/AwoooI 一致性鐵律
- LOGBOOK: Sprint 5R 開始實施
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:15:43 +08:00
OG T
d8c2969341
feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
...
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
"🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:05:16 +08:00
OG T
aa2eb486ce
docs(logbook): 自動修復 L7 閉環記錄 — 12 Bug 全修 E2E 6208ms 成功
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:55:40 +08:00
OG T
7857c25677
feat: Ollama 本機 Tool Calling 取代 NVIDIA 雲端 (44s→~5s)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- nvidia_provider.py: 新增 OllamaToolProvider
- 實作 INvidiaProvider protocol,打 Ollama /v1/chat/completions
- 模型: llama3.1:8b (tool calling 最穩定的 8B)
- 延遲: 44s → ~5s(本機 M1 Pro 192.168.0.111)
- get_nvidia_provider() 根據 USE_OLLAMA_TOOL_CALLING 切換
- config.py: USE_OLLAMA_TOOL_CALLING=True (預設開啟), OLLAMA_TOOL_MODEL=llama3.1:8b
- 回退: USE_OLLAMA_TOOL_CALLING=False → 恢復 NvidiaProvider 雲端
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:55:04 +08:00
OG T
77f2da9264
fix(k8s): Bug #11+#12 — SSH egress 白名單 + repair-ssh-key 讀取權限
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #11 (NetworkPolicy): allow-required-egress 缺少 192.168.0.110:22
- K8s Pod 到 110 的 SSH port 22 被 default-deny-all 封鎖
- 自動修復的 SSH_COMMAND Playbook 必然 Connection refused
- 修正: 加入 port 22 到 110 的 egress 白名單
Bug #12 (Deployment): repair-ssh-key Secret defaultMode=0400 (root-only)
- Pod 以 appuser(UID 1000) 跑,無法讀取 root-owned 的 SSH key
- ssh 報錯: "Load key: Permission denied"
- 修正: 加入 securityContext.fsGroup=1000,讓 appuser 透過 group read 存取
- 已驗證: Pod 內 ssh → repair-bot-110 → REPAIR_OK:sentry ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:50:49 +08:00
OG T
4f80ba38c0
feat: 告警狀態變更在原訊息延續 (方案 B)
...
CD Pipeline / build-and-deploy (push) Successful in 12m28s
**telegram_gateway.py**
- 新增 append_incident_update(incident_id, status_line)
- 從 Redis tg_msg:{id} 取 message_id
- editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默),保留 Row2(詳情/重診/歷史)
- sendMessage reply_to_message_id: 在原訊息下方追加狀態行
- 找不到 message_id 回傳 False(呼叫方自行 fallback)
**decision_manager.py**
- _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h)
- _push_auto_repair_result: 改用 append_incident_update,找不到 message_id 才 fallback 新訊息
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:21:33 +08:00
OG T
20a2ec1455
ci: 重觸發 CD — Bug #5+#6 修正部署 (ssh binary + target_resource)
2026-04-09 14:19:43 +08:00
OG T
2554ac1e60
fix: E2E test 告警識別 + 自動修復結果 Telegram 通知
...
CD Pipeline / build-and-deploy (push) Has been cancelled
**alert_rule_engine.py**
- _matches() 加入 instance_prefix 匹配(最高優先)
- match_rule() 傳入 instance label 至 _matches
- 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別
**alert_rules.yaml**
- 新增 e2e_smoke_test 規則 (priority=120)
- alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host
- suggested_action: NO_ACTION,顯示「告警鏈路驗證成功」
**decision_manager.py**
- _auto_execute() 成功後發 Telegram 結果通知 ✅
- _auto_execute() 失敗後發 Telegram 失敗通知 ❌
- 新增 _push_auto_repair_result() 函數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:16:15 +08:00
OG T
1fb0c0ca90
fix(auto-repair): Bug #5+#6 — SSH binary + affected_services 匹配修正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #5 (webhooks.py): target_resource 現在優先用 component label
- SentryDown alert 有 labels.component="sentry"
- 舊邏輯: labels.instance="192.168.0.110:9000" → Playbook affected_services 不匹配
- 新邏輯: component → pod → instance → alertname
Bug #6 (Dockerfile): python:3.11-slim 無 openssh-client
- SSH_COMMAND Playbook 執行路徑調用 asyncio.create_subprocess_exec("ssh", ...)
- image 沒有 ssh binary → 所有 SSH 修復必然失敗
- 修正: 在 production stage 安裝 openssh-client
服務清單: 補 sentry 主服務到 service-registry.yaml (AUTO 級別)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:11:50 +08:00
OG T
73ef9c6b12
fix(web): QA 掃描 — alert-operation-logs i18n + classic emoji→icon + knowledge 載入中
...
CD Pipeline / build-and-deploy (push) Successful in 12m28s
- alert-operation-logs: 30+ 處硬編碼中文改 useTranslations (18 event types + UI)
- classic: 告警 badge + 等待確認 + TOOL_EMOJI → Lucide icon
- knowledge: 載入中 → common.loading
- 新增 alertOpLogs i18n section (zh-TW + en)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 13:58:04 +08:00
OG T
1d88b7cd9d
fix(webhooks): Signal.labels 補 alertname 讓 playbook 匹配能讀到原始 alertname
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: create_incident_for_approval 建立 Signal 時 labels 只有
namespace/resource,沒有 alertname,導致 _extract_symptoms 讀
labels.alertname 取得 None,fallback 到 alert_name="custom",
playbook Jaccard 永遠無法匹配真實 alertname (如 SentryDown)。
修正: 新增 alertname 參數,傳入 Signal.labels["alertname"]。
兩個呼叫點 (LLM 成功 + fallback) 都補上 alertname=alertname。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 13:54:42 +08:00
OG T
08db3580a7
fix(monitoring): 修復 110 主機 CPU 高負載
...
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
→ 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
→ --housekeeping_interval=30s --docker_only=true
→ CPU 從 239% 降到 <1%
根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
→ 加 scrape_interval: 30s / scrape_timeout: 25s
→ CPU 從 48% 降到 0%
整體 load average: 20 → 9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 13:53:13 +08:00
OG T
e4070b2f86
fix(webhooks): 補 get_alert_operation_log_repository import 兩處
...
CD Pipeline / build-and-deploy (push) Successful in 12m53s
alert_received_log_failed 錯誤原因:alertmanager_webhook 函數內
直接呼叫 get_alert_operation_log_repository() 但未在 local scope import,
導致 NameError 被 except 吞掉,ALERT_RECEIVED 事件無法記錄。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:29:48 +08:00
OG T
fc03eb1f4d
fix(auto-repair): _extract_symptoms 優先用 labels.alertname 取得原始 alertname
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: signal.alert_name 存的是 alert_type (如 "custom"),而非 Prometheus
alertname (如 "SentryDown"),導致 playbook Jaccard 匹配永遠失敗 (NO_MATCH)。
根本原因: webhook 的 alertname_to_type mapping 將未知 alertname 轉為 "custom",
存入 signal.alert_name,但 Playbook 的 symptom_pattern.alert_names 存原始名稱。
修正: 從 signal.labels["alertname"] 讀取原始 Prometheus alertname,
fallback 到 signal.alert_name (保持向下相容)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:26:18 +08:00
OG T
5bd8a8a719
fix(monitoring): 補齊 blackbox-tcp scrape targets (11→15)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
SentryDown/HarborDown/SignOzDown 等告警引用的 instance 不在 scrape list
導致 absent metric = 0,告警持續 firing
新增缺少的 targets:
192.168.0.125:6443/32334/32335 (K3s)
192.168.0.110:9000/5000/3100 (Sentry/Harbor/Langfuse)
192.168.0.188:3301/5432/6380/11434/8089 (SignOz/PG/Redis/Ollama/OpenClaw)
已在 110 主機 reload Prometheus,全部 15 targets UP
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:20:19 +08:00
OG T
af49a54728
fix(playbook): alert_names 完全匹配時 bypass 相似度門檻
...
CD Pipeline / build-and-deploy (push) Successful in 12m58s
症狀: SentryDown/OllamaDown 告警觸發 incident,但 playbook 搜索
回傳 NO_MATCH,即使 alert_names 完全一致。
根本原因: Jaccard 加權計算中,affected_services 存的是 Prometheus
instance IP (192.168.0.110:9000),而 Playbook 存的是服務名 (sentry),
導致 services 維度得 0,最終 0.35 < min_similarity=0.4。
修正: alert_names 有交集時直接通過,不受其他維度拉低分數影響。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:05:07 +08:00
OG T
79a9a514dd
fix(rules): ADR-064 L1 Redis 分散式鎖防止多 Pod 重複生成規則
...
CD Pipeline / build-and-deploy (push) Has started running
問題: _generating set 是進程級,多 Pod 各自獨立,同一 alertname 可能被
多個 Pod 同時送給 Ollama/Gemini 生成規則
修復: SET NX EX lock_key — 只有第一個 Pod 能取鎖,其他 Pod 直接跳過
降級: Redis 不可用時 fallback 進程級 set(保持原有行為)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:03:51 +08:00
OG T
6615432471
docs(logbook): Sprint 5.2 自動修復閉環完成記錄
2026-04-09 12:01:33 +08:00
OG T
b66263ad36
fix(decision_manager): resolved Incident 不重送 Telegram
...
CD Pipeline / build-and-deploy (push) Has started running
dedup TTL 10分鐘過期後,已 resolve 的 Incident 仍被重新推送
加入狀態檢查,resolved/closed 直接跳過
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:00:39 +08:00
OG T
8d0042ed29
feat(ops): Sprint 5.2 docker-health-monitor 升級為自動修復模式
...
舊版: 純感知層 (L4-6),只送 Webhook,修復由 API 執行
新版: 感知 + 自動修復 + 回報
修復分級 (ADR-060):
- 一般容器: docker restart
- 監控棧 (prometheus/grafana/alertmanager): docker start (保護 WAL)
- DB/Redis/ClickHouse: 僅告警,禁止重啟
已部署到:
- 192.168.0.110 ~/awoooi-ops/docker-health-monitor.sh
- 192.168.0.188 ~/awoooi-ops/docker-health-monitor.sh
- 兩台 cron */5 * * * * 運行中
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:59:11 +08:00
OG T
b43e1f1818
feat(rules): L2-2 alerts-unified — 補充 14 條 Prometheus 告警規則 + target_down 自動修復
...
CD Pipeline / build-and-deploy (push) Has been cancelled
新增規則:
- postgresql_down / postgresql_connection_pool / postgresql_slow_queries
- redis_down / ollama_down / minio_down / minio_disk_high / harbor_down
- k3s_node_down / awoooi_api_down / alert_chain_broken / nvidia_circuit_breaker
修正:
- target_down: kubectl_command 從診斷改為自動重啟 exporter (docker restart / systemctl)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:49:28 +08:00
OG T
afe52c2c70
docs(logbook): Sprint 5 全面完成 + 監控告警全部修復
...
- C1-C4 + I1-I5 審查修正清零
- node-exporter Docker 部署 110+188
- RedisMemoryHigh 除以零誤報修正
- Prometheus 0 firing alerts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:48:58 +08:00
OG T
9361fd1fa7
fix(decision_manager): action 不應 strip_placeholders 避免截斷 deployment name
...
CD Pipeline / build-and-deploy (push) Has been cancelled
_strip_placeholders 移除 <...> 導致 kubectl rollout restart deployment/<name>
變成 kubectl rollout restart deployment/,Telegram 顯示建議指令不完整
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:45:33 +08:00
OG T
d467fc11be
fix(nemotron): 修復 deployment_name placeholder 問題
...
根因: Nemotron tool calling 收到 target_resource=DockerContainerUnhealthy
(非真實 K8s deployment name),不確定時填 <deployment_name>
修復:
1. prompt 明確標注 deployment_name 必須填入 target_resource
2. 收到 tool call 結果後,偵測 placeholder 並用 target_resource 覆蓋
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:44:25 +08:00
OG T
85d4857d1b
fix(monitoring): RedisMemoryHigh 誤報 — max_bytes=0 除以零修正
...
CD Pipeline / build-and-deploy (push) Has started running
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
- 加入 redis_memory_max_bytes > 0 前置條件
- 防止 Redis 未設 maxmemory 時除以零產生 +Inf 永遠觸發
- 影響: alerts-unified.yml + database-alerts.yaml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:41:10 +08:00
OG T
bf4ec18d0e
docs(adr): ADR-030 補充九-十章實作完成記錄
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:29:04 +08:00
OG T
580053394b
fix(web): C4 監控工具 emoji → Lucide icon (feedback_no_emoji_use_icons.md)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
TOOL_EMOJI Record<string> 改為 TOOL_ICON Record<React.ReactNode>
使用 BarChart3/Flame/Telescope/FlaskConical/Activity/GitBranch
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:28:53 +08:00
OG T
12b084e2e0
docs(logbook): 2026-04-09 Telegram 截斷修復 + Panel 抽取全完成
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:21:59 +08:00
OG T
4a94588766
fix(web): I3 approve/reject API + I4 SIGNOZ_URL env + I5 ErrorsPanel nothing-gray
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- I3: Approve/Reject 按鈕串接 /api/v1/approvals/{id}/sign|reject
- I4: ApmPanel SIGNOZ_URL 改用 NEXT_PUBLIC_SIGNOZ_URL 環境變數
- I5: ErrorsPanel 外框改用 nothing-gray 調色盤 inline style
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:20:44 +08:00
OG T
28d2ff704e
fix(web): C1 殘留 i18n — 5 處硬編碼中文改 useTranslations
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- 告警 badge: alertBadge / alertBadgeZero
- 等待確認: awaitingConfirm
- 主機/拓撲 toggle: hostView / topoView
- HOST_CATALOG description 確認未渲染,不需 i18n
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:18:05 +08:00
OG T
c5e475121a
fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後
→ 改為 [:80],完整顯示 kubectl command
根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT)
→ decision_manager.py 加入偵測,用規則引擎重新查 kubectl command
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:14:30 +08:00
OG T
f51ef5e089
docs(logbook): 首席架構師審查 P0 修正完成記錄
2026-04-09 11:08:51 +08:00
OG T
fb66ecd2a0
refactor(web): Panel 抽取全面完成 — 三個整合頁面解決雙重 AppLayout
...
CD Pipeline / build-and-deploy (push) Has been cancelled
/observability: AppsPanel + ServicesPanel (共 5/5 Tab 完成)
/automation: AutoRepairPanel + NeuralCommandPanel + DriftPanel (3/3)
/operations: DeploymentsPanel + TicketsPanel + CostPanel + ActionLogsPanel + BillingPanel (5/5)
原始頁面全部精簡為 AppLayout + Panel,零雙重 Layout。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:06:57 +08:00
OG T
7934ade3a6
refactor(web): 全部 13 Panel 抽取完成 + 整合頁面雙重 AppLayout 修正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Panel 抽取 (13 個):
- MonitoringPanel, ApmPanel, ErrorsPanel, AppsPanel, ServicesPanel
- AutoRepairPanel, NeuralCommandPanel, DriftPanel
- DeploymentsPanel, TicketsPanel, CostPanel, ActionLogsPanel, BillingPanel
整合頁面更新 (全部使用 Panel,無雙重 AppLayout):
- /observability: 5 Panel
- /automation: 3 Panel
- /operations: 5 Panel
首席架構師 I2 問題已解決
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-09 11:05:37 +08:00
OG T
9e10305acc
fix(web): C2 拓撲元件 i18n — 10+ 處硬編碼中文改 useTranslations
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-09 11:04:35 +08:00
OG T
7153395267
fix(web): 首席架構師 P0 修正 — i18n 硬編碼 + 效能輪詢
...
CD Pipeline / build-and-deploy (push) Has been cancelled
C1: 首頁 4 Tab 30+ 處硬編碼中文改為 useTranslations
- 新增 dashboard.tabs.* / alertEvents / approve / reject 等 30+ i18n key
- zh-TW + en 雙語同步
C3: automation/operations Loading 改用 LobsterLoading (i18n)
I1: 100ms setInterval 改為 popstate + 1s 低頻備援 (效能 10x 改善)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-09 11:01:07 +08:00
OG T
5ea6c3fb91
feat: alert_operation_log 查詢 API + 前端頁面 (Sprint 5.2)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
後端:
- 新增 list_recent() 分頁方法 (alert_operation_log_repository)
- 新增 /api/v1/alert-operation-logs GET + /stats 端點
- main.py 註冊 alert_operation_logs_v1.router
前端:
- /alert-operation-logs 頁面,18 種 event_type 顏色標記
- 分頁、event_type 篩選、incident_id 篩選
- 24h 統計卡片 (總數/護欄攔截/自動修復/已解決)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:57:40 +08:00
OG T
428e66c111
fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064
...
CD Pipeline / build-and-deploy (push) Has been cancelled
S1 Critical:
- S1-1: asyncio 觸發移至 _call_with_fallback async 上下文,移除 sync 中的 get_event_loop()
- S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排
- S1-3: _matches() 對 alertname=["*"] 直接回傳 False,防意外命中
S2 Major:
- S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key),移除 import settings
- S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑,非假數據
- S2-5: suggested_action .strip() 防空白字串繞過 or
S3 Minor:
- S3-2: priority 上界 min(next, 890)
- S3-3: alertname sanitize re.sub([{}]) 防 format KeyError
- S3-4: model_registry.py 最後修改時間戳更新
文件:
- ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習
- Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項
- Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程
- LOGBOOK: 本次 Session 完整記錄
2026-04-09 ogt: 首席架構師審查修正
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:52:40 +08:00
OG T
11fc2860cf
refactor(web): ErrorsPanel 抽取 — /observability 3 個 Tab 已無雙重 Layout
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-09 10:51:59 +08:00
OG T
22fa6ea413
refactor(web): ApmPanel 抽取 — /observability 的 monitoring+apm 兩個 Tab 無雙重 Layout
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-09 10:49:39 +08:00
OG T
4b3fdd82f9
fix(api): incidents list 不再同步等待 AI 決策 (效能修復)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: GET /api/v1/incidents 對每個 incident await AI 分析 (120-180s)
多個活躍 incident 時 timeout 乘積爆炸 → 前端完全無法載入
修復:
- list endpoint 只查 Redis 已快取的決策 token (立即返回)
- 無快取時回 decision=null,背景 fire-and-forget 觸發 AI
- 前端對有興趣的 incident 再 GET 單筆端點取得決策結果
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:49:30 +08:00
OG T
f05a391d02
feat(web): panels/index.ts 匯出 + Panel 抽取進度標記
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-09 10:42:30 +08:00
OG T
5ead01abf7
feat(ops): dr-drill.sh — 每月 DR Drill 自動演練
...
每月第一個週日 03:00 (121 cron) 執行:
1. 找最新 Velero backup (Completed)
2. 還原到 awoooi-dr-test namespace
3. 等待 Pod Ready + API health 驗證
4. 清理 dr-test namespace + restore 資源
5. Telegram 通知 PASS/FAIL + 耗時
支援 --dry-run 模式 (只檢查 backup,不還原)。
dry-run 驗證通過: daily-awoooi-prod-20260409020003
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:42:12 +08:00
OG T
770667eed4
refactor(web): MonitoringPanel 抽取 — 解決 /observability 雙重 AppLayout
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-09 10:40:07 +08:00