Your Name
|
95110971f3
|
fix(telegram): close remaining DM alert routes
CD Pipeline / tests (push) Successful in 1m27s
Code Review / ai-code-review (push) Successful in 29s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-04-30 23:02:17 +08:00 |
|
Your Name
|
61f5a6a419
|
fix(telegram): route alerts to SRE war room
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-04-30 15:01:23 +08:00 |
|
Your Name
|
fb130c9a28
|
feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄
CD Pipeline / build-and-deploy (push) Failing after 2m16s
Wave 8 P3.1-T2 後續補測 + 配套:
新增測試:
- test_diagnosis_aggregator_stub.py (238 行) — 15 tests
· stub fixture 驗證 _collect_diagnosis_aggregator 接線
· feature flag default off 不呼叫
· timeout 邊界 / exception fail-soft
修改:
- core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標
- sanitization_service.py +24 — 補強 prompt sanitize 邊界(vuln #4 配套)
- RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調
Tests: 15 passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-27 08:30:26 +08:00 |
|
Your Name
|
fefe4c21cd
|
fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:
A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
· success/timeout/error outcome label
· histogram 寫入 aiops_agent_step_duration_seconds
A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout
新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality
新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序
新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
· 三 Agent histogram alert rules(p99 > timeout 80% → warning)
驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)
INC-20260425 雙修總工作量(595629c0 + 此 commit):
· 5 個 service/agent 檔修改
· 1 個新 observability 模組
· 4 個新測試/配套檔
· 1372+187 = 1559 行新增
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
|
2026-04-27 08:15:53 +08:00 |
|
Your Name
|
1ab6786ce3
|
feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Wave 6 P2.3 ops 配套 + tool-expert 部署文件:
新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
· 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
· failover/recovery 完整 SOP
· 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
· 4 panel:current primary / failover events / quota usage / health status
· 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
· ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
|
2026-04-27 08:11:40 +08:00 |
|
OG T
|
a1432c03ed
|
docs: ADR-070/071 + ssh-mcp-setup runbook + Skill-04 v2.7
- ADR-070: 全自動 AIOps 閉環 MCP Phase 1-4 決策文件
- ADR-071: 告警通知四類型 + KM 三段資料閉環決策文件
- docs/runbooks/ssh-mcp-setup.md: SSH MCP 建立/驗證/輪換 SOP
- Skill-04: v2.7 新增 Sprint C DR + ADR-070 MCP 10 providers 完整記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 20:04:47 +08:00 |
|
OG T
|
43edff184d
|
feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件
C-1 Velero: 已確認運作中(daily-awoooi-prod schedule, 13d, MinIO Available)
C-2 Host rsync 備份:
scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110
- Harbor registry data(最高優先)
- Gitea repos
- bitan-pharmacy.git(若存在)
- 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控
- 失敗時 Telegram 告警
ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則
C-3 DR SOP 文件:
docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘)
docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘)
docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘)
部署備份腳本說明 (需手動執行):
scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh
ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}"
ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' | crontab -"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 03:04:18 +08:00 |
|
OG T
|
66b12bf9eb
|
fix(infra): 根治 Harbor Exited(128) Race Condition + harbor-watchdog 常駐自愈
問題根因:
awoooi-startup-110.sh 在 Harbor 啟動時,第一次 compose up -d 會同時
啟動所有容器。harbor-core/db/portal 嘗試連 syslog:1514(harbor-log 未就緒),
失敗後 exit(128),restart:always 重試直到 backoff 放棄。
即使後來 harbor-log healthy,其他容器已不再重試。
修復 1 — startup-110.sh Harbor 時序(4 Phase 策略):
Phase 1: 清除所有 Exited Harbor 容器(打破 backoff 死鎖)
Phase 2: 只啟動 harbor-log
Phase 3: 等 harbor-log healthy(最多 90s)
Phase 4: 啟動全組件
修復 2 — harbor-watchdog.service(常駐自愈):
Type=simple 常駐進程,每 60s 輪詢 http://127.0.0.1:5000/v2/
不健康 → 等 5s 再確認 → 執行 Phase 1-4 完整修復
修復重開機時序問題無法覆蓋的「運行中崩潰」場景
Bug Fix:curl -f 會把 HTTP 401 視為失敗(exit 22),
Harbor /v2/ 正常回傳 401(需認證),改用 curl -s 不加 -f
REBOOT-RECOVERY-SOP.md → v5.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 12:13:21 +08:00 |
|
OG T
|
f51bf5a6a8
|
feat(backup): 全服務備份覆蓋 + 告警機制 — 9/9 服務完整
新增備份(已部署到 110,首次執行全部通過):
- backup-langfuse.sh: Langfuse AI 追蹤/評測 DB (7238 traces)
- backup-monitoring.sh: Prometheus + Grafana + Alertmanager volumes + configs
- backup-signoz.sh: SignOz ClickHouse + SQLite (分散式追蹤/日誌)
- backup-open-webui.sh: Open-WebUI LLM 對話紀錄 (SSH 188 volume)
- backup-clawbot.sh: ClawBot Redis 狀態/快取 (SSH 188 volume)
- backup-all.sh v3.0: 整合至 9/9 服務
告警機制:
- common.sh: notify_clawbot 改用 /webhook/custom 正確格式
- failed → severity:critical → Telegram 🔴 立即告警
- 告警測試通過:{"status":"ok","alert_id":"878c4c59..."}
GFS 保留:30日/12週/24月 (AWOOOI 額外 28h 高頻)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 11:12:42 +08:00 |
|
OG T
|
91564c6ea3
|
docs(sop): REBOOT-RECOVERY-SOP.md v4.0
更新:
- 加入 Sentry /opt/sentry 啟動說明 (110 Step 7/9)
- 新增 Sentry 重開機損壞修復章節 (PostgreSQL WAL/Redis RDB/ClickHouse parts)
- 告警沉默診斷樹補充「規則未部署」診斷 + deploy-alerts.sh 修復指令
- E2E 驗證腳本加入 Sentry + Prometheus 規則數驗證 (≥25)
- 架構圖補充 Sentry :9000
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 03:11:27 +08:00 |
|
OG T
|
8f64affbdb
|
docs(runbooks): REBOOT-RECOVERY-SOP v3.0 完整重開機自動化方案
## 內容
完整盤點所有主機、服務、工具、監控的:
- 啟動順序與依賴關係圖
- 正常重啟 vs 異常重啟處理流程
- 各主機詳細啟動序列 (188/110/120/121)
- 常見故障排查手冊 (告警沉默/CD失效/數據消失/NodePort)
- E2E 驗證腳本
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 01:48:29 +08:00 |
|
OG T
|
be3aa6069b
|
feat(backup): AWOOOI 高頻備份 — 每 6 小時備份 awoooi_prod
awoooi_prod 為核心生產 DB,每日一次最大損失 24 小時不可接受:
- backup-awoooi-frequent.sh:每 6 小時備份 awoooi_prod(08/14/20:00)
- 02:00 由 backup-all.sh 完整備份(含 dev/k3s)
- 合計 4次/天,最大數據損失 ≤ 6 小時
- GFS 保留:28h 高頻 + 30日 + 12週 + 24月
首次執行:✅ 680K,4s,snapshot db050dbc
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 01:14:50 +08:00 |
|
OG T
|
3136fc5ea0
|
feat(backup): 全面自動化備份 + AWOOOI DB + GFS 延長保留
首席架構師備份審計 — 全部自動化完成:
- backup-awoooi.sh:新增 AWOOOI PostgreSQL 備份腳本
- awoooi_prod (KB/事故/AutoRepair/Drift) + k3s_datastore
- 從 110 SSH 到 188 執行 pg_dump,整合進 restic
- 首次執行:680K,9s,snapshot 8750748f ✅
- backup-all.sh v2.0:整合第 4 個服務 AWOOOI DB
- GFS 保留策略延長:
- 每日 7→30 份(覆蓋最近 30 天)
- 每週 4→12 份(覆蓋最近 3 個月)
- 每月 6→24 份(覆蓋最近 2 年)
- BACKUP-STATUS.md:更新為全自動化狀態總覽
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 01:11:31 +08:00 |
|
OG T
|
84cfdb6195
|
docs(backup): 備份審計完整盤點 + 新增 AWOOOI DB 與 Gitea DB 備份腳本
首席架構師備份審計結論:
- awoooi_prod PostgreSQL:❌ 無備份 (P0 缺口)
- Gitea SQLite DB:❌ 無備份 (今日已損壞,人工修復耗時 2h+)
新增:
- scripts/backup/backup-awoooi-db.sh (188 部署,02:00 daily)
- scripts/backup/backup-gitea-db.sh (110 部署,01:00 daily)
- docs/runbooks/BACKUP-STATUS.md (全景表 + 部署步驟 + SOP)
- LOGBOOK.md 備份審計段落
待手動部署:統帥需 scp 腳本至 188/110 並設定 crontab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 01:01:58 +08:00 |
|
OG T
|
f4f454fd98
|
feat(api): 重開機後自動 warm-up Redis Working Memory from PostgreSQL
- main.py lifespan: 啟動時從 DB restore INVESTIGATING/MITIGATING incidents
- scripts/reboot-recovery: 188 + 110 自動化腳本 + systemd services
- scripts/reboot-recovery: aiops-network 自動建立 (ClawBot 依賴)
- docs/runbooks/REBOOT-RECOVERY-SOP.md: 完整改寫,含自動化腳本說明
Why: 重開機後 Redis 清空導致前端 incidents 顯示 0 筆(DB 完整保存)
統帥批准: 「所有數據必須被長久記錄下來」
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 00:39:20 +08:00 |
|
OG T
|
08f73dfce8
|
docs: Phase O-5 Wave 5.4 告警鏈路 E2E 驗證 Runbook
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
- 架構圖、手動測試步驟、smoke test 清單
- generate_monitoring.py 用法說明
- 已知問題豁免清單、回滾指令
- 首次驗收記錄 2026-04-02 8/8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-02 21:34:43 +08:00 |
|
OG T
|
5a46998689
|
docs: Secrets 管理手冊 (ADR-035+ 統一 Secrets 真相來源)
CD Pipeline / build-and-deploy (push) Successful in 5m23s
E2E Health Check / e2e-health (push) Successful in 17s
建立 docs/runbooks/SECRETS-MANAGEMENT.md:
- 7 個 Gitea Secrets + 12 個 K8s Secrets 完整清單
- 更新 SOP (API + Web UI)
- 一鍵狀態檢查命令
- 各 key 取得/更新指南
- 緊急狀況處理
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 15:40:48 +08:00 |
|
OG T
|
89e05e6ea2
|
docs: ADR-037 + 監控架構提案 + Runbooks
- ADR-037 監控增強架構
- MONITORING_MASTER_PLAN 主計畫
- MASTER_EXECUTION_SCHEDULE 執行排程
- Phase D/E/Worker HPA Runbooks
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:04:08 +08:00 |
|
OG T
|
95b46af986
|
docs: 新增稽核報告 + 靈感實驗室 + Runbook 更新
- AWOOOI_COMPREHENSIVE_AUDIT_2026Q1.md 全維度稽核
- INSPIRATION_LAB.md 靈感收集
- K3S-OPTIMIZATION-RUNBOOK.md 優化指南
- ADR-006 AI Fallback 策略更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:41 +08:00 |
|
OG T
|
e03d99b871
|
docs(runbook): K3s 優化 Runbook v1.2 - 標記完成狀態
Phase 完成狀態:
- K0 ✅ Swap/PDB/備份/清理 (首席架構師 9.0/10)
- K-NET ✅ VIP 192.168.0.125 + CI/CD 整合
- K-CLEAN ✅ 9 RS + 1 Job 清理
K-HA 📋 另案規劃 (需維護窗口)
更新:
- 版本號 1.1 → 1.2
- 目錄標記完成狀態
- 各 Phase 加入執行結果
- 附錄 A 實際執行時間線
- 問題統計 (清理前後對照)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 18:52:13 +08:00 |
|
OG T
|
efb80b403e
|
feat(k8s): Phase K0.5 Startup Probe + PDB + revisionHistoryLimit
K3s 生產級優化 Phase K0 變更:
- 新增 startupProbe 到 API/Web/Worker Deployment (60s 啟動時間)
- 新增 revisionHistoryLimit: 3 (減少孤立 ReplicaSet)
- 新增 09-pdb.yaml (PodDisruptionBudget 保護)
- 新增 K3S-OPTIMIZATION-RUNBOOK.md (執行手冊)
- 修正 selector 對齊現有 Deployment (app+environment+system)
首席架構師審查: 9.0/10 ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 11:13:44 +08:00 |
|