Your Name
2ec7f6f440
fix(ops): harden heartbeat and momo alert noise
Code Review / ai-code-review (push) Successful in 14s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 31s
CD Pipeline / tests (push) Successful in 1m59s
CD Pipeline / build-and-deploy (push) Successful in 7m36s
CD Pipeline / post-deploy-checks (push) Failing after 43s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-24 19:38:33 +08:00
Your Name
95f442adab
fix(ops): harden 188 backup exporter recovery [skip ci]
2026-06-24 06:37:44 +08:00
Your Name
93ac6030cf
fix(ops): 同步 source provider freshness 告警規則
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 24s
2026-06-18 14:23:13 +08:00
Your Name
ff18872a23
feat(ops): 新增 host runaway process aiops guard
Code Review / ai-code-review (push) Successful in 14s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 26s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-18 14:17:03 +08:00
Your Name
6efd186750
docs(security): 建立高價值配置控管清冊 [skip ci]
2026-06-11 11:29:58 +08:00
Your Name
ae7b39d96a
fix(ops): harden reboot recovery and backup alerts
2026-05-29 12:41:34 +08:00
Your Name
ae9d0b7385
feat(monitoring): alert on stale source provider ingestion
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 3m26s
CD Pipeline / build-and-deploy (push) Successful in 3m38s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
2026-05-20 19:19:21 +08:00
Your Name
598f33ae8b
fix(monitoring): clarify alert chain smoke evidence
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s
CD Pipeline / tests (push) Successful in 3m55s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m33s
2026-05-20 13:11:44 +08:00
Your Name
21dcfbd991
fix(governance): collapse km slo fallback series
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s
CD Pipeline / tests (push) Successful in 1m6s
CD Pipeline / build-and-deploy (push) Successful in 5m17s
CD Pipeline / post-deploy-checks (push) Successful in 1m38s
2026-05-14 19:37:15 +08:00
Your Name
d2a4a17969
fix(governance): stabilize adr100 km growth slo
Code Review / ai-code-review (push) Successful in 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 1m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-14 19:33:52 +08:00
Your Name
4111ea4f9f
fix(ai): remove 188 ollama provider
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 14:34:48 +08:00
Your Name
587551c1f1
fix(ops): monitor full-stack cold-start gates
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s
2026-05-06 00:48:05 +08:00
Your Name
23932773ef
fix(monitoring): route docker baseline alerts to ssh
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 19s
2026-05-06 00:00:12 +08:00
Your Name
2f50c67f5c
fix(monitoring): keep host alert ssh diagnostics canonical
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 20s
E2E Health Check / e2e-health (push) Successful in 2m35s
2026-05-05 23:57:53 +08:00
Your Name
2221fd3256
fix(ops): persist host resource guardrails
CD Pipeline / tests (push) Successful in 5m25s
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m31s
CD Pipeline / post-deploy-checks (push) Successful in 5m10s
2026-05-05 16:13:19 +08:00
Your Name
1cc9de5722
fix(ops): point runner guardrail alerts to host script
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
2026-05-05 15:25:37 +08:00
Your Name
d08d1e4951
fix(ops): alert on missing docker resource limits
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 15:01:31 +08:00
Your Name
72d66e4ae6
fix(ops): align stale job cleanup thresholds
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
2026-05-05 14:54:17 +08:00
Your Name
5e625f777d
fix(ops): add stale gitea job cleanup guard
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:50:47 +08:00
Your Name
7d45f0cb58
fix(ops): alert on stale gitea actions jobs
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:42:09 +08:00
Your Name
fe618960a8
fix(ops): monitor systemd runners in host baseline
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
2026-05-05 14:08:43 +08:00
Your Name
e8e6748f70
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 13:45:09 +08:00
Your Name
b1ef05fa8c
feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
...
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110
## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 22:49:23 +08:00
Your Name
577250a678
fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警
...
Code Review / ai-code-review (push) Successful in 52s
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s
【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】
前次 commit f1362fcc 用 skip 條件把告警吞掉,是消音化解法:
- W-3:total_exec<10 永遠 skip → Redis 永遠空也不會告警
- W-4:playbooks total==0 永遠 skip → 表被清空也不會告警
- Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警
統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。
【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】
- ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛:
- W-3a:metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」
- W-3b:rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」
- W-4a:playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」
- W-4b:playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」
- 3 份 Prometheus rule(k8s/monitoring/flywheel-alerts.yaml、
ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml)新增
FlywheelExecutionRateMissing:absent() 或 NaN 持續 30 分鐘 → 告警,
與 watchdog W-3b 雙保險
【已加入 memory】
feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律:
「fresh deploy / init guard 用 skip 吞告警 = 結構性失職,必須分流寬限期 +
過期改打資料管線斷新告警」
【驗證】
106 個治理相關 unit test 全過:
test_trust_drift_watchdog / test_governance_agent / test_failover_alerter /
test_check_trust_drift_commit_outside_context_poc /
test_governance_remediation_dispatch / test_ai_governance_endpoints /
test_governance_dispatcher
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 12:39:46 +08:00
Your Name
b371edb70c
fix host alert auto-repair routing and backup false positives
2026-05-02 23:44:12 +08:00
Your Name
da772a1605
fix(decision): block kubectl actions on bare_metal host alerts
...
Code Review / ai-code-review (push) Successful in 54s
CD Pipeline / tests (push) Successful in 3m47s
CD Pipeline / build-and-deploy (push) Successful in 13m26s
CD Pipeline / post-deploy-checks (push) Successful in 5m45s
When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host
(192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating
CPU), the LLM kept proposing "kubectl rollout restart awoooi-api",
which is a wrong-domain action — restarting awoooi cannot fix a
third-party process's CPU usage on the host. Auto-execute would then
either run the no-op kubectl restart (wasted) or escalate after
ssh_diagnose because no safe action was found, producing the
"AI 自動修復失敗" Telegram noise the user just complained about.
Adds a guard at the top of DecisionManager._auto_execute: if the
incident's primary signal carries host_type=bare_metal AND the
proposed action starts with "kubectl", refuse to execute. The
incident is marked READY with a clear blocked_reason so human
operators see why automation declined, and emergency_escalation
records the event in AOL for audit.
Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new
ops/monitoring/alerts.yml in repo) to add an explicit
auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory
that hints LLM toward `ssh ... ps aux` rather than kubectl restart.
Prometheus reload returned 200.
Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py
covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked,
(3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT
blocked, (5) missing host_type label NOT blocked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-02 17:41:28 +08:00
Your Name
3156ff1c69
feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts
...
Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune
with ≥75% disk usage gate) and routes "docker prune" actions through it.
Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider
routing labels so the flywheel can self-heal next disk-full event without
hitting the emergency_channel Telegram path.
Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by
empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every
disk-full alert through "Host key is not trusted → escalate" loop.
known_hosts patched live; this commit closes the playbook gap so the
next occurrence resolves without manual intervention.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-02 12:31:37 +08:00
Your Name
ca22ec2fd2
fix(aiops): route backup failures rule-first
CD Pipeline / tests (push) Successful in 1m51s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 8m21s
CD Pipeline / post-deploy-checks (push) Successful in 4m18s
2026-05-01 10:11:10 +08:00
Your Name
f0d14ab6c4
fix(aiops): escalate blocked auto repair
CD Pipeline / tests (push) Successful in 1m33s
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-30 23:49:17 +08:00
Your Name
ed205489c1
feat(p3.2-tests+ci-schema): model_version 測試 + CI test_schema 對齊 + Grafana SLO Dashboard
...
CD Pipeline / build-and-deploy (push) Failing after 1m20s
P3.2 配套測試 + CI 環境同步 + ADR-100 Grafana 視覺化:
CI test_schema 補齊(解 1162-1172 阻塞之延伸):
- setup_test_schema.sql 加 ai_provider_version_history 表
- 對齊 production p3_2_provider_version_history.sql(已 K8s exec 上線)
新增測試 (636 行):
- test_model_version_probe.py (387) — Provider 探測單元測試
- test_model_version_tracker.py (249) — Tracker 整合測試
· 4 個 DB-dependent tests 標 @pytest.mark.integration
· 15 unit + 4 integration(unit step 跳過 integration class)
新增配套:
- ai-slo-dashboard.json (496 行) — Grafana 儀表板
· 對應 ADR-100 SLO 規則的 4 大面板:
自主修復成功率 / 飛輪閉環延遲 / 治理事件 / Provider 健康度
修改:
- governance_agent.py +122 行 — SLO 指標暴露 + retrieve metric 整合
Tests: 15 passed (probe + tracker unit), 4 deselected (integration class)
Production 部署狀態:
- p2_decision_fusion_columns.sql ✅ K8s exec 完成(commit c58bdd0c)
- p3_2_provider_version_history.sql ✅ K8s exec 完成(this commit)
- 兩個 production migration 都已上線,CI test_schema 同步補齊
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-27 14:57:16 +08:00
Your Name
025a493f06
feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner
...
run-migration / migrate (push) Failing after 12s
CD Pipeline / build-and-deploy (push) Has been cancelled
Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套:
P3.2 — Model Version Tracking:
- model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version
- model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表
- migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema
- db/models.py +32 行 — ProviderVersionHistory ORM
ADR-100 — AI 自主化 SLO:
- docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值
- ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts
- ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests
整合修改:
- main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule
- gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知
- ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線
新測試:
- test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass
- test_slo_rules.yaml — promtool 驗收
Tests: 9 passed (test_kb_rot_cleaner_schedule)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com >
2026-04-27 14:54:19 +08:00
Your Name
fb130c9a28
feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄
...
CD Pipeline / build-and-deploy (push) Failing after 2m16s
Wave 8 P3.1-T2 後續補測 + 配套:
新增測試:
- test_diagnosis_aggregator_stub.py (238 行) — 15 tests
· stub fixture 驗證 _collect_diagnosis_aggregator 接線
· feature flag default off 不呼叫
· timeout 邊界 / exception fail-soft
修改:
- core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標
- sanitization_service.py +24 — 補強 prompt sanitize 邊界(vuln #4 配套)
- RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調
Tests: 15 passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-27 08:30:26 +08:00
Your Name
fefe4c21cd
fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
...
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:
A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
· success/timeout/error outcome label
· histogram 寫入 aiops_agent_step_duration_seconds
A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout
新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality
新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序
新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
· 三 Agent histogram alert rules(p99 > timeout 80% → warning)
驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)
INC-20260425 雙修總工作量(595629c0 + 此 commit):
· 5 個 service/agent 檔修改
· 1 個新 observability 模組
· 4 個新測試/配套檔
· 1372+187 = 1559 行新增
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com >
2026-04-27 08:15:53 +08:00
Your Name
1ab6786ce3
feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
...
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Wave 6 P2.3 ops 配套 + tool-expert 部署文件:
新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
· 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
· failover/recovery 完整 SOP
· 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
· 4 panel:current primary / failover events / quota usage / health status
· 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
· ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com >
2026-04-27 08:11:40 +08:00
Your Name
2c57b71db9
feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合
...
CD Pipeline / build-and-deploy (push) Successful in 10m45s
MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成(multiple engineers 在限額前完成代碼,補 commit):
P2.2 — GovernanceAgent 4 項自檢:
- governance_agent.py (342 行) — 每 1 小時自檢循環:
· trust_drift(信任度漂移檢測)
· knowledge_degradation(知識退化檢測)
· llm_hallucination(LLM 幻覺檢測)
· execution_blast_radius(執行爆炸半徑檢測)
- main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動
try/except 包裹,schedule 失敗不阻斷主流程
- failover_alerter.py: alert_governance(event_type, payload) 1h dedup
四類事件 → Telegram MarkdownV2 告警
P2.3 — Ollama 健康規則 + Prometheus Metrics:
- ops/monitoring/ollama_health_rules.yaml (148 行):
· OllamaHealthDegraded / OllamaPrimaryDown
· OllamaFailoverTriggered / GeminiQuotaExceeded
· 補 Prometheus 取資料的 alert rules
- core/metrics.py (57 行):
· GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge
· OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter
· OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge
- ollama_failover_manager.py:
· _check_gemini_quota: 每次 check 同步更新 Gauge(讓 Prometheus 取最新值)
· select_provider: failover 時 inc Counter + 切 Primary Gauge
· try/except 包裹,metric 失敗不阻斷主路由
E2E 測試:
- test_failover_e2e_dispatch.py (365 行)
完整 dispatch 路徑:health check → failover decide → alerter → metrics
Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com >
2026-04-26 20:56:19 +08:00
Your Name
7cd53c0228
fix(monitoring): 記憶體告警改用 working_set,停止 page cache 假告警
...
- alerts-unified.yml:
- SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes,0.8 → 0.85
- GiteaMemoryPressure: 同步修正(同樣 page cache 虛高根因)
- ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases
- 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範
- LOGBOOK: 記錄 T0 五大並行任務(A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review)
鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8%
根因: container_memory_usage_bytes 含 OS page cache,OOM killer 不視為壓力
修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache),閾值 0.85
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-26 20:16:12 +08:00
OG T
ba18ad2ef8
feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s
CD Pipeline / build-and-deploy (push) Successful in 8m37s
統帥 2026-04-19 決策:
- Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則
- Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護)
- noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整
1. rule_stats_updater v2 noise 算法:
原: 任何 EXPIRED approval 都算 fp
問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報
修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ...
2. hermes_rule_quality v2 LLM 升級:
新增 _llm_analyze_noisy_rule:
- 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule
- JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate
- 3 路 parse fallback (直接 / NemoTron wrapper / description nested)
_write_advisory_aol 加 llm_analysis 到 output_payload
_send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長)
符合統帥鐵律: AI 分析但不自動動作,仍人工決策
3. ops/monitoring/alerts-unified.yml 替換 Rule 1:
刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報)
加 HostDiskUsageHigh (>80% for 10m, warning)
加 HostDiskUsageCritical (>90% for 5m, critical)
兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯
(待 deploy-alerts workflow 下次 apply 到 Prometheus)
4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推):
UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate'
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 19:39:05 +08:00
OG T
eab3f527cd
feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090)
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s
CD Pipeline / build-and-deploy (push) Failing after 9m24s
戰場:110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效
統帥鐵律:不要只降低,要長期解決 → 結構性治理而非補丁
本 commit 涵蓋:
1. k8s/monitoring/docker-compose-110.yml
- cadvisor 加 mem_limit 512M + cpus 1.0(L2 防爆網)
- 備註 110 live 與本檔 drift(下一 session 納入 CD)
2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組:
- CadvisorDown / MemoryPressure / CPUThrottled
- NodeExporterDown / CPUThrottled
- SentryClickHouseMemoryPressure / CPUThrottled
- GiteaMemoryPressure / CPUThrottled
- PrometheusDown(監控自監控元層)
→ 全部用 (memory usage / spec_memory_limit) 動態判斷,
不寫死 80% 或 MB 數,配額改閾值自動跟著變
其他配套(非本 repo,已 SSH patch 到 110/188):
- /home/ollama/wooo-aiops/docker-compose.yml:188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c
- /home/wooo/monitoring/docker-compose.yml:110 cadvisor + node-exporter 納管 + 降維 flags + 配額
- /opt/sentry/docker-compose.override.yml:Sentry L2 配額(clickhouse 8g/4c, kafka 3g/2c 等)
- /home/wooo/gitea/docker-compose.yml:Gitea 3g/3c
- /home/wooo/act-runner/docker-compose.yml:Actions Runner 2g/2c
對映:
- feedback_monitor_self_monitoring.md 🔴 🔴 🔴 監控工具必須被監控
- feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則
- ADR-090 Layer 2 資源配額強制
驗收(48h):
- 188 cadvisor CPU 從 321% → <50%(配額強制)
- 110 load5 從 18 → <10(Sentry/Gitea 釋壓後)
- 自監控告警無誤報
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 01:50:41 +08:00
OG T
946fe1fa7c
fix(monitoring): 合併重複飛輪告警 group + 補 notification_type: TYPE-8M
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 44s
awoooi_flywheel_health (重複) 合入 awoooi_flywheel_meta_alerts:
- 所有 5 條規則加 notification_type: TYPE-8M
- 新增 FlywheelAlertnameNullHigh(原僅在舊 group)
- 刪除重複 group,消除 Prometheus 同名告警衝突
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 22:43:02 +08:00
OG T
bd75aca727
feat(adr-075): 補全 2 個欠缺的 Prometheus 告警規則
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 49s
- MomoScraperSuccessLow: 業務爬蟲成功率 <90% (business group)
- CoreDNSResolutionFailed: CoreDNS SERVFAIL 率 >5% (kubernetes group)
ADR-075 Phase 3 完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 21:59:18 +08:00
OG T
edb97fd29b
fix(monitoring): 補回 4 個僅存於主機的 Prometheus 規則群組
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 41s
deploy-alerts.sh 部署時覆寫了這 4 個從未進 repo 的群組:
- awoooi_flywheel_health (5條:Playbook/Success/Vectorization/NullRate/Stuck)
- awoooi_backup_restore (2條:RestoreTestFailed/TestStale)
- awoooi_infrastructure_detailed (3條:Container/RedisStream/DiskGrowth)
- awoooi_host_connectivity (1條:NetworkPartition)
從 /home/wooo/monitoring/alerts.yml.bak_20260412_183835 還原。
offset PromQL 已修正為各個 selector 上,而非整個表達式。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 19:14:39 +08:00
OG T
f52dc459e6
feat(adr075): Step4 新增4組Prometheus規則 secops/business/flywheel_meta
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 41s
新增規則群組:
- awoooi_secops_alerts: UnauthorizedSSHLogin (5min>10次失敗)
- awoooi_business_alerts: AITokenCostSpike + GeminiAPIErrorRateHigh
- awoooi_flywheel_meta_alerts:
FlywheelPlaybookZero / FlywheelExecutionSuccessLow
FlywheelKMVectorizationLow / FlywheelIncidentsStuck
飛輪 meta 規則依賴 ADR-074 Exporter 指標
secops/business 規則依賴 node_exporter/awoooi custom metrics
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:51:23 +08:00
OG T
43edff184d
feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件
...
C-1 Velero: 已確認運作中(daily-awoooi-prod schedule, 13d, MinIO Available)
C-2 Host rsync 備份:
scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110
- Harbor registry data(最高優先)
- Gitea repos
- bitan-pharmacy.git(若存在)
- 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控
- 失敗時 Telegram 告警
ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則
C-3 DR SOP 文件:
docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘)
docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘)
docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘)
部署備份腳本說明 (需手動執行):
scp scripts/ops/backup-from-110.sh ollama@192.168 .0.188:~/bin/backup-from-110.sh
ssh ollama@192.168 .0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}"
ssh ollama@192.168 .0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' | crontab -"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 03:04:18 +08:00
OG T
6351e9a0e9
feat(mcp-phase2): MCP Phase 2 — Prometheus MCP + SSH MCP + alert labels
...
CD Pipeline / build-and-deploy (push) Successful in 13m37s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 35s
MCP-2b: prometheus_provider.py
- prometheus_query (PromQL 即時查詢)
- prometheus_query_range (歷史趨勢,預設 15 分鐘)
- prometheus_get_alert_history (告警觸發歷史)
- config: PROMETHEUS_URL + PROMETHEUS_MCP_ENABLED
MCP-2a: ssh_provider.py
- 群組A 9 個只讀診斷工具 (top/disk/memory/logs/status/port/nginx/swap)
- 群組B 6 個安全操作工具 (restart/compose/systemctl/clear-log/ssl/nginx-reload)
- 四層安全守衛 (白名單/allowed_hosts/forbidden_patterns/trust_score)
- config: SSH_MCP_ENABLED + SSH_MCP_ALLOWED_HOSTS
K8s: 04-ssh-mcp-secret.example.yaml (ssh-mcp-key Secret 範本 + 建立步驟)
Alert labels: alerts-unified.yml 補充 mcp_provider/host_type/alert_category
覆蓋: HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/DockerContainer*
SignOzDown/SentryDown/HarborDown/GiteaDown
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 02:35:35 +08:00
OG T
e1dfbedf0e
fix(alerts): HostHighCpuLoad auto_repair: false → true
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
飛輪一直 GUARDRAIL_BLOCKED 的根本原因:
Prometheus rule 標籤 auto_repair=false 強制 HITL
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 13:33:23 +08:00
OG T
ab3e266a23
fix(monitoring): Phase O-6.2 service-registry 補齊 9 個缺失 K8s 部署
...
新增:
- argocd 5個元件 (applicationset/dex/notifications/redis/repo-server)
- awoooi-dev/awoooi-api
- kube-state-metrics
- observability/event-exporter
- velero/velero
結果: prometheus 覆蓋率 94%→96%, errors 9→0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 10:44:36 +08:00
OG T
85d4857d1b
fix(monitoring): RedisMemoryHigh 誤報 — max_bytes=0 除以零修正
...
CD Pipeline / build-and-deploy (push) Has started running
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
- 加入 redis_memory_max_bytes > 0 前置條件
- 防止 Redis 未設 maxmemory 時除以零產生 +Inf 永遠觸發
- 影響: alerts-unified.yml + database-alerts.yaml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:41:10 +08:00
OG T
9799a14f54
feat(monitoring): Plan C 外部網站告警 — 4個網站 + SSL憑證預警
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 34s
新增 external_website_alerts 群組:
- MoWoooWorkDown (mo.wooo.work, 188, momo-app)
- TsenyangWebsiteDown (tsenyang.com, 188, tsenyang-website)
- StockWoooWorkDown (stock.wooo.work, 110, stock-platform)
- BitanWoooWorkDown (bitan.wooo.work, 188, bitan-app)
- ExternalSiteSSLExpiringSoon (14天預警, auto_repair:false)
blackbox-http 已涵蓋全部目標,此為結構化告警規則。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 08:53:08 +08:00
OG T
3c6807d79c
ops(monitoring): 觸發 deploy-alerts — database_detail_alerts 6條規則補部署
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
d9e0fab 新增了 6 條 DB 詳細告警規則但 deploy-alerts 因 pyyaml 未安裝失敗
0f86c5c 已修復 workflow,此 commit 觸發重新部署
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 21:17:26 +08:00
OG T
d9e0fab3fe
feat(monitoring): Sprint 5.2 Plan B — 資料庫詳細告警規則
...
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 17s
新增 database_detail_alerts 規則群組:
PostgreSQL:
- PostgreSQLSlowQueries: 慢查詢 >60s
- PostgreSQLDeadlocks: 死鎖發生
- PostgreSQLTooManyConnections: 連接數 >50
Redis:
- RedisKeyEviction: Key 驅逐
- RedisConnectionsHigh: 連接數 >100
- RedisCommandLatencyHigh: 命令延遲 >10ms
前置: postgres-exporter:9187 + redis-exporter:9121 已在 188 部署 ✅
Prometheus scrape 已更新 ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 18:19:03 +08:00