awoooi

Author	SHA1	Message	Date
Your Name	2ec7f6f440	fix(ops): harden heartbeat and momo alert noise Some checks failed Code Review / ai-code-review (push) Successful in 14s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 31s Details CD Pipeline / tests (push) Successful in 1m59s Details CD Pipeline / build-and-deploy (push) Successful in 7m36s Details CD Pipeline / post-deploy-checks (push) Failing after 43s Details Ansible / Reboot Recovery Contract / validate (push) Has been cancelled Details	2026-06-24 19:38:33 +08:00
Your Name	95f442adab	fix(ops): harden 188 backup exporter recovery [skip ci]	2026-06-24 06:37:44 +08:00
Your Name	93ac6030cf	fix(ops): 同步 source provider freshness 告警規則 Some checks failed Ansible / Reboot Recovery Contract / validate (push) Has been cancelled Details Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 24s Details	2026-06-18 14:23:13 +08:00
Your Name	ff18872a23	feat(ops): 新增 host runaway process aiops guard Some checks failed Code Review / ai-code-review (push) Successful in 14s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 26s Details Ansible / Reboot Recovery Contract / validate (push) Has been cancelled Details	2026-06-18 14:17:03 +08:00
Your Name	6efd186750	docs(security): 建立高價值配置控管清冊 [skip ci]	2026-06-11 11:29:58 +08:00
Your Name	ae7b39d96a	fix(ops): harden reboot recovery and backup alerts	2026-05-29 12:41:34 +08:00
Your Name	ae9d0b7385	feat(monitoring): alert on stale source provider ingestion All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s Details CD Pipeline / tests (push) Successful in 3m26s Details CD Pipeline / build-and-deploy (push) Successful in 3m38s Details CD Pipeline / post-deploy-checks (push) Successful in 1m25s Details	2026-05-20 19:19:21 +08:00
Your Name	598f33ae8b	fix(monitoring): clarify alert chain smoke evidence All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s Details CD Pipeline / tests (push) Successful in 3m55s Details CD Pipeline / build-and-deploy (push) Successful in 3m31s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-05-20 13:11:44 +08:00
Your Name	21dcfbd991	fix(governance): collapse km slo fallback series All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s Details CD Pipeline / tests (push) Successful in 1m6s Details CD Pipeline / build-and-deploy (push) Successful in 5m17s Details CD Pipeline / post-deploy-checks (push) Successful in 1m38s Details	2026-05-14 19:37:15 +08:00
Your Name	d2a4a17969	fix(governance): stabilize adr100 km growth slo Some checks failed Code Review / ai-code-review (push) Successful in 22s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s Details CD Pipeline / tests (push) Successful in 1m11s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-14 19:33:52 +08:00
Your Name	4111ea4f9f	fix(ai): remove 188 ollama provider All checks were successful Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / tests (push) Successful in 1m13s Details CD Pipeline / build-and-deploy (push) Successful in 3m36s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-06 14:34:48 +08:00
Your Name	587551c1f1	fix(ops): monitor full-stack cold-start gates All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s Details	2026-05-06 00:48:05 +08:00
Your Name	23932773ef	fix(monitoring): route docker baseline alerts to ssh All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 19s Details	2026-05-06 00:00:12 +08:00
Your Name	2f50c67f5c	fix(monitoring): keep host alert ssh diagnostics canonical All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 20s Details E2E Health Check / e2e-health (push) Successful in 2m35s Details	2026-05-05 23:57:53 +08:00
Your Name	2221fd3256	fix(ops): persist host resource guardrails All checks were successful CD Pipeline / tests (push) Successful in 5m25s Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m31s Details CD Pipeline / post-deploy-checks (push) Successful in 5m10s Details	2026-05-05 16:13:19 +08:00
Your Name	1cc9de5722	fix(ops): point runner guardrail alerts to host script All checks were successful CD Pipeline / tests (push) Successful in 5m31s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m45s Details CD Pipeline / post-deploy-checks (push) Successful in 5m4s Details	2026-05-05 15:25:37 +08:00
Your Name	d08d1e4951	fix(ops): alert on missing docker resource limits Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Successful in 23s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 15:01:31 +08:00
Your Name	72d66e4ae6	fix(ops): align stale job cleanup thresholds All checks were successful Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s Details	2026-05-05 14:54:17 +08:00
Your Name	5e625f777d	fix(ops): add stale gitea job cleanup guard Some checks failed Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:50:47 +08:00
Your Name	7d45f0cb58	fix(ops): alert on stale gitea actions jobs Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:42:09 +08:00
Your Name	fe618960a8	fix(ops): monitor systemd runners in host baseline Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details	2026-05-05 14:08:43 +08:00
Your Name	e8e6748f70	fix(ops): add docker host resource baseline guardrails Some checks failed CD Pipeline / tests (push) Failing after 1m50s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 13:45:09 +08:00
Your Name	b1ef05fa8c	feat(ollama): ADR-110 GCP 三層容災架構（GCP-A → GCP-B → Local → Gemini） Some checks failed Code Review / ai-code-review (push) Successful in 50s Details CD Pipeline / tests (push) Failing after 1m14s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details ## 變更摘要 - Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理) - Secondary: http://34.21.145.224:11434 (GCP-B SSD) - Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD，最後防線) - 廢止 ADR-105「111 唯一鐵律」，新建 ADR-110 ## 核心改動 - config.py: 新增 OLLAMA_SECONDARY_URL；validator 加 GCP IP 白名單（34.143.170.20, 34.21.145.224） - ollama_failover_manager.py: 三層 Ollama 決策矩陣；並行健康檢查三台；health_111 → health_gcp_a - ollama_health_monitor.py: host label 萃取改為通用版（支援 GCP 公網 IP） - failover_alerter.py: 故障/恢復主機動態顯示，不再硬編碼「Ollama 111 (GPU)」 - ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a；recovered_host 動態 - k8s/awoooi-prod: configmap + deployment + network-policy 同步更新（egress 加 GCP /32） - 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL - 測試: URL 常數更新，新增三層容災場景，GCP IP 白名單驗證測試 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 22:49:23 +08:00
Your Name	577250a678	fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警 Some checks failed Code Review / ai-code-review (push) Successful in 52s Details CD Pipeline / tests (push) Failing after 2m21s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s Details 【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】前次 commit `f1362fcc` 用 skip 條件把告警吞掉，是消音化解法： - W-3：total_exec<10 永遠 skip → Redis 永遠空也不會告警 - W-4：playbooks total==0 永遠 skip → 表被清空也不會告警 - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】 - ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛： - W-3a：metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」 - W-3b：rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」 - W-4a：playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」 - W-4b：playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」 - 3 份 Prometheus rule（k8s/monitoring/flywheel-alerts.yaml、 ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml）新增 FlywheelExecutionRateMissing：absent() 或 NaN 持續 30 分鐘 → 告警，與 watchdog W-3b 雙保險【已加入 memory】 feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律：「fresh deploy / init guard 用 skip 吞告警 = 結構性失職，必須分流寬限期 + 過期改打資料管線斷新告警」【驗證】 106 個治理相關 unit test 全過： test_trust_drift_watchdog / test_governance_agent / test_failover_alerter / test_check_trust_drift_commit_outside_context_poc / test_governance_remediation_dispatch / test_ai_governance_endpoints / test_governance_dispatcher 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:39:46 +08:00
Your Name	b371edb70c	fix host alert auto-repair routing and backup false positives	2026-05-02 23:44:12 +08:00
Your Name	da772a1605	fix(decision): block kubectl actions on bare_metal host alerts All checks were successful Code Review / ai-code-review (push) Successful in 54s Details CD Pipeline / tests (push) Successful in 3m47s Details CD Pipeline / build-and-deploy (push) Successful in 13m26s Details CD Pipeline / post-deploy-checks (push) Successful in 5m45s Details When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host (192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating CPU), the LLM kept proposing "kubectl rollout restart awoooi-api", which is a wrong-domain action — restarting awoooi cannot fix a third-party process's CPU usage on the host. Auto-execute would then either run the no-op kubectl restart (wasted) or escalate after ssh_diagnose because no safe action was found, producing the "AI 自動修復失敗" Telegram noise the user just complained about. Adds a guard at the top of DecisionManager._auto_execute: if the incident's primary signal carries host_type=bare_metal AND the proposed action starts with "kubectl", refuse to execute. The incident is marked READY with a clear blocked_reason so human operators see why automation declined, and emergency_escalation records the event in AOL for audit. Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new ops/monitoring/alerts.yml in repo) to add an explicit auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory that hints LLM toward `ssh ... ps aux` rather than kubectl restart. Prometheus reload returned 200. Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked, (3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT blocked, (5) missing host_type label NOT blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 17:41:28 +08:00
Your Name	3156ff1c69	feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune with ≥75% disk usage gate) and routes "docker prune" actions through it. Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider routing labels so the flywheel can self-heal next disk-full event without hitting the emergency_channel Telegram path. Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every disk-full alert through "Host key is not trusted → escalate" loop. known_hosts patched live; this commit closes the playbook gap so the next occurrence resolves without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	ca22ec2fd2	fix(aiops): route backup failures rule-first All checks were successful CD Pipeline / tests (push) Successful in 1m51s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 8m21s Details CD Pipeline / post-deploy-checks (push) Successful in 4m18s Details	2026-05-01 10:11:10 +08:00
Your Name	f0d14ab6c4	fix(aiops): escalate blocked auto repair Some checks failed CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:49:17 +08:00
Your Name	ed205489c1	feat(p3.2-tests+ci-schema): model_version 測試 + CI test_schema 對齊 + Grafana SLO Dashboard Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m20s Details P3.2 配套測試 + CI 環境同步 + ADR-100 Grafana 視覺化： CI test_schema 補齊（解 1162-1172 阻塞之延伸）: - setup_test_schema.sql 加 ai_provider_version_history 表 - 對齊 production p3_2_provider_version_history.sql（已 K8s exec 上線）新增測試 (636 行): - test_model_version_probe.py (387) — Provider 探測單元測試 - test_model_version_tracker.py (249) — Tracker 整合測試 · 4 個 DB-dependent tests 標 @pytest.mark.integration · 15 unit + 4 integration（unit step 跳過 integration class）新增配套: - ai-slo-dashboard.json (496 行) — Grafana 儀表板 · 對應 ADR-100 SLO 規則的 4 大面板：自主修復成功率 / 飛輪閉環延遲 / 治理事件 / Provider 健康度修改: - governance_agent.py +122 行 — SLO 指標暴露 + retrieve metric 整合 Tests: 15 passed (probe + tracker unit), 4 deselected (integration class) Production 部署狀態: - p2_decision_fusion_columns.sql ✅ K8s exec 完成（commit c58bdd0c） - p3_2_provider_version_history.sql ✅ K8s exec 完成（this commit） - 兩個 production migration 都已上線，CI test_schema 同步補齊 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:57:16 +08:00
Your Name	025a493f06	feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner Some checks failed run-migration / migrate (push) Failing after 12s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套： P3.2 — Model Version Tracking: - model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version - model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表 - migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema - db/models.py +32 行 — ProviderVersionHistory ORM ADR-100 — AI 自主化 SLO: - docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值 - ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts - ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests 整合修改: - main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule - gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知 - ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線新測試: - test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass - test_slo_rules.yaml — promtool 驗收 Tests: 9 passed (test_kb_rot_cleaner_schedule) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>	2026-04-27 14:54:19 +08:00
Your Name	fb130c9a28	feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m16s Details Wave 8 P3.1-T2 後續補測 + 配套：新增測試: - test_diagnosis_aggregator_stub.py (238 行) — 15 tests · stub fixture 驗證 _collect_diagnosis_aggregator 接線 · feature flag default off 不呼叫 · timeout 邊界 / exception fail-soft 修改: - core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標 - sanitization_service.py +24 — 補強 prompt sanitize 邊界（vuln #4 配套） - RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調 Tests: 15 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:30:26 +08:00
Your Name	fefe4c21cd	fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 延續 `595629c0` INC-20260425 修復，補三段 Agent + 全鏈路觀測： A1 後續 — Solver/Critic 三段 timeout 接線: - solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override） - critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override） - protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫 · success/timeout/error outcome label · histogram 寫入 aiops_agent_step_duration_seconds A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain: - auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE） - 完全避開 Ollama CPU 238s 二次 timeout 新增 metrics: - core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality 新增測試 (862 行): - test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label - test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序新增配套: - docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引 - ops/monitoring/grafana/agent_step_latency_rules.yaml (160) · 三 Agent histogram alert rules（p99 > timeout 80% → warning）驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11) INC-20260425 雙修總工作量（595629c0 + 此 commit）: · 5 個 service/agent 檔修改 · 1 個新 observability 模組 · 4 個新測試/配套檔 · 1372+187 = 1559 行新增 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>	2026-04-27 08:15:53 +08:00
Your Name	1ab6786ce3	feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch Some checks failed run-migration / migrate (push) Failing after 13s Details CD Pipeline / build-and-deploy (push) Failing after 2m1s Details Wave 6 P2.3 ops 配套 + tool-expert 部署文件：新增: - docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行) · 三大鐵律驗證步驟（自動切 Gemini / 自動切回 / quota 熔斷） · failover/recovery 完整 SOP · 故障排查清單（Ollama 111/188 不通、Gemini quota 超發等） - ops/monitoring/grafana/dashboards/ollama_failover.json (295 行) · 4 panel：current primary / failover events / quota usage / health status · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT - k8s/awoooi-prod/04-configmap.yaml.patch-consensus · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>	2026-04-27 08:11:40 +08:00
Your Name	2c57b71db9	feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成（multiple engineers 在限額前完成代碼，補 commit）： P2.2 — GovernanceAgent 4 項自檢: - governance_agent.py (342 行) — 每 1 小時自檢循環: · trust_drift（信任度漂移檢測） · knowledge_degradation（知識退化檢測） · llm_hallucination（LLM 幻覺檢測） · execution_blast_radius（執行爆炸半徑檢測） - main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動 try/except 包裹，schedule 失敗不阻斷主流程 - failover_alerter.py: alert_governance(event_type, payload) 1h dedup 四類事件 → Telegram MarkdownV2 告警 P2.3 — Ollama 健康規則 + Prometheus Metrics: - ops/monitoring/ollama_health_rules.yaml (148 行): · OllamaHealthDegraded / OllamaPrimaryDown · OllamaFailoverTriggered / GeminiQuotaExceeded · 補 Prometheus 取資料的 alert rules - core/metrics.py (57 行): · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge - ollama_failover_manager.py: · _check_gemini_quota: 每次 check 同步更新 Gauge（讓 Prometheus 取最新值） · select_provider: failover 時 inc Counter + 切 Primary Gauge · try/except 包裹，metric 失敗不阻斷主路由 E2E 測試: - test_failover_e2e_dispatch.py (365 行) 完整 dispatch 路徑：health check → failover decide → alerter → metrics Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>	2026-04-26 20:56:19 +08:00
Your Name	7cd53c0228	fix(monitoring): 記憶體告警改用 working_set，停止 page cache 假告警 - alerts-unified.yml: - SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes，0.8 → 0.85 - GiteaMemoryPressure: 同步修正（同樣 page cache 虛高根因） - ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases - 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範 - LOGBOOK: 記錄 T0 五大並行任務（A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review）鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8% 根因: container_memory_usage_bytes 含 OS page cache，OOM killer 不視為壓力修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache)，閾值 0.85 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:16:12 +08:00
OG T	ba18ad2ef8	feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / build-and-deploy (push) Successful in 8m37s Details 統帥 2026-04-19 決策: - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則 - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護) - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整 1. rule_stats_updater v2 noise 算法: 原: 任何 EXPIRED approval 都算 fp 問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ... 2. hermes_rule_quality v2 LLM 升級: 新增 _llm_analyze_noisy_rule: - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate - 3 路 parse fallback (直接 / NemoTron wrapper / description nested) _write_advisory_aol 加 llm_analysis 到 output_payload _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長) 符合統帥鐵律: AI 分析但不自動動作,仍人工決策 3. ops/monitoring/alerts-unified.yml 替換 Rule 1: 刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報) 加 HostDiskUsageHigh (>80% for 10m, warning) 加 HostDiskUsageCritical (>90% for 5m, critical) 兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯 (待 deploy-alerts workflow 下次 apply 到 Prometheus) 4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推): UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:39:05 +08:00
OG T	eab3f527cd	feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090) Some checks failed Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s Details CD Pipeline / build-and-deploy (push) Failing after 9m24s Details 戰場：110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效統帥鐵律：不要只降低，要長期解決 → 結構性治理而非補丁本 commit 涵蓋： 1. k8s/monitoring/docker-compose-110.yml - cadvisor 加 mem_limit 512M + cpus 1.0（L2 防爆網） - 備註 110 live 與本檔 drift（下一 session 納入 CD） 2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組： - CadvisorDown / MemoryPressure / CPUThrottled - NodeExporterDown / CPUThrottled - SentryClickHouseMemoryPressure / CPUThrottled - GiteaMemoryPressure / CPUThrottled - PrometheusDown（監控自監控元層） → 全部用 (memory usage / spec_memory_limit) 動態判斷，不寫死 80% 或 MB 數，配額改閾值自動跟著變其他配套（非本 repo，已 SSH patch 到 110/188）： - /home/ollama/wooo-aiops/docker-compose.yml：188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c - /home/wooo/monitoring/docker-compose.yml：110 cadvisor + node-exporter 納管 + 降維 flags + 配額 - /opt/sentry/docker-compose.override.yml：Sentry L2 配額（clickhouse 8g/4c, kafka 3g/2c 等） - /home/wooo/gitea/docker-compose.yml：Gitea 3g/3c - /home/wooo/act-runner/docker-compose.yml：Actions Runner 2g/2c 對映： - feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控 - feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則 - ADR-090 Layer 2 資源配額強制驗收（48h）： - 188 cadvisor CPU 從 321% → <50%（配額強制） - 110 load5 從 18 → <10（Sentry/Gitea 釋壓後） - 自監控告警無誤報 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:50:41 +08:00
OG T	946fe1fa7c	fix(monitoring): 合併重複飛輪告警 group + 補 notification_type: TYPE-8M All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 44s Details awoooi_flywheel_health (重複) 合入 awoooi_flywheel_meta_alerts: - 所有 5 條規則加 notification_type: TYPE-8M - 新增 FlywheelAlertnameNullHigh（原僅在舊 group） - 刪除重複 group，消除 Prometheus 同名告警衝突 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:43:02 +08:00
OG T	bd75aca727	feat(adr-075): 補全 2 個欠缺的 Prometheus 告警規則 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 49s Details - MomoScraperSuccessLow: 業務爬蟲成功率 <90% (business group) - CoreDNSResolutionFailed: CoreDNS SERVFAIL 率 >5% (kubernetes group) ADR-075 Phase 3 完成 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 21:59:18 +08:00
OG T	edb97fd29b	fix(monitoring): 補回 4 個僅存於主機的 Prometheus 規則群組 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 41s Details deploy-alerts.sh 部署時覆寫了這 4 個從未進 repo 的群組： - awoooi_flywheel_health (5條：Playbook/Success/Vectorization/NullRate/Stuck) - awoooi_backup_restore (2條：RestoreTestFailed/TestStale) - awoooi_infrastructure_detailed (3條：Container/RedisStream/DiskGrowth) - awoooi_host_connectivity (1條：NetworkPartition) 從 /home/wooo/monitoring/alerts.yml.bak_20260412_183835 還原。 offset PromQL 已修正為各個 selector 上，而非整個表達式。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:14:39 +08:00
OG T	f52dc459e6	feat(adr075): Step4 新增4組Prometheus規則 secops/business/flywheel_meta All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 41s Details 新增規則群組: - awoooi_secops_alerts: UnauthorizedSSHLogin (5min>10次失敗) - awoooi_business_alerts: AITokenCostSpike + GeminiAPIErrorRateHigh - awoooi_flywheel_meta_alerts: FlywheelPlaybookZero / FlywheelExecutionSuccessLow FlywheelKMVectorizationLow / FlywheelIncidentsStuck 飛輪 meta 規則依賴 ADR-074 Exporter 指標 secops/business 規則依賴 node_exporter/awoooi custom metrics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 18:51:23 +08:00
OG T	43edff184d	feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件 C-1 Velero: 已確認運作中（daily-awoooi-prod schedule, 13d, MinIO Available） C-2 Host rsync 備份: scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110 - Harbor registry data（最高優先） - Gitea repos - bitan-pharmacy.git（若存在） - 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控 - 失敗時 Telegram 告警 ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則 C-3 DR SOP 文件: docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘) docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘) docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘) docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘) docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘) 部署備份腳本說明 (需手動執行): scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}" ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' \| crontab -" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 03:04:18 +08:00
OG T	6351e9a0e9	feat(mcp-phase2): MCP Phase 2 — Prometheus MCP + SSH MCP + alert labels All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m37s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 35s Details MCP-2b: prometheus_provider.py - prometheus_query (PromQL 即時查詢) - prometheus_query_range (歷史趨勢，預設 15 分鐘) - prometheus_get_alert_history (告警觸發歷史) - config: PROMETHEUS_URL + PROMETHEUS_MCP_ENABLED MCP-2a: ssh_provider.py - 群組A 9 個只讀診斷工具 (top/disk/memory/logs/status/port/nginx/swap) - 群組B 6 個安全操作工具 (restart/compose/systemctl/clear-log/ssl/nginx-reload) - 四層安全守衛 (白名單/allowed_hosts/forbidden_patterns/trust_score) - config: SSH_MCP_ENABLED + SSH_MCP_ALLOWED_HOSTS K8s: 04-ssh-mcp-secret.example.yaml (ssh-mcp-key Secret 範本 + 建立步驟) Alert labels: alerts-unified.yml 補充 mcp_provider/host_type/alert_category 覆蓋: HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/DockerContainer* SignOzDown/SentryDown/HarborDown/GiteaDown Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 02:35:35 +08:00
OG T	e1dfbedf0e	fix(alerts): HostHighCpuLoad auto_repair: false → true All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details 飛輪一直 GUARDRAIL_BLOCKED 的根本原因： Prometheus rule 標籤 auto_repair=false 強制 HITL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 13:33:23 +08:00
OG T	ab3e266a23	fix(monitoring): Phase O-6.2 service-registry 補齊 9 個缺失 K8s 部署新增: - argocd 5個元件 (applicationset/dex/notifications/redis/repo-server) - awoooi-dev/awoooi-api - kube-state-metrics - observability/event-exporter - velero/velero 結果: prometheus 覆蓋率 94%→96%, errors 9→0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 10:44:36 +08:00
OG T	85d4857d1b	fix(monitoring): RedisMemoryHigh 誤報 — max_bytes=0 除以零修正 Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details - 加入 redis_memory_max_bytes > 0 前置條件 - 防止 Redis 未設 maxmemory 時除以零產生 +Inf 永遠觸發 - 影響: alerts-unified.yml + database-alerts.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 11:41:10 +08:00
OG T	9799a14f54	feat(monitoring): Plan C 外部網站告警 — 4個網站 + SSL憑證預警 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 34s Details 新增 external_website_alerts 群組： - MoWoooWorkDown (mo.wooo.work, 188, momo-app) - TsenyangWebsiteDown (tsenyang.com, 188, tsenyang-website) - StockWoooWorkDown (stock.wooo.work, 110, stock-platform) - BitanWoooWorkDown (bitan.wooo.work, 188, bitan-app) - ExternalSiteSSLExpiringSoon (14天預警, auto_repair:false) blackbox-http 已涵蓋全部目標，此為結構化告警規則。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 08:53:08 +08:00
OG T	3c6807d79c	ops(monitoring): 觸發 deploy-alerts — database_detail_alerts 6條規則補部署 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details `d9e0fab` 新增了 6 條 DB 詳細告警規則但 deploy-alerts 因 pyyaml 未安裝失敗 `0f86c5c` 已修復 workflow，此 commit 觸發重新部署 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:17:26 +08:00
OG T	d9e0fab3fe	feat(monitoring): Sprint 5.2 Plan B — 資料庫詳細告警規則 Some checks failed Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 17s Details 新增 database_detail_alerts 規則群組: PostgreSQL: - PostgreSQLSlowQueries: 慢查詢 >60s - PostgreSQLDeadlocks: 死鎖發生 - PostgreSQLTooManyConnections: 連接數 >50 Redis: - RedisKeyEviction: Key 驅逐 - RedisConnectionsHigh: 連接數 >100 - RedisCommandLatencyHigh: 命令延遲 >10ms 前置: postgres-exporter:9187 + redis-exporter:9121 已在 188 部署 ✅ Prometheus scrape 已更新 ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 18:19:03 +08:00

1 2

58 Commits