awoooi

Author	SHA1	Message	Date
Your Name	6efd186750	docs(security): 建立高價值配置控管清冊 [skip ci]	2026-06-11 11:29:58 +08:00
Your Name	ae7b39d96a	fix(ops): harden reboot recovery and backup alerts	2026-05-29 12:41:34 +08:00
Your Name	6d2b0ed4cd	ops(runner): add isolation readiness gate [skip ci]	2026-05-24 09:56:47 +08:00
Your Name	4407b46bb6	ops(runner): inventory workflow labels [skip ci]	2026-05-24 09:52:04 +08:00
Your Name	22b45006b7	ops(runner): add pool inventory audit [skip ci]	2026-05-24 09:47:02 +08:00
Your Name	9b465ee140	ci(runner): drain legacy docker act runner safely All checks were successful Code Review / ai-code-review (push) Successful in 11s Details	2026-05-21 18:53:45 +08:00
Your Name	b3ab4da03b	ci(cd): wait for host web build pressure All checks were successful Code Review / ai-code-review (push) Successful in 17s Details	2026-05-21 15:51:36 +08:00
Your Name	ae9d0b7385	feat(monitoring): alert on stale source provider ingestion All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s Details CD Pipeline / tests (push) Successful in 3m26s Details CD Pipeline / build-and-deploy (push) Successful in 3m38s Details CD Pipeline / post-deploy-checks (push) Successful in 1m25s Details	2026-05-20 19:19:21 +08:00
Your Name	598f33ae8b	fix(monitoring): clarify alert chain smoke evidence All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s Details CD Pipeline / tests (push) Successful in 3m55s Details CD Pipeline / build-and-deploy (push) Successful in 3m31s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-05-20 13:11:44 +08:00
Your Name	21dcfbd991	fix(governance): collapse km slo fallback series All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s Details CD Pipeline / tests (push) Successful in 1m6s Details CD Pipeline / build-and-deploy (push) Successful in 5m17s Details CD Pipeline / post-deploy-checks (push) Successful in 1m38s Details	2026-05-14 19:37:15 +08:00
Your Name	d2a4a17969	fix(governance): stabilize adr100 km growth slo Some checks failed Code Review / ai-code-review (push) Successful in 22s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s Details CD Pipeline / tests (push) Successful in 1m11s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-14 19:33:52 +08:00
Your Name	4111ea4f9f	fix(ai): remove 188 ollama provider All checks were successful Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / tests (push) Successful in 1m13s Details CD Pipeline / build-and-deploy (push) Successful in 3m36s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-06 14:34:48 +08:00
OG T	c4f40235f4	fix(alertmanager): gate direct telegram to alertchain emergencies All checks were successful Code Review / ai-code-review (push) Successful in 10s Details	2026-05-06 13:45:33 +08:00
OG T	4753099155	fix(alertmanager): send direct alerts to sre group All checks were successful Code Review / ai-code-review (push) Successful in 10s Details	2026-05-06 13:38:47 +08:00
Your Name	587551c1f1	fix(ops): monitor full-stack cold-start gates All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s Details	2026-05-06 00:48:05 +08:00
Your Name	6e96623884	fix(ops): harden momo scheduler cold start gate All checks were successful Code Review / ai-code-review (push) Successful in 10s Details	2026-05-06 00:15:14 +08:00
Your Name	0315c2b510	docs(ops): codify full stack cold start recovery All checks were successful Code Review / ai-code-review (push) Successful in 7s Details	2026-05-06 00:07:57 +08:00
Your Name	23932773ef	fix(monitoring): route docker baseline alerts to ssh All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 19s Details	2026-05-06 00:00:12 +08:00
Your Name	2f50c67f5c	fix(monitoring): keep host alert ssh diagnostics canonical All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 20s Details E2E Health Check / e2e-health (push) Successful in 2m35s Details	2026-05-05 23:57:53 +08:00
Your Name	2221fd3256	fix(ops): persist host resource guardrails All checks were successful CD Pipeline / tests (push) Successful in 5m25s Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m31s Details CD Pipeline / post-deploy-checks (push) Successful in 5m10s Details	2026-05-05 16:13:19 +08:00
Your Name	1cc9de5722	fix(ops): point runner guardrail alerts to host script All checks were successful CD Pipeline / tests (push) Successful in 5m31s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m45s Details CD Pipeline / post-deploy-checks (push) Successful in 5m4s Details	2026-05-05 15:25:37 +08:00
Your Name	d08d1e4951	fix(ops): alert on missing docker resource limits Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Successful in 23s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 15:01:31 +08:00
Your Name	72d66e4ae6	fix(ops): align stale job cleanup thresholds All checks were successful Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s Details	2026-05-05 14:54:17 +08:00
Your Name	5e625f777d	fix(ops): add stale gitea job cleanup guard Some checks failed Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:50:47 +08:00
Your Name	7d45f0cb58	fix(ops): alert on stale gitea actions jobs Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:42:09 +08:00
Your Name	fe618960a8	fix(ops): monitor systemd runners in host baseline Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details	2026-05-05 14:08:43 +08:00
Your Name	e8e6748f70	fix(ops): add docker host resource baseline guardrails Some checks failed CD Pipeline / tests (push) Failing after 1m50s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 13:45:09 +08:00
Your Name	ec013f662d	fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy Some checks failed Code Review / ai-code-review (push) Successful in 45s Details Ansible Lint / lint (push) Has been cancelled Details - ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib 避免与 governance_agent 每小时自检查重复触发 Telegram - infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B 端口 11435 -> 34.143.170.20:11434 (GCP-A) 端口 11436 -> 34.21.145.224:11434 (GCP-B) - docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南 - ops/nginx: 手动部署脚本供 110 直接执行 ADR-110 三层容灾启用前提：先部署 proxy，再改 ConfigMap	2026-05-04 23:12:35 +08:00
Your Name	b1ef05fa8c	feat(ollama): ADR-110 GCP 三層容災架構（GCP-A → GCP-B → Local → Gemini） Some checks failed Code Review / ai-code-review (push) Successful in 50s Details CD Pipeline / tests (push) Failing after 1m14s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details ## 變更摘要 - Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理) - Secondary: http://34.21.145.224:11434 (GCP-B SSD) - Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD，最後防線) - 廢止 ADR-105「111 唯一鐵律」，新建 ADR-110 ## 核心改動 - config.py: 新增 OLLAMA_SECONDARY_URL；validator 加 GCP IP 白名單（34.143.170.20, 34.21.145.224） - ollama_failover_manager.py: 三層 Ollama 決策矩陣；並行健康檢查三台；health_111 → health_gcp_a - ollama_health_monitor.py: host label 萃取改為通用版（支援 GCP 公網 IP） - failover_alerter.py: 故障/恢復主機動態顯示，不再硬編碼「Ollama 111 (GPU)」 - ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a；recovered_host 動態 - k8s/awoooi-prod: configmap + deployment + network-policy 同步更新（egress 加 GCP /32） - 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL - 測試: URL 常數更新，新增三層容災場景，GCP IP 白名單驗證測試 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 22:49:23 +08:00
Your Name	577250a678	fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警 Some checks failed Code Review / ai-code-review (push) Successful in 52s Details CD Pipeline / tests (push) Failing after 2m21s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s Details 【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】前次 commit `f1362fcc` 用 skip 條件把告警吞掉，是消音化解法： - W-3：total_exec<10 永遠 skip → Redis 永遠空也不會告警 - W-4：playbooks total==0 永遠 skip → 表被清空也不會告警 - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】 - ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛： - W-3a：metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」 - W-3b：rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」 - W-4a：playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」 - W-4b：playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」 - 3 份 Prometheus rule（k8s/monitoring/flywheel-alerts.yaml、 ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml）新增 FlywheelExecutionRateMissing：absent() 或 NaN 持續 30 分鐘 → 告警，與 watchdog W-3b 雙保險【已加入 memory】 feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律：「fresh deploy / init guard 用 skip 吞告警 = 結構性失職，必須分流寬限期 + 過期改打資料管線斷新告警」【驗證】 106 個治理相關 unit test 全過： test_trust_drift_watchdog / test_governance_agent / test_failover_alerter / test_check_trust_drift_commit_outside_context_poc / test_governance_remediation_dispatch / test_ai_governance_endpoints / test_governance_dispatcher 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:39:46 +08:00
Your Name	b371edb70c	fix host alert auto-repair routing and backup false positives	2026-05-02 23:44:12 +08:00
Your Name	da772a1605	fix(decision): block kubectl actions on bare_metal host alerts All checks were successful Code Review / ai-code-review (push) Successful in 54s Details CD Pipeline / tests (push) Successful in 3m47s Details CD Pipeline / build-and-deploy (push) Successful in 13m26s Details CD Pipeline / post-deploy-checks (push) Successful in 5m45s Details When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host (192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating CPU), the LLM kept proposing "kubectl rollout restart awoooi-api", which is a wrong-domain action — restarting awoooi cannot fix a third-party process's CPU usage on the host. Auto-execute would then either run the no-op kubectl restart (wasted) or escalate after ssh_diagnose because no safe action was found, producing the "AI 自動修復失敗" Telegram noise the user just complained about. Adds a guard at the top of DecisionManager._auto_execute: if the incident's primary signal carries host_type=bare_metal AND the proposed action starts with "kubectl", refuse to execute. The incident is marked READY with a clear blocked_reason so human operators see why automation declined, and emergency_escalation records the event in AOL for audit. Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new ops/monitoring/alerts.yml in repo) to add an explicit auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory that hints LLM toward `ssh ... ps aux` rather than kubectl restart. Prometheus reload returned 200. Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked, (3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT blocked, (5) missing host_type label NOT blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 17:41:28 +08:00
Your Name	3156ff1c69	feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune with ≥75% disk usage gate) and routes "docker prune" actions through it. Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider routing labels so the flywheel can self-heal next disk-full event without hitting the emergency_channel Telegram path. Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every disk-full alert through "Host key is not trusted → escalate" loop. known_hosts patched live; this commit closes the playbook gap so the next occurrence resolves without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	3650fc727a	docs(ci): record runner user service takeover state All checks were successful Code Review / ai-code-review (push) Successful in 45s Details	2026-05-01 16:30:54 +08:00
Your Name	e7991b8e6c	fix(ci): keep runner installer idempotent without restart All checks were successful Code Review / ai-code-review (push) Successful in 42s Details	2026-05-01 16:27:37 +08:00
Your Name	bc295eaec2	fix(ci): allow user service for gitea host runner Some checks failed Code Review / ai-code-review (push) Has been cancelled Details	2026-05-01 16:24:45 +08:00
Your Name	cb5ab900c4	fix(ci): preserve gitea runner jobs on shutdown All checks were successful Code Review / ai-code-review (push) Successful in 46s Details	2026-05-01 16:16:27 +08:00
Your Name	ca22ec2fd2	fix(aiops): route backup failures rule-first All checks were successful CD Pipeline / tests (push) Successful in 1m51s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 8m21s Details CD Pipeline / post-deploy-checks (push) Successful in 4m18s Details	2026-05-01 10:11:10 +08:00
Your Name	f0d14ab6c4	fix(aiops): escalate blocked auto repair Some checks failed CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:49:17 +08:00
Your Name	e27b462bef	fix(ops): keep disabled gitea runner stopped All checks were successful Code Review / ai-code-review (push) Successful in 27s Details	2026-04-30 10:59:46 +08:00
Your Name	0f7e9d3467	fix(cd): run docker builds on host runner All checks were successful CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 25s Details CD Pipeline / build-and-deploy (push) Successful in 9m20s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-04-30 10:43:33 +08:00
Your Name	7cc10b2599	fix(cd): serialize gitea docker builds Some checks failed CD Pipeline / build-and-deploy (push) Failing after 40s Details Code Review / ai-code-review (push) Successful in 24s Details	2026-04-30 10:11:50 +08:00
Your Name	c5753e1c57	fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化 critic PR review 揭示已 push commits 的 7 個 blocker，本 commit 全部修復。 ## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約（critic 最嚴重 5 條） ### C1 km_writer.py:194 — backfill 自打臉修 - 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe() - 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入 - 新增 km_backfill_reconciler_job.py（每 5 分鐘掃 DLQ）+ ENABLE_KM_BACKFILL_RECONCILER flag - 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race ### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊 - 從 ensure_future（fire-and-forget 比舊版同步寫更糟） - 改 await writer.write(retry=1, timeout=2.0)（仍 await 但只試一次、超時短） - docstring 明確標註「緊急回滾用，不保證可靠性」 ### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路 - 兩處 _fire_and_forget(executor.write_execution_result_to_km(...)) - 改 await asyncio.shield(...) + BaseException 保護（防上層 cancel 中斷） - KM_WRITE_AWAIT=true 在這條路徑終於真正 await ### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ - 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task - 改 3 次指數退避 retry + DLQ 保護（呼叫 km_writer 私有 helper） ### M3 km_writer.py:166 — 冪等聲明對齊實作 - knowledge_repository.create() 加 UPSERT 路徑（pg_insert ON CONFLICT DO UPDATE） - KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位 - migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path ## M4 alertmanager.yml — equal: [] 收緊（critic 防爆炸抑制） - OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束 - 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警 ## M5 Alertmanager 版本驗證（已確認 v0.31.1，遠超 v0.22+） ## M6 governance_agent.py — health score 區分 skipped vs ok vs violated - check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status} - run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警（不污染 self_failure 計數，因為 no_data 是 emitter 未實作不是治理機制故障） ## M7 scripts/check_config_drift.py — 改 AST 解析 - regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...) - 避免多行 list / default_factory= / 含跳行字串的 false negative - 4 欄位（AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL）全對齊 ## 新增測試 - test_km_writer_backfill_reconciler.py: 7 cases（C1 reconciler + safe helper） - test_km_writer_idempotent.py: 5 cases（M3 path_type 注入 + UPSERT 分支） ## 驗證 - 1585 unit tests 全綠（+13 從 1572） - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker AST-based 4 欄位全對齊 - Alertmanager v0.31.1 確認支援新語法 ## 期望影響 - KMWriter 名實統一：飛輪閉環 KM 寫入路徑 100% 可靠 - M4 抑制爆炸風險解除 - 治理層不再對 SLO no_data 靜默 - drift checker false negative 風險解除 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	715dc3cb91	fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具 12-Agent 全景診斷觸發的 P0/P1 觀測層修復。 ## P0 假警報止血（4 SLO 雪崩根因） - governance_agent.py:306 — 空 result 不再 fallback 0.0，改 continue + log warning 根因：Prometheus 查無資料（emitter 未實作 / rule 未部署）被誤判為 SLO=0 必觸發 violated=True 噴 4 條假告警 ## P0 鬼魂按鈕守門 - telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear() first_row（批准/拒絕，HMAC nonce 無狀態）由 caller 1488 永遠保留 feedback_no_ghost_buttons.md 三缺一鐵律對齊 ## ConfigMap drift 修復（3 處） - config.py:683 PROMETHEUS_URL: 188→110（drift checker 揪出 = SPF-4 部分根因） - config.py:705 ARGOCD_URL: 125→121（T0 G3 已知） - config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap ## P1 Alertmanager 升級（amtool SUCCESS） - ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法 - match/match_re → matchers - source_match/target_match → source_matchers/target_matchers - group_by 加 team label（防 SLO 雪崩 4 條同秒推） - PostgreSQL/Redis inhibit 補 equal: ['instance']（防爆炸抑制） - 新增 3 組因果抑制： - OllamaInstanceDown → SLO_/AI_（30 分鐘） - KMConverterDown → SLO_KMGrowthRate* - SLO__FastBurn → SLO__(Medium\|Slow)Burn ## 治理工具落地 - scripts/check_config_drift.py: ConfigMap vs code default drift 檢測揪出 PROMETHEUS_URL drift 是 SPF-4 根因（governance_agent 連 188 而非 110） - scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證 ## 驗證 - 1552 unit tests 全綠 - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker 4 欄位全對齊 - health check 11 服務全可達 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	ed205489c1	feat(p3.2-tests+ci-schema): model_version 測試 + CI test_schema 對齊 + Grafana SLO Dashboard Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m20s Details P3.2 配套測試 + CI 環境同步 + ADR-100 Grafana 視覺化： CI test_schema 補齊（解 1162-1172 阻塞之延伸）: - setup_test_schema.sql 加 ai_provider_version_history 表 - 對齊 production p3_2_provider_version_history.sql（已 K8s exec 上線）新增測試 (636 行): - test_model_version_probe.py (387) — Provider 探測單元測試 - test_model_version_tracker.py (249) — Tracker 整合測試 · 4 個 DB-dependent tests 標 @pytest.mark.integration · 15 unit + 4 integration（unit step 跳過 integration class）新增配套: - ai-slo-dashboard.json (496 行) — Grafana 儀表板 · 對應 ADR-100 SLO 規則的 4 大面板：自主修復成功率 / 飛輪閉環延遲 / 治理事件 / Provider 健康度修改: - governance_agent.py +122 行 — SLO 指標暴露 + retrieve metric 整合 Tests: 15 passed (probe + tracker unit), 4 deselected (integration class) Production 部署狀態: - p2_decision_fusion_columns.sql ✅ K8s exec 完成（commit c58bdd0c） - p3_2_provider_version_history.sql ✅ K8s exec 完成（this commit） - 兩個 production migration 都已上線，CI test_schema 同步補齊 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:57:16 +08:00
Your Name	025a493f06	feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner Some checks failed run-migration / migrate (push) Failing after 12s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套： P3.2 — Model Version Tracking: - model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version - model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表 - migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema - db/models.py +32 行 — ProviderVersionHistory ORM ADR-100 — AI 自主化 SLO: - docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值 - ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts - ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests 整合修改: - main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule - gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知 - ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線新測試: - test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass - test_slo_rules.yaml — promtool 驗收 Tests: 9 passed (test_kb_rot_cleaner_schedule) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>	2026-04-27 14:54:19 +08:00
Your Name	fb130c9a28	feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m16s Details Wave 8 P3.1-T2 後續補測 + 配套：新增測試: - test_diagnosis_aggregator_stub.py (238 行) — 15 tests · stub fixture 驗證 _collect_diagnosis_aggregator 接線 · feature flag default off 不呼叫 · timeout 邊界 / exception fail-soft 修改: - core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標 - sanitization_service.py +24 — 補強 prompt sanitize 邊界（vuln #4 配套） - RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調 Tests: 15 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:30:26 +08:00
Your Name	fefe4c21cd	fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 延續 `595629c0` INC-20260425 修復，補三段 Agent + 全鏈路觀測： A1 後續 — Solver/Critic 三段 timeout 接線: - solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override） - critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override） - protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫 · success/timeout/error outcome label · histogram 寫入 aiops_agent_step_duration_seconds A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain: - auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE） - 完全避開 Ollama CPU 238s 二次 timeout 新增 metrics: - core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality 新增測試 (862 行): - test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label - test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序新增配套: - docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引 - ops/monitoring/grafana/agent_step_latency_rules.yaml (160) · 三 Agent histogram alert rules（p99 > timeout 80% → warning）驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11) INC-20260425 雙修總工作量（595629c0 + 此 commit）: · 5 個 service/agent 檔修改 · 1 個新 observability 模組 · 4 個新測試/配套檔 · 1372+187 = 1559 行新增 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>	2026-04-27 08:15:53 +08:00
Your Name	1ab6786ce3	feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch Some checks failed run-migration / migrate (push) Failing after 13s Details CD Pipeline / build-and-deploy (push) Failing after 2m1s Details Wave 6 P2.3 ops 配套 + tool-expert 部署文件：新增: - docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行) · 三大鐵律驗證步驟（自動切 Gemini / 自動切回 / quota 熔斷） · failover/recovery 完整 SOP · 故障排查清單（Ollama 111/188 不通、Gemini quota 超發等） - ops/monitoring/grafana/dashboards/ollama_failover.json (295 行) · 4 panel：current primary / failover events / quota usage / health status · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT - k8s/awoooi-prod/04-configmap.yaml.patch-consensus · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>	2026-04-27 08:11:40 +08:00
Your Name	2c57b71db9	feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成（multiple engineers 在限額前完成代碼，補 commit）： P2.2 — GovernanceAgent 4 項自檢: - governance_agent.py (342 行) — 每 1 小時自檢循環: · trust_drift（信任度漂移檢測） · knowledge_degradation（知識退化檢測） · llm_hallucination（LLM 幻覺檢測） · execution_blast_radius（執行爆炸半徑檢測） - main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動 try/except 包裹，schedule 失敗不阻斷主流程 - failover_alerter.py: alert_governance(event_type, payload) 1h dedup 四類事件 → Telegram MarkdownV2 告警 P2.3 — Ollama 健康規則 + Prometheus Metrics: - ops/monitoring/ollama_health_rules.yaml (148 行): · OllamaHealthDegraded / OllamaPrimaryDown · OllamaFailoverTriggered / GeminiQuotaExceeded · 補 Prometheus 取資料的 alert rules - core/metrics.py (57 行): · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge - ollama_failover_manager.py: · _check_gemini_quota: 每次 check 同步更新 Gauge（讓 Prometheus 取最新值） · select_provider: failover 時 inc Counter + 切 Primary Gauge · try/except 包裹，metric 失敗不阻斷主路由 E2E 測試: - test_failover_e2e_dispatch.py (365 行) 完整 dispatch 路徑：health check → failover decide → alerter → metrics Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>	2026-04-26 20:56:19 +08:00

1 2

87 Commits