awoooi

Author	SHA1	Message	Date
Your Name	898d7b0ff2	docs(logbook): 更新 Phase 2 進度（P0-05/06/11/12 全部完成） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 13:55:14 +08:00
Your Name	f2f5148ca6	fix(awooop): Phase 2 第二批 P0 安全強化 + Redis key 命名空間修正 ## P0-05 Callback Nonce 防偽造（ADR-116） - security_interceptor.py：generate_callback_nonce() 新增 HMAC-SHA256[:16] 附加 - 新 5-part 格式：{action}:{short_id}:{ts}:{rand}:{hmac16} - CALLBACK_HMAC_SECRET 未設定時降級 warning（向後相容） - security_interceptor.py：parse_callback_data() 新增 5-part 分支 + HMAC 驗證 - config.py：新增 CALLBACK_HMAC_SECRET: str = Field(default="") ## P0-06 Webhook HMAC Replay 防護（ADR-116） - security_interceptor.py：新增 check_webhook_nonce()（Service 層，get_redis 在此層合法） - webhooks.py：verify_webhook_signature() 新增兩個可選 Header - X-Webhook-Timestamp：±300s 窗口驗證（若提供） - X-Webhook-Nonce：呼叫 check_webhook_nonce()（Redis NX dedup，fail open） - 移除直接 get_redis import（leWOOOgo 積木化修正） ## P0-11 ollama:current_primary Redis key 遷移 Phase A（ADR-110） - ollama_auto_recovery.py：_REDIS_PRIMARY_KEY = "platform:ollama:current_primary" - 雙寫舊 key "ollama:current_primary"（Phase A 30 天） - 讀取以新 key 為主，fallback 舊 key ## P0-12 consensus Redis key 加 project namespace Phase A - consensus_engine.py：新增 _consensus_key() / _consensus_legacy_key() helper - 新 key：{project_id}:consensus:{consensus_id} - project_id=None 時 fallback __platform__:consensus:{consensus_id} - Phase A 雙寫 + fallback 讀取，現有呼叫方零修改 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 13:54:38 +08:00
Your Name	13e51802fe	feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema（含 critic 四項修正） ## Phase 0（文件層，全部 Accepted） - ADR-106/107：AwoooP 平台架構 + 儲存策略 - ADR-111~118：Bootstrap → RLS 七項核心 ADR - ADR-119~124：SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04：Operator Console 四個 UI ADR ## Phase 1（DB schema + migration） - awooop_phase1_control_plane_2026-05-04.sql：7 張新表 + trigger + RLS - Step 1：三角色（platform_admin/migration BYPASSRLS，awooop_app 受 RLS） - Step 13：GRANT awooop_app 最小權限（7 條） - Step 14：RLS fail-closed，移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql：高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py：SKIP LOCKED 分批回填腳本 - awooop_models.py：7 個 SQLAlchemy 2.x models ## Critic 修正（4 Critical + 3 Major） - C-1：ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2：__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3：__platform__ RLS 後門 → 全移除，改用 BYPASSRLS role - C-4：awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1：active_pointer_guard SECURITY DEFINER（FORCE RLS 跨租戶保護） - M-2：pg_partman create_parent 加冪等防護 - M-3：immutability trigger 新增身份欄位保護（project_id/family/contract_id） ## Task 1.2 修補 - agent_loader.py：硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile：補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 13:37:11 +08:00
Your Name	b1ef05fa8c	feat(ollama): ADR-110 GCP 三層容災架構（GCP-A → GCP-B → Local → Gemini） Some checks failed Code Review / ai-code-review (push) Successful in 50s Details CD Pipeline / tests (push) Failing after 1m14s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details ## 變更摘要 - Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理) - Secondary: http://34.21.145.224:11434 (GCP-B SSD) - Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD，最後防線) - 廢止 ADR-105「111 唯一鐵律」，新建 ADR-110 ## 核心改動 - config.py: 新增 OLLAMA_SECONDARY_URL；validator 加 GCP IP 白名單（34.143.170.20, 34.21.145.224） - ollama_failover_manager.py: 三層 Ollama 決策矩陣；並行健康檢查三台；health_111 → health_gcp_a - ollama_health_monitor.py: host label 萃取改為通用版（支援 GCP 公網 IP） - failover_alerter.py: 故障/恢復主機動態顯示，不再硬編碼「Ollama 111 (GPU)」 - ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a；recovered_host 動態 - k8s/awoooi-prod: configmap + deployment + network-policy 同步更新（egress 加 GCP /32） - 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL - 測試: URL 常數更新，新增三層容災場景，GCP IP 白名單驗證測試 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 22:49:23 +08:00
Your Name	0f009d9459	docs(adr): ADR-109 telegram_gateway unified dedup layer (P0 #1 design doc) P0 #1 (徹底長期修系列) — 33 個 send_xxx 方法各自寫 dedup 改為統一在 `_send_request()` 一層處理，未來新增 send_xxx 方法傳兩個 kwargs (dedup_scope + dedup_fingerprint) 即自動繼承 dedup，不再有「漏修一條鏈就轟炸統帥」的設計缺陷。當前是 Proposed 狀態，等首席架構師審。Tier 2 橙區。包含： - 33 個 send_xxx 的 dedup_scope mapping - 5-6 小時 / 3 commits 漸進式重構計畫 - 與 ADR-108 (incident_id fingerprint) 的協同關係兩個 ADR 都是「徹底長期修」系列的 design 階段，等統帥批准執行。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:54:19 +08:00
Your Name	62698158b0	docs(adr): ADR-108 incident_id fingerprint derivation (P1 design doc) P1 (徹底長期修系列) — 治本所有 dedup 問題：把 incident_id 從 uuid4()[:6] 隨機改為 fingerprint hash 派生，open 期間同 fingerprint 強制復用同一 INC。當前是 Proposed 狀態，等首席架構師審。Tier 3 紅區改動，不批不動 code。包含： - 影響面盤點（1435 引用點，預計實際需改 ~10 檔 ~20 處） - 4 phase 漸進式遷移（~7 小時） - 跨日 reuse 行為決策 - 5 條風險與緩解 - 5 條驗收標準 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:53:09 +08:00
Your Name	b710f3f38f	feat(governance): normalize AI治理告警輸出與元告警解析度 Some checks failed CD Pipeline / tests (push) Failing after 25s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 46s Details	2026-05-02 23:49:59 +08:00
Your Name	ed0553c337	docs(governance): add AI governance alert schema and consolidation playbook	2026-05-02 23:47:00 +08:00
Your Name	3059897318	feat(governance): auto-deprecate low-trust unused playbooks (>30d) Some checks failed Code Review / ai-code-review (push) Successful in 41s Details CD Pipeline / tests (push) Successful in 3m29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details trust_drift previously fired alerts forever for playbooks stuck below the 0.2 threshold. With user authorization for governance-class auto-fixes, check_trust_drift now retires playbooks that have been unused for 30+ days (or never used and created 30+ days ago) by flipping status to 'deprecated' before alerting. Alerts now report drifted_count, auto_deprecated_count, and the kept playbook_ids that still need human review (those in their 30d trial window). Existing alert noise from the four currently-drifted playbooks should drop to whatever fraction is genuinely in trial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	607358c4dd	fix(approval): route SSH actions through SSHProvider on manual approve parse_operation_from_action only knew kubectl and Chinese restart phrases, so any "ssh host '...'" action approved via Telegram fell through to "Could not parse operation type" and reported a fake failure even though the LLM had proposed a valid host repair. Adds OperationType.SSH_HOST, makes the parser detect ssh prefixes (with optional flags / user@host) before kubectl patterns, and routes the SSH_HOST branch in approval_execution.execute_in_background through SSHProvider with the same tool keywords decision_manager uses (ssh_docker_prune / ssh_docker_restart / ssh_systemctl_restart / ssh_diagnose). Unroutable SSH actions now fail loudly with a descriptive error instead of silently breaking. Trigger: 2026-05-02 incidents INC-20260502-D6D0B7 / E12EE4 / 557055 were approved by the user but executor reported "Could not parse" and left the alerts pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	3156ff1c69	feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune with ≥75% disk usage gate) and routes "docker prune" actions through it. Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider routing labels so the flywheel can self-heal next disk-full event without hitting the emergency_channel Telegram path. Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every disk-full alert through "Host key is not trusted → escalate" loop. known_hosts patched live; this commit closes the playbook gap so the next occurrence resolves without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	8cf559215c	docs(awooop): add Phase 1 Isolation Foundation implementation plan (ADR-106 P1)	2026-05-02 12:28:33 +08:00
Your Name	443947ffa1	fix(ci): avoid code review sigpipe on large diffs [skip ci]	2026-05-01 20:59:14 +08:00
Your Name	7795f027d2	fix(aiops): persist emergency intervention traces Some checks failed CD Pipeline / tests (push) Successful in 2m56s Details Code Review / ai-code-review (push) Failing after 39s Details CD Pipeline / build-and-deploy (push) Successful in 12m54s Details CD Pipeline / post-deploy-checks (push) Successful in 4m40s Details	2026-05-01 20:34:33 +08:00
Your Name	8e49f2ea88	fix(ci): preserve ssh mcp known hosts [skip ci]	2026-05-01 17:18:32 +08:00
Your Name	433f7b068e	fix(aiops): close ssh and telegram remediation gaps All checks were successful CD Pipeline / tests (push) Successful in 2m7s Details Code Review / ai-code-review (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 13m14s Details CD Pipeline / post-deploy-checks (push) Successful in 4m29s Details	2026-05-01 16:53:02 +08:00
Your Name	3650fc727a	docs(ci): record runner user service takeover state All checks were successful Code Review / ai-code-review (push) Successful in 45s Details	2026-05-01 16:30:54 +08:00
Your Name	bc295eaec2	fix(ci): allow user service for gitea host runner Some checks failed Code Review / ai-code-review (push) Has been cancelled Details	2026-05-01 16:24:45 +08:00
Your Name	cb5ab900c4	fix(ci): preserve gitea runner jobs on shutdown All checks were successful Code Review / ai-code-review (push) Successful in 46s Details	2026-05-01 16:16:27 +08:00
Your Name	b0da6da1e9	feat(aiops): structure agent loop shadow output Some checks failed CD Pipeline / tests (push) Successful in 2m50s Details Code Review / ai-code-review (push) Successful in 33s Details CD Pipeline / build-and-deploy (push) Failing after 25m48s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 15:09:57 +08:00
Your Name	f8e44971c1	feat(aiops): enable read-only agent loop canary All checks were successful CD Pipeline / tests (push) Successful in 1m43s Details Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Successful in 10m22s Details CD Pipeline / post-deploy-checks (push) Successful in 4m3s Details	2026-05-01 14:20:16 +08:00
Your Name	6ec3f116fd	fix(ci): normalize migration database url for psql All checks were successful CD Pipeline / tests (push) Successful in 1m30s Details Code Review / ai-code-review (push) Successful in 27s Details CD Pipeline / build-and-deploy (push) Successful in 13m20s Details CD Pipeline / post-deploy-checks (push) Successful in 3m36s Details	2026-05-01 13:30:32 +08:00
Your Name	7e4d995e4b	feat(aiops): add mcp agent loop foundation Some checks failed CD Pipeline / tests (push) Successful in 1m59s Details Code Review / ai-code-review (push) Successful in 28s Details run-migration / migrate (push) Failing after 24s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:21:19 +08:00
Your Name	9db87f177e	fix(aiops): suppress repeated llm alert loops Some checks failed CD Pipeline / tests (push) Successful in 1m37s Details Code Review / ai-code-review (push) Successful in 28s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:02:07 +08:00
Your Name	11673d80ea	fix(aiops): route backup decisions through ssh Some checks failed CD Pipeline / tests (push) Successful in 1m35s Details Code Review / ai-code-review (push) Successful in 34s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 12:50:01 +08:00
Your Name	3a6acae408	fix(km): add phase25 knowledge enum labels Some checks failed CD Pipeline / tests (push) Successful in 2m14s Details Code Review / ai-code-review (push) Successful in 26s Details run-migration / migrate (push) Failing after 24s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 11:03:03 +08:00
Your Name	2c12bce135	fix(aiops): use existing escalation event type Some checks failed CD Pipeline / tests (push) Successful in 1m54s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 10:56:59 +08:00
Your Name	97be5dedd7	fix(aiops): escalate failed host verification Some checks failed CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 10:47:42 +08:00
Your Name	e4aef6ac4e	fix(aiops): block k8s playbooks for host repair All checks were successful CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 26s Details CD Pipeline / build-and-deploy (push) Successful in 8m6s Details CD Pipeline / post-deploy-checks (push) Successful in 3m31s Details	2026-05-01 10:33:52 +08:00
Your Name	ca22ec2fd2	fix(aiops): route backup failures rule-first All checks were successful CD Pipeline / tests (push) Successful in 1m51s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 8m21s Details CD Pipeline / post-deploy-checks (push) Successful in 4m18s Details	2026-05-01 10:11:10 +08:00
Your Name	f154ac022e	feat(playbook): version generated playbooks All checks were successful CD Pipeline / tests (push) Successful in 1m34s Details Code Review / ai-code-review (push) Successful in 28s Details Type Sync Check / check-type-sync (push) Successful in 1m10s Details CD Pipeline / build-and-deploy (push) Successful in 10m19s Details CD Pipeline / post-deploy-checks (push) Successful in 3m1s Details	2026-04-30 23:59:39 +08:00
Your Name	f0d14ab6c4	fix(aiops): escalate blocked auto repair Some checks failed CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:49:17 +08:00
Your Name	6e04fe9c8a	feat(playbook): generate drafts with local llm Some checks failed CD Pipeline / tests (push) Successful in 1m28s Details Code Review / ai-code-review (push) Successful in 29s Details Type Sync Check / check-type-sync (push) Failing after 2m41s Details CD Pipeline / build-and-deploy (push) Successful in 8m40s Details CD Pipeline / post-deploy-checks (push) Successful in 3m10s Details	2026-04-30 23:04:58 +08:00
Your Name	95110971f3	fix(telegram): close remaining DM alert routes Some checks failed CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:02:17 +08:00
Your Name	712d3e5a77	fix(ci): send workflow alerts to SRE group All checks were successful CD Pipeline / tests (push) Successful in 1m30s Details Code Review / ai-code-review (push) Successful in 26s Details CD Pipeline / build-and-deploy (push) Successful in 7m48s Details CD Pipeline / post-deploy-checks (push) Successful in 2m58s Details	2026-04-30 15:05:16 +08:00
Your Name	61f5a6a419	fix(telegram): route alerts to SRE war room Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details	2026-04-30 15:01:23 +08:00
Your Name	ed2a4838f2	fix(auto): use action parser for repair gates Some checks failed CD Pipeline / tests (push) Failing after 1m2s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 24s Details	2026-04-30 14:06:09 +08:00
Your Name	4723499955	fix(cd): install playwright system deps for smoke All checks were successful CD Pipeline / tests (push) Successful in 1m34s Details Code Review / ai-code-review (push) Successful in 24s Details CD Pipeline / build-and-deploy (push) Successful in 6m58s Details CD Pipeline / post-deploy-checks (push) Successful in 3m7s Details	2026-04-30 11:02:12 +08:00
Your Name	e27b462bef	fix(ops): keep disabled gitea runner stopped All checks were successful Code Review / ai-code-review (push) Successful in 27s Details	2026-04-30 10:59:46 +08:00
Your Name	0f7e9d3467	fix(cd): run docker builds on host runner All checks were successful CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 25s Details CD Pipeline / build-and-deploy (push) Successful in 9m20s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-04-30 10:43:33 +08:00
Your Name	7cc10b2599	fix(cd): serialize gitea docker builds Some checks failed CD Pipeline / build-and-deploy (push) Failing after 40s Details Code Review / ai-code-review (push) Successful in 24s Details	2026-04-30 10:11:50 +08:00
Your Name	e91db52858	docs(logbook): record `639bb64` prod deployment [skip ci]	2026-04-30 09:45:48 +08:00
Your Name	639bb64788	feat(flywheel): surface ai automation and code review Some checks failed Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Failing after 5m23s Details	2026-04-30 00:09:25 +08:00
Your Name	4a57c2d04f	feat(flywheel): expose incident processing timeline All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m56s Details	2026-04-29 23:38:30 +08:00
Your Name	f5f41543c9	docs: ADR-105 推翻 A2 + LOGBOOK 2026-04-29 LLM 飛輪復活戰 ADR-105 完整記錄推翻 A2 鐵律的決策： - Context: A2 歷史背景 + 2 個月後事實基礎變化（GPU + qwen2.5:7b） - Decision: 4 處修改（IntentType.DIAGNOSE override / chain / openclaw.py task_type / 6 regression test） - Consequences: 正面（飛輪復活）+ 負面（Ollama 單點）+ 已知債（ADR-106-109 後續） - Validation: 部署前 1635 tests 全綠，部署後 5 項驗證指標 - Rollback: env 切換 / git revert LOGBOOK 加 2026-04-29 條目： - 真根因：4 provider 全死 + A2 鐵律排除 Ollama - CD 連環血淚：5 個 commit 全 failure（setup_test_schema.sql 缺欄） - 已落地（不依賴 CD）：Prometheus 17 條 rule + Gemini sanitize - Memory 索引同步更新（指向 project_revert_a2_ollama_primary.md）注意：docs/ 不在 cd.yaml paths trigger，此 commit 不影響 CD。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:59:53 +08:00
Your Name	6eb33594c2	docs(logbook): T0 12-Agent 全景驗證紀錄承接前段 session wave2 (commit `143c15f0`) + DB cleanup + Gitea HMAC + ArgoCD/Sentry MCP，派四位專家並行驗證（critic / db-expert / debugger / tool-expert）。詳情：B1/B2 鬼魂按鈕 + KM 早期吞例外 + M1-M4 中度問題 + G1-G3 環境治理 gap。此 commit 主要為 LOGBOOK 索引補齊，本次 P0/P1 修復內容詳見前 2 個 commit。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	025a493f06	feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner Some checks failed run-migration / migrate (push) Failing after 12s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套： P3.2 — Model Version Tracking: - model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version - model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表 - migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema - db/models.py +32 行 — ProviderVersionHistory ORM ADR-100 — AI 自主化 SLO: - docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值 - ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts - ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests 整合修改: - main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule - gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知 - ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線新測試: - test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass - test_slo_rules.yaml — promtool 驗收 Tests: 9 passed (test_kb_rot_cleaner_schedule) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>	2026-04-27 14:54:19 +08:00
Your Name	fb130c9a28	feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m16s Details Wave 8 P3.1-T2 後續補測 + 配套：新增測試: - test_diagnosis_aggregator_stub.py (238 行) — 15 tests · stub fixture 驗證 _collect_diagnosis_aggregator 接線 · feature flag default off 不呼叫 · timeout 邊界 / exception fail-soft 修改: - core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標 - sanitization_service.py +24 — 補強 prompt sanitize 邊界（vuln #4 配套） - RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調 Tests: 15 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:30:26 +08:00
Your Name	fefe4c21cd	fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 延續 `595629c0` INC-20260425 修復，補三段 Agent + 全鏈路觀測： A1 後續 — Solver/Critic 三段 timeout 接線: - solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override） - critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override） - protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫 · success/timeout/error outcome label · histogram 寫入 aiops_agent_step_duration_seconds A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain: - auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE） - 完全避開 Ollama CPU 238s 二次 timeout 新增 metrics: - core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality 新增測試 (862 行): - test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label - test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序新增配套: - docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引 - ops/monitoring/grafana/agent_step_latency_rules.yaml (160) · 三 Agent histogram alert rules（p99 > timeout 80% → warning）驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11) INC-20260425 雙修總工作量（595629c0 + 此 commit）: · 5 個 service/agent 檔修改 · 1 個新 observability 模組 · 4 個新測試/配套檔 · 1372+187 = 1559 行新增 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>	2026-04-27 08:15:53 +08:00
Your Name	1ab6786ce3	feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch Some checks failed run-migration / migrate (push) Failing after 13s Details CD Pipeline / build-and-deploy (push) Failing after 2m1s Details Wave 6 P2.3 ops 配套 + tool-expert 部署文件：新增: - docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行) · 三大鐵律驗證步驟（自動切 Gemini / 自動切回 / quota 熔斷） · failover/recovery 完整 SOP · 故障排查清單（Ollama 111/188 不通、Gemini quota 超發等） - ops/monitoring/grafana/dashboards/ollama_failover.json (295 行) · 4 panel：current primary / failover events / quota usage / health status · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT - k8s/awoooi-prod/04-configmap.yaml.patch-consensus · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>	2026-04-27 08:11:40 +08:00

1 2 3 4 5 ...

397 Commits