Commit Graph

85 Commits

Author SHA1 Message Date
Your Name
6bae94fa0b fix(api): restore Alertmanager project context
Some checks failed
CD Pipeline / tests (push) Successful in 1m26s
Code Review / ai-code-review (push) Successful in 13s
CD Pipeline / build-and-deploy (push) Successful in 4m23s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-06-11 11:46:43 +08:00
Your Name
cfb866d055 feat(governance): add agent market automation surfaces
Some checks failed
Ansible Lint / lint (push) Successful in 35s
CD Pipeline / tests (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Failing after 11s
2026-06-04 21:50:55 +08:00
Your Name
9080ba3670 feat(awooop): run ansible check-mode evidence worker
All checks were successful
CD Pipeline / tests (push) Successful in 1m28s
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / build-and-deploy (push) Successful in 5m9s
CD Pipeline / post-deploy-checks (push) Successful in 1m30s
2026-05-31 12:53:22 +08:00
Your Name
9b01f1fa46 fix(api): serialize startup bootstrap ddl
All checks were successful
CD Pipeline / tests (push) Successful in 5m29s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 4m9s
CD Pipeline / post-deploy-checks (push) Successful in 1m57s
2026-05-24 17:10:26 +08:00
Your Name
edf97ad8ca feat(governance): process hermes km healthchecks
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 2m13s
CD Pipeline / build-and-deploy (push) Successful in 5m14s
CD Pipeline / post-deploy-checks (push) Successful in 1m55s
2026-05-19 22:32:55 +08:00
Your Name
c516f9fc71 fix(metrics): refresh alert chain timestamp from durable evidence
All checks were successful
Code Review / ai-code-review (push) Successful in 9s
CD Pipeline / tests (push) Successful in 5m53s
CD Pipeline / build-and-deploy (push) Successful in 4m13s
CD Pipeline / post-deploy-checks (push) Successful in 1m29s
2026-05-19 17:55:47 +08:00
Your Name
d0835a7be1 fix(api): reconcile completed stuck incidents
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 3m34s
CD Pipeline / post-deploy-checks (push) Successful in 1m35s
2026-05-19 11:45:15 +08:00
Your Name
13cf02b740 feat(governance): emit adr100 slo metrics
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m0s
CD Pipeline / build-and-deploy (push) Successful in 3m21s
CD Pipeline / post-deploy-checks (push) Successful in 1m16s
2026-05-14 18:57:03 +08:00
Your Name
b3dc41fcd4 fix(metrics): 串入飛輪指標到 /metrics 主端點,修復 FlywheelExecutionRateMissing 死告警
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m3s
CD Pipeline / build-and-deploy (push) Successful in 3m28s
CD Pipeline / post-deploy-checks (push) Successful in 1m21s
INC-20260507-99ADF2 根因(feedback_full_chain_first_then_fix.md 全鏈分析):

【鏈路斷點】規則層(5/3 加)vs 指標層(5/6 改)vs scrape 層(從沒同步)
- 577250a6(5/3)「反消音化」commit 加了 FlywheelExecutionRateMissing
  rule,要求 110 Prom scrape 到 awoooi_flywheel_execution_success_rate;
- a2c4b3d4(5/6)Codex 改 FlywheelStatsService 用 auto_repair_executions
  作 source of truth(24h 樣本 1-9 筆回 None 給 W-3b watchdog 接管);
- 但 awoooi_flywheel_* 指標自始至終只在 /api/v1/stats/flywheel/metrics
  暴露,110 Prom awoooi-api job 抓的是 /metrics → absent() 永遠 1
  → 自 2026-05-06T04:14 UTC 起 firing 26h+ 屬 dead alert

【修法】只動 awoooi-api 一處,不碰 Codex 設計、不碰 110 Prom 配置:
- main.py /metrics endpoint 改 async,在 generate_latest() 後串入
  FlywheelStatsService.compute() → to_prometheus_lines()。
- 既有 awoooi-api scrape job 自動拿到飛輪指標。
- 完全保留 Codex a2c4b3d4 設計:1-9 筆回 None 讓 W-3b watchdog 雙保險。

【不碰的部分】
- flywheel_stats_service.py 不動:Codex 5/6 LOGBOOK 已明確說明
  「Redis playbook counter 失準 → 用 auto_repair_executions 為唯一信任源」,
  1-9 筆 return None 是配合 ai_slo_watchdog_job W-3b grace+30min 設計的
  反消音化雙保險,不是 bug。

驗證計畫(部署後):
1. curl /metrics | grep awoooi_flywheel  → 看到飛輪指標
2. Prom query awoooi_flywheel_execution_success_rate  → 非空
3. ALERTS{alertname="FlywheelExecutionRateMissing"}  → resolved
4. 30 分鐘觀察 Telegram 不再收 INC-20260507-99ADF2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 15:33:04 +08:00
Your Name
4111ea4f9f fix(ai): remove 188 ollama provider
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 14:34:48 +08:00
Your Name
8629ac709b feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
Some checks failed
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)

## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)

## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop

## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)

## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:31:53 +08:00
Your Name
e45b055e0e feat(governance): AI 治理事件處理鏈四軌交付(C/D/B/A)
Some checks failed
Code Review / ai-code-review (push) Successful in 48s
run-migration / migrate (push) Failing after 45s
CD Pipeline / tests (push) Successful in 3m46s
Type Sync Check / check-type-sync (push) Successful in 2m8s
CD Pipeline / build-and-deploy (push) Failing after 31m14s
CD Pipeline / post-deploy-checks (push) Has been skipped
【十二人專家團隊全景掃描 + 並行四軌實施】

統帥質疑「有讓 12-agent 一起協作嗎」後,依照團隊規則完成全鏈路交付:
onboarder + critic + db-expert + debugger + frontend-designer 並行掃描,
找到 6 大 Gap,再由 fullstack-engineer × 4、refactor-specialist 協作落地。

【Track C — trust_drift 雙寫整併】

兩條獨立寫 event_type=trust_drift 路徑互不呼叫,下游 consumer 拿到雙份資料
無法判定 source-of-truth。整併保留 governance_agent.check_trust_drift(功能
更全:auto-deprecate + Telegram + PG),TrustDriftDetector 降為純統計 lib,
W-6 watchdog 改呼叫 governance_agent。新增 TestSinglePgWritePerDriftScenario
驗證同一 drift 場景只觸發一次 PG 寫入。

  變更:
    - apps/api/src/services/trust_drift_detector.py(lib only,不再寫 PG)
    - apps/api/tests/test_trust_drift_watchdog.py(W-6 改 mock governance_agent)

【Track D — governance_remediation_dispatch 派遣表】

ai_governance_events 是不可變 Event Sourcing,不能塞執行狀態。新建派遣表
作為投影層:1 event → 0..N dispatches,狀態可變、可重試、可審計。

  - PgEnum 5 種 event_type + 7 階段狀態機(pending → dispatched → executing →
    succeeded/failed/cancelled/skipped)
  - 失敗重試 INSERT 新 row(不改舊 row 的 status,保留審計痕跡)
  - Partial unique index ux_grd_one_active_per_event 強制「同事件唯一活躍」
  - 4 個複合 index 支援 worker poll、去重查詢、觀測面板
  - FK 對應 ai_governance_events / playbooks / incidents / approval_records
    全部 SET NULL(avoid cascade lock,但 governance_event 用 RESTRICT)

  變更:
    - apps/api/src/db/models.py(GovernanceRemediationDispatch ORM class)
    - apps/api/migrations/governance_remediation_dispatch_2026-05-03.sql
    - apps/api/src/repositories/governance_remediation_dispatch_repo.py
      (6 個 async 函式 + 3 個自訂例外:DispatchAlreadyActive /
       InvalidStatusTransition / DispatchNotFound)
    - apps/api/src/models/governance_dispatch.py(DecisionContextV1 等 4 schema)
    - apps/api/tests/test_governance_remediation_dispatch.py(29 tests)

【Track B — /governance 頁面】

後端 PR1 三個 endpoint + 前端 PR2-5 完整三 Tab。

PR1 後端:
  - GET /api/v1/ai/governance/events(events_tab,含 event_type/severity/
    狀態/時間範圍篩選 + 分頁)
  - GET /api/v1/ai/governance/queue(queue_tab,含 graceful fallback:
    dispatch 表不存在時回 table_pending=True 不拋 500)
  - GET /api/v1/ai/governance/summary(slo_tab 30d 違反時序圖)
  - severity 映射規則寫死(critic 建議未來移 settings)

PR2-5 前端:
  - /governance 路由 + AppLayout + Compliance Badge 橫幅 + PageTabs
  - SLO Tab:3 KPI 卡片(Syne 28px + StatusOrb + 7d sparkline)+
    30d 違反 stacked BarChart
  - Events Tab:篩選列 + 表格 + inline 展開行(JSON / 修復建議 / 派遣記錄)
  - Queue Tab:HITL 待辦卡片 + 信任度進度條 + 批准/拒絕按鈕(本 PR console.log)
  - Sidebar 加入「AI 治理」入口(ShieldCheck icon)
  - i18n 雙語完整(governance namespace + nav.governance)
  - 7 個新元件:slo-kpi-card / slo-violation-chart / events-table /
    events-filter-bar / event-detail-drawer / queue-item-card / queue-history-tabs

  變更:
    - apps/api/src/api/v1/ai_governance.py(router)
    - apps/api/src/services/governance_query_service.py
    - apps/api/src/models/governance.py(Pydantic V2 schemas)
    - apps/api/tests/test_ai_governance_endpoints.py(21 tests)
    - apps/web/src/app/[locale]/governance/(page + 3 tabs)
    - apps/web/src/components/governance/(7 元件)
    - apps/web/messages/{zh-TW,en}.json(governance namespace)
    - apps/web/src/components/layout/sidebar.tsx(+1 行)
    - apps/api/src/main.py(router include)

【Track A — GovernanceDispatcher 決策融合】

把治理事件接到 remediation 執行器,走北極星方向決策融合(LLM × Playbook trust
× MCP),符合「禁寫死規則」鐵律。

  - 設計鐵律:DecisionFusionAdapter 是新增 wrapper,**不修改任何 Tier 3 檔**
    (decision_manager / learning_service / trust_engine),只 consume 既有 API
  - 三維融合公式:confidence = 0.4×llm + 0.3×playbook_trust + 0.3×mcp_consistency
    (權重加 TODO 標明未來由 AI 自學調整)
  - 三分支決策路徑:
    confidence ≥ 0.85 → auto_dispatch(status=dispatched)
    0.65 ≤ confidence < 0.85 → pending_approval(HITL)
    confidence < 0.65 → skip + log
  - decision_context JSONB 完整記錄三維輸入快照(給未來 fine-tune 用)
  - poll 30s 掃 unresolved 事件,仿 governance loop 模式
  - 重複事件擋去重(呼叫 get_active_for_event)

  變更:
    - apps/api/src/services/governance_dispatcher.py
    - apps/api/src/services/decision_fusion_adapter.py
    - apps/api/tests/test_governance_dispatcher.py(14 tests)
    - apps/api/src/main.py(lifespan task 接 run_governance_dispatcher_loop)

【驗證】

1836 個 unit test 全過(29 skipped 為既有 PG integration env 問題)

【調度教訓 — 已記入 memory】

- vuln-verifier 應在 fullstack-engineer **之前**跑(避免並行讀到已修代碼誤判)
- critic 雙輪審查不可省(第二輪抓到 NaN sentinel + Prom rule 連鎖)
- 北極星「禁寫死規則」搭配 decision-fusion 確實實施

【未動 Tier 3 — 已驗證】

git diff 確認本 commit 完全沒改 decision_manager.py / learning_service.py /
trust_engine.py,只新增 wrapper service consume 既有 API。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:42:40 +08:00
Your Name
6e04fe9c8a feat(playbook): generate drafts with local llm
Some checks failed
CD Pipeline / tests (push) Successful in 1m28s
Code Review / ai-code-review (push) Successful in 29s
Type Sync Check / check-type-sync (push) Failing after 2m41s
CD Pipeline / build-and-deploy (push) Successful in 8m40s
CD Pipeline / post-deploy-checks (push) Successful in 3m10s
2026-04-30 23:04:58 +08:00
Your Name
3668d49f2f feat(flywheel): W2 三件 + KMWriter critic 修法(1635 tests 全綠)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major
全部修完,default flag=false 安全無爆炸風險。

## W2 三件 PR

### PR-R2 — AOL → catalog confidence EWMA 回灌(修飛輪斷鏈 C2)
- 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py`
- 邏輯:每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog
- 失敗閾值 N=5 連續低成功率 → review_status='draft'
- Hermes _fetch_noisy_rules SQL 加 OR review_status='draft'
- ENABLE_AOL_WRITEBACK_JOB=false (default)
- 8 個測試(mock path 修正:lazy import → patch src.db.base.get_db_context)

### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6)
- 新檔 `apps/api/src/services/self_healing_validator.py`(純函數 assess_self_healing)
- post_execution_verifier.py step 5 串接(feature flag gate)
- evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位
- db/models.py + base.py ALTER IF NOT EXISTS
- score < 0.5 → 觸發 rollback 提案 Telegram alert(不自動執行)
- ENABLE_SELF_HEALING_VALIDATOR=false (default)
- 7 個測試

### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4)
- learning_service.py 三條新邏輯:
  1. _write_playbook_evolution_km:promote/demote 寫 KM 演化條目
  2. _check_and_mark_playbook_review:N=5 累積觸發 review_required
  3. _demote_alert_rule_catalog_confidence:DEPRECATED → confidence×=0.5
- PlaybookRecord 加 review_required 欄位(schema migration via base.py)
- ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default)
- KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調
- 6 個測試

## KMWriter Critic 5 個 Critical/Major 修復(之前 critic PR review 發現)
之前 push commit c5753e1c 已修,本 commit 補回 stash 中的對應檔案:
- C1 km_writer.py:194 backfill 自打臉(已修:同步 await + DLQ)
- C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊
- M1 decision_manager.py:2178/2203 移除 _fire_and_forget
- M2 incident_service.py:1099 自製 path 加 retry+DLQ
- M3 km_writer.py:166 冪等聲明對齊(UPSERT + partial unique index)

## 驗證
- 1635 unit tests 全綠(+27 from 1608)
- 與 fb0c72db (推翻 A2 Ollama primary) 共存無衝突
- 所有新 Job/Service default flag=false(不爆炸)

## 期望影響
飛輪斷鏈 C2 + C3 + C4 + C6 全修
飛輪自主化評分:65 → 85 預估(W2 完成後)

啟用順序(待 prod fb0c72db 驗證 OLLAMA primary 跑得起來後):
1. ENABLE_AOL_WRITEBACK_JOB=true
2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true
3. ENABLE_SELF_HEALING_VALIDATOR=true

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:44:04 +08:00
Your Name
c5753e1c57 fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。

## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)

### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race

### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」

### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await

### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)

### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path

## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警

## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)

## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
  (不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)

## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊

## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)

## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法

## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
Your Name
025a493f06 feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner
Some checks failed
run-migration / migrate (push) Failing after 12s
CD Pipeline / build-and-deploy (push) Has been cancelled
Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套:

P3.2 — Model Version Tracking:
- model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version
- model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表
- migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema
- db/models.py +32 行 — ProviderVersionHistory ORM

ADR-100 — AI 自主化 SLO:
- docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值
- ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts
- ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests

整合修改:
- main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule
- gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知
- ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線

新測試:
- test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass
- test_slo_rules.yaml — promtool 驗收

Tests: 9 passed (test_kb_rot_cleaner_schedule)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>
2026-04-27 14:54:19 +08:00
Your Name
cc547736ab feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復
承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼,補 commit 解 production
HEAD 隱性 import error(decision_fusion 已被 decision_manager 引用但檔案 untracked)。

新增(後端核心):
- decision_fusion.py (562 行) — P2.1 方法 III(OpenClaw + Hermes + Elephant 三 LLM 融合)
- aiops_timeline.py + aiops_timeline_service.py — critic B4 修復
  /api/v1/aiops/timeline endpoint,DB 存取抽到 service 層遵守 leWOOOgo 積木化
- migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位

修改(後端整合):
- decision_manager.py — fusion 三斷鏈修補(critic B1+B2+B3):
  · B1: 寫 _evidence_snapshot_ref 到 token.proposal_data
  · B2: fusion 前計算 complexity_score 並寫 token
  · B3: fusion composite 寫 token.proposal_data["decision_fusion"]
- auto_approve.py — fusion + consensus 認識(critic B3+B5):
  · composite > 0.7 → auto_execute_eligible bypass min_confidence
  · source=consensus_engine + score>=0.6 → 規則可信路徑
- consensus_engine.py — db-fix _save_consensus 重用 agent_sessions
- governance_agent.py — db-fix _alert PG 寫入 ai_governance_events
- approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint
- db/models.py — schema 對齊 migration
- core/config.py — vuln #1 修復:OLLAMA_URL/_FALLBACK_URL field_validator
  拒絕公網 IP + 外部域名,僅允許私網/loopback/K8s SVC 白名單
- core/feature_flags.py — P2 fusion + consensus flags
- main.py — governance_agent lifespan 啟動
- failover_alerter.py — Wave8-X2: in-memory dedup fallback(Redis 拒絕後不 fail-open)
- ollama_*.py — metrics 整合 + recovery 改善
- auto_repair_service.py — verifier 接線

新增(測試 2438 行):
- test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py
- test_p2_db_fixes.py / test_wave8_fusion_fixes.py
- test_config_url_validation.py(vuln #1 12 tests)
- test_failover_alerter.py +Wave8-X2 in-memory dedup 補測

驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus +
                      governance + p2_db_fixes + failover_alerter)

Conflict resolution:
- 3 檔(config.py + auto_approve.py + decision_manager.py)git stash pop 衝突
  保留 stashed (engineer 最終版),補回 ValueError 「公網 IP」字樣對齊 test

Note: 此 commit 解 production HEAD 隱性 import error
仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed
       / B25-B26 drain_pending_tasks / B8 governance fail alert

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>
2026-04-27 08:11:40 +08:00
Your Name
2c57b71db9 feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m45s
MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成(multiple engineers 在限額前完成代碼,補 commit):

P2.2 — GovernanceAgent 4 項自檢:
- governance_agent.py (342 行) — 每 1 小時自檢循環:
  · trust_drift(信任度漂移檢測)
  · knowledge_degradation(知識退化檢測)
  · llm_hallucination(LLM 幻覺檢測)
  · execution_blast_radius(執行爆炸半徑檢測)
- main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動
  try/except 包裹,schedule 失敗不阻斷主流程
- failover_alerter.py: alert_governance(event_type, payload) 1h dedup
  四類事件 → Telegram MarkdownV2 告警

P2.3 — Ollama 健康規則 + Prometheus Metrics:
- ops/monitoring/ollama_health_rules.yaml (148 行):
  · OllamaHealthDegraded / OllamaPrimaryDown
  · OllamaFailoverTriggered / GeminiQuotaExceeded
  · 補 Prometheus 取資料的 alert rules
- core/metrics.py (57 行):
  · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge
  · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter
  · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge
- ollama_failover_manager.py:
  · _check_gemini_quota: 每次 check 同步更新 Gauge(讓 Prometheus 取最新值)
  · select_provider: failover 時 inc Counter + 切 Primary Gauge
  · try/except 包裹,metric 失敗不阻斷主路由

E2E 測試:
- test_failover_e2e_dispatch.py (365 行)
  完整 dispatch 路徑:health check → failover decide → alerter → metrics

Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>
2026-04-26 20:56:19 +08:00
Your Name
32affaffeb fix(critic-hotfix): 4 修補 critic BLOCKER + HIGH(CD 阻塞 + 飛輪空轉)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
Critic 全面審查 6 個 commit 後抓出:

CD 阻塞修復:
- test_ai_router_failover_integration.py: 3 個 test 改用 patch.object 直接
  mock _select_provider_and_model 強制初始 OLLAMA。原 IntentType.UNKNOWN mock
  在 router 內仍被 reclassify 成 DIAGNOSE → openclaw_nemo,failover 不觸發。
  → 5/5 PASSED

BLOCKER B1 — Gitea Telegram 通知永遠發不出去:
- apps/api/src/api/v1/gitea_webhook.py:399
  redis = await get_redis()  →  redis = get_redis()
  原 await 會 raise TypeError 被外層 except 吞 → Task C PR merged + workflow_run
  failure 通知全部失效(CI 綠燈是假象,test 只驗 HTTP 202 不驗實際送達)

BLOCKER B2 — P1.3+P1.4 學習鏈閉環空轉(兩處同 bug):
- apps/api/src/api/v1/webhooks.py:261
- apps/api/src/services/approval_execution.py:771(pre-existing)
  EvidenceSnapshot.get_latest_snapshot(...) 是 module-level async function
  不是 classmethod → AttributeError 被 except 吞成 warning
  → 飛輪閉環假性接通實際空跑(feature flag default off 暫時免爆)

HIGH H3 — main.py lifespan 順序競爭:
- apps/api/src/main.py: configure_alerter() 移到 _recovery_svc.start() 之前
  原順序:start() 觸發 immediate-check → 可能呼叫 alert_recovery,但 alerter
  尚未注入 Redis → dedup fail-open,重複告警風險。

HIGH H1 — Gemini quota dedup 跨日吞告警:
- apps/api/src/services/failover_alerter.py:89
  dedup key 加 :{YYYY-MM-DD} 後綴,每日獨立 dedup window
  原昨 22:00 觸發,今 21:30 再觸發時 dedup 還沒過期會被吞掉

Tests: 14 passed (failover_alerter + ai_router_failover_integration + lifespan_wiring)

延後 follow-up:
- H2: proactive_inspector memory metric 改名 + baseline 清理
- H4: probe_success NaN fallback
- M1-M4 / S1-S2: 見 critic 報告

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:39:53 +08:00
Your Name
dcf2750b2b feat(p1.5): FailoverAlerter 整合點 3+4 + 6 個 testcase 補完
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m32s
P1.5 收尾(status 文件 line 96-99 指定):

整合點 3 — failover_manager Gemini quota 告警觸發:
- ollama_failover_manager.py: _check_gemini_quota 返回 False 時呼叫
  alerter.alert_gemini_quota_exceeded({quota, current_count})
- 從 Redis 讀 ollama:gemini_daily_count:{date} 取 current_count(fail-soft)
- alerter 內 24h dedup(QUOTA_DEDUP_TTL_SEC=86400),每日只發一次
- try/except 包裹:告警失敗 fail-open,不阻斷 routing

整合點 4 — main.py lifespan 注入 Redis client:
- 在 _recovery_svc.start() 之後、yield 之前
- 呼叫 configure_alerter(get_redis()) 替換 singleton 注入 dedup 能力
- try/except 包裹:注入失敗 fail-open(alerter 仍可工作但 dedup 失效)

新測試 (174 行, 6/6 pass):
- test_alert_failover_dedup: 同 to_provider 第二次被 10min dedup 
- test_alert_recovery_send: 正常發送 + Markdown 訊息 + 連續 N 次 HEALTHY 
- test_no_telegram_chat_id_noop: chat_id 缺時 fail-soft 不 raise 
- test_quota_alert_dedup_24h: TTL=86400s,訊息含 quota+count 
- test_configure_alerter_replaces_singleton: lifespan 注入後 redis 可用 
- test_dedup_fail_open_when_no_redis: Redis None → 允許送出 

Mock 注意:_send() inline import telegram_gateway/get_settings,
mock target 必須是 src.services.telegram_gateway / src.core.config
而非 alerter module 自己。

回歸:原 37 ollama_failover_manager + 3 lifespan_wiring 測試全綠。

飛輪自主化分數:~75 → 預估 ~80(配額耗盡有告警,運維可見性 +5)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:28:29 +08:00
Your Name
55c6b4e2d9 feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警
ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統,全部 Engineer-A2/C/C2 補上。

新服務 (1581 行):
- ollama_health_monitor.py (356):3 層健康檢測(TCP/HTTP/推理)
- ollama_failover_manager.py (571):111→188 自動切換 + Redis 持久化 + recovery callback
- ollama_auto_recovery.py (436):30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache
- failover_alerter.py (218):P1.5 Telegram 容災告警

服務整合:
- ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain
- main.py lifespan: 啟動時 wire callback + start recovery,關閉時優雅 stop
- config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA(帳單熔斷)

K8s 配置:
- 04-configmap.yaml.patch-188-fallback:注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434

測試 (2082 行):
- test_ollama_health_monitor.py (402)
- test_ollama_failover_manager.py (707)
- test_ollama_auto_recovery.py (580)
- test_ai_router_failover_integration.py (257)
- test_lifespan_failover_wiring.py (136)

依賴鏈:service 三件套 + ai_router + main.py 一起 commit,缺一就 ImportError。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:18:33 +08:00
Your Name
294e0e3387 feat(ws3): ADR-093 Callback User-ID Binding + ADR-094 Webhook 入口
## T3.1/T3.2 Bound User Check(security_interceptor.py)
- verify_callback() Step 0: 檢查 Redis cb_bind:{nonce}
  → 若有 binding 且 caller != bound_user_id → UserNotWhitelistedError
  → 若 key 不存在(舊格式)→ 降級走 whitelist(向後相容)
  → 若 Redis unavailable → 降級繼續(安全降級)
- bind_callback_user(nonce, user_id): async 方法,TTL=48h

## T3.3 Telegram Webhook 入口(ADR-094)
- apps/api/src/api/v1/telegram_webhook.py(新建)
  POST /api/v1/telegram/webhook
  - X-Telegram-Bot-Api-Secret-Token header 驗證
  - TELEGRAM_WEBHOOK_SECRET="" → dev 跳過(不 break 現有測試)
  - WS4 Hermes NL 接入預留佔位

## T3.4 config.py
- 新增 TELEGRAM_WEBHOOK_SECRET field(預設空字串)

## main.py
- 掛載 telegram_webhook_v1.router 到 /api/v1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:10:06 +08:00
Your Name
156a52f807 fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m33s
B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum
  → _pg_upsert playbook.status.value炸(163次/48h),修:update_with_validation強制enum轉型

B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂
  → PENDING≠Telegram已發;修:成功push後mark tg_sent:{fingerprint} Redis(24h TTL)
  → find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂

drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在
  → 新增adopt_drift()包裝:從DB載入DriftReport後委派adopt(),修復採納失敗

B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障(MASTER §1.1盲區)
  → 新增每15分鐘自健診:W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率
  → 任一異常→TYPE-8M send_meta_alert;Redis去重1h

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:00:06 +08:00
Your Name
e1539a813e feat(config+main): aider-watch v2 settings + router + lifespan register
- Add 4 settings to config.py: AIDER_WEBHOOK_SECRET, AIDER_EVENTS_STREAM_KEY, AIDER_PATTERN_EXTRACT_INTERVAL_HOURS, USE_AIDER_FEEDBACK (ADR-091)
- Import aider_events_v1 router in main.py imports (alphabetical after ai_slo_v1)
- Register aider_events_v1.router in include_router block (after alert_operation_logs_v1)
- Register run_aider_event_processor_loop() in lifespan (after compliance_scanner_loop)
- All 65 tests pass (24 action_parsing + 41 aider-watch tests)

Co-Authored-By: Claude Haiku 4.5 (1M context) <noreply@anthropic.com>
2026-04-20 19:40:02 +08:00
OG T
0004554bc6 feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m47s
GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI.

leWOOOgo 積木化鐵律對齊:
  - Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB
  - Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算
  - 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正

services/aiops_kpi_service.py (~230 行):
  AiopsKpiService.get_snapshot() 回 6 section:

  1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified)
  2. coverage_kpi: 7 維 × (green/yellow/red/unknown)
     + green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO)
  3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy
  4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d
  5. automation_flow_24h: aol detail + by_actor + by_operation_type
  6. ai_autonomy_score: 0-100 總分
     5 子項 × 20: asset_coverage / rule_quality / capacity_health /
                  automation_flow / ai_diversity
     grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50)

api/v1/aiops_kpi.py (~35 行 精簡 router):
  只做 router = APIRouter() + @router.get 委派給 service

main.py:
  include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI'])

統帥使用:
  curl http://192.168.0.121:32334/api/v1/aiops/kpi | jq .
  一次看見 AI 自主化成熟度全景

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:21:46 +08:00
OG T
a391dfc389 feat(aiops): capacity_forecaster — Phase 4 Holt-Winters MVP (predict_linear)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥批准 4 項下階段候選之一完成: AI 容量預測.

新增 capacity_forecaster_job.py (~220 行):
  每日 05:00 Taipei 跑預測 (02:00 scanner → 03:00 compliance →
  04:00 Hermes → 05:00 forecaster 形成完整日鏈).

預測方法論 (MVP):
  Prometheus predict_linear(metric[7d], 86400*7) — 基於過去 7d 做線性外推
  3 個預測 query:
    1. disk_saturation_7d: predict_linear(node_filesystem_avail_bytes[7d], 7d) < 0
    2. mem_saturation_7d: predict_linear(MemAvailable[7d], 7d) / MemTotal < 10%
    3. cpu_high_7d_trend: avg_over_time(cpu_used_pct[7d]) > 70%

發現高風險 host → 寫 aol(capacity_recommendation) + 推 Telegram
  - input: host + horizon + findings count
  - output: findings list + proposed_actions + requires_human_decision=true

proposed_actions 依 findings 推導:
  - disk: 清理 log/docker/PG WAL 或擴容
  - mem: top consumer / JVM 調整
  - cpu: scale out / vCPU 擴充

統帥鐵律對齊:
   只推建議不自動 scale up
   7d window 有足夠樣本
   AI 預測 + 人工決策

未來 TODO:
  - 真 Holt-Winters (含季節性) — 需 Python statsmodels
  - 業務週期調整 (週一高峰/週末低谷)

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:00:36 +08:00
OG T
6ab0ce9c75 feat(aiops): Hermes rule quality advisor — E3 AI 規則品質建議 (保守版)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m22s
實證 rule_stats 跑完後發現 2 條 100% noise_rate 規則:
  - PostgreSQLDiskGrowthRate (tp=0 fp=2)
  - NoAlertsReceived2Hours   (tp=0 fp=1)
加上 MoWoooWorkDown (33%), KubePodCrashLooping (25%)

新增 hermes_rule_quality_job.py (~210 行):
  每日 04:00 Taipei 分析 alert_rule_catalog:
    - threshold: noise_rate >= 0.7 AND 樣本 >= 5
    - 為每條寫 aol('rule_rejected', proposed_action='review_or_deprecate')
    - 推 Telegram 摘要給 SRE group

統帥鐵律對齊:
   不自動改 review_status (人工決策 deprecate,AI 只推建議)
   threshold 作為「觸發討論」而非「最終決策」
   aol(rule_rejected) 留 trail,未來可升級 LLM 辯證

解鎖 E3 Hermes 基礎: 後續可加 LLM 分析假報真因 (expr 缺 for: window、
label match 太寬泛、metric 本身 noisy 等),產出具體改進建議.

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:11:26 +08:00
OG T
92349bc37c feat(aiops): asset_change_tracker — 8 張 0 writer 表全數上線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點 10: asset_change_event 仍 0 筆 (最後一張 0 writer 表)

新增 asset_change_tracker_job.py (~180 行):
  每 1h 比對最近兩次 asset_discovery_run,寫 asset_change_event:
     asset_added: newer run 有但 older run 沒有 (EXCEPT SET)
     asset_removed: older 有但 newer 沒有
     lifecycle_changed: asset_inventory.lifecycle_state='deprecated' 且 updated_at 近 2h
  使用 SET EXCEPT 避免 N+1, 單次 INSERT 完成所有 diff

8 張 ADR-090 0 writer 表到此全數有 writer:
   asset_inventory / asset_discovery_run / asset_coverage_snapshot
     / asset_relationship / asset_change_event / asset_compliance_snapshot (asset_*)
   alert_rule_catalog
   host_capacity_snapshot / capacity_violation_event (capacity_*)

Phase 7 資產盤點 + 覆蓋矩陣 + 變化追蹤完整實作.
接下來可以上 Hermes AI agent 分析品質 (deprecate noisy rules, 推薦 coverage 修復).

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:18:34 +08:00
OG T
df71c9a37b feat(aiops): rule_stats_updater — 計算 noise_rate + true/false positive
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m26s
Review 盲點 5: alert_rule_catalog 68 筆但 noise_rate/TP/FP/last_fired_at 全 NULL

新增 rule_stats_updater_job.py (~170 行):
  每 1h UPDATE 全表 alert_rule_catalog,從 incidents + approval_records 推算:
    - last_fired_at = max(incidents.created_at WHERE alertname=rule_name)
    - true_positive_count = count incidents.status='RESOLVED' past 30d
    - false_positive_count = count approval_records.status='EXPIRED' past 30d
      (EXPIRED = 48h 無人處理,視為假警報 proxy)
    - noise_rate = fp / (tp + fp)

窗口: 30 天 (可配置)
使用單一 UPDATE + subquery,避免 N+1 (68 rule × 3 query = 204 queries → 1 query)

解鎖 E3 Hermes:
  後續 Hermes AI agent 讀 alert_rule_catalog WHERE noise_rate > 0.5
  提案 review_status='deprecated' 或 superseded_by_rule_id

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:05:30 +08:00
OG T
505232336b feat(aiops): coverage_evaluator — 把 coverage_snapshot 從 unknown 升為真實 status
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點 4: asset_coverage_snapshot 546 筆全是 'unknown',沒實際意義

新增 coverage_evaluator_job.py (~270 行):
  每 1h 針對最新 asset_discovery_run 的 coverage_snapshot 做 3 維升級:
     auto_monitoring: Prometheus /api/v1/targets 看 host asset IP
       → green (有 target) / red (無 target)
     auto_alerting: alert_rule_catalog.labels 是否 match asset
       → host/namespace/layer 三種 match 策略, green/red
     auto_km_creation: knowledge_entries.body ILIKE asset.name
       → green (有 KM) / yellow (無 KM)
  evidence JSONB 記錄升級依據,供 AI 後續稽核

未實作 (留 unknown):
     auto_rule_matching (需 alert history 統計)
     auto_playbook / auto_remediation / auto_rule_creation (需 playbook 表)

預期效果 (下次 evaluator 跑 + coverage_snapshot UPDATE):
  - 546 筆 coverage 從 100% unknown → 30-50% green/red/yellow
  - 真正可以算 "覆蓋率 SLO" KPI (MASTER §7.1)
  - AI 可從 coverage_snapshot 看出 red asset,主動推 remediation

Wire main.py lifespan asyncio.create_task()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:02:30 +08:00
OG T
4259a104f5 feat(aiops): capacity_scanner + compliance_scanner (ADR-090 Phase 7 剩 2)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
完成 ADR-090 Phase 7 第 3+4 個 service,解鎖 2 張 0 writer 表:

B3. apps/api/src/jobs/capacity_scanner_job.py (~300 行)
  - 每日 02:00 Taipei 撈 Prometheus node_exporter
  - 寫 host_capacity_snapshot (load1/5/15, cpu, iowait, mem, swap)
  - heuristic ai_verdict: cpu>80 or mem>85 → critical; >60/70 → warning
  - 超過硬閾值 → 寫 capacity_violation_event
  - 寫 aol(capacity_recommendation)

B4. apps/api/src/jobs/compliance_scanner_job.py (~270 行)
  - 每日 03:00 Taipei 遍歷 asset_inventory active assets
  - 為每個 asset 寫 7 維 compliance snapshot
  - secret_rotated: 真實檢查 (metadata.creationTimestamp > 90d = warning)
  - 其他 6 維 (ssl_cert_valid / cve_scan / backup_tested /
    audit_log_enabled / access_reviewed / encryption_at_rest) 占位 'unknown'
    + detail TODO,後續 agent 補邏輯
  - 寫 aol(coverage_recalculated) summary

main.py lifespan 同步 wire 2 個新 loop

預期解鎖 (配合 B1 asset_scanner + B2 rule_catalog_sync):
  - asset_inventory: 0 → 數百 (B1)
  - asset_discovery_run: 0 → 每小時 1 (B1)
  - asset_coverage_snapshot: 0 → assets × 7 維 (B1)
  - alert_rule_catalog: 0 → ~68 條 (B2)
  - host_capacity_snapshot: 0 → 每日 hosts (B3)
  - capacity_violation_event: 0 → 超閾值時 (B3)
  - asset_compliance_snapshot: 0 → assets × 7 維 (B4)

automation_operation_log 新增 4 個 op_type: asset_discovered / rule_created /
rule_updated / capacity_recommendation / coverage_recalculated

8 張 0 writer 表到此全數有 writer,ADR-090 Phase 7 實作完成.

Refs: ADR-090 §4.2 Phase 4, MASTER §3.5 D5 (capacity-aware)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:23:27 +08:00
OG T
5b9b36f30d fix(ci)+feat(aiops): cd.yaml shared network + rule_catalog_sync (ADR-090 E3)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m31s
CI 修復 (c0f3509 第三次 fail 真因):
  c0f3509 log: 'Detected act task network: (none, will fall back to bridge)'
  → grep ACT_NET 在 CI 環境未 match → fallback bridge
  → default bridge 不支援 container name DNS → pg-test-b5 解析失敗

修復 (v3 — 主動創 shared network):
  - B5_NET=b5-test-net (idempotent docker network create)
  - ci-runner 自己 docker network connect $HOSTNAME
  - pg-test-b5 --network=$B5_NET
  - 兩邊同 user-defined network → container name DNS 正常

新增 rule_catalog_sync_job (ADR-090 § Phase 7 第 2 個 service):
  + apps/api/src/jobs/rule_catalog_sync_job.py (~230 行)
    - run_rule_catalog_sync_loop: 啟動延遲 90s,每 1h sync
    - sync_once: HTTP GET {PROMETHEUS_URL}/api/v1/rules (type=alert)
    - UPSERT alert_rule_catalog (rule_name 為 UNIQUE)
    - 只在實際 INSERT/UPDATE 發生時才寫 aol (避免 N 條 rule 污染)
  + main.py lifespan asyncio.create_task() wire

預期解鎖:
  - alert_rule_catalog: 從 0 → Prometheus active rules 數 (~68 條)
  - automation_operation_log: 新增 'rule_created' / 'rule_updated' op_type
  - E3 Hermes AI 終於有 baseline 可以提案規則修正

Refs: ADR-090 §4.2 E3, MASTER §3.3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:08:34 +08:00
OG T
ddb902f1ff fix(ci+aiops): cd.yaml grep set-e bug + 新增 asset_scanner_job (ADR-090)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CI 修復 (b636d3b 第二次 fail 真因):
  cd.yaml line 161 ACT_NET=$(docker network ls | grep -E '^GITEA-ACTIONS-...')
  act runner 用 'bash -e -o pipefail',grep 無 match 時 exit 1 → 整 step 中斷
  (前一次 e7ba8cb fail 是 PG IP 不通,b636d3b 是 grep set-e bug — 兩個不同錯誤)

修復:
  ACT_NET=$(... | (grep -E '...' || echo "") | head -1)
  把 grep 包在 subshell 並 || echo "" 確保失敗時 ACT_NET 為空字串

新增 asset_scanner_job (ADR-090 § Phase 7 第 1 個 service):
  + apps/api/src/jobs/asset_scanner_job.py (~360 行)
    - run_asset_scanner_loop: 每 1h cron,首次延遲 60s
    - scan_once: 用 K8sProvider kubectl_get pods --all-namespaces
    - UPSERT asset_inventory (asset_key 為 UNIQUE,跨 run 沿用同 asset_id)
    - 為每個 active asset 寫 7 維 asset_coverage_snapshot (預設 unknown)
    - 寫 automation_operation_log(asset_discovered)
  + main.py lifespan asyncio.create_task() wire

預期解鎖:
  - asset_inventory: 從 0 → 數百 (全 namespace pods)
  - asset_discovery_run: 每小時 1 筆
  - asset_coverage_snapshot: 每筆 asset × 7 dim
  - automation_operation_log: 新增 'asset_discovered' op_type

下一階段 (rule_catalog / capacity / compliance scanner) 待 CI 通過後分批提交.

Refs: ADR-090 §4.1, MASTER §3.4 D4, project_blindspot_governance.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:15:45 +08:00
OG T
9d6aa7ea45 feat(trust): ADR-088 Trust Score 持久化 — L4 自動放行核心
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m40s
TrustScoreManager 從記憶體升級為 PostgreSQL 持久化,
Pod 重啟後信任分數不再歸零,AI 能真正累積到 L4 自動放行門檻。

變更:
- migrations/adr088_trust_score_persistence.sql: trust_records 表
- db/models.py: TrustRecordDB ORM model
- repositories/interfaces.py: ITrustRepository Protocol
- repositories/trust_repository.py: PG upsert ON CONFLICT DO UPDATE
- services/trust_engine.py: bulk_load() 啟動 warm-up
- services/learning_service.py: _persist_trust() + 2 call sites
- main.py: 啟動時 load_all() → bulk_load()

流程: 批准 5 次 → score=5 寫入 DB → Pod 重啟 → warm-up 讀回
      → evaluate_adjusted_risk MEDIUM→LOW → 自動執行

2026-04-17 ogt + Claude Sonnet 4.6(亞太): ADR-088

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:14:44 +08:00
OG T
588b0d745b fix(aiops): 修復 sensors=0/0 根因 — MCPToolRegistry 從未在 startup 初始化
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m44s
三個問題同時修復:

1. main.py: 補上 init_mcp_tool_registry() 呼叫
   - ADR-081 Phase 1 建立了 MCPToolRegistry 但從未在 lifespan startup 被呼叫
   - 導致 PreDecisionInvestigator sensors=0/0,evidence_summary 永遠空白
   - 空白 evidence → Diagnostician 永遠 ABSTAIN

2. signal_producer.py: str(dict) → json.dumps()
   - labels/annotations 用 Python str() 序列化,寫入 Redis 後無法反序列化

3. brain/incident_engine.py: 新增 _parse_dict_field() helper
   - 從 Redis 讀回的 labels/annotations 可能是 JSON 字串
   - isinstance(..., dict) 防禦不足,需先 json.loads()

2026-04-16 ogt + Claude Sonnet 4.6(亞太): 飛輪感官修復

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 15:35:19 +08:00
OG T
62e2efda85 fix(heartbeat): 恢復 30 分鐘心跳報告到 SRE 戰情室
2026-04-15 停用理由(forwarded_to_separate_group)有誤,
SRE 戰情室就是 SRE_GROUP_CHAT_ID,不應停用。
恢復 start_heartbeat_monitor(interval=30min)。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:05:01 +08:00
OG T
ce1a4d286e feat(sweeper): 新增 Incident Analysis Sweeper — 自動觸發未分析 Incident AI 決策
Gap修復:
  Signal Worker 創建 Incident 後,AI 分析只在 GET /api/v1/incidents 被呼叫時觸發
  若前端無人訪問,新 Incident 永遠沒有 AI 分析與 Telegram 通知

解法:
  新增 src/jobs/incident_analysis_sweeper.py
  每 90 秒掃描無 decision token 的 INVESTIGATING incidents
  自動背景觸發 get_or_create_decision() — Semaphore(3) 限流,每批最多 5 筆
  main.py lifespan 啟動時 asyncio.create_task() 掛載

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:08:30 +08:00
OG T
85c4e3b434 fix(km): 修復 KM 寫入全為 unknown 的根因 (三個節點)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
Bug-A: approval_execution.py 呼叫不存在的 get_incident() → AttributeError
被 except 吞掉 → alertname/alert_category/affected_services 全用預設值
修復: 改用 get_from_working_memory() + get_from_episodic_memory() 雙路徑

Bug-B: _record_to_incident() 從 PG 還原 Incident 時漏掉
notification_type + alert_category 欄位 → km_conversion 讀到 None
修復: 加入這兩個欄位的還原

Bug-C: main.py working_memory_warmup 重建 Incident 時同樣遺漏
notification_type + alert_category
修復: 同步補上

2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:28:48 +08:00
OG T
256a24e843 fix(deps+startup): drain3/statsmodels 補入 pyproject + warmup skip 舊資料
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 17m23s
- pyproject.toml: 補 drain3>=0.9.11, statsmodels>=0.14.0, sse-starlette
  → Docker build 從 pyproject 裝,requirements.txt 的套件之前沒裝進 image
  → P4 LogAnomalyDetector 400次 drain3_not_available 告警排除
- main.py: working_memory warmup per-record try/except
  → 舊 incident 含非法 source (node-exporter) → 跳過,不 crash 整個 warmup

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:08:13 +08:00
OG T
76558a3cd9 feat(AIOps): 全開 P1-P6 feature flags + Nemotron + offline replay loop
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- configmap: 啟用 AIOPS_P1~P6 全部總開關與子開關
- configmap: ENABLE_NEMOTRON_COLLABORATION=true(回歸 120s timeout)
- feature_flags.py: 補齊 AIOPS_P6_GOVERNANCE_ENABLED 缺失欄位
- main.py: 掛載 run_offline_replay_loop(ADR-087 Phase 6)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:59:51 +08:00
OG T
4718c7667c feat(Phase 3): Evolver loop 排程 + 手動觸發端點 — 合併演練閘道完工
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_evolver.py: 新增 run_evolver_loop()(24h 無限迴圈)
- main.py: 掛載 run_evolver_loop asyncio.create_task
- api/v1/learning.py: POST /api/v1/learning/evolver/run(Phase 3 exit #6 演練端點)
- MASTER §8: 補錄 66c4eda AgentSession + 本次 Evolver 完整退出條件清單

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:07:56 +08:00
OG T
fb1bbd0e20 feat(Phase 3): 學習閉環補完 — Root cause 3 + 診斷 feedback + 知識遺忘 + Fine-tune 管線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- approval_execution.py: _run_post_execution_verify() 補接 record_verification_result()
  Root cause 3 終結:環境驗證結果(success/degraded/failed/timeout)不再孤立
- learning_service.py: 新增 record_verification_result() — 驗證結果 → Redis + Playbook EWMA
- learning_service.py: 新增 record_diagnosis_outcome() — 誤診負向訊號回寫(L3×D4)
- jobs/knowledge_decay_job.py: 新建 30d 知識遺忘 Job(未引用 draft/review → archived)
- services/finetune_exporter.py: 新建每週 JSONL 匯出(EvidenceSnapshot × AgentSession)
- main.py: 掛載 knowledge_decay_loop(24h)+ finetune_export_loop(7d)
- MASTER §8: Phase 3 核心改造項全部落地記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:57:43 +08:00
OG T
05b774386b feat(Phase 6): AI SLO REST API — GET /api/v1/ai/slo 收官
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-087 Phase 6 自我治理閉環最後一塊拼圖:

1. api/v1/ai_slo.py — GET /api/v1/ai/slo
   - Service 層快取優先(TTL 5min,AiSloCalculator.get_cached_report)
   - force_refresh=true 強制重算(AiSloCalculator.run)
   - Router 層零 Redis 直接存取(leWOOOgo 積木化鐵律)

2. main.py — 路由掛載 ai_slo_v1.router(prefix=/api/v1)

3. MASTER §8 Living Changelog 追加:
   - P0 告警靜默 3 根因 RCA 完整紀錄
   - P2 飛輪斷鏈修復摘要
   - Phase 6 全元件完成清單

Phase 6 退出條件 5/6 已達(生產驗證待 image 上線)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:57:26 +08:00
OG T
3ce5025ca7 fix(alerts): 3 個飛輪沉默節點 — DIAGNOSE routing + 心跳停用 + 通知格式
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. openclaw.py: DIAGNOSE 移除 require_local=True
   - v4.3 已決定 NIM 為主力且無隱私問題
   - require_local=True 導致所有 provider 被 privacy_skip → 告警永遠失敗
   - 修後 DIAGNOSE 走 _full_fallback_chain(NIM → Gemini → Claude)

2. ai_router.py: require_local 失敗通知改為 ADR-075 TYPE-1 格式
   - 禁止純文字 raw notification(統帥鐵律:所有訊息必須符合格式模板)
   - 改用 ├─ / └─ 樹狀結構 + 語義化標籤

3. main.py: 停用 Telegram 心跳監控
   - 心跳已轉發到另一個 Telegram 群組,不需在此頻道重複發送

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:49:43 +08:00
OG T
f045506abd fix(flywheel): P2 Approval 逾期不結案 → KM 學習鏈斷鏈修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m11s
問題根因:
  PENDING approval 無人處置超過 48h 後應自動 EXPIRED,
  但 get_pending_approvals() 只在用戶開 UI 時觸發,
  若無人開 UI → Incident 永遠 PENDING → KM 永遠不寫入
  → Phase 6 SLO human_override_rate 低估,EWMA 缺少負向樣本。

修復:
  1. anomaly_counter.py: 新增 "timeout_ignored" disposition 類型,
     與 auto_repair / human_approved / manual_resolved 區分
  2. incident_service.py: resolve_incident() 新增 resolution_type 參數,
     resolution_type="timeout" 時記錄 "timeout_ignored" 而非 "manual_resolved"
  3. jobs/approval_timeout_resolver.py (新): 每小時掃描逾期 PENDING approval,
     批次標記 EXPIRED,對每筆有 incident_id 的記錄呼叫 resolve_incident("timeout")
  4. main.py: startup 掛載 approval_timeout_resolver 排程(interval=3600s)

效果:
  - 告警無人處置 48h → Incident 自動結案 → KM 寫入 → EWMA 取得樣本
  - disposition="timeout_ignored" 讓 SLO 計算正確區分「AI 建議被忽略」
  - 飛輪學習鏈對「無人處置告警」閉環

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:21:21 +08:00
OG T
14a02263ae feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)

### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
  - DynamicBaselineService(3σ 偏離)
  - LogAnomalyDetector(新 Drain3 pattern)
  - TrendPredictor(斜率外推 4h 預測)
  - Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓

### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
  ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
  LLM RCA 可參考動態基線偏差與趨勢預警

### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)

### main.py
- 啟動 run_proactive_inspector_loop() asyncio task

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:47:05 +08:00
OG T
bf45b80bd2 feat(Phase 3.5 + Phase 4): AI 學習成果持久化到 PostgreSQL — 修正「AI 失憶」架構缺陷
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-085: AI 學習成果不可存在 Cache

架構鐵律確立:
- PostgreSQL = System of Record(AI 的永久記憶)
- Redis = Warm Cache(加速讀取,TTL 到期從 PG 復原)

核心變更:
1. models.py: 新增 PlaybookRecord / DynamicBaselineRecord / LogClusterRecord ORM
2. base.py: ALTER TABLE playbooks 補加 trust_score / requires_approval_level 等欄位
3. playbook_repository.py: 完整雙寫實作(PG upsert + Redis cache)
4. dynamic_baseline_service.py: Holt-Winters 訓練結果寫入 PG,Redis 只作 24h warm cache
5. log_anomaly_detector.py: Drain3 cluster template 寫入 PG(UPSERT on cluster_id)
6. main.py: 啟動時執行 backfill_redis_to_pg()(Redis → PG 冪等補救)

修正的問題:
- Playbook 7天 Redis TTL 到期 → AI 失去所有修復知識
- trust_score EWMA 隨 Redis TTL 歸零 → AI 重新回到初始信任度 0.3
- Holt-Winters 基線 24h TTL → AI 每天重新學習「正常」的定義
- Drain3 cluster 沒有持久化 → AI 把已知 log pattern 反覆當新 pattern

Phase 4 新服務(requirements.txt 已加入 statsmodels + drain3 + numpy)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:34:04 +08:00
OG T
db9e304a14 feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
  (1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
   15 KPI、21 Feature Flags、10 風險場景)

- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
  (7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)

- apps/api/src/core/feature_flags.py
  (AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
   is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)

- apps/api/src/jobs/__init__.py + baseline_snapshot.py
  (Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
   / learning loop rate / auto_repair — 寫入 aiops:baseline:latest)

- apps/api/tests/test_feature_flags.py  (21 tests — 全綠)

- docs/HARD_RULES.md → v1.9
  (新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)

- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol

Gate 0 Pass — 21/21 tests green

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 12:44:53 +08:00
OG T
684d6cfb43 feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 17m34s
Task 2: AlertGroupingService — Redis 5分鐘滑動視窗,防告警風暴
- apps/api/src/services/alert_grouping_service.py (新增)
- webhooks.py 整合:指紋生成後/LLM前短路子告警
- Threshold=3,Graceful Degradation,16 tests

Task 3: approval_execution.py 執行失敗重試
- MAX_RETRY=2, RETRY_DELAY_SECONDS=30
- _is_transient_error() 瞬態/永久分類,永久錯誤不重試
- Timeline 記錄重試進度,成功後標注重試次數,29 tests

Task 4: report_generation_service.py 自動報告
- 日度巡檢報告:每日 08:00 台北時間,Telegram SRE 群組推送
- Postmortem:Incident resolved + duration > 10 分鐘自動觸發
- main.py lifespan 掛載 run_daily_report_loop(),30 tests

測試: 600 → 675 通過 (+75),0 failed

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 14:39:14 +08:00
OG T
5aa0244c9a fix(aiops): ADR-072 P1 Bug 修復 — BUG-004/005/006
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
BUG-004 KM vectorization 108/112 = False:
  km_conversion_service: KM entry 建立後(embedding 已背景觸發),
  補寫 incidents.vectorized = True,飛輪閉環(ADR-068)學習指標正常

BUG-005 15 ready decisions 無人審核:
  decision_manager: 新增 resend_stale_ready_tokens(),
  掃描 Redis decision:* key,找出 state=ready 且 dedup_key 過期的 token,
  重新推送 Telegram 審核卡片
  main.py: lifespan startup 排程 resend_stale_ready_tokens()(asyncio.create_task 非阻塞)

BUG-006 outcome/verification_result 全 null:
  _push_auto_repair_result: Telegram 推送前先寫入
  incidents.outcome + incidents.verification_result 到 DB

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:24:41 +08:00