feat(Phase 6): AI SLO REST API — GET /api/v1/ai/slo 收官
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled

ADR-087 Phase 6 自我治理閉環最後一塊拼圖:

1. api/v1/ai_slo.py — GET /api/v1/ai/slo
   - Service 層快取優先(TTL 5min,AiSloCalculator.get_cached_report)
   - force_refresh=true 強制重算(AiSloCalculator.run)
   - Router 層零 Redis 直接存取(leWOOOgo 積木化鐵律)

2. main.py — 路由掛載 ai_slo_v1.router(prefix=/api/v1)

3. MASTER §8 Living Changelog 追加:
   - P0 告警靜默 3 根因 RCA 完整紀錄
   - P2 飛輪斷鏈修復摘要
   - Phase 6 全元件完成清單

Phase 6 退出條件 5/6 已達(生產驗證待 image 上線)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-15 19:57:26 +08:00
parent 14579ce149
commit 05b774386b
3 changed files with 105 additions and 0 deletions

View File

@@ -0,0 +1,58 @@
"""
AI SLO REST API
===============
ADR-087 Phase 6 自我治理閉環 — AI 決策品質 SLO 查詢端點
Endpoints:
GET /api/v1/ai/slo — 取得最新 SLO 計算結果(含 Redis 快取)
設計原則:
- 優先讀 Service 層快取TTL 5min快取失效才重算
- 計算失敗 → 保守回傳 any_violated=True由 AiSloCalculator 處理)
- 強制重算:?force_refresh=true
- Router 層不直接存取 RedisleWOOOgo 積木化鐵律)
2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 6 初始建立
"""
from __future__ import annotations
import structlog
from fastapi import APIRouter, Query
from src.services.ai_slo_calculator import AiSloCalculator
logger = structlog.get_logger(__name__)
router = APIRouter()
@router.get("/ai/slo")
async def get_ai_slo(
force_refresh: bool = Query(False, description="忽略快取,強制重算"),
) -> dict:
"""
取得 AI 決策品質 SLO 最新結果。
優先讀 Redis 快取TTL 5minforce_refresh=true 則強制重算並更新快取。
Response:
calculated_at ISO 時間戳
window_days 計算視窗(天)
any_violated 是否有任何 SLO 違反
cache_hit 是否命中快取
metrics[] 三大 SLO 指標明細
"""
calc = AiSloCalculator()
if not force_refresh:
cached = await calc.get_cached_report()
if cached:
data = cached.to_dict()
data["cache_hit"] = True
return data
report = await calc.run()
data = report.to_dict()
data["cache_hit"] = False
return data

View File

@@ -34,6 +34,7 @@ from sentry_sdk.integrations.starlette import StarletteIntegration
from src.api.v1 import agents as agents_v1 # Phase 9.5: Agent Teams API
from src.api.v1 import ai as ai_v1
from src.api.v1 import ai_slo as ai_slo_v1 # Phase 6 ADR-087: AI SLO 自我治理
from src.api.v1 import approvals as approvals_v1
from src.api.v1 import alert_operation_logs as alert_operation_logs_v1
from src.api.v1 import audit_logs as audit_logs_v1
@@ -514,6 +515,7 @@ app.include_router(csrf_v1.router, prefix="/api/v1", tags=["Security"]) # Phase
app.include_router(dashboard_v1.router, prefix="/api/v1", tags=["Dashboard"])
app.include_router(approvals_v1.router, prefix="/api/v1", tags=["HITL Approvals"])
app.include_router(ai_v1.router, prefix="/api/v1", tags=["AI Decision"])
app.include_router(ai_slo_v1.router, prefix="/api/v1", tags=["AI SLO"]) # Phase 6 ADR-087
app.include_router(webhooks_v1.router, prefix="/api/v1", tags=["Webhooks"])
app.include_router(timeline_v1.router, prefix="/api/v1", tags=["Timeline"])
app.include_router(audit_logs_v1.router, prefix="/api/v1", tags=["Audit Logs"])

View File

@@ -1484,3 +1484,48 @@ Phase 6 完成後
- `AIOPS_P3_EVOLVER_ENABLED`
**下一步:** ADR-083 草稿 → Gate 3 架構審查 → Phase 3 commit push Gitea
---
### 2026-04-15 深夜 (台北) — P0 告警靜默根治 + Phase 6 自我治理閉環收官
**P0 告警靜默 RCA3 根因):**
| # | 根因 | 影響 | 修復 |
|---|------|------|------|
| 1 | `approval_db.py:find_by_fingerprint()` PENDING 無 TTL | 舊 PENDING 記錄hit_count=77/30/17永久吸收相同 fingerprint 告警Telegram 完全靜默 | 加 `PENDING_TTL_HOURS=24` 時限kubectl 直接過期 7 筆殭屍記錄 |
| 2 | `create_approval_with_fingerprint()` expires_at=NULL | 新建 ApprovalRecord 永遠不過期,自動過期邏輯形同虛設 | 加 `DEFAULT_APPROVAL_TTL_HOURS=48` 預設值 |
| 3 | `openclaw.py:897` DIAGNOSE require_local=True | v4.3 早已決定 NIM 為主力但未更新,所有 DIAGNOSE 請求 privacy_skip → LLM 無聲失敗 | 移除 DIAGNOSE 出 require_local 條件 |
**P2 飛輪斷鏈修復:**
- 新建 `jobs/approval_timeout_resolver.py`:每小時掃描逾期 PENDING標記 EXPIRED + 呼叫 `resolve_incident(resolution_type="timeout")`
- `anomaly_counter.py` 新增 `timeout_ignored` disposition隱式負向回饋
- `incident_service.py.resolve_incident()` 新增 `resolution_type` 參數
**asyncpg CrashLoopBackOff 修復:**
- `db/base.py` Phase 6 migration 三條 CREATE INDEX 拆為獨立 executeasyncpg 不允許 prepared statement 多指令)
**Phase 6 自我治理閉環 — 全部完成:**
| 元件 | 檔案 | 說明 |
|------|------|------|
| AI SLO 計算器 | `services/ai_slo_calculator.py` | 三大 SLO成功率/推翻率/false neg rate7d 滾動Redis 快取 5min |
| Trust Drift 偵測器 | `services/trust_drift_detector.py` | 偵測 optimism_bias / confidence_collapse寫 ai_governance_events |
| KB Rot 清理 Job | `jobs/kb_rot_cleaner.py` | ROT-1 Kubernetes API 版本/ROT-2 Prometheus 指標/ROT-3 90d 陳舊 case |
| 自我降級引擎 | `services/decision_manager.py` | SLO 違反時自動升高 confidence_threshold啟動降級保護 |
| SLO REST API | `api/v1/ai_slo.py` | GET /api/v1/ai/slo含 force_refresh 參數) |
| DB 表 + Migration | `db/models.py` + `db/base.py` | AiGovernanceEvent 不可變 Event Sourcing 表 + 3 個 index |
**附帶修復:**
- `main.py` 停用 Telegram 心跳監控(已轉發到另一群組)
- `ai_router.py` 失敗通知改為 ADR-075 TYPE-1 格式(禁止 raw text
**Phase 6 退出條件§7.1**
- [x] SLO 計算器可回傳三大指標
- [x] Trust drift 偵測器可寫 ai_governance_events
- [x] KB rot 清理 job 可運行
- [x] decision_manager 自我降級邏輯掛載
- [x] REST API `/api/v1/ai/slo` 可查詢
- [ ] 生產驗證(等 3ce5025 / Phase 6 image 部署後觀察)
**commit chain** fab65e7 → f31b4e3 → f045506 → f9ba200 → 3ce5025 → (Phase 6 REST API commit)