Files
awoooi/apps/api/src/jobs/approval_timeout_resolver.py
Your Name 45dbe07188
Some checks failed
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」

【六大修復】
1. MCP Provider 三蟲修復
   - ssh_provider: asyncssh.run() → conn.run()
   - prometheus_provider: KeyError 'query' → .get() 容錯
   - k8s_provider: 空 pod_name → 早返回錯誤字典

2. Agent Debate / 決策品質
   - decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
   - intent_classifier: LLM 逾時降級至關鍵字分類(非 None)

3. Watchdog 誤報修復(ADR-092 B3)
   - W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
   - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
   - approval_timeout_resolver: 60min → 15min,batch 50 → 200

4. Config Drift 自動化
   - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
   - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片

5. Playbook 飛輪穩定
   - playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
   - playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)

6. 可觀測性
   - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
   - auto_approve: reject 原因 Redis 計數器
   - heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊

【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:55:50 +08:00

159 lines
5.7 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
AWOOOI — Approval Timeout Resolver逾期 Approval 自動結案 Job
================================================================
職責:每小時掃描 approval_records 中已逾期expires_at < now但狀態仍為
PENDING 的記錄,標記為 EXPIRED並對其關聯的 Incident 呼叫 resolve_incident
以確保 KM 學習鏈完整閉環。
為什麼需要這個 Job
get_pending_approvals() 有自動過期邏輯,但只在用戶開啟待處理列表時觸發。
若無人開 UIPENDING 記錄永遠停留,關聯 Incident 不會 RESOLVED
km_conversion_service 永不觸發AI 學習飛輪對「無人處置的告警」完全盲目。
disposition 記錄:
timeout_ignored — 與 auto_repair / human_approved 區別,
讓 anomaly_counter 統計反映「AI 建議但被人類忽略」的現象,
供 Phase 6 SLO human_override_rate 校正。
設計原則:
1. 只更新 DB不刪除記錄符合 archive_not_delete 鐵律)
2. resolve_incident 使用 resolution_type="timeout",記錄正確 disposition
3. 失敗 → 只記錄 error不影響主路徑
4. 每次執行記錄 resolved_count / error_count
2026-04-15 ogt + Claude Sonnet 4.6亞太P2 飛輪斷鏈修復
ADR-092 B3 (2026-04-24 ogt + Claude Sonnet 4.6 Asia/Taipei)
執行間隔從 3600s1h縮短至 900s15 分鐘),與 W-2 Watchdog 同頻,
減少 PENDING 積壓導致 W-2 誤報的時間視窗;
BATCH_LIMIT 從 50 提升至 200加快存量清理速度。
"""
from __future__ import annotations
import asyncio
from datetime import UTC, datetime, timedelta
import structlog
from sqlalchemy import and_, select, update
from src.db.base import get_db_context
from src.db.models import ApprovalRecord
from src.models.approval import ApprovalStatus
from src.utils.timezone import now_taipei
logger = structlog.get_logger(__name__)
# 每次最多處理幾筆,避免單次執行阻塞過長
# ADR-092 B3 (2026-04-24 ogt + Claude Sonnet 4.6 Asia/Taipei):從 50 提升至 200加快存量清理
BATCH_LIMIT = 200
async def run_approval_timeout_resolver() -> None:
"""
無限迴圈:每 15 分鐘執行一次逾期 Approval 結案掃描。
在 main.py startup 以 asyncio.create_task 掛載。
ADR-092 B3 (2026-04-24 ogt + Claude Sonnet 4.6 Asia/Taipei)
從 3600s1h縮短至 900s15 分鐘),與 ai_slo_watchdog_job 同頻,
減少 W-2 因 PENDING 積壓誤報的時間視窗。
"""
while True:
try:
resolved, errors = await _resolve_expired_approvals()
if resolved > 0 or errors > 0:
logger.info(
"approval_timeout_resolver_done",
resolved=resolved,
errors=errors,
)
except Exception as e:
logger.error("approval_timeout_resolver_loop_error", error=str(e))
await asyncio.sleep(900) # ADR-092 B3: 每 15 分鐘執行(原 3600s
async def _resolve_expired_approvals() -> tuple[int, int]:
"""
找出已逾期的 PENDING approval標記 EXPIRED 並結案對應 Incident。
Returns:
(resolved_count, error_count)
"""
now = datetime.now(UTC)
resolved = 0
errors = 0
# Step 1: 找出逾期但仍 PENDING 的記錄(有 expires_at 且逾期)
async with get_db_context() as db:
result = await db.execute(
select(ApprovalRecord)
.where(
and_(
ApprovalRecord.status == ApprovalStatus.PENDING,
ApprovalRecord.expires_at.is_not(None),
ApprovalRecord.expires_at < now,
)
)
.order_by(ApprovalRecord.expires_at)
.limit(BATCH_LIMIT)
)
expired_records = result.scalars().all()
if not expired_records:
return 0, 0
# Step 2: 批次標記 EXPIRED
expired_ids = [r.id for r in expired_records]
await db.execute(
update(ApprovalRecord)
.where(ApprovalRecord.id.in_(expired_ids))
.values(status=ApprovalStatus.EXPIRED, resolved_at=now)
)
await db.commit()
logger.info(
"approval_timeout_batch_expired",
count=len(expired_ids),
ids=[str(i)[:8] for i in expired_ids[:10]],
)
# Step 3: 對每筆有 incident_id 的記錄呼叫 resolve_incident
from src.services.incident_service import get_incident_service
inc_svc = get_incident_service()
for record in expired_records:
incident_id = getattr(record, "incident_id", None)
if not incident_id:
continue
try:
result = await inc_svc.resolve_incident(
incident_id=str(incident_id),
resolution_type="timeout",
)
if result:
resolved += 1
logger.info(
"approval_timeout_incident_resolved",
approval_id=str(record.id)[:8],
incident_id=str(incident_id)[:8],
)
else:
# incident_not_found 或已 RESOLVED不算 error
logger.debug(
"approval_timeout_incident_skip",
approval_id=str(record.id)[:8],
incident_id=str(incident_id)[:8],
reason="not_found_or_already_resolved",
)
except Exception as e:
errors += 1
logger.error(
"approval_timeout_resolve_error",
approval_id=str(record.id)[:8],
incident_id=str(incident_id)[:8],
error=str(e),
)
return resolved, errors