Some checks failed
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】 MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文 → LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設 → W-2 Watchdog 誤報「靜默故障」 【六大修復】 1. MCP Provider 三蟲修復 - ssh_provider: asyncssh.run() → conn.run() - prometheus_provider: KeyError 'query' → .get() 容錯 - k8s_provider: 空 pod_name → 早返回錯誤字典 2. Agent Debate / 決策品質 - decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘) - intent_classifier: LLM 逾時降級至關鍵字分類(非 None) 3. Watchdog 誤報修復(ADR-092 B3) - W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值) - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL - approval_timeout_resolver: 60min → 15min,batch 50 → 200 4. Config Drift 自動化 - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘 - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片 5. Playbook 飛輪穩定 - playbook_seed_service: 修復幂等性(deprecated 不視為缺失) - playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆) 6. 可觀測性 - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline) - auto_approve: reject 原因 Redis 計數器 - heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊 【待人工執行】 psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
159 lines
5.7 KiB
Python
159 lines
5.7 KiB
Python
"""
|
||
AWOOOI — Approval Timeout Resolver(逾期 Approval 自動結案 Job)
|
||
================================================================
|
||
職責:每小時掃描 approval_records 中已逾期(expires_at < now)但狀態仍為
|
||
PENDING 的記錄,標記為 EXPIRED,並對其關聯的 Incident 呼叫 resolve_incident
|
||
以確保 KM 學習鏈完整閉環。
|
||
|
||
為什麼需要這個 Job?
|
||
get_pending_approvals() 有自動過期邏輯,但只在用戶開啟待處理列表時觸發。
|
||
若無人開 UI,PENDING 記錄永遠停留,關聯 Incident 不會 RESOLVED,
|
||
km_conversion_service 永不觸發,AI 學習飛輪對「無人處置的告警」完全盲目。
|
||
|
||
disposition 記錄:
|
||
timeout_ignored — 與 auto_repair / human_approved 區別,
|
||
讓 anomaly_counter 統計反映「AI 建議但被人類忽略」的現象,
|
||
供 Phase 6 SLO human_override_rate 校正。
|
||
|
||
設計原則:
|
||
1. 只更新 DB,不刪除記錄(符合 archive_not_delete 鐵律)
|
||
2. resolve_incident 使用 resolution_type="timeout",記錄正確 disposition
|
||
3. 失敗 → 只記錄 error,不影響主路徑
|
||
4. 每次執行記錄 resolved_count / error_count
|
||
|
||
2026-04-15 ogt + Claude Sonnet 4.6(亞太):P2 飛輪斷鏈修復
|
||
ADR-092 B3 (2026-04-24 ogt + Claude Sonnet 4.6 Asia/Taipei):
|
||
執行間隔從 3600s(1h)縮短至 900s(15 分鐘),與 W-2 Watchdog 同頻,
|
||
減少 PENDING 積壓導致 W-2 誤報的時間視窗;
|
||
BATCH_LIMIT 從 50 提升至 200,加快存量清理速度。
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import asyncio
|
||
from datetime import UTC, datetime, timedelta
|
||
|
||
import structlog
|
||
from sqlalchemy import and_, select, update
|
||
|
||
from src.db.base import get_db_context
|
||
from src.db.models import ApprovalRecord
|
||
from src.models.approval import ApprovalStatus
|
||
from src.utils.timezone import now_taipei
|
||
|
||
logger = structlog.get_logger(__name__)
|
||
|
||
# 每次最多處理幾筆,避免單次執行阻塞過長
|
||
# ADR-092 B3 (2026-04-24 ogt + Claude Sonnet 4.6 Asia/Taipei):從 50 提升至 200,加快存量清理
|
||
BATCH_LIMIT = 200
|
||
|
||
|
||
async def run_approval_timeout_resolver() -> None:
|
||
"""
|
||
無限迴圈:每 15 分鐘執行一次逾期 Approval 結案掃描。
|
||
在 main.py startup 以 asyncio.create_task 掛載。
|
||
ADR-092 B3 (2026-04-24 ogt + Claude Sonnet 4.6 Asia/Taipei):
|
||
從 3600s(1h)縮短至 900s(15 分鐘),與 ai_slo_watchdog_job 同頻,
|
||
減少 W-2 因 PENDING 積壓誤報的時間視窗。
|
||
"""
|
||
while True:
|
||
try:
|
||
resolved, errors = await _resolve_expired_approvals()
|
||
if resolved > 0 or errors > 0:
|
||
logger.info(
|
||
"approval_timeout_resolver_done",
|
||
resolved=resolved,
|
||
errors=errors,
|
||
)
|
||
except Exception as e:
|
||
logger.error("approval_timeout_resolver_loop_error", error=str(e))
|
||
|
||
await asyncio.sleep(900) # ADR-092 B3: 每 15 分鐘執行(原 3600s)
|
||
|
||
|
||
async def _resolve_expired_approvals() -> tuple[int, int]:
|
||
"""
|
||
找出已逾期的 PENDING approval,標記 EXPIRED 並結案對應 Incident。
|
||
|
||
Returns:
|
||
(resolved_count, error_count)
|
||
"""
|
||
now = datetime.now(UTC)
|
||
resolved = 0
|
||
errors = 0
|
||
|
||
# Step 1: 找出逾期但仍 PENDING 的記錄(有 expires_at 且逾期)
|
||
async with get_db_context() as db:
|
||
result = await db.execute(
|
||
select(ApprovalRecord)
|
||
.where(
|
||
and_(
|
||
ApprovalRecord.status == ApprovalStatus.PENDING,
|
||
ApprovalRecord.expires_at.is_not(None),
|
||
ApprovalRecord.expires_at < now,
|
||
)
|
||
)
|
||
.order_by(ApprovalRecord.expires_at)
|
||
.limit(BATCH_LIMIT)
|
||
)
|
||
expired_records = result.scalars().all()
|
||
|
||
if not expired_records:
|
||
return 0, 0
|
||
|
||
# Step 2: 批次標記 EXPIRED
|
||
expired_ids = [r.id for r in expired_records]
|
||
await db.execute(
|
||
update(ApprovalRecord)
|
||
.where(ApprovalRecord.id.in_(expired_ids))
|
||
.values(status=ApprovalStatus.EXPIRED, resolved_at=now)
|
||
)
|
||
await db.commit()
|
||
|
||
logger.info(
|
||
"approval_timeout_batch_expired",
|
||
count=len(expired_ids),
|
||
ids=[str(i)[:8] for i in expired_ids[:10]],
|
||
)
|
||
|
||
# Step 3: 對每筆有 incident_id 的記錄呼叫 resolve_incident
|
||
from src.services.incident_service import get_incident_service
|
||
|
||
inc_svc = get_incident_service()
|
||
|
||
for record in expired_records:
|
||
incident_id = getattr(record, "incident_id", None)
|
||
if not incident_id:
|
||
continue
|
||
|
||
try:
|
||
result = await inc_svc.resolve_incident(
|
||
incident_id=str(incident_id),
|
||
resolution_type="timeout",
|
||
)
|
||
if result:
|
||
resolved += 1
|
||
logger.info(
|
||
"approval_timeout_incident_resolved",
|
||
approval_id=str(record.id)[:8],
|
||
incident_id=str(incident_id)[:8],
|
||
)
|
||
else:
|
||
# incident_not_found 或已 RESOLVED,不算 error
|
||
logger.debug(
|
||
"approval_timeout_incident_skip",
|
||
approval_id=str(record.id)[:8],
|
||
incident_id=str(incident_id)[:8],
|
||
reason="not_found_or_already_resolved",
|
||
)
|
||
except Exception as e:
|
||
errors += 1
|
||
logger.error(
|
||
"approval_timeout_resolve_error",
|
||
approval_id=str(record.id)[:8],
|
||
incident_id=str(incident_id)[:8],
|
||
error=str(e),
|
||
)
|
||
|
||
return resolved, errors
|