Files
awoooi/apps/api/src/services/ai_advisory_helpers.py
Your Name f572561467
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m31s
feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler
統帥 2026-04-19 截圖反饋:
  1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop)
  2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行)

新增 services/ai_advisory_helpers.py (~240 行):
  - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}',
    TTL 25h,fail-open (Redis 掛照推,不阻塞).
  - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用).
  - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h.
  - build_ai_advisory_keyboard: 統一 4 按鈕
       已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令
    callback_data 格式: 'ai_advisory_{action}:{type}:{id}'
  - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback,
    view/produce_cmd 留 P1.

4 個 LLM scanner 改用 helper:
  - capacity_forecaster: daily_lock + snooze check per host + 按鈕
  - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕
  - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕
  - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕

telegram_gateway.py:
  handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後)
  新增 _handle_ai_advisory_action 方法:
    解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback
    → answer_callback (Telegram toast 回饋)
    → 返回 dict (info_action=True for view/produce_cmd)

統帥鐵律對齊:
   多 Pod 場景只 leader 推 (Redis SETNX 保證冪等)
   失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作)
   aol.output 加 human_feedback 供 AI 學習
   snooze 避免重複告警 (24h TTL)
   原 drift 按鈕 pattern 複用 (non-breaking)

明早 AI 將收到:
  - 單一訊息 (非重複)
  - 含 4 按鈕 (手動 feedback 閉環)
  - snooze 後同主題 24h 不再推

view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:02:57 +08:00

263 lines
9.7 KiB
Python

"""
AI Advisory Helpers — Leader Lock + Snooze + Inline Keyboard (ADR-092 升級)
=============================================================================
統帥 2026-04-19 反饋:
- 多 Pod 重複推 Telegram (leader 機制缺)
- 純文字無按鈕 (feedback 無閉環)
- AI 只「建議」不「執行」(view_detail 缺)
本 helper 提供 4 LLM scanner 共用:
1. daily_job_leader_lock: Redis SETNX,只 leader Pod 跑 daily job (避免 replica 重複)
2. is_snoozed: 檢查是否被 snooze 24h (避免同 host/rule 重複告警)
3. build_ai_advisory_keyboard: 統一 4 按鈕 (已處理/忽略 24h/查看詳情/產指令)
callback_data 格式 (與 telegram_gateway 對接):
'ai_advisory_{action}:{advisory_type}:{advisory_id}'
action: handled / snooze / view / produce_cmd
advisory_type: capacity_forecast / compliance_posture / rule_quality / coverage_gap
advisory_id: aol.op_id (UUID) or 組合 key (host@date)
2026-04-19 ogt + Claude Opus 4.7 (1M context) Asia/Taipei
ADR-092 § AI Decision LLM 擴展層 P0 修復
"""
from __future__ import annotations
from datetime import date as _date
from typing import Any
import structlog
logger = structlog.get_logger(__name__)
# ============================================================================
# Leader Lock — 多 Pod 場景只 leader 跑 daily job
# ============================================================================
async def try_acquire_daily_lock(job_name: str) -> bool:
"""
用 Redis SETNX 搶當日 lock. 只搶到的 Pod 跑 daily job.
key: aiops:daily_lock:{job_name}:{YYYY-MM-DD-Taipei}
TTL: 25h (覆蓋 daily tick + 時差保險)
Returns:
True — 搶到 lock (此 Pod 跑)
False — 其他 Pod 已搶 (此 Pod skip)
失敗安全: Redis 掛 → 視為搶到 (fail-open,不阻塞主流程,可能造成一次多 Pod 推送,但好過不推)
"""
try:
from src.core.redis_client import get_redis
from src.utils.timezone import now_taipei
today = now_taipei().date().isoformat()
key = f"aiops:daily_lock:{job_name}:{today}"
redis = await get_redis()
# NX: 只有不存在才設. EX 25h. 回 True/None
acquired = await redis.set(key, "1", nx=True, ex=25 * 3600)
if acquired:
logger.info("daily_lock_acquired", job=job_name, date=today, key=key)
return True
logger.info("daily_lock_skipped_other_pod", job=job_name, date=today)
return False
except Exception as e:
logger.warning("daily_lock_redis_failed_fail_open", job=job_name, error=str(e))
return True # fail-open
async def try_acquire_hourly_lock(job_name: str) -> bool:
"""
類似 daily 但 hourly granularity. 適用於 coverage_evaluator 等每 1h 跑的 scanner.
key: aiops:hourly_lock:{job_name}:{YYYY-MM-DD-HH-Taipei}
TTL: 75 分鐘 (覆蓋 hourly tick + 緩衝)
"""
try:
from src.core.redis_client import get_redis
from src.utils.timezone import now_taipei
now = now_taipei()
slot = f"{now.date().isoformat()}-{now.hour:02d}"
key = f"aiops:hourly_lock:{job_name}:{slot}"
redis = await get_redis()
acquired = await redis.set(key, "1", nx=True, ex=75 * 60)
if acquired:
logger.info("hourly_lock_acquired", job=job_name, slot=slot)
return True
logger.info("hourly_lock_skipped_other_pod", job=job_name, slot=slot)
return False
except Exception as e:
logger.warning("hourly_lock_redis_failed_fail_open", job=job_name, error=str(e))
return True
# ============================================================================
# Snooze — 按「忽略 24h」後避免重複告警
# ============================================================================
async def is_snoozed(advisory_type: str, target: str) -> bool:
"""
檢查 advisory 是否被 snooze (人工按「忽略 24h」後).
key: aiops:snooze:{advisory_type}:{target}
target: host IP / rule_name / worst_dimension 等
Returns:
True — 被 snooze,應跳過 Telegram 推送
False — 未 snooze 或 Redis 失敗
"""
try:
from src.core.redis_client import get_redis
redis = await get_redis()
key = f"aiops:snooze:{advisory_type}:{target}"
val = await redis.get(key)
return val is not None
except Exception as e:
logger.warning("snooze_check_failed", type=advisory_type, target=target, error=str(e))
return False
async def set_snooze(advisory_type: str, target: str, hours: int = 24) -> None:
"""設 snooze TTL — 供 callback handler 呼叫."""
try:
from src.core.redis_client import get_redis
redis = await get_redis()
key = f"aiops:snooze:{advisory_type}:{target}"
await redis.set(key, "1", ex=hours * 3600)
logger.info("snooze_set", type=advisory_type, target=target, hours=hours)
except Exception as e:
logger.warning("snooze_set_failed", type=advisory_type, target=target, error=str(e))
# ============================================================================
# Inline Keyboard — 統一 AI advisory 按鈕格式
# ============================================================================
def build_ai_advisory_keyboard(
advisory_type: str,
advisory_id: str,
include_view: bool = True,
include_produce_cmd: bool = False,
) -> dict[str, Any]:
"""
建 AI advisory 訊息的 inline_keyboard.
callback_data 格式: 'ai_advisory_{action}:{advisory_type}:{advisory_id}'
Args:
advisory_type: capacity_forecast / compliance_posture / rule_quality / coverage_gap
advisory_id: aol.op_id 或組合 key
include_view: 是否含「🔍 查看詳情」按鈕 (P1,目前多數 disabled)
include_produce_cmd: 是否含「📋 產指令」按鈕 (P1,需 LLM 產 kubectl command)
Returns:
dict: {"inline_keyboard": [[...]]} 可直接傳給 Telegram sendMessage reply_markup
"""
cb = lambda action: f"ai_advisory_{action}:{advisory_type}:{advisory_id}" # noqa: E731
row1 = [
{"text": "✅ 已處理", "callback_data": cb("handled")},
{"text": "😴 忽略 24h", "callback_data": cb("snooze")},
]
row2 = []
if include_view:
row2.append({"text": "🔍 查看詳情", "callback_data": cb("view")})
if include_produce_cmd:
row2.append({"text": "📋 產 kubectl 指令", "callback_data": cb("produce_cmd")})
keyboard = {"inline_keyboard": [row1]}
if row2:
keyboard["inline_keyboard"].append(row2)
return keyboard
# ============================================================================
# Callback Handler — 處理「已處理」/「忽略 24h」按鈕
# ============================================================================
async def handle_ai_advisory_callback(
action: str,
advisory_type: str,
advisory_id: str,
username: str,
) -> dict[str, Any]:
"""
處理 ai_advisory_* callback.
Args:
action: handled / snooze / view / produce_cmd
advisory_type: 4 種 advisory 類型
advisory_id: 對應 aol.op_id 或組合 key
Returns:
{success: bool, feedback_text: str (給 Telegram answer_callback 顯示)}
邏輯:
- handled: UPDATE aol.output jsonb += {human_feedback, by, at}
- snooze: set_snooze(24h) + UPDATE aol
- view: (P1 TODO — 未來 LLM 產詳情)
- produce_cmd: (P1 TODO — 未來 LLM 產 kubectl command)
"""
from datetime import datetime
from src.utils.timezone import now_taipei
feedback_text = ""
success = False
if action == "handled":
success = await _write_human_feedback(advisory_id, "handled", username)
feedback_text = f"✅ 已記錄「已處理」by {username}"
elif action == "snooze":
# advisory_id 本身就是 target (host / rule_name 等)
await set_snooze(advisory_type, advisory_id, hours=24)
success = await _write_human_feedback(advisory_id, "snoozed_24h", username)
feedback_text = "😴 已忽略 24 小時"
elif action == "view":
feedback_text = "🔍 詳情功能下階段實作" # P1
elif action == "produce_cmd":
feedback_text = "📋 產指令功能下階段實作" # P1
else:
feedback_text = f"❓ 未知 action: {action}"
return {"success": success, "feedback_text": feedback_text}
async def _write_human_feedback(advisory_id: str, feedback: str, username: str) -> bool:
"""UPDATE aol.output.human_feedback 供 AI 學習."""
try:
from sqlalchemy import text as _sql
from src.db.base import get_db_context
from src.utils.timezone import now_taipei
async with get_db_context() as db:
await db.execute(
_sql("""
UPDATE automation_operation_log
SET output = output
|| jsonb_build_object(
'human_feedback', :fb,
'human_feedback_by', :who,
'human_feedback_at', :at
)
WHERE op_id::text = :aid
OR (input::jsonb->>'advisory_id' = :aid)
OR (output::jsonb->>'host' = :aid)
OR (output::jsonb->>'rule_name' = :aid)
"""),
{
"fb": feedback,
"who": username[:50],
"at": now_taipei().isoformat(),
"aid": advisory_id,
},
)
return True
except Exception as e:
logger.warning("write_human_feedback_failed", advisory_id=advisory_id, error=str(e))
return False