All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m31s
統帥 2026-04-19 截圖反饋:
1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop)
2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行)
新增 services/ai_advisory_helpers.py (~240 行):
- try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}',
TTL 25h,fail-open (Redis 掛照推,不阻塞).
- try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用).
- is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h.
- build_ai_advisory_keyboard: 統一 4 按鈕
✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令
callback_data 格式: 'ai_advisory_{action}:{type}:{id}'
- handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback,
view/produce_cmd 留 P1.
4 個 LLM scanner 改用 helper:
- capacity_forecaster: daily_lock + snooze check per host + 按鈕
- compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕
- coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕
- hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕
telegram_gateway.py:
handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後)
新增 _handle_ai_advisory_action 方法:
解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback
→ answer_callback (Telegram toast 回饋)
→ 返回 dict (info_action=True for view/produce_cmd)
統帥鐵律對齊:
✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等)
✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作)
✅ aol.output 加 human_feedback 供 AI 學習
✅ snooze 避免重複告警 (24h TTL)
✅ 原 drift 按鈕 pattern 複用 (non-breaking)
明早 AI 將收到:
- 單一訊息 (非重複)
- 含 4 按鈕 (手動 feedback 閉環)
- snooze 後同主題 24h 不再推
view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
263 lines
9.7 KiB
Python
263 lines
9.7 KiB
Python
"""
|
|
AI Advisory Helpers — Leader Lock + Snooze + Inline Keyboard (ADR-092 升級)
|
|
=============================================================================
|
|
統帥 2026-04-19 反饋:
|
|
- 多 Pod 重複推 Telegram (leader 機制缺)
|
|
- 純文字無按鈕 (feedback 無閉環)
|
|
- AI 只「建議」不「執行」(view_detail 缺)
|
|
|
|
本 helper 提供 4 LLM scanner 共用:
|
|
1. daily_job_leader_lock: Redis SETNX,只 leader Pod 跑 daily job (避免 replica 重複)
|
|
2. is_snoozed: 檢查是否被 snooze 24h (避免同 host/rule 重複告警)
|
|
3. build_ai_advisory_keyboard: 統一 4 按鈕 (已處理/忽略 24h/查看詳情/產指令)
|
|
|
|
callback_data 格式 (與 telegram_gateway 對接):
|
|
'ai_advisory_{action}:{advisory_type}:{advisory_id}'
|
|
action: handled / snooze / view / produce_cmd
|
|
advisory_type: capacity_forecast / compliance_posture / rule_quality / coverage_gap
|
|
advisory_id: aol.op_id (UUID) or 組合 key (host@date)
|
|
|
|
2026-04-19 ogt + Claude Opus 4.7 (1M context) Asia/Taipei
|
|
ADR-092 § AI Decision LLM 擴展層 P0 修復
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
from datetime import date as _date
|
|
from typing import Any
|
|
|
|
import structlog
|
|
|
|
logger = structlog.get_logger(__name__)
|
|
|
|
|
|
# ============================================================================
|
|
# Leader Lock — 多 Pod 場景只 leader 跑 daily job
|
|
# ============================================================================
|
|
|
|
async def try_acquire_daily_lock(job_name: str) -> bool:
|
|
"""
|
|
用 Redis SETNX 搶當日 lock. 只搶到的 Pod 跑 daily job.
|
|
|
|
key: aiops:daily_lock:{job_name}:{YYYY-MM-DD-Taipei}
|
|
TTL: 25h (覆蓋 daily tick + 時差保險)
|
|
|
|
Returns:
|
|
True — 搶到 lock (此 Pod 跑)
|
|
False — 其他 Pod 已搶 (此 Pod skip)
|
|
|
|
失敗安全: Redis 掛 → 視為搶到 (fail-open,不阻塞主流程,可能造成一次多 Pod 推送,但好過不推)
|
|
"""
|
|
try:
|
|
from src.core.redis_client import get_redis
|
|
from src.utils.timezone import now_taipei
|
|
|
|
today = now_taipei().date().isoformat()
|
|
key = f"aiops:daily_lock:{job_name}:{today}"
|
|
|
|
redis = await get_redis()
|
|
# NX: 只有不存在才設. EX 25h. 回 True/None
|
|
acquired = await redis.set(key, "1", nx=True, ex=25 * 3600)
|
|
if acquired:
|
|
logger.info("daily_lock_acquired", job=job_name, date=today, key=key)
|
|
return True
|
|
logger.info("daily_lock_skipped_other_pod", job=job_name, date=today)
|
|
return False
|
|
except Exception as e:
|
|
logger.warning("daily_lock_redis_failed_fail_open", job=job_name, error=str(e))
|
|
return True # fail-open
|
|
|
|
|
|
async def try_acquire_hourly_lock(job_name: str) -> bool:
|
|
"""
|
|
類似 daily 但 hourly granularity. 適用於 coverage_evaluator 等每 1h 跑的 scanner.
|
|
|
|
key: aiops:hourly_lock:{job_name}:{YYYY-MM-DD-HH-Taipei}
|
|
TTL: 75 分鐘 (覆蓋 hourly tick + 緩衝)
|
|
"""
|
|
try:
|
|
from src.core.redis_client import get_redis
|
|
from src.utils.timezone import now_taipei
|
|
|
|
now = now_taipei()
|
|
slot = f"{now.date().isoformat()}-{now.hour:02d}"
|
|
key = f"aiops:hourly_lock:{job_name}:{slot}"
|
|
|
|
redis = await get_redis()
|
|
acquired = await redis.set(key, "1", nx=True, ex=75 * 60)
|
|
if acquired:
|
|
logger.info("hourly_lock_acquired", job=job_name, slot=slot)
|
|
return True
|
|
logger.info("hourly_lock_skipped_other_pod", job=job_name, slot=slot)
|
|
return False
|
|
except Exception as e:
|
|
logger.warning("hourly_lock_redis_failed_fail_open", job=job_name, error=str(e))
|
|
return True
|
|
|
|
|
|
# ============================================================================
|
|
# Snooze — 按「忽略 24h」後避免重複告警
|
|
# ============================================================================
|
|
|
|
async def is_snoozed(advisory_type: str, target: str) -> bool:
|
|
"""
|
|
檢查 advisory 是否被 snooze (人工按「忽略 24h」後).
|
|
|
|
key: aiops:snooze:{advisory_type}:{target}
|
|
target: host IP / rule_name / worst_dimension 等
|
|
|
|
Returns:
|
|
True — 被 snooze,應跳過 Telegram 推送
|
|
False — 未 snooze 或 Redis 失敗
|
|
"""
|
|
try:
|
|
from src.core.redis_client import get_redis
|
|
|
|
redis = await get_redis()
|
|
key = f"aiops:snooze:{advisory_type}:{target}"
|
|
val = await redis.get(key)
|
|
return val is not None
|
|
except Exception as e:
|
|
logger.warning("snooze_check_failed", type=advisory_type, target=target, error=str(e))
|
|
return False
|
|
|
|
|
|
async def set_snooze(advisory_type: str, target: str, hours: int = 24) -> None:
|
|
"""設 snooze TTL — 供 callback handler 呼叫."""
|
|
try:
|
|
from src.core.redis_client import get_redis
|
|
|
|
redis = await get_redis()
|
|
key = f"aiops:snooze:{advisory_type}:{target}"
|
|
await redis.set(key, "1", ex=hours * 3600)
|
|
logger.info("snooze_set", type=advisory_type, target=target, hours=hours)
|
|
except Exception as e:
|
|
logger.warning("snooze_set_failed", type=advisory_type, target=target, error=str(e))
|
|
|
|
|
|
# ============================================================================
|
|
# Inline Keyboard — 統一 AI advisory 按鈕格式
|
|
# ============================================================================
|
|
|
|
def build_ai_advisory_keyboard(
|
|
advisory_type: str,
|
|
advisory_id: str,
|
|
include_view: bool = True,
|
|
include_produce_cmd: bool = False,
|
|
) -> dict[str, Any]:
|
|
"""
|
|
建 AI advisory 訊息的 inline_keyboard.
|
|
|
|
callback_data 格式: 'ai_advisory_{action}:{advisory_type}:{advisory_id}'
|
|
|
|
Args:
|
|
advisory_type: capacity_forecast / compliance_posture / rule_quality / coverage_gap
|
|
advisory_id: aol.op_id 或組合 key
|
|
include_view: 是否含「🔍 查看詳情」按鈕 (P1,目前多數 disabled)
|
|
include_produce_cmd: 是否含「📋 產指令」按鈕 (P1,需 LLM 產 kubectl command)
|
|
|
|
Returns:
|
|
dict: {"inline_keyboard": [[...]]} 可直接傳給 Telegram sendMessage reply_markup
|
|
"""
|
|
cb = lambda action: f"ai_advisory_{action}:{advisory_type}:{advisory_id}" # noqa: E731
|
|
|
|
row1 = [
|
|
{"text": "✅ 已處理", "callback_data": cb("handled")},
|
|
{"text": "😴 忽略 24h", "callback_data": cb("snooze")},
|
|
]
|
|
row2 = []
|
|
if include_view:
|
|
row2.append({"text": "🔍 查看詳情", "callback_data": cb("view")})
|
|
if include_produce_cmd:
|
|
row2.append({"text": "📋 產 kubectl 指令", "callback_data": cb("produce_cmd")})
|
|
|
|
keyboard = {"inline_keyboard": [row1]}
|
|
if row2:
|
|
keyboard["inline_keyboard"].append(row2)
|
|
return keyboard
|
|
|
|
|
|
# ============================================================================
|
|
# Callback Handler — 處理「已處理」/「忽略 24h」按鈕
|
|
# ============================================================================
|
|
|
|
async def handle_ai_advisory_callback(
|
|
action: str,
|
|
advisory_type: str,
|
|
advisory_id: str,
|
|
username: str,
|
|
) -> dict[str, Any]:
|
|
"""
|
|
處理 ai_advisory_* callback.
|
|
|
|
Args:
|
|
action: handled / snooze / view / produce_cmd
|
|
advisory_type: 4 種 advisory 類型
|
|
advisory_id: 對應 aol.op_id 或組合 key
|
|
|
|
Returns:
|
|
{success: bool, feedback_text: str (給 Telegram answer_callback 顯示)}
|
|
|
|
邏輯:
|
|
- handled: UPDATE aol.output jsonb += {human_feedback, by, at}
|
|
- snooze: set_snooze(24h) + UPDATE aol
|
|
- view: (P1 TODO — 未來 LLM 產詳情)
|
|
- produce_cmd: (P1 TODO — 未來 LLM 產 kubectl command)
|
|
"""
|
|
from datetime import datetime
|
|
from src.utils.timezone import now_taipei
|
|
|
|
feedback_text = ""
|
|
success = False
|
|
|
|
if action == "handled":
|
|
success = await _write_human_feedback(advisory_id, "handled", username)
|
|
feedback_text = f"✅ 已記錄「已處理」by {username}"
|
|
elif action == "snooze":
|
|
# advisory_id 本身就是 target (host / rule_name 等)
|
|
await set_snooze(advisory_type, advisory_id, hours=24)
|
|
success = await _write_human_feedback(advisory_id, "snoozed_24h", username)
|
|
feedback_text = "😴 已忽略 24 小時"
|
|
elif action == "view":
|
|
feedback_text = "🔍 詳情功能下階段實作" # P1
|
|
elif action == "produce_cmd":
|
|
feedback_text = "📋 產指令功能下階段實作" # P1
|
|
else:
|
|
feedback_text = f"❓ 未知 action: {action}"
|
|
|
|
return {"success": success, "feedback_text": feedback_text}
|
|
|
|
|
|
async def _write_human_feedback(advisory_id: str, feedback: str, username: str) -> bool:
|
|
"""UPDATE aol.output.human_feedback 供 AI 學習."""
|
|
try:
|
|
from sqlalchemy import text as _sql
|
|
from src.db.base import get_db_context
|
|
from src.utils.timezone import now_taipei
|
|
|
|
async with get_db_context() as db:
|
|
await db.execute(
|
|
_sql("""
|
|
UPDATE automation_operation_log
|
|
SET output = output
|
|
|| jsonb_build_object(
|
|
'human_feedback', :fb,
|
|
'human_feedback_by', :who,
|
|
'human_feedback_at', :at
|
|
)
|
|
WHERE op_id::text = :aid
|
|
OR (input::jsonb->>'advisory_id' = :aid)
|
|
OR (output::jsonb->>'host' = :aid)
|
|
OR (output::jsonb->>'rule_name' = :aid)
|
|
"""),
|
|
{
|
|
"fb": feedback,
|
|
"who": username[:50],
|
|
"at": now_taipei().isoformat(),
|
|
"aid": advisory_id,
|
|
},
|
|
)
|
|
return True
|
|
except Exception as e:
|
|
logger.warning("write_human_feedback_failed", advisory_id=advisory_id, error=str(e))
|
|
return False
|