抑制 EA 無實證空告警
All checks were successful
CD Pipeline / deploy (push) Successful in 1m2s

This commit is contained in:
OoO
2026-05-19 13:04:01 +08:00
parent 6fa425fb41
commit 6788379d10
6 changed files with 144 additions and 28 deletions

View File

@@ -4,6 +4,7 @@
================================================================================
【已完成】
- V10.253 修正 Elephant Alpha L3 HITL 空告警:價格類與資源調配低信心事件若沒有 Hermes/實證資料,只記 suppressed telemetry 與 cooldown不寫 pending human_review、不發 Telegram`resource_optimization` 會保留 queue/load 原始指標供追查。
- V10.251 修正 OpenClaw Q&A 備援遙測Ollama 主路徑仍為 GCP-A → GCP-B → 111Gemini 只記為 `openclaw_qa_gemini_fallback`NIM 只記為 `openclaw_qa_nim`AI Calls 會把 legacy `openclaw_qa + gemini` 標成 Gemini 備援,避免再次誤判 Gemini-first。
- V10.251 穩定 `/growth_analysis` 正式站速度:成長分析快取從單 worker memory 擴充為 `data/growth_analysis_cache.pkl` 跨 worker 共享快取,避免 Gunicorn 冷 worker 偶發掃明細表造成 5 秒級 TTFB補 `tests/test_cache_manager.py` 覆蓋 shared file roundtrip 與清除行為。
- V10.249 收斂 `/observability/ppt_audit_history` 手機與平板第一屏密度:將 4 個產線訊號從 hero 內移出成獨立狀態列,手機版維持 2 欄狀態卡並降低 hero 卡片間距;本機 10 個 AI 觀測台頁面 rendered visual contract 全數通過PPT 頁 hero 高度 desktop/tablet/mobile 為 214/361/398px。

View File

@@ -320,7 +320,7 @@ YOUTUBE_API_KEY = os.getenv('YOUTUBE_API_KEY', '')
# ==========================================
# 系統版本與路徑
# ==========================================
SYSTEM_VERSION = "V10.252"
SYSTEM_VERSION = "V10.253"
LOG_FILE_PATH = os.path.join(BASE_DIR, 'logs/system.log')
public_url = PUBLIC_URL # 用於模板顯示

View File

@@ -1,8 +1,8 @@
# MOMO PRO — AI 競價情報模組 Single Source of Truth
> **最後更新**: 2026-05-18 (台北時間)
> **最後更新**: 2026-05-19 (台北時間)
> **狀態**: 🟢 四 AI Agent 自動化閉環已落地LLM 路由紅線升級為 Ollama-first 三主機級聯Gemini 僅備援 / 鎖定場景
> **適用版本**: V10.179
> **適用版本**: V10.253
---
@@ -105,6 +105,7 @@ SQL漏斗(~300筆)
- Gunicorn runtime 預設 `worker_class = gthread``GUNICORN_THREADS=4``preload_app = False`;此組合讓 HUP 熱重載可用,也避免 Dashboard 長查詢完全阻塞 `/health`
- CD rebuild 模式必須先 build image 成功,再短暫 stop/rm/recreate 三應用容器,避免 no-cache build 造成長時間 502。
- ElephantAlpha 使用 NVIDIA NIM hosted APIproduction 預設模型為 `nvidia/llama-3.3-nemotron-super-49b-v1.5``ELEPHANT_ALPHA_FALLBACK_MODELS` 需保留至少一個可呼叫備援403/404、408/409/425/429、5xx、timeout 與 connection error 必須嘗試下一個模型。
- ElephantAlpha L3 HITL 只允許發送有實證、可審核、可行動的升級告警;價格類 trigger 無 Hermes 具體威脅、或 `resource_optimization` 低信心且沒有具體行動時,只記錄 suppressed escalation telemetry 與 cooldown不寫 pending `human_review`,不發 Telegram 空告警。
- OpenClaw/Hermes embedding 優先呼叫 Ollama `/api/embed`,只在舊節點不支援時 fallback `/api/embeddings`timeout 由 `EMBEDDING_TIMEOUT` / `OLLAMA_EMBED_TIMEOUT` 控制。
- PPT 自動產線由 `momo-scheduler` 依節奏執行 `run_ppt_auto_generation_task(schedule_kind)`:每日 20:30 產日報、週一 20:40 產週報/市場情報、每月 1 日 20:50 產月報與管理型簡報、季初 21:00 產季報、半年初 21:10 產半年報、年初 21:20 產年報,再交給 22:00 `ppt_vision_audit` 做視覺審核;每次嘗試會寫入 `ppt_generation_runs``/observability/ppt_audit_history` 以精準參數檢查目標版本是否已產生,並可用 `/observability/ppt_audit/generate_missing` 手動補齊缺漏,總開關為 `PPT_AUTO_GENERATION_ENABLED`。PPT vision 需 `PPT_VISION_ENABLED=true` 與容器內 LibreOffice`/observability/ppt_audit_file/<filename>` 會把 PPTX 轉成 PDF 快取供站內線上預覽,原始 PPTX 仍保留下載。

View File

@@ -45,9 +45,13 @@
蓋掉原本的 plan 元流程文字。**強制配套限制**
- `asyncio.wait_for(timeout=5)` 短超時Hermes 熱駐留 < 10s但冷啟動會拖到 30s+HITL 訊息延遲不可大於 10s
- Pre-fetch 失敗timeout / 0 threats / 全部缺金額)→ fallback 回原 plan 文字,**不中斷 escalation 主流程**
- Pre-fetch 失敗timeout / 0 threats / 全部缺金額)→ **不送 Telegram、不寫 pending human_review**,只記錄 suppressed escalation telemetry 與 cooldown避免把無實證 plan 當成可審核告警
- 「全部行皆缺金額」也視同無料 fallback避免「乾巴巴兩行 MOMO/PChome 比價」比 plan 文字更空泛
### 規則 1.1 — 非價格類低信心 escalation 必須有可審核內容
2026-05-19 補充:`resource_optimization` 不是 SKU/價格事件,不能套用 Hermes/SKU fallback 模板。當 EA 對 `resource_optimization` 只產出低信心決策且沒有具體可審核行動時,系統只記錄 telemetry 與 cooldown不發 Telegram也不建立 `ai_insights.status='pending'` 的人工審核。這避免「資源調配優化」告警顯示「Hermes 即時威脅清單不可用」這類錯誤診斷。
### 規則 2 — NemoTron 告警必填金額影響量化
新增模組級 helper `_compute_business_impact(threat) -> {revenue_loss_7d, recommended_price}`
@@ -96,6 +100,7 @@
### 正面
- EA 升級審核 Telegram 內容從元流程描述變為「具體 SKU + 價格 + 金額流失 + 建議調價」HITL 真正可決策
- 無實證的價格類與資源調配低信心 escalation 不再打擾人工,避免 Telegram 出現不可批准、不可駁回、不可操作的空告警
- NemoTron 既有告警再升級,每筆都帶可批准/駁回的金額判斷依據
- `momo:eig:` 按鈕首次有對應 handlerHITL 流程閉環完整
- pre-fetch 改用 5s 短超時 + fallback最壞情況退回原 plan 文字,不破壞既有行為
@@ -103,7 +108,7 @@
### 負面 / 風險
- 每次價格類 escalation 多花 ≤ 5sHermes 熱駐留實測 < 10s 但有 timeout整體告警延遲略增
- Hermes 在 5s 內若沒回應,告警內容降級回 plan 文字(仍維持原行為,無新增風險)
- Hermes 在 5s 內若沒回應,價格類 escalation 會被壓制並記錄 telemetry若後續需要追查需從 `ai_calls.meta.suppressed_escalation` 與 scheduler log 觀察
- `gap_pct ≤ 0` 案例的銷量下滑(非價格因素)將完全不顯示流失金額——若統帥需追蹤「非價格流失」需另開告警類型(待後續 ADR
### 監控指標
@@ -118,6 +123,7 @@
- [x] `services/nemoton_dispatcher_service.py` `_compute_business_impact` helper + 三條 dispatch 路徑注入
- [x] `services/elephant_alpha_autonomous_engine.py` `_fetch_hermes_threats_summary` + 5s timeout + fallback
- [x] `services/telegram_bot_service.py` `_handle_event_ignore_callback` + HTML escape + 空 id 拒絕
- [x] 2026-05-19無實證價格類 / `resource_optimization` 低信心 escalation 改為 suppressed telemetry不再送空泛 Telegram
- [x] Critic 審查通過Critical-1 / High-1 / High-2 / Medium-2 / Medium-3 全修)
- [x] Smoke test`_compute_business_impact` 對 gap≤0 / gap=0 / 銷量回升 / bogus type 四案例驗證
- [x] `docs/adr/README.md` 索引加 ADR-021

View File

@@ -123,6 +123,11 @@ _PRICE_RELATED_TRIGGERS = frozenset({
"threat_escalation",
})
# 這些低信心觸發若沒有具體實證,不應升級打擾人工。
_NO_CONCRETE_ESCALATION_SUPPRESSED_TRIGGERS = frozenset({
"resource_optimization",
})
def _zh_trigger(trigger_type: str) -> str:
return _TRIGGER_ZH.get(trigger_type, trigger_type)
@@ -415,8 +420,16 @@ class ElephantAlphaAutonomousEngine:
session.close()
async def _check_resource_optimization_trigger(self, trigger: AutonomousTrigger) -> bool:
return (self._get_action_queue_size() > 10
or self._get_system_load_percentage() > 80)
queue_size = self._get_action_queue_size()
system_load_pct = self._get_system_load_percentage()
trigger.conditions = dict(trigger.conditions or {})
trigger.conditions["_resource_metrics"] = {
"action_queue_size": queue_size,
"system_load_pct": system_load_pct,
"queue_threshold": 10,
"load_threshold_pct": 80,
}
return queue_size > 10 or system_load_pct > 80
async def _check_code_exception_trigger(self, trigger: AutonomousTrigger) -> bool:
containers = trigger.conditions.get("scan_containers", ["momo-pro-system", "momo-scheduler"])
@@ -885,8 +898,73 @@ class ElephantAlphaAutonomousEngine:
except Exception as e:
self._log.error("Telegram audit failed (non-blocking): %s", e)
@staticmethod
def _get_prefetched_concrete_actions(trigger: AutonomousTrigger) -> Optional[List[str]]:
actions = (trigger.conditions or {}).get("_prefetched_hermes_threats")
if not isinstance(actions, list):
return None
cleaned = [str(action).strip() for action in actions if str(action).strip()]
return cleaned[:5] or None
@staticmethod
def _should_suppress_no_concrete_escalation(trigger: AutonomousTrigger) -> bool:
return (
trigger.trigger_type in _PRICE_RELATED_TRIGGERS
or trigger.trigger_type in _NO_CONCRETE_ESCALATION_SUPPRESSED_TRIGGERS
)
def _record_suppressed_escalation(
self,
decision: StrategicDecision,
trigger: AutonomousTrigger,
reason: str,
) -> None:
self._log.warning(
"EA escalation suppressed: trigger=%s reason=%s confidence=%.2f conditions=%s",
trigger.trigger_type,
reason,
decision.confidence,
trigger.conditions,
)
try:
from services.ai_call_logger import log_ai_call
with log_ai_call(
caller="ea_engine",
provider="gcp_ollama",
model="hermes3:latest",
meta={
"suppressed_escalation": True,
"trigger": trigger.trigger_type,
"reason": reason,
"confidence": decision.confidence,
"conditions": trigger.conditions,
},
) as ctx:
ctx.set_tokens(input=0, output=0)
ctx.status = "cache_only"
except Exception:
self._log.warning(
"EA suppressed escalation telemetry failed; trigger=%s",
trigger.trigger_type,
exc_info=True,
)
async def _escalate_to_human(self, decision: StrategicDecision, trigger: AutonomousTrigger) -> None:
self._log.warning("Escalating to human: %s", trigger.trigger_type)
concrete_actions = self._get_prefetched_concrete_actions(trigger)
if not concrete_actions and trigger.trigger_type in _PRICE_RELATED_TRIGGERS:
try:
concrete_actions = await self._fetch_hermes_threats_summary(top_n=5)
except Exception as e:
self._log.warning("Pre-fetch threats raised (non-blocking): %s", e)
concrete_actions = None
if not concrete_actions and self._should_suppress_no_concrete_escalation(trigger):
self._store_escalation(trigger.trigger_type)
self._record_suppressed_escalation(decision, trigger, "no_concrete_evidence")
return
session = get_session()
try:
row = session.execute(
@@ -934,17 +1012,6 @@ class ElephantAlphaAutonomousEngine:
if not dedup_ts or (datetime.now().timestamp() - dedup_ts) / 60 >= cooldown_min:
self._store_escalation(trigger.trigger_type)
# A' 軌:價格類觸發前 pre-fetch Hermes 具體威脅清單,
# 取代「步驟 1:[OpenClaw] 生成策略」這類元流程文字。
# — Claude Opus 4.7 (2026-05-02)
concrete_actions: Optional[List[str]] = None
if trigger.trigger_type in _PRICE_RELATED_TRIGGERS:
try:
concrete_actions = await self._fetch_hermes_threats_summary(top_n=5)
except Exception as e:
self._log.warning("Pre-fetch threats raised (non-blocking): %s", e)
concrete_actions = None
# ─── Operation Ollama-First v5.0 修補:消除空泛幻覺訊息 ───
# 統帥反饋2026-05-03fallback 路徑帶 OpenClaw Gemini plan 文字 +
# decision.reasoning 全是「312 SKU / 23% / 14 項任務」幻覺數字,無 DB 鉤住,
@@ -964,27 +1031,23 @@ class ElephantAlphaAutonomousEngine:
f"參與模組:{', '.join(_AGENT_LABEL.get(a.lower(), a) for a in decision.agents_required)}"
)
else:
# 無實證數據路徑:極簡訊息,明確標註無數據
# 非價格類且允許升級的低信心事件,不能套 Hermes/SKU 模板。
self._log.warning(
"EA escalation 落入 no-concrete-data fallback (trigger=%s)"
"極簡訊息避免 LLM 幻覺數字誤導統帥",
"非價格類診斷訊息避免 LLM 幻覺數字誤導統帥",
trigger.trigger_type
)
ai_actions_payload = [
"⚠️ Hermes 即時威脅清單不可用5s timeout 或無 SKU 命中)",
"📋 建議:手動下 SQL 查詢過去 24h competitor_price_history 確認狀況",
"🔧 或SSH 188 跑 docker exec momo-pro-system python -c "
"'from services.hermes_analyst_service import HermesAnalystService;"
" print(HermesAnalystService().run().threats[:5])'",
f"檢查觸發條件:{json.dumps(trigger.conditions, ensure_ascii=False)[:300]}",
"不執行自動動作;請先在觀測台確認對應資料來源與最近錯誤紀錄。",
]
ai_summary_text = (
f"⚠️ 本訊息為**無實證**告警Hermes pre-fetch 失敗,"
f"以下原始決策內容含 LLM 自由發揮數字(非 DB 數據),請審慎參考。"
f"低信心且缺少可格式化的具體行動;已隱藏 LLM plan 文字,避免把推測當成事實。"
)
ai_cause_text = (
f"觸發類型:{_zh_trigger(trigger.trigger_type)} | "
f"信心度:{decision.confidence:.2f} | "
f"⚠️ 無 Hermes SKU 數據(不顯示 LLM 幻覺 plan 文字)"
f"缺少可直接審核的實證資料"
)
try:

View File

@@ -114,3 +114,48 @@ def test_execute_autonomous_decision_logs_short_circuit_telemetry_failure(monkey
asyncio.run(engine._execute_autonomous_decision(trigger))
assert "EA short-circuit telemetry failed" in caplog.text
def test_escalate_resource_optimization_without_evidence_is_suppressed(monkeypatch):
import services.elephant_alpha_autonomous_engine as engine_module
from services.elephant_alpha_autonomous_engine import (
AutonomousTrigger,
ElephantAlphaAutonomousEngine,
)
from services.elephant_alpha_orchestrator import StrategicDecision
engine = ElephantAlphaAutonomousEngine()
suppressed = []
cooldown = []
def _raise_if_db_opened():
raise AssertionError("no-concrete resource escalation should not write human_review")
monkeypatch.setattr(engine_module, "get_session", _raise_if_db_opened)
monkeypatch.setattr(engine, "_store_escalation", lambda trigger_type: cooldown.append(trigger_type))
monkeypatch.setattr(
engine,
"_record_suppressed_escalation",
lambda decision, trigger, reason: suppressed.append((trigger.trigger_type, reason)),
)
decision = StrategicDecision(
priority="medium",
agents_required=["openclaw"],
reasoning="資源調配建議信心不足",
expected_outcome="待人工確認",
confidence=0.60,
execution_plan=[],
resource_requirements={},
)
trigger = AutonomousTrigger(
trigger_type="resource_optimization",
conditions={"_resource_metrics": {"action_queue_size": 14, "system_load_pct": 52.0}},
threshold=0.6,
enabled=True,
)
asyncio.run(engine._escalate_to_human(decision, trigger))
assert cooldown == ["resource_optimization"]
assert suppressed == [("resource_optimization", "no_concrete_evidence")]