# ADR-124: Global Singleton Decomposition for Multi-tenancy **狀態**:Accepted **日期**:2026-05-03(台北) **決策者**:統帥 **範圍**:13 個 global singleton 的分類、分解策略、Tier 3 安全邊界 **關聯**:ADR-111(bootstrap order)、ADR-118(RLS)、INV-9 --- ## 背景 INV-9 確認 codebase 中有 **13 個** global singleton,全部是模組層級變數(`_engine: Optional[X] = None`)。這些 singleton 在多租戶環境下有兩個核心問題: 1. **Tenant 共用狀態**:`AnomalyCounter` 等組件將所有 tenant 的數據混合在同一個實例中 2. **Bootstrap 時機不明確**:singleton 在首次呼叫時才初始化,可能在 project_id 未知的情況下被觸發 **Tier 3 限制(RED_ZONES.md)**: - `DecisionManager`(decision_manager.py:1402)— 禁止未經架構審查的修改 - `TrustEngine`(trust_engine.py:189)— 核心信任計算,修改需要 P10 授權 --- ## 決策 ### D1 — 四種分解策略 **策略 1:Platform Singleton(保留,不分解)** 這些 singleton 本就應該是平台層,無需 per-tenant 實例: | Singleton | 檔案 | 原因 | |-----------|------|------| | `TelegramGateway`(polling lock)| `telegram_gateway.py:1324` | Polling leader 是平台層,非 per-tenant | | `HostRepairAgent` | `host_repair_agent.py` | 修復 host 是平台操作,非 per-tenant | | `AIProviderRegistry` | `registry.py` | Provider 清單是平台層(但需要 `__provider` 修補,PR-04)| | `FailoverAlerter` | `failover_alerter.py` | 告警路由是平台層 | **策略 2:Tenant-Scoped Instance(Phase 3,按需實例化)** 這些 singleton 需要改為 per-project_id 實例: ```python # 修改前(全域 singleton) _anomaly_counter: Optional[AnomalyCounter] = None def get_anomaly_counter() -> AnomalyCounter: global _anomaly_counter if _anomaly_counter is None: _anomaly_counter = AnomalyCounter() return _anomaly_counter # 修改後(per-tenant,Phase 3) _anomaly_counters: dict[str, AnomalyCounter] = {} def get_anomaly_counter(project_id: str) -> AnomalyCounter: if project_id not in _anomaly_counters: _anomaly_counters[project_id] = AnomalyCounter(project_id=project_id) return _anomaly_counters[project_id] ``` 需要此策略的 singleton: | Singleton | 檔案 | 分解難度 | |-----------|------|---------| | `AnomalyCounter` | `anomaly_counter.py:85` | LOW(PR-11,Phase 2)| | `ConsensusEngine` | `consensus_engine.py:344` | MEDIUM(Phase 3)| | `IntentClassifier` | `intent_classifier.py` | MEDIUM(Phase 3)| **策略 3:Context-Injected(Phase 3,依賴注入)** 透過 `ContextVar` 或 FastAPI `Depends` 注入,不使用模組層級 singleton: ```python # DecisionManager 不再是全域 singleton,而是 per-request 注入 async def get_decision_manager( project_id: str = Depends(get_project_id), effective_policy: EffectivePolicy = Depends(get_effective_policy) ) -> DecisionManager: return DecisionManager(project_id=project_id, policy=effective_policy) ``` 需要此策略的 singleton: | Singleton | 檔案 | 備註 | |-----------|------|------| | `DecisionManager` | `decision_manager.py:1402` | Tier 3!需要 P10 架構審查 | | `TrustEngine` | `trust_engine.py:189` | Tier 3!需要 P10 授權 | **策略 4:Module-Level Config(保留 singleton,但 config 注入)** Singleton 保留,但內部狀態改為從 `project_id` 動態讀取(而不是靜態初始化): ```python # DecisionFusionAdapter:不改 singleton 結構,但方法接受 project_id class DecisionFusionAdapter: async def fuse( self, project_id: str, # 新增 project_id 參數 decisions: list[Decision] ) -> FusedDecision: policy = await get_effective_policy(project_id) # 用 policy 而非 self 的靜態 config ... ``` ### D2 — 分解優先序 **立即(Phase 2)**: - PR-04:`registry.py` `_provider` → `__provider` double underscore(INV-9 找到的封裝漏洞) - PR-11:`AnomalyCounter` per-project(依賴 PR-10 loop tagging) **Phase 3**: - `ConsensusEngine` per-tenant instance - `IntentClassifier` per-tenant instance - `DecisionFusionAdapter` 方法簽名加 `project_id` **Phase 4+(Tier 3,需 P10 審查)**: - `DecisionManager` → 依賴注入重構(大型工程,需要獨立 ADR) - `TrustEngine` → 依賴注入重構(Tier 3,必須有首席架構師授權) ### D3 — AnomalyCounter 分解設計(Phase 2,PR-11) AnomalyCounter 是最安全的分解起點(影響範圍小,沒有 Tier 3 限制): ```python # anomaly_counter.py _anomaly_counters: dict[str, AnomalyCounter] = {} _counters_lock = asyncio.Lock() async def get_anomaly_counter(project_id: str) -> AnomalyCounter: async with _counters_lock: if project_id not in _anomaly_counters: _anomaly_counters[project_id] = AnomalyCounter( project_id=project_id, redis_prefix=f"anomaly:{project_id}:" # per-tenant Redis key ) return _anomaly_counters[project_id] ``` **Redis key 遷移**(INV-1 P2 keys): ``` 舊:anomaly:counter:{metric} 新:anomaly:{project_id}:counter:{metric} ``` ### D4 — DecisionManager + TrustEngine 的保護措施(Phase 2 前) 在真正分解前,Phase 2 的保護措施: 1. **Context 注入**:確保所有 DecisionManager 方法的 `project_id` 都從 `contextvars` 讀取 2. **Redis key 隔離**:DecisionManager 內部的 Redis key 改為帶 `project_id` prefix(PR-06 已覆蓋部分) 3. **禁止直接呼叫 global**:在 Tier 3 檔案頂部加警告 comment(不是程式碼,是文件) ### D5 — ConsensusEngine Redis Key 遷移(PR-06) INV-9 確認 `ConsensusEngine` 的 `CONSENSUS_PREFIX` 沒有 project 隔離: ```python # 修改前(consensus_engine.py) CONSENSUS_PREFIX = "consensus:" # 修改後(PR-06,Phase 2) def get_consensus_prefix(project_id: str) -> str: return f"consensus:{project_id}:" ``` --- ## 13 Singleton 完整分解計畫 | Singleton | 策略 | 優先級 | Phase | Tier | |-----------|------|-------|-------|------| | `TelegramGateway` | Platform Singleton(保留)| - | - | - | | `HostRepairAgent` | Platform Singleton(保留)| - | - | - | | `AIProviderRegistry` | Platform Singleton + PR-04 | P1 | Phase 2 | - | | `FailoverAlerter` | Platform Singleton(保留)| - | - | - | | `AnomalyCounter` | Tenant-Scoped | P1 | Phase 2 | - | | `ConsensusEngine` | Tenant-Scoped + PR-06 | P1 | Phase 3 | - | | `IntentClassifier` | Tenant-Scoped | P2 | Phase 3 | - | | `DecisionFusionAdapter` | Config-Injected | P2 | Phase 3 | - | | `AIRouter` | Config-Injected | P2 | Phase 3 | - | | `AIRouterExecutor` | Config-Injected | P2 | Phase 3 | - | | `DecisionManager` | Context-Injected | P3 | Phase 4+ | Tier 3 | | `TrustEngine` | Context-Injected | P3 | Phase 4+ | Tier 3 | | `TelegramGateway`(messages)| Context-Injected | P2 | Phase 3 | - | --- ## 後果 ### Benefits - AnomalyCounter per-tenant:EwoooC 和 AWOOOI 的異常計數互不干擾 - ConsensusEngine Redis key 隔離(PR-06):共識決策不跨 tenant 污染 - Tier 3 singleton(DecisionManager / TrustEngine)有清晰的分解路徑,但保護好不急躁 ### Costs - AnomalyCounter per-tenant 需要 Redis key migration(舊格式的 counter 會被遺棄) - Phase 3+ 的大規模 singleton 分解是重大工程(每個需要獨立 PR + critic 審查) ### Risks - `_anomaly_counters: dict` 本身沒有 GC 機制,長期運行可能累積 tenant 實例 - 緩解:WeakValueDictionary 或定期清理(tenant 長時間無活動則清除實例) - Tier 3 singleton 分解失敗 → 架構回退需要 hotfix - 緩解:Tier 3 必須先有完整測試覆蓋才能動手 --- ## 驗收標準 - [ ] PR-04:`registry.py` `_provider` → `__provider`(Phase 2) - [ ] PR-06:ConsensusEngine Redis prefix 加 project_id(Phase 2) - [ ] PR-11:AnomalyCounter per-tenant(Phase 2) - [ ] Phase 3 前:ConsensusEngine / IntentClassifier tenant-scoped 完成 - [ ] Phase 4+ 前:DecisionManager / TrustEngine 分解有獨立 ADR + P10 授權 ## 關聯 - ADR-111(bootstrap order,singleton 初始化時機) - ADR-118(RLS,tenant 隔離需要正確的 project_id context) - INV-9(13 singleton 完整清單 + 檔案位置) - PR-04/PR-06/PR-11(AnoooP Phase 2 具體 PR)