## Phase 0(文件層,全部 Accepted) - ADR-106/107:AwoooP 平台架構 + 儲存策略 - ADR-111~118:Bootstrap → RLS 七項核心 ADR - ADR-119~124:SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04:Operator Console 四個 UI ADR ## Phase 1(DB schema + migration) - awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS) - Step 13:GRANT awooop_app 最小權限(7 條) - Step 14:RLS fail-closed,移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本 - awooop_models.py:7 個 SQLAlchemy 2.x models ## Critic 修正(4 Critical + 3 Major) - C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role - C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護) - M-2:pg_partman create_parent 加冪等防護 - M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id) ## Task 1.2 修補 - agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile:補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.3 KiB
ADR-124: Global Singleton Decomposition for Multi-tenancy
狀態:Accepted 日期:2026-05-03(台北) 決策者:統帥 範圍:13 個 global singleton 的分類、分解策略、Tier 3 安全邊界 關聯:ADR-111(bootstrap order)、ADR-118(RLS)、INV-9
背景
INV-9 確認 codebase 中有 13 個 global singleton,全部是模組層級變數(_engine: Optional[X] = None)。這些 singleton 在多租戶環境下有兩個核心問題:
- Tenant 共用狀態:
AnomalyCounter等組件將所有 tenant 的數據混合在同一個實例中 - Bootstrap 時機不明確:singleton 在首次呼叫時才初始化,可能在 project_id 未知的情況下被觸發
Tier 3 限制(RED_ZONES.md):
DecisionManager(decision_manager.py:1402)— 禁止未經架構審查的修改TrustEngine(trust_engine.py:189)— 核心信任計算,修改需要 P10 授權
決策
D1 — 四種分解策略
策略 1:Platform Singleton(保留,不分解)
這些 singleton 本就應該是平台層,無需 per-tenant 實例:
| Singleton | 檔案 | 原因 |
|---|---|---|
TelegramGateway(polling lock) |
telegram_gateway.py:1324 |
Polling leader 是平台層,非 per-tenant |
HostRepairAgent |
host_repair_agent.py |
修復 host 是平台操作,非 per-tenant |
AIProviderRegistry |
registry.py |
Provider 清單是平台層(但需要 __provider 修補,PR-04) |
FailoverAlerter |
failover_alerter.py |
告警路由是平台層 |
策略 2:Tenant-Scoped Instance(Phase 3,按需實例化)
這些 singleton 需要改為 per-project_id 實例:
# 修改前(全域 singleton)
_anomaly_counter: Optional[AnomalyCounter] = None
def get_anomaly_counter() -> AnomalyCounter:
global _anomaly_counter
if _anomaly_counter is None:
_anomaly_counter = AnomalyCounter()
return _anomaly_counter
# 修改後(per-tenant,Phase 3)
_anomaly_counters: dict[str, AnomalyCounter] = {}
def get_anomaly_counter(project_id: str) -> AnomalyCounter:
if project_id not in _anomaly_counters:
_anomaly_counters[project_id] = AnomalyCounter(project_id=project_id)
return _anomaly_counters[project_id]
需要此策略的 singleton:
| Singleton | 檔案 | 分解難度 |
|---|---|---|
AnomalyCounter |
anomaly_counter.py:85 |
LOW(PR-11,Phase 2) |
ConsensusEngine |
consensus_engine.py:344 |
MEDIUM(Phase 3) |
IntentClassifier |
intent_classifier.py |
MEDIUM(Phase 3) |
策略 3:Context-Injected(Phase 3,依賴注入)
透過 ContextVar 或 FastAPI Depends 注入,不使用模組層級 singleton:
# DecisionManager 不再是全域 singleton,而是 per-request 注入
async def get_decision_manager(
project_id: str = Depends(get_project_id),
effective_policy: EffectivePolicy = Depends(get_effective_policy)
) -> DecisionManager:
return DecisionManager(project_id=project_id, policy=effective_policy)
需要此策略的 singleton:
| Singleton | 檔案 | 備註 |
|---|---|---|
DecisionManager |
decision_manager.py:1402 |
Tier 3!需要 P10 架構審查 |
TrustEngine |
trust_engine.py:189 |
Tier 3!需要 P10 授權 |
策略 4:Module-Level Config(保留 singleton,但 config 注入)
Singleton 保留,但內部狀態改為從 project_id 動態讀取(而不是靜態初始化):
# DecisionFusionAdapter:不改 singleton 結構,但方法接受 project_id
class DecisionFusionAdapter:
async def fuse(
self,
project_id: str, # 新增 project_id 參數
decisions: list[Decision]
) -> FusedDecision:
policy = await get_effective_policy(project_id)
# 用 policy 而非 self 的靜態 config
...
D2 — 分解優先序
立即(Phase 2):
- PR-04:
registry.py_provider→__providerdouble underscore(INV-9 找到的封裝漏洞) - PR-11:
AnomalyCounterper-project(依賴 PR-10 loop tagging)
Phase 3:
ConsensusEngineper-tenant instanceIntentClassifierper-tenant instanceDecisionFusionAdapter方法簽名加project_id
Phase 4+(Tier 3,需 P10 審查):
DecisionManager→ 依賴注入重構(大型工程,需要獨立 ADR)TrustEngine→ 依賴注入重構(Tier 3,必須有首席架構師授權)
D3 — AnomalyCounter 分解設計(Phase 2,PR-11)
AnomalyCounter 是最安全的分解起點(影響範圍小,沒有 Tier 3 限制):
# anomaly_counter.py
_anomaly_counters: dict[str, AnomalyCounter] = {}
_counters_lock = asyncio.Lock()
async def get_anomaly_counter(project_id: str) -> AnomalyCounter:
async with _counters_lock:
if project_id not in _anomaly_counters:
_anomaly_counters[project_id] = AnomalyCounter(
project_id=project_id,
redis_prefix=f"anomaly:{project_id}:" # per-tenant Redis key
)
return _anomaly_counters[project_id]
Redis key 遷移(INV-1 P2 keys):
舊:anomaly:counter:{metric}
新:anomaly:{project_id}:counter:{metric}
D4 — DecisionManager + TrustEngine 的保護措施(Phase 2 前)
在真正分解前,Phase 2 的保護措施:
- Context 注入:確保所有 DecisionManager 方法的
project_id都從contextvars讀取 - Redis key 隔離:DecisionManager 內部的 Redis key 改為帶
project_idprefix(PR-06 已覆蓋部分) - 禁止直接呼叫 global:在 Tier 3 檔案頂部加警告 comment(不是程式碼,是文件)
D5 — ConsensusEngine Redis Key 遷移(PR-06)
INV-9 確認 ConsensusEngine 的 CONSENSUS_PREFIX 沒有 project 隔離:
# 修改前(consensus_engine.py)
CONSENSUS_PREFIX = "consensus:"
# 修改後(PR-06,Phase 2)
def get_consensus_prefix(project_id: str) -> str:
return f"consensus:{project_id}:"
13 Singleton 完整分解計畫
| Singleton | 策略 | 優先級 | Phase | Tier |
|---|---|---|---|---|
TelegramGateway |
Platform Singleton(保留) | - | - | - |
HostRepairAgent |
Platform Singleton(保留) | - | - | - |
AIProviderRegistry |
Platform Singleton + PR-04 | P1 | Phase 2 | - |
FailoverAlerter |
Platform Singleton(保留) | - | - | - |
AnomalyCounter |
Tenant-Scoped | P1 | Phase 2 | - |
ConsensusEngine |
Tenant-Scoped + PR-06 | P1 | Phase 3 | - |
IntentClassifier |
Tenant-Scoped | P2 | Phase 3 | - |
DecisionFusionAdapter |
Config-Injected | P2 | Phase 3 | - |
AIRouter |
Config-Injected | P2 | Phase 3 | - |
AIRouterExecutor |
Config-Injected | P2 | Phase 3 | - |
DecisionManager |
Context-Injected | P3 | Phase 4+ | Tier 3 |
TrustEngine |
Context-Injected | P3 | Phase 4+ | Tier 3 |
TelegramGateway(messages) |
Context-Injected | P2 | Phase 3 | - |
後果
Benefits
- AnomalyCounter per-tenant:EwoooC 和 AWOOOI 的異常計數互不干擾
- ConsensusEngine Redis key 隔離(PR-06):共識決策不跨 tenant 污染
- Tier 3 singleton(DecisionManager / TrustEngine)有清晰的分解路徑,但保護好不急躁
Costs
- AnomalyCounter per-tenant 需要 Redis key migration(舊格式的 counter 會被遺棄)
- Phase 3+ 的大規模 singleton 分解是重大工程(每個需要獨立 PR + critic 審查)
Risks
_anomaly_counters: dict本身沒有 GC 機制,長期運行可能累積 tenant 實例- 緩解:WeakValueDictionary 或定期清理(tenant 長時間無活動則清除實例)
- Tier 3 singleton 分解失敗 → 架構回退需要 hotfix
- 緩解:Tier 3 必須先有完整測試覆蓋才能動手
驗收標準
- PR-04:
registry.py_provider→__provider(Phase 2) - PR-06:ConsensusEngine Redis prefix 加 project_id(Phase 2)
- PR-11:AnomalyCounter per-tenant(Phase 2)
- Phase 3 前:ConsensusEngine / IntentClassifier tenant-scoped 完成
- Phase 4+ 前:DecisionManager / TrustEngine 分解有獨立 ADR + P10 授權
關聯
- ADR-111(bootstrap order,singleton 初始化時機)
- ADR-118(RLS,tenant 隔離需要正確的 project_id context)
- INV-9(13 singleton 完整清單 + 檔案位置)
- PR-04/PR-06/PR-11(AnoooP Phase 2 具體 PR)