Files
awoooi/docs/adr/ADR-124-global-singleton-decomposition.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

8.3 KiB
Raw Blame History

ADR-124: Global Singleton Decomposition for Multi-tenancy

狀態Accepted 日期2026-05-03台北 決策者:統帥 範圍13 個 global singleton 的分類、分解策略、Tier 3 安全邊界 關聯ADR-111bootstrap order、ADR-118RLS、INV-9


背景

INV-9 確認 codebase 中有 13 個 global singleton全部是模組層級變數_engine: Optional[X] = None)。這些 singleton 在多租戶環境下有兩個核心問題:

  1. Tenant 共用狀態AnomalyCounter 等組件將所有 tenant 的數據混合在同一個實例中
  2. Bootstrap 時機不明確singleton 在首次呼叫時才初始化,可能在 project_id 未知的情況下被觸發

Tier 3 限制RED_ZONES.md

  • DecisionManagerdecision_manager.py:1402— 禁止未經架構審查的修改
  • TrustEnginetrust_engine.py:189— 核心信任計算,修改需要 P10 授權

決策

D1 — 四種分解策略

策略 1Platform Singleton保留不分解

這些 singleton 本就應該是平台層,無需 per-tenant 實例:

Singleton 檔案 原因
TelegramGatewaypolling lock telegram_gateway.py:1324 Polling leader 是平台層,非 per-tenant
HostRepairAgent host_repair_agent.py 修復 host 是平台操作,非 per-tenant
AIProviderRegistry registry.py Provider 清單是平台層(但需要 __provider 修補PR-04
FailoverAlerter failover_alerter.py 告警路由是平台層

策略 2Tenant-Scoped InstancePhase 3按需實例化

這些 singleton 需要改為 per-project_id 實例:

# 修改前(全域 singleton
_anomaly_counter: Optional[AnomalyCounter] = None

def get_anomaly_counter() -> AnomalyCounter:
    global _anomaly_counter
    if _anomaly_counter is None:
        _anomaly_counter = AnomalyCounter()
    return _anomaly_counter

# 修改後per-tenantPhase 3
_anomaly_counters: dict[str, AnomalyCounter] = {}

def get_anomaly_counter(project_id: str) -> AnomalyCounter:
    if project_id not in _anomaly_counters:
        _anomaly_counters[project_id] = AnomalyCounter(project_id=project_id)
    return _anomaly_counters[project_id]

需要此策略的 singleton

Singleton 檔案 分解難度
AnomalyCounter anomaly_counter.py:85 LOWPR-11Phase 2
ConsensusEngine consensus_engine.py:344 MEDIUMPhase 3
IntentClassifier intent_classifier.py MEDIUMPhase 3

策略 3Context-InjectedPhase 3依賴注入

透過 ContextVar 或 FastAPI Depends 注入,不使用模組層級 singleton

# DecisionManager 不再是全域 singleton而是 per-request 注入
async def get_decision_manager(
    project_id: str = Depends(get_project_id),
    effective_policy: EffectivePolicy = Depends(get_effective_policy)
) -> DecisionManager:
    return DecisionManager(project_id=project_id, policy=effective_policy)

需要此策略的 singleton

Singleton 檔案 備註
DecisionManager decision_manager.py:1402 Tier 3需要 P10 架構審查
TrustEngine trust_engine.py:189 Tier 3需要 P10 授權

策略 4Module-Level Config保留 singleton但 config 注入)

Singleton 保留,但內部狀態改為從 project_id 動態讀取(而不是靜態初始化):

# DecisionFusionAdapter不改 singleton 結構,但方法接受 project_id
class DecisionFusionAdapter:
    async def fuse(
        self,
        project_id: str,    # 新增 project_id 參數
        decisions: list[Decision]
    ) -> FusedDecision:
        policy = await get_effective_policy(project_id)
        # 用 policy 而非 self 的靜態 config
        ...

D2 — 分解優先序

立即Phase 2

  • PR-04registry.py _provider__provider double underscoreINV-9 找到的封裝漏洞)
  • PR-11AnomalyCounter per-project依賴 PR-10 loop tagging

Phase 3

  • ConsensusEngine per-tenant instance
  • IntentClassifier per-tenant instance
  • DecisionFusionAdapter 方法簽名加 project_id

Phase 4+Tier 3需 P10 審查)

  • DecisionManager → 依賴注入重構(大型工程,需要獨立 ADR
  • TrustEngine → 依賴注入重構Tier 3必須有首席架構師授權

D3 — AnomalyCounter 分解設計Phase 2PR-11

AnomalyCounter 是最安全的分解起點(影響範圍小,沒有 Tier 3 限制):

# anomaly_counter.py
_anomaly_counters: dict[str, AnomalyCounter] = {}
_counters_lock = asyncio.Lock()

async def get_anomaly_counter(project_id: str) -> AnomalyCounter:
    async with _counters_lock:
        if project_id not in _anomaly_counters:
            _anomaly_counters[project_id] = AnomalyCounter(
                project_id=project_id,
                redis_prefix=f"anomaly:{project_id}:"  # per-tenant Redis key
            )
        return _anomaly_counters[project_id]

Redis key 遷移INV-1 P2 keys

anomaly:counter:{metric}
新anomaly:{project_id}:counter:{metric}

D4 — DecisionManager + TrustEngine 的保護措施Phase 2 前)

在真正分解前Phase 2 的保護措施:

  1. Context 注入:確保所有 DecisionManager 方法的 project_id 都從 contextvars 讀取
  2. Redis key 隔離DecisionManager 內部的 Redis key 改為帶 project_id prefixPR-06 已覆蓋部分)
  3. 禁止直接呼叫 global:在 Tier 3 檔案頂部加警告 comment不是程式碼是文件

D5 — ConsensusEngine Redis Key 遷移PR-06

INV-9 確認 ConsensusEngineCONSENSUS_PREFIX 沒有 project 隔離:

# 修改前consensus_engine.py
CONSENSUS_PREFIX = "consensus:"

# 修改後PR-06Phase 2
def get_consensus_prefix(project_id: str) -> str:
    return f"consensus:{project_id}:"

13 Singleton 完整分解計畫

Singleton 策略 優先級 Phase Tier
TelegramGateway Platform Singleton保留 - - -
HostRepairAgent Platform Singleton保留 - - -
AIProviderRegistry Platform Singleton + PR-04 P1 Phase 2 -
FailoverAlerter Platform Singleton保留 - - -
AnomalyCounter Tenant-Scoped P1 Phase 2 -
ConsensusEngine Tenant-Scoped + PR-06 P1 Phase 3 -
IntentClassifier Tenant-Scoped P2 Phase 3 -
DecisionFusionAdapter Config-Injected P2 Phase 3 -
AIRouter Config-Injected P2 Phase 3 -
AIRouterExecutor Config-Injected P2 Phase 3 -
DecisionManager Context-Injected P3 Phase 4+ Tier 3
TrustEngine Context-Injected P3 Phase 4+ Tier 3
TelegramGatewaymessages Context-Injected P2 Phase 3 -

後果

Benefits

  • AnomalyCounter per-tenantEwoooC 和 AWOOOI 的異常計數互不干擾
  • ConsensusEngine Redis key 隔離PR-06共識決策不跨 tenant 污染
  • Tier 3 singletonDecisionManager / TrustEngine有清晰的分解路徑但保護好不急躁

Costs

  • AnomalyCounter per-tenant 需要 Redis key migration舊格式的 counter 會被遺棄)
  • Phase 3+ 的大規模 singleton 分解是重大工程(每個需要獨立 PR + critic 審查)

Risks

  • _anomaly_counters: dict 本身沒有 GC 機制,長期運行可能累積 tenant 實例
  • 緩解WeakValueDictionary 或定期清理tenant 長時間無活動則清除實例)
  • Tier 3 singleton 分解失敗 → 架構回退需要 hotfix
  • 緩解Tier 3 必須先有完整測試覆蓋才能動手

驗收標準

  • PR-04registry.py _provider__providerPhase 2
  • PR-06ConsensusEngine Redis prefix 加 project_idPhase 2
  • PR-11AnomalyCounter per-tenantPhase 2
  • Phase 3 前ConsensusEngine / IntentClassifier tenant-scoped 完成
  • Phase 4+ 前DecisionManager / TrustEngine 分解有獨立 ADR + P10 授權

關聯

  • ADR-111bootstrap ordersingleton 初始化時機)
  • ADR-118RLStenant 隔離需要正確的 project_id context
  • INV-913 singleton 完整清單 + 檔案位置)
  • PR-04/PR-06/PR-11AnoooP Phase 2 具體 PR