## Phase 0(文件層,全部 Accepted) - ADR-106/107:AwoooP 平台架構 + 儲存策略 - ADR-111~118:Bootstrap → RLS 七項核心 ADR - ADR-119~124:SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04:Operator Console 四個 UI ADR ## Phase 1(DB schema + migration) - awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS) - Step 13:GRANT awooop_app 最小權限(7 條) - Step 14:RLS fail-closed,移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本 - awooop_models.py:7 個 SQLAlchemy 2.x models ## Critic 修正(4 Critical + 3 Major) - C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role - C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護) - M-2:pg_partman create_parent 加冪等防護 - M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id) ## Task 1.2 修補 - agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile:補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
223 lines
8.3 KiB
Markdown
223 lines
8.3 KiB
Markdown
# ADR-124: Global Singleton Decomposition for Multi-tenancy
|
||
|
||
**狀態**:Accepted
|
||
**日期**:2026-05-03(台北)
|
||
**決策者**:統帥
|
||
**範圍**:13 個 global singleton 的分類、分解策略、Tier 3 安全邊界
|
||
**關聯**:ADR-111(bootstrap order)、ADR-118(RLS)、INV-9
|
||
|
||
---
|
||
|
||
## 背景
|
||
|
||
INV-9 確認 codebase 中有 **13 個** global singleton,全部是模組層級變數(`_engine: Optional[X] = None`)。這些 singleton 在多租戶環境下有兩個核心問題:
|
||
|
||
1. **Tenant 共用狀態**:`AnomalyCounter` 等組件將所有 tenant 的數據混合在同一個實例中
|
||
2. **Bootstrap 時機不明確**:singleton 在首次呼叫時才初始化,可能在 project_id 未知的情況下被觸發
|
||
|
||
**Tier 3 限制(RED_ZONES.md)**:
|
||
- `DecisionManager`(decision_manager.py:1402)— 禁止未經架構審查的修改
|
||
- `TrustEngine`(trust_engine.py:189)— 核心信任計算,修改需要 P10 授權
|
||
|
||
---
|
||
|
||
## 決策
|
||
|
||
### D1 — 四種分解策略
|
||
|
||
**策略 1:Platform Singleton(保留,不分解)**
|
||
|
||
這些 singleton 本就應該是平台層,無需 per-tenant 實例:
|
||
|
||
| Singleton | 檔案 | 原因 |
|
||
|-----------|------|------|
|
||
| `TelegramGateway`(polling lock)| `telegram_gateway.py:1324` | Polling leader 是平台層,非 per-tenant |
|
||
| `HostRepairAgent` | `host_repair_agent.py` | 修復 host 是平台操作,非 per-tenant |
|
||
| `AIProviderRegistry` | `registry.py` | Provider 清單是平台層(但需要 `__provider` 修補,PR-04)|
|
||
| `FailoverAlerter` | `failover_alerter.py` | 告警路由是平台層 |
|
||
|
||
**策略 2:Tenant-Scoped Instance(Phase 3,按需實例化)**
|
||
|
||
這些 singleton 需要改為 per-project_id 實例:
|
||
|
||
```python
|
||
# 修改前(全域 singleton)
|
||
_anomaly_counter: Optional[AnomalyCounter] = None
|
||
|
||
def get_anomaly_counter() -> AnomalyCounter:
|
||
global _anomaly_counter
|
||
if _anomaly_counter is None:
|
||
_anomaly_counter = AnomalyCounter()
|
||
return _anomaly_counter
|
||
|
||
# 修改後(per-tenant,Phase 3)
|
||
_anomaly_counters: dict[str, AnomalyCounter] = {}
|
||
|
||
def get_anomaly_counter(project_id: str) -> AnomalyCounter:
|
||
if project_id not in _anomaly_counters:
|
||
_anomaly_counters[project_id] = AnomalyCounter(project_id=project_id)
|
||
return _anomaly_counters[project_id]
|
||
```
|
||
|
||
需要此策略的 singleton:
|
||
|
||
| Singleton | 檔案 | 分解難度 |
|
||
|-----------|------|---------|
|
||
| `AnomalyCounter` | `anomaly_counter.py:85` | LOW(PR-11,Phase 2)|
|
||
| `ConsensusEngine` | `consensus_engine.py:344` | MEDIUM(Phase 3)|
|
||
| `IntentClassifier` | `intent_classifier.py` | MEDIUM(Phase 3)|
|
||
|
||
**策略 3:Context-Injected(Phase 3,依賴注入)**
|
||
|
||
透過 `ContextVar` 或 FastAPI `Depends` 注入,不使用模組層級 singleton:
|
||
|
||
```python
|
||
# DecisionManager 不再是全域 singleton,而是 per-request 注入
|
||
async def get_decision_manager(
|
||
project_id: str = Depends(get_project_id),
|
||
effective_policy: EffectivePolicy = Depends(get_effective_policy)
|
||
) -> DecisionManager:
|
||
return DecisionManager(project_id=project_id, policy=effective_policy)
|
||
```
|
||
|
||
需要此策略的 singleton:
|
||
|
||
| Singleton | 檔案 | 備註 |
|
||
|-----------|------|------|
|
||
| `DecisionManager` | `decision_manager.py:1402` | Tier 3!需要 P10 架構審查 |
|
||
| `TrustEngine` | `trust_engine.py:189` | Tier 3!需要 P10 授權 |
|
||
|
||
**策略 4:Module-Level Config(保留 singleton,但 config 注入)**
|
||
|
||
Singleton 保留,但內部狀態改為從 `project_id` 動態讀取(而不是靜態初始化):
|
||
|
||
```python
|
||
# DecisionFusionAdapter:不改 singleton 結構,但方法接受 project_id
|
||
class DecisionFusionAdapter:
|
||
async def fuse(
|
||
self,
|
||
project_id: str, # 新增 project_id 參數
|
||
decisions: list[Decision]
|
||
) -> FusedDecision:
|
||
policy = await get_effective_policy(project_id)
|
||
# 用 policy 而非 self 的靜態 config
|
||
...
|
||
```
|
||
|
||
### D2 — 分解優先序
|
||
|
||
**立即(Phase 2)**:
|
||
- PR-04:`registry.py` `_provider` → `__provider` double underscore(INV-9 找到的封裝漏洞)
|
||
- PR-11:`AnomalyCounter` per-project(依賴 PR-10 loop tagging)
|
||
|
||
**Phase 3**:
|
||
- `ConsensusEngine` per-tenant instance
|
||
- `IntentClassifier` per-tenant instance
|
||
- `DecisionFusionAdapter` 方法簽名加 `project_id`
|
||
|
||
**Phase 4+(Tier 3,需 P10 審查)**:
|
||
- `DecisionManager` → 依賴注入重構(大型工程,需要獨立 ADR)
|
||
- `TrustEngine` → 依賴注入重構(Tier 3,必須有首席架構師授權)
|
||
|
||
### D3 — AnomalyCounter 分解設計(Phase 2,PR-11)
|
||
|
||
AnomalyCounter 是最安全的分解起點(影響範圍小,沒有 Tier 3 限制):
|
||
|
||
```python
|
||
# anomaly_counter.py
|
||
_anomaly_counters: dict[str, AnomalyCounter] = {}
|
||
_counters_lock = asyncio.Lock()
|
||
|
||
async def get_anomaly_counter(project_id: str) -> AnomalyCounter:
|
||
async with _counters_lock:
|
||
if project_id not in _anomaly_counters:
|
||
_anomaly_counters[project_id] = AnomalyCounter(
|
||
project_id=project_id,
|
||
redis_prefix=f"anomaly:{project_id}:" # per-tenant Redis key
|
||
)
|
||
return _anomaly_counters[project_id]
|
||
```
|
||
|
||
**Redis key 遷移**(INV-1 P2 keys):
|
||
```
|
||
舊:anomaly:counter:{metric}
|
||
新:anomaly:{project_id}:counter:{metric}
|
||
```
|
||
|
||
### D4 — DecisionManager + TrustEngine 的保護措施(Phase 2 前)
|
||
|
||
在真正分解前,Phase 2 的保護措施:
|
||
|
||
1. **Context 注入**:確保所有 DecisionManager 方法的 `project_id` 都從 `contextvars` 讀取
|
||
2. **Redis key 隔離**:DecisionManager 內部的 Redis key 改為帶 `project_id` prefix(PR-06 已覆蓋部分)
|
||
3. **禁止直接呼叫 global**:在 Tier 3 檔案頂部加警告 comment(不是程式碼,是文件)
|
||
|
||
### D5 — ConsensusEngine Redis Key 遷移(PR-06)
|
||
|
||
INV-9 確認 `ConsensusEngine` 的 `CONSENSUS_PREFIX` 沒有 project 隔離:
|
||
|
||
```python
|
||
# 修改前(consensus_engine.py)
|
||
CONSENSUS_PREFIX = "consensus:"
|
||
|
||
# 修改後(PR-06,Phase 2)
|
||
def get_consensus_prefix(project_id: str) -> str:
|
||
return f"consensus:{project_id}:"
|
||
```
|
||
|
||
---
|
||
|
||
## 13 Singleton 完整分解計畫
|
||
|
||
| Singleton | 策略 | 優先級 | Phase | Tier |
|
||
|-----------|------|-------|-------|------|
|
||
| `TelegramGateway` | Platform Singleton(保留)| - | - | - |
|
||
| `HostRepairAgent` | Platform Singleton(保留)| - | - | - |
|
||
| `AIProviderRegistry` | Platform Singleton + PR-04 | P1 | Phase 2 | - |
|
||
| `FailoverAlerter` | Platform Singleton(保留)| - | - | - |
|
||
| `AnomalyCounter` | Tenant-Scoped | P1 | Phase 2 | - |
|
||
| `ConsensusEngine` | Tenant-Scoped + PR-06 | P1 | Phase 3 | - |
|
||
| `IntentClassifier` | Tenant-Scoped | P2 | Phase 3 | - |
|
||
| `DecisionFusionAdapter` | Config-Injected | P2 | Phase 3 | - |
|
||
| `AIRouter` | Config-Injected | P2 | Phase 3 | - |
|
||
| `AIRouterExecutor` | Config-Injected | P2 | Phase 3 | - |
|
||
| `DecisionManager` | Context-Injected | P3 | Phase 4+ | Tier 3 |
|
||
| `TrustEngine` | Context-Injected | P3 | Phase 4+ | Tier 3 |
|
||
| `TelegramGateway`(messages)| Context-Injected | P2 | Phase 3 | - |
|
||
|
||
---
|
||
|
||
## 後果
|
||
|
||
### Benefits
|
||
- AnomalyCounter per-tenant:EwoooC 和 AWOOOI 的異常計數互不干擾
|
||
- ConsensusEngine Redis key 隔離(PR-06):共識決策不跨 tenant 污染
|
||
- Tier 3 singleton(DecisionManager / TrustEngine)有清晰的分解路徑,但保護好不急躁
|
||
|
||
### Costs
|
||
- AnomalyCounter per-tenant 需要 Redis key migration(舊格式的 counter 會被遺棄)
|
||
- Phase 3+ 的大規模 singleton 分解是重大工程(每個需要獨立 PR + critic 審查)
|
||
|
||
### Risks
|
||
- `_anomaly_counters: dict` 本身沒有 GC 機制,長期運行可能累積 tenant 實例
|
||
- 緩解:WeakValueDictionary 或定期清理(tenant 長時間無活動則清除實例)
|
||
- Tier 3 singleton 分解失敗 → 架構回退需要 hotfix
|
||
- 緩解:Tier 3 必須先有完整測試覆蓋才能動手
|
||
|
||
---
|
||
|
||
## 驗收標準
|
||
|
||
- [ ] PR-04:`registry.py` `_provider` → `__provider`(Phase 2)
|
||
- [ ] PR-06:ConsensusEngine Redis prefix 加 project_id(Phase 2)
|
||
- [ ] PR-11:AnomalyCounter per-tenant(Phase 2)
|
||
- [ ] Phase 3 前:ConsensusEngine / IntentClassifier tenant-scoped 完成
|
||
- [ ] Phase 4+ 前:DecisionManager / TrustEngine 分解有獨立 ADR + P10 授權
|
||
|
||
## 關聯
|
||
|
||
- ADR-111(bootstrap order,singleton 初始化時機)
|
||
- ADR-118(RLS,tenant 隔離需要正確的 project_id context)
|
||
- INV-9(13 singleton 完整清單 + 檔案位置)
|
||
- PR-04/PR-06/PR-11(AnoooP Phase 2 具體 PR)
|