Files
awoooi/docs/adr/ADR-124-global-singleton-decomposition.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

223 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-124: Global Singleton Decomposition for Multi-tenancy
**狀態**Accepted
**日期**2026-05-03台北
**決策者**:統帥
**範圍**13 個 global singleton 的分類、分解策略、Tier 3 安全邊界
**關聯**ADR-111bootstrap order、ADR-118RLS、INV-9
---
## 背景
INV-9 確認 codebase 中有 **13 個** global singleton全部是模組層級變數`_engine: Optional[X] = None`)。這些 singleton 在多租戶環境下有兩個核心問題:
1. **Tenant 共用狀態**`AnomalyCounter` 等組件將所有 tenant 的數據混合在同一個實例中
2. **Bootstrap 時機不明確**singleton 在首次呼叫時才初始化,可能在 project_id 未知的情況下被觸發
**Tier 3 限制RED_ZONES.md**
- `DecisionManager`decision_manager.py:1402— 禁止未經架構審查的修改
- `TrustEngine`trust_engine.py:189— 核心信任計算,修改需要 P10 授權
---
## 決策
### D1 — 四種分解策略
**策略 1Platform Singleton保留不分解**
這些 singleton 本就應該是平台層,無需 per-tenant 實例:
| Singleton | 檔案 | 原因 |
|-----------|------|------|
| `TelegramGateway`polling lock| `telegram_gateway.py:1324` | Polling leader 是平台層,非 per-tenant |
| `HostRepairAgent` | `host_repair_agent.py` | 修復 host 是平台操作,非 per-tenant |
| `AIProviderRegistry` | `registry.py` | Provider 清單是平台層(但需要 `__provider` 修補PR-04|
| `FailoverAlerter` | `failover_alerter.py` | 告警路由是平台層 |
**策略 2Tenant-Scoped InstancePhase 3按需實例化**
這些 singleton 需要改為 per-project_id 實例:
```python
# 修改前(全域 singleton
_anomaly_counter: Optional[AnomalyCounter] = None
def get_anomaly_counter() -> AnomalyCounter:
global _anomaly_counter
if _anomaly_counter is None:
_anomaly_counter = AnomalyCounter()
return _anomaly_counter
# 修改後per-tenantPhase 3
_anomaly_counters: dict[str, AnomalyCounter] = {}
def get_anomaly_counter(project_id: str) -> AnomalyCounter:
if project_id not in _anomaly_counters:
_anomaly_counters[project_id] = AnomalyCounter(project_id=project_id)
return _anomaly_counters[project_id]
```
需要此策略的 singleton
| Singleton | 檔案 | 分解難度 |
|-----------|------|---------|
| `AnomalyCounter` | `anomaly_counter.py:85` | LOWPR-11Phase 2|
| `ConsensusEngine` | `consensus_engine.py:344` | MEDIUMPhase 3|
| `IntentClassifier` | `intent_classifier.py` | MEDIUMPhase 3|
**策略 3Context-InjectedPhase 3依賴注入**
透過 `ContextVar` 或 FastAPI `Depends` 注入,不使用模組層級 singleton
```python
# DecisionManager 不再是全域 singleton而是 per-request 注入
async def get_decision_manager(
project_id: str = Depends(get_project_id),
effective_policy: EffectivePolicy = Depends(get_effective_policy)
) -> DecisionManager:
return DecisionManager(project_id=project_id, policy=effective_policy)
```
需要此策略的 singleton
| Singleton | 檔案 | 備註 |
|-----------|------|------|
| `DecisionManager` | `decision_manager.py:1402` | Tier 3需要 P10 架構審查 |
| `TrustEngine` | `trust_engine.py:189` | Tier 3需要 P10 授權 |
**策略 4Module-Level Config保留 singleton但 config 注入)**
Singleton 保留,但內部狀態改為從 `project_id` 動態讀取(而不是靜態初始化):
```python
# DecisionFusionAdapter不改 singleton 結構,但方法接受 project_id
class DecisionFusionAdapter:
async def fuse(
self,
project_id: str, # 新增 project_id 參數
decisions: list[Decision]
) -> FusedDecision:
policy = await get_effective_policy(project_id)
# 用 policy 而非 self 的靜態 config
...
```
### D2 — 分解優先序
**立即Phase 2**
- PR-04`registry.py` `_provider``__provider` double underscoreINV-9 找到的封裝漏洞)
- PR-11`AnomalyCounter` per-project依賴 PR-10 loop tagging
**Phase 3**
- `ConsensusEngine` per-tenant instance
- `IntentClassifier` per-tenant instance
- `DecisionFusionAdapter` 方法簽名加 `project_id`
**Phase 4+Tier 3需 P10 審查)**
- `DecisionManager` → 依賴注入重構(大型工程,需要獨立 ADR
- `TrustEngine` → 依賴注入重構Tier 3必須有首席架構師授權
### D3 — AnomalyCounter 分解設計Phase 2PR-11
AnomalyCounter 是最安全的分解起點(影響範圍小,沒有 Tier 3 限制):
```python
# anomaly_counter.py
_anomaly_counters: dict[str, AnomalyCounter] = {}
_counters_lock = asyncio.Lock()
async def get_anomaly_counter(project_id: str) -> AnomalyCounter:
async with _counters_lock:
if project_id not in _anomaly_counters:
_anomaly_counters[project_id] = AnomalyCounter(
project_id=project_id,
redis_prefix=f"anomaly:{project_id}:" # per-tenant Redis key
)
return _anomaly_counters[project_id]
```
**Redis key 遷移**INV-1 P2 keys
```
anomaly:counter:{metric}
anomaly:{project_id}:counter:{metric}
```
### D4 — DecisionManager + TrustEngine 的保護措施Phase 2 前)
在真正分解前Phase 2 的保護措施:
1. **Context 注入**:確保所有 DecisionManager 方法的 `project_id` 都從 `contextvars` 讀取
2. **Redis key 隔離**DecisionManager 內部的 Redis key 改為帶 `project_id` prefixPR-06 已覆蓋部分)
3. **禁止直接呼叫 global**:在 Tier 3 檔案頂部加警告 comment不是程式碼是文件
### D5 — ConsensusEngine Redis Key 遷移PR-06
INV-9 確認 `ConsensusEngine``CONSENSUS_PREFIX` 沒有 project 隔離:
```python
# 修改前consensus_engine.py
CONSENSUS_PREFIX = "consensus:"
# 修改後PR-06Phase 2
def get_consensus_prefix(project_id: str) -> str:
return f"consensus:{project_id}:"
```
---
## 13 Singleton 完整分解計畫
| Singleton | 策略 | 優先級 | Phase | Tier |
|-----------|------|-------|-------|------|
| `TelegramGateway` | Platform Singleton保留| - | - | - |
| `HostRepairAgent` | Platform Singleton保留| - | - | - |
| `AIProviderRegistry` | Platform Singleton + PR-04 | P1 | Phase 2 | - |
| `FailoverAlerter` | Platform Singleton保留| - | - | - |
| `AnomalyCounter` | Tenant-Scoped | P1 | Phase 2 | - |
| `ConsensusEngine` | Tenant-Scoped + PR-06 | P1 | Phase 3 | - |
| `IntentClassifier` | Tenant-Scoped | P2 | Phase 3 | - |
| `DecisionFusionAdapter` | Config-Injected | P2 | Phase 3 | - |
| `AIRouter` | Config-Injected | P2 | Phase 3 | - |
| `AIRouterExecutor` | Config-Injected | P2 | Phase 3 | - |
| `DecisionManager` | Context-Injected | P3 | Phase 4+ | Tier 3 |
| `TrustEngine` | Context-Injected | P3 | Phase 4+ | Tier 3 |
| `TelegramGateway`messages| Context-Injected | P2 | Phase 3 | - |
---
## 後果
### Benefits
- AnomalyCounter per-tenantEwoooC 和 AWOOOI 的異常計數互不干擾
- ConsensusEngine Redis key 隔離PR-06共識決策不跨 tenant 污染
- Tier 3 singletonDecisionManager / TrustEngine有清晰的分解路徑但保護好不急躁
### Costs
- AnomalyCounter per-tenant 需要 Redis key migration舊格式的 counter 會被遺棄)
- Phase 3+ 的大規模 singleton 分解是重大工程(每個需要獨立 PR + critic 審查)
### Risks
- `_anomaly_counters: dict` 本身沒有 GC 機制,長期運行可能累積 tenant 實例
- 緩解WeakValueDictionary 或定期清理tenant 長時間無活動則清除實例)
- Tier 3 singleton 分解失敗 → 架構回退需要 hotfix
- 緩解Tier 3 必須先有完整測試覆蓋才能動手
---
## 驗收標準
- [ ] PR-04`registry.py` `_provider``__provider`Phase 2
- [ ] PR-06ConsensusEngine Redis prefix 加 project_idPhase 2
- [ ] PR-11AnomalyCounter per-tenantPhase 2
- [ ] Phase 3 前ConsensusEngine / IntentClassifier tenant-scoped 完成
- [ ] Phase 4+ 前DecisionManager / TrustEngine 分解有獨立 ADR + P10 授權
## 關聯
- ADR-111bootstrap ordersingleton 初始化時機)
- ADR-118RLStenant 隔離需要正確的 project_id context
- INV-913 singleton 完整清單 + 檔案位置)
- PR-04/PR-06/PR-11AnoooP Phase 2 具體 PR