docs: enforce traditional chinese documentation

This commit is contained in:
OG T
2026-05-06 14:07:02 +08:00
parent ffd767d4bb
commit 578bf3bc7c
4 changed files with 248 additions and 198 deletions

View File

@@ -31,6 +31,9 @@
## 🔴 絕對禁止 → [HARD_RULES.md](docs/HARD_RULES.md) ## 🔴 絕對禁止 → [HARD_RULES.md](docs/HARD_RULES.md)
## 🔴 文件語言鐵律 → [文件語言規範](docs/HARD_RULES.md#文件語言規範)
Markdown、ADR、LOGBOOK、Runbook、交接文件與計畫文件一律使用繁體中文程式符號、API、指令、錯誤碼、服務名稱與原始 log 可保留英文。
## 🔴 紅區治理 → [RED_ZONES.md](docs/RED_ZONES.md) ## 🔴 紅區治理 → [RED_ZONES.md](docs/RED_ZONES.md)
Tier 3 核心檔案 (decision_manager, trust_engine, config 等) 修改需首席架構師授權 Tier 3 核心檔案 (decision_manager, trust_engine, config 等) 修改需首席架構師授權

View File

@@ -8,11 +8,11 @@
| 欄位 | 值 | | 欄位 | 值 |
|------|-----| |------|-----|
| **版本** | v2.0 | | **版本** | v2.1 |
| **建立日期** | 2026-03-20 (台北) | | **建立日期** | 2026-03-20 (台北) |
| **建立者** | Claude Code | | **建立者** | Claude Code |
| **最後修改** | 2026-04-16 (台北) | | **最後修改** | 2026-05-06 (台北) |
| **修改者** | Claude Code + ogt (新增孤島開發/主動執行熔斷/自循環工作流/狀態機驗證鐵律) | | **修改者** | Codex + ogt (新增文件語言鐵律) |
### 變更紀錄 ### 變更紀錄
@@ -29,6 +29,7 @@
| v1.8 | 2026-04-03 | Claude Code | 🔴🔴🔴 費用變更強制審批 (統帥指示) | | v1.8 | 2026-04-03 | Claude Code | 🔴🔴🔴 費用變更強制審批 (統帥指示) |
| v1.9 | 2026-04-15 | Claude Code + ogt | 🔴🔴🔴 AI 自主化飛輪 Phase 退出條件鐵律 (ADR-080) | | v1.9 | 2026-04-15 | Claude Code + ogt | 🔴🔴🔴 AI 自主化飛輪 Phase 退出條件鐵律 (ADR-080) |
| v2.0 | 2026-04-16 | Claude Code + ogt | 新增 No Island Coding / 主動執行熔斷機制 / 自循環工作流 / 狀態機驗證鐵律 | | v2.0 | 2026-04-16 | Claude Code + ogt | 新增 No Island Coding / 主動執行熔斷機制 / 自循環工作流 / 狀態機驗證鐵律 |
| v2.1 | 2026-05-06 | Codex + ogt | 🔴 文件語言鐵律Markdown/ADR/LOGBOOK/Runbook/交接文件一律繁體中文 |
--- ---
@@ -39,6 +40,7 @@
| CI/CD | `ubuntu-latest` | `self-hosted` | [→ GitHub Billing](#github-billing) | | CI/CD | `ubuntu-latest` | `self-hosted` | [→ GitHub Billing](#github-billing) |
| Telegram | `logOut()` | 先停後換 | [→ Telegram Token](#telegram-token) | | Telegram | `logOut()` | 先停後換 | [→ Telegram Token](#telegram-token) |
| 前端 | 硬編碼文字 | `next-intl` | [→ i18n](#i18n) | | 前端 | 硬編碼文字 | `next-intl` | [→ i18n](#i18n) |
| 文件 | 英文主文 | 繁體中文 | [→ 文件語言規範](#文件語言規範) |
| 資料庫 | SQLite | PostgreSQL | [→ DB](#database) | | 資料庫 | SQLite | PostgreSQL | [→ DB](#database) |
| CORS | `*` | 白名單 | [→ CORS](#cors) | | CORS | `*` | 白名單 | [→ CORS](#cors) |
| 數據 | 假數據 Demo | 真實 API | [→ No Fake Data](#no-fake-data) | | 數據 | 假數據 Demo | 真實 API | [→ No Fake Data](#no-fake-data) |
@@ -197,6 +199,38 @@ await bot.log_out()
--- ---
## 文件語言規範
**統帥指示 2026-05-06**:所有文件必須使用繁體中文。
### 適用範圍
| 類型 | 規範 |
|------|------|
| Markdown 文件 | 主文、標題、表格說明、結論一律繁體中文 |
| ADR / LOGBOOK / Runbook | 一律繁體中文,避免英文段落混入 |
| 交接文件 / implementation plan | 一律繁體中文,讓跨 session 接手時不再重譯 |
| 前端使用者可見文字 | 仍遵守 i18n不得硬編碼 |
### 可保留英文的內容
| 類型 | 範例 |
|------|------|
| 程式符號與路徑 | `project_id`, `apps/api/src`, `MCPToolResult` |
| API / 指令 / 錯誤碼 | `GET /v1/platform/runs`, `kubectl`, `E-MCP-GATE-001` |
| 服務與產品名稱 | `Alertmanager`, `Sentry`, `ClickHouse`, `Gemini` |
| 原始 log / trace / commit message | 保留原文,必要時在下方補繁中解釋 |
### 絕對禁止
```
❌ 新增英文主文的 ADR、Runbook、LOGBOOK 或 handoff 文件
❌ 將中文文件改成中英混雜但沒有必要技術原因
❌ 用英文標題包裝中文專案狀態,造成後續 session 還要重譯
```
---
## Database ## Database
**Memory:** AWOOOI 憲法 **Memory:** AWOOOI 憲法

View File

@@ -3584,6 +3584,23 @@ DockerContainerCpuSustainedHigh -> PB-20260505-F4197B
--- ---
## 2026-05-06台北— 文件語言鐵律收斂為繁體中文
**背景**統帥明確指示所有文檔必須使用繁體中文AwoooP 整合總圖仍殘留大量英文主文,會讓後續 session 接手時重新翻譯與誤解。
### 已修正
- `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md` 全文轉為繁體中文保留必要技術識別碼、API path、指令、錯誤碼與服務名稱原文。
- `docs/HARD_RULES.md` 升級到 v2.1,新增文件語言規範鐵律。
- `AGENTS.md` 加入文件語言入口規則,讓 session 啟動即可看到「Markdown / ADR / LOGBOOK / Runbook / 交接文件一律繁體中文」。
### 後續要求
- 新增或修改文件時,主文、標題、表格說明與結論一律使用繁體中文。
- 原始 log、trace、commit message、程式符號與服務名稱可保留英文但必要時需補繁中解釋。
---
## 2026-05-06台北— Alertmanager 旁路改送 SRE 群組 + Sentry Snuba 修復 ## 2026-05-06台北— Alertmanager 旁路改送 SRE 群組 + Sentry Snuba 修復
**觸發**Telegram 收到 `🚨 [Alertmanager Fallback] DockerContainerRestartSpike`,且訊息發到 OpenClaw 私訊/機器人對話;同一時間 AWOOOI 心跳正常,表示 fallback 旁路不是「API 離線才觸發」,而是 Alertmanager critical route 的 direct Telegram 旁路。 **觸發**Telegram 收到 `🚨 [Alertmanager Fallback] DockerContainerRestartSpike`,且訊息發到 OpenClaw 私訊/機器人對話;同一時間 AWOOOI 心跳正常,表示 fallback 旁路不是「API 離線才觸發」,而是 Alertmanager critical route 的 direct Telegram 旁路。
@@ -3592,8 +3609,8 @@ DockerContainerCpuSustainedHigh -> PB-20260505-F4197B
| 範圍 | 結果 | | 範圍 | 結果 |
|------|------| |------|------|
| Alertmanager live config | `/home/wooo/monitoring/alertmanager.yml``telegram-direct.chat_id` 已從 `OPENCLAW_TG_CHAT_ID` 切到 `SRE_GROUP_CHAT_ID`,並以 HUP reload | | Alertmanager 現場設定 | `/home/wooo/monitoring/alertmanager.yml``telegram-direct.chat_id` 已從 `OPENCLAW_TG_CHAT_ID` 切到 `SRE_GROUP_CHAT_ID`,並以 HUP reload |
| Alertmanager repo template | `ops/alertmanager/alertmanager.yml` 改用 `SRE_GROUP_CHAT_ID_PLACEHOLDER`,避免後續部署回退到私訊 | | Alertmanager repo 範本 | `ops/alertmanager/alertmanager.yml` 改用 `SRE_GROUP_CHAT_ID_PLACEHOLDER`,避免後續部署回退到私訊 |
| Sentry / Snuba schema | 110 上 `/opt/sentry` 執行 `docker compose run --rm snuba-api bootstrap --force`,補齊 ClickHouse Snuba tables | | Sentry / Snuba schema | 110 上 `/opt/sentry` 執行 `docker compose run --rm snuba-api bootstrap --force`,補齊 ClickHouse Snuba tables |
| Kafka offset | `ingest-consumer``ingest-events:0``generic-metrics-consumer``ingest-performance-metrics:0` reset 到 latest修正 `OffsetOutOfRange` | | Kafka offset | `ingest-consumer``ingest-events:0``generic-metrics-consumer``ingest-performance-metrics:0` reset 到 latest修正 `OffsetOutOfRange` |
@@ -3601,7 +3618,7 @@ DockerContainerCpuSustainedHigh -> PB-20260505-F4197B
```text ```text
docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
# SUCCESS # 成功
Alertmanager telegram-direct chat_id Alertmanager telegram-direct chat_id
# group/supergroup, suffix=74679 # group/supergroup, suffix=74679
@@ -3612,27 +3629,27 @@ ClickHouse tables
# default.transactions_local = 1 # default.transactions_local = 1
# default.metrics_raw_v2_local = 1 # default.metrics_raw_v2_local = 1
Sentry consumers after reset Sentry consumers reset 後狀態
# events-consumer healthy # events-consumer healthy
# generic-metrics-consumer healthy # generic-metrics-consumer healthy
# snuba-errors/metrics/transactions consumers healthy # snuba-errors/metrics/transactions consumers healthy
# recent 45s logs: no OffsetOutOfRange / UNKNOWN_TABLE / ERROR markers # 45 log:沒有 OffsetOutOfRange / UNKNOWN_TABLE / ERROR 標記
``` ```
### 第二段收斂:旁路改成 Emergency-only ### 第二段收斂:旁路改成緊急限定
後續同一輪已再收斂 Alertmanager 路由: 後續同一輪已再收斂 Alertmanager 路由:
| 範圍 | 結果 | | 範圍 | 結果 |
|------|------| |------|------|
| Direct route gate | `telegram-direct` 不再匹配所有 `severity=critical`,只匹配 `AWOOOIApiDown` / `AlertmanagerDown` / `AlertChainBroken_*` / `AlertChainUnhealthy` / `NoAlertsReceived2Hours` | | 直接路由閘門 | `telegram-direct` 不再匹配所有 `severity=critical`,只匹配 `AWOOOIApiDown` / `AlertmanagerDown` / `AlertChainBroken_*` / `AlertChainUnhealthy` / `NoAlertsReceived2Hours` |
| Main route | 一般 critical含 Docker/Sentry container restart只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 | | 主路由 | 一般 critical含 Docker/Sentry container restart只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 |
| Live webhook URL | `/home/wooo/monitoring/alertmanager.yml``192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` | | 現場 webhook URL | `/home/wooo/monitoring/alertmanager.yml``192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` |
| Config check | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功HUP reload 完成 | | 設定檢查 | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功HUP reload 完成 |
| Drift prevention | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload | | 防漂移部署 | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload |
| Deploy safety | 修正部署腳本以原 inode 覆寫 bind-mounted config並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop | | 部署安全性 | 修正部署腳本以原 inode 覆寫 bind-mounted config並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop |
| Live firing state | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0Sentry consumers 回到 healthy | | 現場 firing 狀態 | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0Sentry consumers 回到 healthy |
| Temporary silence | 因 Alertmanager 權限修復期間自身 restart 觸發 15 分鐘窗口,已只針對 `alertname=DockerContainerRestartSpike, container_name=alertmanager` 建立 30 分鐘 silence其他 Docker/Sentry restart 不受影響 | | 暫時 silence | 因 Alertmanager 權限修復期間自身 restart 觸發 15 分鐘窗口,已只針對 `alertname=DockerContainerRestartSpike, container_name=alertmanager` 建立 30 分鐘 silence其他 Docker/Sentry restart 不受影響 |
### 注意 ### 注意

View File

@@ -1,67 +1,68 @@
# AWOOOI / AwoooP / AI Autonomous Flywheel Integration Plan # AWOOOI / AwoooP / AI 自主化飛輪整合計畫
**Date**: 2026-05-06 **日期**2026-05-06
**Status**: Accepted integration baseline for AwoooP execution **狀態**:已接受,作為 AwoooP 後續執行的整合基準
**Scope**: AWOOOI alerting, AI decisioning, approval, MCP execution, KM/RAG/PlayBook learning, AwoooP Operator Console, audit, governance, and platform reliability **範圍**AWOOOI 告警、AI 決策、簽核、MCP 執行、KM/RAG/PlayBook 學習、AwoooP Operator Console、稽核、治理與平台可靠性
**Related**: **相關文件**
- `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md` - `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`
- `docs/awooop/MASTER-WORKPLAN.md` - `docs/awooop/MASTER-WORKPLAN.md`
- `docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md` - `docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md`
- `docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md` - `docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md`
- `docs/adr/ADR-106-agent-platform-architecture.md` - `docs/adr/ADR-106-agent-platform-architecture.md`
## 1. Executive Decision ## 1. 架構決策
The current findings are not isolated bugs. They are evidence that the AI autonomous flywheel is not yet closed end-to-end. 目前盤點出的問題不是一批孤立 bug而是 AI 自主化飛輪尚未端到端閉環的證據。
AwoooP must not evolve as a separate product line beside AWOOOI automation. It is the human-and-governance plane for the same flywheel: AwoooP 不得在 AWOOOI 自動化之外另長成一條產品線。AwoooP 是同一條飛輪的人機協作與治理平面:
- **AI Flywheel**: engine, cognition, execution, learning loop. - **AI Flywheel**:引擎、認知、執行與學習迴路。
- **AwoooP**: operator console, governance plane, audit plane, human approval plane. - **AwoooP**:操作員控制台、治理平面、稽核平面與人工簽核平面。
- **MCP Gateway**: the only execution choke point for tool calls. - **MCP Gateway**:工具呼叫的唯一執行閘門。
- **Approval / Trust**: the only authorization state machine. - **Approval / Trust**:唯一授權狀態機。
- **KM / RAG / PlayBook**: the only knowledge substrate. - **KM / RAG / PlayBook**:唯一知識底座。
- **Channel Event / Audit**: the only observability and trace stream. - **Channel Event / Audit**:唯一觀測、稽核與追蹤流。
If AI automation and AwoooP are implemented as separate tracks, the system will drift into duplicate approval state machines, duplicate audit flows, duplicate routing gates, duplicate MCP and Channel decisions, and frontend states that do not match the backend execution truth. 如果 AI 自動化與 AwoooP 分成兩條軌道推進,系統會漂移出兩套簽核狀態機、兩套稽核流、兩套路由閘門、兩套 MCP Channel 決策,最後前端看到的狀態會與後端真實執行狀態不一致。
## 2. Shared Target Loop ## 2. 共同目標迴路
The target loop is: 目標閉環如下:
```text ```text
monitoring / alerting 監控 / 告警
-> event classification -> 事件分類
-> rule match or rule creation -> 規則匹配或規則建立
-> PlayBook / KM / RAG retrieval -> PlayBook / KM / RAG 檢索
-> AI decision -> AI 決策
-> Approval / Trust gate -> Approval / Trust Gate
-> MCP Gateway execution -> MCP Gateway 執行
-> audit and trace -> 稽核與追蹤
-> execution verification -> 執行驗證
-> KM / rule / PlayBook feedback -> KM / Rule / PlayBook 回饋
``` ```
AwoooP surfaces and governs that loop. It does not create an independent execution lane. AwoooP 負責呈現與治理這條迴路,不建立獨立的執行 lane。
## 3. Architectural Invariants ## 3. 架構不變式
1. **One execution gate**: production MCP calls must pass through MCP Gateway. Direct provider calls are compatibility debt and must become `forbid-new`. 1. **唯一執行閘門**production MCP 呼叫必須通過 MCP Gateway。直接呼叫 provider 是相容性技術債,必須逐步標記為 `forbid-new`
2. **One approval state machine**: AwoooP approvals, AWOOOI approval records, TrustEngine, and Telegram approval buttons must converge to one signed, auditable flow. 2. **唯一簽核狀態機**AwoooP approvalsAWOOOI approval recordsTrustEngine Telegram 簽核按鈕必須收斂成一條可簽章、可稽核的流程。
3. **One audit stream**: Channel events, runtime state, MCP audit, model routing, and approval decisions must be joinable by `project_id`, `run_id`, and `trace_id`. 3. **唯一稽核流**Channel eventsruntime stateMCP auditmodel routing approval decisions 必須能以 `project_id``run_id``trace_id` 串接查詢。
4. **One knowledge substrate**: KM, RAG, and PlayBook embeddings must use consistent dimensions and model naming. Test fixtures must match production schema. 4. **唯一知識底座**KM、RAG、PlayBook embeddings 必須使用一致的維度與模型命名。測試 fixtures 必須與 production schema 一致。
5. **One routing control plane**: provider/model fallback must resolve through EffectivePolicy or a wrapped legacy equivalent during strangler migration. 5. **唯一模型路由控制面**provider/model fallback 必須透過 EffectivePolicy或在絞殺榕遷移期間使用受包裝的 legacy 等價層解析。
6. **No fail-open tenant boundary**: missing `project_id` must never broaden data visibility. Bootstrap exceptions must be explicit and audited. 6. **租戶邊界禁止 fail-open**:缺少 `project_id` 時不得擴大資料可見範圍。Bootstrap 例外必須明確且可稽核。
7. **No client-only identity**: any operator identity supplied by frontend body is display metadata only; authorization identity must come from the authenticated principal. 7. **禁止 client-only 身份**frontend body 提供的 operator identity 只能當顯示 metadata授權身份必須來自已認證 principal
8. **No channel policy decisions**: Telegram, LINE, Slack, and email adapters deliver and track messages, but do not decide model, tool, approval, or incident policy. 8. **Channel adapter 不做政策決策**TelegramLINESlack、Email adapter 只負責收送與追蹤訊息,不決定模型、工具、簽核或事件策略。
9. **文件一律繁體中文**Markdown 文件、ADR、LOGBOOK、Runbook、交接文件與計畫文件必須使用繁體中文技術識別碼、API path、程式符號、錯誤碼、指令、服務名稱與原始 log 可保留英文。
## 4. Integration Layers ## 4. 整合分層
### 4.1 AI Flywheel Core ### 4.1 AI 飛輪核心層
Owns monitoring, alert ingestion, event classification, rule matching, rule generation, PlayBook, KM/RAG, AI decisions, safe repair, verification, and learning feedback. 負責監控、告警入口、事件分類、規則匹配、規則生成、PlayBookKM/RAGAI 決策、安全修復、驗證與學習回饋。
Key existing surfaces: 既有關鍵介面:
- `apps/api/src/api/v1/webhooks.py` - `apps/api/src/api/v1/webhooks.py`
- `apps/api/src/services/decision_manager.py` - `apps/api/src/services/decision_manager.py`
@@ -71,11 +72,11 @@ Key existing surfaces:
- `apps/api/src/services/knowledge_rag_service.py` - `apps/api/src/services/knowledge_rag_service.py`
- `apps/api/src/services/playbook_rag.py` - `apps/api/src/services/playbook_rag.py`
### 4.2 AwoooP Governance / Operator Layer ### 4.2 AwoooP 治理 / 操作層
Owns operator runs, approvals, audit, channel events, MCP Gateway visibility, frontend control surface, i18n, and authenticated operator identity. 負責 operator runsapprovalsauditchannel eventsMCP Gateway 可視化、frontend 操作面、i18n 與已認證 operator identity
Key existing surfaces: 既有關鍵介面:
- `apps/api/src/api/v1/platform/*` - `apps/api/src/api/v1/platform/*`
- `apps/api/src/services/platform_*` - `apps/api/src/services/platform_*`
@@ -84,11 +85,11 @@ Key existing surfaces:
- `apps/web/src/app/[locale]/awooop/*` - `apps/web/src/app/[locale]/awooop/*`
- `docs/awooop/*` - `docs/awooop/*`
### 4.3 Infrastructure / Reliability Layer ### 4.3 基礎設施 / 可靠性層
Owns DB schema and RLS, Redis namespace migration, K8s manifests, NetworkPolicy, Ollama failover, Gitea CD, migration discipline, and release gates. 負責 DB schema RLSRedis namespace migrationK8s manifestsNetworkPolicyOllama failoverGitea CDmigration discipline release gates
Key existing surfaces: 既有關鍵介面:
- `apps/api/migrations/*` - `apps/api/migrations/*`
- `apps/api/src/db/*` - `apps/api/src/db/*`
@@ -97,162 +98,157 @@ Key existing surfaces:
- `docs/runbooks/*` - `docs/runbooks/*`
- `docs/awooop/inventory/*` - `docs/awooop/inventory/*`
## 5. Consolidated Risk Register ## 5. 整合風險登錄表
### 5.1 P0: Production-broken or Security-critical ### 5.1 P0production 會壞或安全關鍵
| ID | Risk | Impact | Required direction | | ID | 風險 | 影響 | 必要方向 |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| P0-A | MCP Gateway is not yet a production choke point; callers can still reach `provider.execute()` directly. | Gateway gates, redaction, and audit can be bypassed. | Wrap all production MCP call sites; then mark direct provider access `forbid-new`. | | P0-A | MCP Gateway 尚未成為 production choke pointcaller 仍可直接碰 `provider.execute()` | Gateway gateredaction、audit 可能被繞過。 | 包裝所有 production MCP call sites,然後將 direct provider access 標記為 `forbid-new` |
| P0-B | MCP blocked-call audit has gaps when `tool_row` is missing or Gate 1/2 rejects early. | Denied or suspicious calls can disappear from audit. | Audit attempt before and after gate evaluation with safe redaction. | | P0-B | MCP blocked-call audit `tool_row` 缺失或 Gate 1/2 提早拒絕時有缺口。 | 被拒絕或可疑的呼叫可能從稽核中消失。 | 在 gate 評估前後都寫入 audit attempt並套安全 redaction |
| P0-C | Legacy K8s tool execution still has command/shell injection risk. | Destructive command path can be polluted by LLM or user input. | Parse command structure, avoid shell, enforce operation schema and allowlists. | | P0-C | Legacy K8s tool execution 仍有 command/shell injection 風險。 | 破壞性指令路徑可能被 LLM 或使用者輸入污染。 | 解析指令結構、避免 shell、強制 operation schema allowlist |
| P0-D | `MCPToolResult(data=...)` mismatches dataclass fields in some success paths. | Sentry / ArgoCD or similar tools can crash on valid results. | Normalize result schema and regression-test all MCP providers. | | P0-D | 部分成功路徑用 `MCPToolResult(data=...)`,與 dataclass 欄位不一致。 | Sentry / ArgoCD 等 provider 可能在有效結果上 crash。 | 正規化 result schema並為所有 MCP providers 補 regression tests。 |
| P0-E | RAG/KM/PlayBook embedding dimensions remain split between 768 and 1024. | Search or backfill can silently fail or hide production drift. | Standardize on `bge-m3` 1024 dimensions and remove stale fixtures. | | P0-E | RAG/KM/PlayBook embedding 維度仍分裂為 768 1024 | 搜尋或 backfill 可能靜默失敗,或掩蓋 production drift | 統一為 `bge-m3` 1024 維,移除 stale fixtures |
| P0-F | KM backfill reconciler is missing required async imports or runtime dependencies. | Repair path that should recover embeddings may crash. | Compile and integration-test the reconciler path. | | P0-F | KM backfill reconciler 缺少必要 async import runtime dependency。 | 原本應修復 embeddings 的補救路徑可能 crash | 對 reconciler path 做 compile integration test。 |
| P0-G | Ollama routing still has direct `OLLAMA_URL` consumers. | GCP-A/GCP-B/111 ordering and Gemini fallback policy can be bypassed. | Inventory, wrap, and migrate call sites to resolver / EffectivePolicy. | | P0-G | Ollama routing 仍有 direct `OLLAMA_URL` consumers | GCP-A/GCP-B/111 順序與 Gemini fallback policy 可能被繞過。 | 盤點、包裝並遷移 call sites resolver / EffectivePolicy |
| P0-H | RLS is fail-open when `project_id` is null. | Cross-tenant data exposure risk. | Force fail-closed RLS and explicit platform bootstrap identities. | | P0-H | `project_id` null 時 RLS fail-open。 | 有跨租戶資料暴露風險。 | RLS 強制 fail-closed並建立明確 platform bootstrap identity。 |
| P0-I | Approval APIs trust frontend/body identity such as `approver_id: "operator"`. | Audit identity is not legally or operationally usable. | Decide endpoint must derive principal from auth/session/token, not body. | | P0-I | Approval APIs 信任 frontend/body identity,例如 `approver_id: "operator"` | 稽核身份無法作為法務或維運依據。 | Decide endpoint 必須從 auth/session/token 推導 principal不信任 body |
| P0-J | API control plane lacks real authorization in some paths; CSRF is not authorization. | Operator APIs can be invoked without a real role boundary. | Add authenticated principal and role checks to AwoooP APIs. | | P0-J | 部分 API control plane 缺少真正 authorizationCSRF 不是 authorization | Operator APIs 可能在沒有真實角色邊界下被呼叫。 | 為 AwoooP APIs 加入 authenticated principal role checks |
| P0-K | Alertmanager internal bypass plus `X-Forwarded-For` and broad trusted hosts may allow spoofed source identity. | Forged alert ingress can create incidents or approvals. | Require signed webhook or strict trusted proxy chain. | | P0-K | Alertmanager internal bypass 搭配 `X-Forwarded-For` 與過寬 trusted hosts,可能讓來源身份被 spoof。 | 偽造告警入口可能建立 incidents approvals | 要求 signed webhook,或嚴格限制 trusted proxy chain |
| P0-L | AwoooP approval token service also depends on `aioredis`. | Python/runtime compatibility issue can break approvals. | Migrate Redis client path or wrap with a maintained async Redis interface. | | P0-L | AwoooP approval token service 也依賴 `aioredis` | Python/runtime 相容性問題可能打壞 approvals | 遷移 Redis client path,或包裝成已維護的 async Redis interface |
### 5.2 P1: Governance consistency and reliability ### 5.2 P1:治理一致性與可靠性
| ID | Risk | Impact | Required direction | | ID | 風險 | 影響 | 必要方向 |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| P1-A | Approval repository / TrustEngine race conditions. | Double decision, stale decision, or missed approval. | Add row locks, idempotency keys, and single source of truth. | | P1-A | Approval repository / TrustEngine race condition | 可能出現重複決策、stale decision missed approval | 加入 row locksidempotency keys single source of truth |
| P1-B | TrustEngine uses process memory dict. | Pending approvals vanish or split across pods/restarts. | Move authoritative state to PostgreSQL; Redis only for cache/notification. | | P1-B | TrustEngine 使用 process memory dict | pending approvals 會因 pod restart 或多副本拆裂而消失或分裂。 | 權威狀態移到 PostgreSQLRedis 僅作 cache/notification |
| P1-C | NetworkPolicy contains rules whose names imply deny but semantics allow broad traffic. | False sense of isolation. | Reconcile names, policies, and live cluster behavior. | | P1-C | NetworkPolicy 名稱像 deny但語義上允許過寬流量。 | 造成隔離已完成的錯覺。 | 對齊名稱、政策與 live cluster 行為。 |
| P1-D | Kustomize may omit service registry, HPA, VPA, backup cron, or related resources. | GitOps drift and partial deploys. | Inventory generated/live manifests and add missing resources. | | P1-D | Kustomize 可能漏納 service registryHPAVPAbackup cron 或相關資源。 | GitOps drift partial deploy | 盤點 generated/live manifests,補齊缺失資源。 |
| P1-E | Hardcoded restart/SSH actions remain in alert rules. | Conflicts with AI automation direction and rule/PlayBook evidence loop. | Convert hardcoded actions into PlayBook proposals with trust gates. | | P1-E | alert rules 中仍殘留 hardcoded restart/SSH actions。 | 與 AI 自動化、rule/PlayBook evidence loop 衝突。 | 轉換為 PlayBook proposals,並走 trust gates |
| P1-F | Startup schema mutation and scattered SQL migrations remain. | Runtime ALTER can diverge dev/prod schema. | Move to disciplined migration path and remove runtime mutation. | | P1-F | startup schema mutation 與散落 SQL migrations 仍存在。 | Runtime ALTER 可能讓 dev/prod schema 漂移。 | 收斂到紀律化 migration path,移除 runtime mutation |
| P1-G | Tests still use stale 768-dimension vectors. | Dev tests can pass while production RAG fails. | Test fixtures must mirror production vector dimensions. | | P1-G | Tests 仍使用 stale 768 vectors | Dev tests 可能通過,但 production RAG 失效。 | Test fixtures 必須映射 production vector dimensions |
| P1-H | GCP-B can remain passive fallback only. | Active-active design is not realized. | Use GCP-B for batch, embedding, shadow, and canary lanes via policy. | | P1-H | GCP-B 可能只停留在被動 fallback | Active-active 設計無法落地。 | 透過 policy 讓 GCP-B 承擔 batchembeddingshadowcanary lanes |
### 5.3 P2: Productization and long-term maintainability ### 5.3 P2:產品化與長期可維護性
| ID | Risk | Impact | Required direction | | ID | 風險 | 影響 | 必要方向 |
| --- | --- | --- | --- | | --- | --- | --- | --- |
| P2-A | AwoooP frontend has incomplete i18n and inconsistent IA. | Operator experience is unprofessional and brittle. | Align UI text, sidebar, routes, and workflow labels. | | P2-A | AwoooP frontend i18n 不完整IA 不一致。 | Operator experience 不專業且脆弱。 | 對齊 UI 文字、sidebarroutes workflow labels |
| P2-B | Operator Console lacks deep trace/journal views. | Operators cannot explain why AI made a decision. | Add run step journal, trace spans, model/tool/cost panes. | | P2-B | Operator Console 缺少深度 trace/journal views | Operators 無法解釋 AI 為什麼做出某個決策。 | 加入 run step journaltrace spansmodel/tool/cost panes |
| P2-C | Error code taxonomy is incomplete. | Frontend and channels cannot render actionable states. | Define platform error codes for auth, schema, budget, MCP, routing, approval, and RAG. | | P2-C | Error code taxonomy 不完整。 | Frontend channels 無法呈現可行動狀態。 | 定義 authschemabudgetMCProutingapproval、RAG 的 platform error codes。 |
| P2-D | Docs and runtime can drift. | Future sessions repeat the same analysis. | Keep ADR, LOGBOOK, inventory, and release checklists linked to commits. | | P2-D | Docs runtime 可能漂移。 | 未來 sessions 會重複同樣盤點。 | ADRLOGBOOKinventory release checklists 必須連到 commits |
## 6. Corrections to Prior Inventories ## 6. 對先前盤點的修正
The following corrections must be applied when consuming the earlier 12-agent inventories: 消化早期 12-agent inventories 時,必須套用以下修正:
1. C21/C22 CronJob service account and image issues appear to have been partially fixed in the worktree, but must be verified against live cluster state before closure. 1. C21/C22 CronJob service account image 問題在 worktree 中看起來已部分修正,但結案前必須對 live cluster state 驗證。
2. C1 `aioredis` is not necessarily a top-level import in all affected paths; at least one Gate 5 runtime import still makes the risk valid. 2. C1 `aioredis` 不一定在所有受影響路徑都是 top-level import至少 Gate 5 runtime import 仍讓風險成立。
3. C2 is more precise as: blocked calls can skip audit when `tool_row is None`, not only due to a database `NOT NULL` failure. 3. C2 的精準說法是:`tool_row is None` 時 blocked calls 可能跳過 audit不只是資料庫 `NOT NULL` failure
4. C12 is partially stale for the main embedding service, but `knowledge_rag_service.py` and `playbook_rag.py` still need verification for 1024-dimensional consistency. 4. C12 對主要 embedding service 的描述部分過期,但 `knowledge_rag_service.py` `playbook_rag.py` 仍需要驗證 1024 維一致性。
## 7. Twelve-agent Ownership Matrix ## 7. 12-agent 權責矩陣
| Owner | Scope | First verification | | Owner | 範圍 | 第一個驗證點 |
| --- | --- | --- | | --- | --- | --- |
| Chief Architect | Total blueprint, dependency order, red-zone governance | This integration plan is linked from AwoooP master workplan and LOGBOOK. | | Chief Architect | 總藍圖、依賴順序、紅區治理 | 本整合計畫已從 AwoooP master workplan LOGBOOK 串接。 |
| Security | Auth, webhook spoofing, RLS, MCP security, token signing | Authenticated approval decision cannot be forged by body-only identity. | | Security | Authwebhook spoofingRLSMCP securitytoken signing | Approval decision 不能被 body-only identity 偽造。 |
| MCP Gateway | Gateway chokepoint, audit, provider bypass, result schema | Direct provider call inventory reaches zero `forbid-new` exceptions. | | MCP Gateway | Gateway chokepointauditprovider bypassresult schema | Direct provider call inventory 達到零個未追蹤 `forbid-new` 例外。 |
| K8s / SRE | CronJob, Kustomize, NetworkPolicy, RBAC, deployment verification | `kubectl diff` and live resource inventory match expected GitOps set. | | K8s / SRE | CronJobKustomizeNetworkPolicyRBACdeployment verification | `kubectl diff` live resource inventory 符合 GitOps 預期集合。 |
| DB / Migration | Alembic, RLS, JSONB, FK, approval race | RLS fail-closed tests and migration up/down pass. | | DB / Migration | AlembicRLSJSONB、FK、approval race | RLS fail-closed tests migration up/down 通過。 |
| RAG / KM | `bge-m3` 1024 dimensions, backfill, PlayBook embedding | Production vector dimension and test fixtures match. | | RAG / KM | `bge-m3` 1024 維、backfillPlayBook embedding | Production vector dimension test fixtures 一致。 |
| AI Router / Ollama | Resolver, failover, direct URL cleanup, GCP-A/B/111 lanes | New alert route logs prove `GCP-A -> GCP-B -> 111 -> Gemini` ordering. | | AI Router / Ollama | Resolverfailoverdirect URL cleanupGCP-A/B/111 lanes | alert route logs 證明 `GCP-A -> GCP-B -> 111 -> Gemini` 順序。 |
| Core / Trust | TrustEngine, RedisLock, UnitOfWork, feature flags | Pending approval survives pod restart and HPA multi-pod execution. | | Core / Trust | TrustEngineRedisLockUnitOfWorkfeature flags | Pending approval 可存活於 pod restart HPA multi-pod 執行。 |
| API Layer | Platform endpoints auth, CSRF, tenant boundaries | Platform APIs reject unauthenticated and cross-project requests. | | API Layer | Platform endpoints authCSRFtenant boundaries | Platform APIs 會拒絕 unauthenticated cross-project requests |
| Frontend / AwoooP | i18n, sidebar, approver identity, internal IP ban | `/zh-TW/awooop` renders without redirect error and no private `NEXT_PUBLIC_*` bundle leak. | | Frontend / AwoooP | i18nsidebarapprover identityinternal IP ban | `/zh-TW/awooop` 可正常 render,且 browser bundle 無 private `NEXT_PUBLIC_*` 外洩。 |
| QA / Test | Integration tests, no-mock scanning, regression matrix | High-risk paths have production-like integration tests, not only mocks. | | QA / Test | 整合測試、no-mock 掃描、回歸矩陣 | 高風險路徑有 production-like integration tests,不只靠 mocks |
| Docs / Release | ADR, LOGBOOK, Memory handoff, runbooks, release checklist | Each wave has a release note, rollback path, and live verification evidence. | | Docs / Release | ADRLOGBOOKMemory handoffrunbooksrelease checklist | 每個 wave 都有 release noterollback path live verification evidence |
## 8. Execution Waves ## 8. 執行 waves
### Wave 0: Formal convergence baseline ### Wave 0:正式收斂基準
Goal: convert the cross-session conclusion into a single source of operational truth. 目標:把跨 session 結論轉成單一營運真相來源。
Work: 工作:
- Add this integration plan. - 新增本整合計畫。
- Link it from `MASTER-WORKPLAN.md` and `AWOOOP-MONITORING-ALERTING-CONVERGENCE.md`. - `MASTER-WORKPLAN.md` `AWOOOP-MONITORING-ALERTING-CONVERGENCE.md` 串接本文件。
- Keep runtime untouched except already-approved urgent repairs. - 除已授權的緊急修復外,不碰 runtime。
- Define owner and wave for each high-risk item. - 為每個高風險項目定義 owner wave
Exit criteria: 退出條件:
- Plan is committed. - 計畫已 commit
- LOGBOOK records the integration baseline. - LOGBOOK 記錄整合基準。
- Next implementation session can start from this document without re-deriving the architecture. - 下一個 implementation session 可直接從本文件開工,不必重推架構。
### Wave 1: P0 production-broken / security-critical ### Wave 1P0 production-broken / security-critical
Goal: prevent the automation flywheel from bypassing governance. 目標:防止自動化飛輪繞過治理。
Progress: 進度:
- 2026-05-06: P0-L first enforcement patch landed. AwoooP approval token - 2026-05-06P0-L 第一段 enforcement patch 已落地。AwoooP approval token storage 與 MCP Gateway Gate 5 已改用 shared Redis pool不再建立 runtime `aioredis.from_url()` client。
storage and MCP Gateway Gate 5 now use the shared Redis pool instead of - 2026-05-06P0-I 第一段 enforcement patch 已落地。AwoooP approval decide endpoint 已從 trusted operator headers 推導 `approver_id`,不再信任 frontend bodyproduction 缺 `AWOOOP_OPERATOR_API_KEY` 時 fail closed。
runtime `aioredis.from_url()` clients.
- 2026-05-06: P0-I first enforcement patch landed. The AwoooP approval decide
endpoint now derives `approver_id` from trusted operator headers instead of
frontend body data, and production fails closed when `AWOOOP_OPERATOR_API_KEY`
is not configured.
Work: 工作:
- MCP Gateway bypass and audit gaps. - MCP Gateway bypass audit gaps
- `aioredis` runtime compatibility in approval/MCP paths. - Approval/MCP 路徑中的 `aioredis` runtime 相容性。
- `MCPToolResult` schema normalization. - `MCPToolResult` schema normalization
- K8s command injection hardening. - K8s command injection hardening
- Webhook spoofing and platform API approval auth. - Webhook spoofing platform API approval auth
- RAG/KM/PlayBook 1024-dimensional consistency. - RAG/KM/PlayBook 1024 維一致性。
- Ollama direct `OLLAMA_URL` cleanup for production alert paths. - Production alert paths 的 Ollama direct `OLLAMA_URL` cleanup
Exit criteria: 退出條件:
- No production MCP caller can bypass Gateway without a tracked `forbid-new` exception. - 沒有 production MCP caller 可在未追蹤 `forbid-new` 例外下繞過 Gateway。
- Approval decide endpoint uses authenticated principal. - Approval decide endpoint 使用 authenticated principal
- Embedding dimensions are production-consistent. - Embedding dimensions production 一致。
- Alert route logs prove Gemini is fallback only after GCP-A, GCP-B, and 111. - Alert route logs 證明 Gemini 只在 GCP-AGCP-B、111 全部失敗後才作為 fallback。
### Wave 2: P1 governance consistency ### Wave 2P1 治理一致性
Goal: make platform state durable and tenant-safe. 目標:讓 platform state 持久且 tenant-safe
Work: 工作:
- RLS fail-closed and project context injection. - RLS fail-closed project context injection
- Approval / Trust race fixes. - Approval / Trust race fixes
- TrustEngine state moves from process memory to PostgreSQL. - TrustEngine state process memory 移到 PostgreSQL
- NetworkPolicy semantic cleanup. - NetworkPolicy semantic cleanup
- Kustomize resource coverage. - Kustomize resource coverage
- Ollama failover centralization and GCP-B active-active usage. - Ollama failover centralization GCP-B active-active usage
Exit criteria: 退出條件:
- Cross-project read tests fail closed. - Cross-project read tests fail closed
- Pending approval survives pod restart. - Pending approval 可存活於 pod restart
- Live cluster resources match Git-managed resources. - Live cluster resources 符合 Git-managed resources
### Wave 3: P2 productization ### Wave 3P2 產品化
Goal: make AwoooP a usable professional operator console, not a side dashboard. 目標:讓 AwoooP 成為專業可用的 operator console,而不是旁支 dashboard
Work: 工作:
- AwoooP i18n and sidebar IA. - AwoooP i18n sidebar IA
- Run step journal and trace view. - Run step journal trace view
- Platform error code taxonomy. - Platform error code taxonomy
- Alembic and migration discipline. - Alembic migration discipline
- Internal IP bundle checks. - Internal IP bundle checks
- Regression matrix and release checklists. - Regression matrix release checklists
Exit criteria: 退出條件:
- Operator can explain a run from alert to MCP execution to KM feedback. - Operator 能從 alert MCP execution KM feedback 解釋一次 run。
- Frontend has no hardcoded user-facing strings outside i18n. - Frontend user-facing strings 全部收斂到 i18n不在元件內硬編碼。
- Release checklist ties commit, image, route health, and live URL verification. - Release checklist 串起 commitimageroute health live URL verification
## 9. Wave Dependencies ## 9. Wave 依賴
```text ```text
Wave 0 Wave 0
@@ -261,40 +257,40 @@ Wave 0
-> Wave 3 Operator Console productization -> Wave 3 Operator Console productization
``` ```
Do not move a destructive or auto-remediation path to AwoooP enforcement until Wave 1 is green and Wave 2 tenant state is fail-closed. Wave 1 綠燈且 Wave 2 tenant state fail-closed 前,不得把破壞性或自動修復路徑推進到 AwoooP enforcement。
## 10. Release and Verification Discipline ## 10. Release 與驗證紀律
Each wave must publish: 每個 wave 必須發布:
- changed surface - 變更面
- runtime behavior changed: yes/no - runtime behavior 是否改變
- contract family - contract family
- migration posture: keep, mirror, wrap, replace, forbid-new - migration posturekeepmirrorwrapreplaceforbid-new
- rollback or disable flag - rollback disable flag
- affected event types - 受影響 event types
- risk rating - 風險等級
- live verification command or URL - live verification command URL
- LOGBOOK entry - LOGBOOK entry
For frontend releases, also verify: Frontend release 也必須驗證:
- `pnpm --filter @awoooi/web typecheck` - `pnpm --filter @awoooi/web typecheck`
- `NEXT_PUBLIC_API_URL=https://awoooi.wooo.work pnpm --filter @awoooi/web build` - `NEXT_PUBLIC_API_URL=https://awoooi.wooo.work pnpm --filter @awoooi/web build`
- no private `NEXT_PUBLIC_*` internal IPs in generated browser bundle - 產出的 browser bundle 不得包含 private `NEXT_PUBLIC_*` internal IPs
- live route status and content after CD rollout - CD rollout 後檢查 live route status 與內容
For AI routing releases, also verify: AI routing release 也必須驗證:
- active pod environment - active pod environment
- provider route logs - provider route logs
- cost-bearing provider fallback order - cost-bearing provider fallback order
- `Gemini` appears only after all configured Ollama lanes fail or policy explicitly permits cloud fallback - 只有在所有 configured Ollama lanes 都失敗,或 policy 明確允許 cloud fallback 時,`Gemini` 才能出現
## 11. Immediate Next Items ## 11. 立即下一步
1. Wire `scripts/ops/deploy-alertmanager-config.sh` into the reboot/release checklist, then consider whether CD should run it for `ops/alertmanager/**` changes. 1. `scripts/ops/deploy-alertmanager-config.sh` 納入 reboot/release checklist,並評估 `ops/alertmanager/**` 變更是否應由 CD 自動執行。
2. Continue Wave 1 with MCP Gateway bypass and MCP audit completeness, because production callers can still route around the gateway. 2. 持續推進 Wave 1 MCP Gateway bypass MCP audit completeness,因為 production callers 仍可能繞過 gateway
3. Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative. 3. 每次 alert-path release 都必須驗證 GCP-A/GCP-B/111 Ollama routing,直到 EffectivePolicy 成為權威來源。
4. Add a Sentry/Snuba post-reboot health gate: ClickHouse table existence, Snuba migration status, and Kafka consumer offsets must be part of cold-start validation. 4. 新增 Sentry/Snuba post-reboot health gateClickHouse table existenceSnuba migration statusKafka consumer offsets 必須納入 cold-start validation
5. Add a post-deploy Alertmanager live check for `amtool check-config`, container status, and config-file mode; direct Telegram must remain emergency-only and target the SRE group. 5. 新增 Alertmanager post-deploy live check`amtool check-config`container statusconfig-file modedirect Telegram 必須維持 emergency-only,且目標只能是 SRE 群組。