diff --git a/AGENTS.md b/AGENTS.md index 9bf9e38e..b6ba345c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -31,6 +31,9 @@ ## 🔴 絕對禁止 → [HARD_RULES.md](docs/HARD_RULES.md) +## 🔴 文件語言鐵律 → [文件語言規範](docs/HARD_RULES.md#文件語言規範) +Markdown、ADR、LOGBOOK、Runbook、交接文件與計畫文件一律使用繁體中文;程式符號、API、指令、錯誤碼、服務名稱與原始 log 可保留英文。 + ## 🔴 紅區治理 → [RED_ZONES.md](docs/RED_ZONES.md) Tier 3 核心檔案 (decision_manager, trust_engine, config 等) 修改需首席架構師授權 diff --git a/docs/HARD_RULES.md b/docs/HARD_RULES.md index 8d04d444..c32a822d 100644 --- a/docs/HARD_RULES.md +++ b/docs/HARD_RULES.md @@ -8,11 +8,11 @@ | 欄位 | 值 | |------|-----| -| **版本** | v2.0 | +| **版本** | v2.1 | | **建立日期** | 2026-03-20 (台北) | | **建立者** | Claude Code | -| **最後修改** | 2026-04-16 (台北) | -| **修改者** | Claude Code + ogt (新增孤島開發/主動執行熔斷/自循環工作流/狀態機驗證鐵律) | +| **最後修改** | 2026-05-06 (台北) | +| **修改者** | Codex + ogt (新增文件語言鐵律) | ### 變更紀錄 @@ -29,6 +29,7 @@ | v1.8 | 2026-04-03 | Claude Code | 🔴🔴🔴 費用變更強制審批 (統帥指示) | | v1.9 | 2026-04-15 | Claude Code + ogt | 🔴🔴🔴 AI 自主化飛輪 Phase 退出條件鐵律 (ADR-080) | | v2.0 | 2026-04-16 | Claude Code + ogt | 新增 No Island Coding / 主動執行熔斷機制 / 自循環工作流 / 狀態機驗證鐵律 | +| v2.1 | 2026-05-06 | Codex + ogt | 🔴 文件語言鐵律:Markdown/ADR/LOGBOOK/Runbook/交接文件一律繁體中文 | --- @@ -39,6 +40,7 @@ | CI/CD | `ubuntu-latest` | `self-hosted` | [→ GitHub Billing](#github-billing) | | Telegram | `logOut()` | 先停後換 | [→ Telegram Token](#telegram-token) | | 前端 | 硬編碼文字 | `next-intl` | [→ i18n](#i18n) | +| 文件 | 英文主文 | 繁體中文 | [→ 文件語言規範](#文件語言規範) | | 資料庫 | SQLite | PostgreSQL | [→ DB](#database) | | CORS | `*` | 白名單 | [→ CORS](#cors) | | 數據 | 假數據 Demo | 真實 API | [→ No Fake Data](#no-fake-data) | @@ -197,6 +199,38 @@ await bot.log_out() --- +## 文件語言規範 + +**統帥指示 2026-05-06**:所有文件必須使用繁體中文。 + +### 適用範圍 + +| 類型 | 規範 | +|------|------| +| Markdown 文件 | 主文、標題、表格說明、結論一律繁體中文 | +| ADR / LOGBOOK / Runbook | 一律繁體中文,避免英文段落混入 | +| 交接文件 / implementation plan | 一律繁體中文,讓跨 session 接手時不再重譯 | +| 前端使用者可見文字 | 仍遵守 i18n,不得硬編碼 | + +### 可保留英文的內容 + +| 類型 | 範例 | +|------|------| +| 程式符號與路徑 | `project_id`, `apps/api/src`, `MCPToolResult` | +| API / 指令 / 錯誤碼 | `GET /v1/platform/runs`, `kubectl`, `E-MCP-GATE-001` | +| 服務與產品名稱 | `Alertmanager`, `Sentry`, `ClickHouse`, `Gemini` | +| 原始 log / trace / commit message | 保留原文,必要時在下方補繁中解釋 | + +### 絕對禁止 + +``` +❌ 新增英文主文的 ADR、Runbook、LOGBOOK 或 handoff 文件 +❌ 將中文文件改成中英混雜但沒有必要技術原因 +❌ 用英文標題包裝中文專案狀態,造成後續 session 還要重譯 +``` + +--- + ## Database **Memory:** AWOOOI 憲法 diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 64913749..55d477ef 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -3584,6 +3584,23 @@ DockerContainerCpuSustainedHigh -> PB-20260505-F4197B --- +## 2026-05-06(台北)— 文件語言鐵律收斂為繁體中文 + +**背景**:統帥明確指示所有文檔必須使用繁體中文;AwoooP 整合總圖仍殘留大量英文主文,會讓後續 session 接手時重新翻譯與誤解。 + +### 已修正 + +- `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md` 全文轉為繁體中文,保留必要技術識別碼、API path、指令、錯誤碼與服務名稱原文。 +- `docs/HARD_RULES.md` 升級到 v2.1,新增文件語言規範鐵律。 +- `AGENTS.md` 加入文件語言入口規則,讓 session 啟動即可看到「Markdown / ADR / LOGBOOK / Runbook / 交接文件一律繁體中文」。 + +### 後續要求 + +- 新增或修改文件時,主文、標題、表格說明與結論一律使用繁體中文。 +- 原始 log、trace、commit message、程式符號與服務名稱可保留英文,但必要時需補繁中解釋。 + +--- + ## 2026-05-06(台北)— Alertmanager 旁路改送 SRE 群組 + Sentry Snuba 修復 **觸發**:Telegram 收到 `🚨 [Alertmanager Fallback] DockerContainerRestartSpike`,且訊息發到 OpenClaw 私訊/機器人對話;同一時間 AWOOOI 心跳正常,表示 fallback 旁路不是「API 離線才觸發」,而是 Alertmanager critical route 的 direct Telegram 旁路。 @@ -3592,8 +3609,8 @@ DockerContainerCpuSustainedHigh -> PB-20260505-F4197B | 範圍 | 結果 | |------|------| -| Alertmanager live config | `/home/wooo/monitoring/alertmanager.yml` 的 `telegram-direct.chat_id` 已從 `OPENCLAW_TG_CHAT_ID` 切到 `SRE_GROUP_CHAT_ID`,並以 HUP reload | -| Alertmanager repo template | `ops/alertmanager/alertmanager.yml` 改用 `SRE_GROUP_CHAT_ID_PLACEHOLDER`,避免後續部署回退到私訊 | +| Alertmanager 現場設定 | `/home/wooo/monitoring/alertmanager.yml` 的 `telegram-direct.chat_id` 已從 `OPENCLAW_TG_CHAT_ID` 切到 `SRE_GROUP_CHAT_ID`,並以 HUP reload | +| Alertmanager repo 範本 | `ops/alertmanager/alertmanager.yml` 改用 `SRE_GROUP_CHAT_ID_PLACEHOLDER`,避免後續部署回退到私訊 | | Sentry / Snuba schema | 110 上 `/opt/sentry` 執行 `docker compose run --rm snuba-api bootstrap --force`,補齊 ClickHouse Snuba tables | | Kafka offset | `ingest-consumer` 的 `ingest-events:0` 與 `generic-metrics-consumer` 的 `ingest-performance-metrics:0` reset 到 latest,修正 `OffsetOutOfRange` | @@ -3601,7 +3618,7 @@ DockerContainerCpuSustainedHigh -> PB-20260505-F4197B ```text docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml -# SUCCESS +# 成功 Alertmanager telegram-direct chat_id # group/supergroup, suffix=74679 @@ -3612,27 +3629,27 @@ ClickHouse tables # default.transactions_local = 1 # default.metrics_raw_v2_local = 1 -Sentry consumers after reset +Sentry consumers reset 後狀態 # events-consumer healthy # generic-metrics-consumer healthy # snuba-errors/metrics/transactions consumers healthy -# recent 45s logs: no OffsetOutOfRange / UNKNOWN_TABLE / ERROR markers +# 近 45 秒 log:沒有 OffsetOutOfRange / UNKNOWN_TABLE / ERROR 標記 ``` -### 第二段收斂:旁路改成 Emergency-only +### 第二段收斂:旁路改成緊急限定 後續同一輪已再收斂 Alertmanager 路由: | 範圍 | 結果 | |------|------| -| Direct route gate | `telegram-direct` 不再匹配所有 `severity=critical`,只匹配 `AWOOOIApiDown` / `AlertmanagerDown` / `AlertChainBroken_*` / `AlertChainUnhealthy` / `NoAlertsReceived2Hours` | -| Main route | 一般 critical(含 Docker/Sentry container restart)只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 | -| Live webhook URL | `/home/wooo/monitoring/alertmanager.yml` 從 `192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` | -| Config check | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功,HUP reload 完成 | -| Drift prevention | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload | -| Deploy safety | 修正部署腳本以原 inode 覆寫 bind-mounted config,並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop | -| Live firing state | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0;Sentry consumers 回到 healthy | -| Temporary silence | 因 Alertmanager 權限修復期間自身 restart 觸發 15 分鐘窗口,已只針對 `alertname=DockerContainerRestartSpike, container_name=alertmanager` 建立 30 分鐘 silence;其他 Docker/Sentry restart 不受影響 | +| 直接路由閘門 | `telegram-direct` 不再匹配所有 `severity=critical`,只匹配 `AWOOOIApiDown` / `AlertmanagerDown` / `AlertChainBroken_*` / `AlertChainUnhealthy` / `NoAlertsReceived2Hours` | +| 主路由 | 一般 critical(含 Docker/Sentry container restart)只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 | +| 現場 webhook URL | `/home/wooo/monitoring/alertmanager.yml` 從 `192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` | +| 設定檢查 | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功,HUP reload 完成 | +| 防漂移部署 | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload | +| 部署安全性 | 修正部署腳本以原 inode 覆寫 bind-mounted config,並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop | +| 現場 firing 狀態 | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0;Sentry consumers 回到 healthy | +| 暫時 silence | 因 Alertmanager 權限修復期間自身 restart 觸發 15 分鐘窗口,已只針對 `alertname=DockerContainerRestartSpike, container_name=alertmanager` 建立 30 分鐘 silence;其他 Docker/Sentry restart 不受影響 | ### 注意 diff --git a/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md b/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md index c52f223b..75f9277d 100644 --- a/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md +++ b/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md @@ -1,67 +1,68 @@ -# AWOOOI / AwoooP / AI Autonomous Flywheel Integration Plan +# AWOOOI / AwoooP / AI 自主化飛輪整合計畫 -**Date**: 2026-05-06 -**Status**: Accepted integration baseline for AwoooP execution -**Scope**: AWOOOI alerting, AI decisioning, approval, MCP execution, KM/RAG/PlayBook learning, AwoooP Operator Console, audit, governance, and platform reliability -**Related**: +**日期**:2026-05-06 +**狀態**:已接受,作為 AwoooP 後續執行的整合基準 +**範圍**:AWOOOI 告警、AI 決策、簽核、MCP 執行、KM/RAG/PlayBook 學習、AwoooP Operator Console、稽核、治理與平台可靠性 +**相關文件**: - `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md` - `docs/awooop/MASTER-WORKPLAN.md` - `docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md` - `docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md` - `docs/adr/ADR-106-agent-platform-architecture.md` -## 1. Executive Decision +## 1. 架構決策 -The current findings are not isolated bugs. They are evidence that the AI autonomous flywheel is not yet closed end-to-end. +目前盤點出的問題不是一批孤立 bug,而是 AI 自主化飛輪尚未端到端閉環的證據。 -AwoooP must not evolve as a separate product line beside AWOOOI automation. It is the human-and-governance plane for the same flywheel: +AwoooP 不得在 AWOOOI 自動化之外另長成一條產品線。AwoooP 是同一條飛輪的人機協作與治理平面: -- **AI Flywheel**: engine, cognition, execution, learning loop. -- **AwoooP**: operator console, governance plane, audit plane, human approval plane. -- **MCP Gateway**: the only execution choke point for tool calls. -- **Approval / Trust**: the only authorization state machine. -- **KM / RAG / PlayBook**: the only knowledge substrate. -- **Channel Event / Audit**: the only observability and trace stream. +- **AI Flywheel**:引擎、認知、執行與學習迴路。 +- **AwoooP**:操作員控制台、治理平面、稽核平面與人工簽核平面。 +- **MCP Gateway**:工具呼叫的唯一執行閘門。 +- **Approval / Trust**:唯一授權狀態機。 +- **KM / RAG / PlayBook**:唯一知識底座。 +- **Channel Event / Audit**:唯一觀測、稽核與追蹤流。 -If AI automation and AwoooP are implemented as separate tracks, the system will drift into duplicate approval state machines, duplicate audit flows, duplicate routing gates, duplicate MCP and Channel decisions, and frontend states that do not match the backend execution truth. +如果 AI 自動化與 AwoooP 分成兩條軌道推進,系統會漂移出兩套簽核狀態機、兩套稽核流、兩套路由閘門、兩套 MCP 與 Channel 決策,最後前端看到的狀態會與後端真實執行狀態不一致。 -## 2. Shared Target Loop +## 2. 共同目標迴路 -The target loop is: +目標閉環如下: ```text -monitoring / alerting - -> event classification - -> rule match or rule creation - -> PlayBook / KM / RAG retrieval - -> AI decision - -> Approval / Trust gate - -> MCP Gateway execution - -> audit and trace - -> execution verification - -> KM / rule / PlayBook feedback +監控 / 告警 + -> 事件分類 + -> 規則匹配或規則建立 + -> PlayBook / KM / RAG 檢索 + -> AI 決策 + -> Approval / Trust Gate + -> MCP Gateway 執行 + -> 稽核與追蹤 + -> 執行驗證 + -> KM / Rule / PlayBook 回饋 ``` -AwoooP surfaces and governs that loop. It does not create an independent execution lane. +AwoooP 負責呈現與治理這條迴路,不建立獨立的執行 lane。 -## 3. Architectural Invariants +## 3. 架構不變式 -1. **One execution gate**: production MCP calls must pass through MCP Gateway. Direct provider calls are compatibility debt and must become `forbid-new`. -2. **One approval state machine**: AwoooP approvals, AWOOOI approval records, TrustEngine, and Telegram approval buttons must converge to one signed, auditable flow. -3. **One audit stream**: Channel events, runtime state, MCP audit, model routing, and approval decisions must be joinable by `project_id`, `run_id`, and `trace_id`. -4. **One knowledge substrate**: KM, RAG, and PlayBook embeddings must use consistent dimensions and model naming. Test fixtures must match production schema. -5. **One routing control plane**: provider/model fallback must resolve through EffectivePolicy or a wrapped legacy equivalent during strangler migration. -6. **No fail-open tenant boundary**: missing `project_id` must never broaden data visibility. Bootstrap exceptions must be explicit and audited. -7. **No client-only identity**: any operator identity supplied by frontend body is display metadata only; authorization identity must come from the authenticated principal. -8. **No channel policy decisions**: Telegram, LINE, Slack, and email adapters deliver and track messages, but do not decide model, tool, approval, or incident policy. +1. **唯一執行閘門**:production MCP 呼叫必須通過 MCP Gateway。直接呼叫 provider 是相容性技術債,必須逐步標記為 `forbid-new`。 +2. **唯一簽核狀態機**:AwoooP approvals、AWOOOI approval records、TrustEngine 與 Telegram 簽核按鈕必須收斂成一條可簽章、可稽核的流程。 +3. **唯一稽核流**:Channel events、runtime state、MCP audit、model routing 與 approval decisions 必須能以 `project_id`、`run_id`、`trace_id` 串接查詢。 +4. **唯一知識底座**:KM、RAG、PlayBook embeddings 必須使用一致的維度與模型命名。測試 fixtures 必須與 production schema 一致。 +5. **唯一模型路由控制面**:provider/model fallback 必須透過 EffectivePolicy,或在絞殺榕遷移期間使用受包裝的 legacy 等價層解析。 +6. **租戶邊界禁止 fail-open**:缺少 `project_id` 時不得擴大資料可見範圍。Bootstrap 例外必須明確且可稽核。 +7. **禁止 client-only 身份**:frontend body 提供的 operator identity 只能當顯示 metadata;授權身份必須來自已認證 principal。 +8. **Channel adapter 不做政策決策**:Telegram、LINE、Slack、Email adapter 只負責收送與追蹤訊息,不決定模型、工具、簽核或事件策略。 +9. **文件一律繁體中文**:Markdown 文件、ADR、LOGBOOK、Runbook、交接文件與計畫文件必須使用繁體中文;技術識別碼、API path、程式符號、錯誤碼、指令、服務名稱與原始 log 可保留英文。 -## 4. Integration Layers +## 4. 整合分層 -### 4.1 AI Flywheel Core +### 4.1 AI 飛輪核心層 -Owns monitoring, alert ingestion, event classification, rule matching, rule generation, PlayBook, KM/RAG, AI decisions, safe repair, verification, and learning feedback. +負責監控、告警入口、事件分類、規則匹配、規則生成、PlayBook、KM/RAG、AI 決策、安全修復、驗證與學習回饋。 -Key existing surfaces: +既有關鍵介面: - `apps/api/src/api/v1/webhooks.py` - `apps/api/src/services/decision_manager.py` @@ -71,11 +72,11 @@ Key existing surfaces: - `apps/api/src/services/knowledge_rag_service.py` - `apps/api/src/services/playbook_rag.py` -### 4.2 AwoooP Governance / Operator Layer +### 4.2 AwoooP 治理 / 操作層 -Owns operator runs, approvals, audit, channel events, MCP Gateway visibility, frontend control surface, i18n, and authenticated operator identity. +負責 operator runs、approvals、audit、channel events、MCP Gateway 可視化、frontend 操作面、i18n 與已認證 operator identity。 -Key existing surfaces: +既有關鍵介面: - `apps/api/src/api/v1/platform/*` - `apps/api/src/services/platform_*` @@ -84,11 +85,11 @@ Key existing surfaces: - `apps/web/src/app/[locale]/awooop/*` - `docs/awooop/*` -### 4.3 Infrastructure / Reliability Layer +### 4.3 基礎設施 / 可靠性層 -Owns DB schema and RLS, Redis namespace migration, K8s manifests, NetworkPolicy, Ollama failover, Gitea CD, migration discipline, and release gates. +負責 DB schema 與 RLS、Redis namespace migration、K8s manifests、NetworkPolicy、Ollama failover、Gitea CD、migration discipline 與 release gates。 -Key existing surfaces: +既有關鍵介面: - `apps/api/migrations/*` - `apps/api/src/db/*` @@ -97,162 +98,157 @@ Key existing surfaces: - `docs/runbooks/*` - `docs/awooop/inventory/*` -## 5. Consolidated Risk Register +## 5. 整合風險登錄表 -### 5.1 P0: Production-broken or Security-critical +### 5.1 P0:production 會壞或安全關鍵 -| ID | Risk | Impact | Required direction | +| ID | 風險 | 影響 | 必要方向 | | --- | --- | --- | --- | -| P0-A | MCP Gateway is not yet a production choke point; callers can still reach `provider.execute()` directly. | Gateway gates, redaction, and audit can be bypassed. | Wrap all production MCP call sites; then mark direct provider access `forbid-new`. | -| P0-B | MCP blocked-call audit has gaps when `tool_row` is missing or Gate 1/2 rejects early. | Denied or suspicious calls can disappear from audit. | Audit attempt before and after gate evaluation with safe redaction. | -| P0-C | Legacy K8s tool execution still has command/shell injection risk. | Destructive command path can be polluted by LLM or user input. | Parse command structure, avoid shell, enforce operation schema and allowlists. | -| P0-D | `MCPToolResult(data=...)` mismatches dataclass fields in some success paths. | Sentry / ArgoCD or similar tools can crash on valid results. | Normalize result schema and regression-test all MCP providers. | -| P0-E | RAG/KM/PlayBook embedding dimensions remain split between 768 and 1024. | Search or backfill can silently fail or hide production drift. | Standardize on `bge-m3` 1024 dimensions and remove stale fixtures. | -| P0-F | KM backfill reconciler is missing required async imports or runtime dependencies. | Repair path that should recover embeddings may crash. | Compile and integration-test the reconciler path. | -| P0-G | Ollama routing still has direct `OLLAMA_URL` consumers. | GCP-A/GCP-B/111 ordering and Gemini fallback policy can be bypassed. | Inventory, wrap, and migrate call sites to resolver / EffectivePolicy. | -| P0-H | RLS is fail-open when `project_id` is null. | Cross-tenant data exposure risk. | Force fail-closed RLS and explicit platform bootstrap identities. | -| P0-I | Approval APIs trust frontend/body identity such as `approver_id: "operator"`. | Audit identity is not legally or operationally usable. | Decide endpoint must derive principal from auth/session/token, not body. | -| P0-J | API control plane lacks real authorization in some paths; CSRF is not authorization. | Operator APIs can be invoked without a real role boundary. | Add authenticated principal and role checks to AwoooP APIs. | -| P0-K | Alertmanager internal bypass plus `X-Forwarded-For` and broad trusted hosts may allow spoofed source identity. | Forged alert ingress can create incidents or approvals. | Require signed webhook or strict trusted proxy chain. | -| P0-L | AwoooP approval token service also depends on `aioredis`. | Python/runtime compatibility issue can break approvals. | Migrate Redis client path or wrap with a maintained async Redis interface. | +| P0-A | MCP Gateway 尚未成為 production choke point;caller 仍可直接碰 `provider.execute()`。 | Gateway gate、redaction、audit 可能被繞過。 | 包裝所有 production MCP call sites,然後將 direct provider access 標記為 `forbid-new`。 | +| P0-B | MCP blocked-call audit 在 `tool_row` 缺失或 Gate 1/2 提早拒絕時有缺口。 | 被拒絕或可疑的呼叫可能從稽核中消失。 | 在 gate 評估前後都寫入 audit attempt,並套安全 redaction。 | +| P0-C | Legacy K8s tool execution 仍有 command/shell injection 風險。 | 破壞性指令路徑可能被 LLM 或使用者輸入污染。 | 解析指令結構、避免 shell、強制 operation schema 與 allowlist。 | +| P0-D | 部分成功路徑用 `MCPToolResult(data=...)`,與 dataclass 欄位不一致。 | Sentry / ArgoCD 等 provider 可能在有效結果上 crash。 | 正規化 result schema,並為所有 MCP providers 補 regression tests。 | +| P0-E | RAG/KM/PlayBook embedding 維度仍分裂為 768 與 1024。 | 搜尋或 backfill 可能靜默失敗,或掩蓋 production drift。 | 統一為 `bge-m3` 1024 維,移除 stale fixtures。 | +| P0-F | KM backfill reconciler 缺少必要 async import 或 runtime dependency。 | 原本應修復 embeddings 的補救路徑可能 crash。 | 對 reconciler path 做 compile 與 integration test。 | +| P0-G | Ollama routing 仍有 direct `OLLAMA_URL` consumers。 | GCP-A/GCP-B/111 順序與 Gemini fallback policy 可能被繞過。 | 盤點、包裝並遷移 call sites 到 resolver / EffectivePolicy。 | +| P0-H | `project_id` 為 null 時 RLS fail-open。 | 有跨租戶資料暴露風險。 | RLS 強制 fail-closed,並建立明確 platform bootstrap identity。 | +| P0-I | Approval APIs 信任 frontend/body identity,例如 `approver_id: "operator"`。 | 稽核身份無法作為法務或維運依據。 | Decide endpoint 必須從 auth/session/token 推導 principal,不信任 body。 | +| P0-J | 部分 API control plane 缺少真正 authorization;CSRF 不是 authorization。 | Operator APIs 可能在沒有真實角色邊界下被呼叫。 | 為 AwoooP APIs 加入 authenticated principal 與 role checks。 | +| P0-K | Alertmanager internal bypass 搭配 `X-Forwarded-For` 與過寬 trusted hosts,可能讓來源身份被 spoof。 | 偽造告警入口可能建立 incidents 或 approvals。 | 要求 signed webhook,或嚴格限制 trusted proxy chain。 | +| P0-L | AwoooP approval token service 也依賴 `aioredis`。 | Python/runtime 相容性問題可能打壞 approvals。 | 遷移 Redis client path,或包裝成已維護的 async Redis interface。 | -### 5.2 P1: Governance consistency and reliability +### 5.2 P1:治理一致性與可靠性 -| ID | Risk | Impact | Required direction | +| ID | 風險 | 影響 | 必要方向 | | --- | --- | --- | --- | -| P1-A | Approval repository / TrustEngine race conditions. | Double decision, stale decision, or missed approval. | Add row locks, idempotency keys, and single source of truth. | -| P1-B | TrustEngine uses process memory dict. | Pending approvals vanish or split across pods/restarts. | Move authoritative state to PostgreSQL; Redis only for cache/notification. | -| P1-C | NetworkPolicy contains rules whose names imply deny but semantics allow broad traffic. | False sense of isolation. | Reconcile names, policies, and live cluster behavior. | -| P1-D | Kustomize may omit service registry, HPA, VPA, backup cron, or related resources. | GitOps drift and partial deploys. | Inventory generated/live manifests and add missing resources. | -| P1-E | Hardcoded restart/SSH actions remain in alert rules. | Conflicts with AI automation direction and rule/PlayBook evidence loop. | Convert hardcoded actions into PlayBook proposals with trust gates. | -| P1-F | Startup schema mutation and scattered SQL migrations remain. | Runtime ALTER can diverge dev/prod schema. | Move to disciplined migration path and remove runtime mutation. | -| P1-G | Tests still use stale 768-dimension vectors. | Dev tests can pass while production RAG fails. | Test fixtures must mirror production vector dimensions. | -| P1-H | GCP-B can remain passive fallback only. | Active-active design is not realized. | Use GCP-B for batch, embedding, shadow, and canary lanes via policy. | +| P1-A | Approval repository / TrustEngine 有 race condition。 | 可能出現重複決策、stale decision 或 missed approval。 | 加入 row locks、idempotency keys 與 single source of truth。 | +| P1-B | TrustEngine 使用 process memory dict。 | pending approvals 會因 pod restart 或多副本拆裂而消失或分裂。 | 權威狀態移到 PostgreSQL;Redis 僅作 cache/notification。 | +| P1-C | NetworkPolicy 名稱像 deny,但語義上允許過寬流量。 | 造成隔離已完成的錯覺。 | 對齊名稱、政策與 live cluster 行為。 | +| P1-D | Kustomize 可能漏納 service registry、HPA、VPA、backup cron 或相關資源。 | GitOps drift 與 partial deploy。 | 盤點 generated/live manifests,補齊缺失資源。 | +| P1-E | alert rules 中仍殘留 hardcoded restart/SSH actions。 | 與 AI 自動化、rule/PlayBook evidence loop 衝突。 | 轉換為 PlayBook proposals,並走 trust gates。 | +| P1-F | startup schema mutation 與散落 SQL migrations 仍存在。 | Runtime ALTER 可能讓 dev/prod schema 漂移。 | 收斂到紀律化 migration path,移除 runtime mutation。 | +| P1-G | Tests 仍使用 stale 768 維 vectors。 | Dev tests 可能通過,但 production RAG 失效。 | Test fixtures 必須映射 production vector dimensions。 | +| P1-H | GCP-B 可能只停留在被動 fallback。 | Active-active 設計無法落地。 | 透過 policy 讓 GCP-B 承擔 batch、embedding、shadow、canary lanes。 | -### 5.3 P2: Productization and long-term maintainability +### 5.3 P2:產品化與長期可維護性 -| ID | Risk | Impact | Required direction | +| ID | 風險 | 影響 | 必要方向 | | --- | --- | --- | --- | -| P2-A | AwoooP frontend has incomplete i18n and inconsistent IA. | Operator experience is unprofessional and brittle. | Align UI text, sidebar, routes, and workflow labels. | -| P2-B | Operator Console lacks deep trace/journal views. | Operators cannot explain why AI made a decision. | Add run step journal, trace spans, model/tool/cost panes. | -| P2-C | Error code taxonomy is incomplete. | Frontend and channels cannot render actionable states. | Define platform error codes for auth, schema, budget, MCP, routing, approval, and RAG. | -| P2-D | Docs and runtime can drift. | Future sessions repeat the same analysis. | Keep ADR, LOGBOOK, inventory, and release checklists linked to commits. | +| P2-A | AwoooP frontend i18n 不完整,IA 不一致。 | Operator experience 不專業且脆弱。 | 對齊 UI 文字、sidebar、routes 與 workflow labels。 | +| P2-B | Operator Console 缺少深度 trace/journal views。 | Operators 無法解釋 AI 為什麼做出某個決策。 | 加入 run step journal、trace spans、model/tool/cost panes。 | +| P2-C | Error code taxonomy 不完整。 | Frontend 與 channels 無法呈現可行動狀態。 | 定義 auth、schema、budget、MCP、routing、approval、RAG 的 platform error codes。 | +| P2-D | Docs 與 runtime 可能漂移。 | 未來 sessions 會重複同樣盤點。 | ADR、LOGBOOK、inventory 與 release checklists 必須連到 commits。 | -## 6. Corrections to Prior Inventories +## 6. 對先前盤點的修正 -The following corrections must be applied when consuming the earlier 12-agent inventories: +消化早期 12-agent inventories 時,必須套用以下修正: -1. C21/C22 CronJob service account and image issues appear to have been partially fixed in the worktree, but must be verified against live cluster state before closure. -2. C1 `aioredis` is not necessarily a top-level import in all affected paths; at least one Gate 5 runtime import still makes the risk valid. -3. C2 is more precise as: blocked calls can skip audit when `tool_row is None`, not only due to a database `NOT NULL` failure. -4. C12 is partially stale for the main embedding service, but `knowledge_rag_service.py` and `playbook_rag.py` still need verification for 1024-dimensional consistency. +1. C21/C22 CronJob service account 與 image 問題在 worktree 中看起來已部分修正,但結案前必須對 live cluster state 驗證。 +2. C1 `aioredis` 不一定在所有受影響路徑都是 top-level import;至少 Gate 5 runtime import 仍讓風險成立。 +3. C2 的精準說法是:`tool_row is None` 時 blocked calls 可能跳過 audit,不只是資料庫 `NOT NULL` failure。 +4. C12 對主要 embedding service 的描述部分過期,但 `knowledge_rag_service.py` 與 `playbook_rag.py` 仍需要驗證 1024 維一致性。 -## 7. Twelve-agent Ownership Matrix +## 7. 12-agent 權責矩陣 -| Owner | Scope | First verification | +| Owner | 範圍 | 第一個驗證點 | | --- | --- | --- | -| Chief Architect | Total blueprint, dependency order, red-zone governance | This integration plan is linked from AwoooP master workplan and LOGBOOK. | -| Security | Auth, webhook spoofing, RLS, MCP security, token signing | Authenticated approval decision cannot be forged by body-only identity. | -| MCP Gateway | Gateway chokepoint, audit, provider bypass, result schema | Direct provider call inventory reaches zero `forbid-new` exceptions. | -| K8s / SRE | CronJob, Kustomize, NetworkPolicy, RBAC, deployment verification | `kubectl diff` and live resource inventory match expected GitOps set. | -| DB / Migration | Alembic, RLS, JSONB, FK, approval race | RLS fail-closed tests and migration up/down pass. | -| RAG / KM | `bge-m3` 1024 dimensions, backfill, PlayBook embedding | Production vector dimension and test fixtures match. | -| AI Router / Ollama | Resolver, failover, direct URL cleanup, GCP-A/B/111 lanes | New alert route logs prove `GCP-A -> GCP-B -> 111 -> Gemini` ordering. | -| Core / Trust | TrustEngine, RedisLock, UnitOfWork, feature flags | Pending approval survives pod restart and HPA multi-pod execution. | -| API Layer | Platform endpoints auth, CSRF, tenant boundaries | Platform APIs reject unauthenticated and cross-project requests. | -| Frontend / AwoooP | i18n, sidebar, approver identity, internal IP ban | `/zh-TW/awooop` renders without redirect error and no private `NEXT_PUBLIC_*` bundle leak. | -| QA / Test | Integration tests, no-mock scanning, regression matrix | High-risk paths have production-like integration tests, not only mocks. | -| Docs / Release | ADR, LOGBOOK, Memory handoff, runbooks, release checklist | Each wave has a release note, rollback path, and live verification evidence. | +| Chief Architect | 總藍圖、依賴順序、紅區治理 | 本整合計畫已從 AwoooP master workplan 與 LOGBOOK 串接。 | +| Security | Auth、webhook spoofing、RLS、MCP security、token signing | Approval decision 不能被 body-only identity 偽造。 | +| MCP Gateway | Gateway chokepoint、audit、provider bypass、result schema | Direct provider call inventory 達到零個未追蹤 `forbid-new` 例外。 | +| K8s / SRE | CronJob、Kustomize、NetworkPolicy、RBAC、deployment verification | `kubectl diff` 與 live resource inventory 符合 GitOps 預期集合。 | +| DB / Migration | Alembic、RLS、JSONB、FK、approval race | RLS fail-closed tests 與 migration up/down 通過。 | +| RAG / KM | `bge-m3` 1024 維、backfill、PlayBook embedding | Production vector dimension 與 test fixtures 一致。 | +| AI Router / Ollama | Resolver、failover、direct URL cleanup、GCP-A/B/111 lanes | 新 alert route logs 證明 `GCP-A -> GCP-B -> 111 -> Gemini` 順序。 | +| Core / Trust | TrustEngine、RedisLock、UnitOfWork、feature flags | Pending approval 可存活於 pod restart 與 HPA multi-pod 執行。 | +| API Layer | Platform endpoints auth、CSRF、tenant boundaries | Platform APIs 會拒絕 unauthenticated 與 cross-project requests。 | +| Frontend / AwoooP | i18n、sidebar、approver identity、internal IP ban | `/zh-TW/awooop` 可正常 render,且 browser bundle 無 private `NEXT_PUBLIC_*` 外洩。 | +| QA / Test | 整合測試、no-mock 掃描、回歸矩陣 | 高風險路徑有 production-like integration tests,不只靠 mocks。 | +| Docs / Release | ADR、LOGBOOK、Memory handoff、runbooks、release checklist | 每個 wave 都有 release note、rollback path 與 live verification evidence。 | -## 8. Execution Waves +## 8. 執行 waves -### Wave 0: Formal convergence baseline +### Wave 0:正式收斂基準 -Goal: convert the cross-session conclusion into a single source of operational truth. +目標:把跨 session 結論轉成單一營運真相來源。 -Work: +工作: -- Add this integration plan. -- Link it from `MASTER-WORKPLAN.md` and `AWOOOP-MONITORING-ALERTING-CONVERGENCE.md`. -- Keep runtime untouched except already-approved urgent repairs. -- Define owner and wave for each high-risk item. +- 新增本整合計畫。 +- 從 `MASTER-WORKPLAN.md` 與 `AWOOOP-MONITORING-ALERTING-CONVERGENCE.md` 串接本文件。 +- 除已授權的緊急修復外,不碰 runtime。 +- 為每個高風險項目定義 owner 與 wave。 -Exit criteria: +退出條件: -- Plan is committed. -- LOGBOOK records the integration baseline. -- Next implementation session can start from this document without re-deriving the architecture. +- 計畫已 commit。 +- LOGBOOK 記錄整合基準。 +- 下一個 implementation session 可直接從本文件開工,不必重推架構。 -### Wave 1: P0 production-broken / security-critical +### Wave 1:P0 production-broken / security-critical -Goal: prevent the automation flywheel from bypassing governance. +目標:防止自動化飛輪繞過治理。 -Progress: +進度: -- 2026-05-06: P0-L first enforcement patch landed. AwoooP approval token - storage and MCP Gateway Gate 5 now use the shared Redis pool instead of - runtime `aioredis.from_url()` clients. -- 2026-05-06: P0-I first enforcement patch landed. The AwoooP approval decide - endpoint now derives `approver_id` from trusted operator headers instead of - frontend body data, and production fails closed when `AWOOOP_OPERATOR_API_KEY` - is not configured. +- 2026-05-06:P0-L 第一段 enforcement patch 已落地。AwoooP approval token storage 與 MCP Gateway Gate 5 已改用 shared Redis pool,不再建立 runtime `aioredis.from_url()` client。 +- 2026-05-06:P0-I 第一段 enforcement patch 已落地。AwoooP approval decide endpoint 已從 trusted operator headers 推導 `approver_id`,不再信任 frontend body;production 缺 `AWOOOP_OPERATOR_API_KEY` 時 fail closed。 -Work: +工作: -- MCP Gateway bypass and audit gaps. -- `aioredis` runtime compatibility in approval/MCP paths. -- `MCPToolResult` schema normalization. -- K8s command injection hardening. -- Webhook spoofing and platform API approval auth. -- RAG/KM/PlayBook 1024-dimensional consistency. -- Ollama direct `OLLAMA_URL` cleanup for production alert paths. +- MCP Gateway bypass 與 audit gaps。 +- Approval/MCP 路徑中的 `aioredis` runtime 相容性。 +- `MCPToolResult` schema normalization。 +- K8s command injection hardening。 +- Webhook spoofing 與 platform API approval auth。 +- RAG/KM/PlayBook 1024 維一致性。 +- Production alert paths 的 Ollama direct `OLLAMA_URL` cleanup。 -Exit criteria: +退出條件: -- No production MCP caller can bypass Gateway without a tracked `forbid-new` exception. -- Approval decide endpoint uses authenticated principal. -- Embedding dimensions are production-consistent. -- Alert route logs prove Gemini is fallback only after GCP-A, GCP-B, and 111. +- 沒有 production MCP caller 可在未追蹤 `forbid-new` 例外下繞過 Gateway。 +- Approval decide endpoint 使用 authenticated principal。 +- Embedding dimensions 與 production 一致。 +- Alert route logs 證明 Gemini 只在 GCP-A、GCP-B、111 全部失敗後才作為 fallback。 -### Wave 2: P1 governance consistency +### Wave 2:P1 治理一致性 -Goal: make platform state durable and tenant-safe. +目標:讓 platform state 持久且 tenant-safe。 -Work: +工作: -- RLS fail-closed and project context injection. -- Approval / Trust race fixes. -- TrustEngine state moves from process memory to PostgreSQL. -- NetworkPolicy semantic cleanup. -- Kustomize resource coverage. -- Ollama failover centralization and GCP-B active-active usage. +- RLS fail-closed 與 project context injection。 +- Approval / Trust race fixes。 +- TrustEngine state 從 process memory 移到 PostgreSQL。 +- NetworkPolicy semantic cleanup。 +- Kustomize resource coverage。 +- Ollama failover centralization 與 GCP-B active-active usage。 -Exit criteria: +退出條件: -- Cross-project read tests fail closed. -- Pending approval survives pod restart. -- Live cluster resources match Git-managed resources. +- Cross-project read tests fail closed。 +- Pending approval 可存活於 pod restart。 +- Live cluster resources 符合 Git-managed resources。 -### Wave 3: P2 productization +### Wave 3:P2 產品化 -Goal: make AwoooP a usable professional operator console, not a side dashboard. +目標:讓 AwoooP 成為專業可用的 operator console,而不是旁支 dashboard。 -Work: +工作: -- AwoooP i18n and sidebar IA. -- Run step journal and trace view. -- Platform error code taxonomy. -- Alembic and migration discipline. -- Internal IP bundle checks. -- Regression matrix and release checklists. +- AwoooP i18n 與 sidebar IA。 +- Run step journal 與 trace view。 +- Platform error code taxonomy。 +- Alembic 與 migration discipline。 +- Internal IP bundle checks。 +- Regression matrix 與 release checklists。 -Exit criteria: +退出條件: -- Operator can explain a run from alert to MCP execution to KM feedback. -- Frontend has no hardcoded user-facing strings outside i18n. -- Release checklist ties commit, image, route health, and live URL verification. +- Operator 能從 alert 到 MCP execution 到 KM feedback 解釋一次 run。 +- Frontend user-facing strings 全部收斂到 i18n,不在元件內硬編碼。 +- Release checklist 串起 commit、image、route health 與 live URL verification。 -## 9. Wave Dependencies +## 9. Wave 依賴 ```text Wave 0 @@ -261,40 +257,40 @@ Wave 0 -> Wave 3 Operator Console productization ``` -Do not move a destructive or auto-remediation path to AwoooP enforcement until Wave 1 is green and Wave 2 tenant state is fail-closed. +Wave 1 綠燈且 Wave 2 tenant state fail-closed 前,不得把破壞性或自動修復路徑推進到 AwoooP enforcement。 -## 10. Release and Verification Discipline +## 10. Release 與驗證紀律 -Each wave must publish: +每個 wave 必須發布: -- changed surface -- runtime behavior changed: yes/no +- 變更面 +- runtime behavior 是否改變 - contract family -- migration posture: keep, mirror, wrap, replace, forbid-new -- rollback or disable flag -- affected event types -- risk rating -- live verification command or URL +- migration posture:keep、mirror、wrap、replace、forbid-new +- rollback 或 disable flag +- 受影響 event types +- 風險等級 +- live verification command 或 URL - LOGBOOK entry -For frontend releases, also verify: +Frontend release 也必須驗證: - `pnpm --filter @awoooi/web typecheck` - `NEXT_PUBLIC_API_URL=https://awoooi.wooo.work pnpm --filter @awoooi/web build` -- no private `NEXT_PUBLIC_*` internal IPs in generated browser bundle -- live route status and content after CD rollout +- 產出的 browser bundle 不得包含 private `NEXT_PUBLIC_*` internal IPs +- CD rollout 後檢查 live route status 與內容 -For AI routing releases, also verify: +AI routing release 也必須驗證: - active pod environment - provider route logs - cost-bearing provider fallback order -- `Gemini` appears only after all configured Ollama lanes fail or policy explicitly permits cloud fallback +- 只有在所有 configured Ollama lanes 都失敗,或 policy 明確允許 cloud fallback 時,`Gemini` 才能出現 -## 11. Immediate Next Items +## 11. 立即下一步 -1. Wire `scripts/ops/deploy-alertmanager-config.sh` into the reboot/release checklist, then consider whether CD should run it for `ops/alertmanager/**` changes. -2. Continue Wave 1 with MCP Gateway bypass and MCP audit completeness, because production callers can still route around the gateway. -3. Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative. -4. Add a Sentry/Snuba post-reboot health gate: ClickHouse table existence, Snuba migration status, and Kafka consumer offsets must be part of cold-start validation. -5. Add a post-deploy Alertmanager live check for `amtool check-config`, container status, and config-file mode; direct Telegram must remain emergency-only and target the SRE group. +1. 將 `scripts/ops/deploy-alertmanager-config.sh` 納入 reboot/release checklist,並評估 `ops/alertmanager/**` 變更是否應由 CD 自動執行。 +2. 持續推進 Wave 1 的 MCP Gateway bypass 與 MCP audit completeness,因為 production callers 仍可能繞過 gateway。 +3. 每次 alert-path release 都必須驗證 GCP-A/GCP-B/111 Ollama routing,直到 EffectivePolicy 成為權威來源。 +4. 新增 Sentry/Snuba post-reboot health gate:ClickHouse table existence、Snuba migration status、Kafka consumer offsets 必須納入 cold-start validation。 +5. 新增 Alertmanager post-deploy live check:`amtool check-config`、container status、config-file mode;direct Telegram 必須維持 emergency-only,且目標只能是 SRE 群組。