docs(skills): Phase 14.2 CI/CD 架構審查 + dependency-cruiser 整合

- Skill 04: Runner 殭屍進程修復 + cancel-in-progress: false
- Skill 05: 新增 SRE QA 內容
- Skill 06: dependency-cruiser 依賴治理 (Layer Model + ADR-014)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-03-26 09:53:56 +08:00
parent 45c3656004
commit 163d94a35b
4 changed files with 201 additions and 4 deletions

View File

@@ -10,7 +10,7 @@
| 欄位 | 值 |
|------|-----|
| **版本** | v1.5 |
| **版本** | v1.6 |
| **建立日期** | 2026-03-20 (台北) |
| **建立者** | Claude Code |
| **最後修改** | 2026-03-26 03:30 (台北) |
@@ -26,6 +26,7 @@
| v1.3 | 2026-03-25 | Claude Code | 加入文件資訊區塊 |
| v1.4 | 2026-03-26 | Claude Code | 新增部署層級決策鐵律 |
| v1.5 | 2026-03-26 | Claude Code | **Phase 15 三層觀測架構 (Deep Linking)** |
| v1.6 | 2026-03-26 | Claude Code | **Runner 殭屍進程修復 + CI/CD cancel-in-progress: false** |
---
@@ -234,6 +235,83 @@ runs-on: [self-hosted, harbor, k8s]
# ❌ --no-gpg-sign
```
### Concurrency 策略 (2026-03-26 教訓)
```yaml
concurrency:
group: cd-${{ github.workflow }}-${{ github.ref }}
# 🔴 改為等待而非取消,避免 Runner _diag/pages 檔案衝突
cancel-in-progress: false
```
**原因**: `cancel-in-progress: true` 在 Runner 清理不完全時會造成:
- `_diag/pages/*.log` 檔案衝突
- Session Conflict 錯誤
- set_output 檔案遺失
---
## 🚨 Runner 殭屍進程修復 (2026-03-26 教訓)
> **問題**: CI/CD Workflow 反覆失敗 (set_output file missing / file already exists / Session Conflict)
> **Memory**: `feedback_runner_zombie_process.md`
### 問題症狀
| 錯誤訊息 | 原因 |
|---------|------|
| `Missing file at path: _runner_file_commands/set_output_*` | Runner 目錄權限問題 |
| `File already exists: _diag/pages/*.log` | 殭屍進程未清理 |
| `TaskAgentSessionConflictException` | 多個 Runner.Listener 同時運行 |
| `could not read Username for 'https://github.com'` | Git Auth Token 讀取失敗 |
### 修復流程 (Tier 2 需統帥確認)
```bash
# Step 1: 停止服務
sudo systemctl stop actions.runner.owenhytsai-awoooi.awoooi-110.service
sudo systemctl stop actions.runner.owenhytsai-awoooi.awoooi-110-2.service
# Step 2: 權限校正 (解決 sudo 造成的 root 擁有權)
sudo chown -R wooo:wooo /home/wooo/actions-runner-awoooi
sudo chown -R wooo:wooo /home/wooo/actions-runner-awoooi-2
# Step 3: 殺死殭屍進程
pkill -9 -u wooo -f 'Runner'
# Step 4: 安全洗地(不加 sudo
rm -rf /home/wooo/actions-runner-awoooi/_work/*
rm -rf /home/wooo/actions-runner-awoooi-2/_work/*
rm -rf /home/wooo/actions-runner-awoooi*/_diag/pages/*
# Step 5: 重啟服務
sudo systemctl start actions.runner.owenhytsai-awoooi.awoooi-110.service
sudo systemctl start actions.runner.owenhytsai-awoooi.awoooi-110-2.service
```
### 診斷指令
```bash
# 檢查殭屍進程
ps aux | grep -E 'Runner.Listener|Runner.Worker' | grep -v grep
# 檢查 Session 衝突日誌
tail -50 ~/actions-runner-awoooi-2/_diag/Runner_*.log | grep -i conflict
# 驗證權限
ls -la ~/actions-runner-awoooi*/_work/_temp/
```
### Workflow 預防措施
```yaml
# 每個 Job 開始時清理暫存目錄
- name: "Clean Runner temp"
run: |
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
rm -rf "$RUNNER_TEMP"/* "$RUNNER_ROOT/_diag/pages"/* .claude/worktrees 2>/dev/null || true
```
### Telegram 通報 (閉環)
```bash

View File

@@ -10,7 +10,7 @@
| 欄位 | 值 |
|------|-----|
| **版本** | v1.3 |
| **版本** | v1.4 |
| **建立日期** | 2026-03-20 (台北) |
| **建立者** | Claude Code |
| **最後修改** | 2026-03-26 03:30 (台北) |
@@ -24,6 +24,7 @@
| v1.1 | 2026-03-24 | Claude Code | 禁止 Mock 測試鐵律 |
| v1.2 | 2026-03-25 | Claude Code | 加入文件資訊區塊 |
| v1.3 | 2026-03-26 | Claude Code | **Phase 15 觀測性測試** |
| v1.4 | 2026-03-26 | Claude Code | **Runner 殭屍進程診斷流程** |
---
@@ -482,6 +483,55 @@ with restore_trace_context({"trace_id": "", "span_id": ""}) as span:
---
---
## 🚨 Runner 殭屍進程診斷 (2026-03-26 新增)
> **問題**: CI/CD Workflow 反覆失敗,錯誤訊息變化多端
> **Memory**: `feedback_runner_zombie_process.md`
### 診斷流程
#### Step 1: 識別症狀
| 錯誤訊息 | 可能原因 |
|---------|---------|
| `Missing file at path: _runner_file_commands/set_output_*` | 權限問題 (sudo 造成) |
| `File already exists: _diag/pages/*.log` | 殭屍進程未清理 |
| `TaskAgentSessionConflictException` | 多個 Runner.Listener |
| `terminal prompts disabled` | Git Auth Token 讀取失敗 |
#### Step 2: 診斷指令
```bash
# 檢查殭屍進程
ps aux | grep -E 'Runner.Listener|Runner.Worker' | grep -v grep
# 檢查 Session 衝突
tail -50 ~/actions-runner-awoooi-2/_diag/Runner_*.log | grep -i conflict
# 檢查目錄權限 (應為 wooo:wooo)
ls -la ~/actions-runner-awoooi*/_work/_temp/
```
#### Step 3: 判斷處理層級
| 情況 | 層級 | 動作 |
|------|------|------|
| 單純暫存檔案衝突 | Tier 1 | 等待 Workflow 自動重試 |
| 權限問題 | Tier 2 | 通報統帥,執行 chown 修復 |
| 殭屍進程 | Tier 2 | 通報統帥,執行 pkill 清理 |
| 服務完全卡死 | Tier 3 | 統帥親自處理服務重啟 |
### 完整修復 SOP (Tier 2)
```bash
# 詳見 Skill 04 - Runner 殭屍進程修復
# 或參考 Memory: feedback_runner_zombie_process.md
```
---
## 參考文檔
- `apps/web/playwright.config.ts`: Playwright 設定
@@ -491,3 +541,4 @@ with restore_trace_context({"trace_id": "", "span_id": ""}) as span:
- `src/core/deep_linking.py`: **👁️ Deep Linking URL 生成器**
- `src/core/telemetry.py`: **Phase 15.2 Trace Context**
- `memory/project_phase15_langfuse.md`: **📊 Phase 15 完整記錄**
- `memory/feedback_runner_zombie_process.md`: **🚨 Runner 殭屍進程修復**

View File

@@ -10,7 +10,7 @@
| 欄位 | 值 |
|------|-----|
| **版本** | v1.4 |
| **版本** | v1.5 |
| **建立日期** | 2026-03-20 (台北) |
| **建立者** | Claude Code |
| **最後修改** | 2026-03-26 15:40 (台北) |
@@ -25,6 +25,7 @@
| v1.2 | 2026-03-26 | Claude Code | 新增紅區治理 + Git Hooks 章節 |
| v1.3 | 2026-03-26 | Claude Code | 首席架構師審查流程 + 審查週期調整 (每週) |
| v1.4 | 2026-03-26 | Claude Code | 🔴 新增「封存而非刪除」策略 (統帥裁示) |
| v1.5 | 2026-03-26 | Claude Code | **dependency-cruiser 依賴治理整合 (Phase 14.2)** |
---
@@ -323,6 +324,71 @@ scripts/hooks/pre-commit # 原始檔 (tracked)
---
---
## 🔗 dependency-cruiser 依賴治理 (Phase 14.2)
> **ADR-014**: 前端依賴分層治理
> **配置檔**: `.dependency-cruiser.cjs`
### Layer Model
```
Layer 0: app/ (Pages - 可引用所有)
Layer 1: components/ (Features - 禁止互相引用)
│ ├── agent/
│ ├── approval/
│ ├── incident/
│ └── dashboard/
Layer 2: shared/layout (禁止下行引用 Layer 1)
Layer 3: ui/lib/stores/hooks (純工具層 - 禁止引用 components)
```
### 檢查指令
```bash
# 掃描前端依賴違規
pnpm dep-check
# 輸出格式: severity | rule | from → to
```
### 規則清單
| 規則 | 嚴重度 | 說明 |
|------|--------|------|
| `feature-isolation-*` | error | Feature 禁止互相引用 |
| `shared-no-feature-import` | error | Shared 禁止引用 Feature |
| `ui-no-feature-import` | error | UI 禁止引用 Feature/Shared |
| `components-no-app-import` | error | Components 禁止引用 app |
| `no-circular` | error | 禁止循環依賴 |
| `hooks-no-component-import` | warn | Hooks 禁止引用 Components |
| `stores-no-component-import` | warn | Stores 禁止引用 Components |
### 違規範例
```typescript
// ❌ 違反 feature-isolation-agent
// apps/web/src/components/agent/AgentChat.tsx
import { ApprovalCard } from '../approval/ApprovalCard' // error!
// ✅ 正確: 使用 shared 層
import { Card } from '../ui/card'
```
### CI 整合
```yaml
# .github/workflows/ci.yaml
- name: Check dependencies
run: pnpm dep-check
```
---
## 參考文檔
- `turbo.json`: Turborepo 配置
@@ -331,3 +397,5 @@ scripts/hooks/pre-commit # 原始檔 (tracked)
- `docs/LOGBOOK.md`: 進度追蹤
- `docs/RED_ZONES.md`: 紅區治理手冊
- `scripts/hooks/pre-commit`: 紅區 Hook 腳本
- `.dependency-cruiser.cjs`: **Phase 14.2 依賴治理規則**
- `docs/adr/ADR-014-dependency-governance.md`: **ADR-014 決策記錄**

View File

@@ -8,7 +8,7 @@ Docker, K3s, Nginx, Host Networking
## 核心約束 (AWOOOI 憲法)
1. **防止腦分裂 (Split Brain Prevention)**:
- 牢記四主機架構:`.110` (金庫)、`.112` (安全)、`.120/.121` (K3s 資源)、`.188` (唯一大腦,包含 Nginx/Ollama/ClawBot/SigNoz)。
- 牢記四主機架構:`.110` (金庫)、`.112` (安全)、`.120/.121` (K3s 資源)、`.188` (唯一大腦,包含 Nginx/Ollama/OpenClaw/SigNoz)。
- 嚴禁在 `.188` 以外的主機部署會做決策的 AI 模型。
2. **授權分級 (Authorization Tiers)**: