Files
awoooi/ops/runner/README.md
OG T d15fb7d9f4 fix(cd): 序列建構修復 Runner _runner_file_commands 衝突
根因: 並行 Job 的 Set up job 階段會同時寫入 RUNNER_TEMP
解法: build-api needs build-web,確保序列執行
移除: Job-level concurrency groups (不再需要)
更新: ops/runner/README.md v1.0→v2.0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 10:29:11 +08:00

93 lines
2.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GitHub Actions Runner 穩定性修復
## 問題: `_diag/pages` 檔案衝突
```
Error: The file '/home/wooo/actions-runner-awoooi/_diag/pages/xxx.log' already exists.
```
### 根因分析 (2026-03-29 完整診斷)
1. **發生時機**: "Set up job" 階段 (在任何自定義步驟之前)
2. **原因**: GitHub Actions Runner 內部 bug
- Runner 在 Job 初始化時寫入 `_diag/pages/*.log`
- 並行執行的多個 Job 可能寫入相同的 UUID 檔案
- 這發生在我們的清理步驟執行**之前**
3. **次要問題**: `RUNNER_TEMP` 共享
- `_work/_temp/_runner_file_commands` 在所有 Jobs 之間共享
- 清理此目錄會導致 "Missing file at path" 錯誤
### 解決方案 (v4 - 最終版 2026-03-29)
#### 1. 序列建構 (核心修復)
```yaml
# build-api 必須等 build-web 完成
build-api:
needs: [detect-changes, build-web] # 關鍵: 依賴 build-web
```
**根因**: Job 並行時,"Set up job" 階段會同時寫入 `_runner_file_commands`,導致衝突
**解法**: 改為序列執行,確保同一時間只有一個 Job 在 Runner 上
#### 2. Workflow Concurrency (輔助)
```yaml
concurrency:
group: cd-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
```
確保同一時間只有一個 workflow 在執行
#### 3. Job 層清理 (防禦性)
每個 Job 開始時清理 `_diag/pages`
```yaml
- name: "Clean Runner Diagnostics"
run: |
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
rm -rf "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
```
**警告**: 絕對不要清理 `$RUNNER_TEMP/*`,會破壞 `_runner_file_commands`
#### 2. Systemd Timer (背景清理)
每 5 分鐘自動清理過期的診斷檔案:
```bash
# 部署
ssh wooo@192.168.0.110
cd /path/to/awoooi/ops/runner
bash deploy-runner-cleanup.sh
```
### 檔案說明
| 檔案 | 用途 |
|------|------|
| `cleanup-runner-diag.sh` | 清理腳本 (安裝到 Runner 目錄) |
| `runner-diag-cleanup.service` | Systemd service 定義 |
| `runner-diag-cleanup.timer` | Systemd timer (每 5 分鐘) |
| `deploy-runner-cleanup.sh` | 一鍵部署腳本 |
### 監控
```bash
# 查看 timer 狀態
sudo systemctl status runner-diag-cleanup.timer
# 查看清理日誌
journalctl -u runner-diag-cleanup.service -f
# 手動觸發清理
sudo systemctl start runner-diag-cleanup.service
```
### 相關文件
- Memory: `feedback_runner_zombie_process.md`
- ADR: 待建立 (如果問題持續)
---
版本: v2.0 | 更新: 2026-03-29 | 作者: Claude Code
變更: v1.0→v2.0 序列建構取代 Job Concurrency Groups