197 lines
6.7 KiB
Markdown
197 lines
6.7 KiB
Markdown
# GitHub Actions Runner 穩定性修復
|
||
|
||
## 問題: `_diag/pages` 檔案衝突
|
||
|
||
```
|
||
Error: The file '/home/wooo/actions-runner-awoooi/_diag/pages/xxx.log' already exists.
|
||
```
|
||
|
||
### 根因分析 (2026-03-29 完整診斷)
|
||
|
||
1. **發生時機**: "Set up job" 階段 (在任何自定義步驟之前)
|
||
2. **原因**: GitHub Actions Runner 內部 bug
|
||
- Runner 在 Job 初始化時寫入 `_diag/pages/*.log`
|
||
- 並行執行的多個 Job 可能寫入相同的 UUID 檔案
|
||
- 這發生在我們的清理步驟執行**之前**
|
||
3. **次要問題**: `RUNNER_TEMP` 共享
|
||
- `_work/_temp/_runner_file_commands` 在所有 Jobs 之間共享
|
||
- 清理此目錄會導致 "Missing file at path" 錯誤
|
||
|
||
### 解決方案 (v4 - 最終版 2026-03-29)
|
||
|
||
#### 1. 序列建構 (核心修復)
|
||
```yaml
|
||
# build-api 必須等 build-web 完成
|
||
build-api:
|
||
needs: [detect-changes, build-web] # 關鍵: 依賴 build-web
|
||
```
|
||
|
||
**根因**: Job 並行時,"Set up job" 階段會同時寫入 `_runner_file_commands`,導致衝突
|
||
**解法**: 改為序列執行,確保同一時間只有一個 Job 在 Runner 上
|
||
|
||
#### 2. Workflow Concurrency (輔助)
|
||
```yaml
|
||
concurrency:
|
||
group: cd-${{ github.workflow }}-${{ github.ref }}
|
||
cancel-in-progress: true
|
||
```
|
||
|
||
確保同一時間只有一個 workflow 在執行
|
||
|
||
#### 3. Job 層清理 (防禦性)
|
||
每個 Job 開始時清理 `_diag/pages`:
|
||
|
||
```yaml
|
||
- name: "Clean Runner Diagnostics"
|
||
run: |
|
||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||
rm -rf "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||
```
|
||
|
||
**警告**: 絕對不要清理 `$RUNNER_TEMP/*`,會破壞 `_runner_file_commands`
|
||
|
||
#### 2. Systemd Timer (背景清理)
|
||
每 5 分鐘自動清理過期的診斷檔案:
|
||
|
||
```bash
|
||
# 部署
|
||
ssh wooo@192.168.0.110
|
||
cd /path/to/awoooi/ops/runner
|
||
bash deploy-runner-cleanup.sh
|
||
```
|
||
|
||
### 檔案說明
|
||
|
||
| 檔案 | 用途 |
|
||
|------|------|
|
||
| `cleanup-runner-diag.sh` | 清理腳本 (安裝到 Runner 目錄) |
|
||
| `runner-diag-cleanup.service` | Systemd service 定義 |
|
||
| `runner-diag-cleanup.timer` | Systemd timer (每 5 分鐘) |
|
||
| `deploy-runner-cleanup.sh` | 一鍵部署腳本 |
|
||
|
||
### 監控
|
||
|
||
```bash
|
||
# 查看 timer 狀態
|
||
sudo systemctl status runner-diag-cleanup.timer
|
||
|
||
# 查看清理日誌
|
||
journalctl -u runner-diag-cleanup.service -f
|
||
|
||
# 手動觸發清理
|
||
sudo systemctl start runner-diag-cleanup.service
|
||
```
|
||
|
||
### 相關文件
|
||
- Memory: `feedback_runner_zombie_process.md`
|
||
- ADR: 待建立 (如果問題持續)
|
||
|
||
## 問題: Gitea act-runner 並行 Docker Build 讓 Job Container 消失
|
||
|
||
### 症狀
|
||
|
||
```
|
||
Error response from daemon: RWLayer of container <id> is unexpectedly nil
|
||
Error response from daemon: No such container: <id>
|
||
```
|
||
|
||
### 根因分析 (2026-04-30)
|
||
|
||
1. AWOOOI CD 在 `Build and Push Web` 仍執行 Next.js production build 時,110 的 `gitea-runner` 又接了另一個 repo 的 Actions task。
|
||
2. 兩個 task 共用同一個 Docker daemon 與同一個 act-runner 容器;act-runner `capacity: 2` 允許跨 repo 並行。
|
||
3. 第二個 task 啟動後,第一個 AWOOOI job container 被 Docker/act 清掉,BuildKit 後續只看到 `RWLayer ... unexpectedly nil`。
|
||
4. Web image 在 110 host 直接 `docker build` 可成功,證明不是 Web 程式 build error。
|
||
|
||
### 第一層修復
|
||
|
||
1. 110 act-runner 必須單工:
|
||
|
||
```yaml
|
||
# /home/wooo/act-runner/config.yaml
|
||
runner:
|
||
capacity: 1
|
||
```
|
||
|
||
2. AWOOOI CD workflow 需要 Docker daemon 全域 lock:
|
||
|
||
```yaml
|
||
- name: Acquire Docker Build Lock
|
||
run: docker network create awoooi-cd-docker-build-lock
|
||
```
|
||
|
||
實作使用 Docker network 當 host-global lock,因為 `/tmp/flock` 只存在 transient job container 內,無法跨 repo/跨 container 生效。
|
||
|
||
3. 若 job 非正常中止留下 lock,下一次 CD 會在 lock 超過 2 小時後移除 stale network。
|
||
|
||
### 第二層修復: host label build/deploy
|
||
|
||
`capacity: 1` 與 Docker network lock 可避免跨 repo 並行,但長時間
|
||
`docker build` 仍可能讓 transient act job container 在 build 收尾時消失。
|
||
2026-04-30 起,AWOOOI CD 拆成三段:
|
||
|
||
| Job | runner label | 用途 |
|
||
|-----|--------------|------|
|
||
| `tests` | `ubuntu-latest` | API unit + B5 integration tests,仍跑在 ci-runner container |
|
||
| `build-and-deploy` | `awoooi-host` | Harbor login、API/Web image build/push、GitOps deploy,直接跑在 110 host |
|
||
| `post-deploy-checks` | `ubuntu-latest` | Alert chain、monitoring coverage、Playwright smoke |
|
||
|
||
110 只保留 host-level `act_runner` daemon,並在同一份 config 宣告兩類 label:
|
||
|
||
```yaml
|
||
runner:
|
||
capacity: 1
|
||
shutdown_timeout: 1h
|
||
labels:
|
||
- "ubuntu-latest:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04"
|
||
- "ubuntu-22.04:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04"
|
||
- "ubuntu-24.04:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04"
|
||
- "awoooi-host:host"
|
||
```
|
||
|
||
Docker-wrapped `gitea-runner` container 必須停用,避免它用同一份 config
|
||
搶走 `awoooi-host` job,導致 host job 其實跑在 runner container 裡。
|
||
`scripts/ops/docker-health-monitor.sh` 預設也必須排除 `gitea-runner`,
|
||
否則每 5 分鐘的 Docker 自動修復會把已停用的 runner container 拉起來。
|
||
|
||
### 第三層修復: graceful shutdown service
|
||
|
||
2026-05-01 發現 build/deploy 已推 GitOps deploy commit,production 也
|
||
`Synced Healthy`,但 Gitea commit status 仍顯示 `build-and-deploy` failure。
|
||
根因是 host-level `act_runner` 收到停止訊號時使用預設
|
||
`runner.shutdown_timeout: 0s`,log 會出現:
|
||
|
||
```text
|
||
runner: wooo-runner shutdown initiated, waiting 0s for running jobs to complete
|
||
```
|
||
|
||
因此 daemon 重啟會直接取消仍在收尾的 job,造成「實際已部署、狀態回寫失敗」。
|
||
110 必須安裝 systemd host runner service,並把 shutdown timeout 固定為 1h:
|
||
|
||
```bash
|
||
cd /path/to/awoooi
|
||
RESTART_NOW=1 bash ops/runner/install-gitea-host-runner-service.sh
|
||
```
|
||
|
||
此 script 會:
|
||
|
||
- 更新 `/home/wooo/act-runner/config.yaml` 的 `shutdown_timeout: 1h`
|
||
- 有 passwordless sudo 時安裝 `/etc/systemd/system/gitea-act-runner-host.service`
|
||
- 沒有 sudo 時 fallback 到 `~/.config/systemd/user/gitea-act-runner-host.service`
|
||
- 停用 Docker-wrapped `gitea-runner` container 的 restart policy
|
||
- 拒絕在 `GITEA-ACTIONS-TASK-*` container 正在跑時重啟 runner
|
||
|
||
若 fallback 到 user-level service,請檢查:
|
||
|
||
```bash
|
||
loginctl show-user wooo -p Linger
|
||
```
|
||
|
||
`Linger=no` 代表 service 已能在目前 user manager 內維持 runner,但主機重開機後,
|
||
若沒有登入 session,user service 不一定會自動啟動。需要 root 執行
|
||
`loginctl enable-linger wooo`,或改安裝 system-level service。
|
||
|
||
---
|
||
版本: v2.0 | 更新: 2026-03-29 | 作者: Claude Code
|
||
變更: v1.0→v2.0 序列建構取代 Job Concurrency Groups
|