508 lines
21 KiB
Markdown
508 lines
21 KiB
Markdown
# GitHub Actions Runner 穩定性修復
|
||
|
||
## 問題: `_diag/pages` 檔案衝突
|
||
|
||
```
|
||
Error: The file '/home/wooo/actions-runner-awoooi/_diag/pages/xxx.log' already exists.
|
||
```
|
||
|
||
### 根因分析 (2026-03-29 完整診斷)
|
||
|
||
1. **發生時機**: "Set up job" 階段 (在任何自定義步驟之前)
|
||
2. **原因**: GitHub Actions Runner 內部 bug
|
||
- Runner 在 Job 初始化時寫入 `_diag/pages/*.log`
|
||
- 並行執行的多個 Job 可能寫入相同的 UUID 檔案
|
||
- 這發生在我們的清理步驟執行**之前**
|
||
3. **次要問題**: `RUNNER_TEMP` 共享
|
||
- `_work/_temp/_runner_file_commands` 在所有 Jobs 之間共享
|
||
- 清理此目錄會導致 "Missing file at path" 錯誤
|
||
|
||
### 解決方案 (v4 - 最終版 2026-03-29)
|
||
|
||
#### 1. 序列建構 (核心修復)
|
||
```yaml
|
||
# build-api 必須等 build-web 完成
|
||
build-api:
|
||
needs: [detect-changes, build-web] # 關鍵: 依賴 build-web
|
||
```
|
||
|
||
**根因**: Job 並行時,"Set up job" 階段會同時寫入 `_runner_file_commands`,導致衝突
|
||
**解法**: 改為序列執行,確保同一時間只有一個 Job 在 Runner 上
|
||
|
||
#### 2. Workflow Concurrency (輔助)
|
||
```yaml
|
||
concurrency:
|
||
group: cd-${{ github.workflow }}-${{ github.ref }}
|
||
cancel-in-progress: true
|
||
```
|
||
|
||
確保同一時間只有一個 workflow 在執行
|
||
|
||
#### 3. Job 層清理 (防禦性)
|
||
每個 Job 開始時清理 `_diag/pages`:
|
||
|
||
```yaml
|
||
- name: "Clean Runner Diagnostics"
|
||
run: |
|
||
RUNNER_ROOT=$(dirname "$(dirname "$RUNNER_TEMP")")
|
||
rm -rf "$RUNNER_ROOT/_diag/pages" .claude/worktrees 2>/dev/null || true
|
||
mkdir -p "$RUNNER_ROOT/_diag/pages" 2>/dev/null || true
|
||
```
|
||
|
||
**警告**: 絕對不要清理 `$RUNNER_TEMP/*`,會破壞 `_runner_file_commands`
|
||
|
||
#### 2. Systemd Timer (背景清理)
|
||
每 5 分鐘自動清理過期的診斷檔案:
|
||
|
||
```bash
|
||
# 部署
|
||
ssh wooo@192.168.0.110
|
||
cd /path/to/awoooi/ops/runner
|
||
bash deploy-runner-cleanup.sh
|
||
```
|
||
|
||
### 檔案說明
|
||
|
||
| 檔案 | 用途 |
|
||
|------|------|
|
||
| `cleanup-runner-diag.sh` | 清理腳本 (安裝到 Runner 目錄) |
|
||
| `runner-diag-cleanup.service` | Systemd service 定義 |
|
||
| `runner-diag-cleanup.timer` | Systemd timer (每 5 分鐘) |
|
||
| `deploy-runner-cleanup.sh` | 一鍵部署腳本 |
|
||
|
||
### 監控
|
||
|
||
```bash
|
||
# 查看 timer 狀態
|
||
sudo systemctl status runner-diag-cleanup.timer
|
||
|
||
# 查看清理日誌
|
||
journalctl -u runner-diag-cleanup.service -f
|
||
|
||
# 手動觸發清理
|
||
sudo systemctl start runner-diag-cleanup.service
|
||
```
|
||
|
||
### 相關文件
|
||
- Memory: `feedback_runner_zombie_process.md`
|
||
- ADR: 待建立 (如果問題持續)
|
||
|
||
## 問題: Gitea act-runner 並行 Docker Build 讓 Job Container 消失
|
||
|
||
### 症狀
|
||
|
||
```
|
||
Error response from daemon: RWLayer of container <id> is unexpectedly nil
|
||
Error response from daemon: No such container: <id>
|
||
```
|
||
|
||
### 根因分析 (2026-04-30)
|
||
|
||
1. AWOOOI CD 在 `Build and Push Web` 仍執行 Next.js production build 時,110 的 `gitea-runner` 又接了另一個 repo 的 Actions task。
|
||
2. 兩個 task 共用同一個 Docker daemon 與同一個 act-runner 容器;act-runner `capacity: 2` 允許跨 repo 並行。
|
||
3. 第二個 task 啟動後,第一個 AWOOOI job container 被 Docker/act 清掉,BuildKit 後續只看到 `RWLayer ... unexpectedly nil`。
|
||
4. Web image 在 110 host 直接 `docker build` 可成功,證明不是 Web 程式 build error。
|
||
|
||
### 第一層修復
|
||
|
||
1. 110 act-runner 必須單工:
|
||
|
||
```yaml
|
||
# /home/wooo/act-runner/config.yaml
|
||
runner:
|
||
capacity: 1
|
||
```
|
||
|
||
2. AWOOOI CD workflow 需要 Docker daemon 全域 lock:
|
||
|
||
```yaml
|
||
- name: Acquire Docker Build Lock
|
||
run: docker network create awoooi-cd-docker-build-lock
|
||
```
|
||
|
||
實作使用 Docker network 當 host-global lock,因為 `/tmp/flock` 只存在 transient job container 內,無法跨 repo/跨 container 生效。
|
||
|
||
3. 若 job 非正常中止留下 lock,下一次 CD 會在 lock 超過 2 小時後移除 stale network。
|
||
|
||
### 第二層修復: host label build/deploy
|
||
|
||
`capacity: 1` 與 Docker network lock 可避免跨 repo 並行,但長時間
|
||
`docker build` 仍可能讓 transient act job container 在 build 收尾時消失。
|
||
2026-04-30 起,AWOOOI CD 拆成三段:
|
||
|
||
| Job | runner label | 用途 |
|
||
|-----|--------------|------|
|
||
| `tests` | `awoooi-host` | API unit + B5 integration tests,直接跑在 110 host runner |
|
||
| `build-and-deploy` | `awoooi-host` | Harbor login、API/Web image build/push、GitOps deploy,直接跑在 110 host |
|
||
| `post-deploy-checks` | `awoooi-host` | Alert chain、monitoring coverage、Playwright smoke |
|
||
|
||
110 只保留 host-level `act_runner` daemon,並在同一份 config 宣告兩類 label:
|
||
|
||
```yaml
|
||
runner:
|
||
capacity: 1
|
||
shutdown_timeout: 1h
|
||
labels:
|
||
- "awoooi-ubuntu:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04"
|
||
- "awoooi-host:host"
|
||
```
|
||
|
||
Docker-wrapped `gitea-runner` container 必須停用,避免它用同一份 config
|
||
搶走 `awoooi-host` job,導致 host job 其實跑在 runner container 裡。
|
||
`scripts/ops/docker-health-monitor.sh` 預設也必須排除 `gitea-runner`,
|
||
否則每 5 分鐘的 Docker 自動修復會把已停用的 runner container 拉起來。
|
||
|
||
### 第三層修復: graceful shutdown service
|
||
|
||
2026-05-01 發現 build/deploy 已推 GitOps deploy commit,production 也
|
||
`Synced Healthy`,但 Gitea commit status 仍顯示 `build-and-deploy` failure。
|
||
根因是 host-level `act_runner` 收到停止訊號時使用預設
|
||
`runner.shutdown_timeout: 0s`,log 會出現:
|
||
|
||
```text
|
||
runner: wooo-runner shutdown initiated, waiting 0s for running jobs to complete
|
||
```
|
||
|
||
因此 daemon 重啟會直接取消仍在收尾的 job,造成「實際已部署、狀態回寫失敗」。
|
||
110 必須安裝 systemd host runner service,並把 shutdown timeout 固定為 1h:
|
||
|
||
```bash
|
||
cd /path/to/awoooi
|
||
RESTART_NOW=1 bash ops/runner/install-gitea-host-runner-service.sh
|
||
```
|
||
|
||
此 script 會:
|
||
|
||
- 更新 `/home/wooo/act-runner/config.yaml` 的 `shutdown_timeout: 1h`
|
||
- 有 passwordless sudo 時安裝 `/etc/systemd/system/gitea-act-runner-host.service`
|
||
- 沒有 sudo 時 fallback 到 `~/.config/systemd/user/gitea-act-runner-host.service`
|
||
- 停用 Docker-wrapped `gitea-runner` container 的 restart policy
|
||
- 拒絕在 `GITEA-ACTIONS-TASK-*` container 正在跑時重啟 runner
|
||
|
||
若 fallback 到 user-level service,請檢查:
|
||
|
||
```bash
|
||
loginctl show-user wooo -p Linger
|
||
```
|
||
|
||
`Linger=no` 代表 service 已能在目前 user manager 內維持 runner,但主機重開機後,
|
||
若沒有登入 session,user service 不一定會自動啟動。需要 root 執行
|
||
`loginctl enable-linger wooo`,或改安裝 system-level service。
|
||
|
||
### 第四層修復: host Web build pressure gate
|
||
|
||
2026-05-21 追加一層 CD preflight:`.gitea/workflows/cd.yaml` 在 Harbor login
|
||
之後、Docker build lock 之前呼叫 `scripts/ci/wait-host-web-build-pressure.sh`。
|
||
|
||
背景是 AWOOI workflow concurrency 與 Docker network lock 只能保護 AWOOI 自己
|
||
與 Docker build/push;其他 repo 仍可能在同一台 110 host runner 直接執行
|
||
`next build` / `turbo build` / `vite build`。這類 host-side build 不會拿
|
||
AWOOI 的 Docker lock,會和 AWOOI Web image 內的 Next production build 疊加,
|
||
造成 110 load、Gitea API timeout、Actions `context canceled` 或 post-deploy
|
||
觀測失真。
|
||
|
||
此 gate 的行為:
|
||
|
||
- 只讀取 `ps`,不 kill / renice / reset 任何外部 process。
|
||
- 排除 AWOOI 自身 checkout、local worktree 與 Web Docker build 內的
|
||
`/app/apps/web` process,避免誤判自己的部署。
|
||
- 預設最多等待 60 次、每次 10 秒;若仍有外部 build / smoke / CI 壓力,
|
||
hard fail,避免繼續把新的 browser smoke 疊到 production host。
|
||
- 只有明確設定 `HOST_WEB_BUILD_PRESSURE_WARN_ONLY=1` 才 warning 放行;這只能
|
||
用在已確認壓力來源可接受的受控補跑。
|
||
|
||
長期方向仍是 runner 隔離或 build offload;此 gate 是在 shared runner 尚未
|
||
拆分前,降低重型前端 build 互相踩踏的保守保護層。
|
||
|
||
### 第四層補充: startup 不自動重開 Gitea runner
|
||
|
||
2026-06-27 110 CPU 事故止血後,`gitea-act-runner-host.service` 維持 inactive 是
|
||
刻意降壓狀態。`scripts/reboot-recovery/awoooi-startup-110.sh` 仍可修正 runner
|
||
`shutdown_timeout` 與 labels,也會停用 legacy Docker runner,但預設不會啟動
|
||
host runner。只有明確設定下列開關時才允許 startup 拉起 runner:
|
||
|
||
```bash
|
||
AWOOOI_START_GITEA_RUNNER_ON_BOOT=1 /usr/local/bin/awoooi-startup-110.sh
|
||
```
|
||
|
||
未完成 runner 限流 / 搬遷前,不要把這個開關加入 systemd environment。
|
||
|
||
### 第五層修復: legacy Docker runner drain
|
||
|
||
2026-05-21 再次確認 110 同時存在兩個 runner:
|
||
|
||
- host-level `gitea-act-runner-host.service`
|
||
- Docker-wrapped `gitea-runner`
|
||
|
||
兩者使用同一份 labels/config,會同時接 `awoooi`、`stockplatform-v2`、
|
||
`ewoooc` 等 repo 的 job。這會讓 AWOOI CD 的 runner ownership 失真,也會
|
||
讓 shared Docker daemon 壓力無法預測。
|
||
|
||
正確處理不是在 task container 正在跑時直接 `docker stop gitea-runner`。
|
||
`ops/runner/install-gitea-host-runner-service.sh` 現在採用 drain 流程:
|
||
|
||
1. `docker update --restart=no gitea-runner`
|
||
2. 若沒有 `GITEA-ACTIONS-TASK-*`,用長 timeout 停止 container
|
||
3. 若仍有 `GITEA-ACTIONS-TASK-*`,送 `SIGINT` 給 `gitea-runner`
|
||
4. act-runner 依 `shutdown_timeout: 1h` 停止接新 job,等待手上的 job 收尾
|
||
|
||
現場判讀:
|
||
|
||
```bash
|
||
docker inspect gitea-runner --format 'Restart={{.HostConfig.RestartPolicy.Name}} Status={{.State.Status}}'
|
||
docker ps --format '{{.Names}}' | grep '^GITEA-ACTIONS-TASK-' || true
|
||
docker logs --since 10m gitea-runner
|
||
```
|
||
|
||
目標狀態:
|
||
|
||
```text
|
||
Restart=no Status=exited
|
||
```
|
||
|
||
### 第六層修復: shared runner label inventory
|
||
|
||
2026-05-21 T139 已把 CI/CD stage transition 寫回 AwoooP,但也暴露下一個基礎設施問題:
|
||
同一個 110 user-level `gitea-act-runner-host.service` 同時宣告 AWOOI 與其他 repo
|
||
label。即使 Docker-wrapped `gitea-runner` 已停用,`capacity: 1` 的 host runner 仍會在
|
||
`awoooi`、`ewoooc`、`stockplatform-v2` 等 repo 之間排隊,讓 AWOOI `post-deploy-checks`
|
||
看起來像部署卡住。
|
||
|
||
本層先做只讀盤點,不直接改 live label:
|
||
|
||
```bash
|
||
# 在 110 本機執行
|
||
bash ops/runner/audit-runner-pool.sh
|
||
|
||
# 或從工作站透過 SSH 執行
|
||
ssh 192.168.0.110 'TASK_LOG_LINES=20 bash -s' < ops/runner/audit-runner-pool.sh
|
||
```
|
||
|
||
腳本會輸出:
|
||
|
||
- `gitea-act-runner-host.service` 的 active/substate/main PID/restart 次數。
|
||
- `/home/wooo/act-runner/config.yaml` 的 `capacity`、`timeout`、`shutdown_timeout`、labels。
|
||
- 非 AWOOI / shared CI labels,例如 `ewoooc-host`,列為 `foreign_or_cross_repo`。
|
||
- Docker-wrapped `gitea-runner` 是否仍為 `Restart=no Status=exited`。
|
||
- 是否存在 active `GITEA-ACTIONS-TASK-*` containers。
|
||
- 近 2 小時 runner journal 內各 repo task 次數,作為 label 隔離前的 live evidence。
|
||
|
||
T140 live evidence 摘要(2026-05-24 09:45 台北):
|
||
|
||
```text
|
||
service=gitea-act-runner-host.service active/running, NRestarts=0
|
||
runner.capacity=1
|
||
runner.shutdown_timeout=1h
|
||
docker gitea-runner: Restart=no Status=exited Running=false
|
||
active_action_containers=none
|
||
foreign_labels=ewoooc-host:docker://192.168.0.110:5000/awoooi/ci-runner:act-22.04
|
||
recent 2h repo counts: none
|
||
```
|
||
|
||
判讀:
|
||
|
||
- 不應用 `capacity: 2` 當修復,因為先前 `RWLayer unexpectedly nil` / `context canceled`
|
||
就是跨 repo 並行與 Docker daemon 壓力造成。
|
||
- 下一步應先讀各 repo workflow 實際使用的 labels,再規劃 repo label isolation 或獨立 runner
|
||
registration;不可在沒有替代 runner 前直接移除 live `ewoooc-host`。
|
||
|
||
### 第七層修復: workflow label matrix
|
||
|
||
Runner config 只能看到「這台 runner 願意接什麼 label」,不能回答「哪些 repo 實際在使用」。
|
||
T141 新增 workflow label 盤點工具:
|
||
|
||
```bash
|
||
ops/runner/audit-workflow-labels.py \
|
||
--local-repo wooo/stockplatform-v2=/Users/ogt/stockplatform-v2
|
||
```
|
||
|
||
工具會透過 Gitea API 讀 `.gitea/workflows/*.yml` / `.yaml` 的 `runs-on`,Gitea 不可讀時可指定
|
||
local fallback;Gitea token 只從 env 或目前 repo `gitea` remote 解析,永不輸出。
|
||
|
||
T141 evidence 摘要(2026-05-24 台北):
|
||
|
||
```text
|
||
wooo/awoooi:
|
||
awoooi-host: cd.yaml tests / build-and-deploy / post-deploy-checks
|
||
ubuntu-latest: code-review, e2e-health, deploy-alerts, cd-dev, ansible-lint, type-sync, run-migration
|
||
|
||
wooo/ewoooc:
|
||
ewoooc-host: cd.yaml deploy
|
||
|
||
wooo/stockplatform-v2:
|
||
ubuntu-latest: ci.yaml hygiene / frontend
|
||
```
|
||
|
||
風險判讀:
|
||
|
||
- `awoooi-host` 已經是 AWOOI CD 專用 label,但同一個 runner service 仍同時宣告
|
||
`ewoooc-host` 與 `ubuntu-latest`,所以 runner queue 仍共享。
|
||
- `ubuntu-latest` 是最主要共享入口;AWOOI code-review / e2e-health 與 stockplatform-v2 CI
|
||
仍可能互相排隊。
|
||
- 下一步若要真正隔離,必須做新的 runner registration / service split,或把非 AWOOI repo 移到
|
||
另一台 runner。不可只在同一個 runner config 加更多 label,因為 `capacity: 1` 仍是同一條隊列。
|
||
|
||
### 第八層修復: runner isolation readiness
|
||
|
||
T142 補一個 live readiness gate,用來判斷「現在能不能安全拆 runner」:
|
||
|
||
```bash
|
||
ssh 192.168.0.110 'bash -s' < ops/runner/check-runner-isolation-readiness.sh
|
||
```
|
||
|
||
T142 live evidence 摘要(2026-05-24 09:54 台北):
|
||
|
||
```text
|
||
primary_service=gitea-act-runner-host.service scope=user LoadState=loaded ActiveState=active SubState=running
|
||
primary_runner_dir=/home/wooo/act-runner
|
||
primary_registration_file=present
|
||
primary labels:
|
||
ubuntu-latest / ubuntu-22.04 / ubuntu-24.04 -> shared_queue
|
||
awoooi-host -> awoooi_dedicated
|
||
ewoooc-host -> foreign_dedicated
|
||
mixed_owner_classes=1
|
||
split_dir=/home/wooo/act-runner-awoooi status=missing
|
||
split_dir=/home/wooo/act-runner-shared status=missing
|
||
split_dir=/home/wooo/act-runner-ewoooc status=missing
|
||
installed_split_services=0/3
|
||
active_action_containers=時間敏感欄位;09:54 初查為 none,pre-commit recheck 曾看到 GITEA-ACTIONS-TASK-3435_WORKFLOW-ci_JOB-frontend
|
||
isolation_ready=false
|
||
blocker=single_registered_runner_with_mixed_owner_labels
|
||
safe_next_step=register_additional_runner_dirs_before_removing_live_labels
|
||
```
|
||
|
||
這代表目前**不能**直接刪掉 `ewoooc-host` 或 `ubuntu-latest`。正確的下一步是先準備新的
|
||
runner registration / service:
|
||
|
||
1. `act-runner-awoooi`:承接 `awoooi-host`,優先保護 production CD。
|
||
2. `act-runner-shared`:承接 `ubuntu-latest`,給 code-review / health / stockplatform-v2 CI。
|
||
3. `act-runner-ewoooc`:承接 `ewoooc-host`,讓 EwoooC CD 不再卡 AWOOI。
|
||
|
||
三個 split runner smoke 都通過後,才 drain primary runner 並移除混合 labels。
|
||
|
||
2026-06-27 live update:110 的 `gitea-act-runner-host.service` 已刻意停在
|
||
`inactive`;`/home/wooo/act-runner/config.yaml` labels 已收斂為
|
||
`awoooi-ubuntu` 與 `awoooi-host`,capacity 仍為 `1`。這是降壓與 label isolation
|
||
狀態;AWOOI workflows 也應只使用 `awoooi-ubuntu` 或 `awoooi-host`,不可再使用
|
||
`ubuntu-latest` / `self-hosted` 這類泛用 label。這不代表 runner 搬遷完成,也不代表可以直接重開 runner。
|
||
|
||
2026-06-28 live update:110 runner 壓力事故確認有直呼
|
||
`/home/wooo/act-runner/act_runner.real-20260628-runner-pressure-guard` 的孤兒
|
||
daemon,且曾透過 transient `awoooi-direct-runner-open.service` 繞過既有
|
||
Gitea service 名稱。四條 live runner 入口已改為 immutable fail-closed stub,
|
||
原 ELF 僅 quarantine 不讀內容;相關 systemd units 維持 inactive / masked:
|
||
|
||
- `/home/wooo/act-runner/act_runner`
|
||
- `/home/wooo/act-runner/act_runner.real-20260628-runner-pressure-guard`
|
||
- `/home/wooo/act-runner-controlled/act_runner`
|
||
- `/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner`
|
||
|
||
必須一併維持 fail-closed 的 legacy unit 名稱;Gitea / direct runner 維持 masked:
|
||
|
||
- `awoooi-direct-runner-open.service`
|
||
- `awoooi-direct-runner.service`
|
||
- `gitea-act-runner-host.service`
|
||
- `gitea-act-runner-awoooi-controlled.service`
|
||
- `gitea-awoooi-controlled-runner.service`
|
||
- `gitea-act-runner-awoooi-open.service`
|
||
|
||
`awoooi-cd-lane.service` 是專用 controlled lane,不屬於 legacy runner mask 清單;
|
||
只有在 `/run/awoooi-cd-lane-enabled` 或 `AWOOOI_START_CONTROLLED_CD_LANE=1`
|
||
存在、`capacity=1`、label 僅限 `awoooi-ubuntu` / `awoooi-host`、沒有
|
||
`ubuntu-latest` / StockPlatform / headless / Playwright 類泛用重型 label,且
|
||
systemd CPU / memory / tasks 限流、root restore-source left `0` 與
|
||
post-apply verifier 可讀回 `CD_LANE_CONTROLLED ok=1` 時,才可受控恢復。
|
||
未滿足條件時 cd-lane 應回到 static `/bin/false` unit 與 shell stub。
|
||
|
||
未完成 runner 搬遷、限流、smoke 排程前,不得解除 legacy mask、恢復泛用 runner label,
|
||
或把 host pressure gate 預設改成 warn-only。
|
||
|
||
2026-06-28 controlled update:舊的 manual-only / freeze guard 已改為分流判讀。
|
||
legacy runner 仍維持 masked / fail-closed;專用 `awoooi-cd-lane.service` 與
|
||
`awoooi-cd-lane-drain.service` 只要通過 capacity、label、binary、process 與
|
||
systemd limit、root restore-source left `0`、post-apply verifier,可作為
|
||
AWOOOI 專用受控部署 lane。
|
||
|
||
若 verifier 失敗,rollback 回 inactive / masked / fail-closed stub;若 verifier
|
||
通過,不得再用 generic runner fail-closed 規則殺掉 controlled lane。legacy / generic
|
||
runner 仍不得解除 mask 或恢復泛用 label;所有會命中 `awoooi-ubuntu` / `awoooi-host`
|
||
的 `.gitea/workflows` 都不得保留 `push`、`pull_request` 或 `pull_request_target`
|
||
自動事件。
|
||
|
||
### 第九層修復: workflow pressure source guard
|
||
|
||
2026-06-28 補上 source guard:
|
||
|
||
```bash
|
||
python3 ops/runner/guard-gitea-runner-pressure.py --root .
|
||
```
|
||
|
||
此 guard 只讀 repo 內 `.gitea/workflows/*.yml` / `.yaml`,禁止兩類回歸:
|
||
|
||
1. `push` / `pull_request` / `pull_request_target` 自動事件命中 `awoooi-ubuntu` 或
|
||
`awoooi-host`。
|
||
2. Gitea workflow 恢復 `ubuntu-latest`、`ubuntu-*` 或 `self-hosted` 泛用 label。
|
||
|
||
`scripts/ops/ansible-validate.sh` 會執行同一 guard。若要恢復自動事件,必須先有
|
||
runner 搬遷或非 110 硬限流的 source-of-truth diff、rollback 與 post-apply verifier。
|
||
`cd.yaml` / `code-review.yaml` 不得因非事故級 guard 長期停在 `workflow_dispatch` only;
|
||
恢復自動事件前必須先通過 runner 搬遷或非 110 硬限流 verifier。
|
||
|
||
### 第十層修復: non-110 runner readiness verifier
|
||
|
||
2026-06-28 補上 AWOOOI non-110 / hard-limited runner readiness gate:
|
||
|
||
```bash
|
||
ssh 192.168.0.188 'TARGET_HOST_IP=192.168.0.188 bash -s' \
|
||
< ops/runner/check-awoooi-non110-runner-readiness.sh
|
||
```
|
||
|
||
此 verifier 只讀 metadata,不讀 `.runner`、runner token、secret 或 raw config value;
|
||
只輸出 host IP、runner `capacity`、labels、systemd limit、rollback unit、active
|
||
Actions container、heavy build/smoke process 與 load/core。恢復 `push` /
|
||
`pull_request` 自動事件前,目標 host 必須讀回:
|
||
|
||
```text
|
||
AWOOOI_NON110_RUNNER_READY=1
|
||
```
|
||
|
||
source-of-truth 範本:
|
||
|
||
- `ops/runner/awoooi-non110-runner.service.example`
|
||
- `ops/runner/awoooi-non110-runner-rollback.service.example`
|
||
- `ops/runner/awoooi-non110-runner.user.service.example`
|
||
- `ops/runner/awoooi-non110-runner-rollback.user.service.example`
|
||
- `ops/runner/install-awoooi-non110-runner-user-service.sh`
|
||
- `ops/runner/check-awoooi-non110-runner-readiness.sh`
|
||
|
||
188 user-level 安裝只允許 non-secret apply,預設不註冊、不啟動:
|
||
|
||
```bash
|
||
ssh ollama@192.168.0.188 'bash -s -- --apply' \
|
||
< ops/runner/install-awoooi-non110-runner-user-service.sh
|
||
ssh ollama@192.168.0.188 'bash -s' \
|
||
< ops/runner/check-awoooi-non110-runner-readiness.sh
|
||
```
|
||
|
||
`--enable` 只允許在 `AWOOOI_NON110_ENABLE=1`、`act_runner` executable、
|
||
`config.yaml` present、`.runner` present 且 service 已由 verifier 證明 target / limits
|
||
正確後使用。installer 不會讀 `.runner` 內容,也不會寫 runner token;rollback 只會
|
||
stop / disable user service 並移除 `.awoooi-non110-runner-enabled` sentinel。
|
||
若 host binary / config 被外部流程移到 disabled 路徑或刪掉,installer 可在 `--apply`
|
||
時從同機 disabled `act_runner` metadata path 恢復 binary,並生成不含 token 的
|
||
`capacity=1` / `awoooi-non110-*` labels config;它仍不會複製 `.runner` 或讀取 registration
|
||
內容。
|
||
|
||
2026-06-28 20:32 live readback:188 host 可達、`ollama` user `Linger=yes`、
|
||
`/home/ollama/act-runner-awoooi/act_runner` executable、`config.yaml` present、
|
||
action containers `0`、heavy process `0`、load/core 約 `0.143333`。舊 user service
|
||
已被 verifier 收緊為 target mismatch,因其 `ExecStart` 仍指向 Docker
|
||
`gitea/act_runner:latest` 與 `/home/ollama/awoooi-non110-runner`,不是已驗證的
|
||
`/home/ollama/act-runner-awoooi`。本輪應用 installer 後仍必須維持
|
||
`AWOOOI_NON110_RUNNER_READY=0`,直到 `.runner` registration metadata 安全補齊並明確
|
||
enable。
|
||
|
||
---
|
||
版本: v2.0 | 更新: 2026-03-29 | 作者: Claude Code
|
||
變更: v1.0→v2.0 序列建構取代 Job Concurrency Groups
|