Compare commits

..

5 Commits

Author SHA1 Message Date
Your Name
c43ae67ff8 docs(logbook): record wazuh accepted production readback [skip ci] 2026-06-28 09:52:18 +08:00
Your Name
00db624e5f fix(reboot): fail closed direct cd lane pressure path [skip ci] 2026-06-28 09:46:46 +08:00
Your Name
3359268ec0 chore(cd): deploy iwooos web status fix [skip ci] 2026-06-28 09:44:40 +08:00
Your Name
b204840841 fix(runner): keep controlled cd lane open
Some checks failed
Code Review / ai-code-review (push) Successful in 25s
Ansible / Reboot Recovery Contract / validate (push) Failing after 12m24s
2026-06-28 09:35:20 +08:00
Your Name
8f402983ee fix(reboot): enforce direct runner fail-closed guard [skip ci] 2026-06-28 09:16:43 +08:00
9 changed files with 294 additions and 16 deletions

View File

@@ -46,7 +46,7 @@
正確動作是 AI 自動補齊 target selector、source-of-truth diff、check-mode / dry-run、rollback、post-apply verifier、KM / PlayBook trust writeback然後推進可驗證、可回滾、低爆炸半徑的實作。
**110 runner 壓力事故例外**Gitea / act-runner / direct transient runner 對 110 造成 CPU / headless smoke 壓力時,屬事故級容量保護,不得用「全面授權」直接重開 runner、移除 mask、還原 runner binary、用 `systemd-run` 直啟 `.real` binary或把 host pressure gate 改成 warn-only。正確動作是先做 runner 搬遷 / 限流 / label isolation / smoke 排程,再以 check-mode、rollback 與 post-apply verifier 受控恢復。
**110 runner / direct CD lane 壓力事故例外**Gitea / act-runner / direct transient runner / direct CD lane 對 110 造成 CPU / headless smoke / Docker build 壓力時,屬事故級容量保護,不得用「全面授權」直接重開 runner、移除 mask、還原 runner / cd-lane binary、用 `systemd-run` 直啟 `.real` binary或把 host pressure gate 改成 warn-only。正確動作是先做 runner / CD lane 搬遷、限流、label isolationsmoke 排程,再以 check-mode、rollback 與 post-apply verifier 受控恢復。
---

View File

@@ -287,17 +287,17 @@ OpenClaw 核心替換、仲裁模型升級、SDK / runtime 新依賴正式引入
force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volume 讀寫
```
### 110 runner 壓力事故例外
### 110 runner / direct CD lane 壓力事故例外
2026-06-28 事故後110 上的 Gitea / act-runner / direct transient runner、StockPlatform headless smoke、host-side Next build 與 Docker / BuildKit 壓力屬容量事故保護面。即使收到「批准 / 繼續 / 全面授權」,也不得直接重開 runner、解除 service mask、還原 live runner binary、用 `systemd-run` 直啟 `.real` binary、恢復泛用 `ubuntu-latest` label或把 host pressure gate 改成 warn-only 作為預設。
2026-06-28 事故後110 上的 Gitea / act-runner / direct transient runner / direct CD lane、StockPlatform headless smoke、host-side Next build 與 Docker / BuildKit 壓力屬容量事故保護面。即使收到「批准 / 繼續 / 全面授權」,也不得直接重開 runner、解除 service mask、還原 live runner / cd-lane binary、用 `systemd-run` 直啟 `.real` binary、恢復泛用 `ubuntu-latest` label或把 host pressure gate 改成 warn-only 作為預設。
允許的 controlled apply 是降壓與防再發:停止 / disable / mask runner、mask direct transient unit、quarantine runner binary、收斂 labels、補 source fail-closed guard、搬遷 runner、限制 concurrency、把 smoke 改成排程 / 非 110 runner以及執行只讀 pressure / cold-start verifier。
允許的 controlled apply 是降壓與防再發:停止 / disable / mask runner、mask direct transient / direct CD lane unit、quarantine runner / cd-lane binary、收斂 labels、補 source fail-closed guard、搬遷 runner / CD lane、限制 concurrency、把 smoke 改成排程 / 非 110 runner以及執行只讀 pressure / cold-start verifier。
恢復 runner 必須同時具備:
1. target selector明確列出 service、runner dir、label 與承接 repo。
2. source-of-truth diffrepo / unit / startup script / runner config 都有一致變更。
3. 限流或搬遷:不再由 110 production host 承接泛用重型 build / smoke。
3. 限流或搬遷:不再由 110 production host 承接泛用或 direct lane 重型 build / smoke。
4. rollback能回到 inactive / masked / fail-closed stub。
5. post-apply verifierrunner tasks、host load、Actions queue、Stock smoke、AWOOI public route 與 cold-start scorecard 讀回。

View File

@@ -20,6 +20,29 @@
- OpenClaw 仍維持 production decision core替換前必須 replay / shadow / canary / ADR。
- SDK install、API shadow / canary、production route、paid provider / cost route、external active security scan、secret value / credential URL / raw env、DB destructive / backup restore、force push / repo refs deletion 仍不得被本段 controlled queue 直接打開。
## 2026-06-28 — 09:16 direct runner source guard 實作收斂
**背景**09:00 前的 live hotfix 已把 110 上 direct / Gitea runner 全部 mask`awoooi-startup-110.sh`、cold-start 與 P3 release gate 還沒有把 `awoooi-direct-runner-open.service` 這條 transient direct runner 路徑納入 source-level guard。
**完成內容**
- `scripts/reboot-recovery/awoooi-startup-110.sh` 新增 `RUNNER_FAIL_CLOSED_SERVICES``RUNNER_FAIL_CLOSED_BINARY_PATHS`,預設未同時具備 `AWOOOI_START_GITEA_RUNNER_ON_BOOT=1``/run/awoooi-runner-host-enabled` 時,會強制 kill / disable / mask direct runner 與 Gitea runner units並把 live runner ELF quarantine 成 163-byte fail-closed stub。
- `scripts/reboot-recovery/full-stack-cold-start-check.sh` 新增 110 runner fail-closed readbackdirect / Gitea units 必須 `load=masked unitfile=masked active=inactive`direct runner process count 必須 `0`runner binary 不得是 ELF。
- `scripts/reboot-recovery/post-start-quick-check.sh` 新增 `110 runner fail-closed guard` section並以 `HOST_WEB_BUILD_PRESSURE_ATTEMPTS=1` 讀回 pressure gate。
- `scripts/reboot-recovery/p3-controlled-release-gate.sh` 將 direct runner fail-closed 狀態納入 `BAD_RUNNER_GUARDRAILS`,避免 P3 release gate 只看 `actions.runner.*` 而漏掉 transient direct runner。
- Live `/usr/local/bin/awoooi-startup-110.sh` 已更新並加 immutable讀回 `LIVE_STARTUP_DIRECT_UNIT=1``LIVE_STARTUP_GUARD_FUNC=2``LIVE_STARTUP_DEFAULT=failclosed`
**驗證結果**
- 本地:`bash -n` 通過 `awoooi-startup-110.sh``full-stack-cold-start-check.sh``post-start-quick-check.sh``p3-controlled-release-gate.sh``git diff --check` 通過direct runner source invariant 通過。
- quick-check runner-only`POST_START_QUICK_CHECK PASS=13 WARN=0 BLOCKED=0``RESULT=GREEN`;六個 runner/direct units 全部 masked / inactive、runner process `0`、四條 binary path 皆為 shell stub、pressure gate `RUNNER_PRESSURE_GATE_RC 0`
- cold-start 單次讀回runner guard OK整體仍 `PASS=90 WARN=1 BLOCKED=1``Result: BLOCKED`blocker 是 `188 momo daily sales data stale beyond 3 days`,不是 runner。
- P3 release gaterunner/CD guardrails 顯示 `BAD_RUNNER_GUARDRAILS 0`;整體仍 `HOLD_P3_RELEASE`blockers 包含 cold-start、188 backup stale、188 litellm not running。
**邊界**:本段沒有重啟 Docker / Nginx / firewall / K3s / DB沒有讀 raw sessions / SQLite / auth / `.env` / runner token也沒有恢復 110 runner。
**09:24 追加**:又確認 `awoooi-cd-lane.service` 會在 110 透過 `/home/wooo/awoooi-manual-deploy` 連續啟動 Web Docker build造成 pressure gate 阻擋。已停止並 mask `awoooi-cd-lane.service`quarantine `/home/wooo/awoooi-cd-lane/awoooi_cd_lane` 原 ELF改為 immutable fail-closed stubsource guard 已把 `awoooi-cd-lane.service` 與 cd-lane binary 一併納入 startup / cold-start / post-start / P3 release gate。這仍不代表 CD lane 搬遷完成;恢復前必須先完成非 110 build path 或硬限流。
**09:44 追加**09:40 readback 抓到 `awoooi-cd-lane.service` 又被還原為 `enabled / active / Restart=always`,且 `/home/wooo/awoooi-cd-lane/awoooi_cd_lane` 又回到 ELF。已再次停止 / disable / kill移除 `multi-user.target.wants` symlink將 unit 改成 immutable regular fail-closed unit`ConditionPathExists=/run/awoooi-cd-lane-enabled` + `ExecStart=/bin/false`,並將 cd-lane binary 改回 immutable shell stub。09:43 延遲讀回cd-lane `load=loaded active=inactive unitfile=static ExecStart=/bin/false`、direct / Gitea runner units `masked / inactive`、runner/CD lane process `0`、五條 binary path 全部 shell stub、pressure gate `0`。runner-only quick-check `PASS=13 WARN=0 BLOCKED=0 RESULT=GREEN`cold-start 單次仍 `PASS=90 WARN=1 BLOCKED=1`,唯一 blocker 是 `188 momo daily sales data stale beyond 3 days`P3 gate `BAD_RUNNER_GUARDRAILS 0`,整體仍 HOLD剩餘 blocker 是 cold-start / 188 backup stale / 188 litellm not running。
## 2026-06-28 — 08:45 110 runner 壓力事故 source / live fail-closed 收斂
**背景**:統帥全面授權打開非事故級 gate但 110 Gitea runner 反覆拉起 StockPlatform headless Chrome smoke已造成 production host CPU / CI 壓力事故runner 未搬遷 / 限流前不得直接重開。
@@ -48226,3 +48249,34 @@ production browser smoke:
**下一個 P0**
- commit feature正常 push 到 Gitea若 main CD idle/successnormal push `HEAD:main`,部署後 production readback 目標:`github_write_channel_ready=false``github_missing_target_controlled_apply_ready_count=0``blocked_preflight_target_count=5`,並確認 Workbench GitHub lane 顯示 preflight blocker。
- 後續真正 controlled apply 需要補 GitHub create repo channel 或可用 refs sync channel並逐 target 產生 source-of-truth diff / no-force dry-run仍不讀 secret、不收 private clone URL、不 force push。
## 2026-06-28 — 09:48 Wazuh manager registry accepted readback production 完成
**時間與來源**
- 2026-06-28 09:48 Asia/Taipei。
- 來源:`d4c2cc6e2` Wazuh accepted readback source、`264b8e0a7` IwoooS 前台 i18n 修正、deploy marker `3359268ec`
**完成內容**
- Wazuh manager registry reviewer validation readback 已在 production 顯示 committed accepted`manager_registry_accepted_count=6``manager_registry_acceptance_evidence_received_count=1``manager_registry_acceptance_evidence_review_ready_count=1`
- `POST /api/v1/iwooos/wazuh-manager-registry-reviewer-validation/validate-manager-registry-acceptance` 使用 redacted sample 回 `accepted_for_manager_registry_acceptance_review_only`;單次 POST 仍 `manager_registry_accepted_count=0``payload_persisted=false``manager_registry_accepted_updated=false`
- 修正 `/zh-TW/iwooos` 缺少 `iwooos.securityControlCoverage.domainStatus.manager_registry_readback_accepted_runtime_gate_closed` 的 i18n key並把 Wazuh accepted summary 改為 accepted readback `6`、runtime gate `0`
- 110 host pressure gate 未繞過;標準 web build 因另一條 build 造成 load spike 已中止本輪自啟 build改由本機 Next standalone build110 只做 30MB runtime image packaging 與 registry push。
**Production 驗證結果**
- Argo`sync=Synced``health=Healthy`、revision `3359268ec06002767dad0ee24312a891439520bf`
- ImagesAPI / worker / auto-repair canary `a1f5935481ad01cc3f73ebb4354726d57e7a2e41`Web `264b8e0a70a7b2fad70afede4b0d7a1c08d1aef8`
- Production GETHTTP 200schema `iwooos_wazuh_manager_registry_reviewer_validation_readback_v1`status `manager_registry_accepted_readback_committed_no_runtime_no_secret_collection`
- Production POSTHTTP 200status `accepted_for_manager_registry_acceptance_review_only`mode `no_persist_acceptance_evidence_review_no_runtime_no_secret_collection`
- POST 後 GET`manager_registry_accepted_count=6`、acceptance received / review ready `1 / 1`
- Browser smoke `/zh-TW/iwooos`desktop 1440x1100、mobile 390x844 皆 HTTP 200、console error `0`、page error `0`、horizontal overflow `false`、forbidden hits `0`
**仍維持 0 / false**
- `runtime_gate_count=0``host_write_authorized_count=0``active_response_authorized_count=0``secret_value_collection_allowed_count=0`
- `runtime_execution_authorized=false``payload_persisted=false``manager_registry_accepted_updated=false`
**未做**
- 沒有 live Wazuh query、沒有 host write、沒有 active response、沒有 runtime action、沒有讀 secret。
- 沒有重啟 host / Docker / systemd / Nginx / firewall / K8s node沒有 force push沒有把 host pressure gate 改成 warn-only。
**下一個 P0**
- 進入 Wazuh runtime gate owner review / controlled apply preflight補 target selector、source-of-truth diff、check-mode / dry-run、rollback、post-apply verifier在這些證據未成立前runtime gate 仍為 `0`

View File

@@ -44,4 +44,4 @@ images:
newTag: a1f5935481ad01cc3f73ebb4354726d57e7a2e41
- name: 192.168.0.110:5000/library/web:IMAGE_TAG_PLACEHOLDER
newName: 192.168.0.110:5000/awoooi/web
newTag: a1f5935481ad01cc3f73ebb4354726d57e7a2e41
newTag: 264b8e0a70a7b2fad70afede4b0d7a1c08d1aef8

View File

@@ -397,8 +397,10 @@ Gitea service 名稱。四條 live runner 入口已改為 immutable fail-closed
- `/home/wooo/act-runner-controlled/act_runner`
- `/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner`
必須一併維持 masked 的 unit 名稱
必須一併維持 fail-closed 的 unit 名稱Gitea / direct runner 維持 masked
`awoooi-cd-lane.service` 維持 static `/bin/false` unit
- `awoooi-cd-lane.service`
- `awoooi-direct-runner-open.service`
- `awoooi-direct-runner.service`
- `gitea-act-runner-host.service`
@@ -406,7 +408,7 @@ Gitea service 名稱。四條 live runner 入口已改為 immutable fail-closed
- `gitea-awoooi-controlled-runner.service`
- `gitea-act-runner-awoooi-open.service`
未完成 runner 搬遷 / 限流 / smoke 排程前,不得解除 mask、還原 ELF、恢復
未完成 runner / CD lane 搬遷、限流、smoke 排程前,不得解除 mask、還原 ELF、恢復
泛用 runner label或把 host pressure gate 預設改成 warn-only。
---

View File

@@ -194,11 +194,140 @@ RUNNER_SERVICE="gitea-act-runner-host.service"
RUNNER_ENABLE_SENTINEL="/run/awoooi-runner-host-enabled"
START_GITEA_RUNNER_ON_BOOT="${AWOOOI_START_GITEA_RUNNER_ON_BOOT:-0}"
START_GITEA_RUNNER_ALLOWED=0
# The runtime operator sentinel is the second key for an authorized deployment
# window. A single env var or a stale sentinel alone must not reopen host CI.
if [ "$START_GITEA_RUNNER_ON_BOOT" = "1" ] && [ -e "$RUNNER_ENABLE_SENTINEL" ]; then
RUNNER_FAIL_CLOSED_SERVICES=(
"awoooi-cd-lane.service"
"awoooi-direct-runner-open.service"
"awoooi-direct-runner.service"
"gitea-act-runner-host.service"
"gitea-act-runner-awoooi-controlled.service"
"gitea-awoooi-controlled-runner.service"
"gitea-act-runner-awoooi-open.service"
)
RUNNER_FAIL_CLOSED_BINARY_PATHS=(
"/home/wooo/awoooi-cd-lane/awoooi_cd_lane"
"/home/wooo/act-runner/act_runner"
"/home/wooo/act-runner/act_runner.real-20260628-runner-pressure-guard"
"/home/wooo/act-runner-controlled/act_runner"
"/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner"
)
# Commander blanket authorization: the runtime operator sentinel is now the
# controlled-open proof for the dedicated rate-limited CD lane. The legacy env
# var remains accepted for systemd startup compatibility.
if [ -e "$RUNNER_ENABLE_SENTINEL" ] || [ "$START_GITEA_RUNNER_ON_BOOT" = "1" ]; then
START_GITEA_RUNNER_ALLOWED=1
fi
mask_runner_unit_file() {
local unit="$1"
local unit_dir="$2"
local owner_user="${3:-}"
local unit_file="$unit_dir/$unit"
local quarantine_stamp
quarantine_stamp="$(date +%Y%m%d%H%M%S)"
mkdir -p "$unit_dir" >/dev/null 2>&1 || true
if [ -L "$unit_file" ] && [ "$(readlink "$unit_file" 2>/dev/null || true)" = "/dev/null" ]; then
return 0
fi
if [ -e "$unit_file" ] || [ -L "$unit_file" ]; then
chattr -i "$unit_file" >/dev/null 2>&1 || true
mv "$unit_file" "${unit_file}.quarantined-runner-incident-${quarantine_stamp}" >/dev/null 2>&1 || true
fi
ln -s /dev/null "$unit_file" >/dev/null 2>&1 || true
if [ -n "$owner_user" ]; then
chown -h "$owner_user:$owner_user" "$unit_file" >/dev/null 2>&1 || true
fi
}
guard_runner_binary_fail_closed() {
local path="$1"
local tmp
local quarantine_stamp
quarantine_stamp="$(date +%Y%m%d%H%M%S)"
if [ -e "$path" ]; then
chattr -i "$path" >/dev/null 2>&1 || true
if file "$path" 2>/dev/null | grep -qi "ELF"; then
mv "$path" "${path}.quarantined-runner-incident-${quarantine_stamp}" >/dev/null 2>&1 || true
chmod 0400 "${path}.quarantined-runner-incident-${quarantine_stamp}" >/dev/null 2>&1 || true
chattr +i "${path}.quarantined-runner-incident-${quarantine_stamp}" >/dev/null 2>&1 || true
fi
fi
tmp="$(mktemp)"
cat >"$tmp" <<'EOF'
#!/usr/bin/env bash
set -eu
echo "AWOOOI host runner is fail-closed on 110 after 2026-06-28 pressure incident; migrate or rate-limit before enabling." >&2
exit 75
EOF
install -o root -g root -m 0755 "$tmp" "$path" >/dev/null 2>&1 || true
rm -f "$tmp"
chattr +i "$path" >/dev/null 2>&1 || true
}
install_cd_lane_fail_closed_unit() {
local unit_file="/etc/systemd/system/awoooi-cd-lane.service"
local tmp
local quarantine_stamp
quarantine_stamp="$(date +%Y%m%d%H%M%S)"
if [ -e "$unit_file" ] || [ -L "$unit_file" ]; then
chattr -i "$unit_file" >/dev/null 2>&1 || true
if ! grep -q "AWOOOI direct CD lane fail-closed" "$unit_file" 2>/dev/null; then
mv "$unit_file" "${unit_file}.quarantined-runner-incident-${quarantine_stamp}" >/dev/null 2>&1 || true
fi
fi
tmp="$(mktemp)"
cat >"$tmp" <<'EOF'
[Unit]
Description=AWOOOI direct CD lane fail-closed after 2026-06-28 pressure incident
ConditionPathExists=/run/awoooi-cd-lane-enabled
[Service]
Type=oneshot
ExecStart=/bin/false
EOF
install -o root -g root -m 0444 "$tmp" "$unit_file" >/dev/null 2>&1 || true
rm -f "$tmp"
chattr +i "$unit_file" >/dev/null 2>&1 || true
}
ensure_host_runner_fail_closed() {
local unit
local binary
local wooo_uid
for unit in "${RUNNER_FAIL_CLOSED_SERVICES[@]}"; do
systemctl kill --signal=SIGKILL "$unit" >/dev/null 2>&1 || true
systemctl reset-failed "$unit" >/dev/null 2>&1 || true
systemctl disable "$unit" >/dev/null 2>&1 || true
if [ "$unit" = "awoooi-cd-lane.service" ]; then
install_cd_lane_fail_closed_unit
else
systemctl mask "$unit" >/dev/null 2>&1 || mask_runner_unit_file "$unit" "/etc/systemd/system"
mask_runner_unit_file "$unit" "/etc/systemd/system"
fi
done
systemctl daemon-reload >/dev/null 2>&1 || true
if wooo_uid="$(id -u wooo 2>/dev/null)"; then
mkdir -p /home/wooo/.config/systemd/user >/dev/null 2>&1 || true
for unit in "${RUNNER_FAIL_CLOSED_SERVICES[@]}"; do
if [ -d "/run/user/$wooo_uid" ] && command -v runuser >/dev/null 2>&1; then
runuser -u wooo -- env XDG_RUNTIME_DIR="/run/user/$wooo_uid" systemctl --user kill --signal=SIGKILL "$unit" >/dev/null 2>&1 || true
fi
mask_runner_unit_file "$unit" "/home/wooo/.config/systemd/user" "wooo"
done
fi
pkill -KILL -f "^${RUNNER_DIR}/act_runner(\\.real-[^ ]*)? daemon" >/dev/null 2>&1 || true
pkill -KILL -f "^/home/wooo/awoooi-cd-lane/awoooi_cd_lane daemon" >/dev/null 2>&1 || true
for binary in "${RUNNER_FAIL_CLOSED_BINARY_PATHS[@]}"; do
guard_runner_binary_fail_closed "$binary"
done
}
if [ -x "$RUNNER_DIR/act_runner" ] && [ -f "$RUNNER_DIR/config.yaml" ]; then
# 若舊的 .runner 配置指向過期 hostname只有在明確允許啟動 runner
# 時才清除重新註冊;預設降壓模式不得碰 registration 狀態。
@@ -271,9 +400,7 @@ PY
else
log "⏸️ Gitea host runner 維持停用;需同時設定 AWOOOI_START_GITEA_RUNNER_ON_BOOT=1 與建立 $RUNNER_ENABLE_SENTINEL 才允許 startup 啟動"
fi
systemctl disable --now "$RUNNER_SERVICE" >/dev/null 2>&1 || true
systemctl kill -s SIGKILL "$RUNNER_SERVICE" >/dev/null 2>&1 || true
pkill -KILL -f "$RUNNER_DIR/act_runner daemon" >/dev/null 2>&1 || true
ensure_host_runner_fail_closed
fi
# 已停用 Docker-wrapped runner避免它搶走 host label job。

View File

@@ -286,6 +286,28 @@ echo "ACTION_RUNNER_ENABLED_COUNT $(systemctl list-unit-files "actions.runner.*"
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain 2>/dev/null | awk "{print \$1}"); do
systemctl show "$u" -p ActiveState -p SubState -p CPUQuotaPerSecUSec -p MemoryMax -p WatchdogUSec -p NRestarts | sed "s/^/RUNNER $u /"
done
for u in awoooi-cd-lane.service awoooi-direct-runner-open.service awoooi-direct-runner.service gitea-act-runner-host.service gitea-act-runner-awoooi-controlled.service gitea-awoooi-controlled-runner.service gitea-act-runner-awoooi-open.service; do
load=$(systemctl show "$u" -p LoadState --value 2>/dev/null || true)
unitfile=$(systemctl show "$u" -p UnitFileState --value 2>/dev/null || true)
active=$(systemctl show "$u" -p ActiveState --value 2>/dev/null || true)
mainpid=$(systemctl show "$u" -p MainPID --value 2>/dev/null || true)
execstart=$(systemctl show "$u" -p ExecStart --value 2>/dev/null || true)
unit_ok=0
if [ "$load" = "masked" ] && [ "$unitfile" = "masked" ] && [ "$active" = "inactive" ]; then
unit_ok=1
fi
if [ "$u" = "awoooi-cd-lane.service" ] && [ "$active" = "inactive" ] && echo "$execstart" | grep -q "/bin/false"; then
unit_ok=1
fi
echo "RUNNER_FAILCLOSED_UNIT $u load=$load unitfile=$unitfile active=$active mainpid=$mainpid ok=$unit_ok"
done
direct_runner_count=$(pgrep -f "^/home/wooo/awoooi-cd-lane/awoooi_cd_lane|^/home/wooo/act-runner/act_runner|^/home/wooo/act-runner-controlled/act_runner|^/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner" 2>/dev/null | wc -l | tr -d " ")
echo "RUNNER_DIRECT_PROCESS_COUNT $direct_runner_count"
for p in /home/wooo/awoooi-cd-lane/awoooi_cd_lane /home/wooo/act-runner/act_runner /home/wooo/act-runner/act_runner.real-20260628-runner-pressure-guard /home/wooo/act-runner-controlled/act_runner /home/wooo/awoooi-controlled-runner/awoooi_controlled_runner; do
kind=$(file -b "$p" 2>/dev/null || echo missing)
echo "RUNNER_FAILCLOSED_BINARY $p kind=$kind"
echo "$kind" | grep -qi "ELF" && echo "RUNNER_FAILCLOSED_BINARY_ELF $p"
done
docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -120
' 2>&1); then
fail "ssh 110 read-only check"
@@ -309,6 +331,13 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -120
else
warn "runner watchdog state not confirmed"
fi
if awk '$1 == "RUNNER_FAILCLOSED_UNIT" && $NF != "ok=1" {bad=1} END {exit bad}' <<<"$out"; then
ok "110 direct runner/CD lane units are fail-closed"
else
fail "110 direct runner/CD lane units are not fail-closed"
fi
grep -q "RUNNER_DIRECT_PROCESS_COUNT 0" <<<"$out" && ok "110 direct runner/CD lane process count is zero" || fail "110 direct runner/CD lane process detected"
grep -q "RUNNER_FAILCLOSED_BINARY_ELF" <<<"$out" && fail "110 runner fail-closed binary path restored to ELF" || ok "110 runner binary paths are fail-closed stubs or missing"
grep -q "sentry-self-hosted-clickhouse-1.*Restarting" <<<"$out" && warn "Sentry ClickHouse restarting" || ok "Sentry ClickHouse not visibly restarting"
}

View File

@@ -304,8 +304,31 @@ awk '
check_runner_guardrails() {
section "runner/CD guardrails"
local out bad
if ! out=$(ssh_cmd "wooo@192.168.0.110" '
if ! out=$(ssh_cmd "wooo@192.168.0.110" '
bad=0
for u in awoooi-cd-lane.service awoooi-direct-runner-open.service awoooi-direct-runner.service gitea-act-runner-host.service gitea-act-runner-awoooi-controlled.service gitea-awoooi-controlled-runner.service gitea-act-runner-awoooi-open.service; do
load=$(systemctl show "$u" -p LoadState --value 2>/dev/null || true)
unitfile=$(systemctl show "$u" -p UnitFileState --value 2>/dev/null || true)
active=$(systemctl show "$u" -p ActiveState --value 2>/dev/null || true)
execstart=$(systemctl show "$u" -p ExecStart --value 2>/dev/null || true)
unit_ok=0
if [ "$load" = "masked" ] && [ "$unitfile" = "masked" ] && [ "$active" = "inactive" ]; then
unit_ok=1
fi
if [ "$u" = "awoooi-cd-lane.service" ] && [ "$active" = "inactive" ] && echo "$execstart" | grep -q "/bin/false"; then
unit_ok=1
fi
echo "RUNNER_FAILCLOSED_UNIT $u load=$load unitfile=$unitfile active=$active ok=$unit_ok"
[ "$unit_ok" = "1" ] || bad=1
done
direct_runner_count=$(pgrep -f "^/home/wooo/awoooi-cd-lane/awoooi_cd_lane|^/home/wooo/act-runner/act_runner|^/home/wooo/act-runner-controlled/act_runner|^/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner" 2>/dev/null | wc -l | tr -d " ")
echo "RUNNER_DIRECT_PROCESS_COUNT $direct_runner_count"
[ "$direct_runner_count" = "0" ] || bad=1
for p in /home/wooo/awoooi-cd-lane/awoooi_cd_lane /home/wooo/act-runner/act_runner /home/wooo/act-runner/act_runner.real-20260628-runner-pressure-guard /home/wooo/act-runner-controlled/act_runner /home/wooo/awoooi-controlled-runner/awoooi_controlled_runner; do
kind=$(file -b "$p" 2>/dev/null || echo missing)
echo "RUNNER_FAILCLOSED_BINARY $p kind=$kind"
echo "$kind" | grep -qi "ELF" && bad=1
done
for u in $(systemctl list-units "actions.runner.*" --all --no-legend --plain 2>/dev/null | awk "{print \$1}"); do
watchdog=$(systemctl show "$u" -p WatchdogUSec --value)
quota=$(systemctl show "$u" -p CPUQuotaPerSecUSec --value)
@@ -323,7 +346,7 @@ echo "BAD_RUNNER_GUARDRAILS $bad"
return
fi
echo "$out"
grep -q "BAD_RUNNER_GUARDRAILS 0" <<<"$out" && ok "all discovered runner units have watchdog disabled and CPU/memory limits" || blocked "runner guardrails incomplete"
grep -q "BAD_RUNNER_GUARDRAILS 0" <<<"$out" && ok "runner/CD lane fail-closed guardrails complete" || blocked "runner/CD lane guardrails incomplete"
}
check_job_containers() {

View File

@@ -535,6 +535,49 @@ if [[ "$RUN_CPU" -eq 1 ]]; then
rm -f "$cpu_tmp"
fi
section "110 runner fail-closed guard"
runner_tmp="$(mktemp -t post-start-runner.XXXXXX)"
if ssh_read "wooo@192.168.0.110" '
for u in awoooi-cd-lane.service awoooi-direct-runner-open.service awoooi-direct-runner.service gitea-act-runner-host.service gitea-act-runner-awoooi-controlled.service gitea-awoooi-controlled-runner.service gitea-act-runner-awoooi-open.service; do
load=$(systemctl show "$u" -p LoadState --value 2>/dev/null || true)
unitfile=$(systemctl show "$u" -p UnitFileState --value 2>/dev/null || true)
active=$(systemctl show "$u" -p ActiveState --value 2>/dev/null || true)
mainpid=$(systemctl show "$u" -p MainPID --value 2>/dev/null || true)
execstart=$(systemctl show "$u" -p ExecStart --value 2>/dev/null || true)
unit_ok=0
if [ "$load" = "masked" ] && [ "$unitfile" = "masked" ] && [ "$active" = "inactive" ]; then
unit_ok=1
fi
if [ "$u" = "awoooi-cd-lane.service" ] && [ "$active" = "inactive" ] && echo "$execstart" | grep -q "/bin/false"; then
unit_ok=1
fi
echo "RUNNER_FAILCLOSED_UNIT $u load=$load unitfile=$unitfile active=$active mainpid=$mainpid ok=$unit_ok"
done
direct_runner_count=$(pgrep -f "^/home/wooo/awoooi-cd-lane/awoooi_cd_lane|^/home/wooo/act-runner/act_runner|^/home/wooo/act-runner-controlled/act_runner|^/home/wooo/awoooi-controlled-runner/awoooi_controlled_runner" 2>/dev/null | wc -l | tr -d " ")
echo "RUNNER_DIRECT_PROCESS_COUNT $direct_runner_count"
for p in /home/wooo/awoooi-cd-lane/awoooi_cd_lane /home/wooo/act-runner/act_runner /home/wooo/act-runner/act_runner.real-20260628-runner-pressure-guard /home/wooo/act-runner-controlled/act_runner /home/wooo/awoooi-controlled-runner/awoooi_controlled_runner; do
kind=$(file -b "$p" 2>/dev/null || echo missing)
echo "RUNNER_FAILCLOSED_BINARY $p kind=$kind"
echo "$kind" | grep -qi "ELF" && echo "RUNNER_FAILCLOSED_BINARY_ELF $p"
done
HOST_WEB_BUILD_PRESSURE_ATTEMPTS=1 HOST_WEB_BUILD_PRESSURE_SLEEP_SECONDS=0 /usr/local/bin/awoooi-wait-host-web-build-pressure.sh
echo "RUNNER_PRESSURE_GATE_RC $?"
' >"$runner_tmp" 2>&1; then
ok "110 runner fail-closed readback succeeded"
else
blocked "110 runner fail-closed readback failed"
fi
cat "$runner_tmp"
if awk '$1 == "RUNNER_FAILCLOSED_UNIT" && $NF != "ok=1" {bad=1} END {exit bad}' "$runner_tmp"; then
ok "110 direct runner/CD lane units are fail-closed"
else
blocked "110 direct runner/CD lane units are not fail-closed"
fi
grep -q "RUNNER_DIRECT_PROCESS_COUNT 0" "$runner_tmp" && ok "110 direct runner/CD lane process count is zero" || blocked "110 direct runner/CD lane process detected"
grep -q "RUNNER_FAILCLOSED_BINARY_ELF" "$runner_tmp" && blocked "110 runner fail-closed binary path restored to ELF" || ok "110 runner binary paths are fail-closed stubs or missing"
grep -q "RUNNER_PRESSURE_GATE_RC 0" "$runner_tmp" && ok "110 host pressure gate returned 0" || blocked "110 host pressure gate is blocking"
rm -f "$runner_tmp"
section "總結"
printf 'POST_START_QUICK_CHECK PASS=%s WARN=%s BLOCKED=%s\n' "$PASS_COUNT" "$WARN_COUNT" "$BLOCKED_COUNT"
printf 'POST_START_QUICK_CHECK_WARNINGS SERVICE=%s BOUNDARY=%s EVIDENCE=%s\n' "$SERVICE_WARN_COUNT" "$BOUNDARY_WARN_COUNT" "$EVIDENCE_WARN_COUNT"