From 2104f0f01a7eca69f7167bf3ed1f7da0aa4f0658 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 28 Jun 2026 16:32:14 +0800 Subject: [PATCH] fix(recovery): harden runner failclosed authority copy [skip ci] --- AGENTS.md | 2 +- docs/HARD_RULES.md | 2 +- docs/LOGBOOK.md | 11 ++++++++++ ...-04-15-MASTER-ai-autonomous-flywheel-v2.md | 2 +- ops/runner/README.md | 4 +++- ...awoooi-runner-failclosed-authority.service | 4 ++-- .../awoooi-runner-failclosed-enforcer.service | 4 ++-- .../awoooi-enforce-runner-failclosed-110.sh | 4 ++++ .../enforce-110-runner-failclosed.sh | 20 ++++++++++++++----- 9 files changed, 40 insertions(+), 13 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index cfca446f..73f9c95c 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -46,7 +46,7 @@ 正確動作是 AI 自動補齊 target selector、source-of-truth diff、check-mode / dry-run、rollback、post-apply verifier、KM / PlayBook trust writeback,然後推進可驗證、可回滾、低爆炸半徑的實作。 -**110 runner / controlled CD lane 壓力事故例外**:Gitea / act-runner / direct transient runner、泛用 `ubuntu-latest`、StockPlatform / headless / Playwright 類重型工作對 110 造成 CPU / Docker build 壓力時,屬事故級容量保護,不得用「全面授權」直接重開 legacy runner、移除 legacy mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary,或把 host pressure gate 改成 warn-only。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer`、`awoooi-runner-failclosed-authority.timer` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`;舊 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*` enforcer source、startup open drop-in、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*`、`failclosed-final-mask-*` disabler artifact 與 restore-source 也必須封存或改成 fail-closed stub。Gitea `cd.yaml` / `code-review.yaml` push workflow 維持 manual-only。 +**110 runner / controlled CD lane 壓力事故例外**:Gitea / act-runner / direct transient runner、泛用 `ubuntu-latest`、StockPlatform / headless / Playwright 類重型工作對 110 造成 CPU / Docker build 壓力時,屬事故級容量保護,不得用「全面授權」直接重開 legacy runner、移除 legacy mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary,或把 host pressure gate 改成 warn-only。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer`、`awoooi-runner-failclosed-authority.timer` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`;cron / systemd authority 必須執行 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh`,讓外部 opener 覆寫 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh` 時仍能自動修復。舊 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*` enforcer source、startup open drop-in、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*`、`failclosed-final-mask-*` disabler artifact 與 restore-source 也必須封存或改成 fail-closed stub。Gitea `cd.yaml` / `code-review.yaml` push workflow 維持 manual-only。 --- diff --git a/docs/HARD_RULES.md b/docs/HARD_RULES.md index 2773577b..e668f17f 100644 --- a/docs/HARD_RULES.md +++ b/docs/HARD_RULES.md @@ -291,7 +291,7 @@ force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volu 2026-06-28 事故後,110 上的 Gitea / act-runner / direct transient runner、StockPlatform headless smoke、host-side Next build 與 Docker / BuildKit 壓力屬容量事故保護面。即使收到「批准 / 繼續 / 全面授權」,也不得直接重開 legacy runner、解除 legacy service mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary、恢復泛用 `ubuntu-latest` label,或把 host pressure gate 改成 warn-only 作為預設。 -允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner,以及執行只讀 pressure / cold-start verifier。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer`、`awoooi-runner-failclosed-authority.timer` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`;若外部 opener 暫時恢復 unit,只能恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub,下一輪 cron authority / authority / enforcer 必須再收斂回 masked / inactive。verifier 不得再接受單一 `controlled_open` lane。 +允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner,以及執行只讀 pressure / cold-start verifier。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer`、`awoooi-runner-failclosed-authority.timer` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`;cron / systemd authority 必須執行 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh`,並用該 authority copy 修復 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`。若外部 opener 暫時恢復 unit 或覆寫 canonical,只能恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub,下一輪 cron authority / authority / enforcer 必須再收斂回 masked / inactive。verifier 不得再接受單一 `controlled_open` lane。 恢復 runner 必須同時具備: diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index faee421b..48293ca7 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,14 @@ +## 2026-06-28 — 16:22 110 runner fail-closed authority copy 補強 + +**背景**:16:21 P3 release gate 又抓到短命外部 opener 把 `awoooi-cd-lane-drain.service` 恢復為 `enabled / activating`、把 fail-closed timers mask,並把 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh` 覆寫成 disabled stub;原 cron authority 雖存在,但若 cron 指向被覆寫的 canonical,就會失去自動修復能力。 + +**完成內容**: +- `scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 新增 authority copy `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh`;`--apply` 會同時安裝 / 修復 authority copy、canonical 與 compatibility wrapper。 +- `awoooi-runner-failclosed-enforcer.service`、`awoooi-runner-failclosed-authority.service` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 改為執行 authority copy,讓外部 opener 覆寫 canonical 時,下一輪 cron / systemd authority 仍可恢復 canonical、timer、unit mask、sentinel、binary stub 與 job container `0`。 +- `AGENTS.md`、`docs/HARD_RULES.md`、MASTER spec 與 `ops/runner/README.md` 同步固定:110 runner/CD 壓力事故期間,canonical 不是唯一信任根,authority copy 才是自動修復入口。 + +**邊界**:沒有讀 runner token / secret / raw session / SQLite / auth / `.env`;沒有重啟 Docker / Nginx / firewall / K3s / DB;沒有打開 legacy runner 或 controlled drain lane。 + ## 2026-06-28 — 15:20 IwoooS Wazuh live metadata owner packet no-persist validator **完成內容**: diff --git a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md index 96deb750..11a66747 100644 --- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md +++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md @@ -153,7 +153,7 @@ AWOOOI / AwoooP / IwoooS 不是單純監控頁、告警轉發器或資安清冊 3. 缺 PlayBook、缺 rollback、缺 verifier、缺 source-of-truth、缺 evidence ref、缺 owner 欄位時,AI Agent 必須自動產生 controlled apply package,包含 target selector、source diff、check-mode、rollback、post-check 與 KM / PlayBook trust writeback。 4. Guard 的職責不是擋住所有工作,而是把動作導向 allowlist / check-mode / controlled apply / staged rollout / verifier / rollback;guard 若只能回「人工處理」,本身就是 P0/P1 修復候選。 5. 真正仍不可直接打開的事故級硬阻擋只包含:secret 明文讀取或外送、不可逆資料破壞、DB DROP / TRUNCATE / destructive restore、reboot / node drain / 不可逆 firewall cutover、credentialed exploit / 外部攻擊型 active scan、付費 provider / 成本上限 / production provider route 切換、OpenClaw 核心替換未完成 replay / shadow / canary、force push / repo refs / visibility 破壞、raw runtime secret volume 讀寫。 -6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only;未完成 runner 搬遷或非 110 硬限流前,AWOOOI controlled CD lane / drain lane 也必須由 `awoooi-runner-failclosed-enforcer.timer`、`awoooi-runner-failclosed-authority.timer` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`,舊 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*` opener source、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*` 與 `failclosed-final-mask-*` disabler artifact 必須封成 fail-closed stub,workflow push trigger 維持 manual-only。 +6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only;未完成 runner 搬遷或非 110 硬限流前,AWOOOI controlled CD lane / drain lane 也必須由 `awoooi-runner-failclosed-enforcer.timer`、`awoooi-runner-failclosed-authority.timer` 與 `/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`,cron / systemd authority 必須執行 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh` 並修復 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`,舊 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*` opener source、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*` 與 `failclosed-final-mask-*` disabler artifact 必須封成 fail-closed stub,workflow push trigger 維持 manual-only。 7. 資料 freshness gate 必須 source-aware:若 Drive / provider source preflight 證明沒有比最後乾淨 import 更新的來源,且 DB sync / import job 乾淨,stale business data 是 source freshness warning;auth/source/failed-folder/DB sync 有異常才是 hard blocker。 8. Provider proxy gate 必須避免成本 / route 誤開:未 provisioned 且 repo 已標 optional retired 的 LiteLLM 等 proxy,只能列 warning;不得為了過 health gate 自動啟動或切 production provider route。 diff --git a/ops/runner/README.md b/ops/runner/README.md index 365094f0..ef86c4b3 100644 --- a/ops/runner/README.md +++ b/ops/runner/README.md @@ -418,7 +418,9 @@ quarantine restore source 或 `systemd-run` 讓它們恢復 active。 - `ops/runner/awoooi-runner-failclosed-authority.service` - `ops/runner/awoooi-runner-failclosed-authority.timer` -live 110 必須安裝 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`, +live 110 必須安裝 authority copy `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh` +與 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`;cron / systemd authority 一律執行 +authority copy,讓外部 opener 覆寫 canonical 時仍可自動修復。 `/usr/local/bin/awoooi-enforce-runner-failclosed-110.sh` 只作相容 wrapper。必須啟用 `awoooi-runner-failclosed-enforcer.timer` 與 `awoooi-runner-failclosed-authority.timer`。 `/etc/cron.d/awoooi-runner-failclosed-authority` 必須存在,作為 systemd timers 被短命外部 opener mask 掉時的第三層收斂 authority。 diff --git a/ops/runner/awoooi-runner-failclosed-authority.service b/ops/runner/awoooi-runner-failclosed-authority.service index 41e005a1..17a935ef 100644 --- a/ops/runner/awoooi-runner-failclosed-authority.service +++ b/ops/runner/awoooi-runner-failclosed-authority.service @@ -1,10 +1,10 @@ [Unit] Description=AWOOOI 110 runner/CD lane fail-closed authority -Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh +Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh Wants=network-online.target After=network-online.target docker.service [Service] Type=oneshot -ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply +ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply TimeoutStartSec=180 diff --git a/ops/runner/awoooi-runner-failclosed-enforcer.service b/ops/runner/awoooi-runner-failclosed-enforcer.service index bf7867f5..4802aba3 100644 --- a/ops/runner/awoooi-runner-failclosed-enforcer.service +++ b/ops/runner/awoooi-runner-failclosed-enforcer.service @@ -1,10 +1,10 @@ [Unit] Description=AWOOOI 110 runner/CD lane fail-closed enforcer -Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh +Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh Wants=network-online.target After=network-online.target docker.service [Service] Type=oneshot -ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply +ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply TimeoutStartSec=180 diff --git a/scripts/reboot-recovery/awoooi-enforce-runner-failclosed-110.sh b/scripts/reboot-recovery/awoooi-enforce-runner-failclosed-110.sh index f7280d8e..6c14c955 100755 --- a/scripts/reboot-recovery/awoooi-enforce-runner-failclosed-110.sh +++ b/scripts/reboot-recovery/awoooi-enforce-runner-failclosed-110.sh @@ -8,4 +8,8 @@ if [ -x "$SCRIPT_DIR/enforce-110-runner-failclosed.sh" ]; then exec "$SCRIPT_DIR/enforce-110-runner-failclosed.sh" "$@" fi +if [ -x /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh ]; then + exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh "$@" +fi + exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.sh "$@" diff --git a/scripts/reboot-recovery/enforce-110-runner-failclosed.sh b/scripts/reboot-recovery/enforce-110-runner-failclosed.sh index 5e19eab1..cb6fe681 100755 --- a/scripts/reboot-recovery/enforce-110-runner-failclosed.sh +++ b/scripts/reboot-recovery/enforce-110-runner-failclosed.sh @@ -9,6 +9,7 @@ MODE="check" STAMP="$(date +%Y%m%dT%H%M%S%z)" APPLY_PERFORMED=0 CANONICAL_ENFORCER="/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh" +AUTHORITY_ENFORCER="/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh" COMPAT_ENFORCER="/usr/local/bin/awoooi-enforce-runner-failclosed-110.sh" usage() { @@ -335,16 +336,25 @@ repair_enforcer_entrypoints() { local tmp current="$(readlink -f "$0" 2>/dev/null || printf '%s' "$0")" as_root mkdir -p "$(dirname "$CANONICAL_ENFORCER")" >/dev/null 2>&1 || true + as_root mkdir -p "$(dirname "$AUTHORITY_ENFORCER")" >/dev/null 2>&1 || true if [ -f "$current" ] && [ "$current" != "$CANONICAL_ENFORCER" ]; then as_root chattr -i "$CANONICAL_ENFORCER" >/dev/null 2>&1 || true as_root install -o root -g root -m 0755 "$current" "$CANONICAL_ENFORCER" >/dev/null 2>&1 || true fi as_root chattr +i "$CANONICAL_ENFORCER" >/dev/null 2>&1 || true + if [ -f "$current" ] && [ "$current" != "$AUTHORITY_ENFORCER" ]; then + as_root chattr -i "$AUTHORITY_ENFORCER" >/dev/null 2>&1 || true + as_root install -o root -g root -m 0755 "$current" "$AUTHORITY_ENFORCER" >/dev/null 2>&1 || true + fi + as_root chattr +i "$AUTHORITY_ENFORCER" >/dev/null 2>&1 || true tmp="$(mktemp)" cat >"$tmp" <<'EOF' #!/usr/bin/env bash set -eu +if [ -x /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh ]; then + exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh "$@" +fi exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.sh "$@" EOF as_root chattr -i "$COMPAT_ENFORCER" >/dev/null 2>&1 || true @@ -365,13 +375,13 @@ repair_enforcer_systemd_units() { cat >"$service_tmp" <<'EOF' [Unit] Description=AWOOOI 110 runner/CD lane fail-closed enforcer -Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh +Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh Wants=network-online.target After=network-online.target docker.service [Service] Type=oneshot -ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply +ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply TimeoutStartSec=180 EOF @@ -395,13 +405,13 @@ EOF cat >"$authority_service_tmp" <<'EOF' [Unit] Description=AWOOOI 110 runner/CD lane fail-closed authority -Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh +Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh Wants=network-online.target After=network-online.target docker.service [Service] Type=oneshot -ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply +ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply TimeoutStartSec=180 EOF @@ -455,7 +465,7 @@ repair_enforcer_cron_authority() { cat >"$tmp" <<'EOF' SHELL=/bin/bash PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin -* * * * * root /usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply >>/var/log/awoooi-runner-failclosed-authority-cron.log 2>&1 +* * * * * root /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply >>/var/log/awoooi-runner-failclosed-authority-cron.log 2>&1 EOF as_root install -o root -g root -m 0644 "$tmp" /etc/cron.d/awoooi-runner-failclosed-authority >/dev/null 2>&1 || true rm -f "$tmp"