fix(recovery): harden runner failclosed authority copy [skip ci]

This commit is contained in:
Your Name
2026-06-28 16:32:14 +08:00
parent f52ec0db26
commit 2104f0f01a
9 changed files with 40 additions and 13 deletions

View File

@@ -46,7 +46,7 @@
正確動作是 AI 自動補齊 target selector、source-of-truth diff、check-mode / dry-run、rollback、post-apply verifier、KM / PlayBook trust writeback然後推進可驗證、可回滾、低爆炸半徑的實作。
**110 runner / controlled CD lane 壓力事故例外**Gitea / act-runner / direct transient runner、泛用 `ubuntu-latest`、StockPlatform / headless / Playwright 類重型工作對 110 造成 CPU / Docker build 壓力時,屬事故級容量保護,不得用「全面授權」直接重開 legacy runner、移除 legacy mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary或把 host pressure gate 改成 warn-only。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service``awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer``/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`;舊 `/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*` enforcer source、startup open drop-in、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifact 與 restore-source 也必須封存或改成 fail-closed stub。Gitea `cd.yaml` / `code-review.yaml` push workflow 維持 manual-only。
**110 runner / controlled CD lane 壓力事故例外**Gitea / act-runner / direct transient runner、泛用 `ubuntu-latest`、StockPlatform / headless / Playwright 類重型工作對 110 造成 CPU / Docker build 壓力時,屬事故級容量保護,不得用「全面授權」直接重開 legacy runner、移除 legacy mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary或把 host pressure gate 改成 warn-only。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service``awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer``/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`cron / systemd authority 必須執行 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh`,讓外部 opener 覆寫 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh` 時仍能自動修復。`/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*` enforcer source、startup open drop-in、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifact 與 restore-source 也必須封存或改成 fail-closed stub。Gitea `cd.yaml` / `code-review.yaml` push workflow 維持 manual-only。
---

View File

@@ -291,7 +291,7 @@ force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volu
2026-06-28 事故後110 上的 Gitea / act-runner / direct transient runner、StockPlatform headless smoke、host-side Next build 與 Docker / BuildKit 壓力屬容量事故保護面。即使收到「批准 / 繼續 / 全面授權」,也不得直接重開 legacy runner、解除 legacy service mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary、恢復泛用 `ubuntu-latest` label或把 host pressure gate 改成 warn-only 作為預設。
允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner以及執行只讀 pressure / cold-start verifier。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service``awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer``/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`;若外部 opener 暫時恢復 unit只能恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub下一輪 cron authority / authority / enforcer 必須再收斂回 masked / inactive。verifier 不得再接受單一 `controlled_open` lane。
允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner以及執行只讀 pressure / cold-start verifier。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service``awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer``/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`cron / systemd authority 必須執行 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh`,並用該 authority copy 修復 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`若外部 opener 暫時恢復 unit 或覆寫 canonical,只能恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub下一輪 cron authority / authority / enforcer 必須再收斂回 masked / inactive。verifier 不得再接受單一 `controlled_open` lane。
恢復 runner 必須同時具備:

View File

@@ -1,3 +1,14 @@
## 2026-06-28 — 16:22 110 runner fail-closed authority copy 補強
**背景**16:21 P3 release gate 又抓到短命外部 opener 把 `awoooi-cd-lane-drain.service` 恢復為 `enabled / activating`、把 fail-closed timers mask並把 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh` 覆寫成 disabled stub原 cron authority 雖存在,但若 cron 指向被覆寫的 canonical就會失去自動修復能力。
**完成內容**
- `scripts/reboot-recovery/enforce-110-runner-failclosed.sh` 新增 authority copy `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh``--apply` 會同時安裝 / 修復 authority copy、canonical 與 compatibility wrapper。
- `awoooi-runner-failclosed-enforcer.service``awoooi-runner-failclosed-authority.service``/etc/cron.d/awoooi-runner-failclosed-authority` 改為執行 authority copy讓外部 opener 覆寫 canonical 時,下一輪 cron / systemd authority 仍可恢復 canonical、timer、unit mask、sentinel、binary stub 與 job container `0`
- `AGENTS.md``docs/HARD_RULES.md`、MASTER spec 與 `ops/runner/README.md` 同步固定110 runner/CD 壓力事故期間canonical 不是唯一信任根authority copy 才是自動修復入口。
**邊界**:沒有讀 runner token / secret / raw session / SQLite / auth / `.env`;沒有重啟 Docker / Nginx / firewall / K3s / DB沒有打開 legacy runner 或 controlled drain lane。
## 2026-06-28 — 15:20 IwoooS Wazuh live metadata owner packet no-persist validator
**完成內容**

View File

@@ -153,7 +153,7 @@ AWOOOI / AwoooP / IwoooS 不是單純監控頁、告警轉發器或資安清冊
3. 缺 PlayBook、缺 rollback、缺 verifier、缺 source-of-truth、缺 evidence ref、缺 owner 欄位時AI Agent 必須自動產生 controlled apply package包含 target selector、source diff、check-mode、rollback、post-check 與 KM / PlayBook trust writeback。
4. Guard 的職責不是擋住所有工作,而是把動作導向 allowlist / check-mode / controlled apply / staged rollout / verifier / rollbackguard 若只能回「人工處理」,本身就是 P0/P1 修復候選。
5. 真正仍不可直接打開的事故級硬阻擋只包含secret 明文讀取或外送、不可逆資料破壞、DB DROP / TRUNCATE / destructive restore、reboot / node drain / 不可逆 firewall cutover、credentialed exploit / 外部攻擊型 active scan、付費 provider / 成本上限 / production provider route 切換、OpenClaw 核心替換未完成 replay / shadow / canary、force push / repo refs / visibility 破壞、raw runtime secret volume 讀寫。
6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only未完成 runner 搬遷或非 110 硬限流前AWOOOI controlled CD lane / drain lane 也必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer``/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`,舊 `/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*` opener source、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifact 必須封成 fail-closed stubworkflow push trigger 維持 manual-only。
6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only未完成 runner 搬遷或非 110 硬限流前AWOOOI controlled CD lane / drain lane 也必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer``/etc/cron.d/awoooi-runner-failclosed-authority` 維持 masked / inactive / no process / no job container / root restore-source left `0`cron / systemd authority 必須執行 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh` 並修復 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh``/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*` opener source、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifact 必須封成 fail-closed stubworkflow push trigger 維持 manual-only。
7. 資料 freshness gate 必須 source-aware若 Drive / provider source preflight 證明沒有比最後乾淨 import 更新的來源,且 DB sync / import job 乾淨stale business data 是 source freshness warningauth/source/failed-folder/DB sync 有異常才是 hard blocker。
8. Provider proxy gate 必須避免成本 / route 誤開:未 provisioned 且 repo 已標 optional retired 的 LiteLLM 等 proxy只能列 warning不得為了過 health gate 自動啟動或切 production provider route。

View File

@@ -418,7 +418,9 @@ quarantine restore source 或 `systemd-run` 讓它們恢復 active。
- `ops/runner/awoooi-runner-failclosed-authority.service`
- `ops/runner/awoooi-runner-failclosed-authority.timer`
live 110 必須安裝 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`
live 110 必須安裝 authority copy `/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh`
與 canonical `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`cron / systemd authority 一律執行
authority copy讓外部 opener 覆寫 canonical 時仍可自動修復。
`/usr/local/bin/awoooi-enforce-runner-failclosed-110.sh` 只作相容 wrapper。必須啟用
`awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer`
`/etc/cron.d/awoooi-runner-failclosed-authority` 必須存在,作為 systemd timers 被短命外部 opener mask 掉時的第三層收斂 authority。

View File

@@ -1,10 +1,10 @@
[Unit]
Description=AWOOOI 110 runner/CD lane fail-closed authority
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh
Wants=network-online.target
After=network-online.target docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply
TimeoutStartSec=180

View File

@@ -1,10 +1,10 @@
[Unit]
Description=AWOOOI 110 runner/CD lane fail-closed enforcer
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh
Wants=network-online.target
After=network-online.target docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply
TimeoutStartSec=180

View File

@@ -8,4 +8,8 @@ if [ -x "$SCRIPT_DIR/enforce-110-runner-failclosed.sh" ]; then
exec "$SCRIPT_DIR/enforce-110-runner-failclosed.sh" "$@"
fi
if [ -x /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh ]; then
exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh "$@"
fi
exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.sh "$@"

View File

@@ -9,6 +9,7 @@ MODE="check"
STAMP="$(date +%Y%m%dT%H%M%S%z)"
APPLY_PERFORMED=0
CANONICAL_ENFORCER="/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh"
AUTHORITY_ENFORCER="/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh"
COMPAT_ENFORCER="/usr/local/bin/awoooi-enforce-runner-failclosed-110.sh"
usage() {
@@ -335,16 +336,25 @@ repair_enforcer_entrypoints() {
local tmp
current="$(readlink -f "$0" 2>/dev/null || printf '%s' "$0")"
as_root mkdir -p "$(dirname "$CANONICAL_ENFORCER")" >/dev/null 2>&1 || true
as_root mkdir -p "$(dirname "$AUTHORITY_ENFORCER")" >/dev/null 2>&1 || true
if [ -f "$current" ] && [ "$current" != "$CANONICAL_ENFORCER" ]; then
as_root chattr -i "$CANONICAL_ENFORCER" >/dev/null 2>&1 || true
as_root install -o root -g root -m 0755 "$current" "$CANONICAL_ENFORCER" >/dev/null 2>&1 || true
fi
as_root chattr +i "$CANONICAL_ENFORCER" >/dev/null 2>&1 || true
if [ -f "$current" ] && [ "$current" != "$AUTHORITY_ENFORCER" ]; then
as_root chattr -i "$AUTHORITY_ENFORCER" >/dev/null 2>&1 || true
as_root install -o root -g root -m 0755 "$current" "$AUTHORITY_ENFORCER" >/dev/null 2>&1 || true
fi
as_root chattr +i "$AUTHORITY_ENFORCER" >/dev/null 2>&1 || true
tmp="$(mktemp)"
cat >"$tmp" <<'EOF'
#!/usr/bin/env bash
set -eu
if [ -x /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh ]; then
exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh "$@"
fi
exec /usr/local/lib/awoooi/enforce-110-runner-failclosed.sh "$@"
EOF
as_root chattr -i "$COMPAT_ENFORCER" >/dev/null 2>&1 || true
@@ -365,13 +375,13 @@ repair_enforcer_systemd_units() {
cat >"$service_tmp" <<'EOF'
[Unit]
Description=AWOOOI 110 runner/CD lane fail-closed enforcer
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh
Wants=network-online.target
After=network-online.target docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply
TimeoutStartSec=180
EOF
@@ -395,13 +405,13 @@ EOF
cat >"$authority_service_tmp" <<'EOF'
[Unit]
Description=AWOOOI 110 runner/CD lane fail-closed authority
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh
Documentation=file:/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh
Wants=network-online.target
After=network-online.target docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply
ExecStart=/usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply
TimeoutStartSec=180
EOF
@@ -455,7 +465,7 @@ repair_enforcer_cron_authority() {
cat >"$tmp" <<'EOF'
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
* * * * * root /usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --apply >>/var/log/awoooi-runner-failclosed-authority-cron.log 2>&1
* * * * * root /usr/local/lib/awoooi/enforce-110-runner-failclosed.authority.sh --apply >>/var/log/awoooi-runner-failclosed-authority-cron.log 2>&1
EOF
as_root install -o root -g root -m 0644 "$tmp" /etc/cron.d/awoooi-runner-failclosed-authority >/dev/null 2>&1 || true
rm -f "$tmp"