fix(recovery): seal runner failclosed disablers [skip ci]

This commit is contained in:
Your Name
2026-06-28 15:50:36 +08:00
parent 27a7190c2c
commit ba054e698d
17 changed files with 1001 additions and 712 deletions

View File

@@ -291,7 +291,7 @@ force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volu
2026-06-28 事故後110 上的 Gitea / act-runner / direct transient runner、StockPlatform headless smoke、host-side Next build 與 Docker / BuildKit 壓力屬容量事故保護面。即使收到「批准 / 繼續 / 全面授權」,也不得直接重開 legacy runner、解除 legacy service mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary、恢復泛用 `ubuntu-latest` label或把 host pressure gate 改成 warn-only 作為預設。
允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner以及執行只讀 pressure / cold-start verifier。專用 `awoooi-cd-lane.service``awoooi-cd-lane-drain.service` 可在 `capacity=1`、無 `ubuntu-latest` / StockPlatform / headless / Playwright label、systemd CPU / memory / tasks 限流、root restore-source left `0`、可回滾 unit、post-apply verifier 與 legacy runner fail-closed 都成立時受控開啟verifier 必須把它與 legacy runner 分開判讀
允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner以及執行只讀 pressure / cold-start verifier。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service``awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer` 維持 masked / inactive / no process / no job container / root restore-source left `0`;若外部 opener 暫時恢復 unit只能恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub下一輪 authority / enforcer 必須再收斂回 masked / inactive。verifier 不得再接受單一 `controlled_open` lane
恢復 runner 必須同時具備:
@@ -301,7 +301,7 @@ force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volu
4. rollback能回到 inactive / masked / fail-closed stub。
5. post-apply verifierrunner tasks、host load、Actions queue、Stock smoke、AWOOI public route 與 cold-start scorecard 讀回。
在上述條件完成前startup / recovery script 必須保留 legacy fail-closed保留 `START_CONTROLLED_CD_LANE`drain lane,必須同時具備 capacity / label / binary / process / systemd limit verifier、root restore-source left `0`、rollback unit post-apply readback不得讓泛用 runner 未限流 runner 借 lane 復活。
在上述條件完成前startup / recovery script 必須保留 fail-closed不得保留 `START_CONTROLLED_CD_LANE`drain lane opener、root restore-source opener、`/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*` 舊 enforcer source、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifact push-trigger workflow 讓泛用 runner / 未限流 runner 借 lane 復活。恢復 lane 必須另開 source-of-truth diff先移除 enforcer 阻擋並提供搬遷 / 限流 verifier。
### Source freshness / provider proxy gate

View File

@@ -29,6 +29,28 @@
**邊界**:沒有啟動 legacy runner / controlled drain lane / generic runner沒有把 host pressure gate 改成 warn-only沒有讀 runner token / secret / raw session / SQLite沒有 force push。
## 2026-06-28 — 14:55 110 runner / cd-lane fail-closed enforcer timer 落地
**背景**11:17 root restore-source fail-closed 後14:00 live precheck 又抓到 `awoooi-cd-lane-drain.service active/enabled``ACTIVE_JOB_CONTAINERS=1``LANE_PROCESS_COUNT=1``ROOT_RESTORE_SOURCES_LEFT=1`,表示外部 opener 仍會把 drain lane 拉回來。
**完成內容**
- 新增 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh`,只看 service / process / container / path / binary kind不讀 runner config / token、raw sessions、SQLite、auth 或 `.env`
- 新增 `ops/runner/awoooi-runner-failclosed-enforcer.service` / `.timer``ops/runner/awoooi-runner-failclosed-authority.service` / `.timer`live canonical 安裝為 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh``/usr/local/bin/awoooi-enforce-runner-failclosed-110.sh` 只作相容 wrapper。enforcer timer `OnUnitInactiveSec=120s`authority timer `OnUnitInactiveSec=20s`
- `scripts/reboot-recovery/awoooi-startup-110.sh` 移除 cd-lane / drain controlled-open 分支regular / drain / direct / Gitea runner 全部納入 fail-closed。
- `p3-controlled-release-gate.sh``full-stack-cold-start-check.sh``post-start-quick-check.sh` 改要求 enforcer / authority timer active / enabled / success、job container `0`、lane process `0`、sentinel `0`、root restore-source left `0`,不再接受單一 `controlled_open` lane若外部 opener 只恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stubverifier 可視為 sealed fallback。
- enforcer 會封存 / 覆寫 `/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*`、舊 cd-lane unit template、startup runner-open drop-in、systemd unit backup、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifacts、root live artifact 與 lane registration 檔名;不讀內容,只搬移或改成 fail-closed stub。
- 15:37-15:43 修正 enforcer 自我修復缺口:安裝 enforcer / authority unit 前會明確移除 `/dev/null` mask symlink避免 `install` 寫入 `/dev/null` 後留下 masked timer同輪 apply 先封 disabler 再重建 authority timer並封存 `/tmp/enforce-110-runner-failclosed.sh``failclosed-final-mask-*`
- `.gitea/workflows/cd.yaml``code-review.yaml` 維持 `workflow_dispatch` onlypush trigger 等 runner 搬遷或非 110 硬限流後另開。
**live 驗證結果**
- 15:43 延遲 120 秒讀回live canonical enforcer SHA `22b306546c22336c96ed1864ace8f8574ccb49415f0e13885bb963c7e74e9eca`enforcer timer 與 authority timer 都 `active/enabled`,兩個 service 都 `Result=success``awoooi-cd-lane.service``awoooi-cd-lane-drain.service``gitea-awoooi-controlled-runner.service``masked/inactive/masked`
- `ACTIVE_JOB_CONTAINERS=0``LANE_PROCESS_COUNT=0``RUNNER_PROCESS_COUNT=0``ROOT_RESTORE_SOURCES_LEFT=0``SENTINELS_LEFT=0`
- `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --check``RUNNER_UNITS_BAD_COUNT=0`;舊 `/tmp/awoooi-enforce-runner-failclosed-110.sh``.codex` 來源改為 fail-closed stub。
- P3 release gate`PASS=38 WARN=3 BLOCKED=0``RUNNER_FAILCLOSED_AUTHORITY active/enabled/success``BAD_RUNNER_GUARDRAILS 0``CD_LANE_GUARDRAILS_OK 1`
- full-stack cold-start read-only scorecard`PASS=95 WARN=1 BLOCKED=0`、Result `DEGRADED`;唯一 warning 是 188 MOMO daily sales source freshness stalesource preflight 無 hard blocker。
**邊界**:沒有重啟 Docker / Nginx / firewall / K3s / DB沒有 force push沒有讀 secret 明文或 runner token沒有讀 raw sessions / SQLite / auth / `.env`
## 2026-06-28 — 14:20 IwoooS Wazuh manager registry 驗收口徑收斂
**完成內容**

View File

@@ -153,7 +153,7 @@ AWOOOI / AwoooP / IwoooS 不是單純監控頁、告警轉發器或資安清冊
3. 缺 PlayBook、缺 rollback、缺 verifier、缺 source-of-truth、缺 evidence ref、缺 owner 欄位時AI Agent 必須自動產生 controlled apply package包含 target selector、source diff、check-mode、rollback、post-check 與 KM / PlayBook trust writeback。
4. Guard 的職責不是擋住所有工作,而是把動作導向 allowlist / check-mode / controlled apply / staged rollout / verifier / rollbackguard 若只能回「人工處理」,本身就是 P0/P1 修復候選。
5. 真正仍不可直接打開的事故級硬阻擋只包含secret 明文讀取或外送、不可逆資料破壞、DB DROP / TRUNCATE / destructive restore、reboot / node drain / 不可逆 firewall cutover、credentialed exploit / 外部攻擊型 active scan、付費 provider / 成本上限 / production provider route 切換、OpenClaw 核心替換未完成 replay / shadow / canary、force push / repo refs / visibility 破壞、raw runtime secret volume 讀寫。
6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only專用 AWOOOI controlled CD lane / drain lane `capacity=1`、窄 label、無泛用重型 label、systemd CPU / memory / tasks 限流、root restore-source left `0`、rollback unit 與 post-apply verifier 成立時可 controlled openworkflow 不得因非事故級 guard 長期停在 manual-only。
6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only未完成 runner 搬遷或非 110 硬限流前,AWOOOI controlled CD lane / drain lane 也必須由 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer` 維持 masked / inactive / no process / no job container / root restore-source left `0`,舊 `/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh*` opener source、`awoooi-runner-failclosed-opened-*``awoooi-runner-failclosed-*-opened-*``awoooi-runner-failclosed-quarantine-*``failclosed-final-mask-*` disabler artifact 必須封成 fail-closed stubworkflow push trigger 維持 manual-only。
7. 資料 freshness gate 必須 source-aware若 Drive / provider source preflight 證明沒有比最後乾淨 import 更新的來源,且 DB sync / import job 乾淨stale business data 是 source freshness warningauth/source/failed-folder/DB sync 有異常才是 hard blocker。
8. Provider proxy gate 必須避免成本 / route 誤開:未 provisioned 且 repo 已標 optional retired 的 LiteLLM 等 proxy只能列 warning不得為了過 health gate 自動啟動或切 production provider route。
@@ -5343,8 +5343,8 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的
**裁決:** 後續所有專案若仍把 low / medium / high 事件停在 read-only、owner gate、manual handoff、`runtime_write_gate=0``needs_human=true`,且未命中上述事故級硬阻擋,必須立即改成 AI controlled apply / verifier / rollback / learning writeback。
### 2026-06-28 10:05 (台北) — 110 legacy runner 與 controlled cd-lane 分流
### 2026-06-28 10:05 (台北) — 110 legacy runner 與 controlled cd-lane 分流(歷史裁決已失效)
**觸發**110 壓力事故 fail-closed guard 將專用 `awoooi-cd-lane.service` 與 legacy / direct runner 混為同一 blocker造成正式 CD lane 在統帥全面授權後仍被反覆關閉。
**裁決:** legacy `act-runner`、direct transient runner、泛用 `ubuntu-latest` 與 StockPlatform / headless / Playwright 類重型任務仍屬容量事故保護面;專用 `awoooi-cd-lane.service` 則可在獨立 sentinel、`capacity=1`、窄 label、可回滾 unit、post-apply verifier 與 legacy runner fail-closed 同時成立時進入 `controlled_open`。所有 startup、cold-start、post-start 與 P3 release verifier 必須分開判讀 `legacy runner fail-closed``CD_LANE_CONTROLLED ok=1`不得再用「cd-lane binary 是 ELF」作為單一硬阻擋
**裁決更新** 後續 live incident 已證明 controlled-open / drain lane opener 會被外部 opener 利用而反覆還原 cd-lane包含舊 `/tmp/enforce-110-runner-failclosed.sh``/tmp/awoooi-enforce-runner-failclosed-110.sh.codex` 會還原舊 enforcer以及 `awoooi-runner-failclosed-opened-*` / `awoooi-runner-failclosed-*-opened-*` / `awoooi-runner-failclosed-quarantine-*` / `failclosed-final-mask-*` 會停用 enforcer 或留下可回放 unit。實際規則以 fail-closed enforcer + authority 為準:`awoooi-cd-lane.service``awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須維持 masked / inactive / no process / no job container / root restore-source left `0`,舊 opener source 必須封成 fail-closed stub`startup`、cold-start、post-start 與 P3 release verifier 必須要求 `awoooi-runner-failclosed-enforcer.timer``awoooi-runner-failclosed-authority.timer` active / enabled / success