fix(recovery): seal runner failclosed disablers [skip ci]
This commit is contained in:
@@ -291,7 +291,7 @@ force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volu
|
||||
|
||||
2026-06-28 事故後,110 上的 Gitea / act-runner / direct transient runner、StockPlatform headless smoke、host-side Next build 與 Docker / BuildKit 壓力屬容量事故保護面。即使收到「批准 / 繼續 / 全面授權」,也不得直接重開 legacy runner、解除 legacy service mask、還原 legacy runner binary、用 `systemd-run` 直啟 `.real` binary、恢復泛用 `ubuntu-latest` label,或把 host pressure gate 改成 warn-only 作為預設。
|
||||
|
||||
允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner,以及執行只讀 pressure / cold-start verifier。專用 `awoooi-cd-lane.service` 或 `awoooi-cd-lane-drain.service` 可在 `capacity=1`、無 `ubuntu-latest` / StockPlatform / headless / Playwright label、systemd CPU / memory / tasks 限流、root restore-source left `0`、可回滾 unit、post-apply verifier 與 legacy runner fail-closed 都成立時受控開啟;verifier 必須把它與 legacy runner 分開判讀。
|
||||
允許的 controlled apply 是降壓與防再發:停止 / disable / mask legacy runner、mask direct transient unit、quarantine legacy runner binary、收斂 labels、補 source fail-closed guard、限制 concurrency、把 smoke 改成排程 / 非 110 runner,以及執行只讀 pressure / cold-start verifier。未完成 runner 搬遷或非 110 硬限流前,`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須由 `awoooi-runner-failclosed-enforcer.timer` 與 `awoooi-runner-failclosed-authority.timer` 維持 masked / inactive / no process / no job container / root restore-source left `0`;若外部 opener 暫時恢復 unit,只能恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub,下一輪 authority / enforcer 必須再收斂回 masked / inactive。verifier 不得再接受單一 `controlled_open` lane。
|
||||
|
||||
恢復 runner 必須同時具備:
|
||||
|
||||
@@ -301,7 +301,7 @@ force push / 刪 repo / 刪 refs / 改 repo visibility / raw runtime secret volu
|
||||
4. rollback:能回到 inactive / masked / fail-closed stub。
|
||||
5. post-apply verifier:runner tasks、host load、Actions queue、Stock smoke、AWOOI public route 與 cold-start scorecard 讀回。
|
||||
|
||||
在上述條件完成前,startup / recovery script 必須保留 legacy fail-closed;若保留 `START_CONTROLLED_CD_LANE` 或 drain lane,必須同時具備 capacity / label / binary / process / systemd limit verifier、root restore-source left `0`、rollback unit 與 post-apply readback,不得讓泛用 runner 或未限流 runner 借 lane 復活。
|
||||
在上述條件完成前,startup / recovery script 必須保留 fail-closed;不得保留 `START_CONTROLLED_CD_LANE`、drain lane opener、root restore-source opener、`/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*` 舊 enforcer source、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*`、`failclosed-final-mask-*` disabler artifact 或 push-trigger workflow 讓泛用 runner / 未限流 runner 借 lane 復活。恢復 lane 必須另開 source-of-truth diff,先移除 enforcer 阻擋並提供搬遷 / 限流 verifier。
|
||||
|
||||
### Source freshness / provider proxy gate
|
||||
|
||||
|
||||
@@ -29,6 +29,28 @@
|
||||
|
||||
**邊界**:沒有啟動 legacy runner / controlled drain lane / generic runner;沒有把 host pressure gate 改成 warn-only;沒有讀 runner token / secret / raw session / SQLite;沒有 force push。
|
||||
|
||||
## 2026-06-28 — 14:55 110 runner / cd-lane fail-closed enforcer timer 落地
|
||||
|
||||
**背景**:11:17 root restore-source fail-closed 後,14:00 live precheck 又抓到 `awoooi-cd-lane-drain.service active/enabled`、`ACTIVE_JOB_CONTAINERS=1`、`LANE_PROCESS_COUNT=1`、`ROOT_RESTORE_SOURCES_LEFT=1`,表示外部 opener 仍會把 drain lane 拉回來。
|
||||
|
||||
**完成內容**:
|
||||
- 新增 `scripts/reboot-recovery/enforce-110-runner-failclosed.sh`,只看 service / process / container / path / binary kind,不讀 runner config / token、raw sessions、SQLite、auth 或 `.env`。
|
||||
- 新增 `ops/runner/awoooi-runner-failclosed-enforcer.service` / `.timer` 與 `ops/runner/awoooi-runner-failclosed-authority.service` / `.timer`;live canonical 安裝為 `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh`,`/usr/local/bin/awoooi-enforce-runner-failclosed-110.sh` 只作相容 wrapper。enforcer timer `OnUnitInactiveSec=120s`,authority timer `OnUnitInactiveSec=20s`。
|
||||
- `scripts/reboot-recovery/awoooi-startup-110.sh` 移除 cd-lane / drain controlled-open 分支,regular / drain / direct / Gitea runner 全部納入 fail-closed。
|
||||
- `p3-controlled-release-gate.sh`、`full-stack-cold-start-check.sh`、`post-start-quick-check.sh` 改要求 enforcer / authority timer active / enabled / success、job container `0`、lane process `0`、sentinel `0`、root restore-source left `0`,不再接受單一 `controlled_open` lane;若外部 opener 只恢復成帶 `ConditionPathExists=/run/awoooi-runner-migrated-or-hard-limited` 的 fail-closed stub,verifier 可視為 sealed fallback。
|
||||
- enforcer 會封存 / 覆寫 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*`、舊 cd-lane unit template、startup runner-open drop-in、systemd unit backup、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*`、`failclosed-final-mask-*` disabler artifacts、root live artifact 與 lane registration 檔名;不讀內容,只搬移或改成 fail-closed stub。
|
||||
- 15:37-15:43 修正 enforcer 自我修復缺口:安裝 enforcer / authority unit 前會明確移除 `/dev/null` mask symlink,避免 `install` 寫入 `/dev/null` 後留下 masked timer;同輪 apply 先封 disabler 再重建 authority timer,並封存 `/tmp/enforce-110-runner-failclosed.sh` 與 `failclosed-final-mask-*`。
|
||||
- `.gitea/workflows/cd.yaml` 與 `code-review.yaml` 維持 `workflow_dispatch` only;push trigger 等 runner 搬遷或非 110 硬限流後另開。
|
||||
|
||||
**live 驗證結果**:
|
||||
- 15:43 延遲 120 秒讀回:live canonical enforcer SHA `22b306546c22336c96ed1864ace8f8574ccb49415f0e13885bb963c7e74e9eca`,enforcer timer 與 authority timer 都 `active/enabled`,兩個 service 都 `Result=success`;`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、`gitea-awoooi-controlled-runner.service` 都 `masked/inactive/masked`。
|
||||
- `ACTIVE_JOB_CONTAINERS=0`、`LANE_PROCESS_COUNT=0`、`RUNNER_PROCESS_COUNT=0`、`ROOT_RESTORE_SOURCES_LEFT=0`、`SENTINELS_LEFT=0`。
|
||||
- `/usr/local/lib/awoooi/enforce-110-runner-failclosed.sh --check` 回 `RUNNER_UNITS_BAD_COUNT=0`;舊 `/tmp/awoooi-enforce-runner-failclosed-110.sh` 與 `.codex` 來源改為 fail-closed stub。
|
||||
- P3 release gate:`PASS=38 WARN=3 BLOCKED=0`、`RUNNER_FAILCLOSED_AUTHORITY active/enabled/success`、`BAD_RUNNER_GUARDRAILS 0`、`CD_LANE_GUARDRAILS_OK 1`。
|
||||
- full-stack cold-start read-only scorecard:`PASS=95 WARN=1 BLOCKED=0`、Result `DEGRADED`;唯一 warning 是 188 MOMO daily sales source freshness stale,source preflight 無 hard blocker。
|
||||
|
||||
**邊界**:沒有重啟 Docker / Nginx / firewall / K3s / DB;沒有 force push;沒有讀 secret 明文或 runner token;沒有讀 raw sessions / SQLite / auth / `.env`。
|
||||
|
||||
## 2026-06-28 — 14:20 IwoooS Wazuh manager registry 驗收口徑收斂
|
||||
|
||||
**完成內容**:
|
||||
|
||||
@@ -153,7 +153,7 @@ AWOOOI / AwoooP / IwoooS 不是單純監控頁、告警轉發器或資安清冊
|
||||
3. 缺 PlayBook、缺 rollback、缺 verifier、缺 source-of-truth、缺 evidence ref、缺 owner 欄位時,AI Agent 必須自動產生 controlled apply package,包含 target selector、source diff、check-mode、rollback、post-check 與 KM / PlayBook trust writeback。
|
||||
4. Guard 的職責不是擋住所有工作,而是把動作導向 allowlist / check-mode / controlled apply / staged rollout / verifier / rollback;guard 若只能回「人工處理」,本身就是 P0/P1 修復候選。
|
||||
5. 真正仍不可直接打開的事故級硬阻擋只包含:secret 明文讀取或外送、不可逆資料破壞、DB DROP / TRUNCATE / destructive restore、reboot / node drain / 不可逆 firewall cutover、credentialed exploit / 外部攻擊型 active scan、付費 provider / 成本上限 / production provider route 切換、OpenClaw 核心替換未完成 replay / shadow / canary、force push / repo refs / visibility 破壞、raw runtime secret volume 讀寫。
|
||||
6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only;專用 AWOOOI controlled CD lane / drain lane 在 `capacity=1`、窄 label、無泛用重型 label、systemd CPU / memory / tasks 限流、root restore-source left `0`、rollback unit 與 post-apply verifier 成立時可 controlled open,workflow 不得因非事故級 guard 長期停在 manual-only。
|
||||
6. 110 runner 容量事故屬硬保護例外:不得重開 legacy runner、解除 legacy fail-closed、恢復泛用 label 或把 host pressure gate warn-only;未完成 runner 搬遷或非 110 硬限流前,AWOOOI controlled CD lane / drain lane 也必須由 `awoooi-runner-failclosed-enforcer.timer` 與 `awoooi-runner-failclosed-authority.timer` 維持 masked / inactive / no process / no job container / root restore-source left `0`,舊 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh*` opener source、`awoooi-runner-failclosed-opened-*`、`awoooi-runner-failclosed-*-opened-*`、`awoooi-runner-failclosed-quarantine-*` 與 `failclosed-final-mask-*` disabler artifact 必須封成 fail-closed stub,workflow push trigger 維持 manual-only。
|
||||
7. 資料 freshness gate 必須 source-aware:若 Drive / provider source preflight 證明沒有比最後乾淨 import 更新的來源,且 DB sync / import job 乾淨,stale business data 是 source freshness warning;auth/source/failed-folder/DB sync 有異常才是 hard blocker。
|
||||
8. Provider proxy gate 必須避免成本 / route 誤開:未 provisioned 且 repo 已標 optional retired 的 LiteLLM 等 proxy,只能列 warning;不得為了過 health gate 自動啟動或切 production provider route。
|
||||
|
||||
@@ -5343,8 +5343,8 @@ Trigger commit `f5cd37b7` 與 deploy marker `0ba92357` 已把 governance UI 的
|
||||
|
||||
**裁決:** 後續所有專案若仍把 low / medium / high 事件停在 read-only、owner gate、manual handoff、`runtime_write_gate=0` 或 `needs_human=true`,且未命中上述事故級硬阻擋,必須立即改成 AI controlled apply / verifier / rollback / learning writeback。
|
||||
|
||||
### 2026-06-28 10:05 (台北) — 110 legacy runner 與 controlled cd-lane 分流
|
||||
### 2026-06-28 10:05 (台北) — 110 legacy runner 與 controlled cd-lane 分流(歷史裁決已失效)
|
||||
|
||||
**觸發**:110 壓力事故 fail-closed guard 將專用 `awoooi-cd-lane.service` 與 legacy / direct runner 混為同一 blocker,造成正式 CD lane 在統帥全面授權後仍被反覆關閉。
|
||||
|
||||
**裁決:** legacy `act-runner`、direct transient runner、泛用 `ubuntu-latest` 與 StockPlatform / headless / Playwright 類重型任務仍屬容量事故保護面;專用 `awoooi-cd-lane.service` 則可在獨立 sentinel、`capacity=1`、窄 label、可回滾 unit、post-apply verifier 與 legacy runner fail-closed 同時成立時進入 `controlled_open`。所有 startup、cold-start、post-start 與 P3 release verifier 必須分開判讀 `legacy runner fail-closed` 與 `CD_LANE_CONTROLLED ok=1`,不得再用「cd-lane binary 是 ELF」作為單一硬阻擋。
|
||||
**裁決更新:** 後續 live incident 已證明 controlled-open / drain lane opener 會被外部 opener 利用而反覆還原 cd-lane,包含舊 `/tmp/enforce-110-runner-failclosed.sh`、`/tmp/awoooi-enforce-runner-failclosed-110.sh.codex` 會還原舊 enforcer,以及 `awoooi-runner-failclosed-opened-*` / `awoooi-runner-failclosed-*-opened-*` / `awoooi-runner-failclosed-quarantine-*` / `failclosed-final-mask-*` 會停用 enforcer 或留下可回放 unit。實際規則以 fail-closed enforcer + authority 為準:`awoooi-cd-lane.service`、`awoooi-cd-lane-drain.service`、direct runner 與 Gitea runner 必須維持 masked / inactive / no process / no job container / root restore-source left `0`,舊 opener source 必須封成 fail-closed stub,`startup`、cold-start、post-start 與 P3 release verifier 必須要求 `awoooi-runner-failclosed-enforcer.timer` 與 `awoooi-runner-failclosed-authority.timer` active / enabled / success。
|
||||
|
||||
Reference in New Issue
Block a user