docs(recovery): record p0 reboot blocker readback [skip ci]

This commit is contained in:
Your Name
2026-06-30 22:19:59 +08:00
parent 230ee54faa
commit 9540a479ba
2 changed files with 28 additions and 6 deletions

View File

@@ -50704,10 +50704,32 @@ production browser smoke:
- `PYTHONPATH=apps/api python3.11 -m py_compile ...`:通過。
- `python3.11 -m json.tool docs/operations/ai-agent-log-intelligence-runtime-sample-readback.snapshot.json`:通過。
## 2026-06-30 — 22:08 P0 mainline live scorecard / 110 Harbor control channel blocker readback
**照優先順序完成的讀回**
- 使用乾淨 worktree `/Users/ogt/codex-workspaces/awoooi-p0-006-postgres-readback-20260630`fast-forward 到 Gitea `main` / `230ee54fa test(agent): align log loop writeback counts`;舊 `/Users/ogt/awoooi` 仍落後且 dirty未操作。
- public Gitea queueCD `#4095` RunningCD `#4094` CanceledCD `#4093` FailureCD `#4091` FailureHarbor repair `#4092` Scheduled / Waitingqueue readback status 仍是 `blocked_harbor_110_repair_no_matching_runner`,缺 `awoooi-host` online runner。jobs API 對 `#4092` 回 stale/mismatched `ai-code-review` / `ubuntu-latest`,不等於 repair job 已執行CD `#4095` self-heal skip reason 仍是 `not_110_host`
- live probes`https://registry.wooo.work/v2/` 502、`http://192.168.0.110:5000/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502、`https://signoz.wooo.work/` 502。
- StockPlatform public freshness / ingestion 仍回 `status=not_configured`、blocker `postgres_not_ready`production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 仍回 2026-06-29 舊資料,不可作為本輪恢復證據。
- `post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260630-220250/summary.txt``POST_START_PASS=33 WARN=6 BLOCKED=8``SERVICE_GREEN=0``PRODUCT_DATA_GREEN=0``BACKUP_CORE_GREEN=0``HOST_188_SERVICE_GREEN=0``OVERALL_DECLARATION=SERVICE_BLOCKED`
- `full-stack-cold-start-check.sh --monitor-read-only --no-color``PASS=66 WARN=5 BLOCKED=5`blocked 包含 110 registry `/v2`、110 SSH read-only check timeout、K3s AWOOOI image pull blocked / registry pull refused、SigNoz TLS/public route 502。
- all-host probe `/tmp/awoooi-host-probe-live-20260630-2205.txt`99 reachable 但 uptime unknown111 unreachable188 startup unit failed/degraded110/112/120/121/188 uptime 均已超過 10 分鐘視窗。
- reboot detector no-write readback `/tmp/awoooi-reboot-event-live-20260630-2205.json``reboot_detected=true` 但只有 99 fresh`all_required_hosts_in_reboot_window=false``unreachable_hosts=["111"]``state_written=false`
- SLO scorecard `/tmp/awoooi-reboot-slo-live-20260630-2205-scorecard.json``status=blocked_reboot_auto_recovery_slo_not_ready``can_claim_all_services_recovered_within_target=false`active blockers 包含 all-host 不在 10 分鐘視窗、111 unreachable、99 uptime unknown、service/product data/backup/188 service 非 green、Wazuh dashboard degraded、Stock `postgres_not_ready`
**source-side 收斂**
- `106acf683` 已把 Harbor / 110 control channel blocker 投影到 `harbor_registry_controlled_recovery_preflight` 與 priority work-order summary`2d677f8af` 已把同一 blocker 納入 AI Loop runtime sample。下一步明確是恢復 110 SSH 或 `awoooi-host` runner control channel再跑 Harbor watchdog check / repair once / public `/v2` verifier。
- 更新 reboot workplan 的 2026-06-30 P0 table 到 22:05 live evidence避免後續回到 #4061 / #4088 / 2026-06-29 production API 舊判讀。
**本地驗證結果**
- `DATABASE_URL=sqlite+aiosqlite:////tmp/awoooi-codex-api-test.db PYTHONPATH=apps/api python3.11 -m pytest apps/api/tests/test_harbor_registry_controlled_recovery_preflight.py apps/api/tests/test_awoooi_priority_work_order_readback_api.py apps/api/tests/test_reboot_auto_recovery_slo_scorecard_api.py ops/runner/test_read_public_gitea_actions_queue.py ops/runner/test_check_awoooi_110_controlled_cd_lane_readiness.py ops/runner/test_cd_controlled_runtime_profile.py scripts/reboot-recovery/tests/test_reboot_auto_recovery_slo_scorecard.py scripts/reboot-recovery/tests/test_reboot_event_detector.py -q``82 passed`
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0``generic_runner_labels=0`
- `node scripts/ci/check-gitea-step-env-secrets.js .gitea/workflows/cd.yaml .gitea/workflows/harbor-110-local-repair.yaml`:通過。
**仍維持**
- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth沒有讀 `.runner` 內容。
- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart沒有 workflow_dispatch沒有 runtime write。
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart沒有 workflow_dispatch沒有 DB write / restore / prune。
**下一步**
- commit/push 後讀回 Gitea queue / closure / registry若 runtime 仍卡 `awoooi-host` no-matching下一個主線是 110 local recovery package apply/readback並將回讀結果寫成下一筆 AI Loop sample / post-apply verifier receipt。
- 持續讀回 `#4095` / `#4092`;若 runtime 仍卡 `awoooi-host` no-matchingP0 不切支線,下一個 controlled apply target 仍是 110 control channel在 110 local console / root shell 或恢復 `awoooi-host` controlled runner 後跑 `recover-110-control-path-and-harbor-local.sh --check` / `--apply-all`,再重讀 public/internal `/v2`、Gitea queue、full-stack cold-start、Stock freshness、backup-status並把 post-apply verifier receipt 回寫成下一筆 AI Loop sample

View File

@@ -15,13 +15,13 @@
| 優先 | 狀態 | 工作項 | 2026-06-30 證據 | 下一步 / 完成條件 |
|------|------|--------|------------------|-------------------|
| P0-1 | BLOCKED | 全主機 cold-start / 10 分鐘自動恢復 SLO | 20:46 `post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260630-live-main/summary.txt``POST_START_PASS=33 WARN=6 BLOCKED=8``SERVICE_GREEN=0``PRODUCT_DATA_GREEN=0``BACKUP_CORE_GREEN=0``HOST_188_SERVICE_GREEN=0``OVERALL_DECLARATION=SERVICE_BLOCKED``full-stack-cold-start-check.sh --monitor-read-only --no-color``PASS=68 WARN=4 BLOCKED=4`SLO scorecard `/tmp/awoooi-reboot-slo-live-20260630-2045/scorecard.json``can_claim_all_services_recovered_within_target=false`、active blockers `12`readiness `18%`。111 不可達、99 uptime unknown、188 startup failed/degraded、110/112/120/121/188 已超過 10 分鐘視窗。 | 先修第一個 runtime blocker110 control path / Harbor registry `/v2`。重跑同一 summary / cold-start / SLO scorecard 到 `SERVICE_GREEN=1``POST_START_BLOCKED=0``PASS` 無 BLOCKED、all-host required observed/reachable 且 `awoooi_reboot_auto_recovery_slo_ready=1`;不可只用 route 200 宣稱恢復。 |
| P0-1 | BLOCKED | 全主機 cold-start / 10 分鐘自動恢復 SLO | 22:05 `post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260630-220250/summary.txt``POST_START_PASS=33 WARN=6 BLOCKED=8``SERVICE_GREEN=0``PRODUCT_DATA_GREEN=0``BACKUP_CORE_GREEN=0``HOST_188_SERVICE_GREEN=0``OVERALL_DECLARATION=SERVICE_BLOCKED``full-stack-cold-start-check.sh --monitor-read-only --no-color``PASS=66 WARN=5 BLOCKED=5`SLO scorecard `/tmp/awoooi-reboot-slo-live-20260630-2205-scorecard.json``can_claim_all_services_recovered_within_target=false`、active blockers `11`reboot detector 只讀評估 `/tmp/awoooi-reboot-event-live-20260630-2205.json``reboot_detected=true` 但只有 99 fresh111 unreachableall required hosts 不在 10 分鐘視窗內;99 uptime unknown、188 startup failed/degraded、110/112/120/121/188 已超過 10 分鐘視窗。 | 先修第一個 runtime blocker110 control path / Harbor registry `/v2`。重跑同一 summary / cold-start / SLO scorecard 到 `SERVICE_GREEN=1``POST_START_BLOCKED=0``PASS` 無 BLOCKED、all-host required observed/reachable 且 `awoooi_reboot_auto_recovery_slo_ready=1`;不可只用 route 200 宣稱恢復。 |
| P0-2 | DONE_THIS_INCIDENT | 使用者可見 502Tsenyang | `www.tsenyang.com` / `tsenyang.com` 由 502 恢復為 200188 `tsenyang-website` container runninglocal `127.0.0.1:3000` 回 200。 | 下次同類 502 先查 release symlink / image / container不先動 Nginx、DNS、DB、主機重啟。 |
| P0-3 | BLOCKED | StockPlatform data freshness | public `/healthz``/api/healthz` 回 200freshness / ingestion 回 `not_configured``postgres_not_ready`。 | 恢復 110 control path 後read-only 查 `/home/wooo/stockplatform-v2` compose / DB schema / migration status禁止 fake freshness、manual DB rows、restore/prune。 |
| P0-4 | BLOCKED | AWOOOI production 版本最新性 | Gitea SSH `main` 最新是 `49a9f7309`,但 public CD `#4061` 失敗,classifier=`harbor_registry_public_route_unavailable`、status code `502`、controlled repair attempted=`true`、skip reason=`not_110_host`production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 仍回 2026-06-29 舊資料,不能當現在真相。 | 補 deploy marker / runtime SHA / endpoint readback 一致Harbor `/v2` 恢復前 CD 無法把最新 source 發到 production未一致前不可宣稱 AWOOOI 最新。 |
| P0-5 | BLOCKED | 110 control path / Harbor registry `/v2` | 20:46 live probe`https://registry.wooo.work/v2/` 502、`http://192.168.0.110:5000/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502CD `#4061` self-heal skip reason=`not_110_host`110 SSH read-only / backup-status / CPU / controlled runner readback 仍 timeout 或不可確認。 | 讓 110-local repair workflow 或 110 console/local script 真正執行 `recover-110-control-path-and-harbor-local.sh --check` / `--apply-all`,並讀回 public/internal `/v2``200/401`。恢復 SSH read-only command path 後才能驗證 Stock DB、Gitea dump、110 backup completeness。 |
| P0-4 | BLOCKED | AWOOOI production 版本最新性 | Gitea SSH `main` 最新是 `230ee54fa`public CD `#4095` Running`#4094` Canceled`#4093` Failure`#4091` Failure前一個 Harbor 形狀明確的 main CD `#4087` failure classifier=`harbor_registry_public_route_unavailable`、status code `502`、controlled repair attempted=`true`、skip reason=`not_110_host`production `/api/v1/agents/reboot-auto-recovery-slo-scorecard` 仍回 2026-06-29 舊資料,與 22:05 live Stock `postgres_not_ready` / Harbor 502 不一致,不能當現在真相。 | 補 deploy marker / runtime SHA / endpoint readback 一致Harbor `/v2` 恢復前 CD 無法把最新 source 發到 production未一致前不可宣稱 AWOOOI 最新。 |
| P0-5 | BLOCKED | 110 control path / Harbor registry `/v2` | 22:02 live probe`https://registry.wooo.work/v2/` 502、`http://192.168.0.110:5000/v2/` 502、`https://harbor.wooo.work/api/v2.0/health` 502`https://signoz.wooo.work/` 502public Gitea queue readback 回 `status=blocked_harbor_110_repair_no_matching_runner`Harbor repair `#4092` Scheduled / Waiting`workflow_no_matching_runner_labels={"harbor-110-local-repair.yaml":"awoooi-host"}`jobs API 仍是 stale/mismatched `ai-code-review` / `ubuntu-latest`,未真正執行 `harbor-110-local-repair` jobCD `#4095` 仍顯示 self-heal skip reason `not_110_host`110 SSH read-only command path 仍 timeout。 | 讓 110-local repair workflow 或 110 console/local script 真正執行 `recover-110-control-path-and-harbor-local.sh --check` / `--apply-all`,並讀回 public/internal `/v2``200/401`。恢復 SSH read-only command path 後才能驗證 Stock DB、Gitea dump、110 backup completeness。 |
| P0-6 | BLOCKED_BACKUP_COMPLETENESS | Gitea repo visibility 與完整備份 | Gitea version API 200public repo search 只列 4 個 public repo`stockplatform-v2` public page/API 404但 internal `git ls-remote` 成功188 `/home/ollama/backup/110/gitea` 起初為空。已建立 verified emergency bundle `/home/ollama/backup/110/gitea/git-bundles/20260630-190931`4 個 public/internal repo bundle verify + checksum 成功,`AwoooGo``stockplatform-v2``vibework` 因 private auth fail-closed。20:18 summary 因 110 `backup-status` 不可讀回,`BACKUP_CORE_GREEN=0``DR_ESCROW_BLOCKED=1``DR_ESCROW_EVIDENCE_UNKNOWN=1`。 | 188 `gitea_repo_mirror_from_110` subtree metric / alert 已補;下一步仍是恢復 110 SSH command path 後跑正式 `gitea dump`、private repo 非互動備份、repo count、backup-status 與 restore drill readback。unknown 不得當作 backup / DR green。 |
| P0-7 | SOURCE_READY_RUNTIME_BLOCKED | 99 VMware / VM autostart | repo 已有 `windows99-vmware-autostart.ps1`20:46 host probe 讀到 99 ping reachable 但 `boot_id=reachable_unknown_boot` / uptime unknown111 不可達112/120/121/188 可讀188 startup unit failed。先前只讀 readback 顯示 99 RDP 3389 / SSH 22 可達、WinRM 5985 fail`administrator@192.168.0.99` SSH publickey denied。 | 恢復 99 可控通道或由 console 套用腳本;完成後讀回 111/188/120/121/112 boot evidence要求 all-host required observed/reachable 且 99 不再是 unknown uptime。 |
| P0-7 | SOURCE_READY_RUNTIME_BLOCKED | 99 VMware / VM autostart | repo 已有 `windows99-vmware-autostart.ps1`22:05 host probe 讀到 99 ping reachable 但 `boot_id=reachable_unknown_boot` / uptime unknown111 不可達112/120/121/188 可讀188 startup unit failed/degraded。先前只讀 readback 顯示 99 RDP 3389 / SSH 22 可達、WinRM 5985 fail`administrator@192.168.0.99` SSH publickey denied。 | 恢復 99 可控通道或由 console 套用腳本;完成後讀回 111/188/120/121/112 boot evidence要求 all-host required observed/reachable 且 99 不再是 unknown uptime。 |
| P0-8 | SOURCE_READY_RUNTIME_BLOCKED | 502 maintenance fallback / Telegram / backup alert | L0/L1 fallback runbook、Nginx snippet、reboot / backup alert rules 已在 sourceruntime 尚需部署與外部 L1 provider readback。 | L0 以測試 vhost 驗證 `X-AWOOOI-Fallback`L1 需外部雲端/CDN probeTelegram 以脫敏 alert receipt 驗證。 |
本次核心經驗: