docs(ops): record orphan Chrome cleanup in reboot SOP [skip ci]
This commit is contained in:
@@ -1,3 +1,42 @@
|
||||
## 2026-06-25|110 orphan Chrome 精準清理與重啟 SOP v1.59 證據同步
|
||||
|
||||
**背景**:20:23 post-start quick check 仍顯示 110 load 偏高。只讀 `ps` / `vmstat` 分流後確認兩組 `stockplatform-review-bulk-ux` headless Chrome process group 已跑約 5 小時,root Chrome process `PPID=1`,沒有活躍測試父程序;GPU process 各吃約 96% CPU,renderer 各約 22% CPU。這符合 SOP 內「orphan browser smoke」分流條件,不是 Docker、Nginx、K3s、MOMO、Harbor、Sentry 或 Wazuh 服務事故。
|
||||
|
||||
**最小 live write**:
|
||||
- 時間:2026-06-25 20:24 Asia/Taipei。
|
||||
- 主機:110。
|
||||
- 命令類型:write-with-approval,僅 process `SIGTERM`。
|
||||
- Target PGID:`2756503`、`2829627`。
|
||||
- Post-check:`AFTER_TERM` 後兩個 PGID 無殘留;`vmstat` 顯示 CPU idle 約 `85-90%`,`si/so=0`,無即時 swap thrash。
|
||||
- 未執行:Docker restart、systemd restart、Nginx reload、firewall / iptables、K8s / ArgoCD action、CI cancellation、manual ingestion、manual DB write、Wazuh / SOC runtime change、secret read。
|
||||
|
||||
**20:25 full post-start readback**:
|
||||
- Host / SSH:110 / 120 / 121 / 188 ping 與 SSH port 皆 OK。
|
||||
- Cold-start:`PASS=89 WARN=0 BLOCKED=0`,Result `GREEN`。
|
||||
- K3s:`mon` / `mon1` Ready;AWOOOI API/Web/Worker Running;VIP present;active failed Jobs `0`。
|
||||
- MOMO:`V10.690`,momo app / scheduler / bot healthy,job `57` completed,`DB_DAILY_FRESHNESS 1|2026-06-24`,current-month parity `15383|15383`。
|
||||
- Backup:110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`offsite_fresh=1`,`rclone_gdrive_fresh=1`,`escrow_missing=5`。
|
||||
- Public routes:AWOOOI、VibeWork、AwoooGo、2026FIFA、Agent Bounty、MOMO、Stock、Bitan、TsenYang、VTuber、Gitea、Harbor、Registry、Sentry、SigNoz、Langfuse、AIOps 皆回 expected `2xx/3xx`。
|
||||
- StockPlatform:route / health OK,live source 與 Gitea `main` 均為 `fb91aa4c6272469d1d26e0820169629eac17d28a`;cron entrypoint 已修復,但 `/api/v1/system/freshness` 仍 `blocked`,blockers `core_margin_short_daily_missing,ai_recommendations_stale`。直接 DB:`price|2026-06-25|1976`、`chips|2026-06-25|1976`、`margin|2026-06-24|1976`、`ai_recommendations|2026-06-24|120`。
|
||||
- Wrapper:`POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`,`RESULT=BLOCKED`;唯一 hard blocker 是 StockPlatform data freshness,DR 仍因 `escrow_missing=5` 不可宣稱完成。
|
||||
|
||||
**SOP / 文件更新**:
|
||||
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 更新為 v1.59。
|
||||
- `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 更新為 v1.4。
|
||||
- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 補 20:24 / 20:25 證據。
|
||||
- `docs/runbooks/BACKUP-STATUS.md` 補 CPU 清理與 Stock data / backup 邊界。
|
||||
|
||||
**最新判定**:
|
||||
- 主機、K3s、AWOOOI runtime、公開路由、MOMO service/data、backup/offsite:本輪證據為 green。
|
||||
- 110 CPU:orphan Chrome 已精準清除;剩餘 CPU 是平台常駐服務與短期 CI,不是 reboot service blocker。
|
||||
- StockPlatform:source / cron entrypoint repaired;data freshness 仍等官方 2026-06-25 margin-short source 與後續 AI pipeline。
|
||||
- DR:仍 blocked,`escrow_missing=5`。
|
||||
- Wazuh:route 200 owner-gated;manager registry accepted 仍 `0`,是資安 evidence blocker,不是重啟服務 blocker。
|
||||
|
||||
**下一步**:
|
||||
- 21:00 後只讀確認 StockPlatform `intelligence-sync` 是否自然跑過 restored Docker-backed `psql` shim;不要手動補跑 production data。
|
||||
- 後續 21:00 / 22:00 / 22:35 / 23:10 追官方 margin-short source;正式 cron 綠後再重跑 post-start quick check。
|
||||
|
||||
## 2026-06-25|AI Agents 全域控管總盤正式部署驗證
|
||||
|
||||
**背景**:使用者要求 OpenClaw、Hermes、Nemotron 與後續 AI Agent 不只停在討論或單頁展示,而是要把主機、產品、網站、套件、服務、工具、AI Agent 與 AI 解決方案全部納入最重要的 AI Agents 控管視角,並且不能把工作視窗對話內容顯示到前端。
|
||||
|
||||
@@ -22,6 +22,7 @@
|
||||
> 2026-06-25 19:17 Codex latest recovery readback: post-start quick check is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`; 110 backup `13/13 fresh failed=0`, 188 backup `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; MOMO data freshness is recovered through `2026-06-24`; DR still blocked only by `escrow_missing=5`.
|
||||
> 2026-06-25 19:35 Codex product-data gate refresh: backup/offsite remains green, but overall "all products/data latest" is blocked by StockPlatform `/api/v1/system/freshness` (`core_margin_short_daily_missing`, `ai_recommendations_stale`). This is not a backup failure; keep `escrow_missing=5` as the DR blocker and Stock freshness as a separate product-data blocker.
|
||||
> 2026-06-25 20:11 Codex StockPlatform cron-source recovery: StockPlatform Gitea/live source is now `fb91aa4c6272469d1d26e0820169629eac17d28a`; six missing production cron entrypoints are restored; natural cron runs for source remediation, market index, price, margin, chips, and AI no longer fail from missing files. Backup/offsite remains green. Stock freshness still blocks because official 2026-06-25 margin-short data is pending and AI recommendations correctly stay on 2026-06-24; this is still not a backup or restore incident.
|
||||
> 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved `SIGTERM`; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by `escrow_missing=5`, and Stock freshness remains the only hard product-data blocker.
|
||||
|
||||
---
|
||||
|
||||
@@ -45,7 +46,7 @@ Read-only evidence sources: `/backup/scripts/backup-status.sh --no-notify --no-r
|
||||
| Offsite / GDrive freshness | VERIFIED | `offsite_fresh=1`, `rclone_gdrive_fresh=1`. |
|
||||
| Backup core blockers | GREEN | `core_blockers=0`. |
|
||||
| MOMO data freshness | VERIFIED | `DB_DAILY_FRESHNESS 1|2026-06-24`; latest job 57 completed cleanly. |
|
||||
| Full-stack service state | GREEN_WITH_DR_ESCROW_BLOCKED | `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`; service warnings 0. |
|
||||
| Full-stack service state | STOCK_DATA_BLOCKED_WITH_CORE_SERVICES_GREEN | `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`; cold-start `PASS=89 WARN=0 BLOCKED=0`; blocker is StockPlatform freshness, not backup. |
|
||||
| Credential escrow | BLOCKED | `escrow_missing=5`; only real non-secret owner evidence may close this. |
|
||||
|
||||
## 2026-06-25 20:11 StockPlatform Cron Source / Backup Boundary
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.58
|
||||
> Version: v1.59
|
||||
> Last updated: 2026-06-25 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -12,6 +12,8 @@
|
||||
|
||||
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
|
||||
|
||||
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness is still blocked and DR remains incomplete.
|
||||
|
||||
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.
|
||||
|
||||
2026-06-25 19:35 product-version / data-freshness refresh supersedes the 19:06 data-complete wording. Host boot, K3s, AWOOOI runtime, MOMO service/data, backup/offsite, Bitan cleanliness, and expanded public routes are available, but the stricter post-start wrapper now checks StockPlatform `/api/v1/system/freshness` and correctly returns `RESULT=BLOCKED` when product data is not current. The 19:35 lightweight wrapper run used `--skip-cold-start --skip-backup --skip-cpu` after the 19:24 full host/cold-start/backup readback and returned `PASS=31 WARN=1 BLOCKED=1`, with the single blocker `StockPlatform freshness is blocked: core_margin_short_daily_missing,ai_recommendations_stale`. `stock.wooo.work`, `/healthz`, and `/api/healthz` all return `200`; public routes now covered by the wrapper include AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Do not declare "all products and data are latest" until StockPlatform freshness is `ok`; keep DR blocked until `escrow_missing=0`.
|
||||
@@ -21,17 +23,17 @@
|
||||
```text
|
||||
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
|
||||
Live cold-start read-only check: 2026-06-25 19:05 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
|
||||
Post-start quick check: 2026-06-25 19:05 PASS=18 WARN=3 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED; exit code 0.
|
||||
Post-start quick check: 2026-06-25 20:25 PASS=37 WARN=2 BLOCKED=1; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=BLOCKED because StockPlatform data freshness remains blocked. Cold-start layer remains GREEN.
|
||||
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
|
||||
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
|
||||
Service state: HOST_AND_CORE_SERVICE_GREEN_STOCK_CRON_SOURCE_REPAIRED_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is still blocked by official margin-short source availability and AI recommendation freshness.
|
||||
Service state: HOST_AND_CORE_SERVICE_GREEN_CPU_ORPHAN_CLEARED_STOCK_CRON_SOURCE_REPAIRED_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared, and 110 orphan StockPlatform Chrome smoke groups cleared by targeted approved SIGTERM. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is still blocked by official margin-short source availability and AI recommendation freshness.
|
||||
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. 19:35 expanded route readback returned expected 2xx/3xx for AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
|
||||
MOMO release state: mo.wooo.work health is healthy on version V10.690. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 18:23 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain recent container-replace / scheduler fail-closed / notification evidence notes, but no service blocker remains.
|
||||
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
|
||||
StockPlatform data state: `/api/v1/system/freshness` returns `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, and `core.market_index_daily.global` 2026-06-25 / 2001 rows. Direct DB readback is `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`. The 20:05 margin ingestion ran successfully but returned `row_count=0` with official pending evidence for 2026-06-25; the 20:10 AI pipeline succeeded at the cron/job layer but remained blocked by `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. Do not claim StockPlatform data or AI recommendations are latest until this endpoint is `ok`.
|
||||
Product version readback: StockPlatform live source `/home/wooo/stockplatform-v2` matches Gitea `wooo/stockplatform-v2.git` main `fb91aa4c6272469d1d26e0820169629eac17d28a`; VibeWork live image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea `wooo/vibework.git` main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source `/home/wooo/awooogo` matches Gitea `wooo/AwoooGo` main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833`; 188 runtime directory is a compose/image deployment path, not a git worktree, so add image revision label evidence before declaring code-image parity.
|
||||
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
|
||||
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 19:05 readback shows current higher load is mainly Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
|
||||
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 20:24 readback found a second recurrence with orphan process groups `2756503` and `2829627`, root Chrome `PPID=1`, elapsed about 5h, no active parent smoke, GPU process CPU around 96%, and renderer CPU around 22%; approved targeted `SIGTERM` cleared both PGIDs. 20:25 post-check shows the Chrome groups are gone and CPU idle is around `85-90%`; remaining load is platform services and short-lived AWOOOI CD smoke/install work. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
|
||||
Backup / monitoring state: 19:05 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
|
||||
Route transient handling: post-deploy `502` on Stock or AwoooGo is a blocker only if it persists after upstream container health is ready and 3-5 consecutive external route reads still fail. For AwoooGo, live upstream is on 110 `192.168.0.110:32190`; do not test only `127.0.0.1` on 110 because the listener may bind the host address. For K3s workload balancing, wait for terminating pods to disappear before judging API/Web placement; final required state for two-replica API/Web is split across `mon` and `mon1`.
|
||||
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# 主機重啟後一頁式總檢查
|
||||
|
||||
> Version: v1.3
|
||||
> Version: v1.4
|
||||
> Last updated: 2026-06-25 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
|
||||
|
||||
@@ -226,6 +226,21 @@ CPU / runaway:orphan=? active_ci=? load=?
|
||||
|
||||
## 6. 目前最新已驗證基線
|
||||
|
||||
2026-06-25 20:25 wrapper live run after targeted orphan Chrome cleanup:
|
||||
|
||||
- Command type:write-with-approval for process `SIGTERM` only, followed by read-only scorecard.
|
||||
- CPU action:110 上兩組 orphan `stockplatform-review-bulk-ux` Chrome PGID `2756503`、`2829627` 已用 targeted `SIGTERM` 清除;`AFTER_TERM` 無殘留。未重啟 Docker、systemd、Nginx、K3s,未取消 CI,未寫 DB。
|
||||
- Wrapper:`POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`。
|
||||
- Warning split:`SERVICE=0 BOUNDARY=1 EVIDENCE=1`。
|
||||
- Result:`BLOCKED`,唯一 hard blocker 是 StockPlatform freshness;不是 host / route / K3s / MOMO / backup blocker。
|
||||
- Cold-start:`PASS=89 WARN=0 BLOCKED=0`,Result `GREEN`。
|
||||
- MOMO:`V10.690`,dedicated preflight `PASS=19 WARN=2 BLOCKED=0`,job `57` clean,`DB_DAILY_FRESHNESS 1|2026-06-24`。
|
||||
- Backup:110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`offsite_fresh=1`,`rclone_gdrive_fresh=1`,`escrow_missing=5`。
|
||||
- StockPlatform:live source and Gitea main `fb91aa4c6272469d1d26e0820169629eac17d28a`; cron entrypoints repaired; `/api/v1/system/freshness` still `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`。Direct DB: `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`。
|
||||
- Public routes:expanded routes all returned expected `2xx/3xx`。
|
||||
- DR:`escrow_missing=5`,不可宣稱 DR complete。
|
||||
- Wazuh / SOC:route `200 disabled_waiting_iwooos_wazuh_owner_gate`;manager registry accepted `0`,不是 reboot service blocker。
|
||||
|
||||
2026-06-25 18:23 wrapper live run after deploy marker `2a9e816a`:
|
||||
|
||||
- Gitea / CD:`code-review.yaml #3346` success,`cd.yaml #3345` success,deploy marker `2a9e816a chore(cd): deploy aa70835 [skip ci]`。
|
||||
|
||||
@@ -11,11 +11,11 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | HOST_AND_CORE_SERVICE_GREEN_STOCK_CRON_SOURCE_REPAIRED_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED | 97% | 2026-06-25 19:24 full post-start readback showed hosts / K3s / AWOOOI / MOMO / backup / offsite service gates green and `escrow_missing=5`; 2026-06-25 19:35 stricter product-data wrapper returned `POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1`, result `BLOCKED`, because StockPlatform `/api/v1/system/freshness` is `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`. 2026-06-25 20:11 follow-up repaired the StockPlatform production cron source drift: Gitea and live `/home/wooo/stockplatform-v2` are `fb91aa4c6272469d1d26e0820169629eac17d28a`, six missing cron entrypoints are present, live cron contract covers all referenced scripts, and natural cron runs at 19:56 / 20:00 / 20:02 / 20:05 / 20:06 / 20:10 no longer fail with `script_exit_127`. Expanded public route smoke covers AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps; all returned expected 2xx/3xx. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly, and Bitan public-content cleanliness direct check passed. Do not declare "all products/data latest" until StockPlatform freshness is `ok`; do not declare DR complete until `escrow_missing=0`. |
|
||||
| Overall recovery readiness | HOST_AND_CORE_SERVICE_GREEN_CPU_ORPHAN_CLEARED_STOCK_CRON_SOURCE_REPAIRED_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED | 97% | 2026-06-25 20:25 full post-start readback showed hosts / K3s / AWOOOI / MOMO / backup / offsite service gates green and `escrow_missing=5`; cold-start returned `PASS=89 WARN=0 BLOCKED=0`. 20:24 targeted approved `SIGTERM` cleared orphan 110 `stockplatform-review-bulk-ux` Chrome PGIDs `2756503` and `2829627`; no Docker/systemd/Nginx/firewall/K8s/DB write was performed. 20:25 wrapper returned `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, result `BLOCKED`, because StockPlatform `/api/v1/system/freshness` is `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`. 20:11 follow-up repaired the StockPlatform production cron source drift: Gitea and live `/home/wooo/stockplatform-v2` are `fb91aa4c6272469d1d26e0820169629eac17d28a`, six missing cron entrypoints are present, live cron contract covers all referenced scripts, and natural cron runs at 19:56 / 20:00 / 20:02 / 20:05 / 20:06 / 20:10 no longer fail with `script_exit_127`. Expanded public route smoke covers AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps; all returned expected 2xx/3xx. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly. Do not declare "all products/data latest" until StockPlatform freshness is `ok`; do not declare DR complete until `escrow_missing=0`. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. |
|
||||
| P2 service / data truth | BLOCKED_STOCK_DATA_FRESHNESS | 94% | Service routes and core runtime are available, and StockPlatform cron-source drift is repaired, but product-data truth is not complete. 2026-06-25 20:11 StockPlatform `/api/v1/system/freshness` still returned `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`; OK sources include price / chips / market index for `2026-06-25`, while `core.margin_short_daily` and `ai.recommendations` stop at `2026-06-24`. Direct DB readback is `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`; 20:05 margin ingestion ran successfully but official source was still pending, and 20:10 AI pipeline correctly blocked on the margin gate. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; Bitan public-content cleanliness direct check passed; expanded public routes are green. |
|
||||
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GATE_V158 | 100% | Workplan, SOP v1.58, one-page post-start quick check v1.3, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
|
||||
| P2 service / data truth | BLOCKED_STOCK_DATA_FRESHNESS | 94% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired, but product-data truth is not complete. 2026-06-25 20:23 StockPlatform `/api/v1/system/freshness` still returned `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`; OK sources include price / chips / market index for `2026-06-25`, while `core.margin_short_daily` and `ai.recommendations` stop at `2026-06-24`. Direct DB readback is `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`; 20:05 margin ingestion ran successfully but official source was still pending, and 20:10 AI pipeline correctly blocked on the margin gate. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
|
||||
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GATE_V159 | 100% | Workplan, SOP v1.59, one-page post-start quick check v1.4, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
|
||||
|
||||
2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.
|
||||
|
||||
@@ -25,6 +25,8 @@ Full cold-start service readiness may now be declared GREEN for the latest verif
|
||||
|
||||
2026-06-25 20:11 StockPlatform cron-source recovery closeout: root cause for several StockPlatform stale/old-data symptoms included production source drift where cron referenced scripts that were absent from live `/home/wooo/stockplatform-v2`, producing `script_exit_127` for source remediation, market index, price ingestion, and related monitors. Commit `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints` was pushed to `gitea/main` and fast-forward pulled on 110 only. Live post-pull checks confirm all referenced cron scripts exist and `bash -n` passes. Natural cron runs then recovered: source remediation 19:56 / 20:00 succeeded, market index 20:00 succeeded, price 20:02 succeeded, margin 20:05 succeeded, chips 20:06 succeeded, and AI pipeline 20:10 succeeded at cron/job level while correctly blocking on official margin-short source pending. Remaining blocker is official 2026-06-25 margin-short data and dependent AI recommendation freshness, not source-version drift. Next natural follow-up is 21:00 `intelligence-sync` to prove the restored Docker-backed `psql` shim without manual production writes.
|
||||
|
||||
2026-06-25 20:25 110 CPU orphan Chrome cleanup closeout: read-only process attribution found two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, and sustained GPU/renderer CPU. With user approval, only those PGIDs received targeted `SIGTERM`; post-check showed no remaining group entries, CPU idle around `85-90%`, and `si/so=0`. Full post-start wrapper after cleanup returned cold-start `PASS=89 WARN=0 BLOCKED=0`, backup core `0`, MOMO fresh, expanded public routes green, and overall `PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED` only because StockPlatform product data freshness remains blocked. This confirms the reboot SOP is effective at separating host/service recovery, runaway process cleanup, product-data freshness, DR escrow, and Wazuh security evidence.
|
||||
|
||||
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user