docs(ops): record full-stack green reboot closeout [skip ci]

This commit is contained in:
ogt
2026-06-25 21:19:31 +08:00
parent 395b1a557f
commit 2e3202c692
5 changed files with 53 additions and 17 deletions

View File

@@ -51,6 +51,35 @@
**邊界**:本段只修狀態鏈與 operator outcome沒有執行 Ansible apply、沒有 restart、沒有 SSH 主機修改、沒有發 Telegram、沒有開 runtime gate、沒有把 check-mode 當修復完成。
## 2026-06-25StockPlatform 自然排程追平與重啟 SOP v1.60 最終服務綠燈
**背景**20:25 post-start wrapper 已證明主機、K3s、公開路由、MOMO、backup/offsite 都恢復,但 StockPlatform `/api/v1/system/freshness` 仍被 `core_margin_short_daily_missing``ai_recommendations_stale` 擋住。為避免人工補跑或 route-only 假綠,本輪等待 21:00 / 21:10 自然 cron 完成,再以 read-only evidence 收斂。
**21:00 / 21:10 StockPlatform 自然排程證據**
- Live source110 `/home/wooo/stockplatform-v2` 仍為 `fb91aa4c6272469d1d26e0820169629eac17d28a`,與 Gitea `main` 對齊。
- `intelligence-sync`21:00 自然排程從早上 `psql` 缺失失敗轉為 `status=0``matched_report_count=317``security_link_count=364``record_count=152`RAG `document_count=4551` / `chunk_count=4551`
- `source-remediation-queue`19:56 起不再 `script_exit_127`21:00 仍 `status=0`
- `core.margin_short_daily`21:05 後追到 `2026-06-25` / `1976` 筆。
- `ai-recommendation-pipeline`21:10 自然排程 `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25``draft_count=120``candidate_count=120``rag_documents=1000``rag_chunks=1000``status=0`
- `/api/v1/system/freshness`21:13 回 `status=ok``latest_trading_date=2026-06-25`blockers `[]`price / chips / margin / AI recommendations 全部為 `2026-06-25`
**21:14 full wrapper readback**
- Cold-start`PASS=89 WARN=0 BLOCKED=0`Result `GREEN`
- Wrapper`POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`Result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- Backup110 `13/13 fresh failed=0`188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1`
- DR`escrow_missing=5`,仍不可宣稱 DR complete。
- 110 CPU目前主要為 AWOOOI web `next build` / worker process這是 CD/build 負載,不是 orphan Chrome recurrence。未取消 CI未殺 build。
**做過的命令類型**
- 本輪只讀StockPlatform logs、freshness endpoint、crontab、post-start wrapper。
- 本輪 docs-only更新 LOGBOOK / FULL-STACK-COLD-START-SOP / REBOOT-POST-START-QUICK-CHECK / recovery workplan / BACKUP-STATUS。
- 無新增 runtime write沒有 Docker/systemd/Nginx/firewall/K8s action沒有 manual ingestion沒有 manual DB write沒有 backup restore沒有 Wazuh/SOC runtime change沒有 secret read。
**最新判定**
- 主機、K3s、AWOOOI runtime、公開路由、MOMO service/data、StockPlatform service/data、backup/offsite本輪證據為 green。
- SOP 有效性:服務恢復與資料 freshness gate 已完成本輪閉環;會正確等待自然排程與資料 gate不再被 route `200` 誤導。
- 仍 blockedDR credential escrow evidence `5`Wazuh manager registry accepted `0`。這兩項是治理/資安 evidence blocker不是主機重啟服務 blocker。
## 2026-06-25110 orphan Chrome 精準清理與重啟 SOP v1.59 證據同步
**背景**20:23 post-start quick check 仍顯示 110 load 偏高。只讀 `ps` / `vmstat` 分流後確認兩組 `stockplatform-review-bulk-ux` headless Chrome process group 已跑約 5 小時root Chrome process `PPID=1`沒有活躍測試父程序GPU process 各吃約 96% CPUrenderer 各約 22% CPU。這符合 SOP 內「orphan browser smoke」分流條件不是 Docker、Nginx、K3s、MOMO、Harbor、Sentry 或 Wazuh 服務事故。

View File

@@ -23,6 +23,7 @@
> 2026-06-25 19:35 Codex product-data gate refresh: backup/offsite remains green, but overall "all products/data latest" is blocked by StockPlatform `/api/v1/system/freshness` (`core_margin_short_daily_missing`, `ai_recommendations_stale`). This is not a backup failure; keep `escrow_missing=5` as the DR blocker and Stock freshness as a separate product-data blocker.
> 2026-06-25 20:11 Codex StockPlatform cron-source recovery: StockPlatform Gitea/live source is now `fb91aa4c6272469d1d26e0820169629eac17d28a`; six missing production cron entrypoints are restored; natural cron runs for source remediation, market index, price, margin, chips, and AI no longer fail from missing files. Backup/offsite remains green. Stock freshness still blocks because official 2026-06-25 margin-short data is pending and AI recommendations correctly stay on 2026-06-24; this is still not a backup or restore incident.
> 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved `SIGTERM`; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by `escrow_missing=5`, and Stock freshness remains the only hard product-data blocker.
> 2026-06-25 21:14 Codex full wrapper refresh: StockPlatform 21:00 `intelligence-sync` and 21:10 AI pipeline naturally caught up; `/api/v1/system/freshness` is `status=ok` with blockers `[]`. Backup/offsite remains 110 `13/13` and 188 `2/2` fresh, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; full-stack service/data result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, with only `escrow_missing=5` blocking DR complete.
---
@@ -46,7 +47,7 @@ Read-only evidence sources: `/backup/scripts/backup-status.sh --no-notify --no-r
| Offsite / GDrive freshness | VERIFIED | `offsite_fresh=1`, `rclone_gdrive_fresh=1`. |
| Backup core blockers | GREEN | `core_blockers=0`. |
| MOMO data freshness | VERIFIED | `DB_DAILY_FRESHNESS 1|2026-06-24`; latest job 57 completed cleanly. |
| Full-stack service state | STOCK_DATA_BLOCKED_WITH_CORE_SERVICES_GREEN | `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`; cold-start `PASS=89 WARN=0 BLOCKED=0`; blocker is StockPlatform freshness, not backup. |
| Full-stack service state | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 21:14 `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`; cold-start `PASS=89 WARN=0 BLOCKED=0`; StockPlatform freshness is OK, and only `escrow_missing=5` blocks DR complete. |
| Credential escrow | BLOCKED | `escrow_missing=5`; only real non-secret owner evidence may close this. |
## 2026-06-25 20:11 StockPlatform Cron Source / Backup Boundary

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.59
> Version: v1.60
> Last updated: 2026-06-25 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -12,7 +12,9 @@
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness is still blocked and DR remains incomplete.
2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, `intelligence-sync` 21:00 finished `status=0`, `core.margin_short_daily` reached `2026-06-25` / 1976 rows, and `ai-recommendation-pipeline` 21:10 finished `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25` with `draft_count=120`, `candidate_count=120`, and `rag_documents=1000`. StockPlatform `/api/v1/system/freshness` now returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`, with price / chips / margin / AI recommendations all on `2026-06-25`. The 21:14 full wrapper returns cold-start `PASS=89 WARN=0 BLOCKED=0` and overall `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. The only remaining recovery red gate is DR credential escrow evidence `escrow_missing=5`; Wazuh manager registry accepted remains `0` as a security evidence blocker, not a reboot service blocker.
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.
@@ -23,17 +25,17 @@
```text
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: 2026-06-25 19:05 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
Post-start quick check: 2026-06-25 20:25 PASS=37 WARN=2 BLOCKED=1; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=BLOCKED because StockPlatform data freshness remains blocked. Cold-start layer remains GREEN.
Post-start quick check: 2026-06-25 21:14 PASS=38 WARN=2 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. Cold-start layer remains GREEN and StockPlatform freshness is now OK; DR remains blocked by credential escrow evidence.
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
Service state: HOST_AND_CORE_SERVICE_GREEN_CPU_ORPHAN_CLEARED_STOCK_CRON_SOURCE_REPAIRED_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared, and 110 orphan StockPlatform Chrome smoke groups cleared by targeted approved SIGTERM. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is still blocked by official margin-short source availability and AI recommendation freshness.
Service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, StockPlatform data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared, and 110 orphan StockPlatform Chrome smoke groups cleared by targeted approved SIGTERM. StockPlatform production cron source drift is repaired and verified by natural cron runs; product-data completeness is now green for the 2026-06-25 evidence set.
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. 19:35 expanded route readback returned expected 2xx/3xx for AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
MOMO release state: mo.wooo.work health is healthy on version V10.690. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 18:23 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain recent container-replace / scheduler fail-closed / notification evidence notes, but no service blocker remains.
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
StockPlatform data state: `/api/v1/system/freshness` returns `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, and `core.market_index_daily.global` 2026-06-25 / 2001 rows. Direct DB readback is `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`. The 20:05 margin ingestion ran successfully but returned `row_count=0` with official pending evidence for 2026-06-25; the 20:10 AI pipeline succeeded at the cron/job layer but remained blocked by `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. Do not claim StockPlatform data or AI recommendations are latest until this endpoint is `ok`.
StockPlatform data state: `/api/v1/system/freshness` returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.margin_short_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, `core.market_index_daily.global` 2026-06-25 / 2001 rows, and `ai.recommendations` 2026-06-25 / 2868 rows. The 21:10 natural AI pipeline produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`; no manual ingestion or DB write was performed.
Product version readback: StockPlatform live source `/home/wooo/stockplatform-v2` matches Gitea `wooo/stockplatform-v2.git` main `fb91aa4c6272469d1d26e0820169629eac17d28a`; VibeWork live image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea `wooo/vibework.git` main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source `/home/wooo/awooogo` matches Gitea `wooo/AwoooGo` main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833`; 188 runtime directory is a compose/image deployment path, not a git worktree, so add image revision label evidence before declaring code-image parity.
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 20:24 readback found a second recurrence with orphan process groups `2756503` and `2829627`, root Chrome `PPID=1`, elapsed about 5h, no active parent smoke, GPU process CPU around 96%, and renderer CPU around 22%; approved targeted `SIGTERM` cleared both PGIDs. 20:25 post-check shows the Chrome groups are gone and CPU idle is around `85-90%`; remaining load is platform services and short-lived AWOOOI CD smoke/install work. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 20:24 readback found a second recurrence with orphan process groups `2756503` and `2829627`, root Chrome `PPID=1`, elapsed about 5h, no active parent smoke, GPU process CPU around 96%, and renderer CPU around 22%; approved targeted `SIGTERM` cleared both PGIDs. 21:14 CPU attribution shows current load is dominated by an active AWOOOI Web `next build` process group and its worker processes, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
Backup / monitoring state: 19:05 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
Route transient handling: post-deploy `502` on Stock or AwoooGo is a blocker only if it persists after upstream container health is ready and 3-5 consecutive external route reads still fail. For AwoooGo, live upstream is on 110 `192.168.0.110:32190`; do not test only `127.0.0.1` on 110 because the listener may bind the host address. For K3s workload balancing, wait for terminating pods to disappear before judging API/Web placement; final required state for two-replica API/Web is split across `mon` and `mon1`.
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.

View File

@@ -1,6 +1,6 @@
# 主機重啟後一頁式總檢查
> Version: v1.4
> Version: v1.5
> Last updated: 2026-06-25 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
@@ -10,6 +10,8 @@
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
最新基準2026-06-25 21:14 full wrapper `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`Result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。StockPlatform 21:10 自然 AI pipeline 已補到 `as_of_date=2026-06-25``/api/v1/system/freshness``status=ok`DR 仍因 `escrow_missing=5` 不可宣稱 complete。
本頁只回答四件事:
1. 主機是否開起來。
@@ -226,17 +228,17 @@ CPU / runawayorphan=? active_ci=? load=?
## 6. 目前最新已驗證基線
2026-06-25 20:25 wrapper live run after targeted orphan Chrome cleanup
2026-06-25 21:14 wrapper live run after StockPlatform natural-cron catch-up
- Command typewrite-with-approval for process `SIGTERM` only, followed by read-only scorecard.
- CPU action110 上兩組 orphan `stockplatform-review-bulk-ux` Chrome PGID `2756503``2829627` 已用 targeted `SIGTERM` 清除;`AFTER_TERM` 無殘留。未重啟 Docker、systemd、Nginx、K3s未取消 CI未寫 DB。
- Wrapper`POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`
- Command typeread-only scorecard / logs / freshness endpoint. 本段無新增 runtime write20:24 的 targeted `SIGTERM` 已另列於歷史證據。
- CPU state110 目前主要負載為 AWOOOI Web `next build` 與其 worker processes不是 orphan Chrome recurrence。未重啟 Docker、systemd、Nginx、K3s未取消 CI未寫 DB。
- Wrapper`POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`
- Warning split`SERVICE=0 BOUNDARY=1 EVIDENCE=1`
- Result`BLOCKED`,唯一 hard blocker 是 StockPlatform freshness不是 host / route / K3s / MOMO / backup blocker。
- Result`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`host / route / K3s / MOMO / StockPlatform / backup service-data gates are green, but DR is not complete.
- Cold-start`PASS=89 WARN=0 BLOCKED=0`Result `GREEN`
- MOMO`V10.690`dedicated preflight `PASS=19 WARN=2 BLOCKED=0`job `57` clean`DB_DAILY_FRESHNESS 1|2026-06-24`
- Backup110 `13/13 fresh failed=0`188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``escrow_missing=5`
- StockPlatformlive source and Gitea main `fb91aa4c6272469d1d26e0820169629eac17d28a`; cron entrypoints repaired; `/api/v1/system/freshness` still `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`。Direct DB: `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`
- StockPlatformlive source and Gitea main `fb91aa4c6272469d1d26e0820169629eac17d28a`; cron entrypoints repaired; 21:00 `intelligence-sync` succeeded; 21:10 `ai-recommendation-pipeline` produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`; `/api/v1/system/freshness` is `status=ok`, blockers `[]`。Sources: price / chips / margin / AI recommendations all current for `2026-06-25`
- Public routesexpanded routes all returned expected `2xx/3xx`
- DR`escrow_missing=5`,不可宣稱 DR complete。
- Wazuh / SOCroute `200 disabled_waiting_iwooos_wazuh_owner_gate`manager registry accepted `0`,不是 reboot service blocker。

View File

@@ -11,11 +11,11 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | HOST_AND_CORE_SERVICE_GREEN_CPU_ORPHAN_CLEARED_STOCK_CRON_SOURCE_REPAIRED_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED | 97% | 2026-06-25 20:25 full post-start readback showed hosts / K3s / AWOOOI / MOMO / backup / offsite service gates green and `escrow_missing=5`; cold-start returned `PASS=89 WARN=0 BLOCKED=0`. 20:24 targeted approved `SIGTERM` cleared orphan 110 `stockplatform-review-bulk-ux` Chrome PGIDs `2756503` and `2829627`; no Docker/systemd/Nginx/firewall/K8s/DB write was performed. 20:25 wrapper returned `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, result `BLOCKED`, because StockPlatform `/api/v1/system/freshness` is `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`. 20:11 follow-up repaired the StockPlatform production cron source drift: Gitea and live `/home/wooo/stockplatform-v2` are `fb91aa4c6272469d1d26e0820169629eac17d28a`, six missing cron entrypoints are present, live cron contract covers all referenced scripts, and natural cron runs at 19:56 / 20:00 / 20:02 / 20:05 / 20:06 / 20:10 no longer fail with `script_exit_127`. Expanded public route smoke covers AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps; all returned expected 2xx/3xx. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly. Do not declare "all products/data latest" until StockPlatform freshness is `ok`; do not declare DR complete until `escrow_missing=0`. |
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-25 21:14 full post-start wrapper showed hosts / K3s / AWOOOI / public routes / MOMO / StockPlatform / backup / offsite service and data gates green. Cold-start returned `PASS=89 WARN=0 BLOCKED=0`; wrapper returned `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. 20:24 targeted approved `SIGTERM` cleared orphan 110 `stockplatform-review-bulk-ux` Chrome PGIDs `2756503` and `2829627`; 21:14 CPU attribution shows current load is active AWOOOI Web `next build`, not orphan Chrome. StockPlatform Gitea/live source is `fb91aa4c6272469d1d26e0820169629eac17d28a`; 21:00 `intelligence-sync` succeeded, 21:10 `ai-recommendation-pipeline` produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`, and `/api/v1/system/freshness` is now `status=ok` with blockers `[]`. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | BLOCKED_STOCK_DATA_FRESHNESS | 94% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired, but product-data truth is not complete. 2026-06-25 20:23 StockPlatform `/api/v1/system/freshness` still returned `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`; OK sources include price / chips / market index for `2026-06-25`, while `core.margin_short_daily` and `ai.recommendations` stop at `2026-06-24`. Direct DB readback is `price|2026-06-25|1976`, `chips|2026-06-25|1976`, `margin|2026-06-24|1976`, `ai_recommendations|2026-06-24|120`; 20:05 margin ingestion ran successfully but official source was still pending, and 20:10 AI pipeline correctly blocked on the margin gate. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GATE_V159 | 100% | Workplan, SOP v1.59, one-page post-start quick check v1.4, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired. 2026-06-25 21:13 StockPlatform `/api/v1/system/freshness` returned `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GREEN_V160 | 100% | Workplan, SOP v1.60, one-page post-start quick check v1.5, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.
@@ -27,6 +27,8 @@ Full cold-start service readiness may now be declared GREEN for the latest verif
2026-06-25 20:25 110 CPU orphan Chrome cleanup closeout: read-only process attribution found two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, and sustained GPU/renderer CPU. With user approval, only those PGIDs received targeted `SIGTERM`; post-check showed no remaining group entries, CPU idle around `85-90%`, and `si/so=0`. Full post-start wrapper after cleanup returned cold-start `PASS=89 WARN=0 BLOCKED=0`, backup core `0`, MOMO fresh, expanded public routes green, and overall `PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED` only because StockPlatform product data freshness remains blocked. This confirms the reboot SOP is effective at separating host/service recovery, runaway process cleanup, product-data freshness, DR escrow, and Wazuh security evidence.
2026-06-25 21:14 StockPlatform natural-cron closeout: after waiting for official schedules, 21:00 `intelligence-sync` succeeded with `status=0`, `core.margin_short_daily` reached `2026-06-25`, and 21:10 `ai-recommendation-pipeline` produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`. `/api/v1/system/freshness` is `status=ok`, blockers `[]`, with price / chips / margin / AI recommendations all current for `2026-06-25`. Full wrapper returned `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. This is the current service/data recovery baseline: all reboot service and product-data gates are green; DR remains blocked only by credential escrow evidence `5`, and Wazuh registry remains a separate security evidence blocker.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
---