diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 953225c0..4d6ccedd 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -51016,3 +51016,20 @@ production browser smoke: - 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。 - 沒有使用 GitHub / gh / GitHub API / GitHub Actions。 - 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart,沒有 workflow_dispatch,沒有 DB write / restore / prune。 + +## 2026-07-01 — 07:58 P0 188 hot-path index controlled apply + +**完成內容**: +- 回答「為什麼還是高」的 live 證據:source migration 已推送但 runtime DB 尚未套用;`awooop_conversation_event` live index 清單仍只有 pkey、`idx_conv_event_run`、`idx_conv_event_subject`、`uix_conv_event_dedup`,CD `#4182` 已失敗 / backlogged,Harbor repair `#4176` 仍 Waiting 且缺 `awoooi-host`。 +- 188 preflight:`awooop_conversation_event` table size 約 `93 MB`;套用前 `k3s-postgres-recovery` 約 `7.9277` CPU cores、188 `load5=10.57`。 +- 已在 188 `k3s-postgres-recovery` container 以 postgres local socket 執行 repo migration `apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01.sql`;只執行 `CREATE INDEX CONCURRENTLY IF NOT EXISTS`,`lock_timeout=5s`、`statement_timeout=0`。 +- post-apply verifier:12 個新 hot-path indexes 全部 `indisvalid=true` / `indisready=true`;`pg_stat_activity` 收斂為 idle `35`、unknown `5`、active `1`;第一次 CPU readback `k3s-postgres-recovery=1.0552` cores,20 秒後已不在 188 top 3 CPU containers。 +- 新增 receipt `docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json`,並把這次經驗寫入 `docs/runbooks/FULL-STACK-COLD-START-SOP.md` v1.84。 + +**仍維持**: +- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。 +- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。 +- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart,沒有 workflow_dispatch,沒有 DROP / TRUNCATE / restore / prune。 + +**下一步**: +- 188 DB CPU 已降;110 仍高,原因仍是 `gitea` / queue / `awoooi-host` control path:110 `load5=27.22`、`gitea=3.4019` cores、Harbor repair `#4176 Waiting`、no matching `awoooi-host`。主線下一步繼續 110 Gitea queue / controlled lane recovery,不恢復 generic runner、不重啟主機。 diff --git a/docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json b/docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json new file mode 100644 index 00000000..c7595ff7 --- /dev/null +++ b/docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json @@ -0,0 +1,61 @@ +{ + "schema_version": "awooop_conversation_event_hot_path_index_apply_receipt_v1", + "generated_at": "2026-07-01T07:58:00+08:00", + "scope": "188:k3s-postgres-recovery:awoooi_prod:awooop_conversation_event", + "source_commit": "c29771a2d1b592e94fe3a1051b3a9d3842ec20f4", + "migration": "apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01.sql", + "rollback": "apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01_down.sql", + "operation": { + "type": "controlled_db_migration", + "statements": "CREATE INDEX CONCURRENTLY IF NOT EXISTS only", + "lock_timeout": "5s", + "statement_timeout": "0", + "runtime_write_performed": true, + "destructive_db_operation_performed": false, + "drop_truncate_restore_performed": false, + "service_restart_performed": false, + "secret_value_read": false, + "runner_token_read": false + }, + "pre_apply": { + "table_size": "93 MB", + "indexes_present": [ + "awooop_conversation_event_pkey", + "idx_conv_event_run", + "idx_conv_event_subject", + "uix_conv_event_dedup" + ], + "k3s_postgres_recovery_cpu_cores": 7.9277, + "host_188_load5": 10.57 + }, + "post_apply": { + "indexes_valid_ready": [ + "idx_awooop_conv_event_project_provider_event_recent", + "idx_awooop_conv_event_project_provider_lower_recent", + "idx_awooop_conv_event_project_provider_recent", + "idx_awooop_conv_event_project_run_id_text_recent", + "idx_awooop_conv_event_source_refs_alert_ids_gin", + "idx_awooop_conv_event_source_refs_approval_ids_gin", + "idx_awooop_conv_event_source_refs_event_ids_gin", + "idx_awooop_conv_event_source_refs_fingerprints_gin", + "idx_awooop_conv_event_source_refs_incident_ids_gin", + "idx_awooop_conv_event_source_refs_sentry_issue_ids_gin", + "idx_awooop_conv_event_source_refs_signoz_alerts_gin", + "idx_conv_event_recent" + ], + "pg_stat_activity": { + "idle": 35, + "unknown": 5, + "active": 1 + }, + "k3s_postgres_recovery_cpu_cores_after_first_readback": 1.0552, + "k3s_postgres_recovery_top3_after_20_seconds": false, + "host_188_load5_after_20_seconds": 10.04 + }, + "remaining_blockers": [ + "host_110_gitea_cpu_pressure", + "harbor_110_repair_no_matching_runner:awoooi-host", + "cd_4182_failure_or_waiting_backlog" + ], + "safe_next_step": "continue_110_gitea_queue_control_path_recovery_without_generic_runner_or_host_reboot" +} diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index faa48cb0..27141ab8 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,7 +1,7 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.83 -> Last updated: 2026-06-30 Asia/Taipei +> Version: v1.84 +> Last updated: 2026-07-01 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. --- @@ -22,6 +22,8 @@ v1.80 / v1.81 credential escrow intake scorecard rule:同一輪 owner response v1.83 Gitea CD running retry rule:`read-public-gitea-actions-queue.py --json` 必須同時看 `latest_visible_cd_failure_classifier` 與 `latest_visible_cd_inflight_classifier`。final `BLOCKER harbor_registry_public_route_unavailable` 尚未出現時,只要 `latest_visible_cd_harbor_public_route_retrying_unavailable=true` 且 `latest_visible_cd_harbor_latest_registry_v2_status` 不是 `200/401`,就把它當作 in-flight production deploy blocker evidence;若 Harbor repair workflow 同時 `Waiting` 或 no-matching `awoooi-host`,下一步是恢復 110 local repair control path,而不是等 CD timeout、重跑無效 CD、workflow_dispatch,或把 `Running` 當作版本最新。 +2026-07-01 07:58 live host-pressure update:188 持續高 CPU 的原因不是一般重啟噪音,而是 `k3s-postgres-recovery` 內 `awooop_conversation_event` hot-path index drift。live DB 原本只剩 `awooop_conversation_event_pkey`、`idx_conv_event_run`、`idx_conv_event_subject`、`uix_conv_event_dedup`,缺 base `idx_conv_event_recent` 與 provider/source_refs hot-path indexes;`k3s-postgres-recovery` 當時約 `7.9277` CPU cores、188 `load5=10.57`。已依 `apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01.sql` 走 controlled DB migration,僅執行 `CREATE INDEX CONCURRENTLY IF NOT EXISTS`,`lock_timeout=5s`,無 DROP / TRUNCATE / restore / DB restart / Docker restart / secret read。post-apply verifier 顯示 12 個新索引全部 `indisvalid=true`、`indisready=true`;`pg_stat_activity` 收斂到 active `1`,第一次讀回 `k3s-postgres-recovery` 降到約 `1.0552` cores,20 秒後已不在 188 top 3 CPU containers。receipt:`docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json`。110 仍高不是同一個 DB 問題:110 `gitea` 仍約 `3.4019` cores,public queue 仍是 `blocked_harbor_110_repair_no_matching_runner` / `awoooi-host`;下一步固定為 110 Gitea queue / controlled lane recovery,不得恢復 generic runner、不得重啟主機。 + v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-maintenance-checklist.sh` 的 SSH helper 必須有 command timeout、single connection attempt、ServerAlive 與 no password prompt;任何 110 / 188 read-only control path 卡住時,都要收斂成 blocker / evidence,而不是讓 `post-reboot-readiness-summary.sh` 無限等待。若 backup / escrow 證據讀不到,`ESCROW_MISSING_COUNT=unknown` 必須同時輸出 `DR_ESCROW_BLOCKED=1` 與 `DR_ESCROW_EVIDENCE_UNKNOWN=1`,並把 `backup_core_readback_recovery`、`credential_escrow_evidence` 放進 `NEXT_REQUIRED_GATES`;unknown 不得被解讀為 DR 或 backup green。 2026-06-29 09:13 previous live summary:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260629-091918/summary.txt` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_SERVICE_WARNINGS=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`STOCK_FRESHNESS_STATUS=ok`、`STOCK_LATEST_TRADING_DATE=2026-06-26`、`BACKUP_CORE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=0`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=6`、`RUNTIME_ACTION_AUTHORIZED=0`、`NEXT_REQUIRED_GATES=credential_escrow_evidence`。此 baseline 已被 2026-06-30 20:18 全主機重啟後 evidence 覆蓋,不得再拿來宣稱目前 green。