docs(recovery): record conversation event index apply
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 3m18s
AWOOOI Harbor 110 Local Repair / workflow-shape (push) Successful in 0s
AI 技術雷達監控 / ai-technology-watch (push) Successful in 36s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
AWOOOI Harbor 110 Local Repair / harbor-110-local-repair (push) Has been cancelled
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 3m18s
AWOOOI Harbor 110 Local Repair / workflow-shape (push) Successful in 0s
AI 技術雷達監控 / ai-technology-watch (push) Successful in 36s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
AWOOOI Harbor 110 Local Repair / harbor-110-local-repair (push) Has been cancelled
This commit is contained in:
@@ -51016,3 +51016,20 @@ production browser smoke:
|
||||
- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。
|
||||
- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。
|
||||
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart,沒有 workflow_dispatch,沒有 DB write / restore / prune。
|
||||
|
||||
## 2026-07-01 — 07:58 P0 188 hot-path index controlled apply
|
||||
|
||||
**完成內容**:
|
||||
- 回答「為什麼還是高」的 live 證據:source migration 已推送但 runtime DB 尚未套用;`awooop_conversation_event` live index 清單仍只有 pkey、`idx_conv_event_run`、`idx_conv_event_subject`、`uix_conv_event_dedup`,CD `#4182` 已失敗 / backlogged,Harbor repair `#4176` 仍 Waiting 且缺 `awoooi-host`。
|
||||
- 188 preflight:`awooop_conversation_event` table size 約 `93 MB`;套用前 `k3s-postgres-recovery` 約 `7.9277` CPU cores、188 `load5=10.57`。
|
||||
- 已在 188 `k3s-postgres-recovery` container 以 postgres local socket 執行 repo migration `apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01.sql`;只執行 `CREATE INDEX CONCURRENTLY IF NOT EXISTS`,`lock_timeout=5s`、`statement_timeout=0`。
|
||||
- post-apply verifier:12 個新 hot-path indexes 全部 `indisvalid=true` / `indisready=true`;`pg_stat_activity` 收斂為 idle `35`、unknown `5`、active `1`;第一次 CPU readback `k3s-postgres-recovery=1.0552` cores,20 秒後已不在 188 top 3 CPU containers。
|
||||
- 新增 receipt `docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json`,並把這次經驗寫入 `docs/runbooks/FULL-STACK-COLD-START-SOP.md` v1.84。
|
||||
|
||||
**仍維持**:
|
||||
- 沒有讀 secret / token / `.env` / raw sessions / SQLite / auth;沒有讀 `.runner` 內容。
|
||||
- 沒有使用 GitHub / gh / GitHub API / GitHub Actions。
|
||||
- 沒有重啟主機,沒有 Docker / Nginx / K3s / DB restart,沒有 workflow_dispatch,沒有 DROP / TRUNCATE / restore / prune。
|
||||
|
||||
**下一步**:
|
||||
- 188 DB CPU 已降;110 仍高,原因仍是 `gitea` / queue / `awoooi-host` control path:110 `load5=27.22`、`gitea=3.4019` cores、Harbor repair `#4176 Waiting`、no matching `awoooi-host`。主線下一步繼續 110 Gitea queue / controlled lane recovery,不恢復 generic runner、不重啟主機。
|
||||
|
||||
@@ -0,0 +1,61 @@
|
||||
{
|
||||
"schema_version": "awooop_conversation_event_hot_path_index_apply_receipt_v1",
|
||||
"generated_at": "2026-07-01T07:58:00+08:00",
|
||||
"scope": "188:k3s-postgres-recovery:awoooi_prod:awooop_conversation_event",
|
||||
"source_commit": "c29771a2d1b592e94fe3a1051b3a9d3842ec20f4",
|
||||
"migration": "apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01.sql",
|
||||
"rollback": "apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01_down.sql",
|
||||
"operation": {
|
||||
"type": "controlled_db_migration",
|
||||
"statements": "CREATE INDEX CONCURRENTLY IF NOT EXISTS only",
|
||||
"lock_timeout": "5s",
|
||||
"statement_timeout": "0",
|
||||
"runtime_write_performed": true,
|
||||
"destructive_db_operation_performed": false,
|
||||
"drop_truncate_restore_performed": false,
|
||||
"service_restart_performed": false,
|
||||
"secret_value_read": false,
|
||||
"runner_token_read": false
|
||||
},
|
||||
"pre_apply": {
|
||||
"table_size": "93 MB",
|
||||
"indexes_present": [
|
||||
"awooop_conversation_event_pkey",
|
||||
"idx_conv_event_run",
|
||||
"idx_conv_event_subject",
|
||||
"uix_conv_event_dedup"
|
||||
],
|
||||
"k3s_postgres_recovery_cpu_cores": 7.9277,
|
||||
"host_188_load5": 10.57
|
||||
},
|
||||
"post_apply": {
|
||||
"indexes_valid_ready": [
|
||||
"idx_awooop_conv_event_project_provider_event_recent",
|
||||
"idx_awooop_conv_event_project_provider_lower_recent",
|
||||
"idx_awooop_conv_event_project_provider_recent",
|
||||
"idx_awooop_conv_event_project_run_id_text_recent",
|
||||
"idx_awooop_conv_event_source_refs_alert_ids_gin",
|
||||
"idx_awooop_conv_event_source_refs_approval_ids_gin",
|
||||
"idx_awooop_conv_event_source_refs_event_ids_gin",
|
||||
"idx_awooop_conv_event_source_refs_fingerprints_gin",
|
||||
"idx_awooop_conv_event_source_refs_incident_ids_gin",
|
||||
"idx_awooop_conv_event_source_refs_sentry_issue_ids_gin",
|
||||
"idx_awooop_conv_event_source_refs_signoz_alerts_gin",
|
||||
"idx_conv_event_recent"
|
||||
],
|
||||
"pg_stat_activity": {
|
||||
"idle": 35,
|
||||
"unknown": 5,
|
||||
"active": 1
|
||||
},
|
||||
"k3s_postgres_recovery_cpu_cores_after_first_readback": 1.0552,
|
||||
"k3s_postgres_recovery_top3_after_20_seconds": false,
|
||||
"host_188_load5_after_20_seconds": 10.04
|
||||
},
|
||||
"remaining_blockers": [
|
||||
"host_110_gitea_cpu_pressure",
|
||||
"harbor_110_repair_no_matching_runner:awoooi-host",
|
||||
"cd_4182_failure_or_waiting_backlog"
|
||||
],
|
||||
"safe_next_step": "continue_110_gitea_queue_control_path_recovery_without_generic_runner_or_host_reboot"
|
||||
}
|
||||
@@ -1,7 +1,7 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.83
|
||||
> Last updated: 2026-06-30 Asia/Taipei
|
||||
> Version: v1.84
|
||||
> Last updated: 2026-07-01 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
---
|
||||
@@ -22,6 +22,8 @@ v1.80 / v1.81 credential escrow intake scorecard rule:同一輪 owner response
|
||||
|
||||
v1.83 Gitea CD running retry rule:`read-public-gitea-actions-queue.py --json` 必須同時看 `latest_visible_cd_failure_classifier` 與 `latest_visible_cd_inflight_classifier`。final `BLOCKER harbor_registry_public_route_unavailable` 尚未出現時,只要 `latest_visible_cd_harbor_public_route_retrying_unavailable=true` 且 `latest_visible_cd_harbor_latest_registry_v2_status` 不是 `200/401`,就把它當作 in-flight production deploy blocker evidence;若 Harbor repair workflow 同時 `Waiting` 或 no-matching `awoooi-host`,下一步是恢復 110 local repair control path,而不是等 CD timeout、重跑無效 CD、workflow_dispatch,或把 `Running` 當作版本最新。
|
||||
|
||||
2026-07-01 07:58 live host-pressure update:188 持續高 CPU 的原因不是一般重啟噪音,而是 `k3s-postgres-recovery` 內 `awooop_conversation_event` hot-path index drift。live DB 原本只剩 `awooop_conversation_event_pkey`、`idx_conv_event_run`、`idx_conv_event_subject`、`uix_conv_event_dedup`,缺 base `idx_conv_event_recent` 與 provider/source_refs hot-path indexes;`k3s-postgres-recovery` 當時約 `7.9277` CPU cores、188 `load5=10.57`。已依 `apps/api/migrations/awooop_conversation_event_hot_path_indexes_2026-07-01.sql` 走 controlled DB migration,僅執行 `CREATE INDEX CONCURRENTLY IF NOT EXISTS`,`lock_timeout=5s`,無 DROP / TRUNCATE / restore / DB restart / Docker restart / secret read。post-apply verifier 顯示 12 個新索引全部 `indisvalid=true`、`indisready=true`;`pg_stat_activity` 收斂到 active `1`,第一次讀回 `k3s-postgres-recovery` 降到約 `1.0552` cores,20 秒後已不在 188 top 3 CPU containers。receipt:`docs/operations/awooop-conversation-event-hot-path-index-apply-receipt-2026-07-01.snapshot.json`。110 仍高不是同一個 DB 問題:110 `gitea` 仍約 `3.4019` cores,public queue 仍是 `blocked_harbor_110_repair_no_matching_runner` / `awoooi-host`;下一步固定為 110 Gitea queue / controlled lane recovery,不得恢復 generic runner、不得重啟主機。
|
||||
|
||||
v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-maintenance-checklist.sh` 的 SSH helper 必須有 command timeout、single connection attempt、ServerAlive 與 no password prompt;任何 110 / 188 read-only control path 卡住時,都要收斂成 blocker / evidence,而不是讓 `post-reboot-readiness-summary.sh` 無限等待。若 backup / escrow 證據讀不到,`ESCROW_MISSING_COUNT=unknown` 必須同時輸出 `DR_ESCROW_BLOCKED=1` 與 `DR_ESCROW_EVIDENCE_UNKNOWN=1`,並把 `backup_core_readback_recovery`、`credential_escrow_evidence` 放進 `NEXT_REQUIRED_GATES`;unknown 不得被解讀為 DR 或 backup green。
|
||||
|
||||
2026-06-29 09:13 previous live summary:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260629-091918/summary.txt` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_SERVICE_WARNINGS=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`STOCK_FRESHNESS_STATUS=ok`、`STOCK_LATEST_TRADING_DATE=2026-06-26`、`BACKUP_CORE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=0`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=6`、`RUNTIME_ACTION_AUTHORIZED=0`、`NEXT_REQUIRED_GATES=credential_escrow_evidence`。此 baseline 已被 2026-06-30 20:18 全主機重啟後 evidence 覆蓋,不得再拿來宣稱目前 green。
|
||||
|
||||
Reference in New Issue
Block a user