fix(recovery): point signoz route to live upstream
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled

This commit is contained in:
Your Name
2026-07-01 18:49:01 +08:00
parent 3213f9016e
commit 81ff04d019
5 changed files with 27 additions and 5 deletions

View File

@@ -1,3 +1,25 @@
## 2026-07-01 — 18:47 SignOz public route source drift 修正
**照主線修正的問題**
- cold-start hard blocker 已縮到 SignOz public 502 / TLS 與 188 MOMO daily sales stale本輪先處理 SignOz。
- live read-only probe 顯示 `https://signoz.wooo.work/` 回 Nginx `502``192.168.0.110:8080` 回 SignOz UI `200`110 Docker 顯示 `signoz` container healthy 且 port `8080->8080`188 沒有 SignOz container188 `:8080` 只是 nginx welcome page。
- live 188 Nginx 與 repo templates 都把 `signoz.wooo.work` proxy 到 `127.0.0.1:3301`,但該 upstream 不存在,根因是 public gateway upstream drift。
- 已把 `infra/ansible/roles/nginx/templates/188-all-sites.conf.j2``188-internal-tools-https.conf.j2` 的 SignOz proxy 改為 `http://192.168.0.110:8080`
- `full-stack-cold-start-check.sh` 的 SignOz upstream probe 也改成 `http://192.168.0.110:8080/`,避免 source 已修但 cold-start 仍檢查不存在的 188 localhost 3301。
**驗證**
- `curl http://192.168.0.110:8080/``200`
- `bash -n scripts/reboot-recovery/full-stack-cold-start-check.sh`:通過。
- `bash scripts/reboot-recovery/reboot-recovery-readiness-audit.sh --no-color``PASS=199 WARN=1 BLOCKED=0`
**未完成 / blocker**
- live apply 未執行188 `/etc/nginx/sites-enabled/*.conf` 是 root-owned`ollama` 執行 `sudo -n nginx -t``sudo: a password is required`
- 未讀、未要求、未保存 sudo password未 reload Nginx。因此 public `signoz.wooo.work` 仍需 privileged apply/readback 才能宣稱恢復。
**邊界**:未重啟主機,未 restart Docker / DB / K3s / firewall未讀 secret / token / `.env` / raw sessions / SQLite / auth未使用 GitHub / `gh` / GitHub API。
**下一步**:由具備 188 sudo / console 的 controlled Nginx apply path 套用 SignOz upstream diff`nginx -t` 再 reload最後重跑 public route TLS 與 cold-start。
## 2026-07-01 — 18:40 P0 cold-start readiness / 110 monitor parity 收斂
**照主線修正的問題**

View File

@@ -18,7 +18,7 @@ v1.79 active owner response template rule同一輪 owner packet 產生後p
v1.80 / v1.81 credential escrow intake scorecard rule同一輪 owner response preflight 後,必須用 `scripts/reboot-recovery/post-reboot-credential-escrow-intake-scorecard.py --summary-file "$ARTIFACT_DIR/summary.txt" --owner-packet-file <owner-packets.json> --response-file <owner-response-template-or-candidate.json> --offsite-report-file <offsite-report.txt> --escrow-status-file <escrow-status.txt>` 收斂 DR escrow gate。scorecard 只讀 sanitized artifacts不得讀 secret value、不得寫 marker、不得送 owner request、不得開 runtime gate。placeholder readback 期望 `STATUS=blocked_waiting_non_secret_credential_escrow_evidence``EFFECTIVE_ESCROW_MISSING_COUNT=5``OWNER_RESPONSE_RECEIVED_COUNT=0``OWNER_RESPONSE_ACCEPTED_COUNT=0``RUNTIME_GATE_COUNT=0``CREDENTIAL_MARKER_WRITE_AUTHORIZED_COUNT=0`。若未來收到合格 redacted owner response 並由 preflight 回 `ready_for_independent_reviewer_acceptance`scorecard 應轉為 `STATUS=ready_for_independent_reviewer_acceptance`;即使 marker 尚未寫入,也只能進 `independent_reviewer_acceptance_then_marker_dry_run`,不得直接寫 marker 或宣稱 `DR_COMPLETE`
2026-07-01 18:40 latest live summary全主機重啟後仍不可宣稱 10 分鐘自動恢復,但主 blocker 已從 110/Harbor/Gitea source parity 轉為實際 runtime/data blocker。repo-side readiness 已修成 `PASS=199 WARN=1 BLOCKED=0``full-stack-cold-start-check.sh --monitor-read-only --no-color``PASS=86 WARN=8 BLOCKED=2`。110 P0 區段已讀回 Harbor / Gitea / Prometheus / Alertmanager / Sentry OK、legacy/direct runner fail-closed、controlled CD lane fail-closed、root restore source left `0`、storage clean、textfile exporters fresh`wooo@192.168.0.110` command channel 曾在同輪 backup 前 timeoutbounded retest 又回 `SSH_COMMAND_OK`,因此要記為 intermittent 110 control-channel evidence不得宣稱 110 SSH 永久穩定。110 live cold-start monitor 已用既有 installer 同步rollback evidence 保留在 `/home/wooo/scripts/full-stack-cold-start-check.sh.before-p0-readiness-20260701-183215``/home/wooo/scripts/full-stack-cold-start-check.sh.before-hostkey-policy-20260701-183637`live hash 為 `full-stack-cold-start-check.sh=e320c061f5afd31c2a682576218f549b683f25dafd43dd52acc13b6283b33712``cold-start-textfile-exporter.sh=c52ea4fe8dd58688a87c01ca6288f8f6050aeb82417852213db3e2be69b29568``verify-cold-start-monitor-deploy.sh` 現在只判斷 deploy parityhash、host-key policy、monitor-upruntime green 另由 cold-start / scorecard 判斷,避免 SignOz/MOMO 尚未恢復時把 source parity 誤報成 deploy mismatch。18:40 scorecard 讀回 `CORE_COLD_START_DEPLOY_PARITY=1``CORE_REGISTRY_READY=1``CORE_COLD_START_BLOCKED_GATES=2``CORE_COLD_START_FIRING_ALERTS=3``DR_OFFSITE_EVIDENCE_READBACK=1``ESCROW_MISSING_COUNT=5``RECOVERY_STATE=CORE_NOT_READY_DR_OFFSITE_PENDING`。目前 hard blockers 是 `signoz.wooo.work` public 502 / TLS failure以及 `188 momo daily sales data stale beyond 3 days`DR 仍缺 5 個 credential escrow non-secret evidence。不可宣稱full-stack green、10 分鐘全服務恢復、MOMO daily data 最新、SignOz public route 正常、DR complete 或 110 SSH 永久穩定。下一步固定為 SignOz public route / TLS 修復與 MOMO source freshness readback110 若再次 command timeout走 local console / control-channel recovery package不重啟主機、不恢復 generic runner。
2026-07-01 18:40 latest live summary全主機重啟後仍不可宣稱 10 分鐘自動恢復,但主 blocker 已從 110/Harbor/Gitea source parity 轉為實際 runtime/data blocker。repo-side readiness 已修成 `PASS=199 WARN=1 BLOCKED=0``full-stack-cold-start-check.sh --monitor-read-only --no-color``PASS=86 WARN=8 BLOCKED=2`。110 P0 區段已讀回 Harbor / Gitea / Prometheus / Alertmanager / Sentry OK、legacy/direct runner fail-closed、controlled CD lane fail-closed、root restore source left `0`、storage clean、textfile exporters fresh`wooo@192.168.0.110` command channel 曾在同輪 backup 前 timeoutbounded retest 又回 `SSH_COMMAND_OK`,因此要記為 intermittent 110 control-channel evidence不得宣稱 110 SSH 永久穩定。110 live cold-start monitor 已用既有 installer 同步rollback evidence 保留在 `/home/wooo/scripts/full-stack-cold-start-check.sh.before-p0-readiness-20260701-183215``/home/wooo/scripts/full-stack-cold-start-check.sh.before-hostkey-policy-20260701-183637`live hash 為 `full-stack-cold-start-check.sh=e320c061f5afd31c2a682576218f549b683f25dafd43dd52acc13b6283b33712``cold-start-textfile-exporter.sh=c52ea4fe8dd58688a87c01ca6288f8f6050aeb82417852213db3e2be69b29568``verify-cold-start-monitor-deploy.sh` 現在只判斷 deploy parityhash、host-key policy、monitor-upruntime green 另由 cold-start / scorecard 判斷,避免 SignOz/MOMO 尚未恢復時把 source parity 誤報成 deploy mismatch。18:40 scorecard 讀回 `CORE_COLD_START_DEPLOY_PARITY=1``CORE_REGISTRY_READY=1``CORE_COLD_START_BLOCKED_GATES=2``CORE_COLD_START_FIRING_ALERTS=3``DR_OFFSITE_EVIDENCE_READBACK=1``ESCROW_MISSING_COUNT=5``RECOVERY_STATE=CORE_NOT_READY_DR_OFFSITE_PENDING`。目前 hard blockers 是 `signoz.wooo.work` public 502 / TLS failure以及 `188 momo daily sales data stale beyond 3 days`DR 仍缺 5 個 credential escrow non-secret evidence。18:47 source fix 已把 SignOz source-of-truth 從 `127.0.0.1:3301` 改到實際 110 upstream `192.168.0.110:8080`,且 cold-start probe 改為檢查同一 upstreamlive apply 仍未執行,因為 188 `sudo -n nginx -t``sudo: a password is required`不可宣稱full-stack green、10 分鐘全服務恢復、MOMO daily data 最新、SignOz public route 正常、DR complete 或 110 SSH 永久穩定。下一步固定為 privileged Nginx config apply/readback 或本機 console 套用 SignOz route接著處理 MOMO source freshness110 若再次 command timeout走 local console / control-channel recovery package不重啟主機、不恢復 generic runner。
2026-06-30 22:55 latest live summary全主機重啟後仍不可宣稱 10 分鐘自動恢復。`SSH_COMMAND_TIMEOUT_SECONDS=8 SSH_BATCH_MODE=yes bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color` artifact `/tmp/awoooi-cold-start-live-after-ff.log``PASS=68 WARN=4 BLOCKED=4`hard blockers 是 110 registry external `/v2`、110 SSH read-only check、K3s registry pull refused by `110:5000`、SignOz TLS / 502。StockPlatform public freshness / ingestion 22:50 仍回 `status=not_configured``blockers=["postgres_not_ready"]`。Public Gitea queue 22:55 回 `status=blocked_harbor_110_repair_no_matching_runner`latest CD `#4105` 雖顯示 `Running`,但 build log 已有 `latest_visible_cd_inflight_classifier=harbor_registry_public_route_unavailable_pending_retry``latest_visible_cd_harbor_latest_registry_v2_status=502``latest_visible_cd_harbor_public_route_retrying_unavailable=true``harbor_controlled_repair_skipped=not_110_host`Harbor repair 仍 `Waiting` 且缺 `awoooi-host` runner。判定CD `Running` 不得視為中性等待;若 registry retry 已連續 502 / 000 且 repair 無 110 control path立即依 110 control path / Harbor `/v2` 主線 blocker 處理。不可宣稱:全服務恢復、最新版本已上 production、Stock 資料最新、backup core green、DR complete、188 hygiene green。下一步仍固定為恢復 110-local repair control path / Harbor `/v2`,再重跑 post-reboot summary、cold-start、Stock freshness / ingestion 與 SLO scorecard。

View File

@@ -91,7 +91,7 @@ server {
}
location / {
proxy_pass http://127.0.0.1:3301;
proxy_pass http://192.168.0.110:8080;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;

View File

@@ -42,7 +42,7 @@ server {
ssl_certificate_key /etc/letsencrypt/live/sentry.wooo.work/privkey.pem;
location / {
proxy_pass http://127.0.0.1:3301;
proxy_pass http://192.168.0.110:8080;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;

View File

@@ -282,7 +282,7 @@ echo "SYSTEMD $(systemctl is-active containerd docker postgresql@14-main redis-s
echo "PG $(pg_isready -h localhost -p 5432 2>&1)"
echo "REDIS $(redis-cli -p 6380 ping 2>/dev/null || redis-cli ping 2>/dev/null || true)"
echo "PORT5432 $(nc -z -w 2 127.0.0.1 5432 >/dev/null 2>&1 && echo OPEN || echo CLOSED)"
echo "SIGNOZ_CODE $(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:3301/ || true)"
echo "SIGNOZ_CODE $(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.110:8080/ || true)"
echo "MOMO_HEALTH_CODE $(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:5003/health || true)"
docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -80
' 2>&1); then
@@ -296,7 +296,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -80
grep -q "accepting connections" <<<"$out" && ok "188 PostgreSQL accepting connections" || fail "188 PostgreSQL not accepting connections"
grep -q "REDIS PONG" <<<"$out" && ok "188 Redis PONG" || warn "188 Redis not confirmed"
grep -q "momo-db.*Restarting" <<<"$out" && warn "188 momo-db restarting" || ok "188 momo-db not in visible restart loop"
grep -Eq "SIGNOZ_CODE (200|302|307)" <<<"$out" && ok "188 SignOz HTTP reachable" || warn "188 SignOz HTTP not confirmed"
grep -Eq "SIGNOZ_CODE (200|302|307)" <<<"$out" && ok "SignOz UI upstream reachable from 188" || warn "SignOz UI upstream not confirmed from 188"
grep -q "MOMO_HEALTH_CODE 200" <<<"$out" && ok "188 momo health reachable" || warn "188 momo health not confirmed"
}