46 Commits

Author SHA1 Message Date
ogt
3274607af8 fix(ops): expose momo source absence after reboot [skip ci] 2026-06-27 11:56:34 +08:00
ogt
89b9e67a41 fix(ops): harden reboot API warmup evidence flow
Some checks failed
Code Review / ai-code-review (push) Successful in 13s
E2E Health Check / e2e-health (push) Successful in 31s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 23:59:06 +08:00
ogt
18a35c5e62 fix(ops): avoid unknown stock blockers when fresh
Some checks failed
Code Review / ai-code-review (push) Successful in 13s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 23:26:57 +08:00
ogt
6afa3e4f35 ops(reboot): classify stock eod freshness window
Some checks failed
Code Review / ai-code-review (push) Successful in 19s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 18:24:42 +08:00
ogt
35dba35253 ops(reboot): persist summary evidence and classify warmup routes
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 17:56:13 +08:00
ogt
ec8377e732 ops(reboot): add post-reboot owner response preflight
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
AI 技術雷達監控 / ai-technology-watch (push) Successful in 38s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 13:30:41 +08:00
ogt
71261c122e ops(reboot): close 188 hygiene and dynamic post-reboot gates
Some checks failed
Code Review / ai-code-review (push) Successful in 15s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 12:40:00 +08:00
ogt
be35ad5861 ops(reboot): guard post-reboot declarations [skip ci] 2026-06-26 11:28:26 +08:00
ogt
75c9314528 ops(reboot): include Wazuh detail in post-reboot summary [skip ci] 2026-06-26 08:54:00 +08:00
ogt
c45f274d5e ops(reboot): guard post-reboot owner packets [skip ci] 2026-06-26 08:45:52 +08:00
ogt
02bcf0a31e ops(reboot): add post-reboot owner packet JSON [skip ci] 2026-06-26 08:32:30 +08:00
ogt
a4ac7be310 ops(reboot): add post-reboot next gate dispatch [skip ci] 2026-06-26 08:22:32 +08:00
ogt
63545353dc ops(reboot): add post-reboot readiness summary [skip ci] 2026-06-26 07:50:36 +08:00
ogt
1c32053ffe ops(reboot): add 188 hygiene read-only checklist [skip ci] 2026-06-26 07:37:30 +08:00
ogt
6250a94b7e fix(ops): harden 188 startup data recovery gate
Some checks failed
Code Review / ai-code-review (push) Successful in 13s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 06:54:49 +08:00
ogt
bae6423d72 docs(ops): show escrow gaps in reboot quick check [skip ci] 2026-06-26 06:37:04 +08:00
ogt
482ff21af5 docs(ops): refresh reboot readback route retry [skip ci] 2026-06-26 06:33:04 +08:00
ogt
5e4887d15c fix(ops): gate reboot recovery on product freshness [skip ci] 2026-06-25 19:39:42 +08:00
ogt
4abd654e52 fix(ops): classify cold-start warning-only quick checks [skip ci] 2026-06-25 15:08:37 +08:00
ogt
c5d76eb360 chore(ops): clarify momo token metadata wording [skip ci] 2026-06-25 14:52:36 +08:00
ogt
65209cbbc1 docs(ops): record post-start wrapper live readback [skip ci] 2026-06-25 14:52:06 +08:00
ogt
37ab97d4e1 docs(ops): add executable post-start quick check [skip ci] 2026-06-25 14:52:06 +08:00
ogt
fc51a8f295 docs(ops): refresh momo preflight recovery evidence [skip ci] 2026-06-25 14:52:05 +08:00
ogt
d2854edcd8 docs(ops): add momo preflight and cpu triage evidence [skip ci] 2026-06-25 14:52:05 +08:00
ogt
6f5e22ba69 fix(ops): classify momo source absence in cold-start gate [skip ci] 2026-06-24 23:05:42 +08:00
Your Name
2b12f44547 docs(ops): add MOMO data freshness reboot gate [skip ci] 2026-06-24 02:51:28 +08:00
Your Name
8aeeadbde1 docs(ops): record heartbeat noise and cold-start detector closure [skip ci] 2026-06-24 02:19:30 +08:00
Your Name
ff18872a23 feat(ops): 新增 host runaway process aiops guard
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 26s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-18 14:17:03 +08:00
Your Name
f89f59c647 fix(ops): 區分 stale failed Job cold-start 判定 [skip ci] 2026-06-18 13:54:00 +08:00
Your Name
63d8361f2a docs(ops): 收斂重啟 repo-side readiness blockers [skip ci] 2026-06-18 12:11:56 +08:00
Your Name
b997016991 docs(ops): 鎖定重啟 Plan B 機制檢查 [skip ci] 2026-06-18 11:50:53 +08:00
Your Name
cfb866d055 feat(governance): add agent market automation surfaces
Some checks failed
Ansible Lint / lint (push) Successful in 35s
CD Pipeline / tests (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Failing after 11s
2026-06-04 21:50:55 +08:00
Your Name
ae7b39d96a fix(ops): harden reboot recovery and backup alerts 2026-05-29 12:41:34 +08:00
Your Name
9b465ee140 ci(runner): drain legacy docker act runner safely
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
2026-05-21 18:53:45 +08:00
Your Name
587551c1f1 fix(ops): monitor full-stack cold-start gates
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s
2026-05-06 00:48:05 +08:00
Your Name
6e96623884 fix(ops): harden momo scheduler cold start gate
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
2026-05-06 00:15:14 +08:00
Your Name
0315c2b510 docs(ops): codify full stack cold start recovery
All checks were successful
Code Review / ai-code-review (push) Successful in 7s
2026-05-06 00:07:57 +08:00
Your Name
1dcc6d61dc fix(ops): retry cold-start HTTP probes
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
2026-05-05 22:56:57 +08:00
Your Name
a4e9a04982 fix(ops): harden cold-start schedule recovery
Some checks failed
Code Review / ai-code-review (push) Successful in 10s
run-migration / migrate (push) Successful in 7s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
2026-05-05 22:17:10 +08:00
Your Name
cb5ab900c4 fix(ci): preserve gitea runner jobs on shutdown
All checks were successful
Code Review / ai-code-review (push) Successful in 46s
2026-05-01 16:16:27 +08:00
OG T
3f7a742683 fix(infra): 首席架構師 Review 修正 — C1/I1/I2/I3/I4/S1
C1: 移除 deploy-to-110.sh 密碼明文,改用 SSH key + sudoers NOPASSWD
I1: 加入 /var/lock/harbor-repair.lock 防止 watchdog 與 startup 並行修復
I2: docker compose 的 stderr 不再靜默(改用 tee -a log | while read 輸出)
I3: watchdog while loop 包在子 shell + || true,子 shell 異常不終止 watchdog
I4: repair_harbor 關鍵指令(harbor-log 啟動)加入退出碼捕捉
S1: 修復後驗證等待從 5s/10s 改為 30s(harbor-core 初始化需要足夠時間)
S2: docker ps 改用 --filter status=exited 取代 grep/awk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:18:41 +08:00
OG T
66b12bf9eb fix(infra): 根治 Harbor Exited(128) Race Condition + harbor-watchdog 常駐自愈
問題根因:
  awoooi-startup-110.sh 在 Harbor 啟動時,第一次 compose up -d 會同時
  啟動所有容器。harbor-core/db/portal 嘗試連 syslog:1514(harbor-log 未就緒),
  失敗後 exit(128),restart:always 重試直到 backoff 放棄。
  即使後來 harbor-log healthy,其他容器已不再重試。

修復 1 — startup-110.sh Harbor 時序(4 Phase 策略):
  Phase 1: 清除所有 Exited Harbor 容器(打破 backoff 死鎖)
  Phase 2: 只啟動 harbor-log
  Phase 3: 等 harbor-log healthy(最多 90s)
  Phase 4: 啟動全組件

修復 2 — harbor-watchdog.service(常駐自愈):
  Type=simple 常駐進程,每 60s 輪詢 http://127.0.0.1:5000/v2/
  不健康 → 等 5s 再確認 → 執行 Phase 1-4 完整修復
  修復重開機時序問題無法覆蓋的「運行中崩潰」場景

Bug Fix:curl -f 會把 HTTP 401 視為失敗(exit 22),
  Harbor /v2/ 正常回傳 401(需認證),改用 curl -s 不加 -f

REBOOT-RECOVERY-SOP.md → v5.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:13:21 +08:00
OG T
4ba62132e2 ops(startup): startup-110.sh 加入 Step 7 Sentry 自動啟動
Sentry 已安裝於 /opt/sentry (2026-03-24),但重開機後未自動啟動
加入非阻塞啟動 + 重開機損壞修復邏輯:
  - sentry-postgres WAL 損壞 → pg_resetwal -f 自動修復
  - sentry-redis dump.rdb 損壞 → 自動刪除重建
  - 啟動後 20s 非阻塞健康驗證

根因: 2026-04-05 重開機後 PostgreSQL WAL + Redis RDB + ClickHouse parts 全部損壞

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 03:09:20 +08:00
OG T
ad4abefcd9 fix(k8s+ops): 修復告警鏈路 + Gitea runner 自動啟動
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 21s
## 修復項目

1. NetworkPolicy allow-nginx-ingress 加入 192.168.0.110
   - Alertmanager (在 110) 需要從 110 直接 POST webhook 到 API pod
   - 修復前: 110 被 NetworkPolicy default-deny 阻擋,webhook timeout
   - 修復後: 110 加入 ingress 白名單,告警鏈路恢復

2. awoooi-startup-110.sh 加入 Gitea Act Runner
   - Step 6: 啟動 /home/wooo/act-runner (gitea-runner container)
   - 修復前: 重開機後 runner 離線,CD pipeline 全面失效
   - 修復後: runner 自動重啟,若配置過期自動清除重新註冊

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:42:52 +08:00
OG T
c0c903dc48 fix(startup): 188 啟動腳本加入 MinIO — 解決 Velero BSL Unavailable
MinIO 重開機後不會自動啟動,導致 Velero BackupStorageLocation Unavailable
加入 MinIO docker compose up -d 到 STEP 7 Docker Compose 服務區段

⚠️ 統帥需要手動執行以下指令讓 188 上的 startup script 生效:
  sudo cp /tmp/awoooi-startup.sh /usr/local/bin/awoooi-startup.sh
  sudo chmod +x /usr/local/bin/awoooi-startup.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:52:13 +08:00
OG T
f4f454fd98 feat(api): 重開機後自動 warm-up Redis Working Memory from PostgreSQL
- main.py lifespan: 啟動時從 DB restore INVESTIGATING/MITIGATING incidents
- scripts/reboot-recovery: 188 + 110 自動化腳本 + systemd services
- scripts/reboot-recovery: aiops-network 自動建立 (ClawBot 依賴)
- docs/runbooks/REBOOT-RECOVERY-SOP.md: 完整改寫,含自動化腳本說明

Why: 重開機後 Redis 清空導致前端 incidents 顯示 0 筆(DB 完整保存)
統帥批准: 「所有數據必須被長久記錄下來」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:39:20 +08:00