Commit Graph

58 Commits

Author SHA1 Message Date
Your Name
c5753e1c57 fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。

## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)

### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race

### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」

### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await

### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)

### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path

## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警

## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)

## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
  (不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)

## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊

## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)

## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法

## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
Your Name
715dc3cb91 fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具
12-Agent 全景診斷觸發的 P0/P1 觀測層修復。

## P0 假警報止血(4 SLO 雪崩根因)
- governance_agent.py:306 — 空 result 不再 fallback 0.0,改 continue + log warning
  根因:Prometheus 查無資料(emitter 未實作 / rule 未部署)被誤判為 SLO=0
  必觸發 violated=True 噴 4 條假告警

## P0 鬼魂按鈕守門
- telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear()
  first_row(批准/拒絕,HMAC nonce 無狀態)由 caller 1488 永遠保留
  feedback_no_ghost_buttons.md 三缺一鐵律對齊

## ConfigMap drift 修復(3 處)
- config.py:683 PROMETHEUS_URL: 188→110(drift checker 揪出 = SPF-4 部分根因)
- config.py:705 ARGOCD_URL: 125→121(T0 G3 已知)
- config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap

## P1 Alertmanager 升級(amtool SUCCESS)
- ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法
  - match/match_re → matchers
  - source_match/target_match → source_matchers/target_matchers
  - group_by 加 team label(防 SLO 雪崩 4 條同秒推)
  - PostgreSQL/Redis inhibit 補 equal: ['instance'](防爆炸抑制)
- 新增 3 組因果抑制:
  - OllamaInstanceDown → SLO_*/AI_*(30 分鐘)
  - KMConverterDown → SLO_KMGrowthRate*
  - SLO_*_FastBurn → SLO_*_(Medium|Slow)Burn

## 治理工具落地
- scripts/check_config_drift.py: ConfigMap vs code default drift 檢測
  揪出 PROMETHEUS_URL drift 是 SPF-4 根因(governance_agent 連 188 而非 110)
- scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證

## 驗證
- 1552 unit tests 全綠
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker 4 欄位全對齊
- health check 11 服務全可達

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
Your Name
9e9bd8679f fix(aider-watch): code-review fixes (4 issues)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通)
2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model
3. aider_event_service: classify_severity 移除 error_count 觸發告警(防假陽性)
4. worker: run_aider_event_processor_loop 包 proc.start() try/except(防靜默崩潰)

2026-04-20 @ Asia/Taipei
2026-04-21 00:59:21 +08:00
Your Name
1744b1e923 fix(aider): stdlib logging → structlog + typing-extensions dep (E2E修復)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- aider_events.py: logging.getLogger → structlog.get_logger (keyword args compatible)
- pyproject.toml: add typing-extensions>=4.0 (python-ulid 3.x requires Self)

2026-04-20 @ Asia/Taipei
2026-04-20 19:59:35 +08:00
Your Name
ce918ee44e feat(client): B5 install.sh + launchd aider-flush plist
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m18s
Mac 端安裝腳本:pipx install aider-watch-client → symlink 到 /opt/homebrew/bin →
驗 ~/.aider-watch.env 必要 key → 建 ~/aider-watch 工作目錄 →
載 launchd com.awoooi.aider-flush(每 5min flush buffer)→ 跑 aider-watch doctor。

走 a 路線(LAN direct AIDER_API_URL=http://192.168.0.120:32334/api/v1/aider/events)。
全景檢查:家用場景,B3 buffer + 5min flush 已覆蓋短暫斷網,無需 Tailscale。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:40:02 +08:00
Your Name
b7d612526a chore(client): gitignore egg-info + remove accidentally committed generated files 2026-04-20 19:40:02 +08:00
Your Name
36610e2744 feat(client): Mac aider-watch client (B1-B4: scaffolding + api_client + buffer + aiderw) 2026-04-20 19:40:02 +08:00
OG T
2d43751729 feat(ops): ADR-090-B 零信任收尾範本 — wrapper / sudoers / migrator / CI
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 12m17s
run-migration / migrate (push) Failing after 14s
2026-04-18 台北時區 —— ogt + Claude Opus 4.7 (1M)

本 commit 響應本 Session 兩次憑證外洩事故
(feedback_secrets_leak_incidents_2026-04-18.md),
交付統帥可直接部署的零信任基礎設施範本.

檔案清單:

1. scripts/host-ops/awoooi-hosts-add.sh
   - 110 主機 /etc/hosts 白名單 wrapper
   - 只允許預定義主機名,idempotent,帶 IP 格式驗證
   - 安裝: /usr/local/bin/awoooi-hosts-add (root:root 0755)

2. scripts/host-ops/awoooi-wrapper.sudoers
   - 配套 sudoers 規則 (NOPASSWD for wrapper + SIGHUP only)
   - 安裝: /etc/sudoers.d/awoooi-wrapper (root:root 0440)
   - 禁 tee / bash / sh 這類 generic shell access

3. apps/api/migrations/adr090b_awoooi_migrator_role.sql
   - PG 限權角色 awoooi_migrator
   - 只能 DDL (CREATE/ALTER/DROP/INDEX/COMMENT)
   - 明確 REVOKE 所有 DML + default privileges 鎖死
   - 本檔由統帥執行 (需 superuser),不由 Claude 執行

4. k8s/awoooi-prod/awoooi-migrator-secret.template.yaml
   - K8s Secret patch 範本
   - 新增 MIGRATION_DATABASE_URL key (awoooi_migrator 連線串)
   - 與應用 DATABASE_URL 拆開

5. .gitea/workflows/run-migration.yml
   - CI 自動套用新 migration (單 transaction + ON_ERROR_STOP)
   - 用 Gitea secret MIGRATION_DATABASE_URL,不走明碼
   - 每次成功寫一筆 asset_discovery_run (audit trail)

零信任三層防線 (對應 feedback_secrets_leak_incidents):
  L1 對話無密碼 -> wrapper 內建白名單
  L2 操作經 wrapper -> sudoers + awoooi_migrator
  L3 顯示強制遮蔽 -> CI 走 secret,不走 env

本 Session 發現的 3 次憑證外洩全部在 feedback_secrets_leak
memory 登記,並有對應 P0 輪替計畫.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:39 +08:00
OG T
fb1d101902 fix(backup): HostBackupFailed P1 根治 — Prometheus textfile 指標 + docker socket 讀取
問題一:backup_110_last_success_timestamp 指標從未存在
根因:腳本只寫純文字 last_success 檔,從未輸出 .prom 格式
修復:成功時寫入 /home/ollama/node_exporter_textfiles/backup.prom
      node_exporter 新增 --collector.textfile.directory=/textfile_collector
      volume: /home/ollama/node_exporter_textfiles:/textfile_collector

問題二:Harbor/Gitea rsync 權限拒絕
根因:/var/lib/docker/volumes/ 是 710 root:root,docker group 無法直接存取 FS 路徑
修復:改用 docker run --rm -v <volume>:/source alpine tar czf -
      透過 docker socket(wooo 已在 docker group)讀取 volume 內容再解壓

驗證:備份腳本三項全 OK,node_exporter 9100/metrics 正確輸出指標
      Prometheus absent(backup_110_last_success_timestamp) 應在下次 scrape 後清除

2026-04-18 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 10:37:23 +08:00
OG T
5fe049de55 fix(backfill): 補充 ADR-075 三種新分類 (secops/flywheel_health/business)
_classify_alert() 與 classify_alert_early() 規則對齊,
確保回填腳本正確分類存量 incidents。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 19:13:07 +08:00
OG T
a28625f088 fix(cr): 首席架構師 CR P0/P1/P2 全修補
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-1: incident_service.py — 刪除 classify_alert_early 死碼 L131-132
P0-2: cron_backup_restore_test.sh — date +%s%3N→+%s,修正毫秒時間戳
P1-2: gitea_webhook.py — fingerprint 移除 sha_short,收斂同 branch 失敗
heartbeat: 還原原始空格對齊格式(統帥要求原本怎樣就怎樣)

P1-1(積木化)/P1-3(TYPE-4)/P2-1(timeZone)/P2-2(IP)/P2-3(WS重連) 待後續處理

2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 16:10:46 +08:00
OG T
c1c96ab47b feat(m4): ADR-074 M4 — 備份還原週排程驗證 CronJob
- scripts/cron_backup_restore_test.sh: Velero restore dry-run 腳本
- k8s/awoooi-prod/16-cronjob-backup-restore-test.yaml: 每週日 02:00 台北執行
- k8s/awoooi-prod/17-configmap-backup-restore-scripts.yaml: 腳本 ConfigMap
- flywheel-alerts.yaml: BackupRestoreTestFailed + BackupRestoreTestStale 告警

失敗時寫入 node-exporter textfile → Prometheus 告警 → TYPE-3 Incident

2026-04-12 ogt (ADR-074 M4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:36:30 +08:00
OG T
0d239838b4 fix(cr): Code Review P2 — 測試覆蓋 + CronJob 腳本重構
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P2-1: CronJob inline Python 抽成 scripts/cron_km_vectorize.py
      Dockerfile 加入 COPY scripts/,CronJob YAML 改用腳本路徑
P2-2: 新增 test_classify_alert_early.py — 23 tests 覆蓋 7 條分類規則
      含邊界情況:VeleroBackupFailed(backup優先於k8s)、優先順序驗證

595 unit tests passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:14:44 +08:00
OG T
5b956a9a47 feat(flywheel): Phase 2-3/2-5 — auto_repair outcome 寫入 + 134 筆 alertname 回填腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 2-3: _try_auto_repair_background() 修復執行後寫入 Incident.outcome
- effectiveness_score: 5(成功) / 2(失敗)
- human_feedback: auto_repair:<playbook_id>:success|failed
- should_remember: True(成功) → KMConversionService 飛輪入口
- 讓 KMConversionService 可依 outcome 判斷 EXECUTION_SUCCESS

ADR-073 Phase 2-5: scripts/backfill_alertname.py
- UPDATE incidents SET alertname = COALESCE(signals->0->>'alertname', signals->0->>'alert_name')
- 已在 Pod 執行:134 筆 NULL → 0 筆 (2026-04-12 ogt)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:33:11 +08:00
OG T
6dc03c9a55 fix(argocd)+feat(flywheel): Phase 1 完成 — ArgoCD image 斷路修復 + 冷啟動腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. k8s/argocd/awoooi-prod-app.yaml:
   移除 Deployment image ignoreDifferences
   - 原設計造成 CD 更新 kustomization.yaml 後 ArgoCD 不更新 image
   - 修復後 GitOps 閉環恢復正常

2. scripts/cold_start_playbooks.py:
   ADR-073 Phase 1 Step 8 — 生成 15 個基礎 Playbook (K8s/Docker/DB/Infra)
   執行結果: Playbooks 0 → 15

3. scripts/batch_vectorize_km.py:
   ADR-073 Phase 1 Step 9 — 批次向量化 KM
   執行結果: 711/713 embedding IS NOT NULL

Phase 1 全部完成,飛輪已解封:
- Pod 運行 105998d(含 8be87b0 所有修復)
- debounce 30min + alertname NULL 修復 + _collect_mcp_context 啟用
- 15 Playbooks + 711 KM 向量化

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:20:52 +08:00
OG T
de055778b3 fix(cd): CD_PUSH_TOKEN + backup 路徑使用 BACKUP_ROOT 環境變數
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- cd.yaml: GITEA_CD_TOKEN → CD_PUSH_TOKEN(Gitea 保留 GITEA_ 前綴)
- ADR-069: 同步更新 token 名稱說明
- backup-from-110.sh: 改用 BACKUP_ROOT 環境變數(預設 /home/ollama/backup/110)
  避免 /var/log /var/run 需要 root 權限
- 已部署到 188 + cron 0 1 * * * 設定完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:07:47 +08:00
OG T
43edff184d feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件
C-1 Velero: 已確認運作中(daily-awoooi-prod schedule, 13d, MinIO Available)

C-2 Host rsync 備份:
  scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110
    - Harbor registry data(最高優先)
    - Gitea repos
    - bitan-pharmacy.git(若存在)
    - 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控
    - 失敗時 Telegram 告警
  ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則

C-3 DR SOP 文件:
  docs/runbooks/disaster-recovery/DR-K8s-awoooi.md  (<15分鐘)
  docs/runbooks/disaster-recovery/DR-Nginx.md        (<5分鐘)
  docs/runbooks/disaster-recovery/DR-Harbor.md       (<30分鐘)
  docs/runbooks/disaster-recovery/DR-Bitan.md        (<5分鐘)
  docs/runbooks/disaster-recovery/DR-Stock.md        (<5分鐘)

部署備份腳本說明 (需手動執行):
  scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh
  ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}"
  ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' | crontab -"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 03:04:18 +08:00
OG T
49bfbd573c feat(test): B5 整合測試框架 — 真實 DB, 5/5 通過
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m34s
新增:
- docker-compose.test.yml: CI 用臨時 pgvector PostgreSQL (port 15432)
- tests/factories.py: Incident/Approval/Knowledge/RAG 測試資料工廠
- tests/integration/test_b5_core_flows.py: 5 個 E2E 整合測試 (5/5 PASSED 1.03s)
- tests/integration/setup_test_schema.sql: CI schema 初始化 SQL
- cd.yaml: 新增 Integration Tests B5 step
- scripts/sync_dev_db.py: dev DB 同步工具

修正:
- .env.test: DATABASE_URL 指向 awoooi_dev (本機設定, gitignore 不入庫)

禁止 Mock 鐵律: 所有 DB 測試使用真實 PostgreSQL, 無 SQLite/MagicMock

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:22:57 +08:00
OG T
07570c3b85 feat(rag): 初始索引腳本 — ADR+Runbook 批次餵入 pgvector
scripts/rag_index_docs.py: 批次呼叫 /knowledge/rag/index
支援 --api-url 參數,含 0.5s 節流避免 Ollama 過載

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 09:59:13 +08:00
OG T
8d0042ed29 feat(ops): Sprint 5.2 docker-health-monitor 升級為自動修復模式
舊版: 純感知層 (L4-6),只送 Webhook,修復由 API 執行
新版: 感知 + 自動修復 + 回報

修復分級 (ADR-060):
- 一般容器: docker restart
- 監控棧 (prometheus/grafana/alertmanager): docker start (保護 WAL)
- DB/Redis/ClickHouse: 僅告警,禁止重啟

已部署到:
- 192.168.0.110 ~/awoooi-ops/docker-health-monitor.sh
- 192.168.0.188 ~/awoooi-ops/docker-health-monitor.sh
- 兩台 cron */5 * * * * 運行中

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:59:11 +08:00
OG T
5ead01abf7 feat(ops): dr-drill.sh — 每月 DR Drill 自動演練
每月第一個週日 03:00 (121 cron) 執行:
1. 找最新 Velero backup (Completed)
2. 還原到 awoooi-dr-test namespace
3. 等待 Pod Ready + API health 驗證
4. 清理 dr-test namespace + restore 資源
5. Telegram 通知 PASS/FAIL + 耗時

支援 --dry-run 模式 (只檢查 backup,不還原)。
dry-run 驗證通過: daily-awoooi-prod-20260409020003

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 10:42:12 +08:00
OG T
ec4ebaf310 fix(ops): pg-backup momo_analytics 改用 docker exec (無對外 port)
momo-db 容器無 port binding,TCP 127.0.0.1:5432 連到 host PG 非容器。
改用 docker exec momo-db pg_dump,實際備份 91M。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:57:05 +08:00
OG T
f98be41517 feat(ops): pg-backup.sh — PostgreSQL 每 6h 自動備份
備份目標 (188):
- awoooi_prod (host PostgreSQL, TCP 127.0.0.1)
- momo_analytics (momo-db 容器)

功能:
- gzip 壓縮,保留 7 天自動清理
- Telegram 通知 (成功/失敗)
- cron 0 */6 * * * 已設定

驗證: 兩個 DB 備份成功 (awoooi_prod 206K, gz 完整)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:16:21 +08:00
OG T
0e6c4b83d4 feat(ops): docker-health-monitor 完成部署 110+188
- 增加 EXCLUDE_CONTAINERS 排除清單(signoz init containers)
- max-time 30→60 支援 API 首次 AI 分析所需時間
- 110: wooo/awoooi-ops, cron */5, secrets.env 已設定
- 188: ollama/awoooi-ops, cron */5, secrets.env 已設定
- 驗證: 188→API webhook 200, Telegram 已收到告警

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:59:45 +08:00
OG T
d80153bdce fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:55:25 +08:00
OG T
170ce2f11d fix(ci): 修正測試與 Sprint 5.2 部署腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
tests/test_auto_repair_service.py:
  - 更新 3個測試符合 2026-04-07 統帥指令移除門檻
  - APPROVED Playbook 直接通過 (低相似度/低品質/高風險均通過)

tests/test_phase22_nemotron_collab.py:
  - 更新 log key: nemotron_collaboration_failed → exhausted

ops/monitoring/docker-compose.exporters.yaml:
  - 修正 postgres DSN: awoooi:awoooi_prod_2026@localhost:5432/awoooi_prod

Sprint 5.2 新增腳本:
  - scripts/sprint51_e2e_validation.py: L7 E2E 驗收腳本 (T1-T5)
  - scripts/ops/deploy-docker-health-monitor.sh: Plan A 一鍵部署腳本

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 18:17:48 +08:00
OG T
88696dba9b feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
  - k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader

Layer 1 - DB Migration (已在 188 執行):
  - M-002: approval_records 新增 approval_level/votes/required_votes
  - M-003: alert_event_type ENUM 新增 8 個值

Layer 2 - IaC:
  - ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)

Layer 3 - Python Services:
  - service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
  - velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
  - preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)

Layer 1-M001 - Playbook model:
  - playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup

Layer 4 - 業務邏輯:
  - alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
  - auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
  - webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
  - db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
  - docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)

Layer 5 - Telegram 通知:
  - telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)

參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:24:09 +08:00
OG T
a421d2c5b8 feat(ops): Plan A docker-health-monitor.sh — Docker 容器健康監控自動修復
- 偵測 unhealthy / exited / dead 容器
- 排除清單: DB(PG/Redis)、Gitea、監控棧
- Prometheus/Grafana/Alertmanager exited → docker start (保護 WAL)
- 必須三段式通知: Intent→Action→Result (首席架構師裁示)
- HMAC-SHA256 簽章 → AWOOOI API /api/v1/webhooks/custom-alert
- Fallback: API down → 直接 Telegram Bot API
- 冷卻期 300s,防止重複修復

部署: cron */5 * * * * on 192.168.0.110 + 192.168.0.188
設定: /etc/awoooi-ops/secrets.env

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 11:48:39 +08:00
OG T
8235f91bc6 fix(scripts): generate-schemas 同時加入 apps/api 和 apps/api/src 到 sys.path
Some checks failed
Type Sync Check / check-type-sync (push) Failing after 56s
問題: CI type-sync-check 持續失敗
原因: 只加 apps/api/src 不夠,模型檔內部用 from src.utils.X import Y
     需要 apps/api 在 path 才能解析 src 套件
結果: 51 個型別全部正確生成

# 2026-04-06 ogt: fix CI type-sync blocking deployment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:00:18 +08:00
OG T
b133631b2d feat(scripts): Phase 26 補寫腳本 — 從 approval_records 反向建立 KM
225 筆歷史告警處理記錄全部補寫到 knowledge_entries (INCIDENT_CASE)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:47:47 +08:00
OG T
4b24ecd67f fix(sprint3): 首席架構師 Review C1/C2/C3/M3/m1 修正
C1: _ssh_execute 直接接收 key_path 參數,不反查 LAYER_SSH_CONFIG
C2: PlaybookService.create() proxy,Router 不再穿透呼叫 _repository
C3: CD Step 1b sed 替換 IMAGE_TAG_PLACEHOLDER,消除失敗中斷風險
M3: repair-bot 110/188 regex 統一 [a-z0-9][a-z0-9-]{0,30},禁止底線
m1: defaultMode 0400 加八進位說明注釋
m2: _ssh_execute 用 deadline 計算剩餘 timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:07:59 +08:00
OG T
b688eeecb7 fix(ops): seed 腳本支援 API_BASE 環境變數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:23:55 +08:00
OG T
3f7a742683 fix(infra): 首席架構師 Review 修正 — C1/I1/I2/I3/I4/S1
C1: 移除 deploy-to-110.sh 密碼明文,改用 SSH key + sudoers NOPASSWD
I1: 加入 /var/lock/harbor-repair.lock 防止 watchdog 與 startup 並行修復
I2: docker compose 的 stderr 不再靜默(改用 tee -a log | while read 輸出)
I3: watchdog while loop 包在子 shell + || true,子 shell 異常不終止 watchdog
I4: repair_harbor 關鍵指令(harbor-log 啟動)加入退出碼捕捉
S1: 修復後驗證等待從 5s/10s 改為 30s(harbor-core 初始化需要足夠時間)
S2: docker ps 改用 --filter status=exited 取代 grep/awk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:18:41 +08:00
OG T
66b12bf9eb fix(infra): 根治 Harbor Exited(128) Race Condition + harbor-watchdog 常駐自愈
問題根因:
  awoooi-startup-110.sh 在 Harbor 啟動時,第一次 compose up -d 會同時
  啟動所有容器。harbor-core/db/portal 嘗試連 syslog:1514(harbor-log 未就緒),
  失敗後 exit(128),restart:always 重試直到 backoff 放棄。
  即使後來 harbor-log healthy,其他容器已不再重試。

修復 1 — startup-110.sh Harbor 時序(4 Phase 策略):
  Phase 1: 清除所有 Exited Harbor 容器(打破 backoff 死鎖)
  Phase 2: 只啟動 harbor-log
  Phase 3: 等 harbor-log healthy(最多 90s)
  Phase 4: 啟動全組件

修復 2 — harbor-watchdog.service(常駐自愈):
  Type=simple 常駐進程,每 60s 輪詢 http://127.0.0.1:5000/v2/
  不健康 → 等 5s 再確認 → 執行 Phase 1-4 完整修復
  修復重開機時序問題無法覆蓋的「運行中崩潰」場景

Bug Fix:curl -f 會把 HTTP 401 視為失敗(exit 22),
  Harbor /v2/ 正常回傳 401(需認證),改用 curl -s 不加 -f

REBOOT-RECOVERY-SOP.md → v5.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:13:21 +08:00
OG T
76f7330c9d feat(api): POST /playbooks/ 建立端點 + seed-repair-playbooks.py (Task 14)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 57s
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been skipped
- playbooks.py: 新增 POST / 端點供直接建立 Playbook (seed/管理用)
- seed-repair-playbooks.py: 5個 Host Repair Playbooks (ssh_command)
  sentry/harbor/gitea/alertmanager (110) + openclaw (188)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:53:49 +08:00
OG T
892c5d53a7 k8s(secret): 加入 repair SSH key 建立說明 template
實際私鑰透過 kubectl create secret 手動建立,不上 Git
主機 110 (wooo) / 188 (ollama) 已設定 command= 受限 authorized_keys
SSH health check 驗證: REPAIR_BOT_HEALTHY:110 / REPAIR_BOT_HEALTHY:188

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:17:57 +08:00
OG T
f51bf5a6a8 feat(backup): 全服務備份覆蓋 + 告警機制 — 9/9 服務完整
新增備份(已部署到 110,首次執行全部通過):
- backup-langfuse.sh: Langfuse AI 追蹤/評測 DB (7238 traces)
- backup-monitoring.sh: Prometheus + Grafana + Alertmanager volumes + configs
- backup-signoz.sh: SignOz ClickHouse + SQLite (分散式追蹤/日誌)
- backup-open-webui.sh: Open-WebUI LLM 對話紀錄 (SSH 188 volume)
- backup-clawbot.sh: ClawBot Redis 狀態/快取 (SSH 188 volume)
- backup-all.sh v3.0: 整合至 9/9 服務

告警機制:
- common.sh: notify_clawbot 改用 /webhook/custom 正確格式
- failed → severity:critical → Telegram 🔴 立即告警
- 告警測試通過:{"status":"ok","alert_id":"878c4c59..."}

GFS 保留:30日/12週/24月 (AWOOOI 額外 28h 高頻)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:12:42 +08:00
OG T
77253a5d87 ops(repair-bot): 主機白名單修復腳本 (Sprint 3)
110: sentry/harbor/gitea/gitea-runner/langfuse/alertmanager/signoz
188: openclaw/minio/signoz (docker compose) + redis/nginx/ollama (systemd)

安全設計: SSH command= 限制 + 嚴格白名單 + /var/log/awoooi-repair-bot.log
已部署: 110:/home/wooo/bin/ + 188:/home/ollama/bin/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:11:55 +08:00
OG T
4ba62132e2 ops(startup): startup-110.sh 加入 Step 7 Sentry 自動啟動
Sentry 已安裝於 /opt/sentry (2026-03-24),但重開機後未自動啟動
加入非阻塞啟動 + 重開機損壞修復邏輯:
  - sentry-postgres WAL 損壞 → pg_resetwal -f 自動修復
  - sentry-redis dump.rdb 損壞 → 自動刪除重建
  - 啟動後 20s 非阻塞健康驗證

根因: 2026-04-05 重開機後 PostgreSQL WAL + Redis RDB + ClickHouse parts 全部損壞

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 03:09:20 +08:00
OG T
7becdcbaf6 ops(scripts): 加入 deploy-alerts.sh 自動部署 Prometheus 規則
功能: 驗證 YAML → 備份 → scp → reload → 驗證規則數+關鍵規則
同步啟用 Prometheus --web.enable-lifecycle (110 docker-compose.yml)
部署驗證: 28 條規則全部 ,關鍵規則 SentryDown/HarborDown/GiteaDown/OpenClawDown/AlertmanagerDown/AlertChainUnhealthy 已上線

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 02:29:21 +08:00
OG T
ad4abefcd9 fix(k8s+ops): 修復告警鏈路 + Gitea runner 自動啟動
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 21s
## 修復項目

1. NetworkPolicy allow-nginx-ingress 加入 192.168.0.110
   - Alertmanager (在 110) 需要從 110 直接 POST webhook 到 API pod
   - 修復前: 110 被 NetworkPolicy default-deny 阻擋,webhook timeout
   - 修復後: 110 加入 ingress 白名單,告警鏈路恢復

2. awoooi-startup-110.sh 加入 Gitea Act Runner
   - Step 6: 啟動 /home/wooo/act-runner (gitea-runner container)
   - 修復前: 重開機後 runner 離線,CD pipeline 全面失效
   - 修復後: runner 自動重啟,若配置過期自動清除重新註冊

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:42:52 +08:00
OG T
be3aa6069b feat(backup): AWOOOI 高頻備份 — 每 6 小時備份 awoooi_prod
awoooi_prod 為核心生產 DB,每日一次最大損失 24 小時不可接受:

- backup-awoooi-frequent.sh:每 6 小時備份 awoooi_prod(08/14/20:00)
- 02:00 由 backup-all.sh 完整備份(含 dev/k3s)
- 合計 4次/天,最大數據損失 ≤ 6 小時
- GFS 保留:28h 高頻 + 30日 + 12週 + 24月

首次執行: 680K,4s,snapshot db050dbc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:14:50 +08:00
OG T
3136fc5ea0 feat(backup): 全面自動化備份 + AWOOOI DB + GFS 延長保留
首席架構師備份審計 — 全部自動化完成:

- backup-awoooi.sh:新增 AWOOOI PostgreSQL 備份腳本
  - awoooi_prod (KB/事故/AutoRepair/Drift) + k3s_datastore
  - 從 110 SSH 到 188 執行 pg_dump,整合進 restic
  - 首次執行:680K,9s,snapshot 8750748f 

- backup-all.sh v2.0:整合第 4 個服務 AWOOOI DB

- GFS 保留策略延長:
  - 每日 7→30 份(覆蓋最近 30 天)
  - 每週 4→12 份(覆蓋最近 3 個月)
  - 每月 6→24 份(覆蓋最近 2 年)

- BACKUP-STATUS.md:更新為全自動化狀態總覽

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:11:31 +08:00
OG T
84cfdb6195 docs(backup): 備份審計完整盤點 + 新增 AWOOOI DB 與 Gitea DB 備份腳本
首席架構師備份審計結論:
- awoooi_prod PostgreSQL: 無備份 (P0 缺口)
- Gitea SQLite DB: 無備份 (今日已損壞,人工修復耗時 2h+)

新增:
- scripts/backup/backup-awoooi-db.sh (188 部署,02:00 daily)
- scripts/backup/backup-gitea-db.sh (110 部署,01:00 daily)
- docs/runbooks/BACKUP-STATUS.md (全景表 + 部署步驟 + SOP)
- LOGBOOK.md 備份審計段落

待手動部署:統帥需 scp 腳本至 188/110 並設定 crontab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:01:58 +08:00
OG T
c0c903dc48 fix(startup): 188 啟動腳本加入 MinIO — 解決 Velero BSL Unavailable
MinIO 重開機後不會自動啟動,導致 Velero BackupStorageLocation Unavailable
加入 MinIO docker compose up -d 到 STEP 7 Docker Compose 服務區段

⚠️ 統帥需要手動執行以下指令讓 188 上的 startup script 生效:
  sudo cp /tmp/awoooi-startup.sh /usr/local/bin/awoooi-startup.sh
  sudo chmod +x /usr/local/bin/awoooi-startup.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:52:13 +08:00
OG T
f4f454fd98 feat(api): 重開機後自動 warm-up Redis Working Memory from PostgreSQL
- main.py lifespan: 啟動時從 DB restore INVESTIGATING/MITIGATING incidents
- scripts/reboot-recovery: 188 + 110 自動化腳本 + systemd services
- scripts/reboot-recovery: aiops-network 自動建立 (ClawBot 依賴)
- docs/runbooks/REBOOT-RECOVERY-SOP.md: 完整改寫,含自動化腳本說明

Why: 重開機後 Redis 清空導致前端 incidents 顯示 0 筆(DB 完整保存)
統帥批准: 「所有數據必須被長久記錄下來」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:39:20 +08:00
OG T
827923b9b9 feat(monitoring): Phase O-5 Wave C.1 generate_monitoring.py 自動發現
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
- 查詢 Prometheus targets API 取得全量 scrape 狀態
- 10 個預期服務覆蓋率計算 (門檻 70%)
- 已知 DOWN targets 豁免清單 (不影響健康判斷)
- --json 機器可讀輸出 / --check CI 模式 (exit 1 if coverage < threshold)
- 首次執行: 100% 覆蓋率,無真實問題

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:33:28 +08:00
OG T
f5b8738185 fix(wave-a): Wave A 告警鏈路驗收修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
- sentry_webhook: 加入 GET /health endpoint (smoke test 探測用)
- smoke_test: alertmanager 路徑改為 /webhooks/health (已存在)
- smoke_test: Prometheus URL 改為正確的 110:9090
- smoke_test: Alert chain metric 標記 critical=False (初始化期正常)

Wave A.6 smoke test 現在 6/8 → 7/8 checks pass (sentry health deploy 後 8/8)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:08:26 +08:00
OG T
d2b337430a feat(cd): Phase O-4 Wave A 收尾 — Sentry Token 注入 + Alert Chain Smoke Test
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 35s
E2E Health Check / e2e-health (push) Successful in 17s
Wave A.1: SENTRY_AUTH_TOKEN CD 自動注入 K8s Secret
  - 每次部署自動 kubectl patch (遵循 ADR-035 鐵律)
  - Token 缺失時 warn 不 fail (降級保護)

Wave A.6 + B.2: Alert Chain Smoke Test
  - scripts/alert_chain_smoke_test.py (新建)
  - 檢查: API Health / Alert Chain Metric / 3 Webhook /
          SigNoz / OTEL Collector / Event Exporter
  - 整合進 cd.yaml (Alert Chain Smoke Test 步驟)
  - continue-on-error: true (不阻塞部署,結果顯示在 TG)
  - TG 部署通知新增 Alert Chain 狀態欄

Wave A.2/A.3/A.4: SignOz/Sentry 程式碼已在 2026-03-29 實作完成
  - signoz_webhook.py / sentry_webhook.py 均已部署
  - 待手動部署 SignOz 告警規則到 .188

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 18:22:13 +08:00
OG T
48ec6ee48e feat(types): 補充 NVIDIA 模型到共用型別 (P0 修復)
首席架構師審查發現 NVIDIA models 遺漏,現已補充:

新增 7 個型別:
- ToolFunction, ToolCall, NvidiaMessage
- NvidiaChoice, NvidiaUsage, NvidiaResponse
- ToolDefinition

總計: 44 → 51 個型別定義
審查評分: 72/100 → 85/100 (預計)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 19:26:44 +08:00