Your Name
|
587551c1f1
|
fix(ops): monitor full-stack cold-start gates
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s
|
2026-05-06 00:48:05 +08:00 |
|
Your Name
|
6e96623884
|
fix(ops): harden momo scheduler cold start gate
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-06 00:15:14 +08:00 |
|
Your Name
|
0315c2b510
|
docs(ops): codify full stack cold start recovery
Code Review / ai-code-review (push) Successful in 7s
|
2026-05-06 00:07:57 +08:00 |
|
Your Name
|
1dcc6d61dc
|
fix(ops): retry cold-start HTTP probes
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-05 22:56:57 +08:00 |
|
Your Name
|
ed7c6946cb
|
docs(awooop): define private Ollama mesh gateway
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-05 22:56:22 +08:00 |
|
Your Name
|
a4e9a04982
|
fix(ops): harden cold-start schedule recovery
Code Review / ai-code-review (push) Successful in 10s
run-migration / migrate (push) Successful in 7s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:17:10 +08:00 |
|
Your Name
|
72d66e4ae6
|
fix(ops): align stale job cleanup thresholds
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
|
2026-05-05 14:54:17 +08:00 |
|
Your Name
|
5e625f777d
|
fix(ops): add stale gitea job cleanup guard
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:50:47 +08:00 |
|
Your Name
|
7d45f0cb58
|
fix(ops): alert on stale gitea actions jobs
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:42:09 +08:00 |
|
Your Name
|
34d1c76be9
|
fix(ops): route systemd runner baseline alerts
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 14:19:58 +08:00 |
|
Your Name
|
fe618960a8
|
fix(ops): monitor systemd runners in host baseline
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
|
2026-05-05 14:08:43 +08:00 |
|
Your Name
|
e8e6748f70
|
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 13:45:09 +08:00 |
|
Your Name
|
8629ac709b
|
feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)
## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)
## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop
## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)
## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:31:53 +08:00 |
|
Your Name
|
cb5ab900c4
|
fix(ci): preserve gitea runner jobs on shutdown
Code Review / ai-code-review (push) Successful in 46s
|
2026-05-01 16:16:27 +08:00 |
|
Your Name
|
95110971f3
|
fix(telegram): close remaining DM alert routes
CD Pipeline / tests (push) Successful in 1m27s
Code Review / ai-code-review (push) Successful in 29s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-04-30 23:02:17 +08:00 |
|
Your Name
|
e27b462bef
|
fix(ops): keep disabled gitea runner stopped
Code Review / ai-code-review (push) Successful in 27s
|
2026-04-30 10:59:46 +08:00 |
|
Your Name
|
639bb64788
|
feat(flywheel): surface ai automation and code review
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Failing after 5m23s
|
2026-04-30 00:09:25 +08:00 |
|
Your Name
|
c5753e1c57
|
fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。
## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)
### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race
### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」
### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await
### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)
### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path
## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警
## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)
## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
(不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)
## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊
## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)
## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法
## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 10:44:39 +08:00 |
|
Your Name
|
715dc3cb91
|
fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具
12-Agent 全景診斷觸發的 P0/P1 觀測層修復。
## P0 假警報止血(4 SLO 雪崩根因)
- governance_agent.py:306 — 空 result 不再 fallback 0.0,改 continue + log warning
根因:Prometheus 查無資料(emitter 未實作 / rule 未部署)被誤判為 SLO=0
必觸發 violated=True 噴 4 條假告警
## P0 鬼魂按鈕守門
- telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear()
first_row(批准/拒絕,HMAC nonce 無狀態)由 caller 1488 永遠保留
feedback_no_ghost_buttons.md 三缺一鐵律對齊
## ConfigMap drift 修復(3 處)
- config.py:683 PROMETHEUS_URL: 188→110(drift checker 揪出 = SPF-4 部分根因)
- config.py:705 ARGOCD_URL: 125→121(T0 G3 已知)
- config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap
## P1 Alertmanager 升級(amtool SUCCESS)
- ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法
- match/match_re → matchers
- source_match/target_match → source_matchers/target_matchers
- group_by 加 team label(防 SLO 雪崩 4 條同秒推)
- PostgreSQL/Redis inhibit 補 equal: ['instance'](防爆炸抑制)
- 新增 3 組因果抑制:
- OllamaInstanceDown → SLO_*/AI_*(30 分鐘)
- KMConverterDown → SLO_KMGrowthRate*
- SLO_*_FastBurn → SLO_*_(Medium|Slow)Burn
## 治理工具落地
- scripts/check_config_drift.py: ConfigMap vs code default drift 檢測
揪出 PROMETHEUS_URL drift 是 SPF-4 根因(governance_agent 連 188 而非 110)
- scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證
## 驗證
- 1552 unit tests 全綠
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker 4 欄位全對齊
- health check 11 服務全可達
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 10:44:39 +08:00 |
|
Your Name
|
9e9bd8679f
|
fix(aider-watch): code-review fixes (4 issues)
CD Pipeline / build-and-deploy (push) Has been cancelled
1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通)
2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model
3. aider_event_service: classify_severity 移除 error_count 觸發告警(防假陽性)
4. worker: run_aider_event_processor_loop 包 proc.start() try/except(防靜默崩潰)
2026-04-20 @ Asia/Taipei
|
2026-04-21 00:59:21 +08:00 |
|
Your Name
|
1744b1e923
|
fix(aider): stdlib logging → structlog + typing-extensions dep (E2E修復)
CD Pipeline / build-and-deploy (push) Has been cancelled
- aider_events.py: logging.getLogger → structlog.get_logger (keyword args compatible)
- pyproject.toml: add typing-extensions>=4.0 (python-ulid 3.x requires Self)
2026-04-20 @ Asia/Taipei
|
2026-04-20 19:59:35 +08:00 |
|
Your Name
|
ce918ee44e
|
feat(client): B5 install.sh + launchd aider-flush plist
CD Pipeline / build-and-deploy (push) Successful in 10m18s
Mac 端安裝腳本:pipx install aider-watch-client → symlink 到 /opt/homebrew/bin →
驗 ~/.aider-watch.env 必要 key → 建 ~/aider-watch 工作目錄 →
載 launchd com.awoooi.aider-flush(每 5min flush buffer)→ 跑 aider-watch doctor。
走 a 路線(LAN direct AIDER_API_URL=http://192.168.0.120:32334/api/v1/aider/events)。
全景檢查:家用場景,B3 buffer + 5min flush 已覆蓋短暫斷網,無需 Tailscale。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-20 19:40:02 +08:00 |
|
Your Name
|
b7d612526a
|
chore(client): gitignore egg-info + remove accidentally committed generated files
|
2026-04-20 19:40:02 +08:00 |
|
Your Name
|
36610e2744
|
feat(client): Mac aider-watch client (B1-B4: scaffolding + api_client + buffer + aiderw)
|
2026-04-20 19:40:02 +08:00 |
|
OG T
|
2d43751729
|
feat(ops): ADR-090-B 零信任收尾範本 — wrapper / sudoers / migrator / CI
CD Pipeline / build-and-deploy (push) Successful in 12m17s
run-migration / migrate (push) Failing after 14s
2026-04-18 台北時區 —— ogt + Claude Opus 4.7 (1M)
本 commit 響應本 Session 兩次憑證外洩事故
(feedback_secrets_leak_incidents_2026-04-18.md),
交付統帥可直接部署的零信任基礎設施範本.
檔案清單:
1. scripts/host-ops/awoooi-hosts-add.sh
- 110 主機 /etc/hosts 白名單 wrapper
- 只允許預定義主機名,idempotent,帶 IP 格式驗證
- 安裝: /usr/local/bin/awoooi-hosts-add (root:root 0755)
2. scripts/host-ops/awoooi-wrapper.sudoers
- 配套 sudoers 規則 (NOPASSWD for wrapper + SIGHUP only)
- 安裝: /etc/sudoers.d/awoooi-wrapper (root:root 0440)
- 禁 tee / bash / sh 這類 generic shell access
3. apps/api/migrations/adr090b_awoooi_migrator_role.sql
- PG 限權角色 awoooi_migrator
- 只能 DDL (CREATE/ALTER/DROP/INDEX/COMMENT)
- 明確 REVOKE 所有 DML + default privileges 鎖死
- 本檔由統帥執行 (需 superuser),不由 Claude 執行
4. k8s/awoooi-prod/awoooi-migrator-secret.template.yaml
- K8s Secret patch 範本
- 新增 MIGRATION_DATABASE_URL key (awoooi_migrator 連線串)
- 與應用 DATABASE_URL 拆開
5. .gitea/workflows/run-migration.yml
- CI 自動套用新 migration (單 transaction + ON_ERROR_STOP)
- 用 Gitea secret MIGRATION_DATABASE_URL,不走明碼
- 每次成功寫一筆 asset_discovery_run (audit trail)
零信任三層防線 (對應 feedback_secrets_leak_incidents):
L1 對話無密碼 -> wrapper 內建白名單
L2 操作經 wrapper -> sudoers + awoooi_migrator
L3 顯示強制遮蔽 -> CI 走 secret,不走 env
本 Session 發現的 3 次憑證外洩全部在 feedback_secrets_leak
memory 登記,並有對應 P0 輪替計畫.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-18 13:23:39 +08:00 |
|
OG T
|
fb1d101902
|
fix(backup): HostBackupFailed P1 根治 — Prometheus textfile 指標 + docker socket 讀取
問題一:backup_110_last_success_timestamp 指標從未存在
根因:腳本只寫純文字 last_success 檔,從未輸出 .prom 格式
修復:成功時寫入 /home/ollama/node_exporter_textfiles/backup.prom
node_exporter 新增 --collector.textfile.directory=/textfile_collector
volume: /home/ollama/node_exporter_textfiles:/textfile_collector
問題二:Harbor/Gitea rsync 權限拒絕
根因:/var/lib/docker/volumes/ 是 710 root:root,docker group 無法直接存取 FS 路徑
修復:改用 docker run --rm -v <volume>:/source alpine tar czf -
透過 docker socket(wooo 已在 docker group)讀取 volume 內容再解壓
驗證:備份腳本三項全 OK,node_exporter 9100/metrics 正確輸出指標
Prometheus absent(backup_110_last_success_timestamp) 應在下次 scrape 後清除
2026-04-18 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-18 10:37:23 +08:00 |
|
OG T
|
5fe049de55
|
fix(backfill): 補充 ADR-075 三種新分類 (secops/flywheel_health/business)
_classify_alert() 與 classify_alert_early() 規則對齊,
確保回填腳本正確分類存量 incidents。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:13:07 +08:00 |
|
OG T
|
a28625f088
|
fix(cr): 首席架構師 CR P0/P1/P2 全修補
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-1: incident_service.py — 刪除 classify_alert_early 死碼 L131-132
P0-2: cron_backup_restore_test.sh — date +%s%3N→+%s,修正毫秒時間戳
P1-2: gitea_webhook.py — fingerprint 移除 sha_short,收斂同 branch 失敗
heartbeat: 還原原始空格對齊格式(統帥要求原本怎樣就怎樣)
P1-1(積木化)/P1-3(TYPE-4)/P2-1(timeZone)/P2-2(IP)/P2-3(WS重連) 待後續處理
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 16:10:46 +08:00 |
|
OG T
|
c1c96ab47b
|
feat(m4): ADR-074 M4 — 備份還原週排程驗證 CronJob
- scripts/cron_backup_restore_test.sh: Velero restore dry-run 腳本
- k8s/awoooi-prod/16-cronjob-backup-restore-test.yaml: 每週日 02:00 台北執行
- k8s/awoooi-prod/17-configmap-backup-restore-scripts.yaml: 腳本 ConfigMap
- flywheel-alerts.yaml: BackupRestoreTestFailed + BackupRestoreTestStale 告警
失敗時寫入 node-exporter textfile → Prometheus 告警 → TYPE-3 Incident
2026-04-12 ogt (ADR-074 M4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:36:30 +08:00 |
|
OG T
|
0d239838b4
|
fix(cr): Code Review P2 — 測試覆蓋 + CronJob 腳本重構
CD Pipeline / build-and-deploy (push) Has been cancelled
P2-1: CronJob inline Python 抽成 scripts/cron_km_vectorize.py
Dockerfile 加入 COPY scripts/,CronJob YAML 改用腳本路徑
P2-2: 新增 test_classify_alert_early.py — 23 tests 覆蓋 7 條分類規則
含邊界情況:VeleroBackupFailed(backup優先於k8s)、優先順序驗證
595 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:14:44 +08:00 |
|
OG T
|
5b956a9a47
|
feat(flywheel): Phase 2-3/2-5 — auto_repair outcome 寫入 + 134 筆 alertname 回填腳本
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 2-3: _try_auto_repair_background() 修復執行後寫入 Incident.outcome
- effectiveness_score: 5(成功) / 2(失敗)
- human_feedback: auto_repair:<playbook_id>:success|failed
- should_remember: True(成功) → KMConversionService 飛輪入口
- 讓 KMConversionService 可依 outcome 判斷 EXECUTION_SUCCESS
ADR-073 Phase 2-5: scripts/backfill_alertname.py
- UPDATE incidents SET alertname = COALESCE(signals->0->>'alertname', signals->0->>'alert_name')
- 已在 Pod 執行:134 筆 NULL → 0 筆 (2026-04-12 ogt)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:33:11 +08:00 |
|
OG T
|
6dc03c9a55
|
fix(argocd)+feat(flywheel): Phase 1 完成 — ArgoCD image 斷路修復 + 冷啟動腳本
CD Pipeline / build-and-deploy (push) Has been cancelled
1. k8s/argocd/awoooi-prod-app.yaml:
移除 Deployment image ignoreDifferences
- 原設計造成 CD 更新 kustomization.yaml 後 ArgoCD 不更新 image
- 修復後 GitOps 閉環恢復正常
2. scripts/cold_start_playbooks.py:
ADR-073 Phase 1 Step 8 — 生成 15 個基礎 Playbook (K8s/Docker/DB/Infra)
執行結果: Playbooks 0 → 15
3. scripts/batch_vectorize_km.py:
ADR-073 Phase 1 Step 9 — 批次向量化 KM
執行結果: 711/713 embedding IS NOT NULL
Phase 1 全部完成,飛輪已解封:
- Pod 運行 105998d(含 8be87b0 所有修復)
- debounce 30min + alertname NULL 修復 + _collect_mcp_context 啟用
- 15 Playbooks + 711 KM 向量化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:20:52 +08:00 |
|
OG T
|
de055778b3
|
fix(cd): CD_PUSH_TOKEN + backup 路徑使用 BACKUP_ROOT 環境變數
CD Pipeline / build-and-deploy (push) Has been cancelled
- cd.yaml: GITEA_CD_TOKEN → CD_PUSH_TOKEN(Gitea 保留 GITEA_ 前綴)
- ADR-069: 同步更新 token 名稱說明
- backup-from-110.sh: 改用 BACKUP_ROOT 環境變數(預設 /home/ollama/backup/110)
避免 /var/log /var/run 需要 root 權限
- 已部署到 188 + cron 0 1 * * * 設定完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 09:07:47 +08:00 |
|
OG T
|
43edff184d
|
feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件
C-1 Velero: 已確認運作中(daily-awoooi-prod schedule, 13d, MinIO Available)
C-2 Host rsync 備份:
scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110
- Harbor registry data(最高優先)
- Gitea repos
- bitan-pharmacy.git(若存在)
- 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控
- 失敗時 Telegram 告警
ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則
C-3 DR SOP 文件:
docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘)
docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘)
docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘)
部署備份腳本說明 (需手動執行):
scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh
ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}"
ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' | crontab -"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 03:04:18 +08:00 |
|
OG T
|
49bfbd573c
|
feat(test): B5 整合測試框架 — 真實 DB, 5/5 通過
CD Pipeline / build-and-deploy (push) Failing after 2m34s
新增:
- docker-compose.test.yml: CI 用臨時 pgvector PostgreSQL (port 15432)
- tests/factories.py: Incident/Approval/Knowledge/RAG 測試資料工廠
- tests/integration/test_b5_core_flows.py: 5 個 E2E 整合測試 (5/5 PASSED 1.03s)
- tests/integration/setup_test_schema.sql: CI schema 初始化 SQL
- cd.yaml: 新增 Integration Tests B5 step
- scripts/sync_dev_db.py: dev DB 同步工具
修正:
- .env.test: DATABASE_URL 指向 awoooi_dev (本機設定, gitignore 不入庫)
禁止 Mock 鐵律: 所有 DB 測試使用真實 PostgreSQL, 無 SQLite/MagicMock
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 11:22:57 +08:00 |
|
OG T
|
07570c3b85
|
feat(rag): 初始索引腳本 — ADR+Runbook 批次餵入 pgvector
scripts/rag_index_docs.py: 批次呼叫 /knowledge/rag/index
支援 --api-url 參數,含 0.5s 節流避免 Ollama 過載
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 09:59:13 +08:00 |
|
OG T
|
8d0042ed29
|
feat(ops): Sprint 5.2 docker-health-monitor 升級為自動修復模式
舊版: 純感知層 (L4-6),只送 Webhook,修復由 API 執行
新版: 感知 + 自動修復 + 回報
修復分級 (ADR-060):
- 一般容器: docker restart
- 監控棧 (prometheus/grafana/alertmanager): docker start (保護 WAL)
- DB/Redis/ClickHouse: 僅告警,禁止重啟
已部署到:
- 192.168.0.110 ~/awoooi-ops/docker-health-monitor.sh
- 192.168.0.188 ~/awoooi-ops/docker-health-monitor.sh
- 兩台 cron */5 * * * * 運行中
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 11:59:11 +08:00 |
|
OG T
|
5ead01abf7
|
feat(ops): dr-drill.sh — 每月 DR Drill 自動演練
每月第一個週日 03:00 (121 cron) 執行:
1. 找最新 Velero backup (Completed)
2. 還原到 awoooi-dr-test namespace
3. 等待 Pod Ready + API health 驗證
4. 清理 dr-test namespace + restore 資源
5. Telegram 通知 PASS/FAIL + 耗時
支援 --dry-run 模式 (只檢查 backup,不還原)。
dry-run 驗證通過: daily-awoooi-prod-20260409020003
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 10:42:12 +08:00 |
|
OG T
|
ec4ebaf310
|
fix(ops): pg-backup momo_analytics 改用 docker exec (無對外 port)
momo-db 容器無 port binding,TCP 127.0.0.1:5432 連到 host PG 非容器。
改用 docker exec momo-db pg_dump,實際備份 91M。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 09:57:05 +08:00 |
|
OG T
|
f98be41517
|
feat(ops): pg-backup.sh — PostgreSQL 每 6h 自動備份
備份目標 (188):
- awoooi_prod (host PostgreSQL, TCP 127.0.0.1)
- momo_analytics (momo-db 容器)
功能:
- gzip 壓縮,保留 7 天自動清理
- Telegram 通知 (成功/失敗)
- cron 0 */6 * * * 已設定
驗證: 兩個 DB 備份成功 (awoooi_prod 206K, gz 完整)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 09:16:21 +08:00 |
|
OG T
|
0e6c4b83d4
|
feat(ops): docker-health-monitor 完成部署 110+188
- 增加 EXCLUDE_CONTAINERS 排除清單(signoz init containers)
- max-time 30→60 支援 API 首次 AI 分析所需時間
- 110: wooo/awoooi-ops, cron */5, secrets.env 已設定
- 188: ollama/awoooi-ops, cron */5, secrets.env 已設定
- 驗證: 188→API webhook 200, Telegram 已收到告警
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 22:59:45 +08:00 |
|
OG T
|
d80153bdce
|
fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 22:55:25 +08:00 |
|
OG T
|
170ce2f11d
|
fix(ci): 修正測試與 Sprint 5.2 部署腳本
CD Pipeline / build-and-deploy (push) Failing after 1m38s
tests/test_auto_repair_service.py:
- 更新 3個測試符合 2026-04-07 統帥指令移除門檻
- APPROVED Playbook 直接通過 (低相似度/低品質/高風險均通過)
tests/test_phase22_nemotron_collab.py:
- 更新 log key: nemotron_collaboration_failed → exhausted
ops/monitoring/docker-compose.exporters.yaml:
- 修正 postgres DSN: awoooi:awoooi_prod_2026@localhost:5432/awoooi_prod
Sprint 5.2 新增腳本:
- scripts/sprint51_e2e_validation.py: L7 E2E 驗收腳本 (T1-T5)
- scripts/ops/deploy-docker-health-monitor.sh: Plan A 一鍵部署腳本
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 18:17:48 +08:00 |
|
OG T
|
88696dba9b
|
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:24:09 +08:00 |
|
OG T
|
a421d2c5b8
|
feat(ops): Plan A docker-health-monitor.sh — Docker 容器健康監控自動修復
- 偵測 unhealthy / exited / dead 容器
- 排除清單: DB(PG/Redis)、Gitea、監控棧
- Prometheus/Grafana/Alertmanager exited → docker start (保護 WAL)
- 必須三段式通知: Intent→Action→Result (首席架構師裁示)
- HMAC-SHA256 簽章 → AWOOOI API /api/v1/webhooks/custom-alert
- Fallback: API down → 直接 Telegram Bot API
- 冷卻期 300s,防止重複修復
部署: cron */5 * * * * on 192.168.0.110 + 192.168.0.188
設定: /etc/awoooi-ops/secrets.env
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 11:48:39 +08:00 |
|
OG T
|
8235f91bc6
|
fix(scripts): generate-schemas 同時加入 apps/api 和 apps/api/src 到 sys.path
Type Sync Check / check-type-sync (push) Failing after 56s
問題: CI type-sync-check 持續失敗
原因: 只加 apps/api/src 不夠,模型檔內部用 from src.utils.X import Y
需要 apps/api 在 path 才能解析 src 套件
結果: 51 個型別全部正確生成
# 2026-04-06 ogt: fix CI type-sync blocking deployment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 12:00:18 +08:00 |
|
OG T
|
b133631b2d
|
feat(scripts): Phase 26 補寫腳本 — 從 approval_records 反向建立 KM
225 筆歷史告警處理記錄全部補寫到 knowledge_entries (INCIDENT_CASE)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 11:47:47 +08:00 |
|
OG T
|
4b24ecd67f
|
fix(sprint3): 首席架構師 Review C1/C2/C3/M3/m1 修正
C1: _ssh_execute 直接接收 key_path 參數,不反查 LAYER_SSH_CONFIG
C2: PlaybookService.create() proxy,Router 不再穿透呼叫 _repository
C3: CD Step 1b sed 替換 IMAGE_TAG_PLACEHOLDER,消除失敗中斷風險
M3: repair-bot 110/188 regex 統一 [a-z0-9][a-z0-9-]{0,30},禁止底線
m1: defaultMode 0400 加八進位說明注釋
m2: _ssh_execute 用 deadline 計算剩餘 timeout
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 13:07:59 +08:00 |
|
OG T
|
b688eeecb7
|
fix(ops): seed 腳本支援 API_BASE 環境變數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 12:23:55 +08:00 |
|
OG T
|
3f7a742683
|
fix(infra): 首席架構師 Review 修正 — C1/I1/I2/I3/I4/S1
C1: 移除 deploy-to-110.sh 密碼明文,改用 SSH key + sudoers NOPASSWD
I1: 加入 /var/lock/harbor-repair.lock 防止 watchdog 與 startup 並行修復
I2: docker compose 的 stderr 不再靜默(改用 tee -a log | while read 輸出)
I3: watchdog while loop 包在子 shell + || true,子 shell 異常不終止 watchdog
I4: repair_harbor 關鍵指令(harbor-log 啟動)加入退出碼捕捉
S1: 修復後驗證等待從 5s/10s 改為 30s(harbor-core 初始化需要足夠時間)
S2: docker ps 改用 --filter status=exited 取代 grep/awk
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 12:18:41 +08:00 |
|