OG T
|
dbb8104557
|
fix(drift): kubectl not found + RBAC services/configmaps/ingresses
CD Pipeline / build-and-deploy (push) Has been cancelled
drift_detector 用 kubectl 比對 Git YAML vs K8s 實際狀態,但:
1. API image 沒有 kubectl binary → No such file or directory: 'kubectl'
2. awoooi-executor ClusterRole 缺少 services/configmaps/ingresses list 權限
修復:
- Dockerfile: apt install curl + download kubectl v1.29.0 amd64
- 07-rbac.yaml: 加入 services/configmaps (core) + ingresses (networking.k8s.io) get/list/watch
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 00:49:56 +08:00 |
|
OG T
|
62cb274735
|
feat(host_aggregator+k8s): 新增 121 K3s Worker 主機監控
CD Pipeline / build-and-deploy (push) Has been cancelled
HOST_CONFIGS 加入 192.168.0.121(K3s Worker):
- K3s API tcp:6443
- awoooi-api NodePort tcp:32334
- awoooi-web NodePort tcp:32335
NetworkPolicy 補開 121 egress: 6443/32334/32335
NodePort 服務實際在 121(mon1),非 120(mon)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 23:36:36 +08:00 |
|
OG T
|
c92cdeea0f
|
feat(drift): B4 drift_reports DB 持久化 + CronJob 修復
CD Pipeline / build-and-deploy (push) Successful in 12m17s
- drift_repository.py: DriftReportRepository (save/get/list/update)
- drift.py router: 移除 in-memory dict,改用 DB repository
- drift-cronjob.yaml: 修正 SA/NetworkPolicy/NodePort 問題
- allow-intra-namespace NetworkPolicy (已套用至 prod)
- migrate-phase8/9: symptoms_hash + drift_reports migration Job YAML
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 20:28:55 +08:00 |
|
OG T
|
c200d7a52d
|
fix(web+k8s): CSRF mismatch + NetworkPolicy 缺少監控端口
CD Pipeline / build-and-deploy (push) Successful in 12m19s
1. pending-approvals-card: 改為點擊時即時 fetch 新 CSRF token
避免多 useCSRF 實例互相覆蓋 cookie 導致 header/cookie 不一致
2. NetworkPolicy: 補開 110:3002(Grafana) 9090(Prometheus) 3001(Gitea)
修正 monitoring probe "All connection attempts failed"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 20:11:00 +08:00 |
|
OG T
|
34f0228d92
|
fix(executor): K8s ClusterIP 10.43.0.1 不可達 — 加 K8S_API_SERVER_URL 覆蓋 + migration job
CD Pipeline / build-and-deploy (push) Successful in 12m0s
問題: in-cluster config 讀到 10.43.0.1:443,但 K3s Pod 內 iptables/kube-proxy
沒把流量導到實際 API server,導致 Connection refused,批准後 kubectl 永遠失敗
修復:
- executor.py: load_incluster_config() 後讀 K8S_API_SERVER_URL env 覆蓋 host
- 04-configmap.yaml: 設 K8S_API_SERVER_URL=https://192.168.0.120:6443
- migrate-sprint5r-telegram-message-id.yaml: approval_records 新增兩欄 migration job
E2E 驗證: kubectl rollout restart deployment/awoooi-worker success=True ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 19:10:27 +08:00 |
|
OG T
|
a4d6b3f3e6
|
fix(review): 首席架構師+QA 修復 C1/P1/P2/I2/I3 — Sprint 5R Review 修正
CD Pipeline / build-and-deploy (push) Has been cancelled
C1/P1-1: DB migration — approval_records 新增 telegram_message_id/telegram_chat_id
- apps/api/migrations/sprint5r_telegram_message_id.sql (新增)
- apps/api/src/db/base.py: init_db() 加 ALTER TABLE ADD COLUMN IF NOT EXISTS
- k8s/jobs/migrate-sprint5r-telegram-message-id.yaml (追蹤)
P1-2: risk_map 補 "high" 鍵防止 LLM 回傳 high 時降為 MEDIUM
- apps/api/src/services/proposal_service.py:183
I2/M3: kubectl_command 回填補齊 delete_deployment/drain_node/cordon_node/delete_service
+ 抽取 _backfill_kubectl_command() helper 消除重複邏輯
- apps/api/src/services/openclaw.py
I3: _notify_approval_result 靜默 except 改為 logger.warning
- apps/api/src/services/telegram_gateway.py
P2-2: PendingApprovalsCard 審批動作加 loading/disabled 防止重複點擊
- apps/web/src/components/shared/pending-approvals-card.tsx
P2-3: SecurityPanel/CompliancePanel error 死碼修復 — catch() 補 setError()
- apps/web/src/components/panels/SecurityPanel.tsx (含 'Unresolved' i18n)
- apps/web/src/components/panels/CompliancePanel.tsx
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 18:38:10 +08:00 |
|
OG T
|
07a097c259
|
fix(infra): Sprint 3 自動修復全鏈路修復 — docker-188 SSH egress + service registry 擴充
CD Pipeline / build-and-deploy (push) Has been cancelled
NetworkPolicy: 新增 192.168.0.188:22 egress — repair-bot-188.sh 執行路徑
service-registry.yaml: 新增 signoz/bitan-app (AUTO, 188主機)
修復覆蓋: Bug #11 補完 (188 SSH) + 188 服務分級覆蓋
E2E 驗證: MoWoooWorkDown → SSH → REPAIR_OK:momo-app (3791ms)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 18:04:19 +08:00 |
|
OG T
|
77f2da9264
|
fix(k8s): Bug #11+#12 — SSH egress 白名單 + repair-ssh-key 讀取權限
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #11 (NetworkPolicy): allow-required-egress 缺少 192.168.0.110:22
- K8s Pod 到 110 的 SSH port 22 被 default-deny-all 封鎖
- 自動修復的 SSH_COMMAND Playbook 必然 Connection refused
- 修正: 加入 port 22 到 110 的 egress 白名單
Bug #12 (Deployment): repair-ssh-key Secret defaultMode=0400 (root-only)
- Pod 以 appuser(UID 1000) 跑,無法讀取 root-owned 的 SSH key
- ssh 報錯: "Load key: Permission denied"
- 修正: 加入 securityContext.fsGroup=1000,讓 appuser 透過 group read 存取
- 已驗證: Pod 內 ssh → repair-bot-110 → REPAIR_OK:sentry ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 14:50:49 +08:00 |
|
OG T
|
08db3580a7
|
fix(monitoring): 修復 110 主機 CPU 高負載
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
→ 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
→ --housekeeping_interval=30s --docker_only=true
→ CPU 從 239% 降到 <1%
根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
→ 加 scrape_interval: 30s / scrape_timeout: 25s
→ CPU 從 48% 降到 0%
整體 load average: 20 → 9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 13:53:13 +08:00 |
|
OG T
|
5bd8a8a719
|
fix(monitoring): 補齊 blackbox-tcp scrape targets (11→15)
CD Pipeline / build-and-deploy (push) Has been cancelled
SentryDown/HarborDown/SignOzDown 等告警引用的 instance 不在 scrape list
導致 absent metric = 0,告警持續 firing
新增缺少的 targets:
192.168.0.125:6443/32334/32335 (K3s)
192.168.0.110:9000/5000/3100 (Sentry/Harbor/Langfuse)
192.168.0.188:3301/5432/6380/11434/8089 (SignOz/PG/Redis/Ollama/OpenClaw)
已在 110 主機 reload Prometheus,全部 15 targets UP
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 12:20:19 +08:00 |
|
OG T
|
85d4857d1b
|
fix(monitoring): RedisMemoryHigh 誤報 — max_bytes=0 除以零修正
CD Pipeline / build-and-deploy (push) Has started running
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
- 加入 redis_memory_max_bytes > 0 前置條件
- 防止 Redis 未設 maxmemory 時除以零產生 +Inf 永遠觸發
- 影響: alerts-unified.yml + database-alerts.yaml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 11:41:10 +08:00 |
|
OG T
|
4aa7c179c1
|
feat(k8s): Sprint 5.1 Guardrail — service-registry ConfigMap 掛載到 API 容器
CD Pipeline / build-and-deploy (push) Successful in 16m36s
問題: Docker 容器無 ops/ 目錄,service_registry.py 找不到 YAML → 全部降級 AUTO
解法: ConfigMap 掛載 service-registry.yaml 到 /app/ops/config/
變更:
- k8s/awoooi-prod/15-service-registry-configmap.yaml (新增 ConfigMap)
- k8s/awoooi-prod/06-deployment-api.yaml (volumeMount + volume)
- .gitea/workflows/cd.yaml (Step 1c apply ConfigMap)
效果: _find_registry_path() 可找到 YAML → BLOCK/CRITICAL_HITL/STANDARD_HITL 生效
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 22:12:29 +08:00 |
|
OG T
|
8b5db2f58e
|
feat(infra): 切換 Ollama 到 M1 Pro 192.168.0.111 + NetworkPolicy 更新
CD Pipeline / build-and-deploy (push) Has been cancelled
- OLLAMA_URL: 188 → 111 (M1 Pro, 40+ tok/s vs 0.45 tok/s)
- OPENCLAW_DEFAULT_MODEL: qwen2.5:7b-instruct → deepseek-r1:14b (SRE最強推理)
- OPENCLAW_TIMEOUT: 90s → 120s (deepseek-r1:14b 實測最慢 54s)
- NetworkPolicy v1.3: 新增 192.168.0.111:11434 egress,移除 188 的 Ollama port
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 22:05:14 +08:00 |
|
OG T
|
88696dba9b
|
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:24:09 +08:00 |
|
OG T
|
af07c23675
|
fix(k8s): known_hosts 改掛 /etc/repair-known-hosts 獨立目錄,修 mount 衝突
CD Pipeline / build-and-deploy (push) Successful in 12m11s
E2E Health Check / e2e-health (push) Successful in 34s
/etc/repair-ssh 已被 repair-ssh-key 佔用,subPath 檔案掛載衝突
改為獨立目錄 /etc/repair-known-hosts,路徑同步更新 KNOWN_HOSTS_PATH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 15:06:28 +08:00 |
|
OG T
|
d56aae135d
|
fix(k8s): repair-known-hosts secret optional:true — Pod 不阻塞等待 secret 建立
CD Pipeline / build-and-deploy (push) Failing after 8m35s
CD 首次跑時才建立 secret,optional 讓 Pod 先起來
等 CD 建立 secret 後自動掛載
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 14:48:45 +08:00 |
|
OG T
|
d4cb9a4ac5
|
ops(k8s): known_hosts Secret + Ansible 白名單 ConfigMap (Sprint 3 T2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 14:20:14 +08:00 |
|
OG T
|
23364423fa
|
feat(webhook): ADR-059 GitHub → Gitea Webhook 遷移完成
- gitea_webhook.py: Header 全部改 X-Gitea-*,移除 workflow_run handler
- gitea_webhook_service.py: _fetch_pr_diff 改直接 httpx,不依賴 github_api_service
- 清除兩個檔案的所有殘留 github_ log key,review_id prefix 改 gitea-
- test_gitea_webhook.py: 10/10 通過,docstring 修正
- 03-secrets.yaml: 新增 GITEA_WEBHOOK_SECRET 佔位
- cd.yaml: 新增 GITEA_WEBHOOK_SECRET 注入步驟
- ADR-059: 建立架構決策文件
待統帥操作: Gitea Actions secret + Gitea UI Webhook 設定
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 14:44:32 +08:00 |
|
OG T
|
4b24ecd67f
|
fix(sprint3): 首席架構師 Review C1/C2/C3/M3/m1 修正
C1: _ssh_execute 直接接收 key_path 參數,不反查 LAYER_SSH_CONFIG
C2: PlaybookService.create() proxy,Router 不再穿透呼叫 _repository
C3: CD Step 1b sed 替換 IMAGE_TAG_PLACEHOLDER,消除失敗中斷風險
M3: repair-bot 110/188 regex 統一 [a-z0-9][a-z0-9-]{0,30},禁止底線
m1: defaultMode 0400 加八進位說明注釋
m2: _ssh_execute 用 deadline 計算剩餘 timeout
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 13:07:59 +08:00 |
|
OG T
|
4b934bb9fd
|
feat(k8s): API Pod 掛載 repair SSH key (Task 13)
- 06-deployment-api.yaml: volumeMount /etc/repair-ssh + volumes secret defaultMode 0400
- 對應 K8s Secret: awoooi-repair-ssh-key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 11:47:37 +08:00 |
|
OG T
|
892c5d53a7
|
k8s(secret): 加入 repair SSH key 建立說明 template
實際私鑰透過 kubectl create secret 手動建立,不上 Git
主機 110 (wooo) / 188 (ollama) 已設定 command= 受限 authorized_keys
SSH health check 驗證: REPAIR_BOT_HEALTHY:110 / REPAIR_BOT_HEALTHY:188
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 11:17:57 +08:00 |
|
OG T
|
ad4abefcd9
|
fix(k8s+ops): 修復告警鏈路 + Gitea runner 自動啟動
CD Pipeline / build-and-deploy (push) Failing after 21s
## 修復項目
1. NetworkPolicy allow-nginx-ingress 加入 192.168.0.110
- Alertmanager (在 110) 需要從 110 直接 POST webhook 到 API pod
- 修復前: 110 被 NetworkPolicy default-deny 阻擋,webhook timeout
- 修復後: 110 加入 ingress 白名單,告警鏈路恢復
2. awoooi-startup-110.sh 加入 Gitea Act Runner
- Step 6: 啟動 /home/wooo/act-runner (gitea-runner container)
- 修復前: 重開機後 runner 離線,CD pipeline 全面失效
- 修復後: runner 自動重啟,若配置過期自動清除重新註冊
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 01:42:52 +08:00 |
|
OG T
|
3455044457
|
feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout
P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
- generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
- generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
- 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint
P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-04 12:35:05 +08:00 |
|
OG T
|
c65ed5b1c9
|
feat(telegram): SRE 戰情室群組三頭政治 Triumvirate (ADR-053)
CD Pipeline / build-and-deploy (push) Successful in 7m6s
- config.py: 新增 OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN / SRE_GROUP_CHAT_ID
- telegram_gateway.py: send_to_group / send_as_openclaw / send_as_nemotron / trigger_group_ai_discussion / _send_approval_card_to_group
- send_approval_card 告警發送後非同步觸發群組 AI 雙向討論
- configmap: SRE_GROUP_CHAT_ID=-1003711974679
- secrets: OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN CHANGE_ME 佔位
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-03 17:16:05 +08:00 |
|
OG T
|
ff5a77f7a9
|
fix(telegram): 啟用 Polling + 修正 InfraAlertMessage 格式
CD Pipeline / build-and-deploy (push) Successful in 6m52s
1. TELEGRAM_ENABLE_POLLING: false→true
- clawbot-v5 已停止 polling (STANDBY_MODE)
- AWOOOI API 接管,統帥可與 OpenClaw/NemoClaw 雙 AI 對話
2. InfraAlertMessage.format() 加入 note 欄位
- NIM 慢屬正常不再顯示「自動修復失敗」
- 改為 💡 資訊性提示
3. NIM 探測端點改為 /v1/models (輕量,不觸發計費)
timeout: 10s → 25s (NIM 免費 tier 冷啟動)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-03 16:43:40 +08:00 |
|
OG T
|
4284337249
|
fix(config): NEMOTRON_TIMEOUT_SECONDS 30→55 固化到 ConfigMap
CD Pipeline / build-and-deploy (push) Successful in 7m0s
NIM 免費 tier 延遲 11-45s,30s 硬編碼導致所有慢請求超時。
已同步 prod/dev ConfigMap,避免下次 CD 部署被覆蓋。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-03 15:58:11 +08:00 |
|
OG T
|
48c65756da
|
chore(config): USE_AI_ROUTER=true 寫入 ConfigMap (Phase 24 B2)
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
防止下次 CD deploy 覆蓋 kubectl set env 的設定。
B2 觀察期 48h, 截止 2026-04-04 18:40 台北時間。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-02 21:26:53 +08:00 |
|
OG T
|
3f339110dd
|
fix(observability): 同步 .188 實際部署調整至 repo
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:
1. MinIO Bearer Token 認證
- 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
- 實際: mc admin prometheus generate 產生 Bearer Token
- 更新: prometheus-config-phase-o.yaml 加入 bearer_token
2. remote_write 廢棄 → OTEL Collector Prometheus scrape
- 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
- 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
- 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
- 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)
3. ADR-053 驗收清單更新為實際結果
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
2026-04-02 21:23:47 +08:00 |
|
OG T
|
99be215e83
|
fix(monitoring): R1 Review 修正 — Blackbox DNS/PSA label/告警閾值
Critical: Blackbox Exporter replacement 從 K8s DNS 改為主機 IP (192.168.0.188:9115)
Important: Descheduler namespace 顯式宣告 PSA restricted labels
Suggestion: failedJobsHistoryLimit 3→1, 新增 MinioDiskUsageCritical 5% 告警
R1 Review by: 首席架構師 (Phase O-1)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-02 14:02:50 +08:00 |
|
OG T
|
41bf0681cf
|
feat(observability): Phase O-2/O-3 OTEL Log管線 + Event Exporter + Remote Write
O-2.1: OTEL Collector DaemonSet (filelog receiver)
- 收集所有 K3s 節點 Pod stdout/stderr → SigNoz ClickHouse
- CRI log parser (Go time layout for +08:00 timezone)
- filter processor 排除 kube-system debug noise
- observability namespace PSA privileged (log 目錄需 root)
- 資源限制: 50m-200m CPU / 64-128Mi Memory
O-2.2: kubernetes-event-exporter
- K8s Event → 結構化 JSON Log → SigNoz
- Warning/Error 全量保留, Normal 過濾高頻事件
- 解決: Event 預設僅保留 ~1hr 的致命盲區
O-3: Prometheus remote_write 配置模板
- 白名單: ~50 關鍵 metric series (node/container/kube/api/db)
- 目標: 90 天長期儲存於 SigNoz ClickHouse
已部署驗證: 3 Pod Running, 0 error, filelog 正常監控所有 namespace
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-02 14:01:42 +08:00 |
|
OG T
|
6ce82ff883
|
fix(k3s): Phase O-1 基礎設施修復 — Descheduler + MinIO/Kali 監控
O-1.1: Descheduler securityContext 修復 (PodSecurity restricted 合規)
- 新增 pod securityContext (runAsNonRoot, runAsUser:65534, seccompProfile)
- 新增 container securityContext (allowPrivilegeEscalation:false, drop ALL)
- 補齊 RBAC: namespaces + replicasets list 權限
- 已部署驗證: CronJob 成功執行 (Status: Completed)
O-1.3: MinIO Prometheus scrape 配置 + 告警規則
O-1.4: Kali Blackbox TCP probe + 告警規則
- MinioDown, MinioDiskUsageHigh, MinioOfflineDisk
- KaliScannerDown
待手動部署: Prometheus config → .188, kubectl kubeconfig → 120/121
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-02 13:55:26 +08:00 |
|
OG T
|
73e8f8ab77
|
feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
- OllamaProvider (local, rca/chat/code_review)
- GeminiProvider (cloud, rca/chat)
- ClaudeProvider (cloud, rca/chat/code_review)
- OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
- AIProviderRegistry (動態註冊/啟停)
- AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範
安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false
2026-04-02 ogt: Phase 24 首批實作
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-02 13:16:09 +08:00 |
|
OG T
|
3ecfe7b3f5
|
chore: 清理 NemoNodeAnimation 殘留 + 修復 Migration YAML
E2E Health Check / e2e-health (push) Successful in 19s
CD Pipeline / build-and-deploy (push) Has been cancelled
- 移除 nemo-node-animation.tsx (無人引用,已被 NemoClaw 取代)
- Migration YAML: 修復 $$ 在 YAML heredoc 被 shell 解析問題
改用單引號字串 DO '' 語法
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-02 09:09:25 +08:00 |
|
OG T
|
d8be78b135
|
feat(api): Knowledge Base Phase 1 後端四層架構
CD Pipeline / build-and-deploy (push) Successful in 7m0s
E2E Health Check / e2e-health (push) Successful in 17s
Type Sync Check / check-type-sync (push) Failing after 30s
- models/knowledge.py: Pydantic Schema (EntryType/Source/Status/CRUD)
- db/models.py: KnowledgeEntryRecord ORM (PostgreSQL)
- repositories/interfaces.py: IKnowledgeRepository Protocol
- repositories/knowledge_repository.py: PostgreSQL CRUD 實作
- services/knowledge_service.py: 業務邏輯 (get_db_context 內部管理 session)
- api/v1/knowledge.py: REST Router (get_knowledge_service,無直接 DB 存取)
- main.py: 掛載 Knowledge Base Router
- k8s/jobs/migrate-knowledge-entries.yaml: DB Migration Job
API 端點: GET/POST / | GET/PATCH/DELETE /{id} | POST /{id}/approve
GET /search | GET /categories
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-02 00:55:56 +08:00 |
|
OG T
|
9cf73bda4f
|
feat(llmops): 啟用 Langfuse LLMOps 追蹤 + CD 自動注入 Keys
CD Pipeline / build-and-deploy (push) Successful in 7m6s
E2E Health Check / e2e-health (push) Successful in 18s
- 04-configmap.yaml: LANGFUSE_ENABLED=true (Phase 15.1 Key 已在 K8s Secret)
- cd.yaml: 補齊 Langfuse keys CD 自動注入 (LANGFUSE_PUBLIC/SECRET_KEY)
- LOGBOOK.md: ClawBot → OpenClaw 命名修正
- .gitignore: 加入 tsconfig.tsbuildinfo + .superpowers/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 22:19:22 +08:00 |
|
OG T
|
dc9a07bf20
|
fix(k8s): NetworkPolicy 修正 OpenClaw port 8089→8088 (clawbot-v5 使用 8088)
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-01 ogt: OpenClaw 從 openclaw container (8089) 遷移到 clawbot (8088)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 22:17:09 +08:00 |
|
OG T
|
5809d3e336
|
feat(ai): 委派 Incident RCA 給 OpenClaw (Nemo) — 架構鐵律修正
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
架構鐵律: OpenClaw = AI 大腦,AWOOOI API 透過 HTTP 委派仲裁
修改:
- openclaw.py: 加入 _call_openclaw_analyze(),在 LLM fallback 前先呼叫 OpenClaw
- 04-configmap.yaml: OPENCLAW_URL 修正為 :8088 (新容器 port)
- AI_FALLBACK_ORDER 改為 ["ollama","claude"] (移除 Gemini 付費 API)
OpenClaw /api/v1/analyze/incident → qwen2.5:7b 本機 Ollama (Nemo)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 21:11:30 +08:00 |
|
OG T
|
bfee94bb6f
|
fix(ai): ADR-044 修正 AI_FALLBACK_ORDER — Gemini 優先做 RCA 仲裁
CD Pipeline / build-and-deploy (push) Successful in 6m49s
E2E Health Check / e2e-health (push) Successful in 21s
問題連鎖:
1. NVIDIA nemotron-mini-4b JSON 截斷 → proposal_parse_failed
2. Ollama K8s 並發請求 → GPU 排隊逾時 (90s)
3. Expert System fallback → 信心 0%
ADR-044: OpenClaw RCA 應用 Ollama/Gemini,非 NVIDIA Tool Calling 模型
修正: gemini > ollama > nvidia > claude
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 20:35:42 +08:00 |
|
OG T
|
555e808f39
|
fix(ai): ollama 優先於 nvidia — 修復 nemotron-mini-4b JSON 截斷導致 0% 信心
CD Pipeline / build-and-deploy (push) Successful in 6m42s
E2E Health Check / e2e-health (push) Successful in 16s
根本原因: nemotron-mini-4b-instruct 輸出 JSON 被截斷 (raw_response={"confidence": )
→ proposal_parse_failed → fallback Expert System → AI 仲裁 0% 信心
修復: AI_FALLBACK_ORDER 改為 ollama 優先,NVIDIA 降為第二
(Ollama qwen2.5:7b-instruct 在 192.168.0.188:11434 輸出品質穩定)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 19:22:15 +08:00 |
|
OG T
|
5b938887c0
|
fix(telegram): 關閉 K8s prod TELEGRAM_ENABLE_POLLING 解決 409 Conflict
CD Pipeline / build-and-deploy (push) Successful in 6m48s
E2E Health Check / e2e-health (push) Successful in 16s
AWOOOI API 與 OpenClaw(192.168.0.188) 同時 Long Polling 造成 409 Conflict,
導致 AI 仲裁降級為規則匹配(0%信心)。
架構原則: OpenClaw 是唯一 Telegram Gateway,K8s 只發送訊息。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 18:09:47 +08:00 |
|
OG T
|
71a4e0f8c8
|
fix(k8s): 修復 dev RBAC RoleBinding 欄位名稱錯誤
CD Pipeline / build-and-deploy (push) Successful in 6m54s
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline (Dev) / build-and-deploy-dev (push) Failing after 3m53s
apiRef → name (正確 Kubernetes 欄位名稱)
防止 RoleBinding 建立失敗
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 16:27:12 +08:00 |
|
OG T
|
9913f5dc6d
|
feat(infra): 開發環境分離 + BuildKit cache 修復 + circuit breaker 優化
CD Pipeline / build-and-deploy (push) Successful in 6m52s
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline (Dev) / build-and-deploy-dev (push) Failing after 9s
1. k8s/awoooi-dev/: 新建 dev namespace (01-05 配置)
- Namespace + ResourceQuota (cpu 2/4, mem 4Gi/8Gi)
- ConfigMap: ENVIRONMENT=dev, LOG_LEVEL=DEBUG, SHADOW_MODE=false
- Deployment: 1 replica, NodePort 32344, image dev-latest
- RBAC: awoooi-executor-dev ServiceAccount
2. .gitea/workflows/cd-dev.yaml: dev branch CD pipeline
- 觸發: dev branch push
- Build: --no-cache (防 cache poisoning)
- Tag: dev-{sha} / dev-latest
- Deploy: awoooi-dev namespace, health check 32344
- Telegram: [DEV] 前綴通知
3. apps/api/Dockerfile: ARG CACHE_BUST=none (防 BuildKit cache 毒化)
- deps 層 (pip install) 仍可 cache
- src/ 和 models.json 層每次重建
4. .gitea/workflows/cd.yaml: 正式環境 API build 加入 CACHE_BUST=git_sha
- 確保 models.json 等配置變更正確進入 image
5. apps/api/src/services/nvidia_provider.py: timeout 不計入 circuit breaker
- TimeoutException → 只 log,不 record_failure()
- 只有硬性錯誤 (auth/rate limit/exception) 才斷路
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 16:22:21 +08:00 |
|
OG T
|
419dc2f8e0
|
fix(nvidia): timeout 60s→30s,NVIDIA 第一保免費,失敗轉 Gemini
CD Pipeline / build-and-deploy (push) Successful in 5m46s
E2E Health Check / e2e-health (push) Successful in 16s
- nvidia_provider.py: NVIDIA_TIMEOUT 60→30s
- models.json: timeout_seconds 60→30s
- configmap: NEMOTRON_TIMEOUT_SECONDS 45→30s, fallback 恢復 nvidia 第一
目標: Nemo 有足夠時間回應(free),失敗快速轉 Gemini(備援),整體機制可運作
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 16:05:19 +08:00 |
|
OG T
|
4c622813af
|
fix(auto-repair): 實際可用的自動修復門檻 (Phase 22 P1)
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: 四道鎖全卡死導致自動修復永遠不觸發
1. configmap: Gemini 排第一 (100ms vs NVIDIA 60s timeout)
2. auto_approve: confidence 0.90→0.65, trust 5→1, playbook 3→1
3. auto_approve: 開放 medium 風險, require_playbook=False
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 16:02:16 +08:00 |
|
OG T
|
eccf61fbc9
|
fix(ai): 修復假信心度 + 解除 Shadow Mode (Phase 22 P1)
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
1. openclaw.py: LLM 截斷時 confidence 0.82→0.0 (禁止偽造信心度)
2. prompts.py: NEMOTRON schema 範例值改用佔位符,防模型照抄 0.75
3. configmap: SHADOW_MODE_ENABLED=false,開放 low 風險自動執行
條件門檻: confidence≥90% + trust_score≥5 + playbook_success≥95%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 15:59:42 +08:00 |
|
OG T
|
cc6b18e3bc
|
fix(phase22): 修復 Telegram 對話三個 Bug (ADR-044)
E2E Health Check / e2e-health (push) Successful in 18s
P0: security_interceptor.py 新增 intercept_telegram() 方法
- 修復 _handle_chat_message 的 AttributeError (致命 Bug)
- 白名單驗證,不需要 Nonce (對話訊息 vs 按鈕回調)
P1: nvidia_provider.py chat() 新增 use_json_mode 參數
- 對話場景預設 False (自然語言回應)
- RCA/分析場景傳入 True (結構化 JSON 輸出)
- openclaw.py RCA 呼叫加上 use_json_mode=True
P2: K8s ConfigMap 啟用 TELEGRAM_ENABLE_POLLING=true
- K8s AWOOOI API 接管 @tsenyangbot Long Polling
- OpenClaw (188) 停止 Telegram,改為純 REST 服務
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-03-31 21:53:09 +08:00 |
|
OG T
|
6da240219d
|
feat(k8s): Phase 22 啟用 OpenClaw + Nemotron 協作 (ADR-044)
E2E Health Check / e2e-health (push) Successful in 17s
- 新增 ENABLE_NEMOTRON_COLLABORATION=true
- 新增 NEMOTRON_TIMEOUT_SECONDS=45
- 新增 NEMOTRON_ASYNC_UPDATE=true
- 統帥需求: @tsenyangbot 同時顯示 OpenClaw 仲裁 + Nemotron 執行
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-03-31 21:39:14 +08:00 |
|
OG T
|
219525f64f
|
fix(k8s): #33 Sentry + OTEL 修復
E2E Health Check / e2e-health (push) Successful in 17s
修復項目:
1. SENTRY_DSN: K8s secret 已修補 (空值 → 正確 DSN)
2. OTEL_EXPORTER_OTLP_ENDPOINT: 24318 → 24317 (gRPC 端口)
驗證結果:
- sentry_initialized ✅
- OTEL export 無錯誤 ✅
- Session Replay 配置 ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 12:18:06 +08:00 |
|
OG T
|
723e8ef251
|
feat(api): Phase 21.3 Weekly Report (ADR-041)
E2E Health Check / e2e-health (push) Successful in 16s
- 新增 WeeklyReportMessage dataclass (telegram_gateway.py)
- 新增 WeeklyReportService (整合 StatsService + K3sMonitor)
- 新增 CronJob (每週五 18:00 台北)
- 新增 API 端點 (/stats/weekly/preview, /stats/weekly/report)
Phase 21 定期報告機制全部完成!
- 21.1 Daily E2E Schedule ✅
- 21.2 K3s Telegram Report ✅
- 21.3 Weekly Report ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 11:28:46 +08:00 |
|
OG T
|
ce6b1b1c64
|
docs: 更新 LOGBOOK - #17 i18n Hydration 完成
前端 P1 改進全部完成:
- #15 SSE + 樂觀更新 (8c8664c)
- #16 DOM Bypass (0b87018)
- #17 i18n Hydration (f25e94e)
首席架構師審查: 96/100 OUTSTANDING
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 11:23:38 +08:00 |
|