Commit Graph

286 Commits

Author SHA1 Message Date
OG T
fc03eb1f4d fix(auto-repair): _extract_symptoms 優先用 labels.alertname 取得原始 alertname
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: signal.alert_name 存的是 alert_type (如 "custom"),而非 Prometheus
alertname (如 "SentryDown"),導致 playbook Jaccard 匹配永遠失敗 (NO_MATCH)。

根本原因: webhook 的 alertname_to_type mapping 將未知 alertname 轉為 "custom",
存入 signal.alert_name,但 Playbook 的 symptom_pattern.alert_names 存原始名稱。

修正: 從 signal.labels["alertname"] 讀取原始 Prometheus alertname,
fallback 到 signal.alert_name (保持向下相容)。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:26:18 +08:00
OG T
79a9a514dd fix(rules): ADR-064 L1 Redis 分散式鎖防止多 Pod 重複生成規則
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
問題: _generating set 是進程級,多 Pod 各自獨立,同一 alertname 可能被
  多個 Pod 同時送給 Ollama/Gemini 生成規則

修復: SET NX EX lock_key — 只有第一個 Pod 能取鎖,其他 Pod 直接跳過
降級: Redis 不可用時 fallback 進程級 set(保持原有行為)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:03:51 +08:00
OG T
b66263ad36 fix(decision_manager): resolved Incident 不重送 Telegram
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
dedup TTL 10分鐘過期後,已 resolve 的 Incident 仍被重新推送
加入狀態檢查,resolved/closed 直接跳過

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:00:39 +08:00
OG T
9361fd1fa7 fix(decision_manager): action 不應 strip_placeholders 避免截斷 deployment name
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
_strip_placeholders 移除 <...> 導致 kubectl rollout restart deployment/<name>
變成 kubectl rollout restart deployment/,Telegram 顯示建議指令不完整

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:45:33 +08:00
OG T
d467fc11be fix(nemotron): 修復 deployment_name placeholder 問題
根因: Nemotron tool calling 收到 target_resource=DockerContainerUnhealthy
  (非真實 K8s deployment name),不確定時填 <deployment_name>

修復:
1. prompt 明確標注 deployment_name 必須填入 target_resource
2. 收到 tool call 結果後,偵測 placeholder 並用 target_resource 覆蓋

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:44:25 +08:00
OG T
c5e475121a fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後
  → 改為 [:80],完整顯示 kubectl command

根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT)
  → decision_manager.py 加入偵測,用規則引擎重新查 kubectl command

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:14:30 +08:00
OG T
428e66c111 fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
S1 Critical:
- S1-1: asyncio 觸發移至 _call_with_fallback async 上下文,移除 sync 中的 get_event_loop()
- S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排
- S1-3: _matches() 對 alertname=["*"] 直接回傳 False,防意外命中

S2 Major:
- S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key),移除 import settings
- S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑,非假數據
- S2-5: suggested_action .strip() 防空白字串繞過 or

S3 Minor:
- S3-2: priority 上界 min(next, 890)
- S3-3: alertname sanitize re.sub([{}]) 防 format KeyError
- S3-4: model_registry.py 最後修改時間戳更新

文件:
- ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習
- Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項
- Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程
- LOGBOOK: 本次 Session 完整記錄

2026-04-09 ogt: 首席架構師審查修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 10:52:40 +08:00
OG T
89da2d24be fix(model-registry): fallback config 更新為 deepseek-r1:14b + gemma3:4b
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m20s
- model_registry._get_default_config: ollama summary llama3.2:3b → gemma3:4b
- model_registry._get_default_config: ollama default/rca → deepseek-r1:14b
- 修正 test_smart_router::test_simple_context 失敗 (斷言 gemma3:4b)
- alert_rule_engine: 移除 asyncio/time unused import
- 2026-04-09 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:52:47 +08:00
OG T
71437db0e9 feat(rule-engine): 自動規則生成 — generic_fallback 觸發 AI 學習
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m25s
流程:
1. 告警命中 generic_fallback 規則
2. 背景觸發 auto_generate_rule()
3. Ollama (deepseek-r1:14b) 生成 YAML 規則片段
4. Ollama 失敗 → Gemini 備援
5. 驗證格式 → append alert_rules.yaml → 清除 lru_cache
6. 下次同類告警直接命中專屬規則,不再走兜底

去重: 同一 alertname 進程內只生成一次
手寫規則 priority 1-499,AI 生成 500-899,兜底 999

2026-04-09 ogt: AI 自學規則引擎

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:20:33 +08:00
OG T
d1ede7f989 feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底
- 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充
- openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0)
- 新增規則只需修改 YAML,重啟 Pod 即可,不需改代碼
- 2026-04-09 ogt: 架構重構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:05:23 +08:00
OG T
3abc7c2f85 fix(openclaw): DockerContainerUnhealthy + TargetDown 專屬規則匹配
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- DockerContainerUnhealthy: ssh docker inspect + docker restart,含 healthcheck 指令驗證
- TargetDown / IP:port instance: ssh 確認 exporter 存活
- 修正 target 混用 alertname 作為 deployment 名稱的問題
- alertname/labels 從 alert_context 提取供規則判斷
- 2026-04-09 ogt: 新增兩條專屬規則

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:00:31 +08:00
OG T
d80153bdce fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:55:25 +08:00
OG T
86ac6ed028 perf(api): HostAggregator 效能優化 — probe timeout 縮短 + 30 秒記憶體快取 2026-04-08 22:42:01 +08:00
OG T
c9f1bcd122 fix(api): service_registry 安全降級 — Docker 無 YAML 時不 crash,fallback AUTO
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m37s
2026-04-08 21:47:38 +08:00
OG T
db4b28c49d fix(ci): 強制觸發 CD — service_registry.py Docker 路徑修正已包含於 1f9eea5
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m45s
Pod CrashLoopBackOff: IndexError parents[5]
修復: _find_registry_path() 安全搜尋 (parents[4]/parents[3]/絕對路徑)
1f9eea5 已修復但未觸發 CI,此 commit 強制重新 build

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 21:37:49 +08:00
OG T
1f9eea5b74 fix(api): service_registry.py Path 索引修正 — 相容 Docker 容器環境
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-08 21:34:40 +08:00
OG T
14cb015826 fix(openclaw): Nemotron 重試邏輯 + exhausted log key (未提交的修改)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- generate_incident_proposal_with_tools: 單次 try/except → 2次重試迴圈
- 失敗 log key: nemotron_collaboration_failed → nemotron_collaboration_exhausted
- 失敗時 nemotron_enabled=True (讓統帥看到失敗狀態)
- _call_nemotron_tools: timeout 超時改為拋出異常(讓外層重試)
- 這是之前 Session 的本地修改,修正測試與實際實作不一致問題

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 21:16:34 +08:00
OG T
0f5fecfef5 fix(sprint5.1): 首席架構師審查修正 — S1×4 S2×2 S3×1
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m40s
S1-1: service_registry/velero_client/preflight_service 改用 structlog
S1-2: velero_client datetime.now(UTC) 改用 now_taipei()(台北時區鐵律)
S1-3: Guardrail 失敗改為保守拒絕(原放行方向與安全目標相悖)
S1-4: service_registry import 移至模組頂部(移除函數內 import)
S2-1: telegram_gateway T1-T6 六個通知方法補齊 try/except
S2-2: webhooks.py Langfuse URL 改用 settings.LANGFUSE_URL(移除硬寫內網 IP)
S3-3: velero_client trigger_emergency_backup 改為 kubectl apply Backup CRD
      (原 kubectl create backup 語法不存在,審查發現靜默失敗風險)

審查評分: 70/100 → 修正後預計 90+/100

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:36:18 +08:00
OG T
88696dba9b feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
  - k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader

Layer 1 - DB Migration (已在 188 執行):
  - M-002: approval_records 新增 approval_level/votes/required_votes
  - M-003: alert_event_type ENUM 新增 8 個值

Layer 2 - IaC:
  - ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)

Layer 3 - Python Services:
  - service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
  - velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
  - preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)

Layer 1-M001 - Playbook model:
  - playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup

Layer 4 - 業務邏輯:
  - alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
  - auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
  - webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
  - db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
  - docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)

Layer 5 - Telegram 通知:
  - telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)

參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:24:09 +08:00
OG T
eee6f06215 feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m32s
統帥指令: 所有自動修復操作(成功/失敗)必須持久化

變更:
- migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引
- db/models.py: 新增 AutoRepairExecution SQLAlchemy model
- repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats)
- auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB
  - 新增 similarity_score 參數傳遞
  - AutoRepairDecision 新增 similarity_score 欄位
- webhooks.py: 傳入 similarity_score 到 execute_auto_repair

已執行 migration: awoooi_prod@192.168.0.188:5432 

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:16:37 +08:00
OG T
68a2fff746 feat(auto-repair): 移除所有阻擋門檻 — 直接全部跳成自動修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
統帥指令: 所有 APPROVED Playbook 直接執行,不再檢查:
- 相似度門檻 (MIN_SIMILARITY_SCORE 0.7 → 0.0)
- is_high_quality 品質門檻
- 冷啟動信任機制
- 動作風險等級門檻 (evaluate + execute 兩層)

保留: P0/P1 嚴重度人工審核、全域冷卻熔斷、APPROVED 狀態檢查

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:10:09 +08:00
OG T
b7ea362efc fix(api): Review #2 技術債清理 — I1/S1/S2/S3 全數修正
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m13s
I1: error_type 欄位補全
- AnomalyCounter.derive_key_from_incident() 新增
  從 signal.labels 提取 reason/error_type,確保四欄位完整

S1: 三處 signature 建構邏輯統一
- auto_repair_service._derive_anomaly_key() → 委託 derive_key_from_incident()
- approval_execution._get_anomaly_key_from_approval() → 同上
- incident_service.resolve_incident() B4 → 同上
- 消除 3 處重複的 signature 建構程式碼

S2: Redis Pipeline 批次查詢
- get_all_disposition_stats() 從 N+1 hgetall 改為 2 次 Pipeline
- Pipeline 1: 批次 hgetall 所有 disposition key
- Pipeline 2: 批次 hget metadata (alert_name)
- 效能從 O(2N) Redis round-trip 降至 O(2)

S3: auto_repair.py get_incident AttributeError 修復
- get_incident() → get_from_working_memory() (pre-existing bug)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 13:13:42 +08:00
OG T
de3935d1d4 feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
  優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)

Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
  週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
  weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 13:02:20 +08:00
OG T
561bcb638b fix(api): Sprint 4 首席架構師 Review P0 修正 — hash 統一 + 積木化合規
P0-1: anomaly_key hash 推導統一
- B1: 新增 _derive_anomaly_key() 使用 AnomalyCounter.hash_signature()
  取代 symptoms.compute_hash()
- B3/B4: namespace 改用 signal.labels.get("namespace", "")
  修正 getattr(signal, "namespace", "") 永遠回傳空字串

P0-2: Router 層積木化合規
- C1/C2: 封裝 get_all_disposition_stats() 到 AnomalyCounter
- Router 不再直接存取 counter.redis
- stats.py 移除未使用的 days/stats 參數

P1: get_frequency() 填充 disposition 欄位
- 與 _record_anomaly_impl() 一致,回傳完整處置統計

首席架構師評分: 82/100 → P0 全數修正

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 12:53:12 +08:00
OG T
a85e9ced08 feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
  - DispositionSummary: auto/human/manual/cold_start + auto_rate
  - DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
  - Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary

Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
  - 🤖 自動: N | 👤 審核: N | 🔧 手動: N
  - 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
  - 完整 5 項計數 + 自動化率

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 12:17:20 +08:00
OG T
9253281d46 feat(api): Sprint 4 Phase A+B — 告警處置統計資料層+寫入層
Phase A: 資料層
- A1: IncidentFrequencyStats 新增 4 欄位 (human_approved/manual_resolved/cold_start_trust/total_resolution)
- A2: AnomalyCounter.record_disposition() — Redis HINCRBY 原子遞增
- A3: get_disposition_stats() — HGETALL 回傳處置分佈
- AnomalyFrequency dataclass 擴充 + to_dict() 同步
- _record_anomaly_impl() 整合 disposition stats

Phase B: 寫入層觸發點接線
- B1: 自動修復成功 → record_disposition("auto_repair")
- B2: 冷啟動信任成功 → record_disposition("cold_start_trust")
  - AutoRepairDecision 新增 is_cold_start flag
  - execute_auto_repair() 接收並區分處置類型
- B3: 人工批准執行成功 → record_disposition("human_approved")
  - 新增 _get_anomaly_key_from_approval() helper
- B4: 手動處理推斷 → resolve_incident() 排除法判定
  - 若 resolved 且無 auto/human/cold_start 紀錄 → manual_resolved

安全設計: 所有 disposition 記錄走 try/except,失敗不阻塞主流程

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:54:46 +08:00
OG T
53b2daeaca feat(api): 首次信任機制 — 打破自動修復冷啟動雞生蛋問題
問題: Playbook 需要 success_count >= 3 才算 is_high_quality,
但沒有自動修復就不會有成功紀錄 → 永遠達不到門檻。

方案 C: 首次信任 (Cold Start Trust)
- APPROVED 狀態 + 全步驟 risk=LOW + 執行次數 < 3 → 自動放行
- Redis counter 限制每日最多 5 次首次信任自動修復
- 累積 3 次成功後自動回歸正常 is_high_quality 門檻

安全邊界:
- 只有 LOW risk 步驟才能首次信任 (重啟容器等)
- HIGH/CRITICAL 仍需人工審核
- P0/P1 嚴重度仍需人工審核
- 每日上限防止失控

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:21:00 +08:00
OG T
2fe8062fb8 refactor(api): Re-Review S1/S2/S3 改善 — 消除重複+防禦性驗證+測試隔離
S1: 抽取 _execute_and_observe() 公用方法
  - 消除 repair_by_uri 中 3 處重複的 execute+audit+langfuse 邏輯
  - 統一 AuditLog + Langfuse trace 寫入路徑

S2: SSH username 防禦性驗證
  - 新增 validate_ssh_user() + _SSH_USER_RE 正則
  - 在 _ssh_execute() 入口驗證 user 參數
  - 防止 user@host 拼接產生非預期行為
  - 新增 8 個 username 驗證測試

S3: Singleton 測試重置
  - 新增 _reset_for_test() classmethod
  - 避免跨測試狀態污染
  - 新增 2 個 singleton reset 測試

測試: 55/55 全數通過 (原 45 + 新 10)
首席架構師 Re-Review: 91/100  通過,3 個 Suggestion 全數實裝

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:17:40 +08:00
OG T
78a8d3dfa5 fix(api): ansible 控制節點加白名單驗證,防環境變數繞過 (Re-Review Important)
首席架構師 Re-Review 指出: ANSIBLE_CONTROL_HOST 來自環境變數 (ConfigMap),
若 ConfigMap 被篡改可繞過 SSH_TARGET_WHITELIST。
在 _execute_ansible() 開頭加 validate_ssh_target_host(host) 閉環。

Re-Review 評分: 91/100  通過

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:13:49 +08:00
OG T
f8d4772abf fix(api): Sprint 3 P0-1/P0-2/P0-3/P0-4 Critical Security Fixes
P0-1: Complete shell metacharacter regex detection
  - Enhanced _SHELL_METACHAR_RE to detect: >, <, \n, ${}, $()
  - Prevents all shell injection vectors (redirects, variable expansion, newlines)
  - Added 5 new validation tests

P0-2: Add shlex.quote() protection for ansible playbook path
  - Wraps playbook_path in shlex.quote() before SSH command construction
  - Prevents shell injection if path contains special characters
  - Applied in _execute_ansible() method

P0-3: Add SSH target host whitelist validation
  - Introduces validate_ssh_target_host() function
  - Only allows SSH to: 192.168.0.110, 192.168.0.188
  - Prevents unauthorized SSH target exploitation
  - Added 5 new whitelist validation tests

P0-4: Convert HostRepairAgent to singleton pattern
  - Implements __new__() singleton with shared _in_process_locks dict
  - Ensures in-process locks persist across multiple auto_repair_service calls
  - Previously created new instance per call, making locks ineffective
  - Added singleton persistence test

Test Results: 45/45 passing (34 existing + 11 new P0 tests)
All security validations verified via comprehensive unit test coverage.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:09:45 +08:00
OG T
af07c23675 fix(k8s): known_hosts 改掛 /etc/repair-known-hosts 獨立目錄,修 mount 衝突
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m11s
E2E Health Check / e2e-health (push) Successful in 34s
/etc/repair-ssh 已被 repair-ssh-key 佔用,subPath 檔案掛載衝突
改為獨立目錄 /etc/repair-known-hosts,路徑同步更新 KNOWN_HOSTS_PATH

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 15:06:28 +08:00
OG T
1644fe6474 feat(api): auto_repair_service 整合 repair_by_uri (Sprint 3 T6)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:39:03 +08:00
OG T
a4e11bfa92 feat(api): AuditLog + Langfuse Trace for SSH_COMMAND (Sprint 3 T5)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:38:59 +08:00
OG T
4561f141bb feat(api): Redis 冪等鎖防止重複修復 (Sprint 3 T4)
雙層鎖設計: in-process asyncio.Lock (必定生效) + Redis 分散式鎖 (跨 Pod best-effort)
同一 URI 的第二次修復呼叫立即返回 "already running" 錯誤

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:26:53 +08:00
OG T
1a654aa37d feat(api): HostRepairAgent 三條執行路徑 + known_hosts + Ansible 白名單 (Sprint 3 T3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:22:54 +08:00
OG T
5e8b2a6894 feat(api): URI scheme 解析器 + Shell Injection 防護 (Sprint 3 T1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:18:21 +08:00
OG T
f6332b4b2f fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。

修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:53:48 +08:00
OG T
658337ec18 fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
   → TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
   → _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
   → 所有 deployment 在 awoooi-prod,203 次執行全失敗

修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:46:05 +08:00
OG T
286a96d1aa fix(knowledge): entrystatus enum 大小寫修正 'archived' → 'ARCHIVED'
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m47s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:25:44 +08:00
OG T
09d965dab5 fix(telegram): 修正 editMessageText 400 錯誤 — 先移除按鈕再更新文字
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m46s
原因: original_text 來自 message.text (純文字),含 <>&等字符,
     用 parse_mode=HTML 發送時 Telegram 返回 400。

修正:
1. 先呼叫 editMessageReplyMarkup 移除按鈕 (確保按鈕一定消失)
2. 再 html.escape(original_text) 後嘗試更新文字
3. 文字更新失敗不影響整體流程 (按鈕已移除為首要目標)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:13:54 +08:00
OG T
8220027298 fix(telegram+cd): 兩個顯示 bug 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Nemotron args 顯示 Python dict 字串問題
   - restart_deployment: {'deployment_name': 'awoo'} → restart_deployment: deployment_name=awoooi-api
   - 改用 key=value 格式化,不再使用 str(dict)[:25]

2. CD 通知 ${MINUTES}/${SECONDS} 等變數未展開
   - TG_MSG 從 env: 移到 run: shell 中組裝
   - env: 中的 shell 變數在 bash 執行前是靜態字串,無法展開

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:47:52 +08:00
OG T
23364423fa feat(webhook): ADR-059 GitHub → Gitea Webhook 遷移完成
- gitea_webhook.py: Header 全部改 X-Gitea-*,移除 workflow_run handler
- gitea_webhook_service.py: _fetch_pr_diff 改直接 httpx,不依賴 github_api_service
- 清除兩個檔案的所有殘留 github_ log key,review_id prefix 改 gitea-
- test_gitea_webhook.py: 10/10 通過,docstring 修正
- 03-secrets.yaml: 新增 GITEA_WEBHOOK_SECRET 佔位
- cd.yaml: 新增 GITEA_WEBHOOK_SECRET 注入步驟
- ADR-059: 建立架構決策文件

待統帥操作: Gitea Actions secret + Gitea UI Webhook 設定

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:44:32 +08:00
OG T
6777532534 feat(webhook): Task 1+2 — config + service GitHub→Gitea 遷移 (ADR-059)
- config.py: GITHUB_WEBHOOK_SECRET/ALLOWED_REPOS → GITEA_*
- 新增 gitea_webhook_service.py: PR/Push review only, 移除 CI diagnosis
- 移除 CIFailureDiagnosis, diagnose_ci_failure, _call_openclaw_ci_diagnosis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:33:58 +08:00
OG T
22ee9b2fe3 fix(telegram): answerCallbackQuery result=true 導致 bool is not iterable
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m3s
Telegram answerCallbackQuery 成功時返回 {"ok": true, "result": true},
_send_request 中 "message_id" in result["result"] 對 bool 做 in 操作
報 "argument of type 'bool' is not iterable"。

修正:加 isinstance(result_val, dict) 防禦後再做 in 檢查。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:20:54 +08:00
OG T
4b4007db6c feat(telegram): SRE 群組告警格式升級為完整 v7.0
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
_send_approval_card_to_group 改用與個人 chat 相同的 TelegramMessage.format()
格式,包含 SignOz metrics、AI provider/model、Nemotron 協作、異常頻率統計等全部欄位。

統帥指示:群組收到的告警訊息要與個人 chat 格式完全一致。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:11:59 +08:00
OG T
76f3ffd7f7 fix(telegram): whitelist property 返回字串導致按鈕無反應
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m0s
security_interceptor.whitelist 返回 settings.OPENCLAW_TG_USER_WHITELIST
(字串),但 is_whitelisted 做 user_id in whitelist(int in str),
Python 報 "requires string as left operand, not int"。

修正:改呼叫 settings.get_tg_user_whitelist() 返回 list[int]。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:40:52 +08:00
OG T
4b24ecd67f fix(sprint3): 首席架構師 Review C1/C2/C3/M3/m1 修正
C1: _ssh_execute 直接接收 key_path 參數,不反查 LAYER_SSH_CONFIG
C2: PlaybookService.create() proxy,Router 不再穿透呼叫 _repository
C3: CD Step 1b sed 替換 IMAGE_TAG_PLACEHOLDER,消除失敗中斷風險
M3: repair-bot 110/188 regex 統一 [a-z0-9][a-z0-9-]{0,30},禁止底線
m1: defaultMode 0400 加八進位說明注釋
m2: _ssh_execute 用 deadline 計算剩餘 timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:07:59 +08:00
OG T
665f93e83f fix(telegram): 首席架構師 R1 修正 — I-1/I-2/M-1/M-2
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
I-1: webhooks/sentry_webhook/signoz_webhook 三個呼叫者補 TODO 說明
     無 incident_id 是已知限制(Approval 路徑未建 Incident 關聯)
I-2: TestPushRequest 新增 incident_id 欄位,使 QA 可驗證按鈕渲染
M-1: 移除 _build_inline_keyboard 呼叫中多餘的 `or message.incident_id`
M-2: 補充 900/1000 截斷長度差異說明

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:07:42 +08:00
OG T
4935cfc346 fix(telegram): 重設計訊息格式 + 修復 detail/reanalyze/history 按鈕失效
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m26s
- format() / format_with_nemotron(): 移除 ═══ 分隔符,改為簡潔換行佈局
- send_approval_card(): 新增 incident_id 參數,傳入 _build_inline_keyboard()
- decision_manager.py: 呼叫 send_approval_card() 時傳入 incident.incident_id
- 問題根因: incident_id 未傳入 _build_inline_keyboard() 導致第二排按鈕從未渲染

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:44:13 +08:00
OG T
53e1ae7ad7 fix(phase25): I2 NIM system prompt + I4 field_path 正則匹配修正
Some checks failed
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
I2: nemotron.analyze() 補上 system role (NIM 標準 message format)
    - 舊: messages=[{role:user, ...}]
    - 新: messages=[{role:system, ...}, {role:user, ...}]
    - 效果: K8s operator 角色定義,改善 tool calling 品質

I4: drift_detector._is_allowlisted/_is_critical 用正則取代 strip
    - 舊: replace('[*]','') 後 startswith/in → 無法匹配 containers[0]
    - 新: [*] → \[\d+\] 正則,正確匹配所有索引
    - 修復: containers[*].image 現在能匹配 containers[0].image
2026-04-05 12:11:05 +08:00