Your Name
359a6ee495
fix(test-schema): approval_records 補 matched_playbook_id 欄位
...
CD Pipeline / build-and-deploy (push) Has been cancelled
CI B5 整合測試失敗根因:04ff225 在 ORM model 加 matched_playbook_id,
但 tests/integration/setup_test_schema.sql 未同步,導致
test_approval_lifecycle / test_incident_approval_association 拋
UndefinedColumnError 阻擋 CD Pipeline build-and-deploy。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 15:48:37 +08:00
Your Name
a6788c2baa
fix(tests): 移 DB 測試到 integration 層修復 CI asyncpg 密碼錯誤
...
CD Pipeline / build-and-deploy (push) Failing after 1m55s
test_aider_event_processor.py 的三個真實 DB 測試在 CI 單元測試層
(tests/)因連線 awoooi_dev DB 失敗(密碼不符)而中斷。
正確架構:
tests/ — 單元測試,CI 直接跑,無 DB
tests/integration/ — 整合測試,CI --ignore,K8s E2E 覆蓋
修復:
- tests/test_aider_event_processor.py 只保留無 DB 的 malformed payload 測試
- 三個 DB 測試移至 tests/integration/test_aider_event_processor_integration.py
改用 conftest db_session fixture,不自建 engine(避免密碼硬碼)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 01:41:34 +08:00
Your Name
479f8d8971
refactor(tests): 技術債清零 — 移除 FakeRepo/FakeSession Mock DB 違規
...
CD Pipeline / build-and-deploy (push) Failing after 35s
## ai_router.py
- 抽取 _aggregate_feedback_stats() 純函數,feedback_from_aider_events 呼叫它
## aider_event_processor.py
- _process_one 加 _session_factory=None DI 參數(預設 get_session_factory())
- 可注入測試 factory,不改既有生產邏輯
## test_ai_router_feedback.py(完全重寫)
- 移除 FakeRepo/FakeSession,改為直接測試 _aggregate_feedback_stats 純函數
- 新增 test_feedback_skips_missing_model 邊界條件
- DB 失敗降級行為 test 保留(只 patch get_session_factory,無 FakeRepo)
## test_aider_event_processor.py(完全重寫)
- 移除 FakeRepo/FakeSession,改用真實 PostgreSQL(real_factory fixture)
- Redis xack + IncidentEngine 保留 mock(外部 broker/AI 服務,符合例外)
- 每個測試後 rollback,不污染 dev DB
## setup_test_schema.sql
- 補入 aider_events_payload_gin GIN index(與 adr091 生產 migration 一致)
## integration/conftest.py
- 補注解說明密碼名稱 awoooi_prod_2026 的歷史混淆
- 修正 assert 邏輯:檢查 DB 名稱而非 URL 字串,避免密碼含 prod 觸發誤判
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 01:33:30 +08:00
Your Name
d0591c54b0
fix(security): 體健修復 — 7項 Critical/Major 安全問題全修
...
CD Pipeline / build-and-deploy (push) Failing after 35s
## Critical 修復 (C1-C5)
- C1: git rm --cached 03-secrets.yaml(CHANGE_ME 模板不再追蹤)
- C2: git rm --cached awoooi.db + .gitignore 加 *.db(SQLite HARD_RULES 違規)
- C3: sentry-tunnel SENTRY_HOST 改為 process.env fallback
- C4: config.py DATABASE_URL 移除 changeme default,改為必填
- C5: run_migration.py 改為 os.environ["DATABASE_URL"]
## Major 修復 (M1-M4)
- M1: auto_repair /execute 加 CSRF 保護 + AutoRepairPanel.tsx 同步
- M2: drift /rollback /adopt 加 CSRF 保護(/internal/scan 保持無 CSRF)
- M3: terminal /intent 加 CSRF 保護 + terminal.store.ts 同步
- M4: live-dashboard HOST_IPS + host-grid VIP 改為 env var
## 其他
- 新增 apps/web/.env.example(6 個 env var 說明)
- K8s deployment-web 補入 3 個新 env var
- 整合測試:新增 aider_event_repository + ai_router_feedback 真實 DB 測試
- test_terminal.py CSRF dependency override 修復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-22 01:27:39 +08:00
Your Name
40771cda6d
feat(ai_router): feedback_from_aider_events read-only hook (Phase 24 A8)
2026-04-20 19:40:01 +08:00
Your Name
df72da69e2
feat(worker): AiderEventProcessor — Redis stream consumer + incident + DB write
...
- Implement Task A7: background worker consuming signals:aider:events stream
- Parse AiderEventIn from Redis XREADGROUP messages
- Call IncidentEngine.process_signal for incident-worthy events
- Persist aider_events to PostgreSQL with optional incident_id FK
- XACK on success, preserve in pending list on DB failure (retry)
- ACK on parse failure (bad JSON avoids pending list jam)
- Match signal_worker.py pattern: no Active Sweeper (MVP)
- Unit tests: 4 tests covering incident creation, non-incident events, malformed payloads, engine failures
Tests: 37 passed (4 new + 33 existing regression)
2026-04-20 19:40:01 +08:00
Your Name
cd894310dc
feat(api): POST /api/v1/aider/events HMAC webhook + Redis stream push
...
- Router layer: HTTP validation + HMAC-SHA256 signature verification
- Service layer: Redis stream push (aider_event_service.push_aider_batch_to_stream)
- leWOOOgo積木化遵循: Router → Service → Redis
- All 6 tests passing (signature validation, batch limits, edge cases)
2026-04-20 19:40:01 +08:00
Your Name
964427c5d4
feat(service): aider_event_service — classify + signal_data builder (uses existing debounce)
2026-04-20 19:40:01 +08:00
Your Name
803b389f6b
security(secrets): 替換 test fixture 真 TG bot token 為假值
...
run-migration / migrate (push) Failing after 20s
CD Pipeline / build-and-deploy (push) Successful in 9m10s
## 事件
aider-watch v1 session 把真 production TG bot token(NEMOTRON_BOT_TOKEN)
當成 test fixture 寫入下列 tracked 檔(均已 push Gitea):
- apps/api/tests/test_secret_redactor.py
- docs/superpowers/plans/2026-04-19-aider-watch.md (3 處)
- docs/superpowers/plans/2026-04-20-aider-watch-v2.md
違反 feedback_secrets_leak_incidents_2026-04-18.md L2 零信任(source control 無 secrets)。
## 處置
- 統帥決議:不撤銷 token(接受風險)
- 替換為假值 111222333:A*35(明顯 placeholder,仍符合 redactor 判別格式)
- 減少未來 search engine / fork 的暴露面(但 git history 仍存)
## 驗證
secret_redactor.py 8 個 test 全過,telegram regex 仍能辨識新假值格式。
## P1 backlog
- git history 清理(git filter-repo)需統帥批准 force push
- pre-commit hook 防未來再洩(grep TG token 格式 / detect-secrets)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 04:23:09 +08:00
Your Name
4188df6fcc
fix(imports): CI 環境 import path 統一為 src.*(移除 apps.api.src.* PEP 420 假依賴)
...
Type Sync Check / check-type-sync (push) Successful in 2m37s
CD Pipeline / build-and-deploy (push) Has started running
## 根因
`apps.api.src.*` 需倉庫根目錄在 sys.path 才能透過 PEP 420 namespace package
解析(因 apps/ 和 apps/api/ 無 __init__.py)。
- CI rootdir=repo root → 可解析(但脆弱依賴)
- 本地 pytest rootdir=apps/api → 解析失敗 → 整個 src.models.__init__ 炸
- CI 錯誤: `test_secret_redactor.py` 無法 import module
## 修復
src.models.__init__ 的 3 處 `apps.api.src.*` 改 `src.*`
src.models.incident 的 1 處 `apps.api.src.*` 改 `src.*`
tests/test_aider_event_models.py import path 統一
tests/test_secret_redactor.py import path 統一
## 驗證
138 個 pytest test 全過(drift + rule_engine + approval_execution + aider_event + incident + secret_redactor)
所有 test 都用 `from src.*` 風格(codebase 既有慣例,pytest rootdir=apps/api 提供 src/ 作 import root)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 04:13:02 +08:00
Your Name
5daae76147
feat(models): AiderEventIn + AiderBatchIn pydantic schemas
...
- Implement aider-watch v2 event schema with 7 event types
- Enforce timezone-aware timestamps via field_validator
- Batch schema supports up to 50 events per request
- Frozen + forbid extra fields (defensive engineering)
- Fix broken src.* imports in models package (incident.py, __init__.py)
Task A3 complete: 7/7 tests passing
2026-04-20 04:06:26 +08:00
Your Name
0db4534133
feat(utils): generic secret_redactor (7 patterns)
run-migration / migrate (push) Failing after 12s
CD Pipeline / build-and-deploy (push) Failing after 1m36s
2026-04-20 04:04:13 +08:00
OG T
d258a1fb87
test(ai-router): 更新 DIAGNOSE routing 測試 — None → OPENCLAW_NEMO
...
CD Pipeline / build-and-deploy (push) Successful in 14m52s
test_diagnose_override_is_none → test_diagnose_override_is_openclaw_nemo
配合 ai_router.py DIAGNOSE 路由修復(Ollama 238s timeout 根因修復)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 00:13:00 +08:00
OG T
f1cbf6db7d
feat(adr-081): Phase 1 感官縱深 — 8D 情報蒐集 + 執行後驗證
...
成品:
- IncidentEvidence DB model(8D 感官 + pre/post 執行狀態)
- EvidenceSnapshot dataclass(build_summary → LLM 上下文)
- SanitizationService(Prompt Injection 0-tolerance,12 pattern)
- MCPToolRegistry(動態工具登記,suggest_tools 不寫死告警類型)
- PreDecisionInvestigator(8D 並行感官,P99 < 8s,Redis 30s 快取)
- PostExecutionVerifier(warmup 10s → 後狀態評估 success/degraded/failed)
- decision_manager + approval_execution 接線(feature flag 守衛)
Gate 1 修復:D4/D5/D7/D8 補 sanitize_dict_values;移除裸 "error" failure
signal 防 error_rate key 誤判;evidence_snapshot rowcount 零行警告。
測試:130 passed(+111 新增)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 13:08:38 +08:00
OG T
db9e304a14
feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
...
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
(1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
15 KPI、21 Feature Flags、10 風險場景)
- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
(7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)
- apps/api/src/core/feature_flags.py
(AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)
- apps/api/src/jobs/__init__.py + baseline_snapshot.py
(Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
/ learning loop rate / auto_repair — 寫入 aiops:baseline:latest)
- apps/api/tests/test_feature_flags.py (21 tests — 全綠)
- docs/HARD_RULES.md → v1.9
(新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)
- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol
Gate 0 Pass — 21/21 tests green
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 12:44:53 +08:00
OG T
208c28ed09
feat(Phase 5 Sprint 5.2): Callback dispatcher 接入真實 MCP registry
...
CD Pipeline / build-and-deploy (push) Successful in 14m38s
dispatch_action() 升級:
- 從 Sprint 5.0 stub 升級為真實 MCP 調用
- internal provider: URL builder + authorization 記錄(不走 MCP)
- 其他 provider: from src.plugins.mcp.registry import get_provider → execute
- asyncio.wait_for 包 timeout_sec(按 spec 設定,每按鈕不同)
Graceful degradation:
- Provider 未註冊 → returns success=False + 'provider_not_found' 錯誤
- MCP returned success=False → reply 含錯誤訊息
- asyncio.TimeoutError → reply 「超時 Xs」+ log
新增 _handle_internal_action():
- build_signoz_url → https://signoz.wooo.work/services/{service}
- build_flywheel_url → https://awoooi.wooo.work/flywheel
- record_authorization → 24h 同源靜默確認
測試覆蓋 (26/26):
- 3 新 internal action tests (open_signoz/open_flywheel/secops_authorize)
- 1 MCP failure graceful test
- 既有 22 個保留(更新 2 個 Sprint 5.0 stub 測試為 Sprint 5.2 graceful)
Sprint 5.2 DOD:
✅ 10 查類按鈕 dispatch 路徑完整
✅ 3 internal actions 實作
✅ Graceful failure (no crash)
✅ asyncio.wait_for timeout 保護
⏳ 實際 end-to-end 測試(需 prod MCP providers 都註冊)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-14 20:43:40 +08:00
OG T
2e2f5a1881
feat(Phase 5 Sprint 5.0): Callback Dispatcher 規格 + 骨架 + 22 測試
...
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥批准 Phase 5 全 Sprint,Sprint 5.0 產出:
1. callback_action_spec.yaml (24 actions)
- 10 查類 (info 2-part callback, 無副作用): check_process, check_port,
check_log_*, check_health, check_pod_logs, describe_pod, open_signoz,
open_flywheel
- 10 寫類 (nonce 4-part, 有副作用): k8s_restart/scale_up/scale_down/rollback,
host_restart_service/clear_log, docker_restart, minio_restart,
reload_nginx, renew_cert
- 4 secops (Multi-Sig CRITICAL): secops_isolate/block_ip/evict/authorize
2. callback_dispatcher.py
- Registry pattern (lru_cache): get_action_spec / list_actions_for_category
- 模板變數替換: {incident_id} / {labels.xxx} / {signals[0].xxx}
- dispatch_action() 骨架 (Sprint 5.2+ 接 MCP)
- _format_reply: text/code/truncated/url 4 種格式
3. test_callback_dispatcher.py (22 tests全過)
- Registry loading 正確性
- Category filtering
- Template resolution (含 nested list index)
- dispatch stub 返回正確 spec 提示
下一步 Sprint 5.1: 接入 MCP registry + telegram callback_handler 整合
MCP 底層能力已有 (k8s 10+ tools, ssh 15 tools)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-14 20:34:14 +08:00
OG T
10b74affcf
fix(GAP-A4): 規則 Action 模板 placeholder 解析修復 — 解開 8.3h 飛輪沉默
...
CD Pipeline / build-and-deploy (push) Has been cancelled
🚨 真因診斷(統帥逮到):
API log 顯示最近 1 小時爆發大量 auto_execute_blocked_unresolved_placeholder:
- action: "kubectl rollout restart deployment HostHighCpuLoad" ← target=alertname
- action: "kubectl rollout restart deployment unknown"
- action: "kubectl scale deployment unknown --replicas=3"
根因:alert_rule_engine._extract_vars() target 解析邏輯不夠強健,
當 Prometheus 告警無 deployment label 時,退回 alertname 或 "unknown",
產生垃圾指令。GAP-A1 防注入閘正確攔下,但自動修復路徑因此卡死,
KM 不寫入 → 飛輪沈默。
修復(三層防護):
1. 新增 _strip_pod_suffix() — K8s Pod 名稱還原 Deployment base
- Deployment 格式: awoooi-api-7d6b776f78-4sgjl → awoooi-api
- StatefulSet: postgresql-0 → postgresql
- Legacy: my-job-x2m4k → my-job
2. 新增 _is_bad_target() — 垃圾 target 識別
- 空串 / "unknown" / "none" / "null"
- target == alertname 本身
- IP:port 格式、純 IP、含空白/括號/引號
- 未解析 {placeholder}
3. 重寫 _extract_vars() — 多層 label 查找(權威優先):
deployment > app > statefulset > pod(去後綴) > container > service > target_resource
每層都過 _is_bad_target 驗證,全失敗 → target="unknown"
4. match_rule() 後置雙驗證:
- bad target → 清空 kubectl_command (降級 LLM)
- 殘留 { or } → 清空 kubectl_command (模板未填完)
測試覆蓋:
- 33 個新單元測試(GAP-A4 四大場景全覆蓋)
- 214/214 回歸測試全過
影響:
- 原本產出「kubectl rollout restart deployment HostHighCpuLoad」的路徑
→ 現在會 `rule_kubectl_command_discarded_bad_target` 並降級 LLM
- LLM 若能從錯誤 log 推理真實 deployment,飛輪恢復正常運轉
- 若 LLM 也無解,進 TYPE-4 人工扶梯
2026-04-14 Claude Sonnet 4.6(MASTER 藍圖之外的隱性 Bug 殲滅)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-14 18:43:29 +08:00
OG T
aae7c12645
feat(adr-076): Task 3.3 — SSH 修復 KM 萃取(補齊飛輪雙手)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
動機: SSH MCP 修復(docker restart/systemctl)成功後,KM 無法學習
因為 _extract_repair_steps 只處理 kubectl,SSH 路徑完全漏失。
approval_execution.py:
- _trigger_playbook_extraction: 成功執行後將 approval.action 寫入
incident.outcome.learning_notes,供 Playbook 萃取器讀取
playbook_service.py:
- _parse_ssh_command(): 新增模組函式,解析 ssh [user@]host 'cmd' 格式
- _extract_repair_steps(): 步驟 2 擴充 SSH 路徑分支
ssh ... → ActionType.SSH_COMMAND + host 記錄
kubectl ... → ActionType.KUBECTL(保留原有邏輯)
- _generate_name(): SSH 修復自動加 [SSH] 前綴
- _extract_tags(): SSH 修復自動加 ssh + host_layer 標籤
test_playbook_ssh_extraction.py: 18 tests(100% 通過)
飛輪雙手對齊:
kubectl 路徑: decision_chain.reasoning_steps → KM ✅ (既有)
SSH 路徑: approval.action → learning_notes → KM ✅ (Task 3.3 新增)
測試: 794 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-14 15:19:54 +08:00
OG T
cc42aa0bdb
feat(adr-076): Task 2.2 + 2.3 — 規則擴充 + kubectl 注入防護
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Task 2.2: alert_rules.yaml 新增 3 類規則 (priority 125-127)
- gitea_down: Gitea CI/CD 下線 → NO_ACTION (priority 125, critical)
- ssl_cert_expiring: SSL 憑證到期 → NO_ACTION (priority 126, medium)
- external_site_down: MoWoooWork/Dev/Blackbox probe → NO_ACTION (priority 127, medium)
規則總數: 21 → 24
Task 2.3: alert_rule_engine.py kubectl 注入防護
- _RULE_ENGINE_DESTRUCTIVE_RE: 阻擋 delete pvc/namespace/statefulset/deployment,
drain/cordon, --replicas=0, rm -rf, DROP TABLE, $() 反引號
- validate_kubectl_command(): 公開 API,SSH 指令/空字串直接通過
- match_rule() 整合: 變數替換後驗證,阻擋時清空 + log warning
- test_alert_rule_engine_validation.py: 34 tests (100% 通過)
測試: 776 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-14 15:10:10 +08:00
OG T
684d6cfb43
feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告
...
CD Pipeline / build-and-deploy (push) Successful in 17m34s
Task 2: AlertGroupingService — Redis 5分鐘滑動視窗,防告警風暴
- apps/api/src/services/alert_grouping_service.py (新增)
- webhooks.py 整合:指紋生成後/LLM前短路子告警
- Threshold=3,Graceful Degradation,16 tests
Task 3: approval_execution.py 執行失敗重試
- MAX_RETRY=2, RETRY_DELAY_SECONDS=30
- _is_transient_error() 瞬態/永久分類,永久錯誤不重試
- Timeline 記錄重試進度,成功後標注重試次數,29 tests
Task 4: report_generation_service.py 自動報告
- 日度巡檢報告:每日 08:00 台北時間,Telegram SRE 群組推送
- Postmortem:Incident resolved + duration > 10 分鐘自動觸發
- main.py lifespan 掛載 run_daily_report_loop(),30 tests
測試: 600 → 675 通過 (+75),0 failed
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-14 14:39:14 +08:00
OG T
1a4b52ed28
fix(alert): fingerprint 加 alertname 防跨告警指紋衝突 + 補入缺漏心跳分類
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題根因:
1. generate_fingerprint 用 alert_type(大量 alertname 落入 "custom")
→ 不同告警名稱同目標共用指紋 → 30 分鐘 debounce 互相擋截
2. classify_alert_early 漏掉 DeadMansSwitch / NoAlertsReceived /
PrometheusNotConnectedToAlertmanager → 落入 TYPE-3 一般告警
修復:
- alert_analyzer_service.py: 指紋改為 namespace:deployment:alertname:target_resource
alertname 取自 labels(Alertmanager),fallback 到 alert_type(其他來源)
- incident_service.py: DeadMansSwitch → backup/TYPE-1;
NoAlertsReceived + PrometheusNotConnectedToAlertmanager → alertchain_health/TYPE-8M
- 補 2 個測試,全套 627 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 22:50:20 +08:00
OG T
db4d4280f5
test(ai-router): 更新 DIAGNOSE routing 測試反映暫停 NEMOTRON 現況
...
CD Pipeline / build-and-deploy (push) Successful in 14m28s
NEMOTRON 因 confidence=0.0 問題暫停,改走複雜度路由(None)
待 _parse_confidence() 修復後恢復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 22:22:52 +08:00
OG T
b3d4b9c8a9
test(telegram): 修正 test_telegram_message_templates 斷言
...
CD Pipeline / build-and-deploy (push) Successful in 14m24s
CRITICAL → 嚴重 (ADR-075 中文風險等級)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 21:20:16 +08:00
OG T
01e6d75ee7
test(telegram): 修正測試斷言符合 ADR-075 中文風險等級
...
CD Pipeline / build-and-deploy (push) Failing after 1m55s
HIGH→高風險, MEDIUM→中風險 (test_sentry / test_github webhook)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 21:08:48 +08:00
OG T
1cb654cf59
fix(adr-075): CR P0/P1 修補 — TYPE_8M enum + 死碼清理 + docstring 更新
...
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-2: NotificationType 新增 TYPE_8M = "TYPE-8M"
classify_notification 早期回傳 TYPE-8M
decision_manager 改用 NotificationType.TYPE_8M enum 比較(移除字串字面量)
P1-1: 移除 _CATEGORY_BUTTONS 中不可達的 alertchain_health/flywheel_health 條目
P1-4: test_classify_alert_early.py docstring 更新為 13 條規則/10 分類
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:44:12 +08:00
OG T
2cef2098d3
feat(adr-075): 修復 Telegram 動態按鈕 4 個斷點 + 新增 7 種告警分類
...
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點 A: decision_manager 提取 alert_category/notification_type 傳入 send_approval_card
斷點 B: send_approval_card 新增參數並傳遞至 _build_inline_keyboard
斷點 C: 互動型通知 (TYPE-3/4/4D/8M) 禁止發 SRE 群組,防 nonce 洩漏
斷點 D: _CATEGORY_BUTTONS k8s_workload → kubernetes + 新增 6 類按鈕組
classify_alert_early 新增: alertchain_health, flywheel_health, storage,
devops_tool, external_site, ssl_cert, host_resource (從 infrastructure 分離)
Test: 52 classify + 664 total passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:35:56 +08:00
OG T
1074936e54
fix(classify): backup/heartbeat severity=warning/critical 告警恢復告警卡片格式
...
CD Pipeline / build-and-deploy (push) Failing after 2m38s
根因:classify_alert_early() backup 規則無 severity 條件,導致
VeleroBackupFailed / HostBackupFailed (warning/critical) 被分為 TYPE-1
(純資訊無按鈕),告警卡片格式遺失。
修復:
- backup/heartbeat 關鍵字只在 severity=info/none 才命中 TYPE-1
- severity=warning/critical 的 backup 告警走正確 prefix 規則
(Velero→kubernetes TYPE-3, HostBackup→infrastructure TYPE-3)
- Watchdog (severity=none) 由 severity 規則先命中,維持 TYPE-1
- 補強測試:25 cases,含 VeleroBackupFailed critical → kubernetes TYPE-3
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 15:24:00 +08:00
OG T
0d239838b4
fix(cr): Code Review P2 — 測試覆蓋 + CronJob 腳本重構
...
CD Pipeline / build-and-deploy (push) Has been cancelled
P2-1: CronJob inline Python 抽成 scripts/cron_km_vectorize.py
Dockerfile 加入 COPY scripts/,CronJob YAML 改用腳本路徑
P2-2: 新增 test_classify_alert_early.py — 23 tests 覆蓋 7 條分類規則
含邊界情況:VeleroBackupFailed(backup優先於k8s)、優先順序驗證
595 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 15:14:44 +08:00
OG T
a67a27f780
fix(test): test_model_regression 加 @pytest.mark.integration(需 Ollama 服務)
...
與 global_repair_cooldown / anomaly_counter 一致,
Ollama 測試預設排除,需真實服務時用 pytest -m integration 執行
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 13:32:42 +08:00
OG T
8be87b0f32
fix(review): 首席架構師 Code Review — c439277 Tier 3 紅區修補
...
CD Pipeline / build-and-deploy (push) Failing after 8m39s
Critical:
- C1: decision_manager _collect_mcp_context container 變數 Python ternary 優先度 bug 修正
原: `A or B or C[0] if list else ""` (ternary 控制全式)
修: `A or B or (C[0] if list else "")` (明確括號)
- C2: 所有 MCP 呼叫加 asyncio.wait_for timeout=5s,防止阻塞決策主路徑
同時加 unknown host warning log (C4)
- C3+M1: _DESTRUCTIVE_PATTERNS 補全移至模組頂層常量
新增: delete pods(複數)/kubectl drain/kubectl cordon/kubectl rollout undo/
docker rm/docker stop/docker kill/rm -rf/"replicas": 0(JSON patch)
Important:
- I1: webhooks.py IP 排除改用 is_internal_ip() 支援全 RFC-1918 (10.x/172.16-31.x/192.168.x)
- I4: 新增 test_destructive_patterns.py — 25 測試全過
涵蓋: 常量存在、攔截、誤攔迴歸、critical 永遠攔截
🔴 Tier 3 紅區 — 首席架構師 Code Review 通過後 push
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 22:05:52 +08:00
OG T
d77b2add73
fix(review): 首席架構師 Code Review 修補 — I1 get_incident_type 邏輯修正 + 測試補全
...
CD Pipeline / build-and-deploy (push) Failing after 8m13s
Code Review 發現 2 個 Critical + 2 個 Important 問題:
Critical:
- rule.id 語意為「規則識別符」,與 incident_type 命名空間不同,不可混用
移除 rule_id fallback 路徑,YAML 匹配無 incident_type 時 fall through 靜態 dict
- get_incident_type() 關鍵路徑無測試覆蓋
新增 test_get_incident_type.py:11 測試、4 類別(靜態/YAML優先/YAML錯誤/custom)全過
Important:
- ALERTNAME_TO_TYPE deferred import 移至模組頂層(無 circular 風險)
- alert_types.py TODO 過期 → 更新為 I1 整合後正確說明
技術債記錄:NetworkPolicy ArgoCD egress ClusterIP 10.43.16.201/32 需 ArgoCD 重裝後更新
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 21:33:19 +08:00
OG T
485b8cb003
fix(ci): B5 整合測試加 ssl=disable — asyncpg 預設嘗試 SSL 被 container 拒絕
...
CD Pipeline / build-and-deploy (push) Failing after 1m55s
錯誤: ConnectionRefusedError Connect call failed ('127.0.0.1', 15432)
根因: asyncpg 走 _create_ssl_connection,臨時 postgres container 無 SSL
修正: TEST_DATABASE_URL + conftest 預設 URL 均加 ?ssl=disable
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 11:40:40 +08:00
OG T
49bfbd573c
feat(test): B5 整合測試框架 — 真實 DB, 5/5 通過
...
CD Pipeline / build-and-deploy (push) Failing after 2m34s
新增:
- docker-compose.test.yml: CI 用臨時 pgvector PostgreSQL (port 15432)
- tests/factories.py: Incident/Approval/Knowledge/RAG 測試資料工廠
- tests/integration/test_b5_core_flows.py: 5 個 E2E 整合測試 (5/5 PASSED 1.03s)
- tests/integration/setup_test_schema.sql: CI schema 初始化 SQL
- cd.yaml: 新增 Integration Tests B5 step
- scripts/sync_dev_db.py: dev DB 同步工具
修正:
- .env.test: DATABASE_URL 指向 awoooi_dev (本機設定, gitignore 不入庫)
禁止 Mock 鐵律: 所有 DB 測試使用真實 PostgreSQL, 無 SQLite/MagicMock
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 11:22:57 +08:00
OG T
e672635edf
fix(test): 更新 TestHistoryMessageFormat 適配 Phase 27 雙層策略
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 01:12:00 +08:00
OG T
2bc2a2f174
test(integration): drift API + DB 持久化整合測試
...
CD Pipeline / build-and-deploy (push) Has been cancelled
覆蓋 GET /drift/reports、POST /drift/internal/scan
驗證掃描後 DB 有新資料(B5 整合測試框架擴充)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:36:17 +08:00
OG T
1e1f24c561
fix(test): ComplexityScorer 模型名稱更新 llama3.2:3b → gemma3:4b
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-09 09:01:59 +08:00
OG T
65e1edb0ad
feat(web): OpenClaw 風格龍蝦 SVG + 三色狀態燈號 + 測試修正
...
CD Pipeline / build-and-deploy (push) Failing after 1m39s
前端:
- OpenClawLobster 全新 SVG (參考 dashboardicons.com/icons/openclaw)
圓潤身體 + 大眼睛 + 鉗子 + 觸角 + 微笑 + 小腳
- 三色版本: red(異常/預設) / green(健康) / yellow(警告)
- LobsterLoading 改用新 SVG
測試修正:
- test_nemotron_failure_still_returns_proposal: func_body 截取 5000→10000
原因: 函數超過 5000 字元,導致 rfind 找不到最後的 return
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-09 08:55:21 +08:00
OG T
b380b6a34c
fix(ci): 修正 nemotron 測試函數體截斷 5000→10000 字元
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 21:09:19 +08:00
OG T
170ce2f11d
fix(ci): 修正測試與 Sprint 5.2 部署腳本
...
CD Pipeline / build-and-deploy (push) Failing after 1m38s
tests/test_auto_repair_service.py:
- 更新 3個測試符合 2026-04-07 統帥指令移除門檻
- APPROVED Playbook 直接通過 (低相似度/低品質/高風險均通過)
tests/test_phase22_nemotron_collab.py:
- 更新 log key: nemotron_collaboration_failed → exhausted
ops/monitoring/docker-compose.exporters.yaml:
- 修正 postgres DSN: awoooi:awoooi_prod_2026@localhost:5432/awoooi_prod
Sprint 5.2 新增腳本:
- scripts/sprint51_e2e_validation.py: L7 E2E 驗收腳本 (T1-T5)
- scripts/ops/deploy-docker-health-monitor.sh: Plan A 一鍵部署腳本
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 18:17:48 +08:00
OG T
b20a619a3d
fix(ci): CD 修復 — shared-types 型別同步 + 測試冷啟動衝突
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Successful in 1m2s
1. pnpm shared-types generate — 同步 Sprint 4 新增的 Pydantic model
2. test_evaluate_not_high_quality 修復 — 加 MEDIUM risk step 避免
意外走冷啟動路徑 (Redis 未初始化 → COLD_START_DAILY_LIMIT)
11/11 auto_repair 測試通過
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-07 13:09:17 +08:00
OG T
2fe8062fb8
refactor(api): Re-Review S1/S2/S3 改善 — 消除重複+防禦性驗證+測試隔離
...
S1: 抽取 _execute_and_observe() 公用方法
- 消除 repair_by_uri 中 3 處重複的 execute+audit+langfuse 邏輯
- 統一 AuditLog + Langfuse trace 寫入路徑
S2: SSH username 防禦性驗證
- 新增 validate_ssh_user() + _SSH_USER_RE 正則
- 在 _ssh_execute() 入口驗證 user 參數
- 防止 user@host 拼接產生非預期行為
- 新增 8 個 username 驗證測試
S3: Singleton 測試重置
- 新增 _reset_for_test() classmethod
- 避免跨測試狀態污染
- 新增 2 個 singleton reset 測試
測試: 55/55 全數通過 (原 45 + 新 10)
首席架構師 Re-Review: 91/100 ✅ 通過,3 個 Suggestion 全數實裝
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:17:40 +08:00
OG T
f8d4772abf
fix(api): Sprint 3 P0-1/P0-2/P0-3/P0-4 Critical Security Fixes
...
P0-1: Complete shell metacharacter regex detection
- Enhanced _SHELL_METACHAR_RE to detect: >, <, \n, ${}, $()
- Prevents all shell injection vectors (redirects, variable expansion, newlines)
- Added 5 new validation tests
P0-2: Add shlex.quote() protection for ansible playbook path
- Wraps playbook_path in shlex.quote() before SSH command construction
- Prevents shell injection if path contains special characters
- Applied in _execute_ansible() method
P0-3: Add SSH target host whitelist validation
- Introduces validate_ssh_target_host() function
- Only allows SSH to: 192.168.0.110, 192.168.0.188
- Prevents unauthorized SSH target exploitation
- Added 5 new whitelist validation tests
P0-4: Convert HostRepairAgent to singleton pattern
- Implements __new__() singleton with shared _in_process_locks dict
- Ensures in-process locks persist across multiple auto_repair_service calls
- Previously created new instance per call, making locks ineffective
- Added singleton persistence test
Test Results: 45/45 passing (34 existing + 11 new P0 tests)
All security validations verified via comprehensive unit test coverage.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:09:45 +08:00
OG T
a4e11bfa92
feat(api): AuditLog + Langfuse Trace for SSH_COMMAND (Sprint 3 T5)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:38:59 +08:00
OG T
4561f141bb
feat(api): Redis 冪等鎖防止重複修復 (Sprint 3 T4)
...
雙層鎖設計: in-process asyncio.Lock (必定生效) + Redis 分散式鎖 (跨 Pod best-effort)
同一 URI 的第二次修復呼叫立即返回 "already running" 錯誤
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:26:53 +08:00
OG T
1a654aa37d
feat(api): HostRepairAgent 三條執行路徑 + known_hosts + Ansible 白名單 (Sprint 3 T3)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:22:54 +08:00
OG T
5e8b2a6894
feat(api): URI scheme 解析器 + Shell Injection 防護 (Sprint 3 T1)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:18:21 +08:00
OG T
8d496e84e2
fix(test): 更新 action_parsing 測試 — 無 -n 參數預設 namespace 改為 awoooi-prod
...
CD Pipeline / build-and-deploy (push) Has been cancelled
action_planner.py default_namespace 已是 awoooi-prod,測試預期值同步更新。
明確指定 -n default 的 kubectl 命令保持不變。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:49:24 +08:00
OG T
5499169996
feat(auto-repair): 打通自動修復閉環 (ADR-058)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 53s
問題: 告警鏈路從未呼叫 auto_repair_service,機制完全死路
修正:
1. webhooks.py: alertmanager_webhook 建立 Incident 後觸發 _try_auto_repair_background
2. playbook.py: is_high_quality 門檻降低 (冷啟動期)
- success_count: 10 → 3
- success_rate: 95% → 80%
3. tests: test_evaluate_not_high_quality 更新為新門檻
流程: Alertmanager → API → Incident → evaluate → P2以下+高品質Playbook → 自動執行 → Telegram通知
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 22:08:08 +08:00
OG T
9629367bc2
fix(webhook): Gitea 簽章格式修正 — 純 hex,無 sha256= 前綴
...
CD Pipeline / build-and-deploy (push) Successful in 13m12s
Gitea X-Gitea-Signature 送出純 hex(與 GitHub X-Hub-Signature-256 不同)
- router: 兩種格式皆接受(向後相容)
- tests: generate_signature 改為純 hex(符合 Gitea 實際行為)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 15:40:40 +08:00