Commit Graph

235 Commits

Author SHA1 Message Date
Your Name
ed2a4838f2 fix(auto): use action parser for repair gates
Some checks failed
CD Pipeline / tests (push) Failing after 1m2s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
2026-04-30 14:06:09 +08:00
Your Name
4723499955 fix(cd): install playwright system deps for smoke
All checks were successful
CD Pipeline / tests (push) Successful in 1m34s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Successful in 6m58s
CD Pipeline / post-deploy-checks (push) Successful in 3m7s
2026-04-30 11:02:12 +08:00
Your Name
e27b462bef fix(ops): keep disabled gitea runner stopped
All checks were successful
Code Review / ai-code-review (push) Successful in 27s
2026-04-30 10:59:46 +08:00
Your Name
0f7e9d3467 fix(cd): run docker builds on host runner
All checks were successful
CD Pipeline / tests (push) Successful in 1m33s
Code Review / ai-code-review (push) Successful in 25s
CD Pipeline / build-and-deploy (push) Successful in 9m20s
CD Pipeline / post-deploy-checks (push) Successful in 1m33s
2026-04-30 10:43:33 +08:00
Your Name
7cc10b2599 fix(cd): serialize gitea docker builds
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 40s
Code Review / ai-code-review (push) Successful in 24s
2026-04-30 10:11:50 +08:00
Your Name
e91db52858 docs(logbook): record 639bb64 prod deployment [skip ci] 2026-04-30 09:45:48 +08:00
Your Name
639bb64788 feat(flywheel): surface ai automation and code review
Some checks failed
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Failing after 5m23s
2026-04-30 00:09:25 +08:00
Your Name
4a57c2d04f feat(flywheel): expose incident processing timeline
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m56s
2026-04-29 23:38:30 +08:00
Your Name
f5f41543c9 docs: ADR-105 推翻 A2 + LOGBOOK 2026-04-29 LLM 飛輪復活戰
ADR-105 完整記錄推翻 A2 鐵律的決策:
- Context: A2 歷史背景 + 2 個月後事實基礎變化(GPU + qwen2.5:7b)
- Decision: 4 處修改(IntentType.DIAGNOSE override / chain / openclaw.py task_type / 6 regression test)
- Consequences: 正面(飛輪復活)+ 負面(Ollama 單點)+ 已知債(ADR-106-109 後續)
- Validation: 部署前 1635 tests 全綠,部署後 5 項驗證指標
- Rollback: env 切換 / git revert

LOGBOOK 加 2026-04-29 條目:
- 真根因:4 provider 全死 + A2 鐵律排除 Ollama
- CD 連環血淚:5 個 commit 全 failure(setup_test_schema.sql 缺欄)
- 已落地(不依賴 CD):Prometheus 17 條 rule + Gemini sanitize
- Memory 索引同步更新(指向 project_revert_a2_ollama_primary.md)

注意:docs/ 不在 cd.yaml paths trigger,此 commit 不影響 CD。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:59:53 +08:00
Your Name
6eb33594c2 docs(logbook): T0 12-Agent 全景驗證紀錄
承接前段 session wave2 (commit 143c15f0) + DB cleanup + Gitea HMAC + ArgoCD/Sentry MCP,
派四位專家並行驗證(critic / db-expert / debugger / tool-expert)。

詳情:B1/B2 鬼魂按鈕 + KM 早期吞例外 + M1-M4 中度問題 + G1-G3 環境治理 gap。
此 commit 主要為 LOGBOOK 索引補齊,本次 P0/P1 修復內容詳見前 2 個 commit。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
Your Name
cc547736ab feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復
承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼,補 commit 解 production
HEAD 隱性 import error(decision_fusion 已被 decision_manager 引用但檔案 untracked)。

新增(後端核心):
- decision_fusion.py (562 行) — P2.1 方法 III(OpenClaw + Hermes + Elephant 三 LLM 融合)
- aiops_timeline.py + aiops_timeline_service.py — critic B4 修復
  /api/v1/aiops/timeline endpoint,DB 存取抽到 service 層遵守 leWOOOgo 積木化
- migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位

修改(後端整合):
- decision_manager.py — fusion 三斷鏈修補(critic B1+B2+B3):
  · B1: 寫 _evidence_snapshot_ref 到 token.proposal_data
  · B2: fusion 前計算 complexity_score 並寫 token
  · B3: fusion composite 寫 token.proposal_data["decision_fusion"]
- auto_approve.py — fusion + consensus 認識(critic B3+B5):
  · composite > 0.7 → auto_execute_eligible bypass min_confidence
  · source=consensus_engine + score>=0.6 → 規則可信路徑
- consensus_engine.py — db-fix _save_consensus 重用 agent_sessions
- governance_agent.py — db-fix _alert PG 寫入 ai_governance_events
- approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint
- db/models.py — schema 對齊 migration
- core/config.py — vuln #1 修復:OLLAMA_URL/_FALLBACK_URL field_validator
  拒絕公網 IP + 外部域名,僅允許私網/loopback/K8s SVC 白名單
- core/feature_flags.py — P2 fusion + consensus flags
- main.py — governance_agent lifespan 啟動
- failover_alerter.py — Wave8-X2: in-memory dedup fallback(Redis 拒絕後不 fail-open)
- ollama_*.py — metrics 整合 + recovery 改善
- auto_repair_service.py — verifier 接線

新增(測試 2438 行):
- test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py
- test_p2_db_fixes.py / test_wave8_fusion_fixes.py
- test_config_url_validation.py(vuln #1 12 tests)
- test_failover_alerter.py +Wave8-X2 in-memory dedup 補測

驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus +
                      governance + p2_db_fixes + failover_alerter)

Conflict resolution:
- 3 檔(config.py + auto_approve.py + decision_manager.py)git stash pop 衝突
  保留 stashed (engineer 最終版),補回 ValueError 「公網 IP」字樣對齊 test

Note: 此 commit 解 production HEAD 隱性 import error
仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed
       / B25-B26 drain_pending_tasks / B8 governance fail alert

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>
2026-04-27 08:11:40 +08:00
Your Name
7cd53c0228 fix(monitoring): 記憶體告警改用 working_set,停止 page cache 假告警
- alerts-unified.yml:
  - SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes,0.8 → 0.85
  - GiteaMemoryPressure: 同步修正(同樣 page cache 虛高根因)
- ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases
- 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範
- LOGBOOK: 記錄 T0 五大並行任務(A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review)

鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8%
根因: container_memory_usage_bytes 含 OS page cache,OOM killer 不視為壓力
修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache),閾值 0.85

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:16:12 +08:00
Your Name
689839cd83 docs(logbook): 記錄 2026-04-25 自動化飛輪四修 + Hermes + qwen3
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 09:49:50 +08:00
Your Name
d467cac709 fix(hermes): 改用 anthropic Python SDK 直呼,棄用需要 claude CLI 的 claude-agent-sdk
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:claude-agent-sdk 需要 spawn claude CLI,prod pod 沒有 CLI 所以 SDK 回空。
修法:改用 anthropic.AsyncAnthropic().messages.create() 直呼 API。
model: claude-haiku-4-5-20251001(快速低成本,適合 Telegram QA)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 03:08:51 +08:00
Your Name
86ee013cdf feat(hermes-complete): Hermes NL 三項補強 + ConsensusEngine + ADR 收尾
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m32s
## Hermes NL 補強(nl_gateway.py)
- T1 hermes_dispatch_log DB 寫入(asyncio.create_task 非阻擋)
- T2 Redis 速率限制:per-chat_id 20 req/min,fail-open
- T3 Multi-turn session:hermes:session:{chat_id}:{user_id} TTL=300s,最近 3 輪

## ConsensusEngine(ADR-095 宣告式設計)
- consensus_engine.py: CONSENSUS_WEIGHTS class 屬性
  security=0.4 鎖定,9 個 Claude Code agent 分配 0.6
- config.py: ENABLE_12AGENT_CONSENSUS=False feature flag

## ADR 狀態
- ADR-093/094/095: Proposed → 🟡 批准實作中
- 各 ADR 加 v1.1 變更紀錄

## K8s ConfigMap
- prod 04-configmap.yaml: 加 3 個 feature flags(均 false)
- dev 02-configmap.yaml: 同步加入

## LOGBOOK
- 記錄 WS0–WS6 + 補強完成,feature flags 啟用指引

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:22:40 +08:00
Your Name
55f111e0e3 fix(aiops): correct host alert fallback and resolved stamp
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m54s
2026-04-25 00:14:07 +08:00
Your Name
0d81b28b1b fix(aiops): bound phase2 timeout and repair incident links
All checks were successful
E2E Health Check / e2e-health (push) Successful in 52s
CD Pipeline / build-and-deploy (push) Successful in 9m24s
2026-04-24 23:53:56 +08:00
Your Name
4ea52d8e5d docs(logbook): ADR-092 P2.4+P2.6 完成記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:58:19 +08:00
Your Name
e75e4678a9 feat(p2.4): Telegram 中間態推播 — 分析中佔位卡 + 完成後自動刪除
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P2.4 實作 2026-04-24 ogt + Claude Sonnet 4.6

問題: LLM 分析耗時 10-30s,期間 Telegram 無任何回應,使用者不知系統在處理

修復:
- telegram_gateway.py: 新增 send_analyzing_placeholder() — 發送「AI 正在分析中...」佔位卡
- telegram_gateway.py: 新增 delete_message() — 刪除佔位卡
- webhooks.py: LLM 分析前 3s 內送出佔位卡(超時不阻塞主流程)
- webhooks.py: _push_to_telegram_background 收到 placeholder_message_id → 完整卡發出後刪除佔位卡
- webhooks.py: import asyncio(補缺漏)

效果: 使用者在告警到達 <3s 內即看到「分析中...」訊息,完整卡出現後自動清除

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:56:26 +08:00
Your Name
04ff22563e fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
Some checks failed
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)

【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值

【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log

【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql

【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:41:35 +08:00
Your Name
45dbe07188 fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
Some checks failed
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」

【六大修復】
1. MCP Provider 三蟲修復
   - ssh_provider: asyncssh.run() → conn.run()
   - prometheus_provider: KeyError 'query' → .get() 容錯
   - k8s_provider: 空 pod_name → 早返回錯誤字典

2. Agent Debate / 決策品質
   - decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
   - intent_classifier: LLM 逾時降級至關鍵字分類(非 None)

3. Watchdog 誤報修復(ADR-092 B3)
   - W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
   - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
   - approval_timeout_resolver: 60min → 15min,batch 50 → 200

4. Config Drift 自動化
   - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
   - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片

5. Playbook 飛輪穩定
   - playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
   - playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)

6. 可觀測性
   - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
   - auto_approve: reject 原因 Redis 計數器
   - heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊

【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:55:50 +08:00
Your Name
e2742ce9f3 docs: BUTTON_DATA_INVALID 根治 + Gitea Code Review 修復 記錄
LOGBOOK + ADR-092 附錄 C — 2026-04-21 修復紀錄

E2E 驗證: telegram_approval_card_sent message_id=25045 (SignOzDown) ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 21:59:00 +08:00
Your Name
994817a23a docs: ADR-092 附錄 A+B + LOGBOOK + MASTER §8 記錄四修與 C1-C4 全流程串接
- ADR-092: 附錄 A(B1-B4 四修 root cause + commit)+ 附錄 B(C1-C4 斷點修復表 + 架構鐵律)
- LOGBOOK: 新增 2026-04-20 晚 C1-C4 章節(斷點清單 + commits + 驗收步驟)
- MASTER §8: 追加 C1-C4 changelog(§3/§1.1 對齊 + 修復後行為說明)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 20:24:41 +08:00
Your Name
54d60d04f5 feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace
統帥三問決議:全做;AI 推薦 0.85 門檻純顯示不自動;先查 aol 再修

## RCA: awoooi-service 失敗來源
- /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed
- grep codebase: 無任何程式碼寫死 awoooi-service(只有歷史 comment)
- 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名
- cf5050c/4f2e122(2026-04-18)已修 NEMOTRON 幻覺雙路徑;本次修第三條路徑

## 修復
### P0.3a alert_rule_engine._extract_vars
- labels.service 降級:-service 結尾先剝 suffix 視為 base name
- match_rule 回傳新增 target_source 欄位追 trace
- 下次 awoooi-service 復發可直接看來源(label.service(stripped) 等)

### P0.3c approval_execution._log_aol_started.input
- 補 parsed_target/operation/namespace 欄位
- 未來 aol 查 failed 可直接看 target,無需推敲

### P0.1 telegram_gateway._send_drift_diff_detail
- 分頁(10 項/頁)取代一次洗版 30 項
- header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動
- 底部 ⬅️/➡️ 分頁按鈕(callback: drift_view_page:{report_id}_{page})
- security_interceptor INFO_ACTIONS 加 drift_view_page 白名單

### P0.2 drift_narrator recommendation
- LLM prompt 加 recommendation 欄位(action/confidence/reason)
- action ∈ {adopt, revert, ignore, investigate}
- 卡片頂部顯示「🎯 AI 建議: 回滾 (85%) — reason」
- LLM 失敗走 _fallback_recommendation(規則式依 intent 對應)
- 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items
- 統帥指令:純顯示不自動執行(門檻 0.85 保留未來)

## 驗證
- 90 個 pytest test 全過(drift + rule_engine + approval_execution)
- 5 檔 AST syntax check 過

## 下次驗收
1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」
2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️
3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:04:13 +08:00
Your Name
86d9b22125 docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Session 35 commits 完整結案:
  - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster)
  - KPI Dashboard API (autonomy_score 63/100 可量化)
  - Audit 誠實 3 Gaps
  - Gap 1 host IPv4 嚴格 + 清理 266 筆重複
  - Gap 2 真因確認非 bug
  - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage)

AI 自主化達成:
  1/9 LLM (只 Hermes) → 4/9 LLM decision
  8 張 0 writer 表全活化
  7/7 coverage 維度完整
  今晚 AI 將自主推 4 種 Telegram 分析報告

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:22:42 +08:00
OG T
53618b25c9 docs(logbook): 2026-04-19 20:00 本 session 22 commits 全景記錄
記錄:
  - 統帥決策 Rule 1 deprecate + Rule 2 保留 + noise 算法修正
  - Hermes LLM 升級 (OpenClaw 分析假報真因)
  - coverage_evaluator 擴充 4 維 (7 維全實作)
  - deploy-alerts workflow 部署 HostDiskUsageHigh/Critical 到 Prometheus
  - Review 發現 5 個 bug 全修復

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:56:56 +08:00
OG T
c015a77011 docs(logbook): Phase 7 完整化記錄 — 8/8 表全寫入 + 5 bugs 修 + Hermes E3
記錄本輪 review 深入發現的 5 個 bug + 8 個新 scanner/evaluator/advisor.
8 張 ADR-090 0 writer 表覆蓋率 100%.
2 條 100% noise rule 待 Hermes 推建議後人工決策.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:28:28 +08:00
OG T
5d011de917 docs(logbook): 2026-04-19 Phase 7 scanner 完成 + CI 修復歷程
記錄本輪 6 個 commits 的全景與 CI cd.yaml B5 3 輪除錯歷程,
供未來 session 接手時理解當前進度。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:36:30 +08:00
OG T
1ae9e9f389 fix(code-review): P0-1 action fallback 語意修正 + P1-2 reason enum + P2-2 secops 清洗
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m7s
Code Review 發現 (2026-04-17 首席架構師審查):

P0-1 auto_approve.py 條件 1d 語意修正:
- 原:用 `action` 變數(已 fallback = action or kubectl_command)做 kubectl 判斷
  → action="" + kubectl_command="kubectl get pods" → action="kubectl get pods" → 1d 通過
  → _kubectl_cmd 與 action 同值(重複判斷同一來源),掩蓋 action 本身是自然語言的情況
- 修:改用 proposal_data.get("action", "") 原始值(_raw_action)
  → 直接檢查 action 欄位本身,邏輯語意明確

P1-2 auto_approve.py NO_EXECUTABLE_ACTION 新增:
- 新增 AutoApproveReason.NO_EXECUTABLE_ACTION enum 值
- 條件 1d 改用此 reason(原 NO_PLAYBOOK 語意為「無匹配 Playbook」,不適用此場景)
- 避免污染 KM 飛輪學習資料的根因分類(ADR-068)

P2-2 decision_manager.py secops 分支:
- threat_behavior 改用 _parse_debate_summary → 取 diagnosis 欄位
- 與 BUG-A/BUG-C 修復一致,不再傾倒完整 debate_summary 前 150 字

ADR-082: Phase 2 多 Agent 協作
2026-04-17 ogt + Claude Sonnet 4.6(亞太): Code Review 後修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:23:35 +08:00
OG T
e8bf37cfd9 docs(logbook): 最終確認 f5e33da2 全節點 E2E 鏈路打通 (2026-04-16)
- CI 894 完成,f5e33da2 已部署
- flywheel outcome 欄位修復確認
- telegram _send_request 修復確認(零 AttributeError)
- Sweeper:20/20 近48h incidents sweeper_done 標記完整
- E2E 鏈路 7 節點完整流程確認(36 incidents)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:18:45 +08:00
OG T
588ecfd940 docs(logbook): 2026-04-16 E2E 全節點驗證 + 生產 bug 修復記錄
2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:46:39 +08:00
OG T
bb7441ec8a docs: 更新 CLAUDE.md + HARD_RULES.md v2.0 + LOGBOOK (2026-04-16)
- HARD_RULES.md v2.0: 新增 Self-Loop Workflow、Circuit Breaker Exception、State & Flow Validation
- CLAUDE.md: 補充 §4 必讀Memory 表格
- LOGBOOK: 記錄 AIOps E2E 修復進度

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:20:16 +08:00
OG T
e465ee1936 docs(Phase 3): Evolver 演練完成 — exit condition #6 通過
- MASTER spec §3/§7/§8:三處 Evolver 演練勾選完成
- LOGBOOK:演練結果記錄 + 下一步更新為 7 天生產監控

演練結果:POST /api/v1/learning/evolver/run → HTTP 200 errors:[] 2026-04-15

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:24:33 +08:00
OG T
5f86da52d9 docs(LOGBOOK): Phase 3 全部落地記錄 — 6 個元件 + 退出條件清單
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:10:47 +08:00
OG T
ee486fbd2b docs(logbook): 2026-04-15 深夜收官 — P0/P2 RCA + Phase 6 閉環
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:58:09 +08:00
OG T
14a02263ae feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)

### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
  - DynamicBaselineService(3σ 偏離)
  - LogAnomalyDetector(新 Drain3 pattern)
  - TrendPredictor(斜率外推 4h 預測)
  - Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓

### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
  ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
  LLM RCA 可參考動態基線偏差與趨勢預警

### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)

### main.py
- 啟動 run_proactive_inspector_loop() asyncio task

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:47:05 +08:00
OG T
f1cbf6db7d feat(adr-081): Phase 1 感官縱深 — 8D 情報蒐集 + 執行後驗證
成品:
- IncidentEvidence DB model(8D 感官 + pre/post 執行狀態)
- EvidenceSnapshot dataclass(build_summary → LLM 上下文)
- SanitizationService(Prompt Injection 0-tolerance,12 pattern)
- MCPToolRegistry(動態工具登記,suggest_tools 不寫死告警類型)
- PreDecisionInvestigator(8D 並行感官,P99 < 8s,Redis 30s 快取)
- PostExecutionVerifier(warmup 10s → 後狀態評估 success/degraded/failed)
- decision_manager + approval_execution 接線(feature flag 守衛)

Gate 1 修復:D4/D5/D7/D8 補 sanitize_dict_values;移除裸 "error" failure
signal 防 error_rate key 誤判;evidence_snapshot rowcount 零行警告。

測試:130 passed(+111 新增)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:08:38 +08:00
OG T
db9e304a14 feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
  (1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
   15 KPI、21 Feature Flags、10 風險場景)

- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
  (7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)

- apps/api/src/core/feature_flags.py
  (AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
   is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)

- apps/api/src/jobs/__init__.py + baseline_snapshot.py
  (Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
   / learning loop rate / auto_repair — 寫入 aiops:baseline:latest)

- apps/api/tests/test_feature_flags.py  (21 tests — 全綠)

- docs/HARD_RULES.md → v1.9
  (新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)

- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol

Gate 0 Pass — 21/21 tests green

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 12:44:53 +08:00
OG T
e3d7c92100 docs(Phase 5): ADR-079 狀態 Completed + LOGBOOK 午夜收官
- ADR-079 Sprint 5.0-5.4 全數完成,狀態改 Completed
- LOGBOOK 新增午夜條目記錄 Phase 5 落地

本日 26 commits 總覽:
cc42aa0 aae7c12 43c9689 dedd7c2 dd0a778 0f48a50 b8b124c 8de807c
f54dea4 6cac507 10b74af aa4e575 8b7e9cb 914c7e7 ca862c5 10e3043
72dd0c5 3f8d087 2a37d1c 094aa95 2e2f5a1 36754a8 581b244 208c28e
de8bbd8 a92562d

涵蓋:
- GAP-A1/A2/A3/A4 (4 個 gap + Phase 2)
- GAP-B1/B4 (timeout fix)
- GAP-C1/C2/C3 (BP-1 + retry + SSH KM)
- GAP-D1/D5 (信任度 + 日報 + Postmortem)
- Phase 5 全 Sprint (分類按鈕完整化)
- 4 BLOCKER 修復 + Bug A 診斷 + Bug B 真修
- 下架死按鈕 + 重啟新按鈕(從 registry 動態產生)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-15 10:46:40 +08:00
OG T
aa4e5757a2 fix: 技術債清理 — report_generation 重試機制 + GAP-A4 文件化
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m46s
技術債 #1: postmortem 發送失敗靜默吞掉
- 3 次指數退避重試 (2s → 4s → 6s)
- 全失敗後送簡化降級通知到 SRE 群組
- 防止事後檢討默默消失

技術債 #2 (QueryBuilder 抽象): DEFER
- 全專案僅 1 處用 outcome JSON path query
- 違反「Don't design for hypothetical future requirements」
- 待第二 caller 出現再抽

技術債 #3 (E2E 測試): 已涵蓋
- test_gap_a4_placeholder_resolution.py TestMatchRuleRejection
- Mission C prod 鏈路實測(KubePodCrashLooping)
- Playwright K8s/Telegram staging 留待 staging 環境就緒

新增文件:
- ADR-078-gap-a4-placeholder-resolution.md
- LOGBOOK 2026-04-14 深夜收官條目

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 18:46:25 +08:00
OG T
6cac5071e4 docs: MASTER 藍圖結案報告 + ADR-077 + LOGBOOK 收尾
本日 Session 終極收案(9 commits, 11/11 Task, 52 新測試):
- docs/reports/2026-04-14-MASTER-BLUEPRINT-CLOSURE.md — 完整結案報告
- docs/adr/ADR-077-master-blueprint-completion.md — 架構審查 + 決議紀錄
- docs/LOGBOOK.md — 新增深夜收官條目

審查裁定: CONDITIONAL PASS
通訊渠道: 全走 Telegram,SMTP 不需要

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 18:36:59 +08:00
OG T
b8b124c917 chore: 收工存檔 — LOGBOOK + 封存過時 flywheel-alerts CRD
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-14 傍晚 Session 收尾:
- LOGBOOK.md: 記錄本日 6 commits + Mission C E2E 驗證 + MASTER 藍圖 P0+P1 全綠
- k8s/monitoring/flywheel-alerts.yaml: 加 🔴 DEPRECATED 標記
  (11/11 規則已遷入 ops/monitoring/alerts-unified.yml,此 CRD 檔無 Operator 支援)

Backlog 清剿盤點:
 C2 hasType4 前端硬編(已接真實 API)
 C3 WebSocket 無重連(指數退避 + polling fallback)
 flywheel-alerts Docker 方式改寫(已完成,僅舊檔未清理 → 本 commit 封存)
 risk_level YAML 優先邏輯(decision_manager:1663)
 SMTP_USER/SMTP_PASSWORD CHANGE_ME(需統帥提供帳密)
 各類 E2E 驗證(需真實告警觸發)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 18:21:08 +08:00
OG T
be2ec4d761 docs(logbook): 更新當前狀態 — P0 文件補建完成,護城河已部署
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 14:54:37 +08:00
OG T
684d6cfb43 feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 17m34s
Task 2: AlertGroupingService — Redis 5分鐘滑動視窗,防告警風暴
- apps/api/src/services/alert_grouping_service.py (新增)
- webhooks.py 整合:指紋生成後/LLM前短路子告警
- Threshold=3,Graceful Degradation,16 tests

Task 3: approval_execution.py 執行失敗重試
- MAX_RETRY=2, RETRY_DELAY_SECONDS=30
- _is_transient_error() 瞬態/永久分類,永久錯誤不重試
- Timeline 記錄重試進度,成功後標注重試次數,29 tests

Task 4: report_generation_service.py 自動報告
- 日度巡檢報告:每日 08:00 台北時間,Telegram SRE 群組推送
- Postmortem:Incident resolved + duration > 10 分鐘自動觸發
- main.py lifespan 掛載 run_daily_report_loop(),30 tests

測試: 600 → 675 通過 (+75),0 failed

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 14:39:14 +08:00
OG T
079d0e89b9 docs(adr-075): 加入實作記錄 + LOGBOOK 更新(Phase 1+2+CR 全完成)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:44:57 +08:00
OG T
2f6859f76f docs(logbook): Session 結尾 — 層次三 M3-M5 + 層次四 C2-C4 全完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:43:06 +08:00
OG T
6dc03c9a55 fix(argocd)+feat(flywheel): Phase 1 完成 — ArgoCD image 斷路修復 + 冷啟動腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. k8s/argocd/awoooi-prod-app.yaml:
   移除 Deployment image ignoreDifferences
   - 原設計造成 CD 更新 kustomization.yaml 後 ArgoCD 不更新 image
   - 修復後 GitOps 閉環恢復正常

2. scripts/cold_start_playbooks.py:
   ADR-073 Phase 1 Step 8 — 生成 15 個基礎 Playbook (K8s/Docker/DB/Infra)
   執行結果: Playbooks 0 → 15

3. scripts/batch_vectorize_km.py:
   ADR-073 Phase 1 Step 9 — 批次向量化 KM
   執行結果: 711/713 embedding IS NOT NULL

Phase 1 全部完成,飛輪已解封:
- Pod 運行 105998d(含 8be87b0 所有修復)
- debounce 30min + alertname NULL 修復 + _collect_mcp_context 啟用
- 15 Playbooks + 711 KM 向量化

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:20:52 +08:00
OG T
a4411f1386 docs(logbook): 技術債 I2 DI 化完成 + 里程碑補記
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:46:05 +08:00
OG T
d32d494320 docs: 四階段細化實施步驟 + 架構轉型截圖定案 + 防偏差守則
規格書 v2.0 新增:
- §十一 四階段細化實施步驟(階段1~4各含驗收清單)
  - 階段1: CD解鎖+debounce+alertname+冷啟動Playbook+KM向量化(9步)
  - 階段2: DB Migration+classify_alert_early+outcome寫入(5步)
  - 階段3: 分診站+SSH路由+TYPE-1/E/F+action解析+risk_level(Tier3,7步)
  - 階段4: KMConversionService+手動修復記錄(4步)
- §十二 防偏差守則(不跳步驟/Tier3授權/不改範圍/異常立刻報告)

ADR-073 更新:架構轉型截圖定案(舊架構中斷→新架構分診飛輪)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:30:37 +08:00
OG T
cda09a229d docs(logbook): 2026-04-12 整合規格書完成,四層方案定案
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:22:20 +08:00