Commit Graph

94 Commits

Author SHA1 Message Date
OG T
7da64eaad2 feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
ADR-083 Phase 3 學習閉環重建:

**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
  (成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)

**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
  成功: trust = 0.9 × old + 0.1 × 1.0
  失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
  trust < 0.1 → log warning,等 Evolver 封存

**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
  1. 低信任封存: trust < 0.1 → DEPRECATED
  2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
  3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
  AIOPS_P3_EVOLVER_ENABLED=False 預設關閉

**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄

AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟

Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com>
2026-04-15 14:01:37 +08:00
OG T
5ddba6d6e0 feat(adr-082): Phase 2 多 Agent 協作 — 5 角色辯證系統骨架上線
新增 5 個 Agent + Orchestrator + DecisionManager 接線:
- protocol.py: DiagnosisReport / ActionPlan / ReviewVerdict / CriticReport / DecisionPackage 型別系統
- DiagnosticianAgent: RCA 根因分析,confidence < 0.4 → ABSTAIN
- SolverAgent: 修復方案軍師,blast_radius 評分 + 降級 rule-based mock
- ReviewerAgent: 安全審查,HARD_RULES 靜態 pattern + blast_radius 閾值 (>50 revision, >80 reject)
- CriticAgent: 刻意唱反調,強制 3 問批判性思維,critical challenge → REJECT
- CoordinatorAgent: 純規則聚合,6 級決策閘,REQUEST_REVISION → 強制人工
- AgentOrchestrator: 30s 全局超時,Reviewer ‖ Critic 並行,DB Immutable Event Sourcing + Redis Streams
- DecisionManager: AIOPS_P2_ENABLED gate + _package_to_proposal_data 橋接既有 proposal_data 格式
- AgentSession DB table + 4 個複合 index
- ADR-082 決策記錄

Gate 2 修復(7 項):
- CRITICAL: DELETE FROM regex lookahead 位置錯誤(移至 FROM 後)
- CRITICAL: REQUEST_REVISION 可抵達 auto-execute 路徑(改回 requires_human_approval=True)
- IMPORTANT: _extract_json flat regex 不支援巢狀 JSON(改 find/rfind 邊界提取)
- IMPORTANT: all_degraded 遺漏 verdict.degraded(補全 4 個 Agent)
- IMPORTANT: Solver ABSTAIN guard 放行降級假設(改為無論 hypotheses 有無均跳過)
- IMPORTANT: dataclasses.asdict() Enum 未序列化導致 DB 寫入靜默失敗(加 json.dumps default handler)
- IMPORTANT: P2 gate 直讀屬性繞過父 Phase 守衛(改用 is_phase_enabled(2))

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:48:55 +08:00
OG T
db9e304a14 feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
  (1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
   15 KPI、21 Feature Flags、10 風險場景)

- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
  (7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)

- apps/api/src/core/feature_flags.py
  (AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
   is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)

- apps/api/src/jobs/__init__.py + baseline_snapshot.py
  (Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
   / learning loop rate / auto_repair — 寫入 aiops:baseline:latest)

- apps/api/tests/test_feature_flags.py  (21 tests — 全綠)

- docs/HARD_RULES.md → v1.9
  (新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)

- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol

Gate 0 Pass — 21/21 tests green

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 12:44:53 +08:00
OG T
e3d7c92100 docs(Phase 5): ADR-079 狀態 Completed + LOGBOOK 午夜收官
- ADR-079 Sprint 5.0-5.4 全數完成,狀態改 Completed
- LOGBOOK 新增午夜條目記錄 Phase 5 落地

本日 26 commits 總覽:
cc42aa0 aae7c12 43c9689 dedd7c2 dd0a778 0f48a50 b8b124c 8de807c
f54dea4 6cac507 10b74af aa4e575 8b7e9cb 914c7e7 ca862c5 10e3043
72dd0c5 3f8d087 2a37d1c 094aa95 2e2f5a1 36754a8 581b244 208c28e
de8bbd8 a92562d

涵蓋:
- GAP-A1/A2/A3/A4 (4 個 gap + Phase 2)
- GAP-B1/B4 (timeout fix)
- GAP-C1/C2/C3 (BP-1 + retry + SSH KM)
- GAP-D1/D5 (信任度 + 日報 + Postmortem)
- Phase 5 全 Sprint (分類按鈕完整化)
- 4 BLOCKER 修復 + Bug A 診斷 + Bug B 真修
- 下架死按鈕 + 重啟新按鈕(從 registry 動態產生)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-15 10:46:40 +08:00
OG T
10e3043ce8 fix(UX): 下架 28 個鬼魂分類按鈕 + ADR-079 Phase 5 補完計畫
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥 2026-04-14 20:00 完整 audit 揭露:
_CATEGORY_BUTTONS 28 個按鈕全死 3 天(從 2026-04-11 commit 325b3851)
- callback_data 格式全錯(3-part 不符 parser 4-part/2-part)
- grep apps/api/src 無任何 dispatch handler
- 統帥今天真踩到:點「查程序」沒反應 → 信任破壞

首席架構師裁示 (C 分級):
A. 立刻下架(本 commit):_CATEGORY_BUTTONS = {} fallback 通用按鈕
B. Phase 5 完整化(ADR-079 規劃,3-5 天,另 Sprint 實作)

保留通用按鈕(全 ):
- 批准 / 拒絕 / 靜默(4-part nonce)
- 詳情 / 歷史 / 重診(2-part info)

新增防禦性文件:
- ADR-079 — Phase 5 工作分解 + 每按鈕 checklist
- feedback_no_ghost_buttons.md(memory)— 鬼魂按鈕鐵律

設計原則永久入檔: 寧可沒按鈕,不可有死按鈕

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 20:19:25 +08:00
OG T
aa4e5757a2 fix: 技術債清理 — report_generation 重試機制 + GAP-A4 文件化
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m46s
技術債 #1: postmortem 發送失敗靜默吞掉
- 3 次指數退避重試 (2s → 4s → 6s)
- 全失敗後送簡化降級通知到 SRE 群組
- 防止事後檢討默默消失

技術債 #2 (QueryBuilder 抽象): DEFER
- 全專案僅 1 處用 outcome JSON path query
- 違反「Don't design for hypothetical future requirements」
- 待第二 caller 出現再抽

技術債 #3 (E2E 測試): 已涵蓋
- test_gap_a4_placeholder_resolution.py TestMatchRuleRejection
- Mission C prod 鏈路實測(KubePodCrashLooping)
- Playwright K8s/Telegram staging 留待 staging 環境就緒

新增文件:
- ADR-078-gap-a4-placeholder-resolution.md
- LOGBOOK 2026-04-14 深夜收官條目

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 18:46:25 +08:00
OG T
6cac5071e4 docs: MASTER 藍圖結案報告 + ADR-077 + LOGBOOK 收尾
本日 Session 終極收案(9 commits, 11/11 Task, 52 新測試):
- docs/reports/2026-04-14-MASTER-BLUEPRINT-CLOSURE.md — 完整結案報告
- docs/adr/ADR-077-master-blueprint-completion.md — 架構審查 + 決議紀錄
- docs/LOGBOOK.md — 新增深夜收官條目

審查裁定: CONDITIONAL PASS
通訊渠道: 全走 Telegram,SMTP 不需要

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 18:36:59 +08:00
OG T
079d0e89b9 docs(adr-075): 加入實作記錄 + LOGBOOK 更新(Phase 1+2+CR 全完成)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 18:44:57 +08:00
OG T
d32d494320 docs: 四階段細化實施步驟 + 架構轉型截圖定案 + 防偏差守則
規格書 v2.0 新增:
- §十一 四階段細化實施步驟(階段1~4各含驗收清單)
  - 階段1: CD解鎖+debounce+alertname+冷啟動Playbook+KM向量化(9步)
  - 階段2: DB Migration+classify_alert_early+outcome寫入(5步)
  - 階段3: 分診站+SSH路由+TYPE-1/E/F+action解析+risk_level(Tier3,7步)
  - 階段4: KMConversionService+手動修復記錄(4步)
- §十二 防偏差守則(不跳步驟/Tier3授權/不改範圍/異常立刻報告)

ADR-073 更新:架構轉型截圖定案(舊架構中斷→新架構分診飛輪)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:30:37 +08:00
OG T
f2b427d87c docs(adr): ADR-073 補充 ADR-071 整合工作序 + ADR-074 監控補全 Sprint
新增:
- Sprint ADR-073-B 補充:DB Migration + 檢傷分類站 + KMConversionService(ADR-071-A/A0/B/C/G/H)
- Sprint ADR-074:飛輪健康度Exporter + 主機間網路 + DNS + Gitea CD + 備份還原測試等9項監控缺口
- 參考指向完整規格書 2026-04-12-aiops-complete-flywheel-repair-design.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:21:27 +08:00
OG T
c7677750b5 docs(adr-070): 補全 c439277 全自動化三大修復 + Tier 3 CR 修補記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 00:09:18 +08:00
OG T
99cc420429 docs(review): 首席架構師 Code Review 後 — ADR-064/067 + Skill 02 補全記錄
ADR-064: 補 I1 整合記錄(get_incident_type 三層降級、rule.id ≠ incident_type 設計決策)
ADR-067: 補 D1 集中化完成記錄(9 purpose keys 對應表)
Skill 02: 補 get_incident_type 使用規範 + Ollama D1 模型中央化禁令

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:35:25 +08:00
OG T
82e1c05df8 fix(review): Code Review C1/C2/I2/M2 修補
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
C1 drift_interpreter: 寫死 192.168.0.111 → settings.OLLAMA_URL
  違反 feedback_frontend_internal_ip_ban 鐵律(後端 service 層同樣禁止寫死內網 IP)

C2 km_conversion_service: BUG-004 補同步 Redis Working Memory vectorized 欄位
  原修復只更新 DB,Redis incident:{id} JSON 的 vectorized 未同步
  → 審計查 Redis 仍顯示 False,fly-wheel 閉環指標仍不準
  修復:DB 更新後 GET → JSON patch vectorized=True → SET(保留原 TTL)

I2 decision_manager: _ALERTNAME_KEYWORDS HostHighDiskUsage→HostOutOfDiskSpace
  + 補 DockerContainerExited
  + fallback 路徑加 debug log

M2 decision_manager: import json as _json 從 for 迴圈移至方法頂部

docs: ADR-072 新增 Code Review 發現與技術債記錄

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:36:59 +08:00
OG T
9382814d14 docs(adr-072): 全部完成 BUG-001~008
ADR-072 狀態更新為「全部修復完成」
BUG-007 確認不需修(alerts-unified.yml 全 42 規則均有 severity)
BUG-008 已修復(f34fe19)
LOGBOOK 新增 P2 完成條目 + 下一步說明

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:30:29 +08:00
OG T
85c71bf73c docs(adr-072): 更新 Bug 修復狀態 + LOGBOOK
ADR-072: BUG-001~006 標記已修復 (P0 commit 88e3197, P1 commit 5aa0244)
LOGBOOK: 新增 ADR-072 P0+P1 全修復條目
P2 待修: BUG-007 severity labels + BUG-008 alertname_to_type

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:26:27 +08:00
OG T
09982fdfaa docs(session6): Telegram 全面審計 + ADR-072 Bug 清單 + 規格整合
- LOGBOOK: Session 6 Redis DB10 審計結果(8個系統性問題,P0-P2分級)
- ADR-072: AIOps 閉環 Bug 修復清單(drift_interpreter/deployment_name/KM vectorization等)
- 規格文件 v2.2: 確認 Sprint A/B/C + MCP 1-4 + ADR-071 全部完成,標記下一步為 ADR-072

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:04:50 +08:00
OG T
a1432c03ed docs: ADR-070/071 + ssh-mcp-setup runbook + Skill-04 v2.7
- ADR-070: 全自動 AIOps 閉環 MCP Phase 1-4 決策文件
- ADR-071: 告警通知四類型 + KM 三段資料閉環決策文件
- docs/runbooks/ssh-mcp-setup.md: SSH MCP 建立/驗證/輪換 SOP
- Skill-04: v2.7 新增 Sprint C DR + ADR-070 MCP 10 providers 完整記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:04:47 +08:00
OG T
de055778b3 fix(cd): CD_PUSH_TOKEN + backup 路徑使用 BACKUP_ROOT 環境變數
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- cd.yaml: GITEA_CD_TOKEN → CD_PUSH_TOKEN(Gitea 保留 GITEA_ 前綴)
- ADR-069: 同步更新 token 名稱說明
- backup-from-110.sh: 改用 BACKUP_ROOT 環境變數(預設 /home/ollama/backup/110)
  避免 /var/log /var/run 需要 root 權限
- 已部署到 188 + cron 0 1 * * * 設定完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:07:47 +08:00
OG T
7f4ec717ef feat(gitops): Sprint B-2/B-3 — ArgoCD Application + CD GitOps 模式
B-2: k8s/argocd/awoooi-prod-app.yaml
  - ArgoCD Application awoooi-prod 建立(已 apply 到 K8s)
  - automated sync: prune + selfHeal
  - ignoreDifferences: Deployment image + Secret data
  - 全部 17 個 K8s 資源已確認 Synced

B-3: .gitea/workflows/cd.yaml — Deploy step 重寫
  - 舊: kubectl set image(與 ArgoCD selfHeal 衝突)
  - 新: kustomize edit set image → git commit [skip ci] → push → ArgoCD sync
  - 新增等待 ArgoCD Synced + Healthy(最多 120s)
  - 需建立 Gitea Secret: GITEA_CD_TOKEN(見 ADR-069)

docs/adr/ADR-069-infra-gitops-sprint-b.md
  - 決策記錄:循環觸發防護 + ignoreDifferences 設計

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 02:57:42 +08:00
OG T
b52e2de968 docs(adr068): 飛輪冷啟動修復結案文件 + Skills v2.8
- ADR-068: 完整記錄五根因、四階段修復、首席架構師審查、E2E 驗收、驗證 Runbook
- LOGBOOK: 更新當前狀態,標記全閉環
- Skill 02 v2.8: 新增「自動修復飛輪六大鐵律」章節(affected_services/alert_name/Router層/Jaccard/alertname變體/Embedding雙軌)

2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:39:42 +08:00
OG T
2065665c9b docs(adr): ADR-067 Ollama 五大本地 AI 應用實施規格
批准五大應用 Phase 30-34,依序執行:
1. Drift 報告中文摘要 (qwen2.5:7b-instruct)
2. Log 異常摘要 (deepseek-r1:14b)
3. PR 自動審查 (qwen2.5-coder:7b)
4. RAG 知識庫 pgvector (nomic-embed-text)
5. 圖片分析 (llava:latest)

pgvector 0.8.2 已確認在 prod 就緒

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 01:35:07 +08:00
OG T
309fe04698 docs(adr066): 批准執行閉環修復記錄 — LOGBOOK + ADR-066 + Skill 02 更新
- LOGBOOK.md: 新增 2026-04-09 批准執行閉環修復狀態區塊
- ADR-066: 記錄根本問題鏈條、決策與受影響檔案
- Skills/02: v2.7 新增 Nemotron tool→kubectl_command 回填鐵律 + 教訓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:23:55 +08:00
OG T
c01026be9b docs(skills+adr): 自動修復全鏈路知識更新 — ADR-058 Appendix A + Skills v2.5
ADR-058: 188白名單補完 + Appendix A (12 Bug修復記錄 + E2E驗證 + Playbook覆蓋矩陣)
Skill-04 DevOps v2.5: SSH自動修復架構章節 (白名單/SOP/陷阱)
Skill-05 SRE: 自動修復E2E驗收規範 + 診斷表

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:21:24 +08:00
OG T
2779233b25 docs: Sprint 5R 實施完成紀錄更新
- LOGBOOK: 13/14 步驟全部完成,CD 部署中
- ADR-065: 狀態更新為「實施完成」
- Skills 01 v1.8: Sprint 5R 完成記錄
- Memory: project_current_status + sprint5r_plan 已更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:19:57 +08:00
OG T
c180bdaaac docs: Sprint 5R 前端重構批准 — ADR-065 + 設計稿 + Skills + LOGBOOK
- ADR-065: Sprint 5R 前端重構決策(版本 A 批准)
- sprint5r-approved-design.html: 統帥批准的設計稿存檔
- Skills 01 v1.7: 品牌 Logo/AwoooI 一致性鐵律
- LOGBOOK: Sprint 5R 開始實施

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:15:43 +08:00
OG T
bf4ec18d0e docs(adr): ADR-030 補充九-十章實作完成記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:29:04 +08:00
OG T
428e66c111 fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
S1 Critical:
- S1-1: asyncio 觸發移至 _call_with_fallback async 上下文,移除 sync 中的 get_event_loop()
- S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排
- S1-3: _matches() 對 alertname=["*"] 直接回傳 False,防意外命中

S2 Major:
- S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key),移除 import settings
- S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑,非假數據
- S2-5: suggested_action .strip() 防空白字串繞過 or

S3 Minor:
- S3-2: priority 上界 min(next, 890)
- S3-3: alertname sanitize re.sub([{}]) 防 format KeyError
- S3-4: model_registry.py 最後修改時間戳更新

文件:
- ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習
- Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項
- Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程
- LOGBOOK: 本次 Session 完整記錄

2026-04-09 ogt: 首席架構師審查修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 10:52:40 +08:00
OG T
0af5c2e89c docs(sprint5.1): LOGBOOK + ADR-062 + Skill 02 更新(首席架構師審查記錄)
- docs/LOGBOOK.md: 當前狀態更新至 L1-L5+審查完成,里程碑補充審查修正記錄
- docs/adr/ADR-062: 新增實施記錄章節(執行清單+審查問題+修正方式)
- .agents/skills/02-lewooogo-backend-core.md v2.5→v2.6:
    加入 Sprint 5.1 Service Registry 模式
    加入 Guardrail 保守原則(失敗 block 不放行)
    加入新 Service 標準樣板(structlog/now_taipei/DI setter/try-except)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:38:31 +08:00
OG T
6f7a4be2c7 docs: Sprint 5.1 資料安全護欄 — ADR-062/063 + 方案規範驗證
- ADR-062: Data Safety Guardrails (服務分級/Pre-flight/MultiSig)
- ADR-063: Service Registry IaC 設計規範
- Sprint 5.1 方案文件: 規範驗證通過,P1-P5 問題修正
  - P1: Playbook 存 Redis(非 SQL),M-001 改為 Pydantic model 修改
  - P2: velero_client.py 命名維持(與 signoz_client 慣例一致)
  - P3: docker-health-monitor 狀態釐清
  - P4/P5: DI setter + Deployment Verification 補充
- LOGBOOK: 當前焦點更新為 Sprint 5.1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:07:12 +08:00
OG T
876aa9a441 docs(adr): ADR-060 React Flow + elkjs 拓撲圖引擎技術選型 (方案 D+ 批准) 2026-04-08 11:56:58 +08:00
OG T
f525e657ca docs: ADR-060/061 全面監控+Event Sourcing架構決策記錄
- ADR-060: 全面基礎設施監控規劃 (Plan A/B/C/D/E)
- ADR-061: Alert Operation Log Event Sourcing 架構
- LOGBOOK: 2026-04-08 里程碑記錄更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 11:44:06 +08:00
OG T
23364423fa feat(webhook): ADR-059 GitHub → Gitea Webhook 遷移完成
- gitea_webhook.py: Header 全部改 X-Gitea-*,移除 workflow_run handler
- gitea_webhook_service.py: _fetch_pr_diff 改直接 httpx,不依賴 github_api_service
- 清除兩個檔案的所有殘留 github_ log key,review_id prefix 改 gitea-
- test_gitea_webhook.py: 10/10 通過,docstring 修正
- 03-secrets.yaml: 新增 GITEA_WEBHOOK_SECRET 佔位
- cd.yaml: 新增 GITEA_WEBHOOK_SECRET 注入步驟
- ADR-059: 建立架構決策文件

待統帥操作: Gitea Actions secret + Gitea UI Webhook 設定

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:44:32 +08:00
OG T
a49faf7baa docs: ADR-058 Host Auto-Repair SSH 白名單 + LOGBOOK 更新
首席架構師 Review 結果: 72→88/100
已修正: C1 C2 C3 M3 m1 m2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:09:58 +08:00
OG T
45458e8f33 docs(adr): ADR-057 狀態更新為已批准並實作
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:44:31 +08:00
OG T
15c7f6fcd3 docs(adr): 起草 ADR-054/055/056/057 — Phase 25 三方向架構決策
ADR-054: DIAGNOSE Privacy-First Routing (已批准)
  - _local_fallback_chain 設計決策
  - NEMOTRON privacy_level=local 首席架構師裁示
  - 全部 local 失敗 → REJECT + Telegram

ADR-055: Knowledge Auto-Harvesting (已批准)
  - AUTO_RUNBOOK DRAFT + ANTI_PATTERN PUBLISHED 設計理由
  - compute_hash() 碰撞風險說明
  - Fire-and-forget GC 防護強制規範

ADR-056: Config Drift Detection 四層架構 (已批准)
  - Detector→Analyzer→Interpreter→Remediator 職責邊界
  - AI 只做意圖分析不做修復決策
  - adopt() 暫停 + _recent_reports Phase 1 限制

ADR-057: adopt() Gitea PR API 實作路徑 (草案,待批准)
  - 解決 API Pod git add -A 安全風險
  - PR review 流程保障

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:24:50 +08:00
OG T
2d5f1a71ad chore(observability): ClickHouse TTL 設定完成 — Phase O 全驗收
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
signoz_logs: 30天 (已內建 _retention_days DEFAULT 30)
signoz_metrics 8個表: 233280000s(2700天) → 7776000s(90天)
  - samples_v4, samples_v4_agg_5m, samples_v4_agg_30m
  - exp_hist, time_series_v4, time_series_v4_6hrs
  - time_series_v4_1day, time_series_v4_1week

Phase O 驗收清單全部打勾 

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:38:39 +08:00
OG T
3f339110dd fix(observability): 同步 .188 實際部署調整至 repo
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:

1. MinIO Bearer Token 認證
   - 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
   - 實際: mc admin prometheus generate 產生 Bearer Token
   - 更新: prometheus-config-phase-o.yaml 加入 bearer_token

2. remote_write 廢棄 → OTEL Collector Prometheus scrape
   - 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
   - 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
   - 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
   - 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)

3. ADR-053 驗收清單更新為實際結果

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:23:47 +08:00
OG T
3e4612f259 docs(observability): ADR-053 SigNoz 統一架構 + Phase O 驗收
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 36s
E2E Health Check / e2e-health (push) Successful in 16s
- 新增 ADR-053: 可觀測性統一架構決策記錄
- 更新 service-registry.yaml: 補齊 MinIO/Kali 監控入口
- 更新 LOGBOOK: Phase O 完成狀態

Phase O 驗收清單:
 kubectl Mac 本機免密碼
 OTEL Collector 2 Pod Running
 Event Exporter 1 Pod Running
 Descheduler CronJob Completed
 MinIO + Kali 告警規則
 Alert Chain Smoke Test
 CD Pipeline 整合
 ClickHouse TTL / remote_write / SigNoz rules (待 .188 手動)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 18:26:57 +08:00
OG T
73e8f8ab77 feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
  - OllamaProvider (local, rca/chat/code_review)
  - GeminiProvider (cloud, rca/chat)
  - ClaudeProvider (cloud, rca/chat/code_review)
  - OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
  - AIProviderRegistry (動態註冊/啟停)
  - AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範

安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false

2026-04-02 ogt: Phase 24 首批實作

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:16:09 +08:00
OG T
c9c60c3a61 feat(mcp-integrations): Phase S 架構修復 + MCP 整合基礎建設
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 22s
Phase S 技術債修復 (首席架構師審查 82→完整):
- S-01: generate_alert_fingerprint 移至 AlertAnalyzer.generate_fingerprint() staticmethod
- S-04: 移除 Pydantic v2 deprecated json_encoders (直接用原生 datetime 序列化)

Sentry MCP 整合 (Phase 23):
- ADR-048: Sentry→OpenClaw AI Triage 架構決策
- sentry_webhook_service.py: parse/analyze/create_incident/build_message Service 層
- config.py: SENTRY_WEBHOOK_SECRET (Fail-Closed HMAC-SHA256)

Playwright MCP 整合 (短期):
- smoke.spec.ts: 5 頁面 E2E smoke test (home/dashboard/incidents/approvals/terminal)
- cd.yaml: E2E Smoke Test 步驟 + Telegram 🎭 Smoke 狀態通知

長期規劃 ADR:
- ADR-049: Figma Code Connect 設計系統同步
- ADR-050: Telegram 互動式 Incident 2.0 (6鍵 Inline Keyboard)
- ADR-051: Context7 依賴升級顧問 (Next.js 14→15, FastAPI 0.115→0.128)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:20:57 +08:00
OG T
22de22c989 refactor(phase-s): Phase S 技術債清理 - 五項架構改善
S-01: generate_alert_fingerprint() 移至 alert_analyzer_service (Router→Service)
S-02: 移除廢棄 USE_NEW_ENGINE config (Phase R 已完成歷史使命)
S-03: github_webhook.py linter 清理 (Field unused + delivery_id noqa)
S-04: Pydantic v2 遷移 - approval/incident models (class Config → ConfigDict)
S-05: Skill 09 v1.1 更新 (USE_NEW_ENGINE 廢棄說明)

測試: 393 passed, 零失敗

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 13:12:02 +08:00
OG T
6fed8be8c4 docs(adr): ADR-024 R4 Router 瘦身標記完成
Some checks failed
E2E Health Check / e2e-health (push) Successful in 17s
Type Sync Check / check-type-sync (push) Failing after 22s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:27:40 +08:00
OG T
5086bafa36 docs: ADR-045 Telegram Gateway 統一到 K8s AWOOOI API
記錄 2026-03-31 已實施的架構決策:
- 統一 Telegram 到 K8s AWOOOI API Webhook 模式
- 解決 OpenClaw (188) Long Polling 雙軌競爭問題

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:17:08 +08:00
OG T
a94bb57d8b feat(types): ADR-046 IncidentConverter + IncidentEngineAdapter
實作 ADR-046 Option B: IncidentConverter 轉換層,解決
BrainIncident (lewooogo-brain) 與 LocalIncident (apps/api) 型別邊界問題。

變更:
- 新增 src/utils/incident_converter.py
  - brain_to_local(): BrainIncident → LocalIncident
  - local_to_brain(): LocalIncident → BrainIncident
  - ESCALATED → MITIGATING 映射 (brain 無 ESCALATED)
- incident_engine.py: 新增 IncidentEngineAdapter 包裝層
  - process_signal() / get_incident() 輸出轉換為 LocalIncident
  - get_incident_engine() 返回 IncidentEngineAdapter
- incident_memory.py: 加入 brain_to_local import,更新 _record_to_incident 說明
- ADR-046: 標記三個轉換點全部完成

解鎖: #123 proposal_service.py 清理 (下一步)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:47:54 +08:00
OG T
2ba61acf72 fix(api): Phase R-R2.2 首席架構師 72/100 P2 修復
P2-01 signal_worker.py: persisted_to_pg 改用 getattr 防 BrainIncident AttributeError
P2-02 IIncidentEngine Protocol: update_incident_status → update_status 對齊 brain 實作
P2-03 config.py USE_NEW_ENGINE: 標記失效 + 回滾路徑更正 (git revert 而非 kubectl)
ADR-046: Option B (IncidentConverter) 決策完成,待實作清單更新
ADR-024: 審查結論 + 正式回滾指令更新
Skill 02: v2.5 版本記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:33:08 +08:00
OG T
cd91560e0b docs: Phase R-R2 完成文件更新 + ADR-046 型別統一
- ADR-024: 更新執行進度 (R1 R2 R3 R4待執行)
- ADR-046: 新增跨套件 Incident 型別統一治理 (待決策)
  推薦 Option B: IncidentConverter 轉換層
- Skill 02: v2.5 記錄 Phase R-R2 + R-R2.1 + ADR-046
- LOGBOOK: 更新當前狀態

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:17:44 +08:00
OG T
dd526684ab feat(ai): Phase 22 OpenClaw + Nemotron 協作架構 (ADR-044)
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
統帥批准實作「仲裁-執行分工」架構:
- OpenClaw = 仲裁者 (Why + Risk Level)
- Nemotron = 執行者 (How + kubectl Command)

新增功能:
- config.py: ENABLE_NEMOTRON_COLLABORATION Feature Flag
- openclaw.py: generate_incident_proposal_with_tools()
- openclaw.py: _call_nemotron_tools() Nemotron 呼叫
- telegram_gateway.py: TelegramMessage Nemotron 欄位
- telegram_gateway.py: format_with_nemotron() 雙區塊格式
- decision_manager.py: 整合協作方法
- proposal_service.py: 整合協作方法

觸發條件:
- LOW 風險 → 僅 OpenClaw
- MEDIUM/HIGH/CRITICAL → OpenClaw + Nemotron 雙軌

首席架構師審查: 83/100 條件通過

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 18:52:53 +08:00
OG T
52177d7096 docs: 更新 ARCHITECTURE_MEMORY + ADR-041
All checks were successful
E2E Health Check / e2e-health (push) Successful in 16s
- ARCHITECTURE_MEMORY: 新增 Service 層架構說明
- ADR-041: 更新 Phase 21 定期報告狀態

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 16:06:51 +08:00
OG T
268b944d5c docs: Phase 18 首席架構師審查 + ADR-043
All checks were successful
E2E Health Check / e2e-health (push) Successful in 24s
2026-03-31 Claude Code (首席架構師)

審查結果:
- 初審: 91/100 OUTSTANDING (條件通過)
- P0 修復後: 95/100 OUTSTANDING

ADR-043: FailureWatcher 與 DecisionManager 協調
- 優先級規則: DecisionManager > FailureWatcher
- 資源鎖定機制 (Redis)
- Approval 衝突檢查
- 來源標記審計

P1 待改進 (Phase 19):
- DI 注入模式統一
- 整合測試覆蓋
- Trace ID 記錄

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 12:24:54 +08:00
OG T
bcd33e854f docs: ADR-042 前端效能優化模式 (DOM Bypass + Optimistic Updates)
All checks were successful
E2E Health Check / e2e-health (push) Successful in 16s
新增 ADR-042:
- Pattern 1: DOM Bypass (繞過 React 渲染,100x 效能提升)
- Pattern 2: Optimistic Updates (0ms UI 延遲 + 失敗回滾)
- Pattern 3: SSE Incremental Updates (增量更新,減少 API 請求)
- Pattern 4: AbortController (防止記憶體洩漏)

更新 Skills 01:
- v1.6 版本更新
- 新增效能優化模式章節
- 參考 ADR-042

首席架構師審查: 96-98/100 OUTSTANDING

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 11:36:21 +08:00