Commit Graph

351 Commits

Author SHA1 Message Date
Your Name
025a493f06 feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner
Some checks failed
run-migration / migrate (push) Failing after 12s
CD Pipeline / build-and-deploy (push) Has been cancelled
Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套:

P3.2 — Model Version Tracking:
- model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version
- model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表
- migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema
- db/models.py +32 行 — ProviderVersionHistory ORM

ADR-100 — AI 自主化 SLO:
- docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值
- ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts
- ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests

整合修改:
- main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule
- gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知
- ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線

新測試:
- test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass
- test_slo_rules.yaml — promtool 驗收

Tests: 9 passed (test_kb_rot_cleaner_schedule)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>
2026-04-27 14:54:19 +08:00
Your Name
fb130c9a28 feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m16s
Wave 8 P3.1-T2 後續補測 + 配套:

新增測試:
- test_diagnosis_aggregator_stub.py (238 行) — 15 tests
  · stub fixture 驗證 _collect_diagnosis_aggregator 接線
  · feature flag default off 不呼叫
  · timeout 邊界 / exception fail-soft

修改:
- core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標
- sanitization_service.py +24 — 補強 prompt sanitize 邊界(vuln #4 配套)
- RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調

Tests: 15 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 08:30:26 +08:00
Your Name
fefe4c21cd fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:

A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
  · success/timeout/error outcome label
  · histogram 寫入 aiops_agent_step_duration_seconds

A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout

新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality

新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序

新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
  · 三 Agent histogram alert rules(p99 > timeout 80% → warning)

驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)

INC-20260425 雙修總工作量(595629c0 + 此 commit):
  · 5 個 service/agent 檔修改
  · 1 個新 observability 模組
  · 4 個新測試/配套檔
  · 1372+187 = 1559 行新增

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
2026-04-27 08:15:53 +08:00
Your Name
1ab6786ce3 feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
Some checks failed
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Wave 6 P2.3 ops 配套 + tool-expert 部署文件:

新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
  · 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
  · failover/recovery 完整 SOP
  · 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
  · 4 panel:current primary / failover events / quota usage / health status
  · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
  · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
2026-04-27 08:11:40 +08:00
Your Name
cc547736ab feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復
承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼,補 commit 解 production
HEAD 隱性 import error(decision_fusion 已被 decision_manager 引用但檔案 untracked)。

新增(後端核心):
- decision_fusion.py (562 行) — P2.1 方法 III(OpenClaw + Hermes + Elephant 三 LLM 融合)
- aiops_timeline.py + aiops_timeline_service.py — critic B4 修復
  /api/v1/aiops/timeline endpoint,DB 存取抽到 service 層遵守 leWOOOgo 積木化
- migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位

修改(後端整合):
- decision_manager.py — fusion 三斷鏈修補(critic B1+B2+B3):
  · B1: 寫 _evidence_snapshot_ref 到 token.proposal_data
  · B2: fusion 前計算 complexity_score 並寫 token
  · B3: fusion composite 寫 token.proposal_data["decision_fusion"]
- auto_approve.py — fusion + consensus 認識(critic B3+B5):
  · composite > 0.7 → auto_execute_eligible bypass min_confidence
  · source=consensus_engine + score>=0.6 → 規則可信路徑
- consensus_engine.py — db-fix _save_consensus 重用 agent_sessions
- governance_agent.py — db-fix _alert PG 寫入 ai_governance_events
- approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint
- db/models.py — schema 對齊 migration
- core/config.py — vuln #1 修復:OLLAMA_URL/_FALLBACK_URL field_validator
  拒絕公網 IP + 外部域名,僅允許私網/loopback/K8s SVC 白名單
- core/feature_flags.py — P2 fusion + consensus flags
- main.py — governance_agent lifespan 啟動
- failover_alerter.py — Wave8-X2: in-memory dedup fallback(Redis 拒絕後不 fail-open)
- ollama_*.py — metrics 整合 + recovery 改善
- auto_repair_service.py — verifier 接線

新增(測試 2438 行):
- test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py
- test_p2_db_fixes.py / test_wave8_fusion_fixes.py
- test_config_url_validation.py(vuln #1 12 tests)
- test_failover_alerter.py +Wave8-X2 in-memory dedup 補測

驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus +
                      governance + p2_db_fixes + failover_alerter)

Conflict resolution:
- 3 檔(config.py + auto_approve.py + decision_manager.py)git stash pop 衝突
  保留 stashed (engineer 最終版),補回 ValueError 「公網 IP」字樣對齊 test

Note: 此 commit 解 production HEAD 隱性 import error
仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed
       / B25-B26 drain_pending_tasks / B8 governance fail alert

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>
2026-04-27 08:11:40 +08:00
Your Name
7cd53c0228 fix(monitoring): 記憶體告警改用 working_set,停止 page cache 假告警
- alerts-unified.yml:
  - SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes,0.8 → 0.85
  - GiteaMemoryPressure: 同步修正(同樣 page cache 虛高根因)
- ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases
- 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範
- LOGBOOK: 記錄 T0 五大並行任務(A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review)

鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8%
根因: container_memory_usage_bytes 含 OS page cache,OOM killer 不視為壓力
修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache),閾值 0.85

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:16:12 +08:00
Your Name
689839cd83 docs(logbook): 記錄 2026-04-25 自動化飛輪四修 + Hermes + qwen3
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 09:49:50 +08:00
Your Name
d467cac709 fix(hermes): 改用 anthropic Python SDK 直呼,棄用需要 claude CLI 的 claude-agent-sdk
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:claude-agent-sdk 需要 spawn claude CLI,prod pod 沒有 CLI 所以 SDK 回空。
修法:改用 anthropic.AsyncAnthropic().messages.create() 直呼 API。
model: claude-haiku-4-5-20251001(快速低成本,適合 Telegram QA)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 03:08:51 +08:00
Your Name
7d1c85eb86 fix(hermes): ANTHROPIC_API_KEY 注入 + solver 信心度修法 A + 12-Agent 治理文件
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- nl_gateway.py: ClaudeAgentOptions 透過 env= 注入 ANTHROPIC_API_KEY(CLAUDE_API_KEY alias),
  修復 SDK 找不到 API key 的問題(SDK 讀 ANTHROPIC_API_KEY,K8s secret 名稱是 CLAUDE_API_KEY)
- solver_agent.py: 修法 A — kubectl_command 欄位優先路徑,OpenClaw Nemo 回傳完整指令時
  不再被語意合成壓縮 confidence(0.9 → min(0.5) 的 bug),9 tests pass
- AGENTS.md: Codex CLI 對應版 CLAUDE.md(Codex Session 啟動用)
- docs/12-agent-game-rules.md: 12-Agent 任務判型 + 主責/協作派工 + 9 skills 對照(v1.0)
- .agents/skills/06-awoooi-monorepo-master.md: v1.6,新增 12-agent 協作治理章節

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:33:43 +08:00
Your Name
86ee013cdf feat(hermes-complete): Hermes NL 三項補強 + ConsensusEngine + ADR 收尾
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m32s
## Hermes NL 補強(nl_gateway.py)
- T1 hermes_dispatch_log DB 寫入(asyncio.create_task 非阻擋)
- T2 Redis 速率限制:per-chat_id 20 req/min,fail-open
- T3 Multi-turn session:hermes:session:{chat_id}:{user_id} TTL=300s,最近 3 輪

## ConsensusEngine(ADR-095 宣告式設計)
- consensus_engine.py: CONSENSUS_WEIGHTS class 屬性
  security=0.4 鎖定,9 個 Claude Code agent 分配 0.6
- config.py: ENABLE_12AGENT_CONSENSUS=False feature flag

## ADR 狀態
- ADR-093/094/095: Proposed → 🟡 批准實作中
- 各 ADR 加 v1.1 變更紀錄

## K8s ConfigMap
- prod 04-configmap.yaml: 加 3 個 feature flags(均 false)
- dev 02-configmap.yaml: 同步加入

## LOGBOOK
- 記錄 WS0–WS6 + 補強完成,feature flags 啟用指引

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:22:40 +08:00
Your Name
5675e7c3b0 fix(phase2+aiops): Phase 2 Agent timeout + AI Router intent hint + signoz incident_id
## Phase 2 Agent timeout(防止單步 LLM 拖垮整場辯證)
- critic_agent.py: asyncio.wait_for + PHASE2_STEP_TIMEOUT_SEC=20s
- diagnostician_agent.py: 同等超時保護
- solver_agent.py: 同等超時保護

## AI Router 優化
- ai_router.py: _resolve_intent_from_context()
  Phase 2 agents 傳 intent_hint → Router 快路徑,不重跑 intent LLM

## SignOz Webhook 修復
- signoz_webhook.py: incident_id 補傳 send_approval_card()(移除 TODO 2026-04-05)

## Alert 處理流程修復
- webhooks.py: _should_bypass_alertmanager_llm()
  Host 類 NO_ACTION 告警直接走人工排查卡片,不再誤觸 LLM Agent Debate
- incident_repository.py: update_incident_status 加 resolved_at 參數
- incident_service.py / proposal_service.py / incident_approval_service.py: 小修

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:10:06 +08:00
Your Name
054d0ae422 docs(ws0): Hermes × 12-Agent Telegram 整合治理文件(ADR-093/094/095)
## 新增
- ADR-093: Telegram 告警全面遷移至 SRE 戰情室群組
  - 混合策略 allowlist 模式(TYPE-3/4/4D/8M → 群組 + user_id binding)
  - nonce 新格式 apr:{short_id}:{action}:{user_id_hash} + Redis 後端映射
  - Feature flag TG_GROUP_CUTOVER 灰階切流

- ADR-094: Hermes 自然語言介面(@mention 對話)
  - Option C:單 bot + Claude Agent SDK 虛擬分派
  - Webhook secret_token + allowed_updates = [message, callback_query, chat_member]
  - Prompt Injection 防護:query/describe/summarize only,mutate 走 ApprovalRecord
  - Redis session TTL=300s + turn>=5 壓縮

- ADR-095: 12-Agent Claude SDK 整合 × Telegram 視覺分派
  - 12 位 agent 完整 emoji/hashtag/handle 表格
  - ConsensusEngine weights 擴充(security=0.4 鎖定)
  - display_names.py 命名隔離(.claude/agents/ vs src/agents/)

## 更新
- ADR-009: 加 v0.3 變更紀錄指向 ADR-095
- ADR-075: 加更新引用表(ADR-093 D4 allowlist 子條款、ADR-094/095)
- docs/design/hermes-telegram-flows/hermes-flows.html: F1-F7 完整流程圖

## Pre-Flight 確認
- approval_records 表尚不存在 → 將用 BIGINT 全新建立
- docker-compose.yml:78 明碼 token 🔴 P0 待 WS1 修復
- awoooi_migrator 角色尚未建立 → WS2 建立
- claude-agent-sdk 升至 0.1.66(最新)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:10:06 +08:00
Your Name
55f111e0e3 fix(aiops): correct host alert fallback and resolved stamp
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m54s
2026-04-25 00:14:07 +08:00
Your Name
0d81b28b1b fix(aiops): bound phase2 timeout and repair incident links
All checks were successful
E2E Health Check / e2e-health (push) Successful in 52s
CD Pipeline / build-and-deploy (push) Successful in 9m24s
2026-04-24 23:53:56 +08:00
Your Name
4ea52d8e5d docs(logbook): ADR-092 P2.4+P2.6 完成記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:58:19 +08:00
Your Name
e75e4678a9 feat(p2.4): Telegram 中間態推播 — 分析中佔位卡 + 完成後自動刪除
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P2.4 實作 2026-04-24 ogt + Claude Sonnet 4.6

問題: LLM 分析耗時 10-30s,期間 Telegram 無任何回應,使用者不知系統在處理

修復:
- telegram_gateway.py: 新增 send_analyzing_placeholder() — 發送「AI 正在分析中...」佔位卡
- telegram_gateway.py: 新增 delete_message() — 刪除佔位卡
- webhooks.py: LLM 分析前 3s 內送出佔位卡(超時不阻塞主流程)
- webhooks.py: _push_to_telegram_background 收到 placeholder_message_id → 完整卡發出後刪除佔位卡
- webhooks.py: import asyncio(補缺漏)

效果: 使用者在告警到達 <3s 內即看到「分析中...」訊息,完整卡出現後自動清除

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:56:26 +08:00
Your Name
04ff22563e fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
Some checks failed
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)

【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值

【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log

【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql

【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:41:35 +08:00
Your Name
45dbe07188 fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
Some checks failed
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」

【六大修復】
1. MCP Provider 三蟲修復
   - ssh_provider: asyncssh.run() → conn.run()
   - prometheus_provider: KeyError 'query' → .get() 容錯
   - k8s_provider: 空 pod_name → 早返回錯誤字典

2. Agent Debate / 決策品質
   - decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
   - intent_classifier: LLM 逾時降級至關鍵字分類(非 None)

3. Watchdog 誤報修復(ADR-092 B3)
   - W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
   - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
   - approval_timeout_resolver: 60min → 15min,batch 50 → 200

4. Config Drift 自動化
   - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
   - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片

5. Playbook 飛輪穩定
   - playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
   - playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)

6. 可觀測性
   - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
   - auto_approve: reject 原因 Redis 計數器
   - heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊

【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:55:50 +08:00
Your Name
e2742ce9f3 docs: BUTTON_DATA_INVALID 根治 + Gitea Code Review 修復 記錄
LOGBOOK + ADR-092 附錄 C — 2026-04-21 修復紀錄

E2E 驗證: telegram_approval_card_sent message_id=25045 (SignOzDown) ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 21:59:00 +08:00
Your Name
994817a23a docs: ADR-092 附錄 A+B + LOGBOOK + MASTER §8 記錄四修與 C1-C4 全流程串接
- ADR-092: 附錄 A(B1-B4 四修 root cause + commit)+ 附錄 B(C1-C4 斷點修復表 + 架構鐵律)
- LOGBOOK: 新增 2026-04-20 晚 C1-C4 章節(斷點清單 + commits + 驗收步驟)
- MASTER §8: 追加 C1-C4 changelog(§3/§1.1 對齊 + 修復後行為說明)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-20 20:24:41 +08:00
Your Name
39ac292c90 docs(master): §8 追加 ADR-092 四修記錄 + project_current_status 更新
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:01:50 +08:00
Your Name
803b389f6b security(secrets): 替換 test fixture 真 TG bot token 為假值
Some checks failed
run-migration / migrate (push) Failing after 20s
CD Pipeline / build-and-deploy (push) Successful in 9m10s
## 事件
aider-watch v1 session 把真 production TG bot token(NEMOTRON_BOT_TOKEN)
當成 test fixture 寫入下列 tracked 檔(均已 push Gitea):
- apps/api/tests/test_secret_redactor.py
- docs/superpowers/plans/2026-04-19-aider-watch.md (3 處)
- docs/superpowers/plans/2026-04-20-aider-watch-v2.md

違反 feedback_secrets_leak_incidents_2026-04-18.md L2 零信任(source control 無 secrets)。

## 處置
- 統帥決議:不撤銷 token(接受風險)
- 替換為假值 111222333:A*35(明顯 placeholder,仍符合 redactor 判別格式)
- 減少未來 search engine / fork 的暴露面(但 git history 仍存)

## 驗證
secret_redactor.py 8 個 test 全過,telegram regex 仍能辨識新假值格式。

## P1 backlog
- git history 清理(git filter-repo)需統帥批准 force push
- pre-commit hook 防未來再洩(grep TG token 格式 / detect-secrets)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:23:09 +08:00
Your Name
54d60d04f5 feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace
統帥三問決議:全做;AI 推薦 0.85 門檻純顯示不自動;先查 aol 再修

## RCA: awoooi-service 失敗來源
- /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed
- grep codebase: 無任何程式碼寫死 awoooi-service(只有歷史 comment)
- 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名
- cf5050c/4f2e122(2026-04-18)已修 NEMOTRON 幻覺雙路徑;本次修第三條路徑

## 修復
### P0.3a alert_rule_engine._extract_vars
- labels.service 降級:-service 結尾先剝 suffix 視為 base name
- match_rule 回傳新增 target_source 欄位追 trace
- 下次 awoooi-service 復發可直接看來源(label.service(stripped) 等)

### P0.3c approval_execution._log_aol_started.input
- 補 parsed_target/operation/namespace 欄位
- 未來 aol 查 failed 可直接看 target,無需推敲

### P0.1 telegram_gateway._send_drift_diff_detail
- 分頁(10 項/頁)取代一次洗版 30 項
- header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動
- 底部 ⬅️/➡️ 分頁按鈕(callback: drift_view_page:{report_id}_{page})
- security_interceptor INFO_ACTIONS 加 drift_view_page 白名單

### P0.2 drift_narrator recommendation
- LLM prompt 加 recommendation 欄位(action/confidence/reason)
- action ∈ {adopt, revert, ignore, investigate}
- 卡片頂部顯示「🎯 AI 建議: 回滾 (85%) — reason」
- LLM 失敗走 _fallback_recommendation(規則式依 intent 對應)
- 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items
- 統帥指令:純顯示不自動執行(門檻 0.85 保留未來)

## 驗證
- 90 個 pytest test 全過(drift + rule_engine + approval_execution)
- 5 檔 AST syntax check 過

## 下次驗收
1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」
2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️
3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:04:13 +08:00
Your Name
8d40bbff2b docs(aider-watch v2): 補 4 個全景盲點
統帥 2026-04-20 提醒「每次更新都不忘全景」— 在執行前做二次檢查
發現 4 個 plan 未處理的盲點,現補齊:

盲點 1:Mac 外網可達性
  - spec §8 + §8b 新增 Tailscale/nginx/VPN 三選一
  - plan Task B5 install.sh 前置提醒選配置

盲點 2:incident 洗版(同 session 多 error)
  - spec §8 新增 coalesce 策略(60s 窗口 per session_id)
  - plan Task A5 service 實作 create_incident_for_event 加 coalesce 邏輯
  - 加 2 個測試 case 驗證同 session reuse + 不同 session 分離

盲點 3:AI Router feedback 首次 rollout 風險
  - spec §8 新增 USE_AIDER_FEEDBACK flag 預設 false,灰度 7 天再開
  - plan Task A8 route() hook 外包 if settings.USE_AIDER_FEEDBACK block
  - plan Task A9 config 加 USE_AIDER_FEEDBACK: bool = False

盲點 4:AWOOOI_PG_PW secret 取得
  - spec §8c 新增 kubectl get secret → env → shred 流程
  - plan Task A0 Step 1 明確寫出 K8s Secret 讀取 + 立即銷毀檔案

符合 feedback_ai_autonomous_direction.md 的全景思考紀律。
執行策略:全 subagent-driven(統帥批准)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:04:13 +08:00
Your Name
345e6832da docs(aider-watch): v2 implementation plan — 18 tasks across server/client/E2E
對應 v2 spec 2026-04-20-aider-watch-v2-design.md:

Phase A (server, 10 tasks, TDD):
  A0 HMAC secret + env setup
  A1 adr091 migration
  A2 secret_redactor util
  A3 Pydantic AiderEventIn/AiderBatchIn
  A4 AiderEventRepository
  A5 aider_event_service (classify/incident/pattern)
  A6 API webhook HMAC-verified
  A7 Redis stream consumer job + daily pattern extract
  A8 ai_router feedback_from_aider_events hook
  A9 config settings + main.py lifespan register

Phase B (Mac client, 5 tasks):
  B1 scaffolding (parsers/config/redactor 從 v1 搬)
  B2 api_client HMAC + retry
  B3 JSONL buffer + flush
  B4 aiderw wrapper + cli
  B5 install.sh + launchd plist

Phase C (E2E, 3 tasks):
  C1 happy path Mac → awoooi
  C2 degradation + buffer flush
  C3 AI Router feedback verification (fixture-driven)

Self-review:spec 覆蓋率 100%,無 placeholder,型別一致。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:04:13 +08:00
Your Name
8ce8efad29 docs(aider-watch): v2 設計稿 — 完全整合 awoooi AI 自主化飛輪
統帥 2026-04-20 指示「C 路線 + 甲 bot」— v1 獨立個人工具路線與
awoooi MASTER blueprint 全景割裂,違反 feedback_ai_autonomous_direction
北極星(純記錄非自主化)。v2 重新對齊:

- DB:進主 PG,新 migration adr091 的 aider_events 表
- Telegram:走既有 telegram_gateway @tsenyangbot + Redis dedup
- Incident:aider error 自動建 incident 走既有告警鏈
- AI 學習回路:symptom_pattern 抽取 + AI Router feedback hook
- Mac client:薄殼 HTTP POST + 本機 JSONL fallback buffer

v1 產物去向:events.py/redactor.py 搬進 awoooi;其他廢棄。
@NemoTronAwoooI_Bot 轉 sandbox 用,不刪。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:04:13 +08:00
Your Name
712d146129 docs(adr+skills): ADR-092 AI Decision LLM 層 + Skill 03 更新統一 LLM pattern
首席架構師 2026-04-19 Review 92/100 Grade A 後的完整文檔化:

**ADR-092 新建 (AI Decision LLM 擴展架構)**:
  - 背景: 14 scanner 中 8 個純 threshold,違反 feedback_ai_autonomous_direction
  - 決策: 4 個 LLM service + 統一 pattern (llm_json_parser)
  - 約束 5 鐵律: 失敗不 raise / AI 只建議不動作 / openclaw 統一入口 /
                aol 留痕 / 繁中 + JSON schema
  - 節流: Daily cron + 事件觸發 (red_ratio>30% 且 scanned>=50)
  - autonomy_score 0-100 量化追蹤
  - 實作成果 + P1 剩餘 + 回滾計畫

**Skill 03 openclaw-cognitive-expert 更新**:
  - 新增「2026-04-19 AI Decision LLM 擴展層」章節
  - Pattern code 範本 (不是每次重寫 3-path parse)
  - 4 LLM service 對照表 + required_key
  - 擴加 5 鐵律清單
  - autonomy_score 追蹤使用說明

下 session Claude 接手時能快速看到 LLM service pattern,不會重複造輪子.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:42:58 +08:00
Your Name
55486ce2fd docs: aider-watch 實作計畫(15 tasks,TDD + 頻繁 commit)
對應 spec 2026-04-19-aider-watch-design.md 的完整 §1-§7 拆解:
scaffold → events schema → redactor → config → tg format/send → PG DDL
→ storage → parsers → wrapper → CLI → reporter → launchd → install → E2E。

每個 task 含 TDD 步驟(測試先行 → 驗失敗 → 實作 → 驗通過 → commit)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:42:41 +08:00
Your Name
8603bce23b docs: aider-watch 設計稿(統帥批准的 §1-§7 定稿)
aider CLI 全程監控系統:Python wrapper 攔 aider stdout + chat history
→ Telegram DM 即時推播(session start/end/file edit/error/commit/silent
timeout)+ PG 192.168.0.188/aider_watch 累積儲存 + 每日 23:50/每週日
22:00 launchd 日週報。

Graceful degradation:PG 不可達 fallback 本機 JSONL buffer + 5min
flush job;Telegram 429 指數退避不阻塞 aider;secret pattern 自動遮罩。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:39:40 +08:00
Your Name
86d9b22125 docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Session 35 commits 完整結案:
  - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster)
  - KPI Dashboard API (autonomy_score 63/100 可量化)
  - Audit 誠實 3 Gaps
  - Gap 1 host IPv4 嚴格 + 清理 266 筆重複
  - Gap 2 真因確認非 bug
  - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage)

AI 自主化達成:
  1/9 LLM (只 Hermes) → 4/9 LLM decision
  8 張 0 writer 表全活化
  7/7 coverage 維度完整
  今晚 AI 將自主推 4 種 Telegram 分析報告

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:22:42 +08:00
OG T
53618b25c9 docs(logbook): 2026-04-19 20:00 本 session 22 commits 全景記錄
記錄:
  - 統帥決策 Rule 1 deprecate + Rule 2 保留 + noise 算法修正
  - Hermes LLM 升級 (OpenClaw 分析假報真因)
  - coverage_evaluator 擴充 4 維 (7 維全實作)
  - deploy-alerts workflow 部署 HostDiskUsageHigh/Critical 到 Prometheus
  - Review 發現 5 個 bug 全修復

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:56:56 +08:00
OG T
c015a77011 docs(logbook): Phase 7 完整化記錄 — 8/8 表全寫入 + 5 bugs 修 + Hermes E3
記錄本輪 review 深入發現的 5 個 bug + 8 個新 scanner/evaluator/advisor.
8 張 ADR-090 0 writer 表覆蓋率 100%.
2 條 100% noise rule 待 Hermes 推建議後人工決策.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:28:28 +08:00
OG T
5d011de917 docs(logbook): 2026-04-19 Phase 7 scanner 完成 + CI 修復歷程
記錄本輪 6 個 commits 的全景與 CI cd.yaml B5 3 輪除錯歷程,
供未來 session 接手時理解當前進度。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:36:30 +08:00
OG T
2524aa983a docs(adr): ADR-091 Telegram 子系統 Round 3 全景審計正式文件
- 11 按鈕 × handler 覆蓋矩陣定版
- 三缺一鐵律(callback格式+handler+能力)升級 ADR 層級
- callback_data 雙格式(nonce vs INFO)正式認定
- Long Polling by design 確認
- approval 三戳鐵律(editMarkup + editText + DB message_id)
- NO_ACTION 不誤標 FAILED 救 MASTER §7.1 #11

對應 commits 877c847 → 4b8be32,git tag v7.3.0
Memory: project_phase7_round3_telegram_subsystem.md
2026-04-19 01:32:52 +08:00
OG T
0670fe4d76 docs(master): §8 追加 Phase 7 Round 3 Telegram 子系統修復記錄
Round 3 Changelog 條目:
- 9 bugs 盤點 + 5 commits 清單
- git tag v7.3.0
- 交接指引給下個 Session

2026-04-19 凌晨 — ogt + Claude Opus 4.7
2026-04-19 01:32:52 +08:00
OG T
5ae82d1d1f feat(db): ADR-090 L4 AIOps 地基 — 資產盤點 × 7 項自動化覆蓋矩陣永久化 DB
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

MoWoooWorkDown 假警報 RCA 暴露三重結構性失守:
- 110/188 主機 load 18/16 × 13 天 / cadvisor 288% / K3s 120/121 無監控
- Prometheus 僅 35 targets / 58 rules(覆蓋不到三成)
- HostHighCpuLoad 量錯維度(CPU idle vs load_avg)

統帥戰略指令:
- 全景資產 × 七大自動化 × 永久化 DB
- AI 四分工(OpenClaw × NemoTron × Hermes × Claude LLM)
- 所有自動化操作歷程必進 DB,不靠 MD(MD 會漂移)

本 commit 交付:

1. SQL migration (apps/api/migrations/adr090_asset_inventory_foundation.sql)
   - 11 張表 + 33 indexes + 20 CHECK + 3 UNIQUE + 16 FK
   - pgcrypto extension dependency
   - 完整 idempotent(CREATE IF NOT EXISTS + single transaction)
   - 已 apply 進 awoooi_prod(188 PG),驗證通過

2. ADR-090 (docs/adr/ADR-090-monitoring-blindspot-governance.md)
   - 決策紀錄 + 7 引擎對映 + 4 替代方案否決

3. 主戰略文件 (docs/superpowers/specs/2026-04-18-blindspot-governance-capacity-l4.md)
   - §0-§14: 背景 / 根因 / Schema DDL / 4 層防禦 / 7 Phase 實施 /
     HARD_RULES / AI 分工矩陣 / 驗收指標 / 技術債 / 回滾 / 接手協議

4. MASTER §8 Living Changelog 追加 Phase 7 啟動條目

11 張表:
  asset_inventory / asset_discovery_run / asset_coverage_snapshot /
  asset_relationship / alert_rule_catalog / asset_change_event /
  asset_compliance_snapshot / host_capacity_snapshot /
  capacity_violation_event / automation_operation_log /
  ai_collaboration_trace

首筆 bootstrap 記錄已 seed 進 asset_discovery_run
(run_id=6760c5bf-57e5-4a40-b82d-31b794464652)

相關 Memory (未 commit,存於 ~/.claude/...):
  - project_blindspot_governance.md (跨 session 指針)
  - feedback_monitor_self_monitoring.md (監控工具必須被監控)
  - feedback_secrets_leak_incidents_2026-04-18.md (憑證外洩三防線)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:46 +08:00
OG T
e4bc3ec0ee docs(hard-rules): Prompt-Model 同步鐵律 — LLM Schema Drift 禁令
血的教訓 (2026-04-17): SuggestedAction enum prompt/model 不同步
→ NemoTron 輸出 investigate → Pydantic 爆炸 → 全系統 fallback 待分析

新增強制鐵律:
- 修改 prompts.py 必須同步更新 models/ai.py
- 接收 LLM JSON 的 Model 必須有 validator + fallback
- 禁止靜默死亡(必須 log 具體失敗欄位)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 21:48:50 +08:00
OG T
ba8cf6105d docs(adr): ADR-086 Telegram UI 清洗規範 + ADR-087 AutoApprove kubectl 閘門
ADR-086: Telegram 通知卡片 UI 清洗規範
- _parse_debate_summary() 設計決定與各 TYPE 欄位清洗規則
- TYPE-3 鍵盤重構:批准/拒絕永遠第一行
- 技術債:_parse_debate_summary 提升模組層級(P1-1)

ADR-087: AutoApprove 安全強化 — kubectl 強制執行閘門
- 條件 1d 設計:_raw_action 語意 + NO_EXECUTABLE_ACTION reason
- Solver Nemo 格式 kubectl 驗證
- 降級指令改為真實 kubectl 唯讀調查
- min_trust_score=0 保留理由記錄(TrustEngine 記憶體持久化技術債)
- P0-2 風險記錄:kubectl exec 未加入 _DESTRUCTIVE_PATTERNS

2026-04-17 ogt + Claude Sonnet 4.6(亞太): Session 技術債清理

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:25:34 +08:00
OG T
1ae9e9f389 fix(code-review): P0-1 action fallback 語意修正 + P1-2 reason enum + P2-2 secops 清洗
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m7s
Code Review 發現 (2026-04-17 首席架構師審查):

P0-1 auto_approve.py 條件 1d 語意修正:
- 原:用 `action` 變數(已 fallback = action or kubectl_command)做 kubectl 判斷
  → action="" + kubectl_command="kubectl get pods" → action="kubectl get pods" → 1d 通過
  → _kubectl_cmd 與 action 同值(重複判斷同一來源),掩蓋 action 本身是自然語言的情況
- 修:改用 proposal_data.get("action", "") 原始值(_raw_action)
  → 直接檢查 action 欄位本身,邏輯語意明確

P1-2 auto_approve.py NO_EXECUTABLE_ACTION 新增:
- 新增 AutoApproveReason.NO_EXECUTABLE_ACTION enum 值
- 條件 1d 改用此 reason(原 NO_PLAYBOOK 語意為「無匹配 Playbook」,不適用此場景)
- 避免污染 KM 飛輪學習資料的根因分類(ADR-068)

P2-2 decision_manager.py secops 分支:
- threat_behavior 改用 _parse_debate_summary → 取 diagnosis 欄位
- 與 BUG-A/BUG-C 修復一致,不再傾倒完整 debate_summary 前 150 字

ADR-082: Phase 2 多 Agent 協作
2026-04-17 ogt + Claude Sonnet 4.6(亞太): Code Review 後修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 15:23:35 +08:00
OG T
e8bf37cfd9 docs(logbook): 最終確認 f5e33da2 全節點 E2E 鏈路打通 (2026-04-16)
- CI 894 完成,f5e33da2 已部署
- flywheel outcome 欄位修復確認
- telegram _send_request 修復確認(零 AttributeError)
- Sweeper:20/20 近48h incidents sweeper_done 標記完整
- E2E 鏈路 7 節點完整流程確認(36 incidents)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:18:45 +08:00
OG T
588ecfd940 docs(logbook): 2026-04-16 E2E 全節點驗證 + 生產 bug 修復記錄
2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:46:39 +08:00
OG T
bb7441ec8a docs: 更新 CLAUDE.md + HARD_RULES.md v2.0 + LOGBOOK (2026-04-16)
- HARD_RULES.md v2.0: 新增 Self-Loop Workflow、Circuit Breaker Exception、State & Flow Validation
- CLAUDE.md: 補充 §4 必讀Memory 表格
- LOGBOOK: 記錄 AIOps E2E 修復進度

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:20:16 +08:00
OG T
e465ee1936 docs(Phase 3): Evolver 演練完成 — exit condition #6 通過
- MASTER spec §3/§7/§8:三處 Evolver 演練勾選完成
- LOGBOOK:演練結果記錄 + 下一步更新為 7 天生產監控

演練結果:POST /api/v1/learning/evolver/run → HTTP 200 errors:[] 2026-04-15

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:24:33 +08:00
OG T
5f86da52d9 docs(LOGBOOK): Phase 3 全部落地記錄 — 6 個元件 + 退出條件清單
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:10:47 +08:00
OG T
4718c7667c feat(Phase 3): Evolver loop 排程 + 手動觸發端點 — 合併演練閘道完工
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_evolver.py: 新增 run_evolver_loop()(24h 無限迴圈)
- main.py: 掛載 run_evolver_loop asyncio.create_task
- api/v1/learning.py: POST /api/v1/learning/evolver/run(Phase 3 exit #6 演練端點)
- MASTER §8: 補錄 66c4eda AgentSession + 本次 Evolver 完整退出條件清單

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:07:56 +08:00
OG T
fb1bbd0e20 feat(Phase 3): 學習閉環補完 — Root cause 3 + 診斷 feedback + 知識遺忘 + Fine-tune 管線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- approval_execution.py: _run_post_execution_verify() 補接 record_verification_result()
  Root cause 3 終結:環境驗證結果(success/degraded/failed/timeout)不再孤立
- learning_service.py: 新增 record_verification_result() — 驗證結果 → Redis + Playbook EWMA
- learning_service.py: 新增 record_diagnosis_outcome() — 誤診負向訊號回寫(L3×D4)
- jobs/knowledge_decay_job.py: 新建 30d 知識遺忘 Job(未引用 draft/review → archived)
- services/finetune_exporter.py: 新建每週 JSONL 匯出(EvidenceSnapshot × AgentSession)
- main.py: 掛載 knowledge_decay_loop(24h)+ finetune_export_loop(7d)
- MASTER §8: Phase 3 核心改造項全部落地記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:57:43 +08:00
OG T
ee486fbd2b docs(logbook): 2026-04-15 深夜收官 — P0/P2 RCA + Phase 6 閉環
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:58:09 +08:00
OG T
05b774386b feat(Phase 6): AI SLO REST API — GET /api/v1/ai/slo 收官
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-087 Phase 6 自我治理閉環最後一塊拼圖:

1. api/v1/ai_slo.py — GET /api/v1/ai/slo
   - Service 層快取優先(TTL 5min,AiSloCalculator.get_cached_report)
   - force_refresh=true 強制重算(AiSloCalculator.run)
   - Router 層零 Redis 直接存取(leWOOOgo 積木化鐵律)

2. main.py — 路由掛載 ai_slo_v1.router(prefix=/api/v1)

3. MASTER §8 Living Changelog 追加:
   - P0 告警靜默 3 根因 RCA 完整紀錄
   - P2 飛輪斷鏈修復摘要
   - Phase 6 全元件完成清單

Phase 6 退出條件 5/6 已達(生產驗證待 image 上線)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:57:26 +08:00
OG T
14a02263ae feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)

### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
  - DynamicBaselineService(3σ 偏離)
  - LogAnomalyDetector(新 Drain3 pattern)
  - TrendPredictor(斜率外推 4h 預測)
  - Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓

### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
  ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
  LLM RCA 可參考動態基線偏差與趨勢預警

### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)

### main.py
- 啟動 run_proactive_inspector_loop() asyncio task

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:47:05 +08:00
OG T
7da64eaad2 feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
ADR-083 Phase 3 學習閉環重建:

**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
  (成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)

**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
  成功: trust = 0.9 × old + 0.1 × 1.0
  失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
  trust < 0.1 → log warning,等 Evolver 封存

**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
  1. 低信任封存: trust < 0.1 → DEPRECATED
  2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
  3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
  AIOPS_P3_EVOLVER_ENABLED=False 預設關閉

**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄

AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟

Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com>
2026-04-15 14:01:37 +08:00