Commit Graph

27 Commits

Author SHA1 Message Date
Your Name
04ff22563e fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
Some checks failed
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)

【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值

【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log

【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql

【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:41:35 +08:00
Your Name
45dbe07188 fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
Some checks failed
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」

【六大修復】
1. MCP Provider 三蟲修復
   - ssh_provider: asyncssh.run() → conn.run()
   - prometheus_provider: KeyError 'query' → .get() 容錯
   - k8s_provider: 空 pod_name → 早返回錯誤字典

2. Agent Debate / 決策品質
   - decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
   - intent_classifier: LLM 逾時降級至關鍵字分類(非 None)

3. Watchdog 誤報修復(ADR-092 B3)
   - W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
   - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
   - approval_timeout_resolver: 60min → 15min,batch 50 → 200

4. Config Drift 自動化
   - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
   - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片

5. Playbook 飛輪穩定
   - playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
   - playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)

6. 可觀測性
   - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
   - auto_approve: reject 原因 Redis 計數器
   - heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊

【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:55:50 +08:00
Your Name
23fb5c4aaa feat(migration): adr091 rollback SQL
統帥全景檢查補:違反 feedback_dev_prod_separation — 直接對 awoooi_prod
套 adr091 migration 時應同時有回滾路徑。新增 DROP TABLE / DROP INDEX
腳本備用。資料不可復原,僅緊急用。

K8s Secret AIDER_WEBHOOK_SECRET 已加進 awoooi-prod.awoooi-secrets
(26 keys now, via kubectl patch)。

v1 repo ~/aider-watch README 標 DEPRECATED 並 tag v1-final。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 04:23:09 +08:00
Your Name
60b06ac54c feat(migration): adr091 aider_events table 2026-04-20 04:04:13 +08:00
OG T
98aef55b31 feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 11m49s
run-migration / migrate (push) Failing after 15s
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)

MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈:
  #3  fine-tune JSONL /week        — finetune_exports 表不存在
  #4  MCP 呼叫/24h                 — timeline_events 沒 mcp_call event_type
  #6  Declarative 修復使用率       — remediation_events 表不存在
  #7  general 兜底 17.3%           — classify_alert_early 漏 5 類
  #10 notification_outcomes /week  — 表不存在

本 commit 全修。

## 1. Migration: adr090d_kpi_data_sources.sql (3 張表)

- finetune_exports       — P3 Fine-tune JSONL 追蹤
- remediation_events     — P5 Declarative 修復追蹤
- notification_outcomes  — 通知品質 + RLHF 語料

Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。

## 2. classify_alert_early 擴 4 類規則 (降 general 兜底)

- test 攔截: Test*/FPTest/FingerprintTest/ADR089*Test/L4Closure*/*FreshUniq*
  → category='test', TYPE-1 純通知
- High*CPU/Memory/Disk/Load → host_resource
- TLS*/SSL*/*ProbeFailure* → ssl_cert
- PostgreSQL*/MySQL*/MongoDB*/*DiskGrowthRate → database

預期 general 17.3% → 3-5% (達標 <10%)。

## 3. finetune_exporter DB 寫入

_run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。

## 4. declarative_remediation DB 寫入

evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events
(status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。

## 5. telegram_gateway DB 寫入 (send_approval_card)

_send_request 成功返回 message_id 後寫 notification_outcomes 一筆,
channel='telegram', delivery_status='delivered|failed'。未來人類按鈕時
update user_action → RLHF 訓料黃金。

## 6. pre_decision_investigator MCP 呼叫追蹤

_call_single_tool() finally 寫 timeline_events event_type='mcp_call',
含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。

## 預期量化改善

| KPI | 修前 | 修後 24h 後應見 |
|-----|------|----------------|
| #3 fine-tune /week | 0 (表不存在) | >=10 (每週 cron 跑) |
| #4 MCP 呼叫/24h | 0 | >0 (實測將寫 timeline) |
| #6 declarative 使用率 | 表不存在 | 有資料 (pending/success/failed 分佈) |
| #7 general 兜底 | 17.3% | <10% |
| #10 notification_outcomes | 0 | 每次 approval card 寫一筆 |

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 00:00:31 +08:00
OG T
a156566b17 feat(drift-narrator): ADR-090-C L4 稽核閉環 — notification_formatted op 入庫
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 10m47s
run-migration / migrate (push) Failing after 14s
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

架構鐵律執行:
「沒有被記錄的 AI 決策,就等於沒有發生過。」
drift_narrator 每次呼叫 LLM 生成摘要,必須完整寫入
automation_operation_log + ai_collaboration_trace,形成 L4 稽核 + RLHF 語料。

本 commit 兩件事:

1. apps/api/migrations/adr090c_notification_formatted_op_type.sql
   - 擴充 automation_operation_log.operation_type CHECK 加 'notification_formatted'
   - DROP + ADD CONSTRAINT idempotent 模式
   - 已用 awoooi(表 owner)apply 進 prod 驗證通過

2. apps/api/src/services/drift_narrator_service.py
   - 新增 _log_ai_action_to_db() 負責 DB 稽核寫入
   - 在 _generate_narrative_and_items() 結尾(success / fallback 都寫)呼叫
   - automation_operation_log:
     * operation_type='notification_formatted'
     * actor='drift_narrator'
     * input = {report_id, namespace, counts, items_scanned}
     * output = {narrative, items, items_count}
     * duration_ms, tags=['drift','type4d','llm_summary']
     * parent_op_id 查詢 alert_fired 鏈路(未來 drift → alert 關聯)
   - ai_collaboration_trace:
     * agent='drift_narrator', model=provider (ollama / nemotron / 等)
     * prompt(限 8000 字)+ response(JSONB)
     * accepted = LLM JSON 解析成功 flag(未來 RLHF 訓料金礦)
   - 錯誤處理: DB 寫入 try/except 包住,永不破壞 Telegram 通知主流程

P2.4 事件關聯:
  - SELECT parent op via input->>'report_id' 或 'drift_report_id'
  - 若找到則綁定 parent_op_id(形成 alert_fired → notification_formatted 追溯鏈)
  - 目前 drift 本身不經 alert_fired,parent 為 NULL(等未來鏈路接通)

P2.5 RLHF 語料:
  - ai_collaboration_trace.accepted=true 的紀錄即為「LLM 解析成功」樣本
  - 未來統帥按 Telegram [ 採納變更] / [ 回滾] 時,對應 trace 也可更新
    outcome flag,形成完整 Human-in-the-loop 語料

技術細節:
  - get_db_context() auto-commit(src/db/base.py:128),無需手動 commit
  - prompt 最長 8000 字(一般 drift 約 2-3k)
  - raw_response 保留前 500 字在 trace.response JSON 中

相關:
  - feedback_ai_autonomous_direction.md L4 北極星
  - feedback_secrets_leak_incidents_2026-04-18.md L1-L4 分層
  - ADR-090 11 張神經網路表
  - commit fb88512(B 方案視覺層)

IDE 可能顯示 src.db.base 找不到 —— 那是誤報(drift_repository.py 用同一條路徑)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:04:23 +08:00
OG T
2d43751729 feat(ops): ADR-090-B 零信任收尾範本 — wrapper / sudoers / migrator / CI
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 12m17s
run-migration / migrate (push) Failing after 14s
2026-04-18 台北時區 —— ogt + Claude Opus 4.7 (1M)

本 commit 響應本 Session 兩次憑證外洩事故
(feedback_secrets_leak_incidents_2026-04-18.md),
交付統帥可直接部署的零信任基礎設施範本.

檔案清單:

1. scripts/host-ops/awoooi-hosts-add.sh
   - 110 主機 /etc/hosts 白名單 wrapper
   - 只允許預定義主機名,idempotent,帶 IP 格式驗證
   - 安裝: /usr/local/bin/awoooi-hosts-add (root:root 0755)

2. scripts/host-ops/awoooi-wrapper.sudoers
   - 配套 sudoers 規則 (NOPASSWD for wrapper + SIGHUP only)
   - 安裝: /etc/sudoers.d/awoooi-wrapper (root:root 0440)
   - 禁 tee / bash / sh 這類 generic shell access

3. apps/api/migrations/adr090b_awoooi_migrator_role.sql
   - PG 限權角色 awoooi_migrator
   - 只能 DDL (CREATE/ALTER/DROP/INDEX/COMMENT)
   - 明確 REVOKE 所有 DML + default privileges 鎖死
   - 本檔由統帥執行 (需 superuser),不由 Claude 執行

4. k8s/awoooi-prod/awoooi-migrator-secret.template.yaml
   - K8s Secret patch 範本
   - 新增 MIGRATION_DATABASE_URL key (awoooi_migrator 連線串)
   - 與應用 DATABASE_URL 拆開

5. .gitea/workflows/run-migration.yml
   - CI 自動套用新 migration (單 transaction + ON_ERROR_STOP)
   - 用 Gitea secret MIGRATION_DATABASE_URL,不走明碼
   - 每次成功寫一筆 asset_discovery_run (audit trail)

零信任三層防線 (對應 feedback_secrets_leak_incidents):
  L1 對話無密碼 -> wrapper 內建白名單
  L2 操作經 wrapper -> sudoers + awoooi_migrator
  L3 顯示強制遮蔽 -> CI 走 secret,不走 env

本 Session 發現的 3 次憑證外洩全部在 feedback_secrets_leak
memory 登記,並有對應 P0 輪替計畫.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:39 +08:00
OG T
5ae82d1d1f feat(db): ADR-090 L4 AIOps 地基 — 資產盤點 × 7 項自動化覆蓋矩陣永久化 DB
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)

MoWoooWorkDown 假警報 RCA 暴露三重結構性失守:
- 110/188 主機 load 18/16 × 13 天 / cadvisor 288% / K3s 120/121 無監控
- Prometheus 僅 35 targets / 58 rules(覆蓋不到三成)
- HostHighCpuLoad 量錯維度(CPU idle vs load_avg)

統帥戰略指令:
- 全景資產 × 七大自動化 × 永久化 DB
- AI 四分工(OpenClaw × NemoTron × Hermes × Claude LLM)
- 所有自動化操作歷程必進 DB,不靠 MD(MD 會漂移)

本 commit 交付:

1. SQL migration (apps/api/migrations/adr090_asset_inventory_foundation.sql)
   - 11 張表 + 33 indexes + 20 CHECK + 3 UNIQUE + 16 FK
   - pgcrypto extension dependency
   - 完整 idempotent(CREATE IF NOT EXISTS + single transaction)
   - 已 apply 進 awoooi_prod(188 PG),驗證通過

2. ADR-090 (docs/adr/ADR-090-monitoring-blindspot-governance.md)
   - 決策紀錄 + 7 引擎對映 + 4 替代方案否決

3. 主戰略文件 (docs/superpowers/specs/2026-04-18-blindspot-governance-capacity-l4.md)
   - §0-§14: 背景 / 根因 / Schema DDL / 4 層防禦 / 7 Phase 實施 /
     HARD_RULES / AI 分工矩陣 / 驗收指標 / 技術債 / 回滾 / 接手協議

4. MASTER §8 Living Changelog 追加 Phase 7 啟動條目

11 張表:
  asset_inventory / asset_discovery_run / asset_coverage_snapshot /
  asset_relationship / alert_rule_catalog / asset_change_event /
  asset_compliance_snapshot / host_capacity_snapshot /
  capacity_violation_event / automation_operation_log /
  ai_collaboration_trace

首筆 bootstrap 記錄已 seed 進 asset_discovery_run
(run_id=6760c5bf-57e5-4a40-b82d-31b794464652)

相關 Memory (未 commit,存於 ~/.claude/...):
  - project_blindspot_governance.md (跨 session 指針)
  - feedback_monitor_self_monitoring.md (監控工具必須被監控)
  - feedback_secrets_leak_incidents_2026-04-18.md (憑證外洩三防線)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:46 +08:00
OG T
9d6aa7ea45 feat(trust): ADR-088 Trust Score 持久化 — L4 自動放行核心
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m40s
TrustScoreManager 從記憶體升級為 PostgreSQL 持久化,
Pod 重啟後信任分數不再歸零,AI 能真正累積到 L4 自動放行門檻。

變更:
- migrations/adr088_trust_score_persistence.sql: trust_records 表
- db/models.py: TrustRecordDB ORM model
- repositories/interfaces.py: ITrustRepository Protocol
- repositories/trust_repository.py: PG upsert ON CONFLICT DO UPDATE
- services/trust_engine.py: bulk_load() 啟動 warm-up
- services/learning_service.py: _persist_trust() + 2 call sites
- main.py: 啟動時 load_all() → bulk_load()

流程: 批准 5 次 → score=5 寫入 DB → Pod 重啟 → warm-up 讀回
      → evaluate_adjusted_risk MEDIUM→LOW → 自動執行

2026-04-17 ogt + Claude Sonnet 4.6(亞太): ADR-088

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:14:44 +08:00
OG T
c05bac6112 fix(playbook): seed tuple unpack + text[] → jsonb migration
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_seed_service.py: list_playbooks 回傳 tuple[list, int],
  缺少解包導致 'list' has no attribute 'source'
- fix_playbooks_array_to_jsonb.sql: source_incident_ids/tags text[] → jsonb
  (已手動套用 prod DB)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:03:59 +08:00
OG T
da871fc149 chore(db): 補齊 AIOps P1/P2/P6 migration SQL(已套用到 prod)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_evidence / agent_sessions / ai_governance_events 三表
IF NOT EXISTS,production DB 已手動確認存在並 apply。

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:02:17 +08:00
OG T
325b3851b5 feat(adr-071): 告警通知四類型第一批 B/C/E/F/G/H 全實作
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 1m7s
ADR-071-B: classify_notification() — 五型分類器 (TYPE-1/2/3/4/4D)
ADR-071-C: send_info_notification() — TYPE-1 純資訊無按鈕卡片
ADR-071-E: _build_inline_keyboard() — 依 alert_category 動態組合 TYPE-3 按鈕
ADR-071-F: send_drift_card() — TYPE-4D Config Drift 卡片 + Diff 截斷
ADR-071-G: km_conversion_service.py — Incident RESOLVED 自動轉 KM
ADR-071-H: handle_manual_fix_done() — TYPE-4 手動修復 Bot 對話閉環

前批已完成: ADR-071-A (DB Migration) + ADR-071-D (狀態機守衛)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 02:24:20 +08:00
OG T
c6edfb5614 fix(flywheel): 四階段系統性修復 AUTO_REPAIR NO_MATCH 斷層
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 1 — affected_services 污染根治
  - webhooks.py: _extract_affected_services() 從 labels 精準萃取服務名
    (component > job > pod deployment name > clean target_resource > [])
  - create_incident_for_approval: alert_labels 完整保留進 Signal
  - alert_name 從 alertname 取,不再用 "custom"

Phase 2 — Playbook alertname 變體擴充
  - alert_rules.yaml: 5 條規則新增 HostHighCpuLoad、KubePodCrashLooping 等變體
  - scripts/update_playbook_alert_variants.py: Redis index 已執行更新 

Phase 3 — Jaccard 通用型 Playbook 豁免
  - similarity.py: affected_services=[] → 1.0 豁免(基礎設施 Playbook 不針對特定服務)
  - severity_range=[] → 1.0 豁免(適用所有嚴重度)

Phase 4 — Playbook Embedding 持久化(冷啟動修復)
  - migrations/flywheel_playbook_embeddings.sql: pgvector 持久化表
  - services/playbook_embedding_service.py: 啟動時重建 Redis 向量快取 + 同步 DB
  - main.py: lifespan 啟動時 asyncio.create_task 非阻塞執行

2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:04:56 +08:00
OG T
af7b1591c1 feat(rag): phase35 ivfflat 向量索引 — 5814 chunks 已建立
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
已在 prod 執行: idx_rag_chunks_embedding (lists=100, cosine)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 10:33:32 +08:00
OG T
63e840ae42 feat(ollama): Phase 31-34 ADR-067 — Log摘要/PR審查/RAG知識庫/圖片分析
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
Phase 31: log_summary_service.py — deepseek-r1:14b K8s Pod日誌異常摘要
  - 觸發: signoz_webhook 告警時背景呼叫
  - Redis快取 log_summary:{pod}:{date} TTL 24h
  - 敏感資料regex遮蔽

Phase 32: local_code_review_service.py — qwen2.5-coder:7b PR自動審查
  - Fallback: Gemini (diff > 50KB 或 Ollama超時)
  - semaphore 最多2個同時審查
  - 雙寫: Redis TTL 7d + pr_reviews表 (phase29 migration)

Phase 33: knowledge_rag_service.py — nomic-embed-text 768維 pgvector RAG
  - 向量化(188) + 生成(111) 雙Ollama
  - rag_chunks表 (phase28 migration)
  - 初期線性搜尋,>100筆啟用ivfflat索引

Phase 34: image_analysis_service.py — llava:latest Telegram圖片分析
  - download_and_analyze: Bot API getFile → 下載 → llava → 回應
  - Rate limit: 每chat_id每分鐘3次 (Redis sliding window)
  - telegram.py webhook新增photo分支

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 01:50:22 +08:00
OG T
89015d4527 feat(phase30): Drift 報告 AI 人話摘要 (ADR-067)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 DriftNarratorService — qwen2.5:7b-instruct (Ollama 111)
  - 觸發條件: high >= 1 or medium >= 3(HPA replicas 白名單)
  - Redis 快取: drift_narrative:{report_id} TTL 1h
  - LLM 失敗時 graceful fallback 結構化文字
- drift.py _analyze_and_notify: 接入 narrator(Phase 30 標記)
- Migration: drift_reports.narrative_text TEXT (已在 prod 執行)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 01:37:43 +08:00
OG T
88ac1c7f50 feat(phase27): 歷史按鈕雙層頻率統計 + DB frequency_snapshot 持久化
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m44s
- telegram_gateway: _send_incident_history 改為 Phase 27 雙層策略
  Layer 1: DB frequency_snapshot (建立時刻永久快照)
  Layer 2: Redis AnomalyCounter disposition 累積統計 (35d TTL)
  修復舊版呼叫 record_anomaly() 導致誤計數的 bug
- 新增 migration: phase27_incident_frequency_snapshot.sql (已在 prod 執行)
- CLAUDE.md: 精簡至 123 行,減少 Token 消耗

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 01:06:51 +08:00
OG T
896bef94ee fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:38:08 +08:00
OG T
88696dba9b feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
  - k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader

Layer 1 - DB Migration (已在 188 執行):
  - M-002: approval_records 新增 approval_level/votes/required_votes
  - M-003: alert_event_type ENUM 新增 8 個值

Layer 2 - IaC:
  - ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)

Layer 3 - Python Services:
  - service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
  - velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
  - preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)

Layer 1-M001 - Playbook model:
  - playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup

Layer 4 - 業務邏輯:
  - alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
  - auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
  - webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
  - db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
  - docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)

Layer 5 - Telegram 通知:
  - telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)

參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:24:09 +08:00
OG T
f20121ad41 feat(audit): Phase 11 告警操作完整溯源 — alert_operation_log + 歷史回填
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m29s
統帥指令「所有告警訊息通通寫入資料庫,並記錄相關操作」

變更:
- phase11_alert_operation_log.sql: 新表 (Event Sourcing,不可變)
- phase11b_backfill_alert_operation_log.sql: 歷史回填 654 筆
  - 14 筆 ALERT_RECEIVED (incidents)
  - 265 筆 TELEGRAM_SENT (approval_records)
  - 265 筆 USER_ACTION (approval_records)
  - 110 筆 EXECUTION_COMPLETED (audit_logs)
- db/models.py: AlertOperationLog SQLAlchemy model
- repositories/alert_operation_log_repository.py: append/list_by_incident/get_stats
- webhooks.py: _try_auto_repair_background 寫入 AUTO_REPAIR_TRIGGERED + EXECUTION_COMPLETED + TELEGRAM_RESULT_SENT
- webhooks.py: _push_to_telegram_background 寫入 TELEGRAM_SENT
- telegram.py: handle_callback 寫入 USER_ACTION (approve/reject)

已執行 migration: awoooi_prod@192.168.0.188 

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:22:03 +08:00
OG T
eee6f06215 feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m32s
統帥指令: 所有自動修復操作(成功/失敗)必須持久化

變更:
- migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引
- db/models.py: 新增 AutoRepairExecution SQLAlchemy model
- repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats)
- auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB
  - 新增 similarity_score 參數傳遞
  - AutoRepairDecision 新增 similarity_score 欄位
- webhooks.py: 傳入 similarity_score 到 execute_auto_repair

已執行 migration: awoooi_prod@192.168.0.188:5432 

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:16:37 +08:00
OG T
658337ec18 fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
   → TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
   → _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
   → 所有 deployment 在 awoooi-prod,203 次執行全失敗

修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:46:05 +08:00
OG T
3455044457 feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout

P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
  - generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
  - generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
  - 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint

P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:35:05 +08:00
OG T
df3ef9006c fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m2s
Critical #1: KM write task 移出 try/except
- _trigger_learning 的 KM 寫入原在 try 內,learning 失敗時不寫 KM
- 移至 except 後確保成功/失敗都寫入
- 移除冗餘 import asyncio(已在頂層 import)
- Minor: approval.incident_id or None 防空字串

Important #2: migration 加 PRIMARY KEY
- playbook_id 從 UNIQUE 升為 PRIMARY KEY
- prod DB 已執行 ALTER TABLE ADD PRIMARY KEY

Important #3: s.sequence→s.step_number, s.description→s.command
- embed_playbook() 使用不存在的欄位名,RAG 向量索引靜默失敗
- RepairStep 正確欄位: step_number, command

Important #1: PlaybookService._get_rag_service 不再 Service 層快取
- 改為每次呼叫工廠 get_playbook_rag_service()
- 避免舊實例繞過工廠的 is_closed 重建邏輯

冷啟動修復 (首席架構師建議B+C):
- _trigger_playbook_extraction 執行成功後自動設定
  execution_success=True, effectiveness_score=4, status=RESOLVED
- skip 路徑 logger.debug → logger.info 提升可觀測性

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:02:03 +08:00
OG T
72d7536ead feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql)
   - 這是自動修復無法啟動的根本原因 — table 從未建立
   - 5 個索引: status/tags/alert_names/source_incidents/created_at
   - 已在 prod DB 執行

2. playbook_service: 萃取後自動沉澱 KM
   - extract_from_incident() 完成後 fire-and-forget _write_to_km()
   - 內容含症狀模式、修復步驟、信心度、來源 Incident

3. approval_execution: 執行結果沉澱 KM
   - _trigger_learning() 後 fire-and-forget _write_execution_result_to_km()
   - 成功/失敗記錄都寫入,category=execution_result

完整閉環:
告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM
                                              ↓
                              Incident解決 → KM(knowledge_extractor)
                                          → Playbook萃取 → KM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:54:15 +08:00
OG T
a1f7d1f495 fix(db): 固化 risklevel ADD VALUE 'high' 為正式 migration
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m58s
E2E Health Check / e2e-health (push) Successful in 18s
Phase 23 緊急修復已在 prod/dev 手動執行,此檔作為正式記錄
使用 DO 塊防止重複執行錯誤

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:36:15 +08:00
OG T
30153496d1 fix(api): 修復全部 lint 錯誤 (ruff --fix)
- Import sorting (I001)
- Unused imports (F401)
- f-string without placeholders (F541)
- Loop variable unused (B007)
- zip() strict parameter (B905)
- Exception chaining (B904)
- collections.abc imports (UP035)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 16:06:20 +08:00