feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動

- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
  (1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
   15 KPI、21 Feature Flags、10 風險場景)

- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
  (7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)

- apps/api/src/core/feature_flags.py
  (AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
   is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)

- apps/api/src/jobs/__init__.py + baseline_snapshot.py
  (Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
   / learning loop rate / auto_repair — 寫入 aiops:baseline:latest)

- apps/api/tests/test_feature_flags.py  (21 tests — 全綠)

- docs/HARD_RULES.md → v1.9
  (新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)

- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol

Gate 0 Pass — 21/21 tests green

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-15 12:44:53 +08:00
parent 6c7f648b60
commit db9e304a14
10 changed files with 2387 additions and 7 deletions

View File

@@ -7,10 +7,13 @@
## ⚠️ Session 啟動第一步
**在做任何事之前,先讀:**
1. `MEMORY.md`記憶索引
2. `docs/LOGBOOK.md`最新進度
3. `docs/HARD_RULES.md`絕對禁止規則
4. 涉及主題的 `feedback_*.md`
1. 🔴🔴🔴 **`docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`**AI 自主化飛輪 MASTER 藍圖(進行中)
2. `MEMORY.md`記憶索引
3. `docs/LOGBOOK.md`最新進度
4. `docs/HARD_RULES.md` — 絕對禁止規則
5. 涉及主題的 `feedback_*.md`
🔴🔴🔴 **AI 自主化工程進行中** — 任何告警/修復/規則/分類/通知相關變更,必須先讀 MASTER §0 Session Resume Protocol禁止繞過。
🔴🔴 **檢查 `project_current_status.md` 最後更新日期** — 超過 2 天 → 先執行 Memory 清理再開工

View File

@@ -0,0 +1,240 @@
"""
AWOOOI AIOps Feature Flags
==========================
AI 自主化飛輪 Phase 0-6 功能開關
ADR-080: AI 自主化飛輪總綱
MASTER: docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
安全規則:
- 所有 flag 預設 False — 任何 Phase 必須明確開啟才生效
- Phase 總開關 = False 時,該 Phase 所有子開關均視為 False
- 自我降級後 (D6) 不得自動反向升級,升級必須人工設定 env var
回滾方式:
kubectl set env deployment/awoooi-api AIOPS_P1_ENABLED=false
# 或修改 .env 後重部署
2026-04-15 ogt: Phase 0 — 初始建立ADR-080 批准後啟用
"""
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict
class AIOpsFeatureFlags(BaseSettings):
"""
AI 自主化飛輪 Feature Flag 集合
每個 Phase 一個總開關 + 細粒度子開關。
讀取順序:環境變數 > .env 檔 > 預設值(全 False
"""
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
case_sensitive=True,
extra="ignore",
)
# ==========================================================================
# Phase 總開關Phase N 退出條件達到後才設 True
# ==========================================================================
AIOPS_P1_ENABLED: bool = Field(
default=False,
description="Phase 1 感官縱深PreDecisionInvestigator + EvidenceSnapshot + PostExecutionVerifier",
)
AIOPS_P2_ENABLED: bool = Field(
default=False,
description="Phase 2 多 Agent 協作5 角色全部上線Diagnostician/Solver/Reviewer/Critic/Coordinator",
)
AIOPS_P3_ENABLED: bool = Field(
default=False,
description="Phase 3 學習閉環重建3 根因修復 + EWMA + Evolver + Fine-tune pipeline",
)
AIOPS_P4_ENABLED: bool = Field(
default=False,
description="Phase 4 動態異常偵測Holt-Winters + Drain3 + Prophet + 主動巡檢",
)
AIOPS_P5_ENABLED: bool = Field(
default=False,
description="Phase 5 修復抽象化Declarative + Blast Radius 四級分控 + GitOps PR",
)
AIOPS_P6_ENABLED: bool = Field(
default=False,
description="Phase 6 自我治理閉環SLO + Trust Drift + KB Rot + 離線回放 + 自我降級",
)
# ==========================================================================
# Phase 1 細粒度子開關
# ==========================================================================
AIOPS_P1_PRE_DECISION_INVESTIGATOR: bool = Field(
default=False,
description="P1: PreDecisionInvestigator 是否在決策前執行 MCP 感官蒐集(可獨立關閉)",
)
AIOPS_P1_POST_EXECUTION_VERIFIER: bool = Field(
default=False,
description="P1: PostExecutionVerifier 是否在每次執行後驗證狀態",
)
# ==========================================================================
# Phase 2 細粒度子開關
# ==========================================================================
AIOPS_P2_CRITIC_ENABLED: bool = Field(
default=False,
description="P2: Critic Agent 是否啟用辯證挑戰(關閉可降低延遲但失去質疑機制)",
)
AIOPS_P2_AGENT_TIMEOUT_SEC: int = Field(
default=5,
description="P2: 單 Agent 熔斷閾值(秒),超時則 Coordinator 降級處理",
)
# ==========================================================================
# Phase 3 細粒度子開關
# ==========================================================================
AIOPS_P3_FINETUNE_EXPORT: bool = Field(
default=False,
description="P3: Fine-tune JSONL 每週匯出到 MinIO 是否執行",
)
AIOPS_P3_EVOLVER_ENABLED: bool = Field(
default=False,
description="P3: Evolver Agent 是否執行 Playbook 自動合併與封存",
)
AIOPS_P3_KNOWLEDGE_DECAY: bool = Field(
default=False,
description="P3: 30 天知識遺忘 job 是否執行(標 decayed降到 cold index",
)
# ==========================================================================
# Phase 4 細粒度子開關
# ==========================================================================
AIOPS_P4_DYNAMIC_BASELINE: bool = Field(
default=False,
description="P4: Holt-Winters 動態基線服務是否啟用",
)
AIOPS_P4_LOG_ANOMALY: bool = Field(
default=False,
description="P4: Drain3 日誌異常偵測是否啟用",
)
AIOPS_P4_TREND_PREDICTOR: bool = Field(
default=False,
description="P4: Prophet 趨勢預測是否啟用(預測 4h 內超閾值風險)",
)
AIOPS_P4_PROACTIVE_INSPECTOR: bool = Field(
default=False,
description="P4: 主動巡檢每 5min 是否執行",
)
# ==========================================================================
# Phase 5 細粒度子開關
# ==========================================================================
AIOPS_P5_BLAST_RADIUS_CHECK: bool = Field(
default=False,
description="P5: Blast Radius 評估是否執行False = 全部視為低風險自動執行,危險)",
)
AIOPS_P5_GITOPS_PR: bool = Field(
default=False,
description="P5: 高風險修復Blast Radius > 50是否走 GitOps Gitea PR 流程",
)
AIOPS_P5_DRY_RUN_ENFORCED: bool = Field(
default=False,
description="P5: Declarative apply 前是否強制 dry-runFalse = 跳過 dry-run危險",
)
# ==========================================================================
# Phase 6 細粒度子開關
# ==========================================================================
AIOPS_P6_SELF_DEMOTION: bool = Field(
default=False,
description="P6: 自我降級邏輯是否啟用SLO 違反 → 自動提高信心閾值)",
)
AIOPS_P6_OFFLINE_REPLAY: bool = Field(
default=False,
description="P6: 週度離線回放 100 案是否執行",
)
AIOPS_P6_KB_ROT_CLEANER: bool = Field(
default=False,
description="P6: 月度 KB 腐爛清理 job 是否執行",
)
AIOPS_P6_TRUST_DRIFT_DETECTOR: bool = Field(
default=False,
description="P6: Playbook trust 分布漂移偵測是否啟用",
)
def is_phase_enabled(self, phase: int) -> bool:
"""
檢查指定 Phase 的總開關是否啟用。
Args:
phase: Phase 編號1-6
Returns:
bool: 該 Phase 是否開啟
Usage:
if flags.is_phase_enabled(1):
await pre_decision_investigator.investigate(...)
"""
phase_flags = {
1: self.AIOPS_P1_ENABLED,
2: self.AIOPS_P2_ENABLED,
3: self.AIOPS_P3_ENABLED,
4: self.AIOPS_P4_ENABLED,
5: self.AIOPS_P5_ENABLED,
6: self.AIOPS_P6_ENABLED,
}
return phase_flags.get(phase, False)
def is_sub_flag_enabled(self, flag_name: str) -> bool:
"""
檢查細粒度子開關(自動驗證父 Phase 開關)。
Args:
flag_name: 子開關名稱,例如 "AIOPS_P1_PRE_DECISION_INVESTIGATOR"
Returns:
bool: 子開關 AND 父 Phase 開關都為 True 才回 True
Usage:
if flags.is_sub_flag_enabled("AIOPS_P1_PRE_DECISION_INVESTIGATOR"):
...
"""
# 解析 Phase 編號
parts = flag_name.split("_")
if len(parts) < 3 or not parts[1].startswith("P"):
return False
try:
phase = int(parts[1][1:])
except ValueError:
return False
# 父 Phase 必須開啟
if not self.is_phase_enabled(phase):
return False
return bool(getattr(self, flag_name, False))
# Singleton — 與 core/config.py 的 settings 相同模式
# 使用from src.core.feature_flags import aiops_flags
aiops_flags = AIOpsFeatureFlags()
def get_aiops_flags() -> AIOpsFeatureFlags:
"""
FastAPI dependency injection 用。
Usage:
@router.get("/status")
async def status(flags: AIOpsFeatureFlags = Depends(get_aiops_flags)):
return {"p1": flags.AIOPS_P1_ENABLED}
"""
return aiops_flags

View File

@@ -0,0 +1,15 @@
"""
AWOOOI AIOps Jobs
==================
定時任務(非 Redis Streams Worker
目前包含:
- baseline_snapshot: Phase 0 觀測基線快照
- knowledge_decay_job: Phase 3 30 天知識遺忘 (待建)
- detection_feedback_writer: Phase 3 誤判告警回寫 (待建)
- offline_replay_service: Phase 6 週度離線回放 (待建)
- kb_rot_cleaner: Phase 6 月度 KB 腐爛清理 (待建)
ADR-080: AI 自主化飛輪總綱
2026-04-15 ogt: Phase 0 — 初始建立
"""

View File

@@ -0,0 +1,338 @@
"""
AWOOOI AIOps Phase 0 — 基線快照 Job
=====================================
拍攝 AI 自主化飛輪「啟動前現況」,作為 Phase 0→1 進展衡量基準。
快照涵蓋 ADR-080 診斷表中的 6 大指標:
1. MCP 呼叫次數/24h目標> 0現況預估0
2. Playbook trust/confidence 分佈(目標:動態;現況:全靜態)
3. 學習閉環觸發率(目標:≥ 99%現況0%fire-and-forget
4. 告警分類 general 比例(目標:< 10%;現況:~ 41%
5. 修復動作 RESTART 比例(目標:< 40%;現況:~ 68%
6. 自動執行成功次數/24h目標> 0現況0
儲存策略:
- Redis Key `aiops:baseline:{timestamp_iso}` — 最新快照TTL 永不過期)
- Redis Key `aiops:baseline:latest` — 指向最新快照的時間戳(方便 API 讀取)
使用方式:
python -m src.jobs.baseline_snapshot # 直接執行(一次性)
await take_baseline_snapshot() # 從程式碼呼叫
ADR-080: AI 自主化飛輪總綱
MASTER: docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md §5 Phase 0
2026-04-15 ogt + Claude Sonnet 4.6 (亞太): Phase 0 — 初始建立
"""
from __future__ import annotations
import asyncio
import json
from datetime import timedelta
import structlog
from sqlalchemy import func, select, text
from src.core.redis_client import get_redis
from src.db.base import get_db_context
from src.db.models import (
AutoRepairExecution,
IncidentRecord,
KnowledgeEntryRecord,
)
from src.utils.timezone import now_taipei
logger = structlog.get_logger(__name__)
# Redis 鍵
BASELINE_KEY_PREFIX = "aiops:baseline:"
BASELINE_LATEST_KEY = "aiops:baseline:latest"
# Playbook Redis 前綴(同 playbook_repository.py
PLAYBOOK_KEY_PREFIX = "playbook:"
async def take_baseline_snapshot() -> dict:
"""
拍攝一次完整基線快照並寫入 Redis。
Returns:
dict: 快照內容(含 snapshot_at 時間戳)
"""
now = now_taipei()
since_24h = now - timedelta(hours=24)
ts_iso = now.isoformat()
logger.info("baseline_snapshot_start", snapshot_at=ts_iso)
snapshot = {
"snapshot_at": ts_iso,
"phase": "P0",
"description": "AI 自主化飛輪 Phase 0 啟動前基線",
"metrics": {},
}
# ── 1. MCP 呼叫次數/24h ───────────────────────────────────────────────
# Phase 0 時 MCP 尚未接入任何決策流程 → 預期為 0
# Phase 1 完成後此數字應 > 0PreDecisionInvestigator 開始呼叫)
mcp_calls_24h = await _count_mcp_calls_24h(since_24h)
snapshot["metrics"]["mcp_calls_24h"] = mcp_calls_24h
# ── 2. Playbook confidence 分佈Redis 掃描)──────────────────────────
playbook_stats = await _playbook_confidence_stats()
snapshot["metrics"]["playbook"] = playbook_stats
# ── 3. 學習閉環觸發率 + 其他 DB 指標 ─────────────────────────────────
db_metrics = await _db_metrics(since_24h)
snapshot["metrics"].update(db_metrics)
# ── 4. 計算衍生指標 ───────────────────────────────────────────────────
snapshot["metrics"]["learning_loop_rate"] = _compute_learning_rate(
db_metrics.get("auto_repair_24h", 0),
db_metrics.get("learning_writes_24h", 0),
)
# ── 寫入 Redis ─────────────────────────────────────────────────────────
await _persist_to_redis(ts_iso, snapshot)
logger.info(
"baseline_snapshot_done",
snapshot_at=ts_iso,
mcp_calls_24h=mcp_calls_24h,
playbook_total=playbook_stats.get("total", 0),
incidents_24h=db_metrics.get("incidents_24h", 0),
auto_repair_success_24h=db_metrics.get("auto_repair_success_24h", 0),
)
return snapshot
# ─────────────────────────────────────────────────────────────────────────────
# Internal helpers
# ─────────────────────────────────────────────────────────────────────────────
async def _count_mcp_calls_24h(since_24h) -> int:
"""
MCP 呼叫次數/24h。
Phase 0無 MCP Calls Table → 從 audit_logs 嘗試計數。
Phase 1 建立 PreDecisionInvestigator 後,此處改為查 mcp_tool_calls 表。
"""
try:
async with get_db_context() as db:
# audit_logs 中 action='mcp_call' — Phase 0 預期 0 筆
result = await db.execute(
text(
"SELECT COUNT(*) FROM audit_logs "
"WHERE action = 'mcp_call' AND created_at >= :since"
),
{"since": since_24h},
)
return result.scalar_one_or_none() or 0
except Exception:
logger.exception("baseline_mcp_count_error")
return 0
async def _playbook_confidence_stats() -> dict:
"""
掃描 Redis 中全部 Playbook統計 ai_confidence 分佈。
指標診斷:
- avg_confidence ≈ 0.3 → 佐證「全靜態」現況Phase 0 基線)
- Phase 3 EWMA 上線後此值應動態分散std_dev 升高、avg 可能提升)
"""
stats = {
"total": 0,
"approved": 0,
"avg_confidence": 0.0,
"min_confidence": None,
"max_confidence": None,
"never_used": 0, # success_count + failure_count == 0
"action_type_dist": {},
}
try:
redis = get_redis()
confidences: list[float] = []
action_counts: dict[str, int] = {}
async for key in redis.scan_iter(match=f"{PLAYBOOK_KEY_PREFIX}PB-*", count=200):
raw = await redis.get(key)
if not raw:
continue
try:
pb = json.loads(raw)
except json.JSONDecodeError:
continue
stats["total"] += 1
if pb.get("status") == "approved":
stats["approved"] += 1
conf = pb.get("ai_confidence", 0.0) or 0.0
confidences.append(conf)
used = (pb.get("success_count", 0) or 0) + (pb.get("failure_count", 0) or 0)
if used == 0:
stats["never_used"] += 1
# 統計 repair_steps 中首個 action_type代表主要修復動作
steps = pb.get("repair_steps", [])
if steps:
first_action = steps[0].get("action_type", "unknown")
action_counts[first_action] = action_counts.get(first_action, 0) + 1
if confidences:
stats["avg_confidence"] = round(sum(confidences) / len(confidences), 4)
stats["min_confidence"] = round(min(confidences), 4)
stats["max_confidence"] = round(max(confidences), 4)
# RESTART 比例:佐證 ADR-080 診斷(目標 < 40%
total_actions = sum(action_counts.values())
restart_count = action_counts.get("restart_service", 0)
stats["restart_ratio"] = round(restart_count / total_actions, 4) if total_actions else 0.0
stats["action_type_dist"] = action_counts
except Exception:
logger.exception("baseline_playbook_stats_error")
return stats
async def _db_metrics(since_24h) -> dict:
"""
從 PostgreSQL 取得核心計數指標。
"""
metrics: dict = {
"incidents_24h": 0,
"incidents_total": 0,
"general_alert_ratio": 0.0,
"auto_repair_24h": 0,
"auto_repair_success_24h": 0,
"km_total": 0,
"km_vectorized": 0,
"learning_writes_24h": 0,
"audit_logs_24h": 0,
}
try:
async with get_db_context() as db:
# Incident 數量24h + 總計)
r = await db.execute(
select(func.count(IncidentRecord.incident_id)).where(
IncidentRecord.created_at >= since_24h
)
)
metrics["incidents_24h"] = r.scalar_one_or_none() or 0
r = await db.execute(select(func.count(IncidentRecord.incident_id)))
metrics["incidents_total"] = r.scalar_one_or_none() or 0
# general 告警比例alert_category = 'general'
r = await db.execute(
select(func.count()).where(
IncidentRecord.alert_category == "general"
)
)
general_count = r.scalar_one_or_none() or 0
total = metrics["incidents_total"]
metrics["general_alert_ratio"] = round(general_count / total, 4) if total else 0.0
# 自動修復執行24h
r = await db.execute(
select(func.count(AutoRepairExecution.id)).where(
AutoRepairExecution.created_at >= since_24h
)
)
metrics["auto_repair_24h"] = r.scalar_one_or_none() or 0
r = await db.execute(
select(func.count(AutoRepairExecution.id)).where(
AutoRepairExecution.created_at >= since_24h,
AutoRepairExecution.success.is_(True),
)
)
metrics["auto_repair_success_24h"] = r.scalar_one_or_none() or 0
# KM 數量 + 向量化率
r = await db.execute(select(func.count(KnowledgeEntryRecord.id)))
metrics["km_total"] = r.scalar_one_or_none() or 0
r = await db.execute(
select(func.count()).where(
KnowledgeEntryRecord.embedding.is_not(None)
)
)
metrics["km_vectorized"] = r.scalar_one_or_none() or 0
# 學習寫入數24h 內新增 KM
r = await db.execute(
select(func.count()).where(
KnowledgeEntryRecord.created_at >= since_24h
)
)
metrics["learning_writes_24h"] = r.scalar_one_or_none() or 0
# audit_logs 24h 計數Phase 0 預期 = 0
r = await db.execute(
text(
"SELECT COUNT(*) FROM audit_logs WHERE created_at >= :since"
),
{"since": since_24h},
)
metrics["audit_logs_24h"] = r.scalar_one_or_none() or 0
except Exception:
logger.exception("baseline_db_metrics_error")
return metrics
def _compute_learning_rate(auto_repair_24h: int, learning_writes_24h: int) -> float:
"""
學習閉環觸發率 = learning_writes_24h / auto_repair_24h。
Phase 0 診斷fire-and-forget → 比率為 0%(即使 auto_repair > 0learning 也可能 = 0
Phase 3 修復後目標:≥ 99%
"""
if auto_repair_24h == 0:
return 0.0
return round(min(learning_writes_24h / auto_repair_24h, 1.0), 4)
async def _persist_to_redis(ts_iso: str, snapshot: dict) -> None:
"""
將快照寫入 Redis
- `aiops:baseline:{ts_iso}` — 歷史記錄(永不過期)
- `aiops:baseline:latest` — 最新快照全量(永不過期)
"""
try:
redis = get_redis()
payload = json.dumps(snapshot, ensure_ascii=False)
# 歷史記錄(保留全部 snapshot
await redis.set(f"{BASELINE_KEY_PREFIX}{ts_iso}", payload)
# 最新快照(供 API 快速讀取)
await redis.set(BASELINE_LATEST_KEY, payload)
logger.info("baseline_snapshot_persisted", key=BASELINE_LATEST_KEY)
except Exception:
logger.exception("baseline_persist_error")
# ─────────────────────────────────────────────────────────────────────────────
# Entry point直接執行
# ─────────────────────────────────────────────────────────────────────────────
async def _main() -> None:
snapshot = await take_baseline_snapshot()
print(json.dumps(snapshot, indent=2, ensure_ascii=False))
if __name__ == "__main__":
asyncio.run(_main())

View File

@@ -69,6 +69,7 @@ from src.api.v1 import terminal as terminal_v1 # Phase 19.1: Omni-Terminal SSE
from src.api.v1 import timeline as timeline_v1
from src.api.v1 import webhooks as webhooks_v1
from src.core.config import settings
from src.core.feature_flags import aiops_flags # ADR-080: AI 自主化飛輪 feature flags 啟動驗證
from src.core.http_client import close_all_http_clients, init_all_http_clients
from src.core.logging import get_logger, setup_logging
from src.core.redis_client import close_redis_pool, init_redis_pool

View File

@@ -0,0 +1,110 @@
"""
AIOps Feature Flags 測試
========================
ADR-080: AI 自主化飛輪 Phase 0 退出條件
測試項目:
- 所有 Phase 總開關預設 False
- is_sub_flag_enabled() 強制父 Phase 開關
- is_phase_enabled() 邊界條件
- bool cast 正確性(非 bool 型態子開關不洩漏 int
2026-04-15 Claude Sonnet 4.6 + ogt: Phase 0 初始建立
"""
import pytest
from src.core.feature_flags import AIOpsFeatureFlags
class TestPhaseDefaultsAllFalse:
"""Phase 0 退出條件:所有 Phase 預設 False"""
def test_p1_disabled_by_default(self):
flags = AIOpsFeatureFlags()
assert flags.AIOPS_P1_ENABLED is False
def test_p2_disabled_by_default(self):
flags = AIOpsFeatureFlags()
assert flags.AIOPS_P2_ENABLED is False
def test_p3_disabled_by_default(self):
flags = AIOpsFeatureFlags()
assert flags.AIOPS_P3_ENABLED is False
def test_p4_disabled_by_default(self):
flags = AIOpsFeatureFlags()
assert flags.AIOPS_P4_ENABLED is False
def test_p5_disabled_by_default(self):
flags = AIOpsFeatureFlags()
assert flags.AIOPS_P5_ENABLED is False
def test_p6_disabled_by_default(self):
flags = AIOpsFeatureFlags()
assert flags.AIOPS_P6_ENABLED is False
class TestSubFlagEnforcement:
"""is_sub_flag_enabled() 必須強制父 Phase 開關"""
def test_sub_flag_blocked_when_parent_disabled(self):
"""父 Phase 關閉時,子開關必須回 False即使子開關本身 True"""
flags = AIOpsFeatureFlags(
AIOPS_P1_ENABLED=False,
AIOPS_P1_PRE_DECISION_INVESTIGATOR=True,
)
assert flags.is_sub_flag_enabled("AIOPS_P1_PRE_DECISION_INVESTIGATOR") is False
def test_sub_flag_allowed_when_parent_enabled(self):
"""父 Phase 開啟且子開關為 True → 回 True"""
flags = AIOpsFeatureFlags(
AIOPS_P1_ENABLED=True,
AIOPS_P1_PRE_DECISION_INVESTIGATOR=True,
)
assert flags.is_sub_flag_enabled("AIOPS_P1_PRE_DECISION_INVESTIGATOR") is True
def test_sub_flag_false_when_sub_disabled(self):
"""父 Phase 開啟但子開關為 False → 回 False"""
flags = AIOpsFeatureFlags(
AIOPS_P1_ENABLED=True,
AIOPS_P1_PRE_DECISION_INVESTIGATOR=False,
)
assert flags.is_sub_flag_enabled("AIOPS_P1_PRE_DECISION_INVESTIGATOR") is False
def test_sub_flag_returns_bool_not_int(self):
"""AIOPS_P2_AGENT_TIMEOUT_SEC 是 int — 不得從 is_sub_flag_enabled 洩漏為 truthy int"""
flags = AIOpsFeatureFlags(AIOPS_P2_ENABLED=True)
result = flags.is_sub_flag_enabled("AIOPS_P2_AGENT_TIMEOUT_SEC")
assert isinstance(result, bool), f"Expected bool, got {type(result)}"
def test_sub_flag_invalid_name_returns_false(self):
flags = AIOpsFeatureFlags()
assert flags.is_sub_flag_enabled("INVALID_FLAG_NAME") is False
def test_sub_flag_nonexistent_field_returns_false(self):
flags = AIOpsFeatureFlags(AIOPS_P1_ENABLED=True)
assert flags.is_sub_flag_enabled("AIOPS_P1_NONEXISTENT") is False
class TestIsPhaseEnabled:
"""is_phase_enabled() 邊界條件"""
def test_valid_phase_enabled(self):
flags = AIOpsFeatureFlags(AIOPS_P3_ENABLED=True)
assert flags.is_phase_enabled(3) is True
def test_valid_phase_disabled(self):
flags = AIOpsFeatureFlags(AIOPS_P3_ENABLED=False)
assert flags.is_phase_enabled(3) is False
def test_invalid_phase_returns_false(self):
flags = AIOpsFeatureFlags()
assert flags.is_phase_enabled(0) is False
assert flags.is_phase_enabled(7) is False
assert flags.is_phase_enabled(99) is False
@pytest.mark.parametrize("phase", [1, 2, 3, 4, 5, 6])
def test_all_phases_default_false(self, phase):
flags = AIOpsFeatureFlags()
assert flags.is_phase_enabled(phase) is False

View File

@@ -8,11 +8,11 @@
| 欄位 | 值 |
|------|-----|
| **版本** | v1.8 |
| **版本** | v1.9 |
| **建立日期** | 2026-03-20 (台北) |
| **建立者** | Claude Code |
| **最後修改** | 2026-04-03 (台北) |
| **修改者** | Claude Code (統帥指示: 費用變更強制審批) |
| **最後修改** | 2026-04-15 (台北) |
| **修改者** | Claude Code + ogt (ADR-080 AI 自主化飛輪 Phase 退出條件鐵律) |
### 變更紀錄
@@ -27,6 +27,7 @@
| v1.6 | 2026-03-30 | Claude Code | 🔴🔴🔴 前端內網 IP 禁令 (瀏覽器權限事故) |
| v1.7 | 2026-04-02 | Claude Code | Phase 24 AI Router 重構規範 (DI/隱私/絞殺者) |
| v1.8 | 2026-04-03 | Claude Code | 🔴🔴🔴 費用變更強制審批 (統帥指示) |
| v1.9 | 2026-04-15 | Claude Code + ogt | 🔴🔴🔴 AI 自主化飛輪 Phase 退出條件鐵律 (ADR-080) |
---
@@ -52,6 +53,7 @@
| **🔴🔴🔴 前端建置** | **內網 IP** | **公網域名** | [→ Frontend Internal IP](#frontend-internal-ip) |
| **AI Router** | **Router import 具體 Provider** | **只依賴 Protocol** | [→ OpenClaw](#openclaw) |
| **🔴🔴🔴 費用變更** | **擅自切換/新增付費 AI Provider** | **先讀憲法第五章,再請統帥批准** | [→ Cost Change Approval](#cost-change-approval) |
| **🔴🔴🔴 AI 飛輪 Phase** | **未過退出條件就宣告完成** | **必須逐條驗收 exit conditions** | [→ AI Phase Exit Conditions](#ai-phase-exit-conditions) |
---
@@ -455,6 +457,40 @@ def new_function():
---
## AI Phase Exit Conditions
> **ADR-080 鐵律 2026-04-15**: 禁止在未通過 Phase N 退出條件前宣告 Phase N 完成。
**Reference:** `docs/adr/ADR-080-ai-autonomy-flywheel-overview.md` / MASTER §5
### 核心規則
```
禁止宣告「Phase N 完成」,除非:
✅ MASTER §5 Phase N 退出條件清單全部打勾
✅ 相關測試通過pytest 綠燈)
✅ 架構師評審 Gate N 已完成ADR-080 §架構師評審框架)
✅ LOGBOOK 已記錄完成項目
```
### 7 Phase 退出條件速查
| Phase | 最關鍵退出條件 | ADR |
|-------|--------------|-----|
| P0 | `feature_flags.py` 建立 + `baseline_snapshot.py` 建立 + HARD_RULES v1.9 | ADR-080 |
| P1 | MCP 呼叫次數/24h > 0EvidenceSnapshot 寫入 DB | ADR-081 |
| P2 | 5 Agent 全部有 unit testCoordinator 熔斷測試通過 | ADR-082 |
| P3 | 學習閉環觸發率 ≥ 99%fire-and-forget bug 消滅 | ADR-083 |
| P4 | 動態基線覆蓋率 ≥ 80%general 告警 < 10% | ADR-084 |
| P5 | Blast Radius check 100% 覆蓋dry-run 強制通過 | ADR-085 |
| P6 | SLO 計算可用;自我降級觸發後不得自動反向升級 | ADR-086 |
### 違規後果
宣告 Phase 完成但退出條件未達到 = 技術債爆炸風險,等同於在不穩定地基上繼續建樓。統帥發現違規 → 立即回滾至上一個已驗收 Phase。
---
## 如何新增規則
1. 在此文件新增章節

View File

@@ -6,6 +6,38 @@
---
## 📍 2026-04-15 — AI 自主化飛輪 Phase 0 防護欄建立
### 完成項目
| 成品 | 路徑 | 說明 |
|------|------|------|
| MASTER v2 藍圖 | `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md` | §0-§8 全填完1456 行7 Phase 完整規劃 |
| ADR-080 | `docs/adr/ADR-080-ai-autonomy-flywheel-overview.md` | 7 Phase + 4 北極星 + 7 架構師 Review Gates |
| Feature Flags | `apps/api/src/core/feature_flags.py` | P1~P6 全 False + 15 細粒度子開關 |
| Jobs 模組 | `apps/api/src/jobs/__init__.py` | Jobs 目錄初始化 |
| 基線快照 Job | `apps/api/src/jobs/baseline_snapshot.py` | 拍攝飛輪啟動前 6 大指標現況 |
| HARD_RULES v1.9 | `docs/HARD_RULES.md` | 新增 Phase 退出條件鐵律 |
### Phase 0 基線數值(待 baseline_snapshot 執行後填入)
| 指標 | 現況(預估) | Phase 6 目標 |
|------|------------|------------|
| MCP 呼叫/24h | 0 | > 0 |
| Playbook avg_confidence | ~0.3(靜態) | 動態 EWMA |
| 學習閉環觸發率 | 0% | ≥ 99% |
| general 告警比例 | ~41% | < 10% |
| RESTART 修復比例 | ~68% | < 40% |
| 自動執行成功/24h | 0 | > 0 |
### 下一步
- 統帥 review ADR-080 + MASTER v2 → 批准後 Phase 1 開工
- Phase 1: PreDecisionInvestigator + MCP ToolRegistry + EvidenceSnapshot + PostExecutionVerifier
- 執行 `python -m src.jobs.baseline_snapshot` 拍攝真實基線數字
---
## 📍 2026-04-14 午夜 — Phase 5 分類按鈕完整化全數上線
**Sprint 5.0 → 5.4 全數完成**26 個 commits 推版:

View File

@@ -0,0 +1,149 @@
# ADR-080: AI 自主化飛輪總綱
> **日期**: 2026-04-15台北
> **狀態**: 🔵 草稿(待統帥批准後開工)
> **作者**: Claude Sonnet 4.6(首席架構師)+ 統帥 audit
> **相關**:
> - MASTER 藍圖:`docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`
> - 廢棄 v1`docs/superpowers/plans/2026-04-15-MASTER-ai-autonomous-flywheel.md`
> - ADR-070 AI 自主修復全自動化迴圈(前置背景)
> - ADR-073 飛輪完整稽查(診斷根因)
---
## 背景
### 現況診斷2026-04-15 深層掃描)
2026-04-15 對整個 AWOOOI AIOps 系統進行深層診斷,確認以下根本性缺陷:
| 指標 | 現況 | 目標 |
|-----|-----|-----|
| MCP 呼叫次數/24h | **0** | > 0 |
| Playbook trust_score | **全 0.3(靜態)** | 動態 EWMA 更新 |
| 學習閉環觸發率 | **0%fire-and-forget** | ≥ 99% |
| 告警分類 general 比例 | **41%** | < 10% |
| 修復動作 RESTART 比例 | **68%** | < 40% |
| 自動執行成功次數/24h | **0** | > 0 |
**根本診斷**:代碼地基在 / 流程骨架在 / **AI 智能靈魂不在**
過去 3 個月所有修復bypass / 黑名單 / 重啟兜底)都是在骨架上貼膏藥,沒有朝 AI 自主化方向推進。
### 驅動力
統帥反覆強調(不下 5 次):
> 「我不要任何寫死的規則,要往 AI 化方向走。每次執行結果都必須回寫讓系統更聰明。」
---
## 決策
### 採用 7 Phase 結構性改造Single Source of Truth MASTER MD
**廢棄**一切「先寫死、先 hardcode、先兜底」的臨時修復思路。
**確立**四大自主化北極星:
| 目標 | 定義 |
|-----|-----|
| **自主學習** | 每次執行回寫 Playbook trust / KM embedding下次決策更聰明 |
| **自主修復** | AI 透過 MCP 主動抓情報推理動作,不依賴硬編規則 |
| **自主告警** | 分類/嚴重度/聚合/路由全部 AI 動態決策 |
| **自主通知** | 收件人/通道/時機/話術由 AI 根據情境判斷 |
### 7 Phase 實施序列
```
Phase 0 防護欄建立 → Feature Flag 框架 + 基線快照 + HARD_RULES 更新
Phase 1 感官縱深 → PreDecisionInvestigator + EvidenceSnapshot + PostExecutionVerifier
Phase 2 多 Agent 協作 → 5 角色Diagnostician/Solver/Reviewer/Critic/Coordinator
Phase 3 學習機制重建 → 3 根因修復 + Evolver + Fine-tune pipeline最關鍵
Phase 4 異常偵測源頭升級 → Holt-Winters + Drain3 + Prophet + 主動巡檢
Phase 5 修復抽象化 → Declarative + Blast Radius 四級分控 + GitOps PR
Phase 6 自我治理閉環 → SLO + Trust Drift + KB Rot + 離線回放 + 自我降級
```
### 防失憶四道閘門
為防止跨 session 方向漂移,確立:
1. **閘門 1**`CLAUDE.md` 強制讀 MASTERSession 啟動第一步)
2. **閘門 2**`project_master_aiops_blueprint.md` 跨 session 狀態指針
3. **閘門 3**MASTER `§0 Session Resume Protocol`(接手 Claude 必讀 7 步驟)
4. **閘門 4**MASTER `§8 Living Changelog`(只追加,記錄每次變更)
### 架構師評審框架(強制)
| Gate | 觸發點 | 審查項目 |
|------|-------|---------|
| Gate 0 | Phase 0 完成 | Feature Flag 結構 / ADR-080 完整性 |
| Gate 1 | Phase 1 完成 | 感官架構邊界 / Prompt Injection 防護 / EvidenceSnapshot schema |
| Gate 2 | Phase 2 完成 | Agent 接口設計 / 熔斷覆蓋 / Redis Streams 隔離 |
| Gate 3 | Phase 3 完成 | 學習閉環三根因 / EWMA 正確性 / fire-and-forget 已消滅 |
| Gate 4 | Phase 4 完成 | 動態基線準確性 / Drain3 整合 / 靜態 rules 減量驗證 |
| Gate 5 | Phase 5 完成 | Blast Radius 分級正確 / dry-run 強制 / GitOps PR 流程 |
| Gate 6 | Phase 6 完成 | SLO 計算準確 / 自我降級不得反向升級 / KB rot 月清可用 |
---
## 影響
### 新增檔案(跨所有 Phase
| Phase | 關鍵新增 |
|-------|---------|
| P0 | `core/feature_flags.py`, `jobs/baseline_snapshot.py` |
| P1 | `services/pre_decision_investigator.py`, `services/evidence_snapshot.py`, `services/post_execution_verifier.py`, `services/mcp_tool_registry.py`, `services/sanitization_service.py` |
| P2 | `agents/diagnostician_agent.py`, `agents/solver_agent.py`, `agents/reviewer_agent.py`, `agents/critic_agent.py`, `agents/coordinator_agent.py`, `services/agent_orchestrator.py` |
| P3 | `services/playbook_evolver.py`, `services/finetune_exporter.py`, `jobs/knowledge_decay_job.py`, `jobs/detection_feedback_writer.py` |
| P4 | `services/dynamic_baseline_service.py`, `services/log_anomaly_detector.py`, `services/trend_predictor.py`, `services/proactive_inspector.py` |
| P5 | `services/blast_radius_calculator.py`, `services/declarative_remediation.py`, `services/gitops_pr_service.py`, `services/rollback_manager.py` |
| P6 | `services/ai_slo_calculator.py`, `services/trust_drift_detector.py`, `services/model_rollback_service.py`, `jobs/offline_replay_service.py`, `jobs/kb_rot_cleaner.py` |
### 修改核心檔案
- `services/decision_manager.py` — 廢棄 25 條硬規則;輸入改 EvidenceSnapshot輸出改 DeclarativeSpec
- `services/approval_execution.py` — 修復 fire-and-forget~line 471接線 PostExecutionVerifier
- `services/learning_service.py` — 填充 matched_playbook_id三段快照負向 2x EWMA
- `services/incident_service.py:classify_alert_early()` — 輸入改 EvidenceSnapshot
### DB 新增表
`incident_evidence` / `agent_sessions` / `playbook_trust_history` / `detection_feedback` /
`anomaly_baselines` / `anomaly_detections` / `forecasts` / `playbook_declarative_stats` /
`ai_governance_events` / `model_checkpoints`
---
## 替代方案(已拒絕)
| 方案 | 拒絕理由 |
|-----|---------|
| 繼續修補規則引擎 | 治標不治本;規則永遠趕不上告警多樣性 |
| 只加 LLM 分類 | 沒有 MCP 情報 → LLM 還是只能猜 RESTART |
| 先 hardcode 過渡 | 統帥明確禁止;技術債會鎖死 AI 化路徑 |
| 分多份 MD 分別追蹤 | 碎片化導致失憶Single Source of Truth 不可妥協 |
---
## 退出條件Phase 0 完成標準)
- [ ] `apps/api/src/core/feature_flags.py` 已建立,`AIOPS_P1~P6_ENABLED``False`
- [ ] `apps/api/src/jobs/baseline_snapshot.py` 已建立,可執行一次拍攝基線
- [ ] `docs/HARD_RULES.md` 已更新至 v1.9(加入 Phase 退出條件鐵律)
- [ ] 本 ADR 已 commit
- [ ] 現有測試全通(`pytest apps/api/tests/` 綠燈)
---
## 參考
- MASTER 藍圖 v2`docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`
- 北極星鐵律:`~/.claude/projects/-Users-ogt-awoooi/memory/feedback_ai_autonomous_direction.md`
- 感官縱深D1MASTER §3.1
- 多 AgentD2MASTER §3.2
- 修復抽象D3MASTER §3.3
- 學習深度D4MASTER §3.4
- 異常偵測D5MASTER §3.5
- 自我治理D6MASTER §3.6

File diff suppressed because it is too large Load Diff