Your Name
|
6e04fe9c8a
|
feat(playbook): generate drafts with local llm
CD Pipeline / tests (push) Successful in 1m28s
Code Review / ai-code-review (push) Successful in 29s
Type Sync Check / check-type-sync (push) Failing after 2m41s
CD Pipeline / build-and-deploy (push) Successful in 8m40s
CD Pipeline / post-deploy-checks (push) Successful in 3m10s
|
2026-04-30 23:04:58 +08:00 |
|
Your Name
|
9908fdf50d
|
feat(p3.1-t2-patha): DiagnosisAggregator 路徑 A + Solver F4 critical reject + 對齊測試
CD Pipeline / build-and-deploy (push) Failing after 1m59s
Wave 8 P3.1-T2 PathA 啟用 + Solver F4 安全強化 + test 對齊:
PathA — DiagnosisAggregator 信號分類層補 PDI:
- ENABLE_DIAGNOSIS_AGGREGATOR default=False → True
· PathA 純信號分類層(OOMKilled/CrashLoop 等業務邏輯)
· 不重複呼叫 K8s/SignOz API(只取 PDI 已收集的 raw 資料)
· 安全 default on — 純邏輯處理,無外部依賴重疊
- diagnosis_aggregator.py +155 行(PathA 實作)
- pre_decision_investigator.py 已接 (commit 3a2cd151)
F4 — Solver critical risk reject:
- solver_agent.py: _validate_recommended_action 拒絕 risk=critical
· 鐵律:critical 動作必須走人工審批,不可變 Telegram 按鈕
· log warning + return None(被 _extract 過濾掉)
- _extract_recommended_actions 改返回 (list, status_str) tuple
· status="ok"/"empty"/"all_invalid" 供呼叫端決策
- protocol.py +16 / metrics.py +9 / ai_router.py +18 — 配套 metric + protocol field
測試對齊:
- test_solver_recommended_actions.py 拆 test_all_valid → low/medium/high accepted +
test_critical_rejected
- result tuple unpack: result, _ = _extract_recommended_actions(...)
- test_diagnosis_aggregator_stub.py: feature flag default 改 True 對齊 PathA
Tests: 51 passed (solver 28 + aggregator 16 + router fallback 8)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (Wave 8 P3.1-T2 PathA + F4) <noreply@anthropic.com>
|
2026-04-27 14:42:29 +08:00 |
|
Your Name
|
fb130c9a28
|
feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄
CD Pipeline / build-and-deploy (push) Failing after 2m16s
Wave 8 P3.1-T2 後續補測 + 配套:
新增測試:
- test_diagnosis_aggregator_stub.py (238 行) — 15 tests
· stub fixture 驗證 _collect_diagnosis_aggregator 接線
· feature flag default off 不呼叫
· timeout 邊界 / exception fail-soft
修改:
- core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標
- sanitization_service.py +24 — 補強 prompt sanitize 邊界(vuln #4 配套)
- RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調
Tests: 15 passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-27 08:30:26 +08:00 |
|
Your Name
|
fefe4c21cd
|
fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:
A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
· success/timeout/error outcome label
· histogram 寫入 aiops_agent_step_duration_seconds
A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout
新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality
新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序
新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
· 三 Agent histogram alert rules(p99 > timeout 80% → warning)
驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)
INC-20260425 雙修總工作量(595629c0 + 此 commit):
· 5 個 service/agent 檔修改
· 1 個新 observability 模組
· 4 個新測試/配套檔
· 1372+187 = 1559 行新增
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
|
2026-04-27 08:15:53 +08:00 |
|
Your Name
|
2c57b71db9
|
feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合
CD Pipeline / build-and-deploy (push) Successful in 10m45s
MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成(multiple engineers 在限額前完成代碼,補 commit):
P2.2 — GovernanceAgent 4 項自檢:
- governance_agent.py (342 行) — 每 1 小時自檢循環:
· trust_drift(信任度漂移檢測)
· knowledge_degradation(知識退化檢測)
· llm_hallucination(LLM 幻覺檢測)
· execution_blast_radius(執行爆炸半徑檢測)
- main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動
try/except 包裹,schedule 失敗不阻斷主流程
- failover_alerter.py: alert_governance(event_type, payload) 1h dedup
四類事件 → Telegram MarkdownV2 告警
P2.3 — Ollama 健康規則 + Prometheus Metrics:
- ops/monitoring/ollama_health_rules.yaml (148 行):
· OllamaHealthDegraded / OllamaPrimaryDown
· OllamaFailoverTriggered / GeminiQuotaExceeded
· 補 Prometheus 取資料的 alert rules
- core/metrics.py (57 行):
· GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge
· OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter
· OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge
- ollama_failover_manager.py:
· _check_gemini_quota: 每次 check 同步更新 Gauge(讓 Prometheus 取最新值)
· select_provider: failover 時 inc Counter + 切 Primary Gauge
· try/except 包裹,metric 失敗不阻斷主路由
E2E 測試:
- test_failover_e2e_dispatch.py (365 行)
完整 dispatch 路徑:health check → failover decide → alerter → metrics
Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>
|
2026-04-26 20:56:19 +08:00 |
|
OG T
|
d89f0520f9
|
fix(api): 修復 34 個 Ruff lint 錯誤
- 自動修復 import 排序、unused imports
- 手動修復 raise from、isinstance union、unused variable
- scripts/ 暫時保留 (非 CI 阻擋)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:27:49 +08:00 |
|