OG T
|
07a097c259
|
fix(infra): Sprint 3 自動修復全鏈路修復 — docker-188 SSH egress + service registry 擴充
CD Pipeline / build-and-deploy (push) Has been cancelled
NetworkPolicy: 新增 192.168.0.188:22 egress — repair-bot-188.sh 執行路徑
service-registry.yaml: 新增 signoz/bitan-app (AUTO, 188主機)
修復覆蓋: Bug #11 補完 (188 SSH) + 188 服務分級覆蓋
E2E 驗證: MoWoooWorkDown → SSH → REPAIR_OK:momo-app (3791ms)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 18:04:19 +08:00 |
|
OG T
|
1fb0c0ca90
|
fix(auto-repair): Bug #5+#6 — SSH binary + affected_services 匹配修正
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #5 (webhooks.py): target_resource 現在優先用 component label
- SentryDown alert 有 labels.component="sentry"
- 舊邏輯: labels.instance="192.168.0.110:9000" → Playbook affected_services 不匹配
- 新邏輯: component → pod → instance → alertname
Bug #6 (Dockerfile): python:3.11-slim 無 openssh-client
- SSH_COMMAND Playbook 執行路徑調用 asyncio.create_subprocess_exec("ssh", ...)
- image 沒有 ssh binary → 所有 SSH 修復必然失敗
- 修正: 在 production stage 安裝 openssh-client
服務清單: 補 sentry 主服務到 service-registry.yaml (AUTO 級別)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 14:11:50 +08:00 |
|
OG T
|
88696dba9b
|
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:24:09 +08:00 |
|