Files
awoooi/docs/security/MONITORING-POST-INCIDENT-READBACK-PLAN.md
Your Name d89f271af3
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m45s
CD Pipeline / build-and-deploy (push) Successful in 4m41s
CD Pipeline / post-deploy-checks (push) Successful in 1m45s
feat(iwooos): 新增監控告警事故回讀 gate
2026-06-16 11:07:34 +08:00

91 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IwoooS Monitoring / Alerting / Observability 事故後回讀計畫
| 項目 | 內容 |
|------|------|
| 日期 | 2026-06-15 |
| 狀態 | `post_incident_readback_plan_ready_no_runtime_action` |
| 工具 | `scripts/security/monitoring-post-incident-readback-plan.py` |
| Snapshot | `docs/security/monitoring-post-incident-readback-plan.snapshot.json` |
| 來源 | `docs/security/monitoring-owner-response-acceptance.snapshot.json` |
| runtime gate | `0` |
## 1. 目的
此計畫把 Monitoring / Alerting / Observability 的事故後回讀從「服務看起來恢復」推進成可重跑、可審查、可拒收的只讀帳本。
它處理的風險不是替正式監控系統下指令,而是先定義未來 owner 必須提供哪些脫敏 ref才能說明誰改了告警 / receiver / silence、何時異常、改前改後 alert state、receiver 是否真的收到、stale / silence 是否造成 false green、如何 rollback、如何防再發。
本 artifact 不連 live Prometheus、不 reload Alertmanager、不套用 Grafana / SigNoz / Sentry / Langfuse、不 reload OTEL、不改 receiver route、不建立 silence、不送 Telegram、不 fire live alert、不跑 alert chain smoke、不 SSH、不 kubectl、不讀 secret value、不保存 raw alert payload、不寫 production。
## 2. 固定數字
| 指標 | 數值 |
|------|------|
| readback candidate | `60` |
| write-capable candidate | `11` |
| live evidence required candidate | `60` |
| alert rule readback candidate | `13` |
| deploy / reload readback candidate | `6` |
| required readback fields | `30` |
| reviewer checks | `28` |
| outcome lanes | `11` |
| blocked actions | `53` |
| post-incident readback received / accepted | `0 / 0` |
| runtime gate | `0` |
## 3. 必填回讀欄位
每一個候選都必須補齊以下 metadata-only refs才能進入 reviewer review
1. incident / change / outage ref。
2. actor role / team attribution ref。
3. change / outage time window ref。
4. change intent 或 break-glass reason ref。
5. before / after alert state refs。
6. rule、datasource、scrape、remote write 或 exporter state ref。
7. receiver route / notification policy / webhook / Telegram route state ref。
8. reload / no-reload ref。
9. receiver receipt readback ref。
10. stale / pending / resolved review ref。
11. silence / mute / dedup / inhibit / maintenance rule review ref。
12. dashboard / trace / log freshness ref。
13. notification delivery metadata ref。
14. alert chain health readback ref。
15. cross-project sync ref。
16. rollback / disable validation ref。
17. post-change monitoring ref。
18. independent postcheck readback ref。
19. recurrence guard ref。
20. maintenance window、rollback owner、followup owner。
21. redacted evidence refs、no-secret-value、no-raw-payload、no-false-green、receipt-not-route-only、independent-postcheck 與 noise-budget / silence-owner attestation。
## 4. Reviewer 檢查
Reviewer 必須確認來源 snapshot 是目前版本,並逐項檢查 actor、時間窗、變更意圖、before / after alert state、rule / datasource / scrape state、receiver route state、reload / no-reload、receiver receipt、stale / silence review、dashboard / trace / log freshness、notification metadata、alert chain health、cross-project sync、rollback、post-change monitoring、independent postcheck、recurrence guard、maintenance window、noise budget owner、脫敏 ref、secret absence、raw payload absence、no-false-green、runtime stays zero 與 count transition safe。
不能把 route `200`、dashboard up、container up、receiver reachable、CD success 或 UI 可見視為告警鏈路驗收。
## 5. 分流
| lane | 用途 |
|------|------|
| `waiting_post_incident_readback` | 尚未收到回讀包;所有 accepted / runtime count 維持 `0` |
| `request_actor_or_time_supplement` | 缺 actor、時間窗、intent 或 break-glass reason |
| `request_alert_state_supplement` | 缺 before / after alert state、rule、datasource、scrape、remote write 或 exporter state |
| `request_receiver_receipt_supplement` | 缺 receiver route state、receipt proof、delivery metadata 或 notification owner |
| `request_stale_silence_supplement` | 缺 stale / pending / resolved、silence、mute、dedup、inhibit 或 maintenance review |
| `request_freshness_or_alert_chain_supplement` | 缺 freshness、alert chain health、post-change monitoring 或 rollback validation |
| `quarantine_raw_payload` | 收到 secret、raw config、raw alert payload、raw receiver payload、未脫敏 log 或截圖時隔離 |
| `reject_false_green_claim` | 把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見當驗收時拒收 |
| `ready_for_monitoring_post_incident_review` | metadata 合格後只能進 reviewer review |
| `recurrence_guard_backfill_required` | 需補防再發 guard、owner review、change freeze、automation block 或 noise budget owner |
| `waiting_runtime_gate` | 即使 readback acceptedruntime gate 仍需獨立人工批准 |
## 6. 固定禁止動作
本階段明確阻擋 `prometheus_reload``alertmanager_reload``grafana_dashboard_apply``signoz_rule_apply``sentry_deploy``langfuse_config_change``otel_collector_reload``receiver_route_change``silence_policy_change``telegram_send``notification_route_change``webhook_receiver_change``remote_write_change``exporter_deploy``live_alert_fire``alert_chain_smoke``ssh_read``ssh_write``kubectl_action``secret_value_collection``host_write``active_scan``production_write``raw payload storage``raw live config storage``full receiver payload storage``Bot token / DSN / webhook secret collection``alert rule / scrape config / dashboard / notification policy / silence / inhibit change` 與任何 action button。
## 7. 目前邊界
此 artifact 只代表事故後回讀計畫已建立。`post_incident_readback_received_count``post_incident_readback_accepted_count``receiver_receipt_readback_accepted_count``stale_pending_resolved_review_accepted_count``silence_mute_dedup_inhibit_review_accepted_count``alert_chain_health_readback_accepted_count``prometheus_reload_authorized_count``alertmanager_reload_authorized_count``telegram_send_authorized_count``alert_chain_smoke_authorized_count``runtime_gate_count``action_button_count` 仍全部維持 `0`