6.3 KiB
IwoooS Monitoring / Alerting / Observability 事故後回讀計畫
| 項目 | 內容 |
|---|---|
| 日期 | 2026-06-15 |
| 狀態 | post_incident_readback_plan_ready_no_runtime_action |
| 工具 | scripts/security/monitoring-post-incident-readback-plan.py |
| Snapshot | docs/security/monitoring-post-incident-readback-plan.snapshot.json |
| 來源 | docs/security/monitoring-owner-response-acceptance.snapshot.json |
| runtime gate | 0 |
1. 目的
此計畫把 Monitoring / Alerting / Observability 的事故後回讀從「服務看起來恢復」推進成可重跑、可審查、可拒收的只讀帳本。
它處理的風險不是替正式監控系統下指令,而是先定義未來 owner 必須提供哪些脫敏 ref,才能說明誰改了告警 / receiver / silence、何時異常、改前改後 alert state、receiver 是否真的收到、stale / silence 是否造成 false green、如何 rollback、如何防再發。
本 artifact 不連 live Prometheus、不 reload Alertmanager、不套用 Grafana / SigNoz / Sentry / Langfuse、不 reload OTEL、不改 receiver route、不建立 silence、不送 Telegram、不 fire live alert、不跑 alert chain smoke、不 SSH、不 kubectl、不讀 secret value、不保存 raw alert payload、不寫 production。
2. 固定數字
| 指標 | 數值 |
|---|---|
| readback candidate | 60 |
| write-capable candidate | 11 |
| live evidence required candidate | 60 |
| alert rule readback candidate | 13 |
| deploy / reload readback candidate | 6 |
| required readback fields | 30 |
| reviewer checks | 28 |
| outcome lanes | 11 |
| blocked actions | 53 |
| post-incident readback received / accepted | 0 / 0 |
| runtime gate | 0 |
3. 必填回讀欄位
每一個候選都必須補齊以下 metadata-only refs,才能進入 reviewer review:
- incident / change / outage ref。
- actor role / team attribution ref。
- change / outage time window ref。
- change intent 或 break-glass reason ref。
- before / after alert state refs。
- rule、datasource、scrape、remote write 或 exporter state ref。
- receiver route / notification policy / webhook / Telegram route state ref。
- reload / no-reload ref。
- receiver receipt readback ref。
- stale / pending / resolved review ref。
- silence / mute / dedup / inhibit / maintenance rule review ref。
- dashboard / trace / log freshness ref。
- notification delivery metadata ref。
- alert chain health readback ref。
- cross-project sync ref。
- rollback / disable validation ref。
- post-change monitoring ref。
- independent postcheck readback ref。
- recurrence guard ref。
- maintenance window、rollback owner、followup owner。
- redacted evidence refs、no-secret-value、no-raw-payload、no-false-green、receipt-not-route-only、independent-postcheck 與 noise-budget / silence-owner attestation。
4. Reviewer 檢查
Reviewer 必須確認來源 snapshot 是目前版本,並逐項檢查 actor、時間窗、變更意圖、before / after alert state、rule / datasource / scrape state、receiver route state、reload / no-reload、receiver receipt、stale / silence review、dashboard / trace / log freshness、notification metadata、alert chain health、cross-project sync、rollback、post-change monitoring、independent postcheck、recurrence guard、maintenance window、noise budget owner、脫敏 ref、secret absence、raw payload absence、no-false-green、runtime stays zero 與 count transition safe。
不能把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見視為告警鏈路驗收。
5. 分流
| lane | 用途 |
|---|---|
waiting_post_incident_readback |
尚未收到回讀包;所有 accepted / runtime count 維持 0 |
request_actor_or_time_supplement |
缺 actor、時間窗、intent 或 break-glass reason |
request_alert_state_supplement |
缺 before / after alert state、rule、datasource、scrape、remote write 或 exporter state |
request_receiver_receipt_supplement |
缺 receiver route state、receipt proof、delivery metadata 或 notification owner |
request_stale_silence_supplement |
缺 stale / pending / resolved、silence、mute、dedup、inhibit 或 maintenance review |
request_freshness_or_alert_chain_supplement |
缺 freshness、alert chain health、post-change monitoring 或 rollback validation |
quarantine_raw_payload |
收到 secret、raw config、raw alert payload、raw receiver payload、未脫敏 log 或截圖時隔離 |
reject_false_green_claim |
把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見當驗收時拒收 |
ready_for_monitoring_post_incident_review |
metadata 合格後只能進 reviewer review |
recurrence_guard_backfill_required |
需補防再發 guard、owner review、change freeze、automation block 或 noise budget owner |
waiting_runtime_gate |
即使 readback accepted,runtime gate 仍需獨立人工批准 |
6. 固定禁止動作
本階段明確阻擋 prometheus_reload、alertmanager_reload、grafana_dashboard_apply、signoz_rule_apply、sentry_deploy、langfuse_config_change、otel_collector_reload、receiver_route_change、silence_policy_change、telegram_send、notification_route_change、webhook_receiver_change、remote_write_change、exporter_deploy、live_alert_fire、alert_chain_smoke、ssh_read、ssh_write、kubectl_action、secret_value_collection、host_write、active_scan、production_write、raw payload storage、raw live config storage、full receiver payload storage、Bot token / DSN / webhook secret collection、alert rule / scrape config / dashboard / notification policy / silence / inhibit change 與任何 action button。
7. 目前邊界
此 artifact 只代表事故後回讀計畫已建立。post_incident_readback_received_count、post_incident_readback_accepted_count、receiver_receipt_readback_accepted_count、stale_pending_resolved_review_accepted_count、silence_mute_dedup_inhibit_review_accepted_count、alert_chain_health_readback_accepted_count、prometheus_reload_authorized_count、alertmanager_reload_authorized_count、telegram_send_authorized_count、alert_chain_smoke_authorized_count、runtime_gate_count 與 action_button_count 仍全部維持 0。