Files
awoooi/docs/security/MONITORING-POST-INCIDENT-READBACK-PLAN.md
Your Name d89f271af3
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m45s
CD Pipeline / build-and-deploy (push) Successful in 4m41s
CD Pipeline / post-deploy-checks (push) Successful in 1m45s
feat(iwooos): 新增監控告警事故回讀 gate
2026-06-16 11:07:34 +08:00

6.3 KiB
Raw Permalink Blame History

IwoooS Monitoring / Alerting / Observability 事故後回讀計畫

項目 內容
日期 2026-06-15
狀態 post_incident_readback_plan_ready_no_runtime_action
工具 scripts/security/monitoring-post-incident-readback-plan.py
Snapshot docs/security/monitoring-post-incident-readback-plan.snapshot.json
來源 docs/security/monitoring-owner-response-acceptance.snapshot.json
runtime gate 0

1. 目的

此計畫把 Monitoring / Alerting / Observability 的事故後回讀從「服務看起來恢復」推進成可重跑、可審查、可拒收的只讀帳本。

它處理的風險不是替正式監控系統下指令,而是先定義未來 owner 必須提供哪些脫敏 ref才能說明誰改了告警 / receiver / silence、何時異常、改前改後 alert state、receiver 是否真的收到、stale / silence 是否造成 false green、如何 rollback、如何防再發。

本 artifact 不連 live Prometheus、不 reload Alertmanager、不套用 Grafana / SigNoz / Sentry / Langfuse、不 reload OTEL、不改 receiver route、不建立 silence、不送 Telegram、不 fire live alert、不跑 alert chain smoke、不 SSH、不 kubectl、不讀 secret value、不保存 raw alert payload、不寫 production。

2. 固定數字

指標 數值
readback candidate 60
write-capable candidate 11
live evidence required candidate 60
alert rule readback candidate 13
deploy / reload readback candidate 6
required readback fields 30
reviewer checks 28
outcome lanes 11
blocked actions 53
post-incident readback received / accepted 0 / 0
runtime gate 0

3. 必填回讀欄位

每一個候選都必須補齊以下 metadata-only refs才能進入 reviewer review

  1. incident / change / outage ref。
  2. actor role / team attribution ref。
  3. change / outage time window ref。
  4. change intent 或 break-glass reason ref。
  5. before / after alert state refs。
  6. rule、datasource、scrape、remote write 或 exporter state ref。
  7. receiver route / notification policy / webhook / Telegram route state ref。
  8. reload / no-reload ref。
  9. receiver receipt readback ref。
  10. stale / pending / resolved review ref。
  11. silence / mute / dedup / inhibit / maintenance rule review ref。
  12. dashboard / trace / log freshness ref。
  13. notification delivery metadata ref。
  14. alert chain health readback ref。
  15. cross-project sync ref。
  16. rollback / disable validation ref。
  17. post-change monitoring ref。
  18. independent postcheck readback ref。
  19. recurrence guard ref。
  20. maintenance window、rollback owner、followup owner。
  21. redacted evidence refs、no-secret-value、no-raw-payload、no-false-green、receipt-not-route-only、independent-postcheck 與 noise-budget / silence-owner attestation。

4. Reviewer 檢查

Reviewer 必須確認來源 snapshot 是目前版本,並逐項檢查 actor、時間窗、變更意圖、before / after alert state、rule / datasource / scrape state、receiver route state、reload / no-reload、receiver receipt、stale / silence review、dashboard / trace / log freshness、notification metadata、alert chain health、cross-project sync、rollback、post-change monitoring、independent postcheck、recurrence guard、maintenance window、noise budget owner、脫敏 ref、secret absence、raw payload absence、no-false-green、runtime stays zero 與 count transition safe。

不能把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見視為告警鏈路驗收。

5. 分流

lane 用途
waiting_post_incident_readback 尚未收到回讀包;所有 accepted / runtime count 維持 0
request_actor_or_time_supplement 缺 actor、時間窗、intent 或 break-glass reason
request_alert_state_supplement 缺 before / after alert state、rule、datasource、scrape、remote write 或 exporter state
request_receiver_receipt_supplement 缺 receiver route state、receipt proof、delivery metadata 或 notification owner
request_stale_silence_supplement 缺 stale / pending / resolved、silence、mute、dedup、inhibit 或 maintenance review
request_freshness_or_alert_chain_supplement 缺 freshness、alert chain health、post-change monitoring 或 rollback validation
quarantine_raw_payload 收到 secret、raw config、raw alert payload、raw receiver payload、未脫敏 log 或截圖時隔離
reject_false_green_claim 把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見當驗收時拒收
ready_for_monitoring_post_incident_review metadata 合格後只能進 reviewer review
recurrence_guard_backfill_required 需補防再發 guard、owner review、change freeze、automation block 或 noise budget owner
waiting_runtime_gate 即使 readback acceptedruntime gate 仍需獨立人工批准

6. 固定禁止動作

本階段明確阻擋 prometheus_reloadalertmanager_reloadgrafana_dashboard_applysignoz_rule_applysentry_deploylangfuse_config_changeotel_collector_reloadreceiver_route_changesilence_policy_changetelegram_sendnotification_route_changewebhook_receiver_changeremote_write_changeexporter_deploylive_alert_firealert_chain_smokessh_readssh_writekubectl_actionsecret_value_collectionhost_writeactive_scanproduction_writeraw payload storageraw live config storagefull receiver payload storageBot token / DSN / webhook secret collectionalert rule / scrape config / dashboard / notification policy / silence / inhibit change 與任何 action button。

7. 目前邊界

此 artifact 只代表事故後回讀計畫已建立。post_incident_readback_received_countpost_incident_readback_accepted_countreceiver_receipt_readback_accepted_countstale_pending_resolved_review_accepted_countsilence_mute_dedup_inhibit_review_accepted_countalert_chain_health_readback_accepted_countprometheus_reload_authorized_countalertmanager_reload_authorized_counttelegram_send_authorized_countalert_chain_smoke_authorized_countruntime_gate_countaction_button_count 仍全部維持 0