# IwoooS Monitoring / Alerting / Observability 事故後回讀計畫 | 項目 | 內容 | |------|------| | 日期 | 2026-06-15 | | 狀態 | `post_incident_readback_plan_ready_no_runtime_action` | | 工具 | `scripts/security/monitoring-post-incident-readback-plan.py` | | Snapshot | `docs/security/monitoring-post-incident-readback-plan.snapshot.json` | | 來源 | `docs/security/monitoring-owner-response-acceptance.snapshot.json` | | runtime gate | `0` | ## 1. 目的 此計畫把 Monitoring / Alerting / Observability 的事故後回讀從「服務看起來恢復」推進成可重跑、可審查、可拒收的只讀帳本。 它處理的風險不是替正式監控系統下指令,而是先定義未來 owner 必須提供哪些脫敏 ref,才能說明誰改了告警 / receiver / silence、何時異常、改前改後 alert state、receiver 是否真的收到、stale / silence 是否造成 false green、如何 rollback、如何防再發。 本 artifact 不連 live Prometheus、不 reload Alertmanager、不套用 Grafana / SigNoz / Sentry / Langfuse、不 reload OTEL、不改 receiver route、不建立 silence、不送 Telegram、不 fire live alert、不跑 alert chain smoke、不 SSH、不 kubectl、不讀 secret value、不保存 raw alert payload、不寫 production。 ## 2. 固定數字 | 指標 | 數值 | |------|------| | readback candidate | `60` | | write-capable candidate | `11` | | live evidence required candidate | `60` | | alert rule readback candidate | `13` | | deploy / reload readback candidate | `6` | | required readback fields | `30` | | reviewer checks | `28` | | outcome lanes | `11` | | blocked actions | `53` | | post-incident readback received / accepted | `0 / 0` | | runtime gate | `0` | ## 3. 必填回讀欄位 每一個候選都必須補齊以下 metadata-only refs,才能進入 reviewer review: 1. incident / change / outage ref。 2. actor role / team attribution ref。 3. change / outage time window ref。 4. change intent 或 break-glass reason ref。 5. before / after alert state refs。 6. rule、datasource、scrape、remote write 或 exporter state ref。 7. receiver route / notification policy / webhook / Telegram route state ref。 8. reload / no-reload ref。 9. receiver receipt readback ref。 10. stale / pending / resolved review ref。 11. silence / mute / dedup / inhibit / maintenance rule review ref。 12. dashboard / trace / log freshness ref。 13. notification delivery metadata ref。 14. alert chain health readback ref。 15. cross-project sync ref。 16. rollback / disable validation ref。 17. post-change monitoring ref。 18. independent postcheck readback ref。 19. recurrence guard ref。 20. maintenance window、rollback owner、followup owner。 21. redacted evidence refs、no-secret-value、no-raw-payload、no-false-green、receipt-not-route-only、independent-postcheck 與 noise-budget / silence-owner attestation。 ## 4. Reviewer 檢查 Reviewer 必須確認來源 snapshot 是目前版本,並逐項檢查 actor、時間窗、變更意圖、before / after alert state、rule / datasource / scrape state、receiver route state、reload / no-reload、receiver receipt、stale / silence review、dashboard / trace / log freshness、notification metadata、alert chain health、cross-project sync、rollback、post-change monitoring、independent postcheck、recurrence guard、maintenance window、noise budget owner、脫敏 ref、secret absence、raw payload absence、no-false-green、runtime stays zero 與 count transition safe。 不能把 route `200`、dashboard up、container up、receiver reachable、CD success 或 UI 可見視為告警鏈路驗收。 ## 5. 分流 | lane | 用途 | |------|------| | `waiting_post_incident_readback` | 尚未收到回讀包;所有 accepted / runtime count 維持 `0` | | `request_actor_or_time_supplement` | 缺 actor、時間窗、intent 或 break-glass reason | | `request_alert_state_supplement` | 缺 before / after alert state、rule、datasource、scrape、remote write 或 exporter state | | `request_receiver_receipt_supplement` | 缺 receiver route state、receipt proof、delivery metadata 或 notification owner | | `request_stale_silence_supplement` | 缺 stale / pending / resolved、silence、mute、dedup、inhibit 或 maintenance review | | `request_freshness_or_alert_chain_supplement` | 缺 freshness、alert chain health、post-change monitoring 或 rollback validation | | `quarantine_raw_payload` | 收到 secret、raw config、raw alert payload、raw receiver payload、未脫敏 log 或截圖時隔離 | | `reject_false_green_claim` | 把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見當驗收時拒收 | | `ready_for_monitoring_post_incident_review` | metadata 合格後只能進 reviewer review | | `recurrence_guard_backfill_required` | 需補防再發 guard、owner review、change freeze、automation block 或 noise budget owner | | `waiting_runtime_gate` | 即使 readback accepted,runtime gate 仍需獨立人工批准 | ## 6. 固定禁止動作 本階段明確阻擋 `prometheus_reload`、`alertmanager_reload`、`grafana_dashboard_apply`、`signoz_rule_apply`、`sentry_deploy`、`langfuse_config_change`、`otel_collector_reload`、`receiver_route_change`、`silence_policy_change`、`telegram_send`、`notification_route_change`、`webhook_receiver_change`、`remote_write_change`、`exporter_deploy`、`live_alert_fire`、`alert_chain_smoke`、`ssh_read`、`ssh_write`、`kubectl_action`、`secret_value_collection`、`host_write`、`active_scan`、`production_write`、`raw payload storage`、`raw live config storage`、`full receiver payload storage`、`Bot token / DSN / webhook secret collection`、`alert rule / scrape config / dashboard / notification policy / silence / inhibit change` 與任何 action button。 ## 7. 目前邊界 此 artifact 只代表事故後回讀計畫已建立。`post_incident_readback_received_count`、`post_incident_readback_accepted_count`、`receiver_receipt_readback_accepted_count`、`stale_pending_resolved_review_accepted_count`、`silence_mute_dedup_inhibit_review_accepted_count`、`alert_chain_health_readback_accepted_count`、`prometheus_reload_authorized_count`、`alertmanager_reload_authorized_count`、`telegram_send_authorized_count`、`alert_chain_smoke_authorized_count`、`runtime_gate_count` 與 `action_button_count` 仍全部維持 `0`。