91 lines
6.3 KiB
Markdown
91 lines
6.3 KiB
Markdown
# IwoooS Monitoring / Alerting / Observability 事故後回讀計畫
|
||
|
||
| 項目 | 內容 |
|
||
|------|------|
|
||
| 日期 | 2026-06-15 |
|
||
| 狀態 | `post_incident_readback_plan_ready_no_runtime_action` |
|
||
| 工具 | `scripts/security/monitoring-post-incident-readback-plan.py` |
|
||
| Snapshot | `docs/security/monitoring-post-incident-readback-plan.snapshot.json` |
|
||
| 來源 | `docs/security/monitoring-owner-response-acceptance.snapshot.json` |
|
||
| runtime gate | `0` |
|
||
|
||
## 1. 目的
|
||
|
||
此計畫把 Monitoring / Alerting / Observability 的事故後回讀從「服務看起來恢復」推進成可重跑、可審查、可拒收的只讀帳本。
|
||
|
||
它處理的風險不是替正式監控系統下指令,而是先定義未來 owner 必須提供哪些脫敏 ref,才能說明誰改了告警 / receiver / silence、何時異常、改前改後 alert state、receiver 是否真的收到、stale / silence 是否造成 false green、如何 rollback、如何防再發。
|
||
|
||
本 artifact 不連 live Prometheus、不 reload Alertmanager、不套用 Grafana / SigNoz / Sentry / Langfuse、不 reload OTEL、不改 receiver route、不建立 silence、不送 Telegram、不 fire live alert、不跑 alert chain smoke、不 SSH、不 kubectl、不讀 secret value、不保存 raw alert payload、不寫 production。
|
||
|
||
## 2. 固定數字
|
||
|
||
| 指標 | 數值 |
|
||
|------|------|
|
||
| readback candidate | `60` |
|
||
| write-capable candidate | `11` |
|
||
| live evidence required candidate | `60` |
|
||
| alert rule readback candidate | `13` |
|
||
| deploy / reload readback candidate | `6` |
|
||
| required readback fields | `30` |
|
||
| reviewer checks | `28` |
|
||
| outcome lanes | `11` |
|
||
| blocked actions | `53` |
|
||
| post-incident readback received / accepted | `0 / 0` |
|
||
| runtime gate | `0` |
|
||
|
||
## 3. 必填回讀欄位
|
||
|
||
每一個候選都必須補齊以下 metadata-only refs,才能進入 reviewer review:
|
||
|
||
1. incident / change / outage ref。
|
||
2. actor role / team attribution ref。
|
||
3. change / outage time window ref。
|
||
4. change intent 或 break-glass reason ref。
|
||
5. before / after alert state refs。
|
||
6. rule、datasource、scrape、remote write 或 exporter state ref。
|
||
7. receiver route / notification policy / webhook / Telegram route state ref。
|
||
8. reload / no-reload ref。
|
||
9. receiver receipt readback ref。
|
||
10. stale / pending / resolved review ref。
|
||
11. silence / mute / dedup / inhibit / maintenance rule review ref。
|
||
12. dashboard / trace / log freshness ref。
|
||
13. notification delivery metadata ref。
|
||
14. alert chain health readback ref。
|
||
15. cross-project sync ref。
|
||
16. rollback / disable validation ref。
|
||
17. post-change monitoring ref。
|
||
18. independent postcheck readback ref。
|
||
19. recurrence guard ref。
|
||
20. maintenance window、rollback owner、followup owner。
|
||
21. redacted evidence refs、no-secret-value、no-raw-payload、no-false-green、receipt-not-route-only、independent-postcheck 與 noise-budget / silence-owner attestation。
|
||
|
||
## 4. Reviewer 檢查
|
||
|
||
Reviewer 必須確認來源 snapshot 是目前版本,並逐項檢查 actor、時間窗、變更意圖、before / after alert state、rule / datasource / scrape state、receiver route state、reload / no-reload、receiver receipt、stale / silence review、dashboard / trace / log freshness、notification metadata、alert chain health、cross-project sync、rollback、post-change monitoring、independent postcheck、recurrence guard、maintenance window、noise budget owner、脫敏 ref、secret absence、raw payload absence、no-false-green、runtime stays zero 與 count transition safe。
|
||
|
||
不能把 route `200`、dashboard up、container up、receiver reachable、CD success 或 UI 可見視為告警鏈路驗收。
|
||
|
||
## 5. 分流
|
||
|
||
| lane | 用途 |
|
||
|------|------|
|
||
| `waiting_post_incident_readback` | 尚未收到回讀包;所有 accepted / runtime count 維持 `0` |
|
||
| `request_actor_or_time_supplement` | 缺 actor、時間窗、intent 或 break-glass reason |
|
||
| `request_alert_state_supplement` | 缺 before / after alert state、rule、datasource、scrape、remote write 或 exporter state |
|
||
| `request_receiver_receipt_supplement` | 缺 receiver route state、receipt proof、delivery metadata 或 notification owner |
|
||
| `request_stale_silence_supplement` | 缺 stale / pending / resolved、silence、mute、dedup、inhibit 或 maintenance review |
|
||
| `request_freshness_or_alert_chain_supplement` | 缺 freshness、alert chain health、post-change monitoring 或 rollback validation |
|
||
| `quarantine_raw_payload` | 收到 secret、raw config、raw alert payload、raw receiver payload、未脫敏 log 或截圖時隔離 |
|
||
| `reject_false_green_claim` | 把 route 200、dashboard up、container up、receiver reachable、CD success 或 UI 可見當驗收時拒收 |
|
||
| `ready_for_monitoring_post_incident_review` | metadata 合格後只能進 reviewer review |
|
||
| `recurrence_guard_backfill_required` | 需補防再發 guard、owner review、change freeze、automation block 或 noise budget owner |
|
||
| `waiting_runtime_gate` | 即使 readback accepted,runtime gate 仍需獨立人工批准 |
|
||
|
||
## 6. 固定禁止動作
|
||
|
||
本階段明確阻擋 `prometheus_reload`、`alertmanager_reload`、`grafana_dashboard_apply`、`signoz_rule_apply`、`sentry_deploy`、`langfuse_config_change`、`otel_collector_reload`、`receiver_route_change`、`silence_policy_change`、`telegram_send`、`notification_route_change`、`webhook_receiver_change`、`remote_write_change`、`exporter_deploy`、`live_alert_fire`、`alert_chain_smoke`、`ssh_read`、`ssh_write`、`kubectl_action`、`secret_value_collection`、`host_write`、`active_scan`、`production_write`、`raw payload storage`、`raw live config storage`、`full receiver payload storage`、`Bot token / DSN / webhook secret collection`、`alert rule / scrape config / dashboard / notification policy / silence / inhibit change` 與任何 action button。
|
||
|
||
## 7. 目前邊界
|
||
|
||
此 artifact 只代表事故後回讀計畫已建立。`post_incident_readback_received_count`、`post_incident_readback_accepted_count`、`receiver_receipt_readback_accepted_count`、`stale_pending_resolved_review_accepted_count`、`silence_mute_dedup_inhibit_review_accepted_count`、`alert_chain_health_readback_accepted_count`、`prometheus_reload_authorized_count`、`alertmanager_reload_authorized_count`、`telegram_send_authorized_count`、`alert_chain_smoke_authorized_count`、`runtime_gate_count` 與 `action_button_count` 仍全部維持 `0`。
|