fix(UX): 下架 28 個鬼魂分類按鈕 + ADR-079 Phase 5 補完計畫
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥 2026-04-14 20:00 完整 audit 揭露:
_CATEGORY_BUTTONS 28 個按鈕全死 3 天(從 2026-04-11 commit 325b3851)
- callback_data 格式全錯(3-part 不符 parser 4-part/2-part)
- grep apps/api/src 無任何 dispatch handler
- 統帥今天真踩到:點「查程序」沒反應 → 信任破壞
首席架構師裁示 (C 分級):
A. 立刻下架(本 commit):_CATEGORY_BUTTONS = {} fallback 通用按鈕
B. Phase 5 完整化(ADR-079 規劃,3-5 天,另 Sprint 實作)
保留通用按鈕(全 ✅):
- 批准 / 拒絕 / 靜默(4-part nonce)
- 詳情 / 歷史 / 重診(2-part info)
新增防禦性文件:
- ADR-079 — Phase 5 工作分解 + 每按鈕 checklist
- feedback_no_ghost_buttons.md(memory)— 鬼魂按鈕鐵律
設計原則永久入檔: 寧可沒按鈕,不可有死按鈕
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1376,65 +1376,16 @@ class TelegramGateway:
|
||||
alert_category: 告警類別 (ADR-071-E: 決定 TYPE-3 按鈕組合)
|
||||
notification_type: 通知類型 (TYPE-1/2/3/4/4D)
|
||||
"""
|
||||
# TYPE-3 動態操作按鈕 (ADR-071-E)
|
||||
# ADR-075: 統一用 kubernetes(移除舊 k8s_workload),新增 storage/external_site/alertchain_health/flywheel_health
|
||||
_CATEGORY_BUTTONS: dict[str, list[tuple[str, str]]] = {
|
||||
"kubernetes": [
|
||||
("🔄 重啟", f"action:restart:{incident_id}"),
|
||||
("📈 擴容", f"action:scale_up:{incident_id}"),
|
||||
("📉 縮容", f"action:scale_down:{incident_id}"),
|
||||
("⏪ 回滾", f"action:rollback:{incident_id}"),
|
||||
],
|
||||
"database": [
|
||||
("🛑 終止慢查詢", f"action:kill_slow_query:{incident_id}"),
|
||||
("🔄 清連線池", f"action:clear_conn_pool:{incident_id}"),
|
||||
],
|
||||
"host_resource": [
|
||||
("🔍 查程序", f"action:check_process:{incident_id}"),
|
||||
("🔄 重啟服務", f"action:restart_service:{incident_id}"),
|
||||
("🗑 清 Log", f"action:clear_log:{incident_id}"),
|
||||
],
|
||||
"network": [
|
||||
("🔄 重載 Nginx", f"action:reload_nginx:{incident_id}"),
|
||||
("🔌 查 Port", f"action:check_port:{incident_id}"),
|
||||
],
|
||||
"devops_tool": [
|
||||
("🔄 重啟服務", f"action:restart_service:{incident_id}"),
|
||||
("📋 查 Log", f"action:check_log:{incident_id}"),
|
||||
],
|
||||
"storage": [
|
||||
("🔄 重啟 MinIO", f"action:restart_service:{incident_id}"),
|
||||
("📋 查 Log", f"action:check_log:{incident_id}"),
|
||||
],
|
||||
"external_site": [
|
||||
("🔍 查健康狀態", f"action:check_health:{incident_id}"),
|
||||
("📋 查 Log", f"action:check_log:{incident_id}"),
|
||||
],
|
||||
# ADR-075 新增分類按鈕 (2026-04-12 ogt)
|
||||
"secops": [
|
||||
("🚫 隔離資源", f"secops_isolate:{incident_id}"),
|
||||
("⛔ 封鎖來源 IP", f"secops_block_ip:{incident_id}"),
|
||||
("🔄 強制驅逐", f"secops_evict:{incident_id}"),
|
||||
("✅ 確認授權", f"secops_authorize:{incident_id}"),
|
||||
],
|
||||
"business": [
|
||||
("⏸️ 暫停 1h", f"action:pause_1h:{incident_id}"),
|
||||
("🔍 查 SignOz", f"action:open_signoz:{incident_id}"),
|
||||
("❌ 忽略", f"action:ignore:{incident_id}"),
|
||||
],
|
||||
"flywheel_health": [
|
||||
("🔄 觸發診斷", f"flywheel_diagnose:{incident_id}"),
|
||||
("📊 飛輪面板", f"action:open_flywheel:{incident_id}"),
|
||||
("🔕 靜默", f"action:silence:{incident_id}"),
|
||||
],
|
||||
# alertchain_health → TYPE-8M → send_meta_alert,不走此字典
|
||||
"ai_system": [
|
||||
("🔀 切換 Provider", f"action:switch_provider:{incident_id}"),
|
||||
],
|
||||
"ssl_cert": [
|
||||
("🔐 更新憑證", f"action:renew_cert:{incident_id}"),
|
||||
],
|
||||
}
|
||||
# 2026-04-14 Claude Sonnet 4.6 首席架構師裁示:
|
||||
# 原 _CATEGORY_BUTTONS (28 個分類按鈕) 全部下架 — 信任 > 功能
|
||||
# 盤查證據(統帥 2026-04-14 20:00 完整 audit):
|
||||
# - callback_data 格式全錯 (3-part action:xxx:id 不符合 parser 4-part nonce / 2-part info)
|
||||
# - 後端 0 個 handler(grep apps/api/src 無任何 dispatch elif 分支)
|
||||
# - 從 2026-04-11 commit 325b3851 起死了 3 天
|
||||
# - 統帥今天真踩到:點「查程序」沒反應 → 信任破壞
|
||||
# 正式下架改用通用按鈕 (approve/reject/silence/detail/history/reanalyze)
|
||||
# 補完計畫:MASTER 藍圖 Phase 5「分類按鈕完整化」(另立 ADR)
|
||||
_CATEGORY_BUTTONS: dict[str, list[tuple[str, str]]] = {} # 臨時空,fallback 通用按鈕
|
||||
|
||||
# 產生 Nonce (防重放,用於寫操作)
|
||||
approve_nonce = self._security.generate_callback_nonce(approval_id, "approve")
|
||||
|
||||
86
docs/adr/ADR-079-category-buttons-phase5.md
Normal file
86
docs/adr/ADR-079-category-buttons-phase5.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# ADR-079: Telegram 分類按鈕下架 + Phase 5 完整化計畫
|
||||
|
||||
> **日期**: 2026-04-14(台北深夜)
|
||||
> **狀態**: ✅ Accepted(下架立刻生效,Phase 5 待後續實作)
|
||||
> **作者**: Claude Sonnet 4.6(首席架構師)+ 統帥 audit
|
||||
> **相關**: ADR-071 通知類型、ADR-075 Telegram 標準、[feedback_no_ghost_buttons.md](~/.claude/projects/-Users-ogt-awoooi/memory/feedback_no_ghost_buttons.md)
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
統帥 2026-04-14 20:00 完整 audit 發現 `telegram_gateway._CATEGORY_BUTTONS` 定義的 **28 個分類按鈕全部是鬼魂**:
|
||||
|
||||
| 檢查 | 結果 |
|
||||
|------|------|
|
||||
| callback_data 格式 | ❌ 3-part `action:xxx:id`,parser 只認 4-part nonce / 2-part info |
|
||||
| 後端 dispatch handler | ❌ grep `apps/api/src` 無任何 `elif action == "restart_service":` 等分支 |
|
||||
| MCP 底層能力 | ✅ K8s 6 工具 + SSH 15 工具都在(能做就是沒接線) |
|
||||
|
||||
從 2026-04-11 commit `325b3851` 起死了 **3 天**。統帥今天真的點了「查程序」結果完全沒反應 → **信任破壞**。
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
### 立刻執行(A 下架)
|
||||
|
||||
**臨時對策**:`_CATEGORY_BUTTONS: dict = {}` 改空 dict → fallback 通用按鈕(approve/reject/silence/detail/history/reanalyze)。
|
||||
|
||||
**影響**:
|
||||
- 所有 TYPE-3 審核卡只顯示 6 個通用按鈕(已驗證全部 ✅)
|
||||
- 零破壞:approve/reject 通道正常工作
|
||||
- 信任修復:用戶不會再點到無反應的鬼魂按鈕
|
||||
|
||||
### 後續計畫(B Phase 5)
|
||||
|
||||
**Phase 5「分類按鈕完整化」**納入 MASTER 藍圖,估 3-5 天。
|
||||
|
||||
#### Phase 5 工作分解
|
||||
|
||||
| Sprint | 內容 | 估時 |
|
||||
|--------|------|------|
|
||||
| 5.1 | 設計 `action → MCP method` 對應表(每個按鈕寫規格)| 0.5 天 |
|
||||
| 5.2 | 查類按鈕實作(無副作用):`check_process`/`check_port`/`check_log`/`check_health`/`open_signoz`/`open_flywheel` | 1 天 |
|
||||
| 5.3 | 寫類按鈕實作(有副作用):`restart`/`scale_up`/`scale_down`/`rollback`/`restart_service`/`clear_log`/`reload_nginx`/`renew_cert` | 2 天 |
|
||||
| 5.4 | 資安按鈕(secops_isolate/block_ip/evict/authorize)| 0.5 天 |
|
||||
| 5.5 | E2E 測試:click → action → MCP → result reply 完整鏈 | 1 天 |
|
||||
|
||||
**總計**:3-5 天(依資源)
|
||||
|
||||
#### 每個按鈕的 checklist
|
||||
|
||||
每個分類按鈕上線前必須:
|
||||
- [ ] callback_data 格式通過 `_security.parse_callback_data()` 驗證
|
||||
- [ ] dispatch handler 新增 `elif action == "xxx":` 分支
|
||||
- [ ] 呼叫的 MCP method 存在且可用
|
||||
- [ ] 寫類按鈕有 nonce + audit log
|
||||
- [ ] E2E 測試涵蓋 click → action → result reply 完整鏈
|
||||
|
||||
### 設計原則(永久)
|
||||
|
||||
> **寧可沒按鈕,不可有死按鈕**。
|
||||
|
||||
- 沒按鈕 → 用戶知道要手動處理,正確預期
|
||||
- 死按鈕 → 用戶以為能自動處理,點了沒反應 → 信任毀掉
|
||||
|
||||
---
|
||||
|
||||
## 結果
|
||||
|
||||
- ✅ Commit `XXXXX`:`_CATEGORY_BUTTONS = {}` 下架 28 個鬼魂
|
||||
- ✅ Memory `feedback_no_ghost_buttons.md` 建立(鐵律文件化)
|
||||
- ⏳ Phase 5 排入 MASTER 藍圖,待資源分配
|
||||
|
||||
---
|
||||
|
||||
## 教訓
|
||||
|
||||
1. **PR Review 漏網**:ADR-075 commit 只加了按鈕定義,沒 Review callback_data 格式 vs parser 規則
|
||||
2. **缺少 E2E 測試**:單元測試只驗「按鈕產生」,沒驗「點擊後完整流程」
|
||||
3. **ADR 規格不完整**:ADR-075 只說「新增 _CATEGORY_BUTTONS」,沒規格化「每個按鈕對應的 action handler + MCP method」
|
||||
4. **信任成本**:任何面向用戶的 UI 元素,功能與可見性必須一致 — 這是 SRE 產品化的核心
|
||||
|
||||
---
|
||||
|
||||
*Accepted by 統帥 @ 2026-04-14 台北深夜*
|
||||
Reference in New Issue
Block a user