Commit Graph

3678 Commits

Author SHA1 Message Date
Your Name
216b7d78e2 fix(backup): 接入 MOMO PG 備份失敗通知
Some checks failed
Code Review / ai-code-review (push) Successful in 11s
Ansible Lint / lint (push) Has been cancelled
2026-05-12 15:50:44 +08:00
Your Name
abdab85362 docs(awooop): record host backup notification deploy 2026-05-12 14:59:17 +08:00
Your Name
116fdbb33f docs(awooop): record ops notification deployment 2026-05-12 14:55:48 +08:00
AWOOOI CD
9db1e9b7a5 chore(cd): deploy 1a74286 [skip ci] 2026-05-12 14:48:50 +08:00
Your Name
1a74286dfa fix(awooop): mirror ops notifications through api
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
2026-05-12 14:43:09 +08:00
AWOOOI CD
b437a33043 chore(cd): deploy 03ba967 [skip ci] 2026-05-12 14:31:32 +08:00
Your Name
03ba9678d5 fix(awooop): label cicd outbound timeline
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m3s
CD Pipeline / build-and-deploy (push) Successful in 4m1s
CD Pipeline / post-deploy-checks (push) Successful in 1m26s
2026-05-12 14:26:29 +08:00
Your Name
d74beb2176 fix(ci): prevent docker lock self match
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
2026-05-12 14:21:57 +08:00
AWOOOI CD
f824308b6a chore(cd): deploy cb7151c [skip ci] 2026-05-12 06:12:20 +00:00
Your Name
cb7151cc27 fix(awooop): set shadow run defaults for mirrors
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m5s
CD Pipeline / build-and-deploy (push) Successful in 10m20s
CD Pipeline / post-deploy-checks (push) Successful in 2m33s
2026-05-12 14:01:03 +08:00
Your Name
ad8ead2546 fix(awooop): route ci notifications through event mirror
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / tests (push) Successful in 1m18s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-12 13:58:08 +08:00
AWOOOI CD
d356cd32fc chore(cd): deploy 80c36ba [skip ci] 2026-05-07 19:00:45 +08:00
Your Name
80c36ba801 fix(incident): F2 NO_ACTION 觸發 resolve_incident + 冪等 guard
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 3m29s
CD Pipeline / post-deploy-checks (push) Successful in 1m30s
【根因】INC-20260507-99ADF2 飛輪斷流,566+ stuck incidents(30秒漲 1)核心
原因:NO_ACTION 路徑 (approval_execution.py:251) 提前 return True,跳過
line 482-495 已有的 resolve_incident 呼叫,incident 永遠卡 INVESTIGATING。

【修法】
- approval_execution.py NO_ACTION 分支補 resolve_incident 呼叫 + 成功/失敗
  log,背景 log 加 path="no_action" 用於 prod 量化修法生效率(debugger
  全鏈分析 + critic 1st/2nd 審查必修 #1)。
- incident_service.py resolve_incident 在 line 1106 加 RESOLVED 冪等 guard,
  早於所有副作用(status mutation / Redis / DB / postmortem / KB / KM /
  disposition),順帶修 success path line 482-495 重觸 postmortem 的潛在
  老風險(critic 必修 #2)。

【遵守 Codex 5/6 設計(feedback_respect_codex_design_intent.md)】
- 不動 flywheel_stats_service.py / heartbeat_report_service.py /
  auto_repair_service.record_auto_repair() / metrics_repository UPPER(status)。
- resolve_incident 不寫 auto_repair_executions 表(Codex 5/6 source of
  truth),不污染 24h KPI 計算。

【Test 覆蓋】
- test_approval_execution_no_action.py:NO_ACTION → resolve 被呼叫一次 +
  resolve raise 時仍 return True(NO_ACTION 不能因 resolve 失敗退化成 False,
  否則污染 auto_execute KPI line 207-208 註解契約)。
- test_incident_service_resolve_idempotency.py:RESOLVED → return existing +
  save_to_working_memory 不被呼叫;not_found → return None。

【驗收條件(部署後 24h)】
1. grep `path="no_action"` 中 incident_resolved_after_no_action_execution
   數量 vs background_execution_noop 數量,1:1 才算修復成功。
2. awoooi_flywheel_incidents_stuck 從每 30 秒漲 1 變平緩。
3. SRE 群 24h 內若湧入 >20 份 NO_ACTION postmortem 觸發 follow-up 評估
   resolution_type="no_action" 跳過 postmortem(critic Minor #3 方案 B)。

Refs: INC-20260507-99ADF2, debugger root cause #1 (鏈 A), critic 1st 必修
#1 #2, critic 2nd 必修 #1 #2 #3

Co-Authored-By: Codex (aider) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 18:55:58 +08:00
AWOOOI CD
afb5f9556e chore(cd): deploy b3dc41f [skip ci] 2026-05-07 15:37:50 +08:00
Your Name
b3dc41fcd4 fix(metrics): 串入飛輪指標到 /metrics 主端點,修復 FlywheelExecutionRateMissing 死告警
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m3s
CD Pipeline / build-and-deploy (push) Successful in 3m28s
CD Pipeline / post-deploy-checks (push) Successful in 1m21s
INC-20260507-99ADF2 根因(feedback_full_chain_first_then_fix.md 全鏈分析):

【鏈路斷點】規則層(5/3 加)vs 指標層(5/6 改)vs scrape 層(從沒同步)
- 577250a6(5/3)「反消音化」commit 加了 FlywheelExecutionRateMissing
  rule,要求 110 Prom scrape 到 awoooi_flywheel_execution_success_rate;
- a2c4b3d4(5/6)Codex 改 FlywheelStatsService 用 auto_repair_executions
  作 source of truth(24h 樣本 1-9 筆回 None 給 W-3b watchdog 接管);
- 但 awoooi_flywheel_* 指標自始至終只在 /api/v1/stats/flywheel/metrics
  暴露,110 Prom awoooi-api job 抓的是 /metrics → absent() 永遠 1
  → 自 2026-05-06T04:14 UTC 起 firing 26h+ 屬 dead alert

【修法】只動 awoooi-api 一處,不碰 Codex 設計、不碰 110 Prom 配置:
- main.py /metrics endpoint 改 async,在 generate_latest() 後串入
  FlywheelStatsService.compute() → to_prometheus_lines()。
- 既有 awoooi-api scrape job 自動拿到飛輪指標。
- 完全保留 Codex a2c4b3d4 設計:1-9 筆回 None 讓 W-3b watchdog 雙保險。

【不碰的部分】
- flywheel_stats_service.py 不動:Codex 5/6 LOGBOOK 已明確說明
  「Redis playbook counter 失準 → 用 auto_repair_executions 為唯一信任源」,
  1-9 筆 return None 是配合 ai_slo_watchdog_job W-3b grace+30min 設計的
  反消音化雙保險,不是 bug。

驗證計畫(部署後):
1. curl /metrics | grep awoooi_flywheel  → 看到飛輪指標
2. Prom query awoooi_flywheel_execution_success_rate  → 非空
3. ALERTS{alertname="FlywheelExecutionRateMissing"}  → resolved
4. 30 分鐘觀察 Telegram 不再收 INC-20260507-99ADF2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 15:33:04 +08:00
Your Name
c88d82f2ac docs(logbook): record timeline label deploy [skip ci] 2026-05-07 10:48:24 +08:00
AWOOOI CD
395cf742b9 chore(cd): deploy 72d86ba [skip ci] 2026-05-07 10:44:52 +08:00
Your Name
72d86ba70b fix(awooop): label outbound timeline events
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m7s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m23s
2026-05-07 10:40:14 +08:00
Your Name
a26ccf8d80 docs(logbook): record capacity migration rollout [skip ci] 2026-05-07 10:35:55 +08:00
AWOOOI CD
77ef400598 chore(cd): deploy 32e8a04 [skip ci] 2026-05-07 10:33:09 +08:00
Your Name
08097f4070 fix(ci): harden migration audit logging
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
2026-05-07 10:32:41 +08:00
Your Name
32e8a045f4 fix(db): allow metric capacity violation types
Some checks failed
Code Review / ai-code-review (push) Successful in 11s
run-migration / migrate (push) Failing after 9s
CD Pipeline / tests (push) Successful in 1m4s
CD Pipeline / build-and-deploy (push) Successful in 3m29s
CD Pipeline / post-deploy-checks (push) Successful in 1m28s
2026-05-07 10:28:33 +08:00
Your Name
814f5d8c6c docs(logbook): record channel shadow run deploy [skip ci] 2026-05-07 10:21:23 +08:00
AWOOOI CD
4f0d677e18 chore(cd): deploy 5d38115 [skip ci] 2026-05-07 02:17:32 +00:00
Your Name
5d38115d2f fix(awooop): anchor legacy channel events to shadow runs
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 4m9s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-07 10:12:52 +08:00
Your Name
200b760512 docs(logbook): record approval timeline deploy [skip ci] 2026-05-07 10:09:42 +08:00
AWOOOI CD
83f4ab0dad chore(cd): deploy 2df36b1 [skip ci] 2026-05-07 10:06:30 +08:00
Your Name
2df36b11e2 fix(awooop): record approval decisions in run timeline
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 58s
CD Pipeline / build-and-deploy (push) Successful in 3m28s
CD Pipeline / post-deploy-checks (push) Successful in 1m21s
2026-05-07 10:01:58 +08:00
Your Name
1b7f46f02c docs(logbook): record cd 188 sync deploy [skip ci] 2026-05-07 09:56:17 +08:00
AWOOOI CD
6ae3a55aed chore(cd): deploy 94e680a [skip ci] 2026-05-07 01:52:22 +00:00
Your Name
94e680add4 fix(cd): split ssh and scp options for 188 sync
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
2026-05-07 09:46:17 +08:00
AWOOOI CD
4810125e9a chore(cd): deploy 3df2311 [skip ci] 2026-05-07 01:42:30 +00:00
Your Name
3df23112ef fix(awooop): reconnect approval decisions to run timeline
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 59s
CD Pipeline / build-and-deploy (push) Successful in 3m45s
CD Pipeline / post-deploy-checks (push) Successful in 1m17s
2026-05-07 09:37:45 +08:00
Your Name
2ccc9d3071 docs(logbook): record awooop action panel deploy [skip ci] 2026-05-07 09:32:40 +08:00
AWOOOI CD
624c1b26c3 chore(cd): deploy beba668 [skip ci] 2026-05-07 09:30:24 +08:00
Your Name
beba668a4c feat(awooop): add run detail action panel
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m7s
CD Pipeline / build-and-deploy (push) Successful in 3m27s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
2026-05-07 09:25:49 +08:00
Your Name
c52ebfc042 docs(logbook): record awooop run detail i18n deploy [skip ci] 2026-05-07 06:06:33 +08:00
AWOOOI CD
8b9a974c66 chore(cd): deploy f960a4a [skip ci] 2026-05-07 05:51:18 +08:00
Your Name
f960a4a19b fix(awooop): localize run detail timeline
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m22s
2026-05-07 05:46:31 +08:00
Your Name
9d85ec5e96 docs(logbook): record awooop timeline deploy [skip ci] 2026-05-07 05:05:16 +08:00
AWOOOI CD
c00c7be9ae chore(cd): deploy 336fd76 [skip ci] 2026-05-06 20:25:22 +00:00
Your Name
336fd76774 fix(ssh): suppress asyncssh info log formatting noise
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m22s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m17s
2026-05-07 04:20:26 +08:00
AWOOOI CD
cd637ef616 chore(cd): deploy 66e22e2 [skip ci] 2026-05-06 20:00:17 +00:00
Your Name
66e22e26cb feat(awooop): add run detail timeline
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m18s
CD Pipeline / build-and-deploy (push) Successful in 3m58s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
2026-05-07 03:55:01 +08:00
Your Name
f10ab71c52 docs(logbook): record auto repair handoff card deploy [skip ci] 2026-05-07 02:15:48 +08:00
AWOOOI CD
d5555697a1 chore(cd): deploy 3f69e03 [skip ci] 2026-05-06 18:12:48 +00:00
Your Name
3f69e03fcb fix(telegram): clarify auto repair handoff cards
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m17s
CD Pipeline / build-and-deploy (push) Successful in 3m47s
CD Pipeline / post-deploy-checks (push) Successful in 1m57s
2026-05-07 02:07:43 +08:00
Your Name
57df3582dd docs(logbook): record grouped alert digest deploy [skip ci] 2026-05-07 02:00:34 +08:00
AWOOOI CD
14180182d3 chore(cd): deploy 6ac61ab [skip ci] 2026-05-06 17:56:12 +00:00
Your Name
6ac61ab6d7 fix(telegram): digest grouped alert storms
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 3m39s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
2026-05-07 01:51:31 +08:00