fix(db): 多 replica 並行啟動競爭 — 每 table 獨立 tx + DROP INDEX IF EXISTS
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled

根因:單一大 transaction 內兩個 pod 同時建同一個 table,
其中一個 CREATE INDEX 失敗 → 整個 transaction ROLLBACK
→ table 也消失 → 下次重啟同樣情況 → 無限 CrashLoop。

修法三層:
1. 每個 table 用獨立 transaction 建立(失敗不影響其他)
2. 建 table 前先 DROP INDEX IF EXISTS 清殘留孤兒 index
3. 捕捉 "already exists" 讓並行 pod 優雅跳過(不 crash)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-15 15:38:43 +08:00
parent 4a6aa16a94
commit 952c10955b

View File

@@ -143,21 +143,33 @@ async def init_db() -> None:
Call this at application startup.
"""
engine = get_engine()
# 2026-04-15 ogt: 多 replica 並行啟動競爭修復
# 問題:單一大 transaction 裡兩個 pod 同時建 table → 其中一個 CREATE INDEX 失敗
# PostgreSQL 中 transaction 內任何錯誤導致整個 transaction ROLLBACK
# → table + index 全消失 → 下次重啟同樣問題 → 無限 CrashLoop
# 修法:每個 table 獨立 transaction先 DROP INDEX IF EXISTS 清殘留孤兒 index
# 捕捉 "already exists" 讓並行 pod 優雅跳過
async with engine.connect() as probe_conn:
existing = set(await probe_conn.run_sync(
lambda c: __import__('sqlalchemy', fromlist=['inspect']).inspect(c).get_table_names()
))
for table in Base.metadata.sorted_tables:
if table.name not in existing:
try:
async with engine.begin() as conn:
# 先清殘留孤兒 index前次 CrashLoop 留下的部分狀態)
for index in table.indexes:
await conn.execute(text(f'DROP INDEX IF EXISTS "{index.name}"'))
await conn.run_sync(table.create)
except Exception as exc:
if "already exists" in str(exc).lower():
pass # 並行 pod 已建好,忽略
else:
raise
async with engine.begin() as conn:
# SQLAlchemy 2.0 問題create_all(checkfirst=True) 跳過 CREATE TABLE
# 但仍對 __table_args__ Index 物件發出獨立 CREATE INDEX → CrashLoopBackOff
# 修法:先 inspect 取得現有 tables只對不存在的 table 呼叫 table.create()
# 這樣 index 只隨新 table 一起建立,永遠不會 duplicate
# 2026-04-15 Claude Sonnet 4.6亞太Phase 3 修復
def _create_missing_tables(sync_conn):
from sqlalchemy import inspect as sa_inspect
existing = set(sa_inspect(sync_conn).get_table_names())
for table in Base.metadata.sorted_tables:
if table.name not in existing:
table.create(sync_conn)
await conn.run_sync(_create_missing_tables)
# 2026-04-02 Claude Code: 確保 risklevel enum 包含 'high' 值
# Phase 23 新增,避免舊 DB 缺少此值導致 InvalidTextRepresentation
await conn.execute(