Your Name
|
6efd186750
|
docs(security): 建立高價值配置控管清冊 [skip ci]
|
2026-06-11 11:29:58 +08:00 |
|
Your Name
|
cfb866d055
|
feat(governance): add agent market automation surfaces
Ansible Lint / lint (push) Successful in 35s
CD Pipeline / tests (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Failing after 11s
|
2026-06-04 21:50:55 +08:00 |
|
Your Name
|
d0163b2d69
|
docs(ops): document ollama 111 fallback diagnosis [skip ci]
|
2026-06-04 09:31:20 +08:00 |
|
Your Name
|
4ea6fb98a6
|
fix(ops): harden reboot recovery and backup alerts
|
2026-05-29 12:45:09 +08:00 |
|
Your Name
|
ae7b39d96a
|
fix(ops): harden reboot recovery and backup alerts
|
2026-05-29 12:41:34 +08:00 |
|
Your Name
|
de16c88418
|
chore(rls): 套用 outbound message canary
Code Review / ai-code-review (push) Successful in 11s
|
2026-05-12 21:55:23 +08:00 |
|
Your Name
|
edd06485e0
|
docs(rls): 記錄 projects canary 套用
|
2026-05-12 21:41:14 +08:00 |
|
Your Name
|
7d92f0acd7
|
chore(rls): stage projects canary path
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m8s
CD Pipeline / build-and-deploy (push) Successful in 3m49s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
|
2026-05-12 21:25:24 +08:00 |
|
Your Name
|
b7af597459
|
chore(rls): 套用 tool registry canary wave1.1
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-12 21:15:14 +08:00 |
|
Your Name
|
1617b73a9d
|
docs(rls): 記錄 canary wave1 production apply
|
2026-05-12 20:55:40 +08:00 |
|
Your Name
|
8c4dc7a5a8
|
chore(rls): 新增 manual script gate 與 canary wave1
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m5s
CD Pipeline / build-and-deploy (push) Failing after 10m6s
CD Pipeline / post-deploy-checks (push) Has been skipped
|
2026-05-12 20:23:27 +08:00 |
|
Your Name
|
ff30c61c4c
|
fix(rls): 收斂 API DB access context
Code Review / ai-code-review (push) Successful in 21s
CD Pipeline / tests (push) Successful in 1m20s
CD Pipeline / build-and-deploy (push) Successful in 4m15s
CD Pipeline / post-deploy-checks (push) Successful in 1m58s
|
2026-05-12 19:55:13 +08:00 |
|
Your Name
|
33c0577e93
|
docs(ops): 記錄 RLS role bootstrap 套用
|
2026-05-12 19:35:28 +08:00 |
|
Your Name
|
f0255e0300
|
chore(ops): 補強 RLS role bootstrap gate
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-12 18:36:35 +08:00 |
|
Your Name
|
0bc1878778
|
chore(ops): 新增 RLS preflight 與 registry certbot 修復包
Code Review / ai-code-review (push) Successful in 13s
|
2026-05-12 18:25:53 +08:00 |
|
Your Name
|
d3e1b61096
|
fix(ops): persist 188 ollama localhost binding
Code Review / ai-code-review (push) Successful in 11s
|
2026-05-06 15:27:19 +08:00 |
|
Your Name
|
f88a3a846b
|
fix(ops): contain 188 ollama gateway exposure
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-06 15:18:28 +08:00 |
|
Your Name
|
d441f70693
|
fix(ai): add 188 ollama retirement gate
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 9m2s
CD Pipeline / post-deploy-checks (push) Successful in 1m15s
|
2026-05-06 14:55:21 +08:00 |
|
Your Name
|
4111ea4f9f
|
fix(ai): remove 188 ollama provider
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
|
2026-05-06 14:34:48 +08:00 |
|
OG T
|
6e2ab7cedc
|
fix(alertmanager): make live config deployment safe
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-06 13:52:57 +08:00 |
|
Your Name
|
9ef9633aff
|
fix(alerts): bypass proxy timeout for GCP Ollama
|
2026-05-06 08:55:14 +08:00 |
|
Your Name
|
587551c1f1
|
fix(ops): monitor full-stack cold-start gates
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s
|
2026-05-06 00:48:05 +08:00 |
|
Your Name
|
6e96623884
|
fix(ops): harden momo scheduler cold start gate
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-06 00:15:14 +08:00 |
|
Your Name
|
0315c2b510
|
docs(ops): codify full stack cold start recovery
Code Review / ai-code-review (push) Successful in 7s
|
2026-05-06 00:07:57 +08:00 |
|
Your Name
|
c4854bb355
|
fix(ai): isolate heavy Ollama workloads from GCP alert lane
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m12s
|
2026-05-05 23:06:07 +08:00 |
|
Your Name
|
ed7c6946cb
|
docs(awooop): define private Ollama mesh gateway
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-05 22:56:22 +08:00 |
|
Your Name
|
a4e9a04982
|
fix(ops): harden cold-start schedule recovery
Code Review / ai-code-review (push) Successful in 10s
run-migration / migrate (push) Successful in 7s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:17:10 +08:00 |
|
Your Name
|
2221fd3256
|
fix(ops): persist host resource guardrails
CD Pipeline / tests (push) Successful in 5m25s
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m31s
CD Pipeline / post-deploy-checks (push) Successful in 5m10s
|
2026-05-05 16:13:19 +08:00 |
|
Your Name
|
1cc9de5722
|
fix(ops): point runner guardrail alerts to host script
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
|
2026-05-05 15:25:37 +08:00 |
|
Your Name
|
d08d1e4951
|
fix(ops): alert on missing docker resource limits
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 15:01:31 +08:00 |
|
Your Name
|
72d66e4ae6
|
fix(ops): align stale job cleanup thresholds
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
|
2026-05-05 14:54:17 +08:00 |
|
Your Name
|
5e625f777d
|
fix(ops): add stale gitea job cleanup guard
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:50:47 +08:00 |
|
Your Name
|
7d45f0cb58
|
fix(ops): alert on stale gitea actions jobs
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:42:09 +08:00 |
|
Your Name
|
fe618960a8
|
fix(ops): monitor systemd runners in host baseline
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
|
2026-05-05 14:08:43 +08:00 |
|
Your Name
|
e8e6748f70
|
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 13:45:09 +08:00 |
|
Your Name
|
ec013f662d
|
fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
避免与 governance_agent 每小时自检查重复触发 Telegram
- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
端口 11435 -> 34.143.170.20:11434 (GCP-A)
端口 11436 -> 34.21.145.224:11434 (GCP-B)
- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行
ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
|
2026-05-04 23:12:35 +08:00 |
|
Your Name
|
13e51802fe
|
feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR
## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
- Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
- Step 13:GRANT awooop_app 最小權限(7 條)
- Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models
## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)
## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:37:11 +08:00 |
|
Your Name
|
8e49f2ea88
|
fix(ci): preserve ssh mcp known hosts [skip ci]
|
2026-05-01 17:18:32 +08:00 |
|
Your Name
|
433f7b068e
|
fix(aiops): close ssh and telegram remediation gaps
CD Pipeline / tests (push) Successful in 2m7s
Code Review / ai-code-review (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 13m14s
CD Pipeline / post-deploy-checks (push) Successful in 4m29s
|
2026-05-01 16:53:02 +08:00 |
|
Your Name
|
95110971f3
|
fix(telegram): close remaining DM alert routes
CD Pipeline / tests (push) Successful in 1m27s
Code Review / ai-code-review (push) Successful in 29s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-04-30 23:02:17 +08:00 |
|
Your Name
|
61f5a6a419
|
fix(telegram): route alerts to SRE war room
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-04-30 15:01:23 +08:00 |
|
Your Name
|
fb130c9a28
|
feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄
CD Pipeline / build-and-deploy (push) Failing after 2m16s
Wave 8 P3.1-T2 後續補測 + 配套:
新增測試:
- test_diagnosis_aggregator_stub.py (238 行) — 15 tests
· stub fixture 驗證 _collect_diagnosis_aggregator 接線
· feature flag default off 不呼叫
· timeout 邊界 / exception fail-soft
修改:
- core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標
- sanitization_service.py +24 — 補強 prompt sanitize 邊界(vuln #4 配套)
- RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調
Tests: 15 passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-27 08:30:26 +08:00 |
|
Your Name
|
fefe4c21cd
|
fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:
A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
· success/timeout/error outcome label
· histogram 寫入 aiops_agent_step_duration_seconds
A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout
新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality
新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序
新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
· 三 Agent histogram alert rules(p99 > timeout 80% → warning)
驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)
INC-20260425 雙修總工作量(595629c0 + 此 commit):
· 5 個 service/agent 檔修改
· 1 個新 observability 模組
· 4 個新測試/配套檔
· 1372+187 = 1559 行新增
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
|
2026-04-27 08:15:53 +08:00 |
|
Your Name
|
1ab6786ce3
|
feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Wave 6 P2.3 ops 配套 + tool-expert 部署文件:
新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
· 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
· failover/recovery 完整 SOP
· 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
· 4 panel:current primary / failover events / quota usage / health status
· 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
· ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
|
2026-04-27 08:11:40 +08:00 |
|
OG T
|
a1432c03ed
|
docs: ADR-070/071 + ssh-mcp-setup runbook + Skill-04 v2.7
- ADR-070: 全自動 AIOps 閉環 MCP Phase 1-4 決策文件
- ADR-071: 告警通知四類型 + KM 三段資料閉環決策文件
- docs/runbooks/ssh-mcp-setup.md: SSH MCP 建立/驗證/輪換 SOP
- Skill-04: v2.7 新增 Sprint C DR + ADR-070 MCP 10 providers 完整記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 20:04:47 +08:00 |
|
OG T
|
43edff184d
|
feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件
C-1 Velero: 已確認運作中(daily-awoooi-prod schedule, 13d, MinIO Available)
C-2 Host rsync 備份:
scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110
- Harbor registry data(最高優先)
- Gitea repos
- bitan-pharmacy.git(若存在)
- 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控
- 失敗時 Telegram 告警
ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則
C-3 DR SOP 文件:
docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘)
docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘)
docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘)
部署備份腳本說明 (需手動執行):
scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh
ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}"
ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' | crontab -"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 03:04:18 +08:00 |
|
OG T
|
66b12bf9eb
|
fix(infra): 根治 Harbor Exited(128) Race Condition + harbor-watchdog 常駐自愈
問題根因:
awoooi-startup-110.sh 在 Harbor 啟動時,第一次 compose up -d 會同時
啟動所有容器。harbor-core/db/portal 嘗試連 syslog:1514(harbor-log 未就緒),
失敗後 exit(128),restart:always 重試直到 backoff 放棄。
即使後來 harbor-log healthy,其他容器已不再重試。
修復 1 — startup-110.sh Harbor 時序(4 Phase 策略):
Phase 1: 清除所有 Exited Harbor 容器(打破 backoff 死鎖)
Phase 2: 只啟動 harbor-log
Phase 3: 等 harbor-log healthy(最多 90s)
Phase 4: 啟動全組件
修復 2 — harbor-watchdog.service(常駐自愈):
Type=simple 常駐進程,每 60s 輪詢 http://127.0.0.1:5000/v2/
不健康 → 等 5s 再確認 → 執行 Phase 1-4 完整修復
修復重開機時序問題無法覆蓋的「運行中崩潰」場景
Bug Fix:curl -f 會把 HTTP 401 視為失敗(exit 22),
Harbor /v2/ 正常回傳 401(需認證),改用 curl -s 不加 -f
REBOOT-RECOVERY-SOP.md → v5.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 12:13:21 +08:00 |
|
OG T
|
f51bf5a6a8
|
feat(backup): 全服務備份覆蓋 + 告警機制 — 9/9 服務完整
新增備份(已部署到 110,首次執行全部通過):
- backup-langfuse.sh: Langfuse AI 追蹤/評測 DB (7238 traces)
- backup-monitoring.sh: Prometheus + Grafana + Alertmanager volumes + configs
- backup-signoz.sh: SignOz ClickHouse + SQLite (分散式追蹤/日誌)
- backup-open-webui.sh: Open-WebUI LLM 對話紀錄 (SSH 188 volume)
- backup-clawbot.sh: ClawBot Redis 狀態/快取 (SSH 188 volume)
- backup-all.sh v3.0: 整合至 9/9 服務
告警機制:
- common.sh: notify_clawbot 改用 /webhook/custom 正確格式
- failed → severity:critical → Telegram 🔴 立即告警
- 告警測試通過:{"status":"ok","alert_id":"878c4c59..."}
GFS 保留:30日/12週/24月 (AWOOOI 額外 28h 高頻)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 11:12:42 +08:00 |
|
OG T
|
91564c6ea3
|
docs(sop): REBOOT-RECOVERY-SOP.md v4.0
更新:
- 加入 Sentry /opt/sentry 啟動說明 (110 Step 7/9)
- 新增 Sentry 重開機損壞修復章節 (PostgreSQL WAL/Redis RDB/ClickHouse parts)
- 告警沉默診斷樹補充「規則未部署」診斷 + deploy-alerts.sh 修復指令
- E2E 驗證腳本加入 Sentry + Prometheus 規則數驗證 (≥25)
- 架構圖補充 Sentry :9000
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 03:11:27 +08:00 |
|
OG T
|
8f64affbdb
|
docs(runbooks): REBOOT-RECOVERY-SOP v3.0 完整重開機自動化方案
## 內容
完整盤點所有主機、服務、工具、監控的:
- 啟動順序與依賴關係圖
- 正常重啟 vs 異常重啟處理流程
- 各主機詳細啟動序列 (188/110/120/121)
- 常見故障排查手冊 (告警沉默/CD失效/數據消失/NodePort)
- E2E 驗證腳本
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 01:48:29 +08:00 |
|