awoooi/docs/runbooks/ANSIBLE-OPERATING-MODEL.md

# AWOOOI Ansible 運作模型

> 最後更新：2026-05-12（台北時間）
> 範圍：說明 Ansible 在 110 / 120 / 121 / 188 的運維、冷啟動恢復、監控與部署安全中扮演的角色。

## 產品架構定位

Ansible 是主機狀態收斂層，負責 Kubernetes 與 Docker 映像之外的主機狀態，包括檔案、套件、systemd units、cron、nginx 設定、node-exporter textfile monitor，以及主機層資源護欄。

Ansible 不取代下列系統：

- `k8s/` 之下的 Kubernetes manifests
- 各服務目錄自己管理的 Docker Compose application 定義
- 資料庫恢復決策
- AI 自動修復執行
- 緊急 console fsck

目標控制流程是：

```text
Git repo
  -> Ansible 驗證並收斂主機狀態
  -> Prometheus 觀測 host/app gate
  -> Alertmanager 發出告警
  -> AWOOOI/AwoooP AI 進行診斷與分流
  -> 涉及有狀態或高風險修復時交由人工批准
```

## 目前納管範圍

| 範圍 | 事實來源 | Runtime 目標 |
|---|---|---|
| 主機 inventory | `infra/ansible/inventory/hosts.yml` | 記錄 110 / 120 / 121 / 188 / 112 |
| 188 public nginx routes | `infra/ansible/roles/nginx/templates/*` + `playbooks/nginx-sync.yml` | `/etc/nginx/sites-enabled/*` |
| 110 Ollama proxy | `110-ollama-proxy.conf.j2` | `/etc/nginx/sites-enabled/110-ollama-proxy.conf` |
| 110 cold-start monitor | `roles/cold-start-monitor` | `/home/wooo/scripts`、cron、node-exporter textfile |
| 110 runner guardrails | `roles/runner-guardrails` | `actions.runner.*` systemd drop-ins |
| 110/188 Docker/systemd/storage/backup textfile exporters | `roles/host-textfile-exporters` | `/home/*/node_exporter_textfiles/docker_stats.prom`、`storage_health.prom`、`backup_health.prom`、110 `systemd_units.prom` |
| 110 Sentry backup / integrity drill | `110-devops.yml --tags backup_jobs` | `/backup/scripts/backup-sentry.sh`、`check-backup-integrity.sh`、weekly/monthly cron |
| 主機健康描述 | `110-devops.yml`、`188-ai-web.yml` | 只讀檢查與有限度主機狀態修復 |

## 必要流程

相關檔案變更後，Gitea workflow `.gitea/workflows/ansible-lint.yml` 會在 self-hosted runner 上執行 `scripts/ops/ansible-validate.sh` 與 `ansible-lint`。本地仍需先跑驗證，避免把明顯壞掉的 Ansible 變更推進 CI。

### 1. 本地驗證

任何 Ansible 變更前先執行：

```bash
bash scripts/ops/bootstrap-ansible-validation-env.sh --recreate
PATH="${ANSIBLE_VALIDATION_VENV:-/tmp/awoooi-ansible-venv}/bin:$PATH" \
  bash scripts/ops/ansible-validate.sh
```

`bootstrap-ansible-validation-env.sh` 會建立 pinned 驗證工具鏈：`ansible-core==2.17.14`、`ansible-lint==24.12.2`。如果本機沒有 `ansible-playbook`，`ansible-validate.sh` 仍會驗證 YAML 與 shell syntax，並明確提示已跳過 Ansible syntax-check；但重開機 SOP、CI 與接手稽核應使用 bootstrap venv，避免只做半套驗證。

若要稽核整個重開機恢復包是否齊全：

```bash
bash scripts/reboot-recovery/reboot-recovery-readiness-audit.sh --live --no-color
```

若要確認是否可以釋放 P3 高負載工作：

```bash
bash scripts/reboot-recovery/p3-controlled-release-gate.sh --no-color
```

### 2. 演練（`--check`）

從 repo root 執行：

```bash
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/site.yml --check
```

針對單一變更時：

```bash
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/nginx-sync.yml --tags 188 --check
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags cold_start_monitor --check
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags runner_guardrails --check
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags textfile_exporters --check
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags backup_jobs --check
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/188-ai-web.yml --tags textfile_exporters --check
```

### 3. 套用

只套用最小必要 tag：

```bash
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/nginx-sync.yml --tags 188
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags cold_start_monitor
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags runner_guardrails
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags textfile_exporters
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/110-devops.yml --tags backup_jobs
ansible-playbook -i infra/ansible/inventory/hosts.yml infra/ansible/playbooks/188-ai-web.yml --tags textfile_exporters
```

### 4. 事後驗證

Ansible apply 不等於完成；runtime gate 變綠才算完成：

```bash
SSH_BATCH_MODE=yes bash scripts/reboot-recovery/full-stack-cold-start-check.sh --send-alert-test
curl -kLsS -o /dev/null -w '%{http_code}\n' https://awoooi.wooo.work/api/v1/health
curl -kLsS -o /dev/null -w '%{http_code}\n' https://mo.wooo.work/health
```

## 冷啟動整合

重開機恢復時：

1. 主機卡在 initramfs 時，先用 console/fsck 讓主機乾淨開機。
2. 只在必要時人工恢復依賴鏈：188 data layer、110 registry/observability、K3s、public routes。
3. Stack 可達後，用 Ansible 把 live state 收回 repo/IaC。
4. 執行 cold-start gate。
5. Gate 變綠前，AI auto-repair 維持 observe-only。

Cold-start monitor 由下列 role/playbook 管理：

```text
infra/ansible/roles/cold-start-monitor
infra/ansible/playbooks/110-devops.yml --tags cold_start_monitor
```

它會寫入：

```text
/home/wooo/node_exporter_textfiles/cold_start_recovery.prom
/home/wooo/reboot-recovery/cold-start-last.log
```

## Dirty Reboot 與檔案系統防線

110 與 188 曾在重開機後停在 initramfs manual fsck，這一類問題不能只靠網站健康檢查發現。`roles/host-textfile-exporters` 現在也會部署 `storage-health-textfile-exporter.py`，每分鐘輸出：

```text
/home/wooo/node_exporter_textfiles/storage_health.prom
/home/ollama/node_exporter_textfiles/storage_health.prom
```

這個 exporter 只讀取 `/proc/mounts`、`/proc/stat`、`journalctl -k` 與 fsck logs，不會修復、不會重啟、不會寫資料庫。它提供 root filesystem 是否 read-only、目前 boot 是否有 storage/kernel error、上一個 boot 是否留下 dirty reboot/fsck 證據。Prometheus 的 `host_storage_health_alerts` 只告警與阻擋放量，所有 fsck/資料恢復仍需人工批准。

## 備份健康與設定檔備份

`roles/host-textfile-exporters` 也管理 `backup-health-textfile-exporter.py`。它每 10 分鐘輸出：

```text
/home/wooo/node_exporter_textfiles/backup_health.prom
/home/ollama/node_exporter_textfiles/backup_health.prom
```

這個 exporter 只讀取 cron、script path、restic snapshot metadata 與既有 textfile，不會執行備份或還原。它用來確認：

- 110 的 `/backup/scripts/backup-all.sh`、AWOOOI 高頻備份、`/backup/configs` 設定檔備份都存在且新鮮。
- 110 的 `/backup/sentry` 專屬資料層備份新鮮，並且 weekly `restic check` / monthly restore drill 有成功證據。
- 188 的 `backup-from-110` 與 momo PostgreSQL daily backup 都新鮮。
- 120 的 Velero schedule、latest Completed backup、`backup-restore-test` CronJob/Job 狀態可查。
- 預期 script 不缺、cron 不缺、最近 aggregate backup 沒有失敗項目。

設定檔備份由 `/backup/scripts/backup-configs.sh` 負責，納入每日 `backup-all.sh`。它會把 nginx、systemd、cron、Docker Compose、K3s manifests、K8s Secret/ConfigMap/RBAC、certs 與 runtime scripts 放進加密 restic repo `/backup/configs`。Secrets 只允許進加密備份，不得出現在 repo、log、Prometheus label 或告警訊息。

Sentry 資料層備份由 `/backup/scripts/backup-sentry.sh` 負責，納入每日 `backup-all.sh`。它會輸出 Sentry Postgres logical dump，並把 ClickHouse、Kafka、Redis、SeaweedFS、Taskbroker、Vroom、Symbolicator 等必要 state 放入加密 restic repo `/backup/sentry`。這是備份行為，不做 restore，也不停止 production stack。

備份可用性由 `/backup/scripts/check-backup-integrity.sh` 負責：

- 每週 `--mode check`：對預期 restic repos 執行 `restic check --read-data-subset=1%`。
- 每月 `--mode restore-drill`：從每個 repo 抽一個小檔案 `restic dump latest <sample>` 到 0700 暫存目錄，驗證 snapshot 可讀。
- 執行狀態寫入 `/backup/integrity/check.status` 與 `/backup/integrity/restore-drill.status`，由 `backup-health-textfile-exporter.py` 轉成 Prometheus metrics。

## 下一批納入 Ansible 的項目

| 優先級 | 項目 | 原因 |
|---|---|---|
| P0 | 110 runner guardrails | `roles/runner-guardrails` 已建立；下一步是在有 Ansible 的 ops host 做 live dry-run/apply 與 CI syntax-check |
| P0 | Sentry 專屬備份與 restic integrity drill | `backup_jobs` 已納入 110 playbook；下一步累積 nightly/weekly/monthly 成功證據 |
| P0 | 188 nginx HTTPS route ownership | 避免 public tool routes 在事故後或同步後再次漂移 |
| P1 | certbot/snap certbot 標準化 | 目前 apt certbot/OpenSSL 路徑脆弱，renewal 需要統一路徑 |
| P1 | 110/188 Docker/systemd/storage/backup textfile exporters | `roles/host-textfile-exporters` 已建立；下一步是在 ops host 上 dry-run/apply，並確認 `docker_stats.prom` / `storage_health.prom` / `backup_health.prom` / `systemd_units.prom` freshness |
| P1 | node-exporter/cAdvisor caps | 監控元件本身不能變成負載來源 |
| P2 | K3s diagnostic-only host tasks | 只驗證 containerd/kubelet 狀態，不做破壞性修復 |
| P2 | 112 Kali inventory only | 先記錄，不掃描、不修復 |

## 安全規則

- 預設先跑 `--check`。
- 用 tags 控制範圍；事故中避免直接套用完整 `site.yml`。
- 不把密碼寫進 repo、cron、inventory 或 group vars。
- 不讓 Ansible 執行 DB/ClickHouse/Kafka 的破壞性恢復。
- Ansible 只做可預期的主機狀態收斂，不處理未知資料修復。
- 任何有狀態 restart 或 quarantine 仍需人工批准。
- Runner guardrail role 預設不重啟 units；只有在計畫維護窗才設定 `runner_guardrails_restart_units=true`。

## 完成定義

Ansible 管理的變更必須全部符合下列條件，才算完成：

- `scripts/ops/ansible-validate.sh` 通過。
- 目標 playbook dry run 成功，或有文件化原因說明為何略過 dry run。
- 目標 apply 成功。
- 影響 runtime 的變更，`full-stack-cold-start-check.sh --send-alert-test` 必須變綠。
- 相關 public routes 或 service health endpoints 通過。
- `docs/LOGBOOK.md` 記錄套用範圍與驗證結果。