docs(argocd): codify health persistence config [skip ci]

This commit is contained in:
Your Name
2026-06-04 09:33:45 +08:00
parent d0163b2d69
commit 017dba8b00
3 changed files with 99 additions and 3 deletions

View File

@@ -158,6 +158,29 @@
- Runbook / 操作清理:`40% -> 70%`;已補只讀診斷腳本與現況 runbook尚待 111 恢復後補 recovery success evidence。
- 完整 AI 自動化飛輪:維持 `67%`;本輪是 inference fallback 可觀測性與技術債清理,沒有新增 auto-execution 或 KM writeback。
## 2026-06-04ArgoCD Application Health 設定 GitOps 化
**背景**:上一階段 live 修復 `awoooi-prod` ArgoCD top-level `Degraded` 時,確認根因包含 `argocd-cm` 舊的全域 `/status` ignore 與 `argocd-cmd-params-cm` 未啟用 child resource health persistence。當時 live 已修好,但 repo 沒有 Argo configmap source-of-truth重裝或手動漂移後可能再次退回 opaque degraded。
**本次調整**
- 新增 `scripts/ops/apply-argocd-health-config.sh`,以 idempotent 方式維持 Argo health 設定:
- 若 `argocd-cm.data.resource.customizations.ignoreResourceUpdates.all` 存在,移除舊的全域 `/status` ignore。
- 若 `argocd-cmd-params-cm.data.controller.resource.health.persist` 不是 `"true"`,設定為 `"true"`
- 只有真的變更 ConfigMap 時才重啟 `argocd-redis``argocd-server``argocd-application-controller`
- 更新 `k8s/argocd/DEPLOY.md`,新增「套用 Application Health 設定」段落與驗證指令。
**Live 驗證**
- `bash scripts/ops/apply-argocd-health-config.sh` 已在 live 執行,結果:
- `ok argocd-cm: no global /status ignore`
- `ok argocd-cmd-params-cm: controller.resource.health.persist=true`
- `no change`
- 因 live 已是正確狀態,本輪沒有觸發 Argo 重啟。
**進度更新**
- ArgoCD health truthfulness`85% -> 92%`live 與 repo 手順已對齊,尚待後續把 Argo base install 也納入完整 Kustomize/Helm source。
- GitOps control-plane hygiene`55% -> 68%`;已補健康設定 source-of-truth仍需盤點 Argo 安裝來源、RBAC 與 notification config。
- 完整 AI 自動化飛輪:維持 `67%`;本輪降低 GitOps 可觀測性退化風險,未新增 auto-repair execution。
## 2026-06-03AwoooP Work Items Owner Review Gate 與 Mobile Shell 可讀性
**背景**:統帥要求 AwoooP / AI 治理不能只在 Telegram 噴告警前端必須看得出事件跑到哪個流程、誰要接手、AI 做了什麼、哪些步驟被 gate 擋住。本階段聚焦 `/zh-TW/awooop/work-items` 的 KM owner-review 接續處理與手機可讀性:把告警中的 `KM 需要更新` 往 Work Items 的單筆審核、乾跑預覽、Owner 確認、寫回保護與 stale ratio 回測串起來。

View File

@@ -16,14 +16,34 @@ kubectl apply -f k8s/argocd/argocd-metrics-nodeport.yaml
kubectl get svc -n argocd | grep nodeport
```
## 2. NodePort 配置
## 2. 套用 Application Health 設定
ArgoCD 必須保留 child resource health否則 `Application.status.health` 會變成 top-level `Degraded` 但缺少子資源證據。
```bash
bash scripts/ops/apply-argocd-health-config.sh
kubectl -n argocd get app awoooi-prod \
-o jsonpath='{.status.sync.status}{" "}{.status.health.status}{"\n"}'
```
必要設定:
| ConfigMap | Key | 目標 |
|-----------|-----|------|
| `argocd-cm` | `resource.customizations.ignoreResourceUpdates.all` | 不得存在;禁止全域忽略 `/status` |
| `argocd-cmd-params-cm` | `controller.resource.health.persist` | 必須為 `"true"` |
若腳本實際修改 ConfigMap會自動重啟 `argocd-redis``argocd-server``argocd-application-controller`,讓設定與 app tree cache 生效。
## 3. NodePort 配置
| Service | NodePort | 用途 |
|---------|----------|------|
| argocd-metrics-nodeport | 30882 | Application Controller Metrics |
| argocd-server-metrics-nodeport | 30883 | ArgoCD Server Metrics |
## 3. Prometheus 抓取端點
## 4. Prometheus 抓取端點
```
http://192.168.0.121:30883/metrics # Server Metrics (Pod 在 mon1)
@@ -31,7 +51,7 @@ http://192.168.0.121:30883/metrics # Server Metrics (Pod 在 mon1)
> ⚠️ 注意: ArgoCD Server Pod 運行在 mon1 (121),需使用該節點 IP
## 4. 關鍵指標
## 5. 關鍵指標
| 指標 | 說明 |
|------|------|

View File

@@ -0,0 +1,53 @@
#!/usr/bin/env bash
set -euo pipefail
# Idempotent source-of-truth for AWOOOI ArgoCD health persistence.
# Keeps Application child resource health visible and prevents the old global
# /status ignore rule from making top-level health look degraded/opaque.
NAMESPACE="${NAMESPACE:-argocd}"
CM_NAME="${CM_NAME:-argocd-cm}"
PARAMS_CM_NAME="${PARAMS_CM_NAME:-argocd-cmd-params-cm}"
changed=0
get_jsonpath() {
kubectl -n "${NAMESPACE}" get cm "$1" -o "jsonpath=$2" 2>/dev/null || true
}
ignore_all_value="$(get_jsonpath "${CM_NAME}" "{.data.resource\\.customizations\\.ignoreResourceUpdates\\.all}")"
if [[ -n "${ignore_all_value}" ]]; then
echo "remove ${CM_NAME}.data.resource.customizations.ignoreResourceUpdates.all"
kubectl -n "${NAMESPACE}" patch cm "${CM_NAME}" \
--type=json \
-p='[{"op":"remove","path":"/data/resource.customizations.ignoreResourceUpdates.all"}]'
changed=1
else
echo "ok ${CM_NAME}: no global /status ignore"
fi
persist_value="$(get_jsonpath "${PARAMS_CM_NAME}" "{.data.controller\\.resource\\.health\\.persist}")"
if [[ "${persist_value}" != "true" ]]; then
echo "set ${PARAMS_CM_NAME}.data.controller.resource.health.persist=true"
kubectl -n "${NAMESPACE}" patch cm "${PARAMS_CM_NAME}" \
--type=merge \
-p='{"data":{"controller.resource.health.persist":"true"}}'
changed=1
else
echo "ok ${PARAMS_CM_NAME}: controller.resource.health.persist=true"
fi
if [[ "${changed}" == "1" ]]; then
echo "restart ArgoCD control-plane to load health config"
kubectl -n "${NAMESPACE}" rollout restart deploy/argocd-redis
kubectl -n "${NAMESPACE}" rollout restart deploy/argocd-server
kubectl -n "${NAMESPACE}" rollout restart statefulset/argocd-application-controller
kubectl -n "${NAMESPACE}" rollout status deploy/argocd-redis --timeout=180s
kubectl -n "${NAMESPACE}" rollout status deploy/argocd-server --timeout=180s
kubectl -n "${NAMESPACE}" rollout status statefulset/argocd-application-controller --timeout=180s
else
echo "no change"
fi
kubectl -n "${NAMESPACE}" get cm "${CM_NAME}" "${PARAMS_CM_NAME}" \
-o jsonpath='{range .items[*]}{.metadata.name}{": ignore-all="}{.data.resource\.customizations\.ignoreResourceUpdates\.all}{", health-persist="}{.data.controller\.resource\.health\.persist}{"\n"}{end}'