docs(argocd): codify health persistence config [skip ci]
This commit is contained in:
@@ -158,6 +158,29 @@
|
||||
- Runbook / 操作清理:`40% -> 70%`;已補只讀診斷腳本與現況 runbook,尚待 111 恢復後補 recovery success evidence。
|
||||
- 完整 AI 自動化飛輪:維持 `67%`;本輪是 inference fallback 可觀測性與技術債清理,沒有新增 auto-execution 或 KM writeback。
|
||||
|
||||
## 2026-06-04|ArgoCD Application Health 設定 GitOps 化
|
||||
|
||||
**背景**:上一階段 live 修復 `awoooi-prod` ArgoCD top-level `Degraded` 時,確認根因包含 `argocd-cm` 舊的全域 `/status` ignore 與 `argocd-cmd-params-cm` 未啟用 child resource health persistence。當時 live 已修好,但 repo 沒有 Argo configmap source-of-truth,重裝或手動漂移後可能再次退回 opaque degraded。
|
||||
|
||||
**本次調整**:
|
||||
- 新增 `scripts/ops/apply-argocd-health-config.sh`,以 idempotent 方式維持 Argo health 設定:
|
||||
- 若 `argocd-cm.data.resource.customizations.ignoreResourceUpdates.all` 存在,移除舊的全域 `/status` ignore。
|
||||
- 若 `argocd-cmd-params-cm.data.controller.resource.health.persist` 不是 `"true"`,設定為 `"true"`。
|
||||
- 只有真的變更 ConfigMap 時才重啟 `argocd-redis`、`argocd-server` 與 `argocd-application-controller`。
|
||||
- 更新 `k8s/argocd/DEPLOY.md`,新增「套用 Application Health 設定」段落與驗證指令。
|
||||
|
||||
**Live 驗證**:
|
||||
- `bash scripts/ops/apply-argocd-health-config.sh` 已在 live 執行,結果:
|
||||
- `ok argocd-cm: no global /status ignore`
|
||||
- `ok argocd-cmd-params-cm: controller.resource.health.persist=true`
|
||||
- `no change`
|
||||
- 因 live 已是正確狀態,本輪沒有觸發 Argo 重啟。
|
||||
|
||||
**進度更新**:
|
||||
- ArgoCD health truthfulness:`85% -> 92%`;live 與 repo 手順已對齊,尚待後續把 Argo base install 也納入完整 Kustomize/Helm source。
|
||||
- GitOps control-plane hygiene:`55% -> 68%`;已補健康設定 source-of-truth,仍需盤點 Argo 安裝來源、RBAC 與 notification config。
|
||||
- 完整 AI 自動化飛輪:維持 `67%`;本輪降低 GitOps 可觀測性退化風險,未新增 auto-repair execution。
|
||||
|
||||
## 2026-06-03|AwoooP Work Items Owner Review Gate 與 Mobile Shell 可讀性
|
||||
|
||||
**背景**:統帥要求 AwoooP / AI 治理不能只在 Telegram 噴告警,前端必須看得出事件跑到哪個流程、誰要接手、AI 做了什麼、哪些步驟被 gate 擋住。本階段聚焦 `/zh-TW/awooop/work-items` 的 KM owner-review 接續處理與手機可讀性:把告警中的 `KM 需要更新` 往 Work Items 的單筆審核、乾跑預覽、Owner 確認、寫回保護與 stale ratio 回測串起來。
|
||||
|
||||
@@ -16,14 +16,34 @@ kubectl apply -f k8s/argocd/argocd-metrics-nodeport.yaml
|
||||
kubectl get svc -n argocd | grep nodeport
|
||||
```
|
||||
|
||||
## 2. NodePort 配置
|
||||
## 2. 套用 Application Health 設定
|
||||
|
||||
ArgoCD 必須保留 child resource health,否則 `Application.status.health` 會變成 top-level `Degraded` 但缺少子資源證據。
|
||||
|
||||
```bash
|
||||
bash scripts/ops/apply-argocd-health-config.sh
|
||||
|
||||
kubectl -n argocd get app awoooi-prod \
|
||||
-o jsonpath='{.status.sync.status}{" "}{.status.health.status}{"\n"}'
|
||||
```
|
||||
|
||||
必要設定:
|
||||
|
||||
| ConfigMap | Key | 目標 |
|
||||
|-----------|-----|------|
|
||||
| `argocd-cm` | `resource.customizations.ignoreResourceUpdates.all` | 不得存在;禁止全域忽略 `/status` |
|
||||
| `argocd-cmd-params-cm` | `controller.resource.health.persist` | 必須為 `"true"` |
|
||||
|
||||
若腳本實際修改 ConfigMap,會自動重啟 `argocd-redis`、`argocd-server` 與 `argocd-application-controller`,讓設定與 app tree cache 生效。
|
||||
|
||||
## 3. NodePort 配置
|
||||
|
||||
| Service | NodePort | 用途 |
|
||||
|---------|----------|------|
|
||||
| argocd-metrics-nodeport | 30882 | Application Controller Metrics |
|
||||
| argocd-server-metrics-nodeport | 30883 | ArgoCD Server Metrics |
|
||||
|
||||
## 3. Prometheus 抓取端點
|
||||
## 4. Prometheus 抓取端點
|
||||
|
||||
```
|
||||
http://192.168.0.121:30883/metrics # Server Metrics (Pod 在 mon1)
|
||||
@@ -31,7 +51,7 @@ http://192.168.0.121:30883/metrics # Server Metrics (Pod 在 mon1)
|
||||
|
||||
> ⚠️ 注意: ArgoCD Server Pod 運行在 mon1 (121),需使用該節點 IP
|
||||
|
||||
## 4. 關鍵指標
|
||||
## 5. 關鍵指標
|
||||
|
||||
| 指標 | 說明 |
|
||||
|------|------|
|
||||
|
||||
53
scripts/ops/apply-argocd-health-config.sh
Executable file
53
scripts/ops/apply-argocd-health-config.sh
Executable file
@@ -0,0 +1,53 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Idempotent source-of-truth for AWOOOI ArgoCD health persistence.
|
||||
# Keeps Application child resource health visible and prevents the old global
|
||||
# /status ignore rule from making top-level health look degraded/opaque.
|
||||
|
||||
NAMESPACE="${NAMESPACE:-argocd}"
|
||||
CM_NAME="${CM_NAME:-argocd-cm}"
|
||||
PARAMS_CM_NAME="${PARAMS_CM_NAME:-argocd-cmd-params-cm}"
|
||||
|
||||
changed=0
|
||||
|
||||
get_jsonpath() {
|
||||
kubectl -n "${NAMESPACE}" get cm "$1" -o "jsonpath=$2" 2>/dev/null || true
|
||||
}
|
||||
|
||||
ignore_all_value="$(get_jsonpath "${CM_NAME}" "{.data.resource\\.customizations\\.ignoreResourceUpdates\\.all}")"
|
||||
if [[ -n "${ignore_all_value}" ]]; then
|
||||
echo "remove ${CM_NAME}.data.resource.customizations.ignoreResourceUpdates.all"
|
||||
kubectl -n "${NAMESPACE}" patch cm "${CM_NAME}" \
|
||||
--type=json \
|
||||
-p='[{"op":"remove","path":"/data/resource.customizations.ignoreResourceUpdates.all"}]'
|
||||
changed=1
|
||||
else
|
||||
echo "ok ${CM_NAME}: no global /status ignore"
|
||||
fi
|
||||
|
||||
persist_value="$(get_jsonpath "${PARAMS_CM_NAME}" "{.data.controller\\.resource\\.health\\.persist}")"
|
||||
if [[ "${persist_value}" != "true" ]]; then
|
||||
echo "set ${PARAMS_CM_NAME}.data.controller.resource.health.persist=true"
|
||||
kubectl -n "${NAMESPACE}" patch cm "${PARAMS_CM_NAME}" \
|
||||
--type=merge \
|
||||
-p='{"data":{"controller.resource.health.persist":"true"}}'
|
||||
changed=1
|
||||
else
|
||||
echo "ok ${PARAMS_CM_NAME}: controller.resource.health.persist=true"
|
||||
fi
|
||||
|
||||
if [[ "${changed}" == "1" ]]; then
|
||||
echo "restart ArgoCD control-plane to load health config"
|
||||
kubectl -n "${NAMESPACE}" rollout restart deploy/argocd-redis
|
||||
kubectl -n "${NAMESPACE}" rollout restart deploy/argocd-server
|
||||
kubectl -n "${NAMESPACE}" rollout restart statefulset/argocd-application-controller
|
||||
kubectl -n "${NAMESPACE}" rollout status deploy/argocd-redis --timeout=180s
|
||||
kubectl -n "${NAMESPACE}" rollout status deploy/argocd-server --timeout=180s
|
||||
kubectl -n "${NAMESPACE}" rollout status statefulset/argocd-application-controller --timeout=180s
|
||||
else
|
||||
echo "no change"
|
||||
fi
|
||||
|
||||
kubectl -n "${NAMESPACE}" get cm "${CM_NAME}" "${PARAMS_CM_NAME}" \
|
||||
-o jsonpath='{range .items[*]}{.metadata.name}{": ignore-all="}{.data.resource\.customizations\.ignoreResourceUpdates\.all}{", health-persist="}{.data.controller\.resource\.health\.persist}{"\n"}{end}'
|
||||
Reference in New Issue
Block a user