diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 9f19a855..0ac57b18 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -158,6 +158,29 @@ - Runbook / 操作清理:`40% -> 70%`;已補只讀診斷腳本與現況 runbook,尚待 111 恢復後補 recovery success evidence。 - 完整 AI 自動化飛輪:維持 `67%`;本輪是 inference fallback 可觀測性與技術債清理,沒有新增 auto-execution 或 KM writeback。 +## 2026-06-04|ArgoCD Application Health 設定 GitOps 化 + +**背景**:上一階段 live 修復 `awoooi-prod` ArgoCD top-level `Degraded` 時,確認根因包含 `argocd-cm` 舊的全域 `/status` ignore 與 `argocd-cmd-params-cm` 未啟用 child resource health persistence。當時 live 已修好,但 repo 沒有 Argo configmap source-of-truth,重裝或手動漂移後可能再次退回 opaque degraded。 + +**本次調整**: +- 新增 `scripts/ops/apply-argocd-health-config.sh`,以 idempotent 方式維持 Argo health 設定: + - 若 `argocd-cm.data.resource.customizations.ignoreResourceUpdates.all` 存在,移除舊的全域 `/status` ignore。 + - 若 `argocd-cmd-params-cm.data.controller.resource.health.persist` 不是 `"true"`,設定為 `"true"`。 + - 只有真的變更 ConfigMap 時才重啟 `argocd-redis`、`argocd-server` 與 `argocd-application-controller`。 +- 更新 `k8s/argocd/DEPLOY.md`,新增「套用 Application Health 設定」段落與驗證指令。 + +**Live 驗證**: +- `bash scripts/ops/apply-argocd-health-config.sh` 已在 live 執行,結果: + - `ok argocd-cm: no global /status ignore` + - `ok argocd-cmd-params-cm: controller.resource.health.persist=true` + - `no change` +- 因 live 已是正確狀態,本輪沒有觸發 Argo 重啟。 + +**進度更新**: +- ArgoCD health truthfulness:`85% -> 92%`;live 與 repo 手順已對齊,尚待後續把 Argo base install 也納入完整 Kustomize/Helm source。 +- GitOps control-plane hygiene:`55% -> 68%`;已補健康設定 source-of-truth,仍需盤點 Argo 安裝來源、RBAC 與 notification config。 +- 完整 AI 自動化飛輪:維持 `67%`;本輪降低 GitOps 可觀測性退化風險,未新增 auto-repair execution。 + ## 2026-06-03|AwoooP Work Items Owner Review Gate 與 Mobile Shell 可讀性 **背景**:統帥要求 AwoooP / AI 治理不能只在 Telegram 噴告警,前端必須看得出事件跑到哪個流程、誰要接手、AI 做了什麼、哪些步驟被 gate 擋住。本階段聚焦 `/zh-TW/awooop/work-items` 的 KM owner-review 接續處理與手機可讀性:把告警中的 `KM 需要更新` 往 Work Items 的單筆審核、乾跑預覽、Owner 確認、寫回保護與 stale ratio 回測串起來。 diff --git a/k8s/argocd/DEPLOY.md b/k8s/argocd/DEPLOY.md index 88df2b9b..2a7dfbef 100644 --- a/k8s/argocd/DEPLOY.md +++ b/k8s/argocd/DEPLOY.md @@ -16,14 +16,34 @@ kubectl apply -f k8s/argocd/argocd-metrics-nodeport.yaml kubectl get svc -n argocd | grep nodeport ``` -## 2. NodePort 配置 +## 2. 套用 Application Health 設定 + +ArgoCD 必須保留 child resource health,否則 `Application.status.health` 會變成 top-level `Degraded` 但缺少子資源證據。 + +```bash +bash scripts/ops/apply-argocd-health-config.sh + +kubectl -n argocd get app awoooi-prod \ + -o jsonpath='{.status.sync.status}{" "}{.status.health.status}{"\n"}' +``` + +必要設定: + +| ConfigMap | Key | 目標 | +|-----------|-----|------| +| `argocd-cm` | `resource.customizations.ignoreResourceUpdates.all` | 不得存在;禁止全域忽略 `/status` | +| `argocd-cmd-params-cm` | `controller.resource.health.persist` | 必須為 `"true"` | + +若腳本實際修改 ConfigMap,會自動重啟 `argocd-redis`、`argocd-server` 與 `argocd-application-controller`,讓設定與 app tree cache 生效。 + +## 3. NodePort 配置 | Service | NodePort | 用途 | |---------|----------|------| | argocd-metrics-nodeport | 30882 | Application Controller Metrics | | argocd-server-metrics-nodeport | 30883 | ArgoCD Server Metrics | -## 3. Prometheus 抓取端點 +## 4. Prometheus 抓取端點 ``` http://192.168.0.121:30883/metrics # Server Metrics (Pod 在 mon1) @@ -31,7 +51,7 @@ http://192.168.0.121:30883/metrics # Server Metrics (Pod 在 mon1) > ⚠️ 注意: ArgoCD Server Pod 運行在 mon1 (121),需使用該節點 IP -## 4. 關鍵指標 +## 5. 關鍵指標 | 指標 | 說明 | |------|------| diff --git a/scripts/ops/apply-argocd-health-config.sh b/scripts/ops/apply-argocd-health-config.sh new file mode 100755 index 00000000..a6a985f7 --- /dev/null +++ b/scripts/ops/apply-argocd-health-config.sh @@ -0,0 +1,53 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Idempotent source-of-truth for AWOOOI ArgoCD health persistence. +# Keeps Application child resource health visible and prevents the old global +# /status ignore rule from making top-level health look degraded/opaque. + +NAMESPACE="${NAMESPACE:-argocd}" +CM_NAME="${CM_NAME:-argocd-cm}" +PARAMS_CM_NAME="${PARAMS_CM_NAME:-argocd-cmd-params-cm}" + +changed=0 + +get_jsonpath() { + kubectl -n "${NAMESPACE}" get cm "$1" -o "jsonpath=$2" 2>/dev/null || true +} + +ignore_all_value="$(get_jsonpath "${CM_NAME}" "{.data.resource\\.customizations\\.ignoreResourceUpdates\\.all}")" +if [[ -n "${ignore_all_value}" ]]; then + echo "remove ${CM_NAME}.data.resource.customizations.ignoreResourceUpdates.all" + kubectl -n "${NAMESPACE}" patch cm "${CM_NAME}" \ + --type=json \ + -p='[{"op":"remove","path":"/data/resource.customizations.ignoreResourceUpdates.all"}]' + changed=1 +else + echo "ok ${CM_NAME}: no global /status ignore" +fi + +persist_value="$(get_jsonpath "${PARAMS_CM_NAME}" "{.data.controller\\.resource\\.health\\.persist}")" +if [[ "${persist_value}" != "true" ]]; then + echo "set ${PARAMS_CM_NAME}.data.controller.resource.health.persist=true" + kubectl -n "${NAMESPACE}" patch cm "${PARAMS_CM_NAME}" \ + --type=merge \ + -p='{"data":{"controller.resource.health.persist":"true"}}' + changed=1 +else + echo "ok ${PARAMS_CM_NAME}: controller.resource.health.persist=true" +fi + +if [[ "${changed}" == "1" ]]; then + echo "restart ArgoCD control-plane to load health config" + kubectl -n "${NAMESPACE}" rollout restart deploy/argocd-redis + kubectl -n "${NAMESPACE}" rollout restart deploy/argocd-server + kubectl -n "${NAMESPACE}" rollout restart statefulset/argocd-application-controller + kubectl -n "${NAMESPACE}" rollout status deploy/argocd-redis --timeout=180s + kubectl -n "${NAMESPACE}" rollout status deploy/argocd-server --timeout=180s + kubectl -n "${NAMESPACE}" rollout status statefulset/argocd-application-controller --timeout=180s +else + echo "no change" +fi + +kubectl -n "${NAMESPACE}" get cm "${CM_NAME}" "${PARAMS_CM_NAME}" \ + -o jsonpath='{range .items[*]}{.metadata.name}{": ignore-all="}{.data.resource\.customizations\.ignoreResourceUpdates\.all}{", health-persist="}{.data.controller\.resource\.health\.persist}{"\n"}{end}'