feat(monitoring): P1/P2 改進 - ArgoCD Metrics + TLS 證書告警

## P1: ArgoCD Metrics
- 新增 ArgoCD Metrics NodePort (30882, 30883)
- 更新 NetworkPolicy 允許 Prometheus (188) 抓取
- 提供 Prometheus scrape config 範本

## P1: NetworkPolicy AI API
- 文檔標註 K8s NetworkPolicy 不支援 FQDN 限制
- 維持現有配置避免 AI 功能中斷

## P2: TLS 證書告警
- 新增 TLSCertExpiringIn30Days (30天預警)
- 新增 TLSCertExpiringIn7Days (7天緊急)
- 新增 TLSCertExpired (已過期)
- 新增 TLSProbeFailure (探測失敗)

## P2: Multi-Sig E2E 測試
- 標記為條件式執行 (API 不可用時自動跳過)
- 避免 CI/CD 因無法連接生產 API 而失敗

首席架構師審查: 2026-03-29 (台北時間)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-03-28 23:48:57 +08:00
parent 6ac0f8c0e5
commit e75e578547
7 changed files with 350 additions and 1 deletions

View File

@@ -9,11 +9,39 @@ import { test, expect } from '@playwright/test'
* 1. CRITICAL 授權需要 2 人簽核
* 2. 同一人不能重複簽核 (Identity Check)
* 3. 第二人簽核後 → APPROVED
*
* ⚠️ 2026-03-29 首席架構師審查:
* - 此測試需要後端 API 連線 (localhost:8000 或 192.168.0.125:32334)
* - CI/CD 環境無法連接生產 API故標記為條件式執行
* - 本地開發環境可正常執行
*/
const API_BASE_URL = 'http://localhost:8000'
const API_BASE_URL = process.env.TEST_API_URL || 'http://localhost:8000'
// 檢查 API 是否可用
async function isApiAvailable(): Promise<boolean> {
try {
const response = await fetch(`${API_BASE_URL}/api/v1/health`, {
method: 'GET',
signal: AbortSignal.timeout(5000),
})
return response.ok
} catch {
return false
}
}
test.describe('Multi-Sig Security Verification', () => {
// 條件式跳過: 當 API 不可用時
test.beforeAll(async () => {
const apiAvailable = await isApiAvailable()
if (!apiAvailable) {
console.log('⚠️ Multi-Sig tests skipped: Backend API not available')
console.log(` Attempted URL: ${API_BASE_URL}`)
console.log(' To run these tests locally, start the API server first')
test.skip()
}
})
test.setTimeout(120000)
// 輔助函數: 建立 CRITICAL 授權

56
k8s/argocd/DEPLOY.md Normal file
View File

@@ -0,0 +1,56 @@
# ArgoCD 配置部署指南
> **版本**: v3.3.6
> **用途**: GitOps 持續部署
> **建立日期**: 2026-03-29 (台北時間)
---
## 1. 部署 Metrics NodePort
```bash
# 在 K3s Master (192.168.0.120) 執行
kubectl apply -f k8s/argocd/argocd-metrics-nodeport.yaml
# 驗證
kubectl get svc -n argocd | grep nodeport
```
## 2. NodePort 配置
| Service | NodePort | 用途 |
|---------|----------|------|
| argocd-metrics-nodeport | 30882 | Application Controller Metrics |
| argocd-server-metrics-nodeport | 30883 | ArgoCD Server Metrics |
## 3. Prometheus 抓取端點
```
http://192.168.0.120:30882/metrics # Application Controller
http://192.168.0.120:30883/metrics # Server Metrics
```
## 4. 關鍵指標
| 指標 | 說明 |
|------|------|
| `argocd_app_info` | Application 狀態資訊 |
| `argocd_app_sync_total` | 同步次數 |
| `argocd_app_health_status` | 健康狀態 |
| `argocd_cluster_api_resources_count` | API 資源數量 |
---
## 架構圖
```
┌─────────────────────────────────────────────────────────┐
│ Prometheus (188:9090) │
├─────────────────────────────────────────────────────────┤
│ scrape_configs: │
│ - job_name: argocd │
│ targets: │
│ - 192.168.0.120:30882 (app-controller metrics) │
│ - 192.168.0.120:30883 (server metrics) │
└─────────────────────────────────────────────────────────┘
```

View File

@@ -0,0 +1,80 @@
# =============================================================================
# ArgoCD Metrics Network Policy - 允許外部 Prometheus 抓取
# =============================================================================
# 建立者: Claude Code (首席架構師)
# 日期: 2026-03-29 (台北時間)
# 用途: 覆蓋 ArgoCD 預設 NetworkPolicy允許 192.168.0.188 (Prometheus) 抓取指標
# =============================================================================
---
# Application Controller Metrics - 允許 Prometheus 抓取
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: argocd-application-controller-network-policy
namespace: argocd
labels:
app.kubernetes.io/component: application-controller
app.kubernetes.io/name: argocd-application-controller
app.kubernetes.io/part-of: argocd
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: argocd-application-controller
policyTypes:
- Ingress
ingress:
# 允許叢集內部存取
- from:
- namespaceSelector: {}
ports:
- port: 8082
protocol: TCP
# 允許外部 Prometheus (188) 抓取指標
- from:
- ipBlock:
cidr: 192.168.0.188/32
ports:
- port: 8082
protocol: TCP
---
# ArgoCD Server Metrics - 允許 Prometheus 抓取
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: argocd-server-network-policy
namespace: argocd
labels:
app.kubernetes.io/component: server
app.kubernetes.io/name: argocd-server
app.kubernetes.io/part-of: argocd
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: argocd-server
policyTypes:
- Ingress
ingress:
# 允許叢集內部存取
- from:
- namespaceSelector: {}
ports:
- port: 8080
protocol: TCP
- port: 8083
protocol: TCP
# 允許外部 Prometheus (188) 抓取指標
- from:
- ipBlock:
cidr: 192.168.0.188/32
ports:
- port: 8083
protocol: TCP
# 允許外部使用者訪問 UI (NodePort 30443)
- from:
- ipBlock:
cidr: 192.168.0.0/24
ports:
- port: 8080
protocol: TCP

View File

@@ -0,0 +1,47 @@
# =============================================================================
# ArgoCD Metrics NodePort Services
# =============================================================================
# 建立者: Claude Code (首席架構師)
# 日期: 2026-03-29 (台北時間)
# 用途: 暴露 ArgoCD Metrics 供外部 Prometheus (188) 抓取
# =============================================================================
---
# ArgoCD Application Controller Metrics
apiVersion: v1
kind: Service
metadata:
name: argocd-metrics-nodeport
namespace: argocd
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/name: argocd-application-controller
spec:
type: NodePort
selector:
app.kubernetes.io/name: argocd-application-controller
ports:
- name: metrics
port: 8082
targetPort: 8082
nodePort: 30882
---
# ArgoCD Server Metrics
apiVersion: v1
kind: Service
metadata:
name: argocd-server-metrics-nodeport
namespace: argocd
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/name: argocd-server
spec:
type: NodePort
selector:
app.kubernetes.io/name: argocd-server
ports:
- name: metrics
port: 8083
targetPort: 8083
nodePort: 30883

View File

@@ -170,6 +170,11 @@ spec:
port: 53
# 允許訪問外部 AI API (雲端備援: Gemini / Claude)
# ⚠️ 2026-03-29 首席架構師審查:
# - K8s NetworkPolicy 不支援 FQDN (域名),僅支援 CIDR
# - Google/Anthropic API IP 範圍動態變化
# - 若限定 CIDRAPI 變更時會中斷 AI 功能
# - 建議: 維持 0.0.0.0/0:443透過 Egress Gateway 強化 (未來)
- to:
- ipBlock:
cidr: 0.0.0.0/0

View File

@@ -175,3 +175,53 @@ groups:
annotations:
summary: "⚠️ 容器 OOM Killed"
description: "{{ $labels.pod }}/{{ $labels.container }} 過去 1 小時因 OOM 被終止"
# ===== TLS 證書告警 =====
- name: tls_certificate_alerts
rules:
# TLS 證書即將過期 (30 天內)
- alert: TLSCertExpiringIn30Days
expr: probe_ssl_earliest_cert_expiry{job="blackbox-https"} - time() < 86400 * 30
for: 1h
labels:
severity: warning
team: ops
component: tls
annotations:
summary: "⚠️ TLS 證書即將過期"
description: "{{ $labels.instance }} 證書將在 {{ $value | humanizeDuration }} 後過期"
# TLS 證書即將過期 (7 天內)
- alert: TLSCertExpiringIn7Days
expr: probe_ssl_earliest_cert_expiry{job="blackbox-https"} - time() < 86400 * 7
for: 1h
labels:
severity: critical
team: ops
component: tls
annotations:
summary: "🔴 TLS 證書即將過期 (7 天內)"
description: "{{ $labels.instance }} 證書將在 {{ $value | humanizeDuration }} 後過期,請立即更新!"
# TLS 證書已過期
- alert: TLSCertExpired
expr: probe_ssl_earliest_cert_expiry{job="blackbox-https"} - time() < 0
labels:
severity: critical
team: ops
component: tls
annotations:
summary: "🔴🔴 TLS 證書已過期"
description: "{{ $labels.instance }} 證書已過期HTTPS 服務可能無法正常運作"
# TLS 探測失敗
- alert: TLSProbeFailure
expr: probe_success{job="blackbox-https"} == 0
for: 5m
labels:
severity: warning
team: ops
component: tls
annotations:
summary: "⚠️ TLS 探測失敗"
description: "無法連線到 {{ $labels.instance }} 進行 TLS 檢查"

View File

@@ -0,0 +1,83 @@
# =============================================================================
# Prometheus Config Additions - P1/P2 改進
# =============================================================================
# 建立者: Claude Code (首席架構師)
# 日期: 2026-03-29 (台北時間)
# 用途: 新增 ArgoCD Metrics + TLS 證書監控
# 部署位置: 192.168.0.188 /etc/prometheus/prometheus.yml
# =============================================================================
#
# 部署方式:
# 1. SSH 到 192.168.0.188
# 2. 編輯 /etc/prometheus/prometheus.yml
# 3. 在 scrape_configs 區塊新增以下內容
# 4. 執行 systemctl reload prometheus
#
# =============================================================================
# ===== 新增 scrape_configs =====
# ArgoCD Metrics (需先部署 NodePort: k8s/argocd/argocd-metrics-nodeport.yaml)
- job_name: argocd
honor_timestamps: true
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- 192.168.0.120:30883 # ArgoCD Server Metrics
labels:
component: server
# 註: Application Controller Metrics (30882) 需確認 ArgoCD 版本是否支援
# TLS 證書監控 (使用 Blackbox Exporter)
- job_name: blackbox-https
honor_timestamps: true
scrape_interval: 60s
scrape_timeout: 30s
metrics_path: /probe
params:
module: [http_2xx_ssl]
static_configs:
- targets:
- https://awoooi.wooo.work # AWOOOI 主站
- https://192.168.0.120:30443 # ArgoCD UI
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
# ===== Blackbox Exporter 模組配置 =====
# 檔案: /etc/blackbox_exporter/blackbox.yml
# 新增模組:
#
# modules:
# http_2xx_ssl:
# prober: http
# timeout: 10s
# http:
# valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
# valid_status_codes: [] # 預設 2xx
# method: GET
# tls_config:
# insecure_skip_verify: false
# fail_if_ssl: false
# fail_if_not_ssl: true
# =============================================================================
# 驗證指令
# =============================================================================
#
# 1. 檢查 ArgoCD 指標:
# curl -s http://192.168.0.120:30883/metrics | grep argocd_info
#
# 2. 檢查 TLS 證書過期時間:
# curl -s "http://192.168.0.188:9090/api/v1/query?query=probe_ssl_earliest_cert_expiry" | jq
#
# 3. 驗證告警規則載入:
# curl -s http://192.168.0.188:9090/api/v1/rules | jq '.data.groups[] | select(.name | contains("tls"))'
#