From b9597d8d701055c813c499b535bbbbf1f49ec48e Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 18 May 2026 12:36:09 +0800 Subject: [PATCH] docs(awooop): record t49 host mcp evidence [skip ci] --- docs/LOGBOOK.md | 118 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 118 insertions(+) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 76b00530..c30345cf 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -9627,3 +9627,121 @@ generation=23 observed=23 ready=1/1 restartedAt=2026-05-13T17:10:43Z - 這不是全面自動化完成:治理告警(例如 `knowledge_degradation` / `governance_slo_data_gap`)仍會重複 Telegram 推播,且目前沒有對應 `governance_remediation_dispatch` 階段可見性。 - 下一階段 T17:治理告警 leader/dedupe、ADR-100 SLO emitter 修補、KM stale refresh 任務、治理 PlayBook seed、AwoooP 前端 Timeline 顯示每階段狀態與 MCP 使用證據。 - 目前進度更新:Alertmanager 低風險自動修復主線約 95%;完整 AI 自動化管理產品化約 70%(治理告警、Ansible 執行證據、前端事件卷宗與 MCP 使用總覽仍未完成)。 + +### 2026-05-18 — T49 Signal-worker Host 告警補齊 AwoooP truth-chain / MCP gateway 證據 + +**觸發**: + +- Telegram `HostErrorLogFlood` / Config Drift 類卡片無法回答「是否重複、是否進 AI 調查、是否用到 MCP/Sentry/SignOz/Prometheus、是否自動修復或轉人工」。 +- Production truth-chain quality 顯示大量 signal-worker 事件停在 received-only:42 筆 24h `HostErrorLogFlood` 有 timeline,但沒有 evidence / outbound / MCP gateway。 + +**修正**: + +- `signal_worker` 成功處理 Redis signal 後,補寫 AwoooP internal inbound/outbound、completed shadow run、raw signal evidence 與 observation timeline;不宣稱自動修復,狀態明確是 `observed_not_executed`。 +- standalone `awoooi-worker` 啟動時與 API lifespan 對齊,初始化 MCP provider registry + MCP tool registry;one-off/backfill observation 也會自我補齊 MCP runtime。 +- MCP tool suggestion 修正 Host 類告警: + - `namespace` 單獨不再視為 K8s workload locator,避免 Host 告警被誤導去打 K8s 工具。 + - tool selection 做 provider-balanced selection,避免單一 provider 擠滿 8D 預算,保留 SSH + SignOz + Prometheus 側證。 +- PDI Host 參數補齊 `container_name` / `filter_name` / `service`,讓 `ssh_get_container_logs`、`ssh_get_container_status` 不再因缺參數失敗。 +- SignOz log client 對齊 live ClickHouse schema:`signoz_logs.distributed_logs_v2` + `resources_string` / `attributes_string`。 + +**local verification**: + +```text +py_compile: +apps/api/src/services/signal_observation_service.py +apps/api/src/workers/signal_worker.py +apps/api/src/services/mcp_tool_registry.py +apps/api/src/services/pre_decision_investigator.py +apps/api/src/services/signoz_client.py +OK + +ruff --select F,E9: +changed API service/test files +OK + +pytest: +apps/api/tests/test_mcp_tool_registry.py +apps/api/tests/test_pre_decision_investigator.py +apps/api/tests/test_signal_observation_service.py +apps/api/tests/test_signoz_client_logs.py +67 passed + +pytest: +apps/api/tests/test_channel_hub_grouped_alert_events.py +apps/api/tests/test_awooop_truth_chain_service.py +38 passed +``` + +**production deploy / smoke(完成)**: + +```text +Commits: +a023c535 fix(awooop): bridge signal worker observations +98a10cbc fix(awooop): initialize mcp runtime for signal worker +64c70442 fix(mcp): balance host alert tool suggestions +5cb10a6d fix(mcp): enrich host log evidence params + +Deploy markers: +df7d9573 chore(cd): deploy a023c53 [skip ci] +989390f7 chore(cd): deploy 98a10cb [skip ci] +0e7fe211 chore(cd): deploy 64c7044 [skip ci] +749b2109 chore(cd): deploy 5cb10a6 [skip ci] + +Gitea Actions: +2274 tests/build-and-deploy/post-deploy-checks -> success +2276 tests/build-and-deploy/post-deploy-checks -> success +2279 tests/build-and-deploy/post-deploy-checks -> success + +Latest deployed image: +awoooi-api 192.168.0.110:5000/awoooi/api:5cb10a6d2d417d8af2a0f906cd2483f644ddf3a9 +awoooi-worker 192.168.0.110:5000/awoooi/api:5cb10a6d2d417d8af2a0f906cd2483f644ddf3a9 + +Worker startup: +mcp_registry_initialized providers=10 tools=56 +signal_worker_mcp_runtime_initialized +signal_worker_started + +health: +GET https://awoooi.wooo.work/api/v1/health -> healthy, mock_mode=false +``` + +**production DB 驗證**: + +```text +Representative live check: +incident=INC-20260518-792684 +gateway_total: 15 -> 23 +evidence_total: 4 -> 5 +sensors_attempted=8 +sensors_succeeded=8 +latest_8 gateway tools: +ssh_diagnose success +ssh_get_container_logs success +ssh_get_container_status success +ssh_get_top_processes success +query_logs success +error_logs_summary success +prometheus_query success +prometheus_query_range success + +24h HostErrorLogFlood backfill: +attempted=40 +with_gateway=40 +failures=[] + +24h final: +host_24h_total=41 +with_evidence=41 +with_channel_event=41 +with_mcp_gateway=41 +missing_mcp_gateway=0 +gateway_rows=343 +``` + +**判讀**: + +- T49 已把最近 24h 的 `HostErrorLogFlood` 從 Telegram/received-only 補成 AwoooP 可追蹤事件:inbound/outbound/evidence/MCP gateway 都可查。 +- 這些 Host 事件仍不是「自動修復成功」;目前正確狀態是 `observed_not_executed`,也就是 AI/PDI/MCP 已調查,尚未進 executor/auto-repair。這是刻意避免假綠。 +- 仍待下一階段:把 truth-chain detail/history API 的 HTTP 400 修掉、把前端 AwoooP Runs/Incident detail 顯示 MCP gateway/evidence/outbound、補齊治理告警 leader/dedupe 與 Ansible executor/diff/apply/rollback audit。 +- 目前進度更新:AwoooP 告警可觀測鏈約 86%;低風險自動修復閉環約 95%;完整 AI 自動化管理產品化約 78%。