docs(awooop): record t49 host mcp evidence [skip ci]

This commit is contained in:
Your Name
2026-05-18 12:36:09 +08:00
parent 749b210997
commit b9597d8d70

View File

@@ -9627,3 +9627,121 @@ generation=23 observed=23 ready=1/1 restartedAt=2026-05-13T17:10:43Z
- 這不是全面自動化完成:治理告警(例如 `knowledge_degradation` / `governance_slo_data_gap`)仍會重複 Telegram 推播,且目前沒有對應 `governance_remediation_dispatch` 階段可見性。
- 下一階段 T17治理告警 leader/dedupe、ADR-100 SLO emitter 修補、KM stale refresh 任務、治理 PlayBook seed、AwoooP 前端 Timeline 顯示每階段狀態與 MCP 使用證據。
- 目前進度更新Alertmanager 低風險自動修復主線約 95%;完整 AI 自動化管理產品化約 70%治理告警、Ansible 執行證據、前端事件卷宗與 MCP 使用總覽仍未完成)。
### 2026-05-18 — T49 Signal-worker Host 告警補齊 AwoooP truth-chain / MCP gateway 證據
**觸發**
- Telegram `HostErrorLogFlood` / Config Drift 類卡片無法回答「是否重複、是否進 AI 調查、是否用到 MCP/Sentry/SignOz/Prometheus、是否自動修復或轉人工」。
- Production truth-chain quality 顯示大量 signal-worker 事件停在 received-only42 筆 24h `HostErrorLogFlood` 有 timeline但沒有 evidence / outbound / MCP gateway。
**修正**
- `signal_worker` 成功處理 Redis signal 後,補寫 AwoooP internal inbound/outbound、completed shadow run、raw signal evidence 與 observation timeline不宣稱自動修復狀態明確是 `observed_not_executed`
- standalone `awoooi-worker` 啟動時與 API lifespan 對齊,初始化 MCP provider registry + MCP tool registryone-off/backfill observation 也會自我補齊 MCP runtime。
- MCP tool suggestion 修正 Host 類告警:
- `namespace` 單獨不再視為 K8s workload locator避免 Host 告警被誤導去打 K8s 工具。
- tool selection 做 provider-balanced selection避免單一 provider 擠滿 8D 預算,保留 SSH + SignOz + Prometheus 側證。
- PDI Host 參數補齊 `container_name` / `filter_name` / `service`,讓 `ssh_get_container_logs``ssh_get_container_status` 不再因缺參數失敗。
- SignOz log client 對齊 live ClickHouse schema`signoz_logs.distributed_logs_v2` + `resources_string` / `attributes_string`
**local verification**
```text
py_compile:
apps/api/src/services/signal_observation_service.py
apps/api/src/workers/signal_worker.py
apps/api/src/services/mcp_tool_registry.py
apps/api/src/services/pre_decision_investigator.py
apps/api/src/services/signoz_client.py
OK
ruff --select F,E9:
changed API service/test files
OK
pytest:
apps/api/tests/test_mcp_tool_registry.py
apps/api/tests/test_pre_decision_investigator.py
apps/api/tests/test_signal_observation_service.py
apps/api/tests/test_signoz_client_logs.py
67 passed
pytest:
apps/api/tests/test_channel_hub_grouped_alert_events.py
apps/api/tests/test_awooop_truth_chain_service.py
38 passed
```
**production deploy / smoke完成**
```text
Commits:
a023c535 fix(awooop): bridge signal worker observations
98a10cbc fix(awooop): initialize mcp runtime for signal worker
64c70442 fix(mcp): balance host alert tool suggestions
5cb10a6d fix(mcp): enrich host log evidence params
Deploy markers:
df7d9573 chore(cd): deploy a023c53 [skip ci]
989390f7 chore(cd): deploy 98a10cb [skip ci]
0e7fe211 chore(cd): deploy 64c7044 [skip ci]
749b2109 chore(cd): deploy 5cb10a6 [skip ci]
Gitea Actions:
2274 tests/build-and-deploy/post-deploy-checks -> success
2276 tests/build-and-deploy/post-deploy-checks -> success
2279 tests/build-and-deploy/post-deploy-checks -> success
Latest deployed image:
awoooi-api 192.168.0.110:5000/awoooi/api:5cb10a6d2d417d8af2a0f906cd2483f644ddf3a9
awoooi-worker 192.168.0.110:5000/awoooi/api:5cb10a6d2d417d8af2a0f906cd2483f644ddf3a9
Worker startup:
mcp_registry_initialized providers=10 tools=56
signal_worker_mcp_runtime_initialized
signal_worker_started
health:
GET https://awoooi.wooo.work/api/v1/health -> healthy, mock_mode=false
```
**production DB 驗證**
```text
Representative live check:
incident=INC-20260518-792684
gateway_total: 15 -> 23
evidence_total: 4 -> 5
sensors_attempted=8
sensors_succeeded=8
latest_8 gateway tools:
ssh_diagnose success
ssh_get_container_logs success
ssh_get_container_status success
ssh_get_top_processes success
query_logs success
error_logs_summary success
prometheus_query success
prometheus_query_range success
24h HostErrorLogFlood backfill:
attempted=40
with_gateway=40
failures=[]
24h final:
host_24h_total=41
with_evidence=41
with_channel_event=41
with_mcp_gateway=41
missing_mcp_gateway=0
gateway_rows=343
```
**判讀**
- T49 已把最近 24h 的 `HostErrorLogFlood` 從 Telegram/received-only 補成 AwoooP 可追蹤事件inbound/outbound/evidence/MCP gateway 都可查。
- 這些 Host 事件仍不是「自動修復成功」;目前正確狀態是 `observed_not_executed`,也就是 AI/PDI/MCP 已調查,尚未進 executor/auto-repair。這是刻意避免假綠。
- 仍待下一階段:把 truth-chain detail/history API 的 HTTP 400 修掉、把前端 AwoooP Runs/Incident detail 顯示 MCP gateway/evidence/outbound、補齊治理告警 leader/dedupe 與 Ansible executor/diff/apply/rollback audit。
- 目前進度更新AwoooP 告警可觀測鏈約 86%;低風險自動修復閉環約 95%;完整 AI 自動化管理產品化約 78%。