docs(awooop): record t49 host mcp evidence [skip ci]
This commit is contained in:
118
docs/LOGBOOK.md
118
docs/LOGBOOK.md
@@ -9627,3 +9627,121 @@ generation=23 observed=23 ready=1/1 restartedAt=2026-05-13T17:10:43Z
|
||||
- 這不是全面自動化完成:治理告警(例如 `knowledge_degradation` / `governance_slo_data_gap`)仍會重複 Telegram 推播,且目前沒有對應 `governance_remediation_dispatch` 階段可見性。
|
||||
- 下一階段 T17:治理告警 leader/dedupe、ADR-100 SLO emitter 修補、KM stale refresh 任務、治理 PlayBook seed、AwoooP 前端 Timeline 顯示每階段狀態與 MCP 使用證據。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 95%;完整 AI 自動化管理產品化約 70%(治理告警、Ansible 執行證據、前端事件卷宗與 MCP 使用總覽仍未完成)。
|
||||
|
||||
### 2026-05-18 — T49 Signal-worker Host 告警補齊 AwoooP truth-chain / MCP gateway 證據
|
||||
|
||||
**觸發**:
|
||||
|
||||
- Telegram `HostErrorLogFlood` / Config Drift 類卡片無法回答「是否重複、是否進 AI 調查、是否用到 MCP/Sentry/SignOz/Prometheus、是否自動修復或轉人工」。
|
||||
- Production truth-chain quality 顯示大量 signal-worker 事件停在 received-only:42 筆 24h `HostErrorLogFlood` 有 timeline,但沒有 evidence / outbound / MCP gateway。
|
||||
|
||||
**修正**:
|
||||
|
||||
- `signal_worker` 成功處理 Redis signal 後,補寫 AwoooP internal inbound/outbound、completed shadow run、raw signal evidence 與 observation timeline;不宣稱自動修復,狀態明確是 `observed_not_executed`。
|
||||
- standalone `awoooi-worker` 啟動時與 API lifespan 對齊,初始化 MCP provider registry + MCP tool registry;one-off/backfill observation 也會自我補齊 MCP runtime。
|
||||
- MCP tool suggestion 修正 Host 類告警:
|
||||
- `namespace` 單獨不再視為 K8s workload locator,避免 Host 告警被誤導去打 K8s 工具。
|
||||
- tool selection 做 provider-balanced selection,避免單一 provider 擠滿 8D 預算,保留 SSH + SignOz + Prometheus 側證。
|
||||
- PDI Host 參數補齊 `container_name` / `filter_name` / `service`,讓 `ssh_get_container_logs`、`ssh_get_container_status` 不再因缺參數失敗。
|
||||
- SignOz log client 對齊 live ClickHouse schema:`signoz_logs.distributed_logs_v2` + `resources_string` / `attributes_string`。
|
||||
|
||||
**local verification**:
|
||||
|
||||
```text
|
||||
py_compile:
|
||||
apps/api/src/services/signal_observation_service.py
|
||||
apps/api/src/workers/signal_worker.py
|
||||
apps/api/src/services/mcp_tool_registry.py
|
||||
apps/api/src/services/pre_decision_investigator.py
|
||||
apps/api/src/services/signoz_client.py
|
||||
OK
|
||||
|
||||
ruff --select F,E9:
|
||||
changed API service/test files
|
||||
OK
|
||||
|
||||
pytest:
|
||||
apps/api/tests/test_mcp_tool_registry.py
|
||||
apps/api/tests/test_pre_decision_investigator.py
|
||||
apps/api/tests/test_signal_observation_service.py
|
||||
apps/api/tests/test_signoz_client_logs.py
|
||||
67 passed
|
||||
|
||||
pytest:
|
||||
apps/api/tests/test_channel_hub_grouped_alert_events.py
|
||||
apps/api/tests/test_awooop_truth_chain_service.py
|
||||
38 passed
|
||||
```
|
||||
|
||||
**production deploy / smoke(完成)**:
|
||||
|
||||
```text
|
||||
Commits:
|
||||
a023c535 fix(awooop): bridge signal worker observations
|
||||
98a10cbc fix(awooop): initialize mcp runtime for signal worker
|
||||
64c70442 fix(mcp): balance host alert tool suggestions
|
||||
5cb10a6d fix(mcp): enrich host log evidence params
|
||||
|
||||
Deploy markers:
|
||||
df7d9573 chore(cd): deploy a023c53 [skip ci]
|
||||
989390f7 chore(cd): deploy 98a10cb [skip ci]
|
||||
0e7fe211 chore(cd): deploy 64c7044 [skip ci]
|
||||
749b2109 chore(cd): deploy 5cb10a6 [skip ci]
|
||||
|
||||
Gitea Actions:
|
||||
2274 tests/build-and-deploy/post-deploy-checks -> success
|
||||
2276 tests/build-and-deploy/post-deploy-checks -> success
|
||||
2279 tests/build-and-deploy/post-deploy-checks -> success
|
||||
|
||||
Latest deployed image:
|
||||
awoooi-api 192.168.0.110:5000/awoooi/api:5cb10a6d2d417d8af2a0f906cd2483f644ddf3a9
|
||||
awoooi-worker 192.168.0.110:5000/awoooi/api:5cb10a6d2d417d8af2a0f906cd2483f644ddf3a9
|
||||
|
||||
Worker startup:
|
||||
mcp_registry_initialized providers=10 tools=56
|
||||
signal_worker_mcp_runtime_initialized
|
||||
signal_worker_started
|
||||
|
||||
health:
|
||||
GET https://awoooi.wooo.work/api/v1/health -> healthy, mock_mode=false
|
||||
```
|
||||
|
||||
**production DB 驗證**:
|
||||
|
||||
```text
|
||||
Representative live check:
|
||||
incident=INC-20260518-792684
|
||||
gateway_total: 15 -> 23
|
||||
evidence_total: 4 -> 5
|
||||
sensors_attempted=8
|
||||
sensors_succeeded=8
|
||||
latest_8 gateway tools:
|
||||
ssh_diagnose success
|
||||
ssh_get_container_logs success
|
||||
ssh_get_container_status success
|
||||
ssh_get_top_processes success
|
||||
query_logs success
|
||||
error_logs_summary success
|
||||
prometheus_query success
|
||||
prometheus_query_range success
|
||||
|
||||
24h HostErrorLogFlood backfill:
|
||||
attempted=40
|
||||
with_gateway=40
|
||||
failures=[]
|
||||
|
||||
24h final:
|
||||
host_24h_total=41
|
||||
with_evidence=41
|
||||
with_channel_event=41
|
||||
with_mcp_gateway=41
|
||||
missing_mcp_gateway=0
|
||||
gateway_rows=343
|
||||
```
|
||||
|
||||
**判讀**:
|
||||
|
||||
- T49 已把最近 24h 的 `HostErrorLogFlood` 從 Telegram/received-only 補成 AwoooP 可追蹤事件:inbound/outbound/evidence/MCP gateway 都可查。
|
||||
- 這些 Host 事件仍不是「自動修復成功」;目前正確狀態是 `observed_not_executed`,也就是 AI/PDI/MCP 已調查,尚未進 executor/auto-repair。這是刻意避免假綠。
|
||||
- 仍待下一階段:把 truth-chain detail/history API 的 HTTP 400 修掉、把前端 AwoooP Runs/Incident detail 顯示 MCP gateway/evidence/outbound、補齊治理告警 leader/dedupe 與 Ansible executor/diff/apply/rollback audit。
|
||||
- 目前進度更新:AwoooP 告警可觀測鏈約 86%;低風險自動修復閉環約 95%;完整 AI 自動化管理產品化約 78%。
|
||||
|
||||
Reference in New Issue
Block a user