docs(awooop): record t50 mcp run evidence [skip ci]
This commit is contained in:
107
docs/LOGBOOK.md
107
docs/LOGBOOK.md
@@ -9745,3 +9745,110 @@ gateway_rows=343
|
||||
- 這些 Host 事件仍不是「自動修復成功」;目前正確狀態是 `observed_not_executed`,也就是 AI/PDI/MCP 已調查,尚未進 executor/auto-repair。這是刻意避免假綠。
|
||||
- 仍待下一階段:把 truth-chain detail/history API 的 HTTP 400 修掉、把前端 AwoooP Runs/Incident detail 顯示 MCP gateway/evidence/outbound、補齊治理告警 leader/dedupe 與 Ansible executor/diff/apply/rollback audit。
|
||||
- 目前進度更新:AwoooP 告警可觀測鏈約 86%;低風險自動修復閉環約 95%;完整 AI 自動化管理產品化約 78%。
|
||||
|
||||
### 2026-05-18 — T50 AwoooP Runs 補齊 MCP-only 調查證據與前端篩選
|
||||
|
||||
**觸發**:
|
||||
|
||||
- T49 已把 Signal-worker Host 告警補進 `mcp_audit_log`,但 AwoooP Run List 仍顯示 `remediation_summary.status=no_evidence`。
|
||||
- Run Detail 可看到 legacy/self-built MCP 證據,Run List 卻沒有把這些 MCP 調查接回總覽,會讓 operator 以為 AI 沒有調查。
|
||||
|
||||
**修正**:
|
||||
|
||||
- `platform_operator_service` 的 run remediation summary 增加 `mcp_observed` 狀態。
|
||||
- 當沒有 ADR-100 dry-run,但存在 legacy/self-built MCP 調查紀錄時,Run List 會回傳:
|
||||
- `status=mcp_observed`
|
||||
- `source=mcp_audit_log`
|
||||
- `evidence_total`
|
||||
- `has_mcp_investigation=true`
|
||||
- `mcp_observation_total`
|
||||
- `mcp_observation_success`
|
||||
- `mcp_observation_failed`
|
||||
- `latest_route`
|
||||
- `/api/v1/platform/runs/list` 支援 `remediation_status=mcp_observed`。
|
||||
- 前端 AwoooP Runs / Approvals 同步新增 `mcp_observed` 篩選、狀態文案、證據數與 summary card。
|
||||
|
||||
**local verification**:
|
||||
|
||||
```text
|
||||
py_compile:
|
||||
apps/api/src/services/platform_operator_service.py
|
||||
apps/api/src/api/v1/platform/operator_runs.py
|
||||
apps/api/tests/test_awooop_operator_timeline_labels.py
|
||||
OK
|
||||
|
||||
ruff --select F,E9:
|
||||
apps/api/src/services/platform_operator_service.py
|
||||
apps/api/src/api/v1/platform/operator_runs.py
|
||||
apps/api/tests/test_awooop_operator_timeline_labels.py
|
||||
OK
|
||||
|
||||
pytest:
|
||||
apps/api/tests/test_awooop_operator_timeline_labels.py
|
||||
16 passed
|
||||
|
||||
web:
|
||||
zh-TW/en JSON parse OK
|
||||
pnpm --filter @awoooi/web typecheck OK
|
||||
next lint target files OK, only existing i18n literal warnings
|
||||
NEXT_PUBLIC_API_URL=https://awoooi.wooo.work pnpm --filter @awoooi/web build OK
|
||||
|
||||
git diff --check OK
|
||||
```
|
||||
|
||||
**production deploy / smoke(完成)**:
|
||||
|
||||
```text
|
||||
Code commit:
|
||||
9d02ab80 feat(awooop): surface mcp investigation evidence
|
||||
|
||||
Deploy marker:
|
||||
5d36638c chore(cd): deploy 9d02ab8 [skip ci]
|
||||
|
||||
Gitea Actions:
|
||||
2282 Code Review -> success
|
||||
2281 CD tests/build-and-deploy/post-deploy-checks -> success
|
||||
|
||||
K8s image:
|
||||
awoooi-api 2/2 192.168.0.110:5000/awoooi/api:9d02ab80803a0167a2195fd121e4219fffa14172
|
||||
awoooi-web 2/2 192.168.0.110:5000/awoooi/web:9d02ab80803a0167a2195fd121e4219fffa14172
|
||||
awoooi-worker 1/1 192.168.0.110:5000/awoooi/api:9d02ab80803a0167a2195fd121e4219fffa14172
|
||||
|
||||
health:
|
||||
GET https://awoooi.wooo.work/api/v1/health -> healthy, prod, mock_mode=false
|
||||
components: api/postgresql/redis/ollama/openclaw/signoz all up
|
||||
|
||||
Run List smoke:
|
||||
incident=INC-20260518-792684
|
||||
total=4
|
||||
all 4 runs:
|
||||
status=mcp_observed
|
||||
source=mcp_audit_log
|
||||
evidence_total=23
|
||||
mcp_observation_total=23
|
||||
mcp_observation_success=18
|
||||
has_mcp_investigation=true
|
||||
latest_route=pre_decision_investigator/ssh_host.ssh_diagnose/read
|
||||
|
||||
Filter smoke:
|
||||
GET /api/v1/platform/runs/list?...&remediation_status=mcp_observed
|
||||
total=4
|
||||
```
|
||||
|
||||
**部署觀察 / 技術債**:
|
||||
|
||||
- API rollout 初期曾因 startup probe 時間差短暫停在舊 pod;rollout 完成後 live API 才回新狀態。
|
||||
- Worker 新 pod 啟動時曾遇到一次 PostgreSQL `ALTER TABLE incident_evidence ADD COLUMN IF NOT EXISTS anomaly_context` deadlock,但後續仍寫出 health files 並完成 rollout。後續應把啟動期 DDL 從 runtime startup 移到 migration/Ansible gate,避免 deployment 與 worker 同時搶 DDL lock。
|
||||
|
||||
**判讀**:
|
||||
|
||||
- T50 補上的是「前端與 API 總覽能看懂 AI/MCP 已調查」;它沒有把 Host 告警宣稱為自動修復成功。
|
||||
- Operator 現在可從 Runs/Approvals 用 `mcp_observed` 區分:AI 已透過 MCP/SignOz/Prometheus/SSH 等工具調查,但尚未進 executor/auto-repair。
|
||||
- 下一階段:修 truth-chain detail/history HTTP 400、補 Incident Detail 的 MCP/evidence/outbound timeline、再接治理告警 dedupe/leader 與 Ansible executor audit。
|
||||
|
||||
**目前整體進度**:
|
||||
|
||||
- AwoooP 告警可觀測鏈:約 89%。
|
||||
- 低風險自動修復閉環:約 95%。
|
||||
- 前端 AI 自動化管理介面同步:約 86%。
|
||||
- 完整 AI 自動化管理產品化:約 81%。
|
||||
|
||||
Reference in New Issue
Block a user