Root cause: mapToDualState() was missing decision field,
causing Y/n buttons to be permanently disabled.
Now correctly passes incident.decision to the card component.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backend:
- Add DecisionManager with state machine (INIT→ANALYZING→READY→EXECUTING)
- Implement Expert System rules engine (100% local, never fails)
- Dual-engine: LLM (primary) + Expert System (fallback)
- Auto-generate decision_token for each incident
- 30-second timeout guarantee
Frontend:
- Use decision.state to unlock [Y/n] buttons
- Display AI action suggestion in card
- Show source indicator [AI] or [EXP]
- Generate proposal on-demand if needed
Fixes: UI locked with hourglass when LLM times out
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add active:scale-95 active:bg-neutral-800 for physical click feedback
- Add disabled:opacity-30 for clearer disabled state
- Add tooltip "大腦分析中..." when proposalId is missing
- Add comprehensive console.log diagnostics for authorization flow
- Add reason parameter "Authorized via WarRoom" for audit trail
- Implement optimistic UI with immediate loading state transition
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ActionExecutor enhancements:
- Add execute_kubectl_command() using asyncio.create_subprocess_shell
- Security: Only kubectl commands allowed, forbidden patterns blocked
- Shadow Mode: Simulate execution without actual kubectl calls
- Capture stdout/stderr with PIPE, handle timeout gracefully
New execute_approved_proposal() function:
- Background task entry point for approved proposals
- Read approval from Redis/DB, verify status='approved'
- Extract kubectl_command from metadata
- Execute via execute_kubectl_command()
- Update status to 'executed' or 'failed' with execution_log
Security guardrails:
- Forbid delete namespace/ns, rm -rf, drop database
- Forbid batch deletion patterns
- 60 second default timeout
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DualStateIncidentCard:
- Add proposalId prop for approval actions
- Add onApprovalChange callback for status updates
- Implement handleApprove() calling POST /api/v1/approvals/{id}/sign
- Implement handleReject() calling POST /api/v1/approvals/{id}/reject
- Add ButtonState management (idle/loading/approved/rejected/error)
- Loading spinner during API call
- Success state: green "已授權" / red "已拒絕"
- Error state: orange "錯誤" with auto-recovery
API Client:
- Fix endpoint mismatch: rename approveApproval to signApproval
- Use correct endpoint /sign instead of /approve
- Add signer parameter for multi-sig support
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove lewooogo-brain local dependency that breaks Docker context.
Inline Proposal/Guardrails definitions in proposals.py mock.
Phase 6.4i will address proper monorepo Docker packaging.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: Worker used SQLITE_DATABASE_URL causing "no such table: incidents"
because each Pod had isolated SQLite file, not shared PostgreSQL.
Fixes:
- db/base.py: Use DATABASE_URL (PostgreSQL) instead of SQLITE_DATABASE_URL
- Added SQLite prohibition guard with logging
- Added pool_size and pool_pre_ping for production stability
New: packages/lewooogo-data PgMemoryProvider (Phase 6.4d)
- Episodic Memory implementation for PostgreSQL
- init_pg_engine() with auto table creation
- SQLite forbidden by Commander's decree
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: Worker shared Redis pool with API (socket_timeout=5s),
but XREADGROUP blocks for 5s causing timeout errors every cycle.
Fix:
- Add init_worker_redis_pool() with socket_timeout=None
- Worker now uses get_worker_redis() for XREADGROUP operations
- API continues using get_redis() with short timeout
Also destroyed 50 zombie consumers via:
XGROUP DESTROY stream:awoooi_signals awoooi_workers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New packages:
- packages/lewooogo-brain: AI reasoning & decision engine
- IProposalEngine interface (ABC)
- IIncidentProcessor interface (ABC)
- Pydantic models: Proposal, Guardrails, Incident, Signal
- packages/lewooogo-data: Memory provider abstraction
- IMemoryProvider interface (ABC)
- IDualMemoryProvider for Working + Episodic memory
- Generic type support for flexible data models
Documentation:
- ADR-008: Python modular packages architecture decision
- ARCHITECTURE_MEMORY.md: Module map index for AI developers
- LOGBOOK.md: Updated milestones and Phase 6.4 status
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New sections in 05-awoooi-sre-qa.md:
- Worker CrashLoopBackOff diagnosis procedure
- Telegram alert system health check
- Frontend race condition diagnosis (Polling vs API)
- Import name mismatch detection pattern
Lessons learned from:
- 7+ hour outage due to undetected worker crash
- Approval card flicker due to Zustand polling race condition
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
signal_worker.py was importing non-existent init_redis/close_redis
Correct names are init_redis_pool/close_redis_pool
Root cause of:
- No Telegram alerts for 7+ hours
- No new approval cards
- No incident processing
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Race condition between polling (5s interval) and sign/reject operations
caused cards to flicker and reappear after being approved.
Fix:
- Pause polling during sign/reject API calls
- Resume polling after 1 second delay to allow backend state sync
- Apply same pattern to both signApproval and rejectApproval
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The optimization_suggestions field is list[dict], not list[object].
Use .get() to access dict keys instead of attribute access.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable Signal Worker to process Redis Streams signals
and trigger Incident Engine for alert aggregation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The toDateTime64(nanoseconds, 9) caused Decimal overflow.
Switched to simpler `now() - INTERVAL X MINUTE` syntax.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The SignOz trace data is stored in distributed_signoz_index_v3,
not v2. This fixes GlobalPulse showing all zeros.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause analysis:
1. OTEL gRPC endpoint had http:// prefix which is invalid for gRPC
2. SignOz query was targeting wrong table (signoz_metrics.distributed_samples_v4)
3. Should query signoz_traces.distributed_signoz_index_v2 for trace data
Fixes:
- Remove http:// prefix from OTEL_EXPORTER_OTLP_ENDPOINT (gRPC needs host:port)
- Update SignOz client to query traces table instead of metrics table
- Fix timestamp format (nanoseconds for DateTime64(9))
- statusCode: 0=Unset, 1=Ok, 2=Error
This should enable OTEL traces to reach SigNoz and GlobalPulse to show real metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace flat text format with structured HTML layout
- Add emoji section headers and visual separators
- Replace raw URLs with Inline Keyboard buttons
- Success: "查看部署紀錄" + "開啟正式站" buttons
- Failure: Only "查看部署紀錄" button
- Use JSON payload for proper Telegram API formatting
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SigNoz OTEL Collector maps container:4317 to host:24317
- Updated NetworkPolicy egress to allow 24317/24318
- Updated ConfigMap with correct OTEL_EXPORTER_OTLP_ENDPOINT
- Fixed OpenClaw port from 8089 to 8088
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pydantic v2 does not allow field names with leading underscores.
Changed from @property pattern to method pattern.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>