- Try load_incluster_config() first (for pods running in K8s)
- Fallback to kubeconfig file (for local development)
- Fixes "K8s connection not available" error in production
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 擴展...副本數 pattern (scale variant)
- Add 重新啟動 without 服務 suffix
- Auto-detect StatefulSet Pod names (xxx-N) for DELETE_POD
- Strip -deployment suffix from resource names
Fixes Y button execution failure when LLM generates Chinese actions
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Backend:
- Add DecisionManager with state machine (INIT→ANALYZING→READY→EXECUTING)
- Implement Expert System rules engine (100% local, never fails)
- Dual-engine: LLM (primary) + Expert System (fallback)
- Auto-generate decision_token for each incident
- 30-second timeout guarantee
Frontend:
- Use decision.state to unlock [Y/n] buttons
- Display AI action suggestion in card
- Show source indicator [AI] or [EXP]
- Generate proposal on-demand if needed
Fixes: UI locked with hourglass when LLM times out
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ActionExecutor enhancements:
- Add execute_kubectl_command() using asyncio.create_subprocess_shell
- Security: Only kubectl commands allowed, forbidden patterns blocked
- Shadow Mode: Simulate execution without actual kubectl calls
- Capture stdout/stderr with PIPE, handle timeout gracefully
New execute_approved_proposal() function:
- Background task entry point for approved proposals
- Read approval from Redis/DB, verify status='approved'
- Extract kubectl_command from metadata
- Execute via execute_kubectl_command()
- Update status to 'executed' or 'failed' with execution_log
Security guardrails:
- Forbid delete namespace/ns, rm -rf, drop database
- Forbid batch deletion patterns
- 60 second default timeout
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove lewooogo-brain local dependency that breaks Docker context.
Inline Proposal/Guardrails definitions in proposals.py mock.
Phase 6.4i will address proper monorepo Docker packaging.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: Worker used SQLITE_DATABASE_URL causing "no such table: incidents"
because each Pod had isolated SQLite file, not shared PostgreSQL.
Fixes:
- db/base.py: Use DATABASE_URL (PostgreSQL) instead of SQLITE_DATABASE_URL
- Added SQLite prohibition guard with logging
- Added pool_size and pool_pre_ping for production stability
New: packages/lewooogo-data PgMemoryProvider (Phase 6.4d)
- Episodic Memory implementation for PostgreSQL
- init_pg_engine() with auto table creation
- SQLite forbidden by Commander's decree
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: Worker shared Redis pool with API (socket_timeout=5s),
but XREADGROUP blocks for 5s causing timeout errors every cycle.
Fix:
- Add init_worker_redis_pool() with socket_timeout=None
- Worker now uses get_worker_redis() for XREADGROUP operations
- API continues using get_redis() with short timeout
Also destroyed 50 zombie consumers via:
XGROUP DESTROY stream:awoooi_signals awoooi_workers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
signal_worker.py was importing non-existent init_redis/close_redis
Correct names are init_redis_pool/close_redis_pool
Root cause of:
- No Telegram alerts for 7+ hours
- No new approval cards
- No incident processing
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The optimization_suggestions field is list[dict], not list[object].
Use .get() to access dict keys instead of attribute access.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The toDateTime64(nanoseconds, 9) caused Decimal overflow.
Switched to simpler `now() - INTERVAL X MINUTE` syntax.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The SignOz trace data is stored in distributed_signoz_index_v3,
not v2. This fixes GlobalPulse showing all zeros.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause analysis:
1. OTEL gRPC endpoint had http:// prefix which is invalid for gRPC
2. SignOz query was targeting wrong table (signoz_metrics.distributed_samples_v4)
3. Should query signoz_traces.distributed_signoz_index_v2 for trace data
Fixes:
- Remove http:// prefix from OTEL_EXPORTER_OTLP_ENDPOINT (gRPC needs host:port)
- Update SignOz client to query traces table instead of metrics table
- Fix timestamp format (nanoseconds for DateTime64(9))
- statusCode: 0=Unset, 1=Ok, 2=Error
This should enable OTEL traces to reach SigNoz and GlobalPulse to show real metrics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SigNoz OTEL Collector maps container:4317 to host:24317
- Updated NetworkPolicy egress to allow 24317/24318
- Updated ConfigMap with correct OTEL_EXPORTER_OTLP_ENDPOINT
- Fixed OpenClaw port from 8089 to 8088
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pydantic v2 does not allow field names with leading underscores.
Changed from @property pattern to method pattern.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>