Monitoring¶
Health Checks¶
Both services expose health and readiness endpoints.
Controller¶
| Endpoint | Method | What it checks |
|---|---|---|
/health/ |
GET | Database connectivity |
/ready/ |
GET | Database + Redis + MinIO |
curl -s http://localhost:8000/health/ | python -m json.tool
curl -s http://localhost:8000/ready/ | python -m json.tool
Dispatcher¶
| Endpoint | Method | What it checks |
|---|---|---|
/health |
GET | Service alive |
/ready |
GET | Database + Redis connectivity |
Load Balancer Configuration¶
Use /health (Dispatcher) or /health/ (Controller) for liveness probes. Use /ready or /ready/ for readiness probes -- these check downstream dependencies and will fail if Postgres or Redis is unreachable.
# Kubernetes liveness/readiness probes
livenessProbe:
httpGet:
path: /health/
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready/
port: 8000
initialDelaySeconds: 10
periodSeconds: 15
Dashboards¶
Temporal UI¶
Access at http://localhost:8088 in local dev. Provides:
- Workflow execution history and status
- Signal delivery tracking
- Activity task queue depth
- Worker availability
MinIO Console¶
Access at http://localhost:9001 (dev credentials: minioadmin / minioadmin). Shows:
- Brief and skill storage usage
- Object lifecycle and versioning
- Bucket access patterns
Controller Admin¶
Django admin at http://localhost:8000/admin provides:
- Task list with status filtering
- Dispatcher instance status and last heartbeat
- Workflow instance status and stage progress
- Skill and config version history
Dispatcher Operator UI¶
The Dispatcher serves an HTMX-based operator UI at http://localhost:8080. Features:
- Agent gallery -- running agents as tiles with live status
- noVNC links for visual agent tiers (terminal, browser, desktop)
- Task telemetry -- duration, tokens used, exit codes
Key Metrics to Track¶
Controller¶
| Metric | Source | Alert threshold |
|---|---|---|
| Task dispatch latency | Application logs | > 5s |
| Brief assembly time | Celery task duration | > 10s |
| Failed dispatches | Task status counts | > 5% failure rate |
| Pending workflow instances | Temporal UI | Growing unbounded |
| Database connection pool | PgBouncer stats | > 80% utilization |
Dispatcher¶
| Metric | Source | Alert threshold |
|---|---|---|
| Queue depth | Redis LLEN or stream XLEN |
Growing unbounded |
| Active tasks | Redis task state | Approaching concurrency limit |
| Container spawn failures | Application logs | Any |
| Image pull latency | Application logs | > 60s (cold pull) |
| Monitor loop duration | Application logs | > configured interval |
Log Aggregation¶
Application Logs¶
Both services log to stdout in JSON format in production. Route to your log aggregation system:
Container stdout is captured automatically by ECS/EKS log drivers.
Use Promtail to scrape container logs. Label by service name.
Use the Datadog agent with Docker or K8s autodiscovery.
Agent Logs¶
Agent logs follow two paths:
- Packaged session logs -- Agent writes a session log, packages it on completion, ships it back as part of the result payload. The Dispatcher stores the reference.
- Cluster-level log forwarding -- Container stdout piped to the cluster's logging facility via runtime-level config. Infrastructure concern, not a Kohakku concern.
Alerting¶
Alert philosophy
All alerts must require human interaction. If an alert doesn't need a human response, it's signal noise -- turn it into a dashboard metric instead.
Critical Alerts¶
- Controller
/ready/returns non-200 for > 2 minutes - Dispatcher
/readyreturns non-200 for > 2 minutes - Task failure rate exceeds threshold in a rolling window
- Queue depth growing with no consumer progress
- Temporal worker count drops to zero
Warning Alerts¶
- Brief assembly p95 latency exceeds 10 seconds
- Dispatcher heartbeat missed (stale dispatcher in registry)
- Object storage usage approaching quota
- Database connection pool > 80% utilized