Monitoring¶

Health Checks¶

Both services expose health and readiness endpoints.

Controller¶

Endpoint	Method	What it checks
`/health/`	GET	Database connectivity
`/ready/`	GET	Database + Redis + MinIO

curl -s http://localhost:8000/health/ | python -m json.tool
curl -s http://localhost:8000/ready/  | python -m json.tool

Dispatcher¶

Endpoint	Method	What it checks
`/health`	GET	Service alive
`/ready`	GET	Database + Redis connectivity

curl -s http://localhost:8080/health | jq .
curl -s http://localhost:8080/ready  | jq .

Load Balancer Configuration¶

Use /health (Dispatcher) or /health/ (Controller) for liveness probes. Use /ready or /ready/ for readiness probes -- these check downstream dependencies and will fail if Postgres or Redis is unreachable.

# Kubernetes liveness/readiness probes
livenessProbe:
  httpGet:
    path: /health/
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready/
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15

Dashboards¶

Temporal UI¶

Access at http://localhost:8088 in local dev. Provides:

Workflow execution history and status
Signal delivery tracking
Activity task queue depth
Worker availability

MinIO Console¶

Access at http://localhost:9001 (dev credentials: minioadmin / minioadmin). Shows:

Brief and skill storage usage
Object lifecycle and versioning
Bucket access patterns

Controller Admin¶

Django admin at http://localhost:8000/admin provides:

Task list with status filtering
Dispatcher instance status and last heartbeat
Workflow instance status and stage progress
Skill and config version history

Dispatcher Operator UI¶

The Dispatcher serves an HTMX-based operator UI at http://localhost:8080. Features:

Agent gallery -- running agents as tiles with live status
noVNC links for visual agent tiers (terminal, browser, desktop)
Task telemetry -- duration, tokens used, exit codes

Key Metrics to Track¶

Controller¶

Metric	Source	Alert threshold
Task dispatch latency	Application logs	> 5s
Brief assembly time	Celery task duration	> 10s
Failed dispatches	Task status counts	> 5% failure rate
Pending workflow instances	Temporal UI	Growing unbounded
Database connection pool	PgBouncer stats	> 80% utilization

Dispatcher¶

Metric	Source	Alert threshold
Queue depth	Redis `LLEN` or stream `XLEN`	Growing unbounded
Active tasks	Redis task state	Approaching concurrency limit
Container spawn failures	Application logs	Any
Image pull latency	Application logs	> 60s (cold pull)
Monitor loop duration	Application logs	> configured interval

Log Aggregation¶

Application Logs¶

Both services log to stdout in JSON format in production. Route to your log aggregation system:

CloudWatchLoki / GrafanaDatadog

Container stdout is captured automatically by ECS/EKS log drivers.

Use Promtail to scrape container logs. Label by service name.

Use the Datadog agent with Docker or K8s autodiscovery.

Agent Logs¶

Agent logs follow two paths:

Packaged session logs -- Agent writes a session log, packages it on completion, ships it back as part of the result payload. The Dispatcher stores the reference.
Cluster-level log forwarding -- Container stdout piped to the cluster's logging facility via runtime-level config. Infrastructure concern, not a Kohakku concern.

Alerting¶

Alert philosophy

All alerts must require human interaction. If an alert doesn't need a human response, it's signal noise -- turn it into a dashboard metric instead.

Critical Alerts¶

Controller /ready/ returns non-200 for > 2 minutes
Dispatcher /ready returns non-200 for > 2 minutes
Task failure rate exceeds threshold in a rolling window
Queue depth growing with no consumer progress
Temporal worker count drops to zero

Warning Alerts¶

Brief assembly p95 latency exceeds 10 seconds
Dispatcher heartbeat missed (stale dispatcher in registry)
Object storage usage approaching quota
Database connection pool > 80% utilized