Skip to content

Backup & Recovery

Postgres

Automated Backups

# Daily backup via cron (add to crontab)
0 2 * * * pg_dump -Fc -h localhost -U dbadmin kohakku-controller \
  > /backups/controller-$(date +\%Y\%m\%d).dump
0 2 * * * pg_dump -Fc -h localhost -U postgres kohakku-dispatcher \
  > /backups/dispatcher-$(date +\%Y\%m\%d).dump

Point-in-Time Recovery

For managed databases (RDS, Cloud SQL):

  • Retention: 7-30 days
  • Backup window: during low-traffic hours
  • Enable WAL archiving for PITR

Manual Backup and Restore

# Backup
pg_dump -Fc -h $POSTGRES_HOST -U $POSTGRES_USER $POSTGRES_DB > backup.dump

# Restore
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB --clean backup.dump

# Restore specific tables
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -t tasks_task backup.dump

Redis

Persistence Configuration

Redis is configured with AOF persistence in docker-compose:

redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru

Backup

# Trigger RDB snapshot
redis-cli BGSAVE

# Copy the dump file
cp /data/dump.rdb /backups/redis-$(date +%Y%m%d).rdb

Recovery

Redis data is reconstructible

Redis data is reconstructible from Postgres -- task state, queue depth, and progress data can be rebuilt. Redis loss is an operational inconvenience, not a data loss event. Running tasks will need to be re-registered by the monitor on restart.

MinIO / Object Storage

Backup Briefs and Skills

# Mirror to S3 (or another MinIO instance)
mc mirror local/kohakku s3/kohakku-backup

# Mirror specific prefixes
mc mirror local/kohakku/briefs s3/kohakku-backup/briefs
mc mirror local/kohakku/skills s3/kohakku-backup/skills

Recovery

mc mirror s3/kohakku-backup local/kohakku

Production recommendation

Use S3 with versioning and cross-region replication instead of MinIO.

Disaster Recovery Procedures

Total Database Loss

  1. Restore Postgres from latest backup
  2. Restart Controller and Dispatcher -- they reconnect automatically
  3. Redis will be rebuilt by the monitor (re-registers active containers)
  4. Running agent containers continue -- they check back on their own schedule

Redis Loss

  1. Restart Redis -- it recovers from AOF/RDB
  2. If AOF is corrupted: start with empty Redis
  3. The monitor will re-register active containers from Postgres
  4. Queue depth resets -- pending tasks in Postgres can be re-dispatched

MinIO / S3 Loss

  1. Restore from backup mirror
  2. Briefs for completed tasks are not needed (results already recorded)
  3. Skills can be re-uploaded from source
  4. Running agents that already downloaded their brief are unaffected

Full Cluster Recovery

  1. Restore Postgres first (source of truth)
  2. Start Redis (AOF recovery or fresh)
  3. Start MinIO (restore from backup)
  4. Start Controller, Dispatcher, Temporal
  5. Running agents will time out and be cleaned up by the monitor
  6. Re-dispatch any in-progress tasks that were lost

Testing Backups

# Verify backup integrity
pg_restore --list backup.dump | head -20

# Test restore to a temporary database
createdb kohakku-restore-test
pg_restore -d kohakku-restore-test backup.dump
psql kohakku-restore-test -c "SELECT count(*) FROM tasks_task"
dropdb kohakku-restore-test