Backup & Recovery¶

Postgres¶

Automated Backups¶

# Daily backup via cron (add to crontab)
0 2 * * * pg_dump -Fc -h localhost -U dbadmin kohakku-controller \
  > /backups/controller-$(date +\%Y\%m\%d).dump
0 2 * * * pg_dump -Fc -h localhost -U postgres kohakku-dispatcher \
  > /backups/dispatcher-$(date +\%Y\%m\%d).dump

Point-in-Time Recovery¶

For managed databases (RDS, Cloud SQL):

Retention: 7-30 days
Backup window: during low-traffic hours
Enable WAL archiving for PITR

Manual Backup and Restore¶

# Backup
pg_dump -Fc -h $POSTGRES_HOST -U $POSTGRES_USER $POSTGRES_DB > backup.dump

# Restore
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB --clean backup.dump

# Restore specific tables
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -t tasks_task backup.dump

Redis¶

Persistence Configuration¶

Redis is configured with AOF persistence in docker-compose:

redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru

Backup¶

# Trigger RDB snapshot
redis-cli BGSAVE

# Copy the dump file
cp /data/dump.rdb /backups/redis-$(date +%Y%m%d).rdb

Recovery¶

Redis data is reconstructible

Redis data is reconstructible from Postgres -- task state, queue depth, and progress data can be rebuilt. Redis loss is an operational inconvenience, not a data loss event. Running tasks will need to be re-registered by the monitor on restart.

MinIO / Object Storage¶

Backup Briefs and Skills¶

# Mirror to S3 (or another MinIO instance)
mc mirror local/kohakku s3/kohakku-backup

# Mirror specific prefixes
mc mirror local/kohakku/briefs s3/kohakku-backup/briefs
mc mirror local/kohakku/skills s3/kohakku-backup/skills

Recovery¶

mc mirror s3/kohakku-backup local/kohakku

Production recommendation

Use S3 with versioning and cross-region replication instead of MinIO.

Disaster Recovery Procedures¶

Total Database Loss¶

Restore Postgres from latest backup
Restart Controller and Dispatcher -- they reconnect automatically
Redis will be rebuilt by the monitor (re-registers active containers)
Running agent containers continue -- they check back on their own schedule

Redis Loss¶

Restart Redis -- it recovers from AOF/RDB
If AOF is corrupted: start with empty Redis
The monitor will re-register active containers from Postgres
Queue depth resets -- pending tasks in Postgres can be re-dispatched

MinIO / S3 Loss¶

Restore from backup mirror
Briefs for completed tasks are not needed (results already recorded)
Skills can be re-uploaded from source
Running agents that already downloaded their brief are unaffected

Full Cluster Recovery¶

Restore Postgres first (source of truth)
Start Redis (AOF recovery or fresh)
Start MinIO (restore from backup)
Start Controller, Dispatcher, Temporal
Running agents will time out and be cleaned up by the monitor
Re-dispatch any in-progress tasks that were lost

Testing Backups¶

# Verify backup integrity
pg_restore --list backup.dump | head -20

# Test restore to a temporary database
createdb kohakku-restore-test
pg_restore -d kohakku-restore-test backup.dump
psql kohakku-restore-test -c "SELECT count(*) FROM tasks_task"
dropdb kohakku-restore-test