Backup & Recovery¶
Postgres¶
Automated Backups¶
# Daily backup via cron (add to crontab)
0 2 * * * pg_dump -Fc -h localhost -U dbadmin kohakku-controller \
> /backups/controller-$(date +\%Y\%m\%d).dump
0 2 * * * pg_dump -Fc -h localhost -U postgres kohakku-dispatcher \
> /backups/dispatcher-$(date +\%Y\%m\%d).dump
Point-in-Time Recovery¶
For managed databases (RDS, Cloud SQL):
- Retention: 7-30 days
- Backup window: during low-traffic hours
- Enable WAL archiving for PITR
Manual Backup and Restore¶
# Backup
pg_dump -Fc -h $POSTGRES_HOST -U $POSTGRES_USER $POSTGRES_DB > backup.dump
# Restore
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB --clean backup.dump
# Restore specific tables
pg_restore -h $POSTGRES_HOST -U $POSTGRES_USER -d $POSTGRES_DB -t tasks_task backup.dump
Redis¶
Persistence Configuration¶
Redis is configured with AOF persistence in docker-compose:
Backup¶
# Trigger RDB snapshot
redis-cli BGSAVE
# Copy the dump file
cp /data/dump.rdb /backups/redis-$(date +%Y%m%d).rdb
Recovery¶
Redis data is reconstructible
Redis data is reconstructible from Postgres -- task state, queue depth, and progress data can be rebuilt. Redis loss is an operational inconvenience, not a data loss event. Running tasks will need to be re-registered by the monitor on restart.
MinIO / Object Storage¶
Backup Briefs and Skills¶
# Mirror to S3 (or another MinIO instance)
mc mirror local/kohakku s3/kohakku-backup
# Mirror specific prefixes
mc mirror local/kohakku/briefs s3/kohakku-backup/briefs
mc mirror local/kohakku/skills s3/kohakku-backup/skills
Recovery¶
Production recommendation
Use S3 with versioning and cross-region replication instead of MinIO.
Disaster Recovery Procedures¶
Total Database Loss¶
- Restore Postgres from latest backup
- Restart Controller and Dispatcher -- they reconnect automatically
- Redis will be rebuilt by the monitor (re-registers active containers)
- Running agent containers continue -- they check back on their own schedule
Redis Loss¶
- Restart Redis -- it recovers from AOF/RDB
- If AOF is corrupted: start with empty Redis
- The monitor will re-register active containers from Postgres
- Queue depth resets -- pending tasks in Postgres can be re-dispatched
MinIO / S3 Loss¶
- Restore from backup mirror
- Briefs for completed tasks are not needed (results already recorded)
- Skills can be re-uploaded from source
- Running agents that already downloaded their brief are unaffected
Full Cluster Recovery¶
- Restore Postgres first (source of truth)
- Start Redis (AOF recovery or fresh)
- Start MinIO (restore from backup)
- Start Controller, Dispatcher, Temporal
- Running agents will time out and be cleaned up by the monitor
- Re-dispatch any in-progress tasks that were lost